udb-size

Description

An Udb server holds its data in memory. A database can either be held by a single server or a pool of servers. The number of servers to use must be determined before populating the database.

A crude estimate is the input data size itself. For example, if the data size is 100GB (uncompressed size), and each server has 16GB available, 7 servers are needed. This is simple, but also very inacurrate. A better way to estimate the amount memory needed per server is outlined below. It depends mainly on the database definition and the characteristics of the data set to be processed.

Database definition related parameters can be obtained from udb.spec. Data characteristics related parameters can be obtained from the report produced by loginf on the data set (or part of it).

Parameters

Parameters needed for the estimate:

ptr_z

The pointer size, an intrinsic software overhead.

ptr_z = 4 on a 32 bit platform
ptr_z = 8 on a 64 bit platform

Num_server: The number of servers in the pool, one or more.

Num_row: The overall row count of a table in the data set. This is a table specific characteristic.

Num_string_per_server: Strings are hashed so that only the unique ones are stored. Strings come from the string columns, excluding the PKEY column. Num_string_per_server is the unique string count per server. It is usually less than its overall unique count in a data set, but it does not scale linearly with the number of servers. In fact, the overall unique count is often a good estimate.

Avg_string_length: The average unique string length on a server. It should be a per-server estimate; however, it is often independent of the number of servers so that the overall average can be used.

Num_bucket: The overall unique PKEY count in the data set.

Avg_num_pluskey: The average unique pluskey count per bucket of a table. This is a table specific characteristic. It should be a per-server estimate; however, it is often independent of the number of servers so that the overall average can be used.

Num_vector: Number of vectors in the data definition.

Num_table_with_pluskey: Number of tables having a +KEY in the data definition.

Num_table: Number of tables in the data definition, excluding vectors and the Var table.

Num_vector_and_table: Number of vectors and tables in the data definition, excluding the Var table.

Estimation

The amount of memory needed by each server in a pool is the sum of these constributions.

Table rows:

Per_column_size:
- I, IS = 4
- F, L, LS = 8
- IP = 20
- S = ptr_z (strings are stored in a hash table, only pointers to the hash entries are stored in a row)
Total_column_size =

Sum_over_columns(Per_column_size)
Per_row_padding:
- Up to 8 bytes on a 32 bit platform.
- Up to 4 bytes on a 64 bit platform.
Per_table_size_per_server =

Num_row * (Total_column_size + Per_row_padding) / Num_server
Total_row_size_per_server =

Sum_over_tables(Per_table_size_per_server)

Strings:

Hash_size_per_server:
- (2M * ptr_z) for up to (2M * 12) Num_string_per_server
- (16M * ptr_z) for up to (16M * 12) Num_string_per_server
- (128M * ptr_z) max
Per_string_size =

ptr_z + 6 + Avg_string_length (rounded up to multiple of ptr_z)
Total_string_size_per_server =

Hash_size_per_server + (Num_string_per_server * Per_string_size)

User buckets:

Num_bucket_per_server =

Num_bucket / Num_server
Hash_size_per_server:
- (2M * ptr_z) for up to (2M * 12) Num_bucket_per_server
- (16M * ptr_z) for up to (16M * 12) Num_bucket_per_server
- (128M * ptr_z) max
Vector_flag_size =

Num_vector * 1 (rounded up to multiple of ptr_z)
Per_bucket_size =

ptr_z + 6 + Avg_pkey_length +

Vector_flag_size +

Num_table_with_pluskey * (8 + ptr_z) +

Num_table * ptr_z +

Num_vector_and_table * ptr_z
Total_bucket_size_per_server =

Hash_size_per_server + (Num_bucket_per_server * Per_bucket_size)

Pluskey (+KEY) overhead:

Hash_size_per_table (per bucket):
- 0 for up to (1 * 16) Avg_num_pluskey
- (8 * ptr_z) for up to (8 * 16) Avg_num_pluskey
- (8^n * ptr_z) for up to (8^n * 16) Avg_num_pluskey
- (16M * ptr_z) max
Per_pluskey_overhead =
- 8 on a 32 bit platform
- 16 on a 64 bit platform
Per_pluskey_table_overhead (per bucket) =

Hash_size_per_table + (Avg_num_pluskey * Per_pluskey_overhead)
Total_pluskey_overhead_per_server =

Num_bucket_per_server * Sum_over_pluskey_tables(Per_pluskey_table_overhead)

Total_storage_per_server =

Total_row_size_per_server +

Total_string_size_per_server +

Total_bucket_size_per_server +

Total_pluskey_overhead_per_server

udb-size

Description

Parameters

Estimation

See Also