An Udb server holds its data in memory. A database can either be held by a single server or a pool of servers. The number of servers to use must be determined before populating the database.
A crude estimate is the input data size itself. For example, if the data size is 100GB (uncompressed size), and each server has 16GB available, 7 servers are needed. This is simple, but also very inacurrate. A better way to estimate the amount memory needed per server is outlined below. It depends mainly on the database definition and the characteristics of the data set to be processed.
Database definition related parameters can be obtained from udb.spec. Data characteristics related parameters can be obtained from the report produced by loginf on the data set (or part of it).
Parameters needed for the estimate:
ptr_z
The pointer size, an intrinsic software overhead.
Num_server
Num_row
Num_string_per_server
Num_string_per_server
is the unique string count per server.
It is usually less than its overall unique count in a data set,
but it does not scale linearly with the number of servers.
In fact, the overall unique count is often a good estimate.Avg_string_length
Num_bucket
Avg_num_pluskey
Num_vector
Num_table_with_pluskey
Num_table
Num_vector_and_table
The amount of memory needed by each server in a pool is the sum of these constributions.
Table rows:
Per_column_size:
Total_column_size =
Sum_over_columns(Per_column_size)
Per_row_padding:
Per_table_size_per_server =
Num_row * (Total_column_size + Per_row_padding) / Num_server
Total_row_size_per_server =
Sum_over_tables(Per_table_size_per_server)
Strings:
Hash_size_per_server:
Per_string_size =
ptr_z + 6 + Avg_string_length (rounded up to multiple of ptr_z)
Total_string_size_per_server =
Hash_size_per_server + (Num_string_per_server * Per_string_size)
User buckets:
Num_bucket_per_server =
Num_bucket / Num_server
Hash_size_per_server:
Vector_flag_size =
Num_vector * 1 (rounded up to multiple of ptr_z)
Per_bucket_size =
ptr_z + 6 + Avg_pkey_length +Vector_flag_size +Num_table_with_pluskey * (8 + ptr_z) +Num_table * ptr_z +Num_vector_and_table * ptr_z
Total_bucket_size_per_server =
Hash_size_per_server + (Num_bucket_per_server * Per_bucket_size)
Pluskey (+KEY) overhead:
Hash_size_per_table (per bucket):
Per_pluskey_overhead =
Per_pluskey_table_overhead (per bucket) =
Hash_size_per_table + (Avg_num_pluskey * Per_pluskey_overhead)
Total_pluskey_overhead_per_server =
Num_bucket_per_server * Sum_over_pluskey_tables(Per_pluskey_table_overhead)
Total_storage_per_server =
Total_row_size_per_server +Total_string_size_per_server +Total_bucket_size_per_server +Total_pluskey_overhead_per_server