Select Data Types
One of the core reasons for ClickHouse's query performance is its efficient data compression. Less data on disk results in faster queries and inserts by minimizing I/O overhead. ClickHouse's column-oriented architecture naturally arranges similar data adjacently, enabling compression algorithms and codecs to reduce data size dramatically. To maximize these compression benefits, it's essential to carefully choose appropriate data types.
Compression efficiency in ClickHouse depends mainly on three factors: the ordering key, data types, and codecs, all defined through the table schema. Choosing optimal data types yields immediate improvements in both storage and query performance.
Some straightforward guidelines can significantly enhance the schema:
-
Use Strict Types: Always select the correct data type for columns. Numeric and date fields should use appropriate numeric and date types rather than general-purpose String types. This ensures correct semantics for filtering and aggregations.
-
Avoid Nullable Columns: Nullable columns introduce additional overhead by maintaining separate columns for tracking null values. Only use Nullable if explicitly required to distinguish between empty and null states. Otherwise, default or zero-equivalent values typically suffice. For further information on why this type should be avoided unless needed, see Avoid Nullable Columns.
-
Minimize Numeric Precision: Select numeric types with minimal bit-width that still accommodate the expected data range. For instance, prefer UInt16 over Int32 if negative values aren't needed, and the range fits within 0–65535.
-
Optimize Date and Time Precision: Choose the most coarse-grained date or datetime type that meets query requirements. Use Date or Date32 for date-only fields, and prefer DateTime over DateTime64 unless millisecond or finer precision is essential.
-
Leverage LowCardinality and Specialized Types: For columns with fewer than approximately 10,000 unique values, use LowCardinality types to significantly reduce storage through dictionary encoding. Similarly, use FixedString only when the column values are strictly fixed-length strings (e.g., country or currency codes), and prefer Enum types for columns with a finite set of possible values to enable efficient storage and built-in data validation.
-
Enums for data validation: The Enum type can be used to efficiently encode enumerated types. Enums can either be 8 or 16 bits, depending on the number of unique values they are required to store. Consider using this if you need either the associated validation at insert time (undeclared values will be rejected) or wish to perform queries which exploit a natural ordering in the Enum values e.g. imagine a feedback column containing user responses Enum(':(' = 1, ':|' = 2, ':)' = 3).
Example
ClickHouse offers built-in tools to streamline type optimization. For example, schema inference can automatically identify initial types. Consider the Stack Overflow dataset, publicly available in Parquet format. Running a simple schema inference via the DESCRIBE
command provides an initial non-optimized schema.
By default, ClickHouse maps these to equivalent Nullable types. This is preferred as the schema is based on a sample of the rows only.
Note below we use the glob pattern *.parquet to read all files in the stackoverflow/parquet/posts folder.
By applying our early simple rules to our posts table, we can identify an optimal type for each column:
Column | Is Numeric | Min, Max | Unique Values | Nulls | Comment | Optimized Type |
---|---|---|---|---|---|---|
PostTypeId | Yes | 1, 8 | 8 | No | Enum('Question' = 1, 'Answer' = 2, 'Wiki' = 3, 'TagWikiExcerpt' = 4, 'TagWiki' = 5, 'ModeratorNomination' = 6, 'WikiPlaceholder' = 7, 'PrivilegeWiki' = 8) | |
AcceptedAnswerId | Yes | 0, 78285170 | 12282094 | Yes | Differentiate Null with 0 value | UInt32 |
CreationDate | No | 2008-07-31 21:42:52.667000000, 2024-03-31 23:59:17.697000000 | - | No | Millisecond granularity is not required, use DateTime | DateTime |
Score | Yes | -217, 34970 | 3236 | No | Int32 | |
ViewCount | Yes | 2, 13962748 | 170867 | No | UInt32 | |
Body | No | - | - | No | String | |
OwnerUserId | Yes | -1, 4056915 | 6256237 | Yes | Int32 | |
OwnerDisplayName | No | - | 181251 | Yes | Consider Null to be empty string | String |
LastEditorUserId | Yes | -1, 9999993 | 1104694 | Yes | 0 is an unused value can be used for Nulls | Int32 |
LastEditorDisplayName | No | - | 70952 | Yes | Consider Null to be an empty string. Tested LowCardinality and no benefit | String |
LastEditDate | No | 2008-08-01 13:24:35.051000000, 2024-04-06 21:01:22.697000000 | - | No | Millisecond granularity is not required, use DateTime | DateTime |
LastActivityDate | No | 2008-08-01 12:19:17.417000000, 2024-04-06 21:01:22.697000000 | - | No | Millisecond granularity is not required, use DateTime | DateTime |
Title | No | - | - | No | Consider Null to be an empty string | String |
Tags | No | - | - | No | Consider Null to be an empty string | String |
AnswerCount | Yes | 0, 518 | 216 | No | Consider Null and 0 to same | UInt16 |
CommentCount | Yes | 0, 135 | 100 | No | Consider Null and 0 to same | UInt8 |
FavoriteCount | Yes | 0, 225 | 6 | Yes | Consider Null and 0 to same | UInt8 |
ContentLicense | No | - | 3 | No | LowCardinality outperforms FixedString | LowCardinality(String) |
ParentId | No | - | 20696028 | Yes | Consider Null to be an empty string | String |
CommunityOwnedDate | No | 2008-08-12 04:59:35.017000000, 2024-04-01 05:36:41.380000000 | - | Yes | Consider default 1970-01-01 for Nulls. Millisecond granularity is not required, use DateTime | DateTime |
ClosedDate | No | 2008-09-04 20:56:44, 2024-04-06 18:49:25.393000000 | - | Yes | Consider default 1970-01-01 for Nulls. Millisecond granularity is not required, use DateTime | DateTime |
Identifying the type for a column relies on understanding its numeric range and number of unique values. To find the range of all columns, and the number of distinct values, users can use the simple query SELECT * APPLY min, * APPLY max, * APPLY uniq FROM table FORMAT Vertical
. We recommend performing this over a smaller subset of the data as this can be expensive.
This results in the following optimized schema (with respect to types):
Avoid Nullable columns
Nullable
column (e.g. Nullable(String)
) creates a separate column of UInt8
type. This additional column has to be processed every time a user works with a Nullable column. This leads to additional storage space used and almost always negatively affects performance.
To avoid Nullable
columns, consider setting a default value for that column. For example, instead of:
use
Consider your use case, a default value may be inappropriate.