Skip to main content
Skip to main content

Select Data Types

One of the core reasons for ClickHouse's query performance is its efficient data compression. Less data on disk results in faster queries and inserts by minimizing I/O overhead. ClickHouse's column-oriented architecture naturally arranges similar data adjacently, enabling compression algorithms and codecs to reduce data size dramatically. To maximize these compression benefits, it's essential to carefully choose appropriate data types.

Compression efficiency in ClickHouse depends mainly on three factors: the ordering key, data types, and codecs, all defined through the table schema. Choosing optimal data types yields immediate improvements in both storage and query performance.

Some straightforward guidelines can significantly enhance the schema:

  • Use Strict Types: Always select the correct data type for columns. Numeric and date fields should use appropriate numeric and date types rather than general-purpose String types. This ensures correct semantics for filtering and aggregations.

  • Avoid Nullable Columns: Nullable columns introduce additional overhead by maintaining separate columns for tracking null values. Only use Nullable if explicitly required to distinguish between empty and null states. Otherwise, default or zero-equivalent values typically suffice. For further information on why this type should be avoided unless needed, see Avoid Nullable Columns.

  • Minimize Numeric Precision: Select numeric types with minimal bit-width that still accommodate the expected data range. For instance, prefer UInt16 over Int32 if negative values aren't needed, and the range fits within 0–65535.

  • Optimize Date and Time Precision: Choose the most coarse-grained date or datetime type that meets query requirements. Use Date or Date32 for date-only fields, and prefer DateTime over DateTime64 unless millisecond or finer precision is essential.

  • Leverage LowCardinality and Specialized Types: For columns with fewer than approximately 10,000 unique values, use LowCardinality types to significantly reduce storage through dictionary encoding. Similarly, use FixedString only when the column values are strictly fixed-length strings (e.g., country or currency codes), and prefer Enum types for columns with a finite set of possible values to enable efficient storage and built-in data validation.

  • Enums for data validation: The Enum type can be used to efficiently encode enumerated types. Enums can either be 8 or 16 bits, depending on the number of unique values they are required to store. Consider using this if you need either the associated validation at insert time (undeclared values will be rejected) or wish to perform queries which exploit a natural ordering in the Enum values e.g. imagine a feedback column containing user responses Enum(':(' = 1, ':|' = 2, ':)' = 3).

Example

ClickHouse offers built-in tools to streamline type optimization. For example, schema inference can automatically identify initial types. Consider the Stack Overflow dataset, publicly available in Parquet format. Running a simple schema inference via the DESCRIBE command provides an initial non-optimized schema.

note

By default, ClickHouse maps these to equivalent Nullable types. This is preferred as the schema is based on a sample of the rows only.

note

Note below we use the glob pattern *.parquet to read all files in the stackoverflow/parquet/posts folder.

By applying our early simple rules to our posts table, we can identify an optimal type for each column:

ColumnIs NumericMin, MaxUnique ValuesNullsCommentOptimized Type
PostTypeIdYes1, 88NoEnum('Question' = 1, 'Answer' = 2, 'Wiki' = 3, 'TagWikiExcerpt' = 4, 'TagWiki' = 5, 'ModeratorNomination' = 6, 'WikiPlaceholder' = 7, 'PrivilegeWiki' = 8)
AcceptedAnswerIdYes0, 7828517012282094YesDifferentiate Null with 0 valueUInt32
CreationDateNo2008-07-31 21:42:52.667000000, 2024-03-31 23:59:17.697000000-NoMillisecond granularity is not required, use DateTimeDateTime
ScoreYes-217, 349703236NoInt32
ViewCountYes2, 13962748170867NoUInt32
BodyNo--NoString
OwnerUserIdYes-1, 40569156256237YesInt32
OwnerDisplayNameNo-181251YesConsider Null to be empty stringString
LastEditorUserIdYes-1, 99999931104694Yes0 is an unused value can be used for NullsInt32
LastEditorDisplayNameNo-70952YesConsider Null to be an empty string. Tested LowCardinality and no benefitString
LastEditDateNo2008-08-01 13:24:35.051000000, 2024-04-06 21:01:22.697000000-NoMillisecond granularity is not required, use DateTimeDateTime
LastActivityDateNo2008-08-01 12:19:17.417000000, 2024-04-06 21:01:22.697000000-NoMillisecond granularity is not required, use DateTimeDateTime
TitleNo--NoConsider Null to be an empty stringString
TagsNo--NoConsider Null to be an empty stringString
AnswerCountYes0, 518216NoConsider Null and 0 to sameUInt16
CommentCountYes0, 135100NoConsider Null and 0 to sameUInt8
FavoriteCountYes0, 2256YesConsider Null and 0 to sameUInt8
ContentLicenseNo-3NoLowCardinality outperforms FixedStringLowCardinality(String)
ParentIdNo-20696028YesConsider Null to be an empty stringString
CommunityOwnedDateNo2008-08-12 04:59:35.017000000, 2024-04-01 05:36:41.380000000-YesConsider default 1970-01-01 for Nulls. Millisecond granularity is not required, use DateTimeDateTime
ClosedDateNo2008-09-04 20:56:44, 2024-04-06 18:49:25.393000000-YesConsider default 1970-01-01 for Nulls. Millisecond granularity is not required, use DateTimeDateTime
tip

Identifying the type for a column relies on understanding its numeric range and number of unique values. To find the range of all columns, and the number of distinct values, users can use the simple query SELECT * APPLY min, * APPLY max, * APPLY uniq FROM table FORMAT Vertical. We recommend performing this over a smaller subset of the data as this can be expensive.

This results in the following optimized schema (with respect to types):

Avoid Nullable columns

Nullable column (e.g. Nullable(String)) creates a separate column of UInt8 type. This additional column has to be processed every time a user works with a Nullable column. This leads to additional storage space used and almost always negatively affects performance.

To avoid Nullable columns, consider setting a default value for that column. For example, instead of:

use

Consider your use case, a default value may be inappropriate.