OLAP on Tap: Much Ado About Nulling

Published Wed, October 8, 2025 ∙ OLAP, ClickHouse ∙ by Johanan Ottensooser

In the OLTP world, NULL is virtually costless — and often helpful. It preserves semantic nuance in your data: NULL doesn’t mean FALSE or 0; it just means unknown or not yet provided.

Because OLTP systems typically read or mutate individual rows rather than scan entire tables, the overhead of storing or checking for NULLs is negligible in most workloads. Even if there’s a small cost to skipping or branching on nulls, it’s triggered so rarely (per-row lookups, point updates) that it’s inconsequential.

That’s why most OLTP databases default columns to nullable unless you explicitly define them as NOT NULL.

This is generally great, it is super flexible, and it preserves the semantic meaning of NULL.

Uno reverso: OLAP NULLs

As I learned when working with ClickHouse and MooseStack, this is absolutely not the case whatsoever with OLAP.

OLAP columns assume they don't have NULL in them. They assume this so hard that you have to explicitly tell them that you might have a nullable(Type). And even then, my gut tells me that they don’t like it. You can’t have nullable nested fields. You can’t have nullable keys or order by fields (with most columns), and you need to read an entire column to be able to find nulls within it.

NULLs also hurt the performance of queries significantly.

Everything in the user experience is trying to kindly shove you away from using them.

Why OLAP and NULL aren’t a happy match

See CH docs: https://clickhouse.com/docs/sql-reference/data-types/nullable

Flip the context (from row-based updates to columnar scans) and NULL stops being a free lunch.

The intuitive, high level explanation of why NULLs aren’t well loved in the OLAP world, is that, as discussed in all the previous OLAP on Tap articles, OLAP is generally columnar, and requires reading huge amounts of data from a column efficiently to thrive.

If you have nulls in a column, the data in that column is not all of the same type. You have your typed data, and you have your place-holding sentinel, representing the NULL. Compression breaks. SIMD operations break. You get a bunch of inefficiency.

Under the hood (and I have only a very high level understanding of this, so please correct me if I oversimplified):

  • ClickHouse stores all data for a column together on disk
  • Except for NULLs, which it stores differently: it stores a separate map of where each null is in the column, such that other data in the column are all compliant with the type constraints of that column

This means, when the database engine is processing a column with nulls, it isn’t batching vectors and doing SIMD operations on the data:

  1. It is looking at each datapoint in the column
  2. Seeing if that datapoint is a “normal” one, or a null, and branching into one of two operations
  3. The calculation over the column, or the appropriate way to skip that null in the calculation

What you can do instead: default

See CH docs: https://clickhouse.com/docs/sql-reference/statements/create/table#default_values

If the compute and performance cost of adding nullability to your column are too great, or if you need to use the column as a key, or an order by field, you may want to define a default value on write, rather than allow for NULLs.

Trade off: be careful about your default value, and consider the way your data is going to be queried. If you are averaging a column full of numbers, having 20% of them be NULL and ignored by the calculation is very very different to having 20% of them be 0 and ruin your average.

Exception

If your column is mainly NULL, some of the sources I read say 95%+; having a nullable column can actually be quite performant, since that null bitmap will be cheaper than a bunch of 0 data in the column.

TL;DR

Like pretty much everything we learned, you sacrifice write time freedom for read time performance (and in this case, storage efficiency too).

In OLAP, NULL isn’t free — it’s a tab you’ll pay at query time.

Heuristics

ScenarioRecommended StrategyRationale / Trade-offs
🟢 Column is always presentNOT NULL (default)Simplest & fastest — no null bitmap, full SIMD & compression
🟢 Occasional blanks, small % (<5%)Use DEFAULT valueAvoids null map overhead; document chosen sentinel (e.g. 0, 'UNKNOWN')
🟡 Semantically meaningful “unknown”Nullable(Type) + guard in queryPreserve semantics, accept slower scans; avoid as key/ORDER BY
🟡 Needs to be key or in ORDER BYNon-nullable + DEFAULTKeys can’t be NULL; encode missing as sentinel
🔴 High frequency of nulls (5–95%)❌ Avoid. Does this need to be in your OLAP table? Is there a suitable DEFAULT?You’ll pay the branching + null map cost on every vector
🟢 Mostly nulls (95%+)Nullable(Type)Null bitmap cheaper than storing many zeroes; scans skip efficiently
🟡 Nested or Array fields❌ Disallow nulls insideClickHouse Nested() & arrays can’t hold NULL — use empty defaults
🟢 JSON or Object('json') columnsKeep path typed, use defaults or assumeNotNull()Subcolumns infer types; avoid per-row null parsing

Agent Docs

🐙 https://github.com/oatsandsugar/olap-agent-ref/blob/main/4-managing-NULL