In previous articles, I wrote about normalization and nestedness. Today, I’m going to dive into something mentioned in the nestedness article a little bit deeper: ClickHouse’s new JSON column type, and how this can help obviate a bunch of the pain we spoke about.

It may be that this should have been the next article in the series, but the discussion of NULLs will actually end up being quite useful!

The JSON type is a really interesting approach by ClickHouse that allows you a bunch of the flexibility of JSON, whilst offering two tools that allow you to maintain some of the primary benefits of OLAP.

Problem: scanning / retrieving / processing raw JSON nullifies the benefits of OLAP

JSON, especially the JSON that you wouldn’t have already extracted fields from, is usually super high cardinality. It is terrible to order by, to select, to run operations on. Pretty much the worst thing you can do to an OLAP database.

Before ClickHouse’s new JSON Type, you would use ETL to extract whatever data you wanted to ORDER BY or SELECT or analyze, and then just shove the rest of the JSON in a column that you prayed you never had to access at scale.

ClickHouse’s new `JSON` type

ClickHouse’s new JSON type allows you to do some of this natively, in database, on write, explicitly. And when you don’t do it explicitly, it tries its best on your behalf.

It does this by explicitly or implicitly declaring JSON paths as subcolumns.

Lets take the order with line items like the one form the nestedness article (minor change to highlight the point of this!):

Here, we have an order, with line items, with details.

As discussed in the nesting article (and assuming we are primarily doing order grain analysis), it is probably best to extract as much as possible relevant to that grain:

But the line items can’t easily be extracted like this. Lets see how the JSON type can help.

Explicit subcolumn declaration

The JSON type allows for two treatments of paths. The first is explicit subcolumn declaration. Here, you tell ClickHouse how to declare a particular JSON path as a subcolumn. Let’s see how that looks when we apply that to the main information in a line item:

Here, we say that there’s going to be a JSON object called line_items that we want to store. In that object, we tell ClickHouse about the sku, quantity and price paths, and their types. So ClickHouse creates those sub-columns accordingly, with functionally equivalent scan and retrieval performance as if they were extracted columns.

Clickhouse is essentially doing this:

order_id	line_items.sku	line_items.quantity
`ORD-1001`	`['SKU-001','SKU-002']`	`[1,2]`
`ORD-1002`	`['SKU-003']`	`[1]`

Implicit subcolumn declaration

For any path that you don’t tell ClickHouse about (up to the path limit, which you can set explicitly, but is by default set to 1024 paths), ClickHouse tries its best to infer the type. Even if it successfully does so, though, there are performance hits to scanning or retrieving from these inferred subcolumns.

Even when ClickHouse correctly infers the type, these inferred subcolumns cost CPU — type inference and decoding happen at read time. That means you lose the usual OLAP advantage of doing minimal work on scan.

Implicit paths also can’t use codecs or low-cardinality encodings, so you miss out on columnar compression wins.

Let’s see how you use it. Here’s how that looks when you define only the typed paths and let ClickHouse infer the rest.

You don’t have to tell it to type the details, and they remain dynamic (its OK that one of the line items has an average rating, and the other doesn’t).

From the above (assuming two rows), you’d functionally get this:

order_id	line_items.sku	line_items.quantity	line_items.price	line_items.details.average_rating
`ORD-1001`	`['SKU-001','SKU-002']`	`[1,2]`	`[19.99,34.99]`	`[4.2,null]`
`ORD-1002`	`['SKU-003']`	`[1]`	`[19.99]`	`[null]`

As you can see, the dynamic JSON is dealt with in the arrays of the implied subcolumns ( where the data was missing, ClickHouse NULLed the data.

What the NULL?

Notice how missing data becomes NULL inside the arrays.

And yes, that’s super unusual for ClickHouse: JSON subcolumns are implicitly nullable, even if their declared type is non-Nullable. You need that for the above to work.

What’s happening under the hood is:

Is essentially read by ClickHouse as:

That JSON() is doing a lot of work.

Heuristics / what should you do:

When you have a column that you are going to ORDER BY or frequently operate on, it is best to extract it
If you want to keep it in a JSON type, explicitly type it
If a path is irregular or infrequently touched, feel free to let ClickHouse handle it with its defaults.
If you ignore the JSON type and save the data as a string, you’ll get the worst performance, everything will have to happen on scan/retrieval.

Performance comparison for JSON

Path type	Example	Storage	Performance
JSON Type: Explicitly typed	`line_items.sku`	Typed column file	Fast
JSON Type: Implicit inferred	`line_items.details.average_rating`	Generic dynamic	Medium
Fully raw JSON (no subpath)	`line_items`	Compressed text blob	Slow

Agent docs:

🐙 https://github.com/oatsandsugar/olap-agent-ref/blob/main/3.1-New-JSON-type.md

OLAP on Tap: Freddy v JSON (the new JSON type)

Problem: scanning / retrieving / processing raw JSON nullifies the benefits of OLAP

ClickHouse’s new `JSON` type

Explicit subcolumn declaration

Implicit subcolumn declaration

What the NULL?

Heuristics / what should you do:

Performance comparison for JSON

Agent docs:

Interested in learning more?

Related posts

OLAP migration complexity is the cost of fast reads

Ship your data with Moose APIs