In previous articles, I wrote about normalization and nestedness. Today, I’m going to dive into something mentioned in the nestedness article a little bit deeper: ClickHouse’s new JSON
column type, and how this can help obviate a bunch of the pain we spoke about.
It may be that this should have been the next article in the series, but the discussion of NULL
s will actually end up being quite useful!
The JSON type is a really interesting approach by ClickHouse that allows you a bunch of the flexibility of JSON, whilst offering two tools that allow you to maintain some of the primary benefits of OLAP.
JSON, especially the JSON that you wouldn’t have already extracted fields from, is usually super high cardinality. It is terrible to order by, to select, to run operations on. Pretty much the worst thing you can do to an OLAP database.
Before ClickHouse’s new JSON
Type, you would use ETL to extract whatever data you wanted to ORDER BY
or SELECT
or analyze, and then just shove the rest of the JSON in a column that you prayed you never had to access at scale.
JSON
typeClickHouse’s new JSON type allows you to do some of this natively, in database, on write, explicitly. And when you don’t do it explicitly, it tries its best on your behalf.
It does this by explicitly or implicitly declaring JSON paths as subcolumns.
Lets take the order with line items like the one form the nestedness article (minor change to highlight the point of this!):
Here, we have an order, with line items, with details.
As discussed in the nesting article (and assuming we are primarily doing order grain analysis), it is probably best to extract as much as possible relevant to that grain:
But the line items can’t easily be extracted like this. Lets see how the JSON
type can help.
The JSON type allows for two treatments of paths. The first is explicit subcolumn declaration. Here, you tell ClickHouse how to declare a particular JSON path as a subcolumn. Let’s see how that looks when we apply that to the main information in a line item:
Here, we say that there’s going to be a JSON object called line_items
that we want to store. In that object, we tell ClickHouse about the sku
, quantity
and price
paths, and their types. So ClickHouse creates those sub-columns accordingly, with functionally equivalent scan and retrieval performance as if they were extracted columns.
Clickhouse is essentially doing this:
order_id | line_items.sku | line_items.quantity |
---|---|---|
ORD-1001 | ['SKU-001','SKU-002'] | [1,2] |
ORD-1002 | ['SKU-003'] | [1] |
For any path that you don’t tell ClickHouse about (up to the path limit, which you can set explicitly, but is by default set to 1024 paths), ClickHouse tries its best to infer the type. Even if it successfully does so, though, there are performance hits to scanning or retrieving from these inferred subcolumns.
Even when ClickHouse correctly infers the type, these inferred subcolumns cost CPU — type inference and decoding happen at read time. That means you lose the usual OLAP advantage of doing minimal work on scan.
Implicit paths also can’t use codecs or low-cardinality encodings, so you miss out on columnar compression wins.
Let’s see how you use it. Here’s how that looks when you define only the typed paths and let ClickHouse infer the rest.
You don’t have to tell it to type the details, and they remain dynamic (its OK that one of the line items has an average rating, and the other doesn’t).
From the above (assuming two rows), you’d functionally get this:
order_id | line_items.sku | line_items.quantity | line_items.price | line_items.details.average_rating |
---|---|---|---|---|
ORD-1001 | ['SKU-001','SKU-002'] | [1,2] | [19.99,34.99] | [4.2,null] |
ORD-1002 | ['SKU-003'] | [1] | [19.99] | [null] |
As you can see, the dynamic JSON is dealt with in the arrays of the implied subcolumns ( where the data was missing, ClickHouse NULL
ed the data.
Notice how missing data becomes NULL
inside the arrays.
And yes, that’s super unusual for ClickHouse: JSON subcolumns are implicitly nullable, even if their declared type is non-Nullable. You need that for the above to work.
What’s happening under the hood is:
Is essentially read by ClickHouse as:
That JSON()
is doing a lot of work.
ORDER BY
or frequently operate on, it is best to extract itJSON
type, explicitly type itJSON
type and save the data as a string, you’ll get the worst performance, everything will have to happen on scan/retrieval.Path type | Example | Storage | Performance |
---|---|---|---|
JSON Type: Explicitly typed | line_items.sku | Typed column file | Fast |
JSON Type: Implicit inferred | line_items.details.average_rating | Generic dynamic | Medium |
Fully raw JSON (no subpath) | line_items | Compressed text blob | Slow |
🐙 https://github.com/oatsandsugar/olap-agent-ref/blob/main/3.1-New-JSON-type.md