Shredding Deeply Nested JSON, One Vector at a Time
2023-03-03Laurens Kuiper
TL;DR: We’ve just lately improved DuckDB’s JSON extension so JSON information may be instantly queried as in the event that they have been tables.
DuckDB has a JSON extension that may be put in and loaded by means of SQL:
INSTALL 'json';
LOAD 'json';
The JSON extension helps numerous features to create, learn, and manipulate JSON strings.
These features are just like the JSON performance offered by different databases similar to PostgreSQL and MySQL.
DuckDB makes use of yyjson internally to parse JSON, a high-performance JSON library written in ANSI C. Many because of the yyjson authors and contributors!
Apart from these features, DuckDB is now in a position to learn JSON instantly!
That is performed by routinely detecting the categories and column names, then changing the values inside the JSON to DuckDB’s vectors.
The automated schema detection dramatically simplifies working with JSON information and subsequent queries on DuckDB’s vectors are considerably sooner!
Because the 0.7.0 update, DuckDB has added JSON desk features.
To exhibit these, we are going to learn todos.json
, a fake TODO list containing 200 pretend TODO gadgets (solely the primary two gadgets are proven):
[
{
"userId": 1,
"id": 1,
"title": "delectus aut autem",
"completed": false
},
{
"userId": 1,
"id": 2,
"title": "quis ut nam facilis et officia qui",
"completed": false
},
]
Every TODO merchandise is an entry within the JSON array, however in DuckDB, we’d like a desk the place every entry is a row.
That is now (since DuckDB’s 0.7.0 launch in February 2023) as straightforward as:
SELECT * FROM 'todos.json';
userId | id | title | accomplished |
---|---|---|---|
1 | 1 | delectus aut autem | false |
1 | 2 | quis ut nam facilis et officia qui | false |
1 | 3 | fugiat veniam minus | false |
1 | 4 | et porro tempora | true |
1 | 5 | laboriosam mollitia et enim quasi adipisci quia provident illum | false |
(Notice: Solely 5 rows proven)
Now, discovering out which person accomplished probably the most TODO gadgets is so simple as:
SELECT userId, sum(accomplished::int) total_completed
FROM 'todos.json'
GROUP BY userId
ORDER BY total_completed DESC
LIMIT 1;
userId | total_completed |
---|---|
5 | 12 |
Below the hood, DuckDB acknowledges the .json
file extension in 'todos.json'
, and calls read_json_auto('todos.json')
as an alternative.
This perform is just like our read_csv_auto
perform, which automatically infers column names and types for CSV files.
Like our different desk features, read_json_auto
helps studying a number of information by passing an inventory, e.g., read_json_auto(['file1.json', 'file2.json'])
, but in addition globbing, e.g., read_json_auto('file*.json')
.
DuckDB will learn a number of information in parallel.
Not all JSON adheres to the format utilized in todos.json
, which is an array of ‘data’.
Newline-delimited JSON, or NDJSON, shops every row on a brand new line.
DuckDB additionally helps studying (and writing!) this format.
First, let’s write our TODO listing as NDJSON:
COPY (SELECT * FROM 'todos.json') to 'todos2.json';
Once more, DuckDB acknowledges the .json
suffix within the output file and routinely infers that we imply to make use of (FORMAT JSON)
.
The created file seems to be like this (solely the primary two data are proven):
{"userId":1,"id":1,"title":"delectus aut autem","accomplished":false}
{"userId":1,"id":2,"title":"quis ut nam facilis et officia qui","accomplished":false}
DuckDB can learn this file in exactly the identical approach as the unique one:
SELECT * FROM 'todos2.json';
In case your JSON file is newline-delimited, DuckDB can parallelize studying.
That is specified with nd
or the traces
parameter:
SELECT * FROM read_ndjson_auto('todos2.json');
SELECT * FROM read_json_auto('todos2.json', traces='true');
You may as well set traces="auto"
to auto-detect whether or not the JSON file is newline-delimited.
If utilizing the read_json
perform instantly, the format of the JSON may be specified utilizing the json_format
parameter.
This parameter defaults to 'auto'
, which tells DuckDB to deduce what sort of JSON we’re coping with.
The primary json_format
is 'array_of_records'
, whereas the second is 'data'
.
This may be specified like so:
SELECT * FROM read_json('todos.json', auto_detect=true, json_format='array_of_records');
SELECT * FROM read_json('todos2.json', auto_detect=true, json_format='data');
Different supported codecs are 'values'
and 'array_of_values'
, that are just like 'data'
and 'array_of_records'
.
Nevertheless, with these codecs, every ‘document’ just isn’t required to be a JSON object however may also be a JSON array, string, or something supported in JSON.
What you might also have observed is the auto_detect
parameter.
This parameter tells DuckDB to deduce the schema, i.e., decide the names and kinds of the returned columns.
These can manually be specified like so:
SELECT * FROM read_json('todos.json',
columns={userId: 'INT', id: 'INT', title: 'VARCHAR', accomplished: 'BOOLEAN'},
json_format='array_of_records');
You don’t should specify all fields, simply those you’re taken with:
SELECT * FROM read_json('todos.json',
columns={userId: 'INT', accomplished: 'BOOLEAN'},
json_format='array_of_records');
Now that we all know use the brand new DuckDB JSON desk features let’s dive into some analytics!
GH Archive is a undertaking to document the general public GitHub timeline, archive it, and make it simply accessible for additional evaluation.
Each hour, a GZIP compressed, newline-delimited JSON file containing all public occasions on GitHub is uploaded.
I’ve downloaded an entire day (2023-02-08) of exercise utilizing wget
and saved the 24 information in a listing referred to as gharchive_gz
.
wget https://information.gharchive.org/2023-02-08-0.json.gz
wget https://information.gharchive.org/2023-02-08-1.json.gz
...
wget https://information.gharchive.org/2023-02-08-23.json.gz
Needless to say the info is compressed:
$ du -sh gharchive_gz
2.3G gharchive_gz
$ gunzip -dc gharchive_gz/* | wc -c
18396198934
Someday of GitHub exercise quantities to greater than 18GB of JSON, which compresses to 2.3GB with GZIP.
To get a really feel of what the info seems to be like, we run the next question:
SELECT json_group_structure(json)
FROM (
SELECT *
FROM read_ndjson_objects('gharchive_gz/*.json.gz')
LIMIT 2048
);
Right here, we use our read_ndjson_objects
perform, which reads the JSON objects within the file as uncooked JSON, i.e., as strings.
The question reads the primary 2048 data of JSON from the JSON information gharchive_gz
listing and describes the construction.
You may as well instantly question the JSON information from GH Archive utilizing DuckDB’s httpfs
extension, however we will likely be querying the information a number of occasions, so it’s higher to obtain them on this case.
I’ve formatted the consequence utilizing an online JSON formatter & validator:
{
"id":"VARCHAR",
"sort":"VARCHAR",
"actor":{
"id":"UBIGINT",
"login":"VARCHAR",
"display_login":"VARCHAR",
"gravatar_id":"VARCHAR",
"url":"VARCHAR",
"avatar_url":"VARCHAR"
},
"repo":{
"id":"UBIGINT",
"title":"VARCHAR",
"url":"VARCHAR"
},
"payload":{"..."},
"public":"BOOLEAN",
"created_at":"VARCHAR",
"org":{
"id":"UBIGINT",
"login":"VARCHAR",
"gravatar_id":"VARCHAR",
"url":"VARCHAR",
"avatar_url":"VARCHAR"
}
}
I’ve left "payload"
out as a result of it consists of deeply nested JSON, and its formatted construction takes up greater than 1000 traces!
So, what number of data are we coping with precisely? Let’s depend it utilizing DuckDB:
SELECT depend(*) depend FROM 'gharchive_gz/*.json.gz';
That’s round 4.4M day by day occasions, which quantities to nearly 200K occasions per hour.
This question takes round 7.3s seconds on my laptop computer, a 2020 MacBook Professional with an M1 chip and 16GB of reminiscence.
That is the time it takes to decompress the GZIP compression and parse each JSON document.
To see how a lot time is spent decompressing GZIP within the question, I’ve additionally created a gharchive
listing containing the identical information however uncompressed.
Working the identical question on the uncompressed information takes round 5.4s, nearly 2 seconds sooner.
So we bought sooner, however we additionally learn greater than 18GB of information from storage, versus 2.3GB when it was compressed.
So, this comparability actually depends upon the pace of your storage.
I want to maintain the info compressed.
As a facet word, the pace of this question actually exhibits how briskly yyjson is!
So, what sort of occasions are within the GitHub information?
SELECT sort, depend(*) depend
FROM 'gharchive_gz/*.json.gz'
GROUP BY sort
ORDER BY depend DESC;
sort | depend |
---|---|
PushEvent | 2359096 |
CreateEvent | 624062 |
PullRequestEvent | 366090 |
IssueCommentEvent | 238660 |
WatchEvent | 231486 |
DeleteEvent | 154383 |
PullRequestReviewEvent | 131107 |
IssuesEvent | 88917 |
PullRequestReviewCommentEvent | 79540 |
ForkEvent | 64233 |
CommitCommentEvent | 36823 |
ReleaseEvent | 23004 |
MemberEvent | 14872 |
PublicEvent | 14500 |
GollumEvent | 8180 |
This question takes round 7.4s, not far more than the depend(*)
question.
In order we are able to see, information evaluation may be very quick as soon as all the things has been decompressed and parsed.
The commonest occasion sort is the PushEvent
, taking on greater than half of all occasions, unsurprisingly, which is individuals pushing their dedicated code to GitHub.
The least widespread occasion sort is the GollumEvent
, taking on lower than 1% of all occasions, which is a creation or replace of a wiki web page.
If we need to analyze the identical information a number of occasions, decompressing and parsing each time is redundant.
As a substitute, we are able to create a DuckDB desk like so:
CREATE TABLE occasions AS
SELECT * EXCLUDE (payload)
FROM 'gharchive_gz/*.json.gz';
Which takes round 9s in the event you’re utilizing an in-memory database.
Should you’re utilizing an on-disk database, this takes round 13s and ends in a database dimension of 444MB.
When utilizing an on-disk database, DuckDB ensures the desk is persistent and performs all kinds of compression.
Notice that now we have quickly ignored the payload
area utilizing the handy EXCLUDE
clause.
To get a really feel of what we learn, we are able to ask DuckDB to explain the desk:
DESCRIBE SELECT * FROM occasions;
This provides us the next:
cid | title | sort | notnull | dflt_value | pk |
---|---|---|---|---|---|
0 | id | BIGINT | false | false | |
1 | sort | VARCHAR | false | false | |
2 | actor | STRUCT(id UBIGINT, login VARCHAR, display_login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR) | false | false | |
3 | repo | STRUCT(id UBIGINT, title VARCHAR, url VARCHAR) | false | false | |
4 | public | BOOLEAN | false | false | |
5 | created_at | TIMESTAMP | false | false | |
6 | org | STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR) | false | false |
As we are able to see, the "actor"
, "repo"
and "org"
fields, that are JSON objects, have been transformed to DuckDB structs.
The "id"
column was a string within the authentic JSON however has been transformed to a BIGINT
by DuckDB’s automated sort detection.
DuckDB may detect a couple of totally different DATE
/TIMESTAMP
codecs inside JSON strings, in addition to TIME
and UUID
.
Now that we’ve created the desk, we are able to analyze it like every other DuckDB desk!
Let’s see how a lot exercise there was within the duckdb/duckdb
GitHub repository on this particular day:
SELECT sort, depend(*) depend
FROM occasions
WHERE repo.title = 'duckdb/duckdb'
GROUP BY sort
ORDER BY depend DESC;
sort | depend |
---|---|
PullRequestEvent | 35 |
IssueCommentEvent | 30 |
WatchEvent | 29 |
PushEvent | 15 |
PullRequestReviewEvent | 14 |
IssuesEvent | 9 |
PullRequestReviewCommentEvent | 7 |
ForkEvent | 3 |
That’s lots of pull request exercise!
Notice that this doesn’t imply that 35 pull requests have been opened on at the present time, exercise inside a pull request can also be counted.
If we search through the pull requests for that day, we see that there are solely 15.
That is extra exercise than regular as a result of many of the DuckDB builders have been busy fixing bugs for the 0.7.0 launch.
Now, let’s see who was probably the most lively:
SELECT actor.login, depend(*) depend
FROM occasions
WHERE repo.title = 'duckdb/duckdb'
AND sort = 'PullRequestEvent'
GROUP BY actor.login
ORDER BY depend desc
LIMIT 5;
login | depend |
---|---|
Mytherin | 19 |
Mause | 4 |
carlopi | 3 |
Tmonster | 2 |
lnkuiper | 2 |
As anticipated, Mark (Mytherin, co-founder of DuckDB Labs) was probably the most lively!
My exercise (lnkuiper, software program engineer at DuckDB Labs) additionally exhibits up.
Up to now, now we have ignored the "payload"
of the occasions.
We’ve ignored it as a result of the contents of this area are totally different primarily based on the kind of occasion.
We are able to see how they differ with the next question:
SELECT json_group_structure(payload) construction
FROM (SELECT *
FROM read_json(
'gharchive_gz/*.json.gz',
columns={
id: 'BIGINT',
sort: 'VARCHAR',
actor: 'STRUCT(id UBIGINT,
login VARCHAR,
display_login VARCHAR,
gravatar_id VARCHAR,
url VARCHAR,
avatar_url VARCHAR)',
repo: 'STRUCT(id UBIGINT, title VARCHAR, url VARCHAR)',
payload: 'JSON',
public: 'BOOLEAN',
created_at: 'TIMESTAMP',
org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)'
},
traces='true'
)
WHERE sort = 'WatchEvent'
LIMIT 2048
);
construction |
---|
{“motion”:”VARCHAR”} |
The "payload"
area is easy for occasions of sort WatchEvent
.
Nevertheless, if we modify the kind to PullRequestEvent
, we get a JSON construction of greater than 500 traces when formatted with a JSON formatter.
We don’t need to look by means of all these fields, so we can not use our automated schema detection, which can attempt to get all of them.
As a substitute, we are able to manually provide the construction of the fields we’re taken with.
DuckDB will skip studying the opposite fields.
One other strategy is to retailer the "payload"
area as DuckDB’s JSON information sort and parse it at question time (see the instance later on this submit!).
I’ve stripped down the JSON construction for the "payload"
of occasions with the kind PullRequestEvent
to the issues I’m really taken with:
{
"motion":"VARCHAR",
"quantity":"UBIGINT",
"pull_request":{
"url":"VARCHAR",
"id":"UBIGINT",
"title":"VARCHAR",
"person":{
"login":"VARCHAR",
"id":"UBIGINT",
},
"physique":"VARCHAR",
"created_at":"TIMESTAMP",
"updated_at":"TIMESTAMP",
"assignee":{
"login":"VARCHAR",
"id":"UBIGINT",
},
"assignees":[
{
"login":"VARCHAR",
"id":"UBIGINT",
}
],
}
}
That is technically not legitimate JSON as a result of there are trailing commas.
Nevertheless, we attempt to allow trailing commas wherever possible in DuckDB, together with JSON!
We are able to now plug this into the columns
parameter of read_json
, however we have to convert it to a DuckDB sort first.
I’m lazy, so I want to let DuckDB do that for me:
SELECT typeof(json_transform('{}', '{
"motion":"VARCHAR",
"quantity":"UBIGINT",
"pull_request":{
"url":"VARCHAR",
"id":"UBIGINT",
"title":"VARCHAR",
"person":{
"login":"VARCHAR",
"id":"UBIGINT",
},
"physique":"VARCHAR",
"created_at":"TIMESTAMP",
"updated_at":"TIMESTAMP",
"assignee":{
"login":"VARCHAR",
"id":"UBIGINT",
},
"assignees":[
{
"login":"VARCHAR",
"id":"UBIGINT",
}
],
}
}'));
This provides us again a DuckDB sort that we are able to plug the kind into our perform!
Notice that as a result of we’re not auto-detecting the schema, now we have to produce timestampformat
to have the ability to parse the timestamps accurately.
The important thing "person"
have to be surrounded by quotes as a result of it’s a reserved key phrase in SQL:
CREATE TABLE pr_events as
SELECT *
FROM read_json(
'gharchive_gz/*.json.gz',
columns={
id: 'BIGINT',
sort: 'VARCHAR',
actor: 'STRUCT(id UBIGINT,
login VARCHAR,
display_login VARCHAR,
gravatar_id VARCHAR,
url VARCHAR,
avatar_url VARCHAR)',
repo: 'STRUCT(id UBIGINT, title VARCHAR, url VARCHAR)',
payload: 'STRUCT(
motion VARCHAR,
quantity UBIGINT,
pull_request STRUCT(
url VARCHAR,
id UBIGINT,
title VARCHAR,
"person" STRUCT(
login VARCHAR,
id UBIGINT
),
physique VARCHAR,
created_at TIMESTAMP,
updated_at TIMESTAMP,
assignee STRUCT(login VARCHAR, id UBIGINT),
assignees STRUCT(login VARCHAR, id UBIGINT)[]
)
)',
public: 'BOOLEAN',
created_at: 'TIMESTAMP',
org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)'
},
json_format='data',
traces='true',
timestampformat='%Y-%m-%dTpercentH:%M:%SZ'
)
WHERE sort = 'PullRequestEvent';
This question completes in round 36s with an on-disk database (ensuing dimension is 478MB) and 9s with an in-memory database.
Should you don’t care about preserving insertion order, you possibly can pace the question up with this setting:
SET preserve_insertion_order=false;
With this setting, the question completes in round 27s with an on-disk database and eight.5s with an in-memory database.
The distinction between the on-disk and in-memory case is sort of substantial right here as a result of DuckDB has to compress and persist far more information.
Now we are able to analyze pull request occasions! Let’s see what the utmost variety of assignees is:
SELECT max(size(payload.pull_request.assignees)) max_assignees
FROM pr_events;
That’s lots of people reviewing a single pull request!
We are able to examine who was assigned probably the most:
WITH assignees AS (
SELECT payload.pull_request.assignee.login assignee
FROM pr_events
UNION ALL
SELECT unnest(payload.pull_request.assignees).login assignee
FROM pr_events
)
SELECT assignee, depend(*) depend
FROM assignees
WHERE assignee NOT NULL
GROUP BY assignee
ORDER BY depend DESC
LIMIT 5;
assignee | depend |
---|---|
poad | 494 |
vinayakkulkarni | 268 |
tmtmtmtm | 198 |
fisker | 98 |
icemac | 84 |
That’s lots of assignments!
Though I think there are duplicates in right here.
Specifying the JSON schema of the "payload"
area was useful as a result of it allowed us to instantly analyze what’s there, and subsequent queries are a lot sooner.
Nonetheless, it may also be fairly cumbersome if the schema is complicated.
Should you don’t need to specify the schema of a area, you possibly can set the kind as 'JSON'
:
CREATE TABLE pr_events AS
SELECT *
FROM read_json(
'gharchive_gz/*.json.gz',
columns={
id: 'BIGINT',
sort: 'VARCHAR',
actor: 'STRUCT(id UBIGINT,
login VARCHAR,
display_login VARCHAR,
gravatar_id VARCHAR,
url VARCHAR,
avatar_url VARCHAR)',
repo: 'STRUCT(id UBIGINT, title VARCHAR, url VARCHAR)',
payload: 'JSON',
public: 'BOOLEAN',
created_at: 'TIMESTAMP',
org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)'
},
json_format='data',
traces='true',
timestampformat='%Y-%m-%dTpercentH:%M:%SZ'
)
WHERE sort = 'PullRequestEvent';
This may load the "payload"
area as a JSON string, and we are able to use DuckDB’s JSON features to research it when querying.
For instance:
SELECT DISTINCT payload->>'motion' AS motion, depend(*) depend
FROM pr_events
GROUP BY motion
ORDER BY depend DESC;
The ->>
arrow is short-hand for our json_extract_string
perform.
Creating all the "payload"
area as a column with sort JSON
just isn’t probably the most environment friendly solution to get simply the "motion"
area, however this instance is simply to indicate the pliability of read_json
.
The question ends in the next desk:
motion | depend |
---|---|
opened | 189096 |
closed | 174914 |
reopened | 2080 |
As we are able to see, just a few pull requests have been reopened.
DuckDB tries to be an easy-to-use instrument that may learn all types of information codecs.
Within the 0.7.0 launch, now we have added assist for studying JSON.
JSON is available in many codecs and all types of schemas.
Duckdb’s wealthy assist for nested sorts (LIST
, STRUCT
) permits it to completely ‘shred’ the JSON to a columnar format for extra environment friendly evaluation.
We’re excited to listen to what you concentrate on our new JSON performance.
You probably have any questions or ideas, please attain out to us on Discord or GitHub!