Summing columns in distant Parquet information utilizing DuckDB
vivym/midjourney-messages on Hugging Face is a big (~8GB) dataset consisting of 55,082,563 Midjourney pictures – every one with the immediate and a URL to the picture hosted on Discord.
TLDR: Every document hyperlinks to a Discord CDN URL, and the entire dimension of all of these pictures is 148.07 TB – so Midjourney has value Discord a LOT of cash in CDN prices!
Learn on for the small print of how I figured this out.
Every of the data seems like this (transformed to JSON):
{
"id": "1144508197969854484",
"channel_id": "989268300473192561",
"content material": "**grownup Goku in Dragonball Z, strolling on a seashore, in a Akira Toriyama anime type** - Picture #1 <@1016225582566101084>",
"timestamp": "2023-08-25T05:46:58.330000+00:00",
"image_id": "1144508197693046875",
"peak": 1024,
"width": 1024,
"url": "https://cdn.discordapp.com/attachments/989268300473192561/1144508197693046875/anaxagore54_adult_Goku_in_Dragonball_Z_walking_on_a_beach_in_a__987e6fd5-64a1-43f6-83dd-c58d2eb42948.png",
"dimension": 1689284
}
The data are hosted on Hugging Face as Parquet information – 56 of them, every round 160MB. Here is the full list – they’re hosted in Git LFS, however Hugging Face additionally presents an HTTP obtain hyperlink.
I needed to whole up the dimension
column there to see how area on Discord’s CDN is taken up by these Midjourney pictures. However I did not wish to obtain all 8GB of information simply to run that question.
DuckDB can question remotely hosted Parquet information and use HTTP Vary header tips to retrieve simply the subset of information wanted to reply a question. It ought to have the ability to sum up the dimension
column with out downloading the entire dataset.
The URL to every Parquet file seems like this (this really redirects to a remaining URL, however DuckDB appears to have the ability to observe these redirects transparently):
https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000000.parquet
Querying a single file
Here is a DuckDB question that calculates the sum of that dimension
column with out retrieving your complete file:
SELECT SUM(dimension) FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000000.parquet';
Operating that within the duckdb
CLI software provides me this consequence:
┌───────────────┐
│ sum(dimension) │
│ int128 │
├───────────────┤
│ 3456458790156 │
└───────────────┘
That is 3.14TB, only for the primary file.
Monitoring community utilization with nettop
What number of bytes of information did that retrieve? We are able to discover out on macOS utilizing the nettop
command.
First we’d like the PID of our duckdb
course of:
ps aux | grep duckdb
simon 19992 0.0 0.0 408644352 1520 s114 S+ 2:30PM 0:00.00 grep duckdb
simon 19985 0.0 0.0 408527616 4752 s118 S+ 2:30PM 0:00.01 duckdb
Now we will run the next, earlier than we execute that SQL question:
Then I ran the SQL question, and noticed this within the output from nettop
:
5331 KiB
So DuckDB retrieved 5.3MB of information (from a file that was 159MB) to reply that question.
Utilizing UNION ALL
How concerning the whole throughout all 56 information? I generated this UNION ALL
question to reply that query:
SELECT SUM(size_total)
FROM (
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000000.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000001.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000002.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000003.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000004.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000005.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000006.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000007.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000008.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000009.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000010.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000011.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000012.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000013.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000014.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000015.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000016.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000017.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000018.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000019.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000020.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000021.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000022.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000023.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000024.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000025.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000026.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000027.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000028.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000029.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000030.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000031.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000032.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000033.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000034.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000035.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000036.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000037.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000038.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000039.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000040.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000041.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000042.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000043.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000044.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000045.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000046.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000047.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000048.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000049.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000050.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000051.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000052.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000053.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000054.parquet' UNION ALL
SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000055.parquet'
);
Utilizing read_parquet()
Replace: Alex Monahan tipped me off to a extra concise various for a similar question:
SELECT SUM(dimension)
FROM read_parquet([
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000000.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000001.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000002.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000003.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000004.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000005.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000006.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000007.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000008.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000009.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000010.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000011.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000012.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000013.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000014.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000015.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000016.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000017.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000018.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000019.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000020.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000021.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000022.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000023.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000024.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000025.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000026.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000027.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000028.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000029.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000030.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000031.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000032.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000033.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000034.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000035.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000036.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000037.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000038.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000039.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000040.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000041.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000042.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000043.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000044.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000045.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000046.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000047.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000048.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000049.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000050.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000051.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000052.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000053.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000054.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000055.parquet'
]);
This model shows a helpful progress bar whereas the question is executing:
Utilizing list_transform()
@adityawarmanfw shared this far more elegant answer:
SELECT
SUM(dimension) AS dimension
FROM read_parquet(
list_transform(
generate_series(0, 55),
n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/' ||
format('{:06d}', n) || '.parquet'
)
);
That is utilizing a DuckDB lambda function – actually neat!
To measure them, I ran a question in a recent DuckDB occasion with nettop
watching the community visitors. Here is what that appeared like whereas it was working:
The whole knowledge transferred was 287 MiB – nonetheless numerous knowledge, however an enormous saving on 8GB.
That is additionally round what I might count on for 56 information, given {that a} single file fetched 5.3MB earlier and 5.3 * 56 = 296.8.
The top consequence
The consequence it gave me was:
┌─────────────────┐
│ sum(size_total) │
│ int128 │
├─────────────────┤
│ 162800469938172 │
└─────────────────┘
That is 162800469938172 bytes – or 148.07 TB of CDN area utilized by Midjourney pictures!
(I acquired ChatGPT to construct me a software for changing bytes to KB/MB/GB/TB: Byte Size Converter.)
CTEs and views work too
You’ll be able to run this question utilizing a CTE to make it nicer to learn:
with midjourney_messages as (
choose
*
from read_parquet(
list_transform(
generate_series(0, 2),
n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/' ||
format('{:06d}', n) || '.parquet'
)
)
)
choose sum(dimension) as dimension from midjourney_messages;
(I used generate_series(0, 2)
right here as a substitute of (0, 55)
to hurry up these subsequent experiments.)
Or you’ll be able to outline a view, which helps you to consult with midjourney_messages
in a number of queries. This can be a dangerous concept although, as every time you execute the question in opposition to the view it is going to obtain the info once more:
create view midjourney_messages as
choose * from read_parquet(
list_transform(
generate_series(0, 2),
n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/' ||
format('{:06d}', n) || '.parquet'
)
);
Operating create view
right here transferred 37 KiB in response to nettop
– presumably from loading metadata so as to have the ability to reply describe
queries like this one:
describe midjourney_messages;
Output:
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬───────┐
│ column_name │ column_type │ null │ key │ default │ further │
│ varchar │ varchar │ varchar │ varchar │ varchar │ int32 │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼───────┤
│ id │ VARCHAR │ YES │ │ │ │
│ channel_id │ VARCHAR │ YES │ │ │ │
│ content material │ VARCHAR │ YES │ │ │ │
│ timestamp │ VARCHAR │ YES │ │ │ │
│ image_id │ VARCHAR │ YES │ │ │ │
│ peak │ BIGINT │ YES │ │ │ │
│ width │ BIGINT │ YES │ │ │ │
│ url │ VARCHAR │ YES │ │ │ │
│ dimension │ BIGINT │ YES │ │ │ │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴───────┘
nettop
confirmed that every time I ran this question:
choose depend(*) from midjourney_messages;
Roughly 114 KiB of information was fetched.
parquet_metadata()
chrisjc tipped me off concerning the parquet_metadata()
operate, which can be utilized like this:
choose * from parquet_metadata(
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000000.parquet'
);
That returns the next, after fetching 37 KiB:
┌──────────────────────┬──────────────┬────────────────────┬───┬──────────────────┬──────────────────────┬──────────────────────┐
│ file_name │ row_group_id │ row_group_num_rows │ … │ data_page_offset │ total_compressed_s… │ total_uncompressed… │
│ varchar │ int64 │ int64 │ │ int64 │ int64 │ int64 │
├──────────────────────┼──────────────┼────────────────────┼───┼──────────────────┼──────────────────────┼──────────────────────┤
│ https://huggingfac… │ 0 │ 1000000 │ … │ 601280 │ 13133418 │ 23093988 │
│ https://huggingfac… │ 0 │ 1000000 │ … │ 13133571 │ 116 │ 112 │
│ https://huggingfac… │ 0 │ 1000000 │ … │ 13396455 │ 46191873 │ 208657682 │
│ https://huggingfac… │ 0 │ 1000000 │ … │ 59593218 │ 9046231 │ 36052113 │
│ https://huggingfac… │ 0 │ 1000000 │ … │ 68973087 │ 13118570 │ 23093988 │
│ https://huggingfac… │ 0 │ 1000000 │ … │ 81491806 │ 498549 │ 915584 │
│ https://huggingfac… │ 0 │ 1000000 │ … │ 81990515 │ 496767 │ 916607 │
│ https://huggingfac… │ 0 │ 1000000 │ … │ 82909448 │ 71430496 │ 180090922 │
│ https://huggingfac… │ 0 │ 1000000 │ … │ 154593238 │ 5392260 │ 8286381 │
├──────────────────────┴──────────────┴────────────────────┴───┴──────────────────┴──────────────────────┴──────────────────────┤
│ 9 rows 23 columns (6 proven) │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Since among the columns right here had been truncated within the center, I typed .columns
to change modes and ran the question once more:
┌──────────────────────┬─────────┬──────────────────────┬───┬──────────────────────┬──────────────────────┬──────────────────────┐
│ Column │ Kind │ Row 1 │ … │ Row 7 │ Row 8 │ Row 9 │
├──────────────────────┼─────────┼──────────────────────┼───┼──────────────────────┼──────────────────────┼──────────────────────┤
│ file_name │ varchar │ https://huggingfac… │ … │ https://huggingfac… │ https://huggingfac… │ https://huggingfac… │
│ row_group_id │ int64 │ 0 │ … │ 0 │ 0 │ 0 │
│ row_group_num_rows │ int64 │ 1000000 │ … │ 1000000 │ 1000000 │ 1000000 │
│ row_group_num_colu… │ int64 │ 9 │ … │ 9 │ 9 │ 9 │
│ row_group_bytes │ int64 │ 481107377 │ … │ 481107377 │ 481107377 │ 481107377 │
│ column_id │ int64 │ 0 │ … │ 6 │ 7 │ 8 │
│ file_offset │ int64 │ 13133422 │ … │ 82486423 │ 153917026 │ 159309733 │
│ num_values │ int64 │ 1000000 │ … │ 1000000 │ 1000000 │ 1000000 │
│ path_in_schema │ varchar │ id │ … │ width │ url │ dimension │
│ sort │ varchar │ BYTE_ARRAY │ … │ INT64 │ BYTE_ARRAY │ INT64 │
│ stats_min │ varchar │ │ … │ 256 │ │ 312 │
│ stats_max │ varchar │ │ … │ 9408 │ │ 17937790 │
│ stats_null_count │ int64 │ 0 │ … │ 0 │ 0 │ 0 │
│ stats_distinct_count │ int64 │ │ … │ │ │ │
│ stats_min_value │ varchar │ 1089054097631629352 │ … │ 256 │ https://cdn.discor… │ 312 │
│ stats_max_value │ varchar │ 1144508197969854484 │ … │ 9408 │ https://cdn.discor… │ 17937790 │
│ compression │ varchar │ SNAPPY │ … │ SNAPPY │ SNAPPY │ SNAPPY │
│ encodings │ varchar │ PLAIN, RLE, RLE_DI… │ … │ PLAIN, RLE, RLE_DI… │ PLAIN, RLE, RLE_DI… │ PLAIN, RLE, RLE_DI… │
│ index_page_offset │ int64 │ │ … │ │ │ │
│ dictionary_page_of… │ int64 │ 4 │ … │ 81989656 │ 82486530 │ 153917473 │
│ data_page_offset │ int64 │ 601280 │ … │ 81990515 │ 82909448 │ 154593238 │
│ total_compressed_s… │ int64 │ 13133418 │ … │ 496767 │ 71430496 │ 5392260 │
│ total_uncompressed… │ int64 │ 23093988 │ … │ 916607 │ 180090922 │ 8286381 │
├──────────────────────┴─────────┴──────────────────────┴───┴──────────────────────┴──────────────────────┴──────────────────────┤
│ 9 rows (4 proven) 23 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Typing .rows
switches the mode again to the default once more.
The identical trick in ClickHouse
richraposa on Hacker News identified that ClickHouse can do the identical HTTP Vary header trick:
SELECT sum(dimension)
FROM url('https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/0000{01..55}.parquet')
Output:
┌───────sum(dimension)─┐
│ 159344011148016 │
└─────────────────┘
1 row in set. Elapsed: 11.615 sec. Processed 54.08 million rows, 8.50 GB (4.66 million rows/s., 731.83 MB/s.)
Peak reminiscence utilization: 458.88 KiB.
This transfers round 290 MiB, successfully the identical as DuckDB.
Associated
Created 2023-11-14T14:43:17-08:00, up to date 2023-11-16T19:37:36-08:00 · History · Edit