Now Reading
Summing columns in distant Parquet information utilizing DuckDB

Summing columns in distant Parquet information utilizing DuckDB

2023-11-14 17:03:08

vivym/midjourney-messages on Hugging Face is a big (~8GB) dataset consisting of 55,082,563 Midjourney pictures – every one with the immediate and a URL to the picture hosted on Discord.

TLDR: Every document hyperlinks to a Discord CDN URL, and the entire dimension of all of these pictures is 148.07 TB – so Midjourney has value Discord a LOT of cash in CDN prices!

Learn on for the small print of how I figured this out.

Every of the data seems like this (transformed to JSON):

{
  "id": "1144508197969854484",
  "channel_id": "989268300473192561",
  "content material": "**grownup Goku in Dragonball Z, strolling on a seashore, in a Akira Toriyama anime type** - Picture #1 <@1016225582566101084>",
  "timestamp": "2023-08-25T05:46:58.330000+00:00",
  "image_id": "1144508197693046875",
  "peak": 1024,
  "width": 1024,
  "url": "https://cdn.discordapp.com/attachments/989268300473192561/1144508197693046875/anaxagore54_adult_Goku_in_Dragonball_Z_walking_on_a_beach_in_a__987e6fd5-64a1-43f6-83dd-c58d2eb42948.png",
  "dimension": 1689284
}

The data are hosted on Hugging Face as Parquet information – 56 of them, every round 160MB. Here is the full list – they’re hosted in Git LFS, however Hugging Face additionally presents an HTTP obtain hyperlink.

I needed to whole up the dimension column there to see how area on Discord’s CDN is taken up by these Midjourney pictures. However I did not wish to obtain all 8GB of information simply to run that question.

DuckDB can question remotely hosted Parquet information and use HTTP Vary header tips to retrieve simply the subset of information wanted to reply a question. It ought to have the ability to sum up the dimension column with out downloading the entire dataset.

The URL to every Parquet file seems like this (this really redirects to a remaining URL, however DuckDB appears to have the ability to observe these redirects transparently):

https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000000.parquet

Querying a single file

Here is a DuckDB question that calculates the sum of that dimension column with out retrieving your complete file:

SELECT SUM(dimension) FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000000.parquet';

Operating that within the duckdb CLI software provides me this consequence:

┌───────────────┐
│   sum(dimension)   │
│    int128     │
├───────────────┤
│ 3456458790156 │
└───────────────┘

That is 3.14TB, only for the primary file.

Monitoring community utilization with nettop

What number of bytes of information did that retrieve? We are able to discover out on macOS utilizing the nettop command.

First we’d like the PID of our duckdb course of:

ps aux | grep duckdb
simon            19992   0.0  0.0 408644352   1520 s114  S+    2:30PM   0:00.00 grep duckdb
simon            19985   0.0  0.0 408527616   4752 s118  S+    2:30PM   0:00.01 duckdb

Now we will run the next, earlier than we execute that SQL question:

Then I ran the SQL question, and noticed this within the output from nettop:

5331 KiB

So DuckDB retrieved 5.3MB of information (from a file that was 159MB) to reply that question.

Utilizing UNION ALL

How concerning the whole throughout all 56 information? I generated this UNION ALL question to reply that query:

SELECT SUM(size_total)
FROM (
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000000.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000001.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000002.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000003.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000004.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000005.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000006.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000007.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000008.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000009.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000010.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000011.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000012.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000013.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000014.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000015.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000016.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000017.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000018.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000019.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000020.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000021.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000022.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000023.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000024.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000025.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000026.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000027.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000028.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000029.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000030.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000031.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000032.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000033.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000034.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000035.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000036.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000037.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000038.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000039.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000040.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000041.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000042.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000043.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000044.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000045.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000046.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000047.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000048.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000049.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000050.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000051.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000052.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000053.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000054.parquet' UNION ALL
    SELECT SUM(dimension) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000055.parquet'
);

Utilizing read_parquet()

Replace: Alex Monahan tipped me off to a extra concise various for a similar question:

SELECT SUM(dimension)
FROM read_parquet([
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000000.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000001.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000002.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000003.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000004.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000005.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000006.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000007.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000008.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000009.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000010.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000011.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000012.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000013.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000014.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000015.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000016.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000017.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000018.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000019.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000020.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000021.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000022.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000023.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000024.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000025.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000026.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000027.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000028.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000029.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000030.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000031.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000032.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000033.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000034.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000035.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000036.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000037.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000038.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000039.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000040.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000041.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000042.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000043.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000044.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000045.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000046.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000047.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000048.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000049.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000050.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000051.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000052.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000053.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000054.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000055.parquet'
]);

This model shows a helpful progress bar whereas the question is executing:

The results of the query with a progress bar at 100%

Utilizing list_transform()

@adityawarmanfw shared this far more elegant answer:

SELECT
    SUM(dimension) AS dimension
FROM read_parquet(
    list_transform(
        generate_series(0, 55),
        n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/' ||
            format('{:06d}', n) || '.parquet'
    )
);

That is utilizing a DuckDB lambda function – actually neat!

To measure them, I ran a question in a recent DuckDB occasion with nettop watching the community visitors. Here is what that appeared like whereas it was working:

Animated GIF of nettop showing different connections being made and how much bandwidth is used for each one

The whole knowledge transferred was 287 MiB – nonetheless numerous knowledge, however an enormous saving on 8GB.

That is additionally round what I might count on for 56 information, given {that a} single file fetched 5.3MB earlier and 5.3 * 56 = 296.8.

The top consequence

The consequence it gave me was:

┌─────────────────┐
│ sum(size_total) │
│     int128      │
├─────────────────┤
│ 162800469938172 │
└─────────────────┘

That is 162800469938172 bytes – or 148.07 TB of CDN area utilized by Midjourney pictures!

(I acquired ChatGPT to construct me a software for changing bytes to KB/MB/GB/TB: Byte Size Converter.)

CTEs and views work too

You’ll be able to run this question utilizing a CTE to make it nicer to learn:

See Also

with midjourney_messages as (
    choose
        *
    from read_parquet(
        list_transform(
            generate_series(0, 2),
            n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/' ||
                format('{:06d}', n) || '.parquet'
        )
    )
)
choose sum(dimension) as dimension from midjourney_messages;

(I used generate_series(0, 2) right here as a substitute of (0, 55) to hurry up these subsequent experiments.)

Or you’ll be able to outline a view, which helps you to consult with midjourney_messages in a number of queries. This can be a dangerous concept although, as every time you execute the question in opposition to the view it is going to obtain the info once more:

create view midjourney_messages as
choose * from read_parquet(
    list_transform(
        generate_series(0, 2),
        n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/' ||
            format('{:06d}', n) || '.parquet'
    )
);

Operating create view right here transferred 37 KiB in response to nettop – presumably from loading metadata so as to have the ability to reply describe queries like this one:

describe midjourney_messages;

Output:

┌─────────────┬─────────────┬─────────┬─────────┬─────────┬───────┐
│ column_name │ column_type │  null   │   key   │ default │ further │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ int32 │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼───────┤
│ id          │ VARCHAR     │ YES     │         │         │       │
│ channel_id  │ VARCHAR     │ YES     │         │         │       │
│ content material     │ VARCHAR     │ YES     │         │         │       │
│ timestamp   │ VARCHAR     │ YES     │         │         │       │
│ image_id    │ VARCHAR     │ YES     │         │         │       │
│ peak      │ BIGINT      │ YES     │         │         │       │
│ width       │ BIGINT      │ YES     │         │         │       │
│ url         │ VARCHAR     │ YES     │         │         │       │
│ dimension        │ BIGINT      │ YES     │         │         │       │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴───────┘

nettop confirmed that every time I ran this question:

choose depend(*) from midjourney_messages;

Roughly 114 KiB of information was fetched.

parquet_metadata()

chrisjc tipped me off concerning the parquet_metadata() operate, which can be utilized like this:

choose * from parquet_metadata(
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/000000.parquet'
);

That returns the next, after fetching 37 KiB:

┌──────────────────────┬──────────────┬────────────────────┬───┬──────────────────┬──────────────────────┬──────────────────────┐
│      file_name       │ row_group_id │ row_group_num_rows │ … │ data_page_offset │ total_compressed_s…  │ total_uncompressed…  │
│       varchar        │    int64     │       int64        │   │      int64       │        int64         │        int64         │
├──────────────────────┼──────────────┼────────────────────┼───┼──────────────────┼──────────────────────┼──────────────────────┤
│ https://huggingfac…  │            0 │            1000000 │ … │           601280 │             13133418 │             23093988 │
│ https://huggingfac…  │            0 │            1000000 │ … │         13133571 │                  116 │                  112 │
│ https://huggingfac…  │            0 │            1000000 │ … │         13396455 │             46191873 │            208657682 │
│ https://huggingfac…  │            0 │            1000000 │ … │         59593218 │              9046231 │             36052113 │
│ https://huggingfac…  │            0 │            1000000 │ … │         68973087 │             13118570 │             23093988 │
│ https://huggingfac…  │            0 │            1000000 │ … │         81491806 │               498549 │               915584 │
│ https://huggingfac…  │            0 │            1000000 │ … │         81990515 │               496767 │               916607 │
│ https://huggingfac…  │            0 │            1000000 │ … │         82909448 │             71430496 │            180090922 │
│ https://huggingfac…  │            0 │            1000000 │ … │        154593238 │              5392260 │              8286381 │
├──────────────────────┴──────────────┴────────────────────┴───┴──────────────────┴──────────────────────┴──────────────────────┤
│ 9 rows                                                                                                   23 columns (6 proven) │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Since among the columns right here had been truncated within the center, I typed .columns to change modes and ran the question once more:

┌──────────────────────┬─────────┬──────────────────────┬───┬──────────────────────┬──────────────────────┬──────────────────────┐
│        Column        │  Kind   │        Row 1         │ … │        Row 7         │        Row 8         │        Row 9         │
├──────────────────────┼─────────┼──────────────────────┼───┼──────────────────────┼──────────────────────┼──────────────────────┤
│ file_name            │ varchar │  https://huggingfac… │ … │  https://huggingfac… │  https://huggingfac… │  https://huggingfac… │
│ row_group_id         │ int64   │                    0 │ … │                    0 │                    0 │                    0 │
│ row_group_num_rows   │ int64   │              1000000 │ … │              1000000 │              1000000 │              1000000 │
│ row_group_num_colu…  │ int64   │                    9 │ … │                    9 │                    9 │                    9 │
│ row_group_bytes      │ int64   │            481107377 │ … │            481107377 │            481107377 │            481107377 │
│ column_id            │ int64   │                    0 │ … │                    6 │                    7 │                    8 │
│ file_offset          │ int64   │             13133422 │ … │             82486423 │            153917026 │            159309733 │
│ num_values           │ int64   │              1000000 │ … │              1000000 │              1000000 │              1000000 │
│ path_in_schema       │ varchar │                   id │ … │                width │                  url │                 dimension │
│ sort                 │ varchar │           BYTE_ARRAY │ … │                INT64 │           BYTE_ARRAY │                INT64 │
│ stats_min            │ varchar │                      │ … │                  256 │                      │                  312 │
│ stats_max            │ varchar │                      │ … │                 9408 │                      │             17937790 │
│ stats_null_count     │ int64   │                    0 │ … │                    0 │                    0 │                    0 │
│ stats_distinct_count │ int64   │                      │ … │                      │                      │                      │
│ stats_min_value      │ varchar │  1089054097631629352 │ … │                  256 │  https://cdn.discor… │                  312 │
│ stats_max_value      │ varchar │  1144508197969854484 │ … │                 9408 │  https://cdn.discor… │             17937790 │
│ compression          │ varchar │               SNAPPY │ … │               SNAPPY │               SNAPPY │               SNAPPY │
│ encodings            │ varchar │  PLAIN, RLE, RLE_DI… │ … │  PLAIN, RLE, RLE_DI… │  PLAIN, RLE, RLE_DI… │  PLAIN, RLE, RLE_DI… │
│ index_page_offset    │ int64   │                      │ … │                      │                      │                      │
│ dictionary_page_of…  │ int64   │                    4 │ … │             81989656 │             82486530 │            153917473 │
│ data_page_offset     │ int64   │               601280 │ … │             81990515 │             82909448 │            154593238 │
│ total_compressed_s…  │ int64   │             13133418 │ … │               496767 │             71430496 │              5392260 │
│ total_uncompressed…  │ int64   │             23093988 │ … │               916607 │            180090922 │              8286381 │
├──────────────────────┴─────────┴──────────────────────┴───┴──────────────────────┴──────────────────────┴──────────────────────┤
│ 9 rows  (4 proven)                                                                                                   23 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Typing .rows switches the mode again to the default once more.

The identical trick in ClickHouse

richraposa on Hacker News identified that ClickHouse can do the identical HTTP Vary header trick:

SELECT sum(dimension)
    FROM url('https://huggingface.co/datasets/vivym/midjourney-messages/resolve/fundamental/knowledge/0000{01..55}.parquet')

Output:

┌───────sum(dimension)─┐
│ 159344011148016 │
└─────────────────┘

1 row in set. Elapsed: 11.615 sec. Processed 54.08 million rows, 8.50 GB (4.66 million rows/s., 731.83 MB/s.)
Peak reminiscence utilization: 458.88 KiB.

This transfers round 290 MiB, successfully the identical as DuckDB.

Created 2023-11-14T14:43:17-08:00, up to date 2023-11-16T19:37:36-08:00 · History · Edit



Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top