Saying DuckDB 0.8.0 – DuckDB

2023-05-17Mark Raasveldt and Hannes Mühleisen
The DuckDB crew is pleased to announce the most recent DuckDB launch (0.8.0). This launch is called “Fulvigula” after the Mottled Duck (Anas Fulvigula) native to the Gulf of Mexico.
To put in the brand new model, please go to the installation guide. The total launch notes will be discovered here.
Notice: The discharge construct remains to be in progress so artifacts and extensions may nonetheless be lacking.
There have been too many adjustments to debate them every intimately, however we want to spotlight a number of notably thrilling options!
- New pivot and unpivot statements
- Enhancements to parallel information import/export
- Time sequence joins
- Recursive globbing
- Lazy-loading of storage metadata for quicker startup instances
- Consumer-defined features for Python
- Arrow Database Connectivity (ADBC) assist
- New Swift integration
Beneath is a abstract of these new options with examples, beginning with two breaking adjustments in our SQL dialect which are designed to supply extra intuitive outcomes by default.
This launch contains two breaking adjustments to the SQL dialect: The division operator uses floating point division by default, and the default null sort order is changed from NULLS FIRST
to NULLS LAST
. Whereas DuckDB is stil in Beta, we acknowledge that many DuckDB queries are already utilized in manufacturing. So, the outdated habits will be restored utilizing the next settings:
SET integer_division=true;
SET default_null_order='nulls_first';
Division Operator. The division operator /
will now all the time carry out a floating level division even with integer parameters. The brand new operator //
retains the outdated semantics and can be utilized to carry out integer division. This makes DuckDB’s division operator much less error inclined for inexperienced persons, and in keeping with the division operator in Python 3 and different techniques within the OLAP house like Spark, Snowflake and BigQuery.
Default Null Sort Order. The default null type order is modified from NULLS FIRST
to NULLS LAST
. The rationale for this variation is that NULLS LAST
sort-order is extra intuitive when mixed with LIMIT
. With NULLS FIRST
, High-N queries all the time return the NULL
values first. With NULLS LAST
, the precise High-N values are returned as an alternative.
CREATE TABLE bigdata(col INTEGER);
INSERT INTO bigdata VALUES (NULL), (42), (NULL), (43);
FROM bigdata ORDER BY col DESC LIMIT 3;
v0.7.1 | v0.8.0 |
---|---|
NULL | 43 |
NULL | 42 |
43 | NULL |
Pivot and Unpivot. There are various styles and sizes of knowledge, and we don’t all the time have management over the method by which information is generated. Whereas SQL is well-suited for reshaping datasets, turning columns into rows or rows into columns is tedious in vanilla SQL. With this launch, DuckDB introduces the PIVOT
and UNPIVOT
statements that enable reshaping information units in order that rows are become columns or vice versa. A key benefit of DuckDB’s syntax is that the column names to pivot or unpivot will be mechanically deduced. Here’s a quick instance:
CREATE TABLE gross sales(yr INT, quantity INT);
INSERT INTO gross sales VALUES (2021, 42), (2022, 100), (2021, 42);
PIVOT gross sales ON yr USING SUM(quantity);
The documentation contains more examples.
ASOF Joins for Time Series. When becoming a member of time sequence information with background reality tables, the timestamps usually don’t precisely match. On this case it’s usually fascinating to hitch rows in order that the timestamp is joined with the nearest timestamp. The ASOF be a part of can be utilized for this function – it performs a fuzzy be a part of to seek out the closest be a part of associate for every row as an alternative of requiring a precise match.
CREATE TABLE a(ts TIMESTAMP);
CREATE TABLE b(ts TIMESTAMP);
INSERT INTO a VALUES (TIMESTAMP '2023-05-15 10:31:00'), (TIMESTAMP '2023-05-15 11:31:00');
INSERT INTO b VALUES (TIMESTAMP '2023-05-15 10:30:00'), (TIMESTAMP '2023-05-15 11:30:00');
FROM a ASOF JOIN b ON a.ts >= b.ts;
a.ts | b.ts |
---|---|
2023-05-15 10:31:00 | 2023-05-15 10:30:00 |
2023-05-15 11:31:00 | 2023-05-15 11:30:00 |
Please refer to the documentation for a extra in-depth clarification.
Default Parallel CSV Reader. On this launch, the parallel CSV reader has been vastly improved and is now the default CSV reader. We want to thank everybody that has tried out the experimental reader for his or her useful suggestions and reviews. The experimental_parallel_csv
flag has been deprecated and is not required. The parallel CSV reader allows rather more environment friendly studying of huge CSV recordsdata.
CREATE TABLE lineitem AS FROM lineitem.csv;
Parallel Parquet, CSV and JSON Writing. This launch contains assist for parallel order-preserving writing of Parquet, CSV and JSON recordsdata. Because of this, writing to those file codecs is parallel by default, additionally with out disabling insertion order preservation, and writing to those codecs is tremendously sped up.
COPY lineitem TO 'lineitem.csv';
COPY lineitem TO 'lineitem.parquet';
COPY lineitem TO 'lineitem.json';
Format | v0.7.1 | v0.8.0 |
---|---|---|
CSV | 3.9s | 0.6s |
Parquet | 8.1s | 1.2s |
JSON | 4.4s | 1.1s |
Recursive File Globbing using **
. This launch provides assist for recursive globbing the place an arbitrary variety of subdirectories will be matched utilizing the **
operator (double-star).
FROM 'information/glob/crawl/stackoverflow/**/*.csv';
The documentation has been updated with numerous examples of this syntax.
Lazy-Loading Table Metadata. DuckDB’s inner storage format shops metadata for each row group in a desk, akin to min-max indices and the place within the file each row group is saved. Up to now, DuckDB would load this metadata instantly as soon as the database was opened. Nevertheless, as soon as the info will get very massive, the metadata can even get fairly giant, resulting in a noticeable delay on database startup. On this launch, we have now optimized the metadata dealing with of DuckDB to solely learn desk metadata as its being accessed. Because of this, startup is near-instantaneous even for giant databases, and metadata is barely loaded for columns which are truly utilized in queries. The benchmarks under are for a database file containing a single giant TPC-H lineitem
desk (120x SF1) with ~770 million rows and 16 columns:
Question | v0.6.1 | v0.7.1 | v0.8.0 | Parquet |
---|---|---|---|---|
SELECT 42 | 1.60s | 0.31s | 0.02s | – |
FROM lineitem LIMIT 1; | 1.62s | 0.32s | 0.03s | 0.27s |
User-Defined Scalar Functions for Python. Arbitrary Python features can now be registered as scalar features inside SQL queries. It will solely work when utilizing DuckDB from Python, as a result of it makes use of the precise Python runtime that DuckDB is working inside. Whereas plain Python values will be handed to the operate, there’s additionally a vectorized variant that makes use of PyArrow underneath the hood for increased effectivity and higher parallelism.
import duckdb
from duckdb.typing import *
from faker import Faker
def random_date():
faux = Faker()
return faux.date_between()
duckdb.create_function('random_date', random_date, [], DATE)
res = duckdb.sql('choose random_date()').fetchall()
print(res)
# [(datetime.date(2019, 5, 15),)]
Arrow Database Connectivity Support (ADBC). ADBC is a database API commonplace for database entry libraries that makes use of Apache Arrow to switch question consequence units and to ingest information. Utilizing Arrow for that is notably useful for columnar information administration techniques which historically suffered a efficiency hit by emulating row-based APIs akin to JDBC/ODBC. From this launch, DuckDB natively helps ADBC. We’re pleased to be one of many first techniques to supply native assist, and DuckDB’s in-process design suits properly with ADBC.
Swift Integration. DuckDB has gained one other official language integration: Swift. Swift is a language developed by Apple that the majority notably is used to create Apps for Apple units, but additionally more and more used for server-side improvement. The DuckDB Swift API permits builders on all swift platforms to harness DuckDB utilizing a local Swift interface with assist for Swift options like sturdy typing and concurrency.
The total launch notes will be found on Github. We want to thank the entire contributors for his or her exhausting work on enhancing DuckDB.