Now Reading
Multi-Database Assist in DuckDB – DuckDB

Multi-Database Assist in DuckDB – DuckDB

2024-01-26 05:58:00

2024-01-26Mark Raasveldt

TL;DR: DuckDB can connect MySQL, Postgres, and SQLite databases along with databases saved in its personal format. This enables knowledge to be learn into DuckDB and moved between these methods in a handy method.

DuckDB supports reading and writing to MySQL, Postgres, and SQLite

In fashionable knowledge evaluation, knowledge should typically be mixed from all kinds of various sources. Knowledge may sit in CSV recordsdata in your machine, in Parquet recordsdata in a knowledge lake, or in an operational database. DuckDB has sturdy help for shifting knowledge between many alternative knowledge sources. Nevertheless, this help has beforehand been restricted to studying knowledge and writing knowledge to recordsdata.

DuckDB helps superior operations by itself native storage format – corresponding to deleting rows, updating values, or altering the schema of a desk. It helps all of those operations utilizing ACID semantics. This ensures that your database is at all times left in a sane state – operations are atomic and don’t partially full.

DuckDB now has a pluggable storage and transactional layer. This versatile layer permits new storage back-ends to be created by DuckDB extensions. These storage back-ends can help all database operations in the identical means that DuckDB helps them, together with inserting knowledge and even modifying schemas.

The MySQL, Postgres, and SQLite extensions implement this new pluggable storage and transactional layer, permitting DuckDB to hook up with these methods and function on them in the identical means that it operates by itself native storage engine.

These extensions allow quite a lot of helpful options. For instance, utilizing these extensions you may:

  • Export knowledge from SQLite to JSON
  • Learn knowledge from Parquet into Postgres
  • Transfer knowledge from MySQL to Postgres

… and way more.

The ATTACH statement can be utilized to connect a brand new database to the system. By default, a local DuckDB file will probably be connected. The TYPE parameter can be utilized to specify a special storage kind. Alternatively, the {kind}: prefix can be utilized.

For instance, utilizing the SQLite extension, we are able to open a SQLite database file and question it as we might question a DuckDB database.

ATTACH 'sakila.db' AS sakila (TYPE sqlite);
SELECT title, release_year, size FROM sakila.movie LIMIT 5;
┌──────────────────┬──────────────┬────────┐
│      title       │ release_year │ size │
│     varchar      │   varchar    │ int64  │
├──────────────────┼──────────────┼────────┤
│ ACADEMY DINOSAUR │ 2006         │     86 │
│ ACE GOLDFINGER   │ 2006         │     48 │
│ ADAPTATION HOLES │ 2006         │     50 │
│ AFFAIR PREJUDICE │ 2006         │    117 │
│ AFRICAN EGG      │ 2006         │    130 │
└──────────────────┴──────────────┴────────┘

The USE command switches the primary database.

USE sakila;
SELECT first_name, last_name FROM actor LIMIT 5;
┌────────────┬──────────────┐
│ first_name │  last_name   │
│  varchar   │   varchar    │
├────────────┼──────────────┤
│ PENELOPE   │ GUINESS      │
│ NICK       │ WAHLBERG     │
│ ED         │ CHASE        │
│ JENNIFER   │ DAVIS        │
│ JOHNNY     │ LOLLOBRIGIDA │
└────────────┴──────────────┘

The SQLite database may be manipulated as if it have been a local DuckDB database. For instance, we are able to create a brand new desk, populate it with values from a Parquet file, delete just a few rows from the desk and alter the schema of the desk.

CREATE TABLE lineitem AS FROM 'lineitem.parquet' LIMIT 1000;
DELETE FROM lineitem WHERE l_returnflag = 'N';
ALTER TABLE lineitem DROP COLUMN l_comment;

The duckdb_databases desk incorporates a listing of all connected databases and their sorts.

SELECT database_name, path, kind FROM duckdb_databases;
┌───────────────┬───────────┬─────────┐
 database_name    path      kind   
    varchar      varchar   varchar 
├───────────────┼───────────┼─────────┤
 sakila         sakila.db  sqlite  
 reminiscence         NULL       duckdb  
└───────────────┴───────────┴─────────┘

Whereas attaching to completely different database sorts is beneficial – it turns into much more highly effective when utilized in mixture. For instance, we are able to connect each a SQLite, MySQL and a Postgres database.

ATTACH 'sqlite:sakila.db' AS sqlite;
ATTACH 'postgres:dbname=postgresscanner' AS postgres;
ATTACH 'mysql:person=root database=mysqlscanner' AS mysql;

Now we are able to transfer knowledge between these connected databases and question them collectively. Let’s copy the movie desk to MySQL, and the actor desk to Postgres:

CREATE TABLE mysql.movie AS FROM sqlite.movie;
CREATE TABLE postgres.actor AS FROM sqlite.actor;

We are able to now be a part of tables from these three connected databases collectively. Let’s discover the entire actors that starred in Ace Goldfinger.

SELECT first_name, last_name
FROM mysql.movie
JOIN sqlite.film_actor ON (movie.film_id = film_actor.film_id)
JOIN postgres.actor ON (actor.actor_id = film_actor.actor_id)
WHERE title = 'ACE GOLDFINGER';
┌────────────┬───────────┐
│ first_name │ last_name │
│  varchar   │  varchar  │
├────────────┼───────────┤
│ BOB        │ FAWCETT   │
│ MINNIE     │ ZELLWEGER │
│ SEAN       │ GUINESS   │
│ CHRIS      │ DEPP      │
└────────────┴───────────┘

Operating EXPLAIN on the question exhibits how the information from the completely different engines is mixed into the ultimate question end result.

┌───────────────────────────┐                                                          
         PROJECTION                                                                  
                                                                          
         first_name                                                                  
         last_name                                                                   
└─────────────┬─────────────┘                                                                                       
┌─────────────┴─────────────┐                                                          
         HASH_JOIN                                                                   
                                                                          
           INNER                                                                     
     film_id = film_id     ├───────────────────────────────────────────┐              
└─────────────┬─────────────┘                                                                                      
┌─────────────┴─────────────┐                             ┌─────────────┴─────────────┐
         HASH_JOIN                                                 FILTER          
                                                             
           INNER                                         (title = 'ACE GOLDFINGER')
    actor_id = actor_id    ├──────────────┐                                         
└─────────────┬─────────────┘                            └─────────────┬─────────────┘                             
┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐
        SQLITE_SCAN        ││       POSTGRES_SCAN       ││        MYSQL_SCAN         
                ││                ││                
    sakila.db:film_actor   ││           actor           ││            movie           
                ││                ││                
          film_id          ││          actor_id         ││          film_id          
          actor_id         ││         first_name        ││           title           
                           ││         last_name         ││                           
└───────────────────────────┘└───────────────────────────┘└───────────────────────────┘ 

A number of modifications have been made to Postgres extension for the reason that final launch. Use FORCE INSTALL postgres to put in the most recent model of the extension.

All statements executed inside DuckDB are executed inside a transaction. If an express BEGIN TRANSACTION just isn’t known as, each assertion will execute in its personal transaction. This additionally applies to queries which might be executed over different storage engines. These storage engines additionally help express BEGIN, COMMIT and ROLLBACK statements.

For instance, we are able to start a transaction inside our connected SQLite database, make a change, after which roll it again. The unique knowledge will probably be restored.

BEGIN;
TRUNCATE movie;
SELECT title, release_year, size FROM movie;
┌─────────┬──────────────┬────────┐
│  title  │ release_year │ size │
│ varchar │   varchar    │ int64  │
├─────────────────────────────────┤
│             0 rows              │
└─────────────────────────────────┘
ROLLBACK;
SELECT title, release_year, size FROM movie LIMIT 5;
┌──────────────────┬──────────────┬────────┐
│      title       │ release_year │ size │
│     varchar      │   varchar    │ int64  │
├──────────────────┼──────────────┼────────┤
│ ACADEMY DINOSAUR │ 2006         │     86 │
│ ACE GOLDFINGER   │ 2006         │     48 │
│ ADAPTATION HOLES │ 2006         │     50 │
│ AFFAIR PREJUDICE │ 2006         │    117 │
│ AFRICAN EGG      │ 2006         │    130 │
└──────────────────┴──────────────┴────────┘

Multi-Database Transactions

Each storage engine has their very own transactions which might be stand-alone and managed by the storage engine itself. Opening a transaction in Postgres, for instance, calls BEGIN TRANSACTION within the Postgres shopper. The transaction is managed by Postgres itself. Equally, when the transaction is dedicated or rolled again, the storage engine handles this by itself.

Transactions are used each for studying and for writing knowledge. For studying knowledge, they’re used to supply a constant snapshot of the database. For writing, they’re used to make sure all knowledge in a transaction is packed collectively and written on the similar time.

When executing a transaction that includes a number of connected databases we have to open a number of transactions: one per connected database that’s used within the transaction. Whereas this isn’t an issue when studying from the database, it turns into difficult when writing. Specifically, after we wish to COMMIT a transaction it’s difficult to make sure that both (a) each database has efficiently dedicated, or (b) each database has rolled again.

See Also

For that purpose, it’s presently not supported to write to a number of connected databases in a single transaction. As a substitute, an error is thrown when that is tried:

BEGIN;
CREATE TABLE postgres.new_table(i INT);
CREATE TABLE mysql.new_table(i INT);
Error: Making an attempt to jot down to database "mysql" in a transaction that has
already modified database "postgres" – a single transaction can solely write
to a single connected database.

CREATE TABLE AS, INSERT INTO and COPY can be utilized to repeat knowledge between completely different connected databases. The devoted COPY FROM DATABASE ... TO can be utilized to repeat all knowledge from one database to a different. This consists of all tables and views which might be saved within the supply database.

-- connect a Postgres database
ATTACH 'postgres:dbname=postgresscanner' AS postgres;
-- connect a DuckDB file
ATTACH 'database.db' AS ddb;
-- export all tables and views from the Postgres database to the DuckDB file
COPY FROM DATABASE postgres TO ddb;

Be aware that this assertion is presently solely out there within the improvement construct. It will likely be out there within the subsequent DuckDB launch (v0.10).

The specific ATTACH assertion just isn’t required to hook up with a special database kind. When instantiating a DuckDB occasion a connection may be made on to a special database kind utilizing the {kind}: prefix. For instance, to hook up with a SQLite file, use sqlite:file.db. To hook up with a Postgres occasion, use postgres:dbname=postgresscanner. This may be accomplished in any shopper, together with the CLI. As an example:

CLI:

Python:

import duckdb
con = duckdb.join('sqlite:file.db')

That is equal to attaching the storage engine and working USE afterwards.

DuckDB’s pluggable storage engine structure allows many use instances. By attaching a number of databases, knowledge may be extracted in a transactionally protected method for bulk ETL or ELT workloads, in addition to for on-the-fly knowledge virtualization workloads. These strategies additionally work nicely together, for instance, by shifting knowledge in bulk on a daily cadence, whereas filling in the previous few knowledge factors on the fly.

Pluggable storage engines additionally unlock new methods to deal with concurrent writers in a knowledge platform. Every separate course of might write its output to a transactional database, and the outcomes might be mixed inside DuckDB – all in a transactionally protected method. Then, knowledge evaluation duties can happen on the centralized DuckDB database for improved efficiency.

We sit up for listening to the various inventive methods you’ll be able to use this function!

We intend to proceed enhancing the efficiency and capabilities of the present extensions. As well as, all of those options may be leveraged by the group to hook up with different databases.

back to news archive

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top