Now Reading
What If OpenDocument Used SQLite?

What If OpenDocument Used SQLite?

2023-09-18 03:11:21



What If OpenDocument Used SQLite?




Small. Quick. Dependable.
Select any three.

Introduction

Suppose the
OpenDocument file format,
and particularly the “ODP” OpenDocument Presentation format, have been
constructed round SQLite. Advantages would come with:

  • Smaller paperwork
  • Sooner File/Save instances
  • Sooner startup instances
  • Much less reminiscence used
  • Doc versioning
  • A greater consumer expertise

Word that that is solely a thought experiment.
We’re not suggesting that OpenDocument be modified.
Neither is this text a criticism of the present OpenDocument
design. The purpose of this essay is to recommend methods to enhance
future file format designs.

About OpenDocument And OpenDocument Presentation

The OpenDocument file format is used for workplace functions:
phrase processors, spreadsheets, and displays. It was initially
designed for the OpenOffice suite however has since been included into
different desktop software suites. The OpenOffice software has been
forked and renamed a number of instances. This writer’s main use for OpenDocument is
constructing slide displays with both
NeoOffice on Mac, or
LibreOffice on Linux and Home windows.

An OpenDocument Presentation or “ODP” file is a
ZIP archive containing
XML recordsdata describing presentation slides and separate picture recordsdata for the
numerous photographs which might be included as a part of the presentation.
(OpenDocument phrase processor and spreadsheet recordsdata are equally
structured however should not thought of by this text.) The reader can
simply see the content material of an ODP file through the use of the “zip -l” command.
For instance, the next is the “zip -l” output from a 49-slide presentation
about SQLite from the 2014
SouthEast LinuxFest
convention:

Archive:  self2014.odp
  Size      Date    Time    Title
---------  ---------- -----   ----
       47  2014-06-21 12:34   mimetype
        0  2014-06-21 12:34   Configurations2/statusbar/
        0  2014-06-21 12:34   Configurations2/accelerator/present.xml
        0  2014-06-21 12:34   Configurations2/floater/
        0  2014-06-21 12:34   Configurations2/popupmenu/
        0  2014-06-21 12:34   Configurations2/progressbar/
        0  2014-06-21 12:34   Configurations2/menubar/
        0  2014-06-21 12:34   Configurations2/toolbar/
        0  2014-06-21 12:34   Configurations2/photographs/Bitmaps/
    54702  2014-06-21 12:34   Photos/10000000000001F40000018C595A5A3D.png
    46269  2014-06-21 12:34   Photos/100000000000012C000000A8ED96BFD9.png
... 58 different photos omitted...
    13013  2014-06-21 12:34   Photos/10000000000000EE0000004765E03BA8.png
  1005059  2014-06-21 12:34   Photos/10000000000004760000034223EACEFD.png
   211831  2014-06-21 12:34   content material.xml
    46169  2014-06-21 12:34   types.xml
     1001  2014-06-21 12:34   meta.xml
     9291  2014-06-21 12:34   Thumbnails/thumbnail.png
    38705  2014-06-21 12:34   Thumbnails/thumbnail.pdf
     9664  2014-06-21 12:34   settings.xml
     9704  2014-06-21 12:34   META-INF/manifest.xml
---------                     -------
 10961006                     78 recordsdata

The ODP ZIP archive comprises 4 completely different XML recordsdata:
content material.xml, types.xml, meta.xml, and settings.xml. These 4 recordsdata
outline the slide structure, textual content content material, and styling. This explicit
presentation comprises 62 photographs, starting from full-screen photos to
tiny icons, every saved as a separate file within the Photos
folder. The “mimetype” file comprises a single line of textual content that claims:

software/vnd.oasis.opendocument.presentation

The aim of the opposite recordsdata and folders is presently
unknown to the writer however might be not troublesome to determine.

Limitations Of The OpenDocument Presentation Format

The usage of a ZIP archive to encapsulate XML recordsdata plus sources is an
elegant strategy to an software file format.
It’s clearly superior to a customized binary file format.
However utilizing an SQLite database because the
container, as an alternative of ZIP, can be extra elegant nonetheless.

A ZIP archive is principally a key/worth database, optimized for
the case of write-once/read-many and for a comparatively small quantity
of distinct keys (a number of hundred to a couple thousand) every with a big BLOB
as its worth. A ZIP archive will be considered as a “pile-of-files”
database. This works, nevertheless it has some shortcomings relative to an
SQLite database, as follows:

  1. Incremental replace is tough.

    It’s troublesome to replace particular person entries in a ZIP archive.
    It’s particularly troublesome to replace particular person entries in a ZIP
    archive in a approach that doesn’t destroy
    the complete doc if the pc loses energy and/or crashes
    in the course of the replace. It isn’t inconceivable to do that, however
    it’s sufficiently troublesome that no person really does it. As an alternative, each time
    the consumer selects “File/Save”, the complete ZIP archive is rewritten.
    Therefore, “File/Save” takes longer than it ought, particularly on
    older {hardware}. Newer machines are sooner, however it’s nonetheless bothersome
    that altering a single character in a 50 megabyte presentation causes one
    to burn via 50 megabytes of the finite write life on the SSD.

  2. Startup is sluggish.

    In line with the pile-of-files theme, OpenDocument shops all slide
    content material in a single large XML file named “content material.xml”.
    LibreOffice reads and parses this whole file simply to show
    the primary slide.
    LibreOffice additionally appears to
    learn all photographs into reminiscence as effectively, which is smart seeing as when
    the consumer does “File/Save” it’s going to have to jot down all of them again out
    once more, regardless that none of them modified. The web impact is that
    start-up is sluggish. Double-clicking an OpenDocument file brings up a
    progress bar somewhat than the primary slide.
    This leads to a nasty consumer expertise.
    The scenario grows ever extra annoying as
    the doc measurement will increase.

  3. Extra reminiscence is required.

    As a result of ZIP archives are optimized for storing large chunks of content material, they
    encourage a method of programming the place the complete doc is learn into
    reminiscence at startup, all modifying happens in reminiscence, then the complete doc
    is written to disk throughout “File/Save”. OpenOffice and its descendants
    embrace that sample.

    One may argue that it’s okay, on this period of multi-gigabyte desktops, to
    learn the complete doc into reminiscence.
    However it’s not okay.
    For one, the quantity of reminiscence used far exceeds the (compressed) file measurement
    on disk. So a 50MB presentation may take 200MB or extra RAM.
    That also will not be an issue if one solely edits a single doc at a time.
    However when engaged on a chat, this writer will sometimes have 10 or 15 completely different
    displays up all on the similar
    time (to facilitate copy/paste of slides from previous presentation) and so
    gigabytes of reminiscence are required.
    Add in an open internet browser or two and some different
    desktop apps, and all of the sudden the disk is whirling and the machine is swapping.
    And even having only a single doc is an issue when working
    on a cheap Chromebook retrofitted with Ubuntu.
    Utilizing much less reminiscence is at all times higher.

  4. Crash restoration is troublesome.

    The descendants of OpenOffice are inclined to segfault extra typically than industrial
    rivals. Maybe for that reason, the OpenOffice forks make
    periodic backups of their in-memory paperwork in order that customers don’t lose
    all pending edits when the inevitable software crash does happen.
    This causes irritating pauses within the software for the few seconds
    whereas every backup is being made.
    After restarting from a crash, the consumer is introduced with a dialog field
    that walks them via the restoration course of. Managing the crash
    restoration this manner entails numerous further software logic and is
    typically an annoyance to the consumer.

  5. Content material is inaccessible.

    One can not simply view, change, or extract the content material of an
    OpenDocument presentation utilizing generic instruments.
    The one cheap solution to view or edit an OpenDocument doc is to open
    it up utilizing an software that’s particularly designed to learn or write
    OpenDocument (learn: LibreOffice or considered one of its cousins). The scenario
    may very well be worse. One can extract and think about particular person photographs (say) from
    a presentation utilizing simply the “zip” archiver software. However it’s not cheap
    attempt to extract the textual content from a slide. Keep in mind that all content material is saved
    in a single “context.xml” file. That file is XML, so it’s a textual content file.
    However it’s not a textual content file that may be managed with an strange textual content
    editor. For the instance presentation above, the content material.xml file
    include precisely two traces. The primary line of the file is simply:

    <?xml model="1.0" encoding="UTF-8"?>
    

    The second line of the file comprises 211792 characters of
    impenetrable XML. Sure, 211792 characters all on one line.
    This file is an effective stress-test for a textual content editor.
    Fortunately, the file will not be some obscure
    binary format, however by way of accessibility, it would as effectively be
    written in Klingon.

First Enchancment: Substitute ZIP with SQLite

Allow us to suppose that as an alternative of utilizing a ZIP archive to retailer its recordsdata,
OpenDocument used a quite simple SQLite database with the next
single-table schema:

CREATE TABLE OpenDocTree(
  filename TEXT PRIMARY KEY,  -- Title of file
  filesize BIGINT,            -- Measurement of file after decompression
  content material BLOB                -- Compressed file content material
);

For this primary experiment, nothing else in regards to the file format is modified.
The OpenDocument continues to be a pile-of-files, solely now every file is a row
in an SQLite database somewhat than an entry in a ZIP archive.
This straightforward change doesn’t use the facility of a relational
database. Even so, this easy change reveals some enhancements.

Surprisingly, utilizing SQLite instead of ZIP makes the presentation
file smaller. Actually. One would assume {that a} relational database file
can be bigger than a ZIP archive, however at the least within the case of NeoOffice
that isn’t so. The next is an precise screen-scrape exhibiting
the sizes of the identical NeoOffice presentation, each in its authentic
ZIP archive format as generated by NeoOffice (self2014.odp), and
as repacked as an SQLite database utilizing the
SQLAR utility:

-rw-r--r--  1 drh  workers  10514994 Jun  8 14:32 self2014.odp
-rw-r--r--  1 drh  workers  10464256 Jun  8 14:37 self2014.sqlar
-rw-r--r--  1 drh  workers  10416644 Jun  8 14:40 zip.odp

The SQLite database file (“self2014.sqlar”) is a couple of
half % smaller than the equal ODP file! How can this be?
Apparently the ZIP archive generator logic in NeoOffice
will not be as environment friendly because it may very well be, as a result of when the identical pile-of-files
is recompressed utilizing the command-line “zip” utility, one will get a file
(“zip.odp”) that’s smaller nonetheless, by one other half %, as seen
within the third line above. So, a well-written ZIP archive
will be barely smaller than the equal SQLite database, as one would
count on. However the distinction is slight. The important thing take-away is that an
SQLite database is size-competitive with a ZIP archive.

The opposite benefit to utilizing SQLite instead of
ZIP is that the doc can now be up to date incrementally, with out danger
of corrupting the doc if an influence loss or different crash happens within the
center of the replace. (Keep in mind that writes to
SQLite databases are atomic.) True, all of the
content material continues to be saved in a single large XML file (“content material.xml”) which should
be fully rewritten if a lot as a single character adjustments. However
with SQLite, solely that one file wants to alter. The opposite 77 recordsdata within the
repository can stay unaltered. They don’t all must be rewritten,
which in flip makes “File/Save” run a lot sooner and saves put on on SSDs.

Second Enchancment: Cut up content material into smaller items

A pile-of-files encourages content material to be saved in a number of massive chunks.
Within the case of ODP, there are simply 4 XML recordsdata that outline the structure
off all slides in a presentation. An SQLite database permits storing
data in a number of massive chunks, however SQLite can also be adept and environment friendly
at storing data in quite a few smaller items.

So then, as an alternative of storing all content material for all slides in a single
outsized XML file (“content material.xml”), suppose there was a separate desk
for storing the content material of every slide individually. The desk schema
may look one thing like this:

CREATE TABLE slide(
  pageNumber INTEGER,   -- The slide web page quantity
  slideContent TEXT     -- Slide content material as XML or JSON
);
CREATE INDEX slide_pgnum ON slide(pageNumber); -- Non-obligatory

The content material of every slide may nonetheless be saved as compressed XML.
However now every web page is saved individually. So when opening a brand new doc,
the appliance may merely run:

SELECT slideContent FROM slide WHERE pageNumber=1;

This question will rapidly and effectively return the content material of the primary
slide, which may then be speedily parsed and exhibited to the consumer.
Just one web page must be learn and parsed so as render the primary display,
which signifies that the primary display seems a lot sooner and
there isn’t any longer a necessity for an annoying progress bar.

If the appliance needed
to maintain all content material in reminiscence, it may proceed studying and parsing the
different pages utilizing a background thread after drawing the primary web page. Or,
since studying from SQLite is so environment friendly, the appliance may
as an alternative select to scale back its reminiscence footprint and solely preserve a single
slide in reminiscence at a time. Or possibly it retains the present slide and the
subsequent slide in reminiscence, to facility speedy transitions to the subsequent slide.

Discover that dividing up the content material into smaller items utilizing an SQLite
desk provides flexibility to the implementation. The appliance can select
to learn all content material into reminiscence at startup. Or it might learn only a
few pages into reminiscence and preserve the remaining on disk. Or it might learn simply
single web page into reminiscence at a time. And completely different variations of the appliance
could make completely different selections with out having to make any adjustments to the
file format. Such choices should not accessible when all content material is in
a single large XML file in a ZIP archive.

Splitting content material into smaller items additionally helps File/Save operations
to go sooner. As an alternative of getting to jot down again the content material of all pages
when doing a File/Save, the appliance solely has to jot down again these
pages which have really modified.

One minor draw back of splitting content material into smaller items is that
compression doesn’t work as effectively on shorter texts and so the scale of
the doc may improve. However as the majority of the doc house
is used to retailer photographs, a small discount within the compression effectivity
of the textual content content material will hardly be noticeable, and is a small worth
to pay for an improved consumer expertise.

Third Enchancment: Versioning

As soon as one is comfy with the idea of storing every slide individually,
it’s a small step to help versioning of the presentation. Contemplate
the next schema:

CREATE TABLE slide(
  slideId INTEGER PRIMARY KEY,
  derivedFrom INTEGER REFERENCES slide,
  content material TEXT     -- XML or JSON or no matter
);
CREATE TABLE model(
  versionId INTEGER PRIMARY KEY,
  priorVersion INTEGER REFERENCES model,
  checkinTime DATETIME,   -- When this model was saved
  remark TEXT,           -- Description of this model
  manifest TEXT           -- Record of integer slideIds
);

On this schema, as an alternative of every slide having a web page quantity that determines
its order throughout the presentation, every slide has a singular
integer identifier that’s unrelated to the place it happens in sequence.
The order of slides within the presentation is decided by an inventory of
slideIds, saved as a textual content string within the MANIFEST column of the VERSION
desk.
Since a number of entries are allowed within the VERSION desk, that signifies that
a number of displays will be saved in the identical doc.

On startup, the appliance first decides which model it
needs to show. Because the versionId will naturally improve in time
and one would usually wish to see the most recent model, an acceptable
question is likely to be:

SELECT manifest, versionId FROM model ORDER BY versionId DESC LIMIT 1;

Or maybe the appliance would somewhat use the
most up-to-date checkinTime:

See Also

SELECT manifest, versionId, max(checkinTime) FROM model;

Utilizing a single question such because the above, the appliance obtains an inventory
of the slideIds for all slides within the presentation. The appliance then
queries for the content material of the primary slide, and parses and shows that
content material, as earlier than.

(Apart: Sure, that second question above that makes use of “max(checkinTime)”
actually does work and actually does return a well-defined reply in SQLite.
Such a question both returns an undefined reply or generates an error
in lots of different SQL database engines, however in SQLite it does what you’d
count on: it returns the manifest and versionId of the entry that has the
most checkinTime.)

When the consumer does a “File/Save”, as an alternative of overwriting the modified
slides, the appliance can now make new entries within the SLIDE desk for
simply these slides which have been added or altered. Then it creates a
new entry within the VERSION desk containing the revised manifest.

The VERSION desk proven above has columns to document a check-in remark
(presumably equipped by the consumer) and the time and date at which the File/Save
motion occurred. It additionally information the mum or dad model to document the historical past
of adjustments. Maybe the manifest may very well be saved as a delta from the
mum or dad model, although sometimes the manifest might be sufficiently small that
storing a delta is likely to be extra bother than it’s price. The SLIDE desk
additionally comprises a derivedFrom column which may very well be used for delta encoding
whether it is decided that saving the slide content material as a delta from its
earlier model is a worthwhile optimization.

So with this easy change, the ODP file now shops not simply probably the most
latest edit to the presentation, however a historical past of all historic edits. The
consumer would usually wish to see simply the latest version of the
presentation, but when desired, the consumer can now go backwards in time to
see historic variations of the identical presentation.

Or, a number of displays may very well be saved throughout the similar doc.

With such a schema, the appliance would not must make
periodic backups of the unsaved adjustments to a separate file to keep away from misplaced
work within the occasion of a crash. As an alternative, a particular “pending” model may
be allotted and unsaved adjustments may very well be written into the pending model.
As a result of solely adjustments would have to be written, not the complete doc,
saving the pending adjustments would solely contain writing a number of kilobytes of
content material, not a number of megabytes, and would take milliseconds as an alternative of
seconds, and so it may very well be finished continuously and silently within the background.
Then when a crash happens and the consumer reboots, all (or virtually all)
of their work is retained. If the consumer decides to discard unsaved adjustments,
they merely return to the earlier model.

There are particulars to fill in right here.
Maybe a display will be offered that shows a historical past adjustments
(maybe with a graph) permitting the consumer to pick which model they
wish to view or edit. Maybe some facility will be offered to merge
forks that may happen within the model historical past. And maybe the
software ought to present a way to purge previous and undesirable variations.
The important thing level is that utilizing an SQLite database to retailer the content material,
somewhat than a ZIP archive, makes all of those options a lot, a lot simpler
to implement, which will increase the likelihood that they may finally
get carried out.

And So Forth…

Within the earlier sections, we now have seen how shifting from a key/worth
retailer carried out as a ZIP archive to a easy SQLite database
with simply three tables can add important capabilities to an software
file format.
We may proceed to reinforce the schema with new tables, with indexes
added for efficiency, with triggers and views for programming comfort,
and constraints to implement consistency of content material even within the face of
programming errors. Additional enhancement concepts embrace:

  • Retailer an automated undo/redo stack in a database desk in order that
    Undo may return into prior edit classes.
  • Add full text search capabilities to the slide deck, or throughout
    a number of slide decks.
  • Decompose the “settings.xml” file into an SQL desk that
    is extra simply considered and edited by separate functions.
  • Escape the “Presentor Notes” from every slide right into a separate
    desk, for simpler entry from third-party functions and/or scripts.
  • Improve the presentation idea past the straightforward linear sequence of
    slides to permit for side-tracks and excursions to be taken relying on
    how the viewers is responding.

An SQLite database has a variety of functionality, which
this essay has solely begun to the touch upon. However hopefully this fast glimpse
has satisfied some readers that utilizing an SQL database as an software
file format is price a re-evaluation.

Some readers may resist utilizing SQLite as an software
file format as a result of prior publicity to enterprise SQL databases and
the caveats and limitations of these different techniques.
For instance, many enterprise database
engines advise towards storing massive strings or BLOBs within the database
and as an alternative recommend that giant strings and BLOBs be saved as separate
recordsdata and the filename saved within the database. However SQLite
will not be like that. Any column of an SQLite database can maintain
a string or BLOB as much as a couple of gigabyte in measurement. And for strings and
BLOBs of 100 kilobytes or much less,
I/O performance is better than utilizing separate
recordsdata.

Some readers is likely to be reluctant to think about SQLite as an software
file format as a result of they’ve been inculcated with the concept all
SQL database schemas should be factored into third regular type and retailer
solely small primitive knowledge varieties reminiscent of strings and integers. Definitely
relational idea is essential and designers ought to attempt to know
it. However, as demonstrated above, it’s typically fairly acceptable to retailer
complicated data as XML or JSON in textual content fields of a database.
Do what works, not what your database professor mentioned you should do.

Overview Of The Advantages Of Utilizing SQLite

In abstract,
the declare of this essay is that utilizing SQLite as a container for an software
file format like OpenDocument
and storing numerous smaller objects in that container
works out significantly better than utilizing a ZIP archive holding a number of bigger objects.
To wit:

  1. An SQLite database file is roughly the identical measurement, and in some circumstances
    smaller, than a ZIP archive holding the identical data.

  2. The atomic update capabilities
    of SQLite enable small incremental adjustments
    to be safely written into the doc. This reduces whole disk I/O
    and improves File/Save efficiency, enhancing the consumer expertise.

  3. Startup time is decreased by permitting the appliance to learn in solely the
    content material proven for the preliminary display. This largely eliminates the
    want to point out a progress bar when opening a brand new doc. The doc
    simply pops up instantly, additional enhancing the consumer expertise.

  4. The reminiscence footprint of the appliance will be dramatically decreased by
    solely loading content material that’s related to the present show and protecting
    the majority of the content material on disk. The quick question functionality of SQLite
    make this a viable different to protecting all content material in reminiscence always.
    And when functions use much less reminiscence, it makes the complete laptop extra
    responsive, additional enhancing the consumer expertise.

  5. The schema of an SQL database is ready to signify data extra straight
    and succinctly than a key/worth database reminiscent of a ZIP archive. This makes
    the doc content material extra accessible to third-party functions and scripts
    and facilitates superior options reminiscent of built-in doc versioning, and
    incremental saving of labor in progress for restoration after a crash.

These are only a few of the advantages of utilizing SQLite as an software file
format — the advantages that appear almost definitely to enhance the consumer
expertise for functions like OpenOffice. Different functions may
profit from SQLite in numerous methods. See the Application File Format
doc for added concepts.

Lastly, allow us to reiterate that this essay is a thought experiment.
The OpenDocument format is well-established and already well-designed.
No one actually believes that OpenDocument needs to be modified to make use of SQLite
as its container as an alternative of ZIP. Neither is this text a criticism of
OpenDocument for not selecting SQLite as its container since OpenDocument
predates SQLite. Somewhat, the purpose of this text is to make use of OpenDocument
as a concrete instance of how SQLite can be utilized to construct higher
software file codecs for future initiatives.

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top