What if OpenDocument used SQLite? (2014)






It's somewhat off topic I know, but is there something like sqlite but tailored for hierachical data? Like a xml document store rather than for relational data like sqlite is.


There’s ASN.1 for hierarchical data with a schema. It doesn’t provide a query language though.


ASN.1 in itself is a schema syntax. That schema can be serialized into various related forms, but all of them are more or less a transport formats that cannot be reasonably used for random access.

There are some more or less general hierarchical formats with support for random access, but most of them are tightly related with particular technology stack (ie. MS's COM Compound Document) or with particular usage area (there is HDF5 for scientific data and many multimedia containers are in fact a hierarchical databases, with both the various IFF variants and EBML being explicitly designed as reusable formats for arbitrary data). And then there are formats that implicitly contain some kind of hierarchical container mechanism (PDF, TIFF, DICOM, FPS game map files…).


I shipped a product that used both SQLite and XML files.

One of the improvements that I made was moving a few tables that contained small amounts of data to xml files. Because these files were small and rarely written; it simplified the data access layer, and simplified diagnostics. (I made sure the files were multi-line tabbed xml.)

For "technical" people who needed to diagnose the product, asking them to crack open a SQLite database was a huge ask; but for the major part of the product that used SQLite, it was hands-down better than XML files. (An older version of the product used XML files. It had scalability problems because there's no good way to make an incremental update to an XML file.)

The advantages of XML, specifically, a human-readable format; really only work for small files when the design of the schema is optimized for readable XML. Unfortunately, the need to always rewrite the entire XML file, and the "complexities" that come with lots and lots of features will quickly erode XML's biggest advantages.

IMO: A "lay" person needing to muck around with the internals of an office document is fringe enough that learning to use a SQLite reader is an acceptable speed bump. The limitations of XML + Zip, when it comes to random writes in the middle of a file, just can't be overcome by Moore's law.


I'm unclear on how SQLite (native format, no zip) is achieving sizes similar to XML + Zip. Are SQLite TEXT or BLOB fields compressed? Or are they assuming the caller is compressing BLOBs before writing?


SQLite does not compress, as far as I know.

Engineering is all about tradeoffs: SQLite is optimized for quick incremental updates where you don't need to rewrite the whole file. Zip & xml aren't. (IE, if you decide to add a letter to a word at the beginning of a document, with zip & XML you have to rewrite the whole document. SQLite can make a minor change without the whole rewrite.)

In our case, file size was not a factor in choosing between SQLite and XML.

But, remember that file size is deceptive: Disks are block devices; the 30 byte and 1k file take up the same space if you block size is 2k. (I've shipped a filesystem driver.) HTTP servers gzip on download. It's more important to know your needs than to get hung up on a single metric like file size.

> I'm unclear on how SQLite (native format, no zip) is achieving sizes similar to XML + Zip. Are SQLite TEXT or BLOB fields compressed? Or are they assuming the caller is compressing BLOBs before writing?

Remember, XML writes each tag name 1 time if there's no content and twice if there is. Each attribute has it's name written every time. I doubt SQLite writes all the metadata in each row.


Yeah what if? Then they haven't really understood the purpose of markup languages as plain text files for viewing/editing using generic text editors. There was no lack of proprietary formats such as MS Structured Format (used by MSO) and it was considered a big success when customers demanded open formats such as SGML/XML-based ones in late 90s/00's. The alternatives aren't even sequential (have fragments and cross pointers, etc). Yes they might be faster because they're closer to the in-memory representations as used by the original/historic app or even primitive memory dumps; marginal speed or size improvements were never a consideration though. And if anything, SQL (almost as old as SGML btw) is a joke as document query language compared to basically any alternative specifically designed for the job (ISO topic maps query language ie. Datalog, XPath and co, SPARQL, DSSSL/Scheme, ...) because of SQL's COBOLness, non-schemalessness, lock semantics/granularity a really bad fit, etc.).




Good article. Although one thing I do like about OpenDocument being just a bunch of XML files in a ZIP archive is that it is fairly easy to generate documents like spreadsheets without using a (potentially hefty) library which knows about the document format.

I have a use case where users of a web service want to use data exported as a bunch of rows in a table in a variety of tools. Now, CSV with UTF-8 encoding is of course, totally open, conventional, and workable, but anyone who has ever offered CSV files to end users will know the pain of these users getting stuck when they want to use these files in a spreadsheet application¹. So I saved a sample spreadsheet in OpenDocument's ODS and another in that Microsoft XML abomination called OOXML as XLSX, and just figured out the basics of those XML formats. I trimmed the ZIP archives down to the essentials, marked the places where content goes, and just build a new spreadsheet file whenever data is requested in that format. Now I can output CSV, ODS, and XLSX (and JSON thrown in for good measure) of the same data.

Doing this with SQLite would be possible of course, just a tad more complex and with a lower development speed. Being able to fire up the office suite, create a template document, and just dig into its XML files in the saved file is a nice feature (although admittedly of niche interest).

1: More specifically, users who use Excel in a locale like nl_NL, where CSV files are, hardcoded, assumed to have their columns separated by semicolons, because Microsoft once notoriously decided that the Dutch did not use comma's in a comma separated values file.


As for [1], it is not really hardcoded, but depends on what is the value of localeconv()->decimal_point, if it is “,”, excel uses semicolons both in CSV files and formula expression language.

This used to be configurable when opening CSV/TXT file in excel (and still is in LibreOffice) but as a part of the overall UI dumbification was moved somewhere under the “Data” menu/ribbon tab (so you have to open new workbook and find the right option, or well, use LibreOffice if you value your time).


> decimal_point

Are you sure that affects it? The decimal point parameter sounds like it decides how to write out 5½ (i.e., 5.5 (English style) or 5,5 (Dutch style)) surely? Although on the topic of this particular bête noire I would not be surprised.


Sadly they did not include bad sides:

1) Vulnerabilities: not only in SQLite, but also in wrappers like https://nvd.nist.gov/vuln/detail/CVE-2023-32697

2) Lack of transparency: zip with xml's contains only xml's; meanwhile SQLite contains by design all kinds of traces with sensitive information or empty blocks. Attempts to fix these issues removes benefits that were mentioned.

3) Lack of implementer support. It was one of the reasons for WebSQL deprecation many years ago.

4) Lack of standardization for file format. SQLite does not even promise forward compatibility, only backward one. Which means that new documents might not open in old software, or vendor should fork SQLite and only backport security patches.


4) is enough for me, so I agree with your general point, but 1) 2) and 3) aren't really cons for SQLite.

1) Makes sense only if the average XML parsers and zip libraries in use have fewer vulnerabilities and are actively maintained as well.

2) You can store sensitive data in a SQLite database or XML file, there's no real difference. You can clean up a SQLite database pretty easily if you want and that doesn't take away all the benefits.

3) What does implementer support even mean? I believe they are open to custom work... WebSQL died because it doesn't make sense to pretend SQLite is some kind of standard -- that brings us back to 4), which is the valid reason to avoid SQLite.

Actually, your 4) is worded too strongly. They say they're committed to forward compatibility as long as you don't use the new features. That makes forward compatibility the decision of the app: an app can have forward compatibility and not use newer features OR lose forward compatibility and use newer features.


> Vulnerabilities: not only in SQLite, but also in wrappers like

Yes, parsing encoded files tends to introduce vulnerabilities. ZIP parsers have had plenty of vulnerabilities. This is not exclusive to SQLite.

> Lack of transparency: zip with xml's contains only xml's

Both zips and sqlite cannot be read with a text editor. Both are open formats with widely available tools to read them. The sqlite binary might, in fact, be more widely available than unzipping tools.

> meanwhile SQLite contains by design all kinds of traces with sensitive information or empty blocks.


> Lack of implementer support. It was one of the reasons for WebSQL deprecation many years ago.

I don't understand how this is relevant?

> SQLite does not even promise forward compatibility, only backward one. Which means that new documents might not open in old software

Neither does OpenDocument. SQLite is actually more solid in this regard – forwards compatibility is still a thing unless new features are used.


> Both zips and sqlite cannot be read with a text editor. Both are open formats with widely available tools to read them.

Well, that's why your archive format of choice should be cpio, which is almost a text file except that modern implementations tend to 0-terminate the filename!

Jokes aside, there are widely-distributed tools that can take in an almost-arbitrary zip file and account for every byte in it. The format is straightforward enough that, were you so inclined, you can do most of it (other than, like, decompression and crc-checking) manually in a text editor. The SQLite format is not like this. There is one implementation, and relatively easy to "hide" data in a database file that its tooling will not reveal.


> > meanwhile SQLite contains by design all kinds of traces with sensitive information or empty blocks.

> Elaborate?

When you delete something from a SQLite database, it isn't necessarily actually removed from the file unless you VACUUM or have the secure_delete PRAMGA turned on. Either of these should solve the problem.

VACUUM INTO is a good way to export sqlite databases from an application for this reason.


> parsing encoded files tends to introduce vulnerabilities

If we are talking about binary formats, now there are systematic solutions like https://github.com/google/wuffs that protect against vulnerabilities. But SQLite is not just a format - it's an evolving ecosystem with constantly added features. And the most prominent issue was not even in core, it was in FTS3. What will SQLite add next? More json-related functions? Maybe BSON? It is useful, but does not help in this situation.

Regarding traces, there are many forensics tools and even books about forensic analysis of SQLite databases. In well-designed format such tools should not exist in the first place. This is hard requirement: if it requires rewriting the whole file - then so be it.


There really should be a "NoSQLite" or something equivalent to store hierarchical data instead of normalized data.


You can probably use SQLite for that, with a single key-value table.


The json* family of tree and table functions are nowadays built in.


It's trivial to implement hierarchical data with recursive common table expressions. https://www.sqlite.org/lang_with.html




> Nobody really believes that OpenDocument should be changed to use SQLite as its container instead of ZIP. […] Rather, the point of this article is to use OpenDocument as a concrete example of how SQLite can be used to build better application file formats for future projects.


Would be interesting to see a fork implementing SQLite. Time would tell how well it would compete with the standard.


Sqlite-based file formats are also very easy to debug, which saves a lot of dev time. After my app writes to a file and loading back doesn't work, I can just open it in Sqlite and inspect it in any way I wish because I have the full power of SQL at my fingertips.


OpenDocument is zipped images and XML. Implying you parse the entire format and put it in RAM. And frankly I don't see how SQLite can improve this. Well XML isn't ideal, but it's zipped, so there's no huge penalty in size here.

All benefits SQLite's article lists (and I love SQLite to death by the way) can be implemented by having SQLite be the runtime model of the document. On disk and in memory. But SQLite doesn't need to be the transport format. In fact SQLite can easily get bigger than the current format, SQLite is full of unused space when you mutate it around, it can get fragmented and sparse. And if you need to optimize it every time, then the "fast save" etc. benefit goes away.

There are formats which do need delta updates and quick indexed look-ups without fully loading the file in RAM, and this is why so many apps do use SQLite as a file format. I just feel OpenDocument was a bad pick to use SQLite for in this hypothetical scenario.


XML and Zip don't really do incremental updates, meaning the whole application file has to be written on save, meaning corruption can occur due to hiccups mid-write. Sqlite as a disk format and the right application implementation means you can't end up in a corrupted state.

I think you can achieve the same thing with xml/zip and some rename shenanigans, but sqlite lets you get that in a single file on disk.

Also if you are using sqlite as the memory model, why not use it as the disk/transport format? It's basically free at that point.

The file size issue can be dealt with VACUUM (I believe, haven't personally dealt with sqlite-as-file-format).


Incremental updates don't matter in a transport format.

The claim "it's basically free" isn't right, as for transport you need to VACUUM. And possibly COMPRESS too. And if you do that... might as well use the existing format. VACUUM completely rewrites the file from scratch. You can't do incremental updates in a VACUUMed file as it stops being VACUUMed, so you need to VACUUM it again to ensure minimal file size. Nothing is free.

ZIP also can be incrementally updated (file by file) by the way, I think MS Word uses this feature in some saves. But that's beside the point. You simply do not need incremental updates in a transport format.

I'm not sure what "hiccups mid-write" you're referring to. Any such hiccup that would damage an XML or ZIP file would also damage an SQLite file.

The distinction between a working disk file and a transport format are important. The working disk file is large, binary, messy, complex, optimized for quick look-ups and quick partial updates. If your word processor crashes, it can restore state from the working disk format in no time.

But the transport format needs to be small, readable, debuggable, simple, stable. And SQLite simply doesn't offer anything significantly superior in that department compared to the existing format. Especially nothing to justify the additional effort of changing an already working solution.

There's a reason "serialization" is called that, it's just serial data. No random access structures, no indices, single representation, often text-based. Throughout the decades, we've learned this is the best way to transport data of any kind. The messy/partial/polymorphic/cryptic/hyperoptimized/indexed formats are not for transport. They're intended to do work in, locally.

[deleted by user]

Shameless plug of a couple of Python libraries I’ve been involved with that work around memory issues of ODS files (for very specific use cases):

https://github.com/uktrade/stream-read-ods https://github.com/uktrade/stream-write-ods


The problem with SQLite is that it's not a standardized file format. It's well-documented and pretty well understood for sure, but there's no ISO standard defining how to interpret an SQLite file in excruciating detail. Same goes for competing implementations, Zip and XML have a much smaller API surface than SQLite, whose API, apart from a bunch of C functions, is the SQL language itself. Writing an XML parser is not a trivial task, but it's still simpler than writing an SQL parser, query optimizer, compiler, bytecode VM, full-text search engine, and whatever else Sqlite offers, without any data corruption in the process. If Open Office used SQLite, its programmers would inevitably start using its more esoteric features and writing queries that a less-capable engine wouldn't be able to optimize too well.

This isn't a concern for most software. If you're writing a domain-specific, closed-source application where interoperability with other apps or ISO standardization isn't a concern, SQLite is a perfectly fine file format, but as far as I understand the situation, those concerns did exist for Open Office.


> This isn't a concern for most software.

It's not even a concern for the US Library of Congress, which defined SQLite as a recommended storage format for datasets alongside CSV, XML, and JSON.


But those are completely different uses of a storage format.

Library of congress considers if someone a 100 years from now could write a new importer in whatever langauge/AI they might use by then.

Office documents are something you send in email attachments to people you often barely know, and expect them to read it in whatever office system they have. And if the recipient uses e.g., Microsoft Word, OFD/Sqlite might not work.


It is true that it requires effort for the developers of a software program to support a given file format. Beyond that I'm not sure what your point is.


Not the op, but one point would be, why did we even pick xml, when we had latex and html? Why is a relational database the right tool for a document format?


> Office documents are something you send in email attachments to people you often barely know, and expect them to read it in whatever office system they have.

Eh, if they're not running the same office system, down to patches, you can't really expect much.


Just define the schema and the semantics of each column for each table.


I've never seen this as a problem, since plenty of random things are distributed as sqlite files. All the remaining questions for ODF would be about the schema design.


But you don't need a standard, because all interaction between applications and the document is made through SQL. And SQL is standardized (at least the parts that matter). If you have concerns about compatibility, make sure that the document can also be accessed through other databases (like mysql).


>at least the parts that matter

In my experience every part matters in non-trivial use cases since someone somewhere will use that part.


But other databases cannot access sqlite databases, because the file format is internal...


SQL file format is very well documented. In some universities it is an assignment to directly read and write sqllite files from disk and understand the paged and blocks structure.

You don’t need sql for any of it.



> In some universities it is an assignment to directly read and write sqllite files from disk and understand the paged and blocks structure.

do you have any links?


It's interesting that this is a classroom assignment, like the sibling comment, I'd curious which university / class this was. I did the read part (+ query planning) on my own as an exercise, but I haven't gotten around to implementing writing yet.

You do need to parse DDL to get the column names, they're stored as a "CREATE TABLE" string. But you don't have to if you want to dump the file without names.



They can if they want to, using the standard SQLite lib or their own implementation.


I'll admit, that's a fantastic third party effort. But there definitely isn't the same level of first party support as there is for zip files.


You seem to be mixing up the file format with how it's used. An application that uses SQLite's file format would use SQLite's library as part of the application. Yes, it would be quite a lot of work to replicate that library but in the same way that replicating the code that uses OpenDocument's file format would be.

The file format itself is pretty straightforward.


> Writing an XML parser is not a trivial task, but it's still simpler than writing an SQL parser, query optimizer, compiler, bytecode VM, full-text search engine, and whatever else Sqlite offers, without any data corruption in the process.

Just to clarify: You don't actually need to implement all that for it to be a standardized file format, any more than you need to implement all the spreadsheet functionality to be able to read a LibreOffice spreadsheet. All you need to do is to be able to reconstruct the tables. There's no reason, having reconstructed the tables, you couldn't write your own imperative code in the language of your choice to go over them and get whatever information you wanted.


I'm not sure if the problem you are pointing out has to do with:

a) SQLite the file format - which is Public Domain and so well documented that parsers for it exist in numerous other languages even though it's almost pointless because...

b) SQLite, the Public Domain (and thus entirely source available) C implementation of the library that can operate on the file format -- and is documented to a level well above what most ISO standards shoot for. It's designed to be used in other software and has bindings for pretty much every major language.

c) Some notional OpenDocument stored in a SQLite file that's really just waiting for somebody to make and document.

ISO standards are great, but if we had to wait for ISO to define a file format we'd have pitifully little to work with.


It is possible that the C implementation of SQLite is the single most commonly deployed software library ever. If not, then it is probably the second, after zlib.


Therefore I consider it a better supported format than most standardized formats.


for one, it's bundled with consumer versions of Windows as winsqlite3.dll. not sure when this started though


That page makes the argument for zlib & sqlite, but Daniel Stenberg makes some good points here[0].

My guess would be zlib is still number 1 though, even accounting for Daniel's considerations.

[0] https://daniel.haxx.se/blog/2021/10/21/the-most-used-softwar...


Sounds like a solution is to use the C implementation to define the standard and have it canonized in to an ISO.


Yeah, a solution in search of a problem.


That's what Opus did. The RFC[1] has a base-64 encoded libopus.tar.gz appendix (Appendix A), which is the "primary normative part of this [Opus] specification." If the prose and source code disagree, the source code takes priority and "wins" when it comes to which is normative.

I have a love-hate relationship with this approach.

[1]: https://datatracker.ietf.org/doc/html/rfc6716


funny, the RFC even includes a shell command pipeline to extract the base64 out of the awkward RFC formatting.

Using the C source code still leaves room for ambiguities / under-specification, no? After all, the semantics rely on the particular gcc release used for compiling the code.


There is still the possibility of a bug or under-specification, but that's always the case in any spec. At least with Opus they document what implementation-defined behavior they require, so assuming there aren't any hidden bugs then you should get consistent output across compilers.


but the semantics change depending on the build tool version and other factors.


That is common for codec standards, the normative part of many MPEG specifications is the parser/decoder in C-like pseudo-code. What is somewhat unique for Xiph is that their normative reference decoders are actually usable.


I think this has been discussed before about WebSQL.

> The [WebSQL] specification reached an impasse: all interested implementors have used the same SQL backend (Sqlite), but we need multiple independent implementations to proceed along a standardisation path.



The Chrome blog post about deprecating sqlite-based WebSQL makes an interesing point. I believe it applies to OpenDocument as well.

> The Web SQL specification cannot be implemented sustainably, which limits innovation and new functionality. The last version of the standard literally states "User agents must implement the SQL dialect supported by Sqlite 3.6.19". SQLite was not initially designed to run malicious SQL statements, yet implementing Web SQL means browsers have to do exactly this. The need to keep up with security and stability fixes dictates updating SQLite in Chromium. This comes in direct conflict with Web SQL's requirement of behaving exactly as SQLite 3.6.19.



In other words, "all the implementors chose a standard, but we're the standard deciders so we're killing the whole idea".


The standard deciders are the implementors. There is no point in opposing them. The W3C is actually the representatives of Google, Mozilla, Microsoft, Opera and so on.


One problem was the standard was bug-for-bug replication of a particular version of SQLite.

There’s very good reason for that not to be a standard. (Now, assuming the SQLite documentation is licensed in a way which supports this, copying the documentation of SQLite’s supported SQL as of that version into the standard might have been viable, but no one interested in having WebSQL proposed that or any other resolution.

That relates to the cited issue of absence of independent implementations, which would have been a problem even with a spec that supported independent implementations and verification of their compliance independent of a particular reference implementation. though I personally think the spec problem is a bigger real problem (even if not the decisive policy problem) than the “everyone is using the same underlying software to implement the spec” problem is in this case, where the shared implementation is a permissively licensed open source implementation sponsored by several of the browser vendors, among others.


hmmm...I appreciate the thoughtful reply. You bring up an interesting point. What is the SQLite documentation licensed as? I would assume PD like the rest of it, but I don't know that for sure.


SQLite itself is in the public domain.


This could have been used as an opportunity to standardize the SQLite3 DB file format.


> less capable engine

There wouldn't be another engine.

It would be SQLite. Period.


It almost seems worth giving up ISO for SQLite, but I understand there are real concerns when you get into enterprisy stuff.

SQLite is kind of its own standard. It's public domain and they don't do breaking changes all day, and it's in C. As long as C is still viable, SQLite is usable on basically all non embedded platforms, and nobody really needs to reimplement it, unless they want to port it to Rust or something.

Not that you'd need to, since it's already very reliable.


This sounds exactly like the argument that killed WebSQL in 2010: https://en.wikipedia.org/wiki/Web_SQL_Database

I am still salty about this, as WebSQL would have made it much easier to build a certain class of web apps.


Coupling a file format to SQLite smells wrong.

SQLite is good, but it is also fairly unique in this space. Why? Because it’s hard to replicate everything it does, because it does a lot.

But… for this case, do we need it do a lot? No, not really. We don’t need the full SQL standard, a query optimiser, etc etc for basic (+ safe) transaction semantics and the ability to store data in a basic table structure.

Perhaps there is a better file format we can use, but it would be better if it was decoupled from SQLite.


You do need all these things for these applications. Efficiently and safely querying and writing data is central to any document format; you'll leverage both the file structure and in-memory structs to do this. SQL would probably work for this, in fact it's especially natural for spreadsheets (rows x cols).


I mean, clearly you don’t: OpenDocument works just fine without it.

You need key/value lookup, a way to list/paginate, and transaction semantics for updates.

AKA: a zip file with entries as keys, and XML documents or attachment blobs for values. What’s lacking and causes issues is the update semantics.

You can wack a SQL query language over those 3 operations if you’d like. Or don’t. Up to you, because the format is defined and can be reimplemented rather than the large, complex library api.


OpenDocument also leverages the file/dir structure for efficient querying like we're both saying. If you mean that you don't need SQLite over ODF, well yeah, ODF works too. I just wouldn't prefer it.

What's the issue with the update semantics, though?


The issue is incremental updates, and this is where things get complex. If you have a file embedded in the middle of a zip file that is 100 bytes, and you want to resize it to 150 bytes, how do you do that?

You can’t squeeze it in without moving everything else about, which disrupts other readers. You could append it to the end maybe, but you need to handle concurrent writers. Compression also is an issue here - I expect zip compression is applied to multiple files at once, rather than per file? So now you might need to update multiple seemingly unrelated files.

You need to step down from the concept of a whole file as a unit and move towards pages of data that can be incrementally updated/reused/freed, where each page might contain one, many or even only a part of a “unit” (file/row/whatever)

This makes things more complex for sure


So basically ODF loads everything into memory, relies heavily on in-memory structs for quick unsaved updates, and is crash-safe by writing the whole zip to a temp location during saves. Kinda similar to MS Office. The file structure also helps a little. This is good enough for small docs.

Many times have I encountered large docs, often spreadsheets, that push the limits here and become noticeably slow. If you want to get more sophisticated with the indexing and paging, SQLite is a very natural path. Anything else would be reinventing the same wheels SQLite has spent decades refining.


> Anything else would be reinventing the same wheels SQLite has spent decades refining.

Which is exactly the problem. They (I.e one dude?), and they alone have spent decades refining a single implementation.

Before we go and lock the entirety of the worlds documents into what’s essentially a proprietary format specific to a single implementation of a single library written by a single dude… we should double check if that’s a good idea or not, and if we can, collectively, solve some of these issues without reimplementing the whole of SQLite.

Because that’s complex. Perhaps more complex than it needs to be for most applications, which would benefit from the storage part more than the query part. And then we are back at the start of our discussion?


- Why not? https://www.sqlite.org/appfileformat.html

- Its size is less than a megabyte: https://sqlite.org/footprint.html

- 750KB if all features are enabled: https://www.sqlite.org/about.html

- Looks like fair amount of functionality can be left out when compiling sqlite and with options to influence/strip down query planner: https://www.sqlite.org/compile.html

- And "SQLite does not compete with client/server databases. SQLite competes with fopen()": https://www.sqlite.org/whentouse.html

In the end, you don't need a database, but a library that gives you database API and behavior.


The complaint is not “it isn’t good” but rather “it is not replaceable”. Since SQLite is so powerful, once you specify it as a format, you are stuck with SQLite forever.


Which is also "not a big issue", since it's a recommended Library of Congress storage format, and supported long term:




It is somewhat of a problem: the development team is very small, they don't take outside contributions (so nobody outside the core team really builds up expertise over time), and the vast majority of tests are proprietary. I hope they have a contingency plan just in case (some sort of a dead man's switch that publishes the test suite under a permissible license) as it would probably be quite difficult for others to maintain the same quality without those tests, or re-implement them in a reasonable time frame.


I'm pretty sure this is why libsql was created https://github.com/libsql/libsql


But that equally applies to getting critical bug fixes for your particular usage scenario of SQLite. It's not just about the viability as a storage format.

For the latter, because the stored data has such a simple format and the implementation has so few dependencies, I expect it will be very easy to get your data out for a long time to come. It's going to be tougher if you have business logic in views or other SQL expressions, of course, and if you rely on SQLite's particular approach to data types (as in “values have types“, but not much more).


Which is why I was clarifying the original complaint and not supporting the original complaint.


Once you pick ODF as a format, you're stuck with it forever... except I wouldn't categorize it as powerful.

[deleted by user]

> In the end, you don't need a database, but a library that gives you database API and behavior.

Why do you need a single library that gives you a database API and behaviour?

Wouldn't it be better to decouple those: provide an open, standard format that enables compact, fast, structured storage that is built to allow transaction/atomic updates.

If that exists then you can plug sqlite on top of that, or something else. Because you don't _need_ any of SQL, or really sqlite to improve the OpenDocument format. You need the storage format.

OpenDocument is very different from the pretty scientific/niche/highly-vendor-locked examples given in replies by others here. Locking this into a format developed by essentially a single person with a single implementation is absolutely mad.

But... it's less mad if the file format wasn't coupled to sqlite.


The Sqlite format is open and the spec is here: https://www.sqlite.org/fileformat2.html

I haven't studied the spec in detail but it seems comprehensive.

The fact that there also exists a high-quality, stable, public domain reference implementation can't really be counted against the format, can it?


> Wouldn't it be better to decouple those: provide an open, standard format that enables compact, fast, structured storage that is built to allow transaction/atomic updates.

The high-level software abstraction approach doesn't hold up when it comes to databases. This is such a wide and performance-critical interface that any abstractions are gonna leak badly. Even the SQL standard has all these impl-specific flavors. Many have tried to build layers on top that'll work with multiple DBMSes, and it's never worth. Anyone writing an app backend is just gonna marry a particular DBMS for the performance benefits (puns intended).

If for some reason an alternative implementation really needs to exist, SQLite is simple and open enough that someone can do it.


You’re totally right, but there are a few things missing: this isn’t a DBMS, really, and the files are not going to be huge.

You need fast listing/pagination, key value get/set, and transactional updates. Basically DynamoDB, but for a single file. Build a query layer on top of that, sure. Use those primitive to build persistent indexes if you want.

Or just iterate through the keys in a for loop. It fits in memory anyway.

You don’t need a fully fledged DBMS for a word document. And if you’re shuffling around lots of data in a structured format with no updates needed, you probably want arrow/parquet rather than sqlite because the read performance is going to crush SQLite.


I don't know, probably a lot of us have dealt with large docs that become noticeably slow to edit and scary to save, mostly spreadsheets.


Ok cool: so adding SQL to that is going to magically speed it up?

No. It’s the on disk format that matters. Because it would be just as slow and scary if it used a sqlite file that was embedded in a zip file or something equally as mad.

It’s not the SQL, it’s the file format.

If you decouple the file format from the SQL engine, it becomes simpler to reimplement, more agnostic and less vendor locked.


SQLite (not just the language SQL) would make it much easier to reimplement in a way that's fast and safe, yes.

> If you decouple the file format from the SQL engine

That alone would be a difficult project. If you really want to break things down, easier to say your document standard relies on SQLite's rather simple query language (https://www.sqlite.org/lang.html), and there, it's independent of SQLite's query planner and file format. Wouldn't be hard to make it work with Postgres or MySQL, for instance.


I’d love to understand your thinking behind the idea that a document standard should rely on a query language and not a file format…

Document standards are file formats…

Or are you saying a document format should just be some DDL statements? What? How is that interoperable? It’s coupled to the database that is storing the data as an implementation detail, which is exactly the problem with using SQLite.

> That alone would be a difficult project

I’m not suggesting using the SQLite file format, I’m suggesting the pretty basic idea that the storage for a general purpose widely used and interoperable document format should be logically decoupled from anything else, and definitely not be tied to the implantation details of a single library or even a single version of that library.

The file format is the most important part. It’s the only part. Nothing else matters because there is nothing else.

It’s not rocket science.


> Or are you saying a document format should just be some DDL statements? What? How is that interoperable?

Yes. How is it interoperable, because it's quite easy to make DDL for SQLite that also works for many other DBMSes, given that SQLite is kinda the lowest common denominator of those.

Maybe not as interoperable as ODF since it's easier to implement an ODF parser/writer than a SQLite clone, but probably more interoperable than some kind of advanced ODF designed for efficient updates. Just because you define a standard doesn't mean there are good portable implementations out there.


> Maybe not as interoperable as ODF since it's easier to implement an ODF parser/writer than a SQLite clone

Ladies and gentlemen: he’s so close, he’s nearly there, but he just can’t make the final connection!


A great thing about just using sqlite as the format is that you get lots of potential features. Sure most applications don't need full SQL power just to save and load data. But then at some point you might want more advanced functionality, or migrate to a new structure. And both you and your users get tools for free, e.g. to extract data or fix problems, or just look around. Other applications can quite easily read your files, without you needing to write various language libraries. Very few projects get around to building that kind of tools for their made up format.

I could agree about the single implementation, but if the alternative is making something new up I am not sure in what way that would be better.


Exactly. Some formats are designed, first and foremost, for interchange. SQLite is pitching that you, as an "app" owner, force the SQLite format upon your users to make it a de-facto standard, without putting the work in to make it a de-jure standard.

Show me a formalised ISO / IEC / ANSI / ETSI SQLite standard that the Richard Hipp and his company never deviates from, and the full legal search to ensure there are no patents that might affect it, and show me the multiple compatible implementations of SQLite that _all_ have these touted advantages, and _then_ we can talk about prosletizing it as a file format. If they don't, they're saying "take a hard dependency on a single-source implementation, and make all your users take it too".

XML is a formal standard. ASN.1 is a formal standard. JFIF is a formal standard. Even ZIP is a formal standard (adopted as part of standardising OpenDocument: ISO/IEC 21320-1:2015)

The most important thing about a document is that everyone _else_ can read it. Saving time on writing updates to disk is an irrelevant sideshow. Did we learn nothing from Microsoft perverting the standards bodies to try and keep its lock-in?


> A letter of resignation written by the departing members and made public by The Inquirer accuses the standards body of folding to pressure from Microsoft, violating its own procedural rules, and ignoring the analysis of the technical committee tasked with evaluating OOXML.


> SQLite is pitching that you, as an "app" owner, force the SQLite format upon your users to make it a de-facto standard, without putting the work in to make it a de-jure standard.

From TFA:

Note that this is only a thought experiment. We are not suggesting that OpenDocument be changed.


I think SQL, a formal standard, has shown that formal standards fail to define a good way to interact with a database. The only real implementations all broke the standard. And an editable document isn't far from a database.


How much work does it take to go from an engine that can read standard XML to one that can read an ODT document's XML and do something useful with it? At what point of complexity does that engine create a de facto standard?


Performance matters and is sufficiently captured via working incremental updates. The single largest upside of a proposal like this is captured by using SQLAR over ZIP. That's what the Library of Congress does when SQLite claims them as a proponent. It's what Fossil does as others in this thread have pointed out. It's suggested as "first improvement" in the linked article. It's also the only part that should actually be considered for implementation.

You are right to point out the folly of deeper implementations like having and needing to understand table structures for things like slides. However, the current status quo involves Microsoft implementing a fairly esoteric "update the XML file's bytes as they would be encoded in a ZIP file" in their proprietary tool (where they have enough money to invest the engineering time) and all other tools use the slower "whole file in memory" approach.

User visible features like incremental fast saves (and shared editing) keep people on closed systems and give Microsoft the leverage to do the things you warn against. SQLite as a container format could have prevented that by giving everyone a shot at a lower cost but still fast implementation.


To add to your point, fossil, the versioning system designed by the people of SQLite, and using SQLite, doesn't even use SQLite as a file format. It's all a bunch of blobs, each with its own format, that happen to be stored on SQLite. SQLite offers safe storage and a bunch of helpful indexes and views, but is not necessary for fossil-the-data to work.


Looking in sqlite.fossil there are 27 tables in it and most are not used for storing blobs. I know when looking up how to do things in the past the answer has sometimes been "run this SQL query". The event table for instance looks like a list of all commits with dates and comments etc. There is a config table that looks like the kind of stuff git stores in .git/config (URL to upstream repo etc) and so on. Well, yes there are some blobs in it too.


As described in https://fossil-scm.org/home/doc/trunk/www/tech_overview.wiki, all the commits are stored as artifacts, and then fossil creates metadata tables for quick access to useful information.

Configuration of a repo indeed isn't defined as an artifact but as a SQLite table. One may wonder if this should be part of a repo, and I would say it should, so it actually is surprising that it's not also stored as artifacts


Have you checked the Apple apps? Most of them use SQLite as storage format. iMovie, iPhoto, Voice recording…

Same with Docker. Can’t be that wrong?


The Apple apps are using Core Data, which uses SQLite as its persistent store by default. So Apple could in theory migrate away from SQLite by changing Core Data’s behaviour without any application-level impact. So in a way, these applications are already decoupled from SQLite in the way the parent comment suggests.


App using a format specific to their own and unique implementation, that ends up kind of proprietary is perfectly ok.

Using it for a open specification which target is cross implementation compatibility makes the move way more hazardous. Meaning, every implementation has to run on environment targetable and compatible wit sqlite or has to re-implement a compatibility layer on something complex enough that you only reliable definitive source of truth is the very famous sqlite test suite.

It the same reason why Web SQL has being abandoned: if sqlite is the sole api implementor, it takes precedence on any others specs, and you have no control on your standard.

I would be 100% for a specification on how to map open docs files to an relational structure, though, with a well know sqlite-backed implementation.


> Meaning, every implementation has to run on environment targetable and compatible wit sqlite

Well i get what you are saying, sqlite has been ported all over the place. It probably wouldn't be the limiting factor portability wise.


Yes, right now, no problem, and there little foreseeable future where a sqlite ported to anything would pose a problem. Still, decisions with no way back like this one requires extra cautions.


SQLite is already used for exactly this purpose. It's used as OGC GeoPackage and Mabox/Maptiler datasets use this.


I found the transactional aspect surprisingly difficult, especially with concurrent file access. SQLITE_BUSY handling was quite hard at the time. I know that serialization failures are expected in transaction processing, but for SQLite it was very difficult to tell persistent failures (say, due to self-deadlock) apart from transient concurrent update problems. For transient failure, you can re-execute the closure defining the transactional operation, but for persistent failure, that's of course pointless.

Part of the problem is that sqlite3_stmt combines aspects of both prepared statements and result sets. There is a tendency to keep them around to cache the compiled bytecode (prepared statement), but your might code might stop mid-iteration (result set), maybe holding a lock at this point. This can lead to surprising lock-upgrade failures. In the end, I wrote extensive error reporting using sqlite3_next_stmt, sqlite3_stmt_busy, sqlite3_sql, just to weed out those issues. The entire transaction retry code I wrote is full of optional logging and many comments, even though it was just for my own personal use. Before that, I wrote transaction retry logic for PostgreSQL, and that was so much easier (but it was before fully SERIALIZABLE transactions arrived).

The other surprise is that “ A transaction committed in WAL mode with synchronous=NORMAL might roll back following a power loss or system crash.” (https://sqlite.org/pragma.html#pragma_synchronous), but that wasn't relevant to my application.


If you retry your write several times and it doesn't succeed you can tell the user it is a persistent failure without agonizing too much over the diagnosis: it is persistent enough to be a significant problem, even without proof that it is an application bug.

Who would attempt to make concurrent writes to an application document format? And how wouldn't such an attempt be a user mistake? Failing to write is the solution, not the problem.


These concurrency failures in transaction processing can be quite rare, but you have to fix them if you want 24/7 unattended operation. SQLite only has timeout-based conflict detection, so you basically have to decide whether you want to wait 60 seconds (or so, depends on how glitchy your storage is) before reporting a potential self-deadlock, which isn't great for development, or risk failing unnecessarily when actually running the job. I had no idea what the right timeout was, which is why I wrote some of the custom self-deadlock detection logic. I think I got to run it completely reliable in the end (no false aborts even under load), but as I said, it was surprisingly hard.

By the way, concurrent read/writes on locally stored documents happen, even on single-user machines. If the reading process uses the SQLite structure (say a document indexer that knows about the format), it has to take some locks and may also need to flush data from the WAL log (depending on implementation details). At this point you have to deal with concurrency issues in the application, too. Unless you rewrite the entire document from scratch on every save and put it in place with an atomic rename (which I ended up doing for a different application, not the transaction-processing one). But that loses some of the advantages of SQLite.


You are clearly discussing a shared database for concurrent transaction processing ("you want 24/7 unattended operation"), not people editing application document files.

Setting aside technological details, multiple clients operating on independent rows of the same table can only, at worst, waste time by retrying a transaction, while multiple concurrent users attempting to modify the same document are asking for trouble, and if they succeed they probably succeed at corrupting the document.

Even without lock contention the aggregate document state can be incoherent (for example, Alice and Bruno edit a text, but they accidentally modify the same section and the latest save prevails and nobody notices).


Implementing versioning in the file format conflicts with git, because each document is essentially its own little source control system. This can be surprising to users who copy the file and don’t realize that they’ve effectively copied the entire repo. Copying a file will sometimes include drafts they didn’t want to share. It can mean you lose control over when things are committed, and so you don’t end up with a useful history.

If you then check the file into git, you are storing one source control system into another one, and older versions appear in two different histories. To be git friendly, you don’t want to save anything other than the current version, and then let git do its thing.

Possibly the answer is “don’t use git, we have it covered,” but then the app developer should realize that they are implementing something like a source control system. How do people share drafts, review them, and merge changes? How do you publish a release that only includes the version you wanted to release?

And it does seem relevant that the developer of Sqlite actually did implement their own source control system [1]. Maybe they could have warned people about what they’re getting themselves into if they go down this route?

I wonder how terrible it would be to either use a git repo as your file format, or to build in git compatibility into your app somehow so you could push and pull?

[1] https://en.m.wikipedia.org/wiki/Fossil_(software)


It's pretty rare to put office docs into version control, as they are typically binary instead of text. So, doesn't work well. Perhaps there is a version of open-doc that doesn't use the zip file but a folder of XML instead? Also the XML might need to be optimized to prefer line-oriented operations.


Yes, it's rare to use git, but it's also pretty well-known that people can share more than they intended in a Word document. Perhaps true of Open Office as well? See:




Don't forget Exif, its thumbnail and even reflections, in photos.


Yes, in LibreOffice you can save as FODT: flat ODT, which is a single unzipped XML. That's what I use to store my resume in git.


I’m curious to know what a gsheet/doc/slide file actually is under the hood. I as the user am only ever presented with a link, there’s no way to download a gsheet in its native format.


Is SQLite’s disk format an open, versioned standard? Or is it just “however SQLite saves data to disk”?


SQLite file format spec: https://www.sqlite.org/fileformat2.html

Complete version history: https://sqlite.org/docsrc/finfo/pages/fileformat2.in

Note that there have been no breaking changes since the file format was designed in 2004. The changes shows in the version history above have all be one of (1) typo fixes, (2) clarifications, or (3) filling in the "reserved for future extensions" bits with descriptions of those extensions as they occurred.


Thanks for elaborating so thoroughly. I didn’t even realize you were on this platform!



What If OpenDocument Used SQLite? (2014) - https://news.ycombinator.com/item?id=25462814 - Dec 2020 (194 comments)

What If OpenDocument Used SQLite? - https://news.ycombinator.com/item?id=15607316 - Nov 2017 (190 comments)


Love the vibe of artivles, which present let's say reason-driven development vs habit-driven.

Why habit? Well, I can imagine back at the time OpenOffice was a fresh project, it went like this: "XML is going to stay forever and everybody uses XML, so ofc we use one... oh, it is so big! And there are many files, so we just zip'em"...

To be fair, the author of this excellent article doesn't even say about getting rid of XML in this format- but that could also be achieved by storing stuff in a SQLite file. Usage of XML was habitual thinking there- and not very visionary, as the format is dead now...


> Well, I can imagine back at the time OpenOffice was a fresh project

OpenOffice was born when Sun bought StarOffice, which was initially released in 1985 (on Z80 and certainly without any XML). So the project itself was far from fresh. OpenDocument was developed from OpenOffice.org XML format which was developed after Sun bought StarOffice in 1999. At the time XML was not used everywhere, but it was very much in vogue, certainly at Sun where the official line was that Java (created at Sun) and XML are going to conquer the world.


Could you clarify the "XML is dead" comment? Don't all the major document formats still use zipped xml? I had to interface with an xml format recently, and that isn't something I ever did, and when I went looking for a crate that parses an xml schema I kept running across this whole xml is dead thing. But it still seems to be everywhere.


Not GP, but I believe the "XML is dead" sentiment stems from the observation that very few greenfield applications are deliberately choosing xml. Sure you have legacy giants like (X)HTML, SVG, office formats, etc, but you'd be hard-pressed to convince developers (especially a younger crowd) to select it as a data format. It's seen as warty, cumbersome, unwieldy, verbose.


Other example: raster map tiles (basically up to millions of tiny square pictures)

Zip vs tar vs filesystem vs sqlite. Tested all these scenarios, and sqlite was the fastest and the smallest, even beating plain archives with no overhead


Many filesystems have an issue with tens of thousands or more files in a single directory, which is exactly what you can get with map tiles. No wonder sqlite is faster.


Yeah, that's why sqlite was adopted for this back then - many devices still used FAT32 on the storage volumes where tiles we often stored/cached and that had horrendous small file performance - a plain white 130 Byte PNG tile could result in 64 kB being used.


It is not just fat32 and overhead up to cluster size; once I had 800k tiles in a single directory on ntfs.

It was unusable. The only thing that I was able to do is to tar it up and move to a machine with xfs, where I was able to sort it up into more balanced subdirs and then move it back (for processing using windows-only tool). Just tarring that single directory up took several days.


It's not just ntfs. I tested this a few months ago in a pretty unscientific manner using ~50 million files in one directory.

btrfs was unusable (not only that particular directory, but the whole filesystem became noticeably slower).

ext4 was ok. xfs didn't break a sweat. I don't recall any practical difference when compared against a nested tree like

  ├── aa
  │   ├── aa
  │   │   └── aaaaf3ee5e6b4b0d3255bfef95601890afd80709
  │   ├── ab
  │   └── ac
  └── ab
      ├── aa
      ├── ab
      └── ac
[deleted by user]

Once we had to ship millions of extremely small files to our customer, we ended up throwing them into a MongoDB and serve them with a web server. It worked great.

We tried to use an image of traditional filesystems (ext4 and fat32), but with most files being under 1 KiB, it was super wasteful.


This is an extremely low quality comment, and I accept any downvotes, but I can't resist: would you say that MongoDB was web scale?


The way our "MongoFS" was organized is actually also low quality and probably would fall apart pretty quickly if used as a serious web service. However, it works if you try to quickly deploy and serve millions of files to a small number of clients.

Our use case was a Maven mirror for disconnected environment that only contains metadata (i.e. lots of small XML files, without the actual jar). We already had a MongoDB service for some other JSON data, so here we are.


If SQLite is faster, the problem is the zip library you use.

SQLite has a major draw back (and yes, I love SQLite and built a lot of things around it over the years): the blob you get from the DB cannot be mmap and you have to copy it to somewhere else. For zip files, as long as the file is not compressed, you can mmap it (or it is compressed using some exotic encoding such as PVRTC) just fine.


> there's no ISO standard defining how to interpret an SQLite file in excruciating detail.

There comes a point where ISOing things doesn't help. The SQLite format belongs to SQLite, and an ISO standard would result in that standard being rendered irrelevant by the SQLite team, should they wish to make a change for any reason. Also, people would have to pay ISO for access to the specifications. SQLite should be treated as a defacto standard defined by the SQLite project.


Just a heads up that it looks like you meant to reply to miki123211, but you've posted a top-level comment instead :)

[deleted by user]
[deleted by user]

> since OpenDocument predates SQLite

This shocked me. Impressive how far SQLite's come in such a short space of time.


Hmm, me too, and Wikipedia says:

> OpenDocument - Initial release: 1 May 2005; 18 years ago

> SQLite - Initial release: 17 August 2000; 23 years ago

Wonder what gives.


Hah - that tallies with my instinct on ODF at least. I'm confused too, then.


OpenDocument traces it's ancestry to OpenOffice XML format, which traces it's ancestry to StarOffice, which was xmlized around the time Sun bought it in 1999


That's fair, I wouldn't standardize on a 1-year-old database.


Not to be confused with Office Open XML (OOXML), Microsoft's "standard".


Thanks for clarifying the somewhat messy history of the format!

[deleted by user]
[deleted by user]

I was optimistic that Audacity adopting SQLite would be a substantial improvement in its file saving capabilities. In practice I encountered many gotchas:

- On Linux, saving into a new file onto a root-owned but world-writable NTFS mount created in /etc/fstab, fails due to permission errors or something. Saving into an existing file works as usual.

- Files are modified on disk when you edit the project in the program, creating spurious Git diffs if you check Audacity projects into Git as binary blobs. And when you save the file, old and deleted data is left in the SQLite file until you close the project's window (unlike saving a file in a text editor), and you can accidentally commit that into a Git repo if you don't close the window before committing. (I recall at one point that you had to manually vacuum the .aup3 file, but now closing the window is sufficient.) I'm getting Word 2003 Fast Save vibes.


It’s also a bit of a bother if Audacity crashes (or is otherwise terminated abnormally), as the cleanup just doesn’t happen at all then, whereas in the past the recovery process would mention the presence of orphaned blocks and allow you to choose to keep or delete them. But when I had a several-gigabyte project that should have only been a couple of hundred megabytes, and needed to save disk space, I finally found a solution suitable for my simple single-track stuff: Mix and Render. Doesn’t change the audio, but allowed it to clean up the detritus on save and exit. But all up, this is clearly an application-level problem, not something inherent to SQLite.

Hmm… I think I vaguely recall that Audacity 2 had the concept of a temporary working space, whereas it seems that Audacity 3 just uses the .aup3 file as its working space? Some advantages, some disadvantages.

Mildly less on-topic: I looked into Audacity 3’s format, and was utterly baffled by what they’ve done with the project data (what used to be the .aup file). They still encode it as XML, storing it in a single-row table, but instead of just writing it as text, they use a simplistic dictionary coder on it. Just… why? Why did someone go to all the trouble of writing that code? It makes interoperability and inspection much harder, surely harms performance (even if by a trivial amount), and the space saving will be rounding error in every plausible case (like, maybe as much as a few kilobytes out of hundreds of megabytes of audio files).


Yes it should replicate the functionality user expects - save everything into temporary file and overwrite the original file only on explicit save action.

As for Git, it would benefit from using text format specifically aimed for easy diffing/merging. No idea how easy the sqlite dump is in this regard.


> Yes it should replicate the functionality user expects

Do users really expect this now a days? Most users use cloud apps, and almost all of those save after every operation automatically.


Which is a compromise for using browsers really. It isn't a good solution and no user really understands this and I believe it is the most hated feature of the new cloud world. Yes, leaving the page open for multiple hours might now allow you to save because your access token expired. No, communication in the background is unreliable too. Autosave is a bad band aid for a bad solution.

Doing periodic and automatic saves is good. Doing so on a document "in production" is majorly stupid. Not that I want to accidentally validate the busy work dev ops puts us through.


Its pretty easy to make a cloud app that emulates the traditional working draft/save workflow. Browsers all have pretty reliable local storage technology now a days if your network is unstable. I don't think this design choice is a compromise of the medium. If anything it seems like if you were going to have to compromise for web you would do it in the other direction so apps are more usable during poor network conditions.

I would say the traditional model is a compromise from back when disks were unacceptably slow to be saving constantly.


> As for Git, it would benefit from using text format specifically aimed for easy diffing/merging. No idea how easy the sqlite dump is in this regard.

The problem I'd predict here is that then people would expect to be able to do three-way merges. It might even work correctly a lot of the time, depending on the exact pattern of changes. But my gut feel is that unless the schema were designed just right, there would be possible merges that would result in a database that was valid from SQLite's point of view but insane from the application's point of view (broke expected variants, etc).


If you want to use flat files you should just use flat files. There are plenty of unix tools to treat them like DB.

You're not going to have a sensible text version of a btree that is reasonably editable by a text editor.


I set up my Git to use the SQLite dump on SQLite files when using “git diff”. This at least shows me the changes row-by-row, or shows nothing if no changes.

I don’t expect to be able to merge though.


I have been told that the new generation of users does not expect, want or appreciate applications that use explicit saves.

I've also been told that they don't understand or even want to understand folders...


The context here is someone using git. Who presumably understands folders.


My wife uses Audacity all day and every few days there is a corrupt sqlite file (duplicate key) which cannot be (as far as we know) repaired/reimported etc from Audacity. I can fix it manually if it's important, but usually just throw the file away and things work again.


Duplicate keys in a SQLite file sounds like an audacity bug. :(


Maybe. The SQLite list of gotchas [0] is quite something. NULLs in the PK? Sure. FKs don’t actually do anything unless you pass a PRAGMA? Why not? Etc. I could easily see someone not fully grasping just how much SQLite lets pass by default, and thus not having a test catch it.

[0]: https://www.sqlite.org/quirks.html

[deleted by user]

Yeah, it definitely is. And it's fixable manually. Kind of the advantage to an open file format with nice tooling.


It also sounds like something that could be manually prevented ahead of time. If you can crack open the file on first save and add the right uniqueness constraint, that should make Audacity crash when it tries to corrupt the data.


> ... that should make Audacity crash when it tries to corrupt the data.

That'd be fairly non-optimal behaviour. ;)

When the application tries to add wrong data (eg duplicate key violating uniqueness constraint), SQLite will return an error.

The application should handle things better than by crashing. In theory anyway. :)


If your choices are a crash or corruption, choose the crash.


Sure, data integrity is important.

But hopefully the choices are a bit better than just those two. :)


I don't want people to read my drafts. That could be highly embarassing, and they should not make it into the final saved document.

Past version and undo history should be stored separately from the document. They should be stored out of tree where they wont be commited into some git repository or be automatically synced or anything like that.


I want to be able to read my drafts, until I decide to bake a publication version.


Did you read the other part of my comment? Where I said to store the draft, but not in the document itself?


I did.


Then don't give people access to your drafts but exported versions without history? Why put the limits on the efficiency of a format by forcing it to store changes elsewhere?


It's better if such gotchas don't exist. Otherwise you'll have every user get burned by it at least once, and blaming them for not knowing the subtle consequences of using "Save As" instead of "Export As" is not going to help anyone.


There are plenty of burns on the other side as well with users losing edits and the consequences of copying your file in a file manager with/without some out-of-tree-out-of-sight history are even more subtle.

This is an app feature (it doesn't have to be "Export As", it can be a "clean history" toggle in the same "Save As" dialog and a separate command), so not a reason to excise efficient history preservation from the file format


As an aside, this blew me away. I can hardly believe it. No nested query required?

> SELECT manifest, versionId, max(checkinTime) FROM version;

> "Aside: Yes, that second query above that uses "max(checkinTime)" really does work and really does return a well-defined answer in SQLite. Such a query either returns an undefined answer or generates an error in many other SQL database engines, but in SQLite it does what you would expect: it returns the manifest and versionId of the entry that has the maximum checkinTime.)"


It's not really what one would expect in SQL, but SQLite often defies expectation. In this case, handy, but non-standard.


That's basically short-hand for

  SELECT manifest, versionId, max(checkinTime)
  FROM version
  GROUP BY manifest, versionId

  WITH m AS (SELECT max(checkinTime) AS checkinTime FROM version)
  SELECT v.manifest, v.versionId, v.checkinTime
  FROM version v
  JOIN m m USING (checkinTime)
  LIMIT 1;
It's a bit of a footgun though because there is some randomness here if multiple rows have the same max checkinTime, so I try not to use this SQLite3-ism. You want to also do something to deterministically pick a "best" row, but for that you need to do something like the above.

> Such a query either returns an undefined answer or generates an error in many other SQL database engines, but in SQLite it does what you would expect:

It may be a useful functionality, but it is NOT what I would expect such a query to return, to be frank.

Also you don't need a nested query in this specific, you can order by checkinTime and limit the result to one.

> select manifest, versionId, checkinTime from version order by checkinTime desc limit 1

or something like that. This should work in SQlite and Postgresql at the minimum. I think to remember that in Oracle you have to use "where rownum=1" so indeed you have to use a nested query. I don't know about other databases.


I agree, that doesn't make sense to me either. What about select versionId, max(checkinTime), min(checkinTime)? Can as well query SqlGPT. And above all, it's not what the SQL standard says when that's the entire point of using a standard in the first place.


Well, it doesn't error out! In this example, it seems like it picks the result from whatever matches the last column, but not sure if this is determinstic:

    sqlite> create table x(c1, c2);
    sqlite> insert into x values ("a", 1);
    sqlite> insert into x values ("b", 2);
    sqlite> insert into x values ("c", 3);
    sqlite> select c1, max(c2) from x;
    sqlite> select c1, max(c2), min(c2) from x;
    sqlite> select c1, min(c2), max(c2) from x;
(note: since SQLite is dynamically typed, no need to specify column types for simple examples like this).

> or something like that

That query isn't guaranteed to produce a well-defined result in most SQL engines. (For pretty much the same reason the original doesn't/can't/shouldn't…) In the simple case of two rows with the same `checkinTime`, many engines permit the results to be ordered arbitrarily.


The interesting thing is if you want more than one record, like you want the latest version number for each document ID. In SQLite you could do: `SELECT documentId, versionId, max(checkInTime) FROM version GROUP BY documentId`. In Postgres you can do `SELECT DISTINCT ON (documentId) documentId, versionId, checkInTime FROM version ORDER BY versionId, checkInTime DESC`.

See: https://www.sqlite.org/lang_select.html#bare_columns_in_an_a...


MySQL allows the query, but the non aggregate fields are selected randomly


Following MySQL's longstanding tradition of just doing whatever instead of showing an error message, no matter how unreasonable the result.


MySQL + PHP, name a more iconic match

[deleted by user]

randomly, but after filtering by the criteria in the WHERE part of the query. This can actually be useful sometimes if all non-aggregate fields contain the same value (though I wouldn't actually rely on it, since whether this is allowed depends on how the database is configured, and it makes it easy to introduce errors by changing the query)


It only allows that if you’ve set it to do so. The default SQL_MODE variable includes ONLY_FULL_GROUP_BY.

However, in their brilliance, AWS RDS defaults to only NO_ENGINE_SUBSTITUTION for SQL_MODE, thus merrily allowing partial aggregates with non-deterministic results. Wheee!



Prior to 5.7, MySQL always accepted non-aggregated fields.

Version 5.7 introduced ONLY_FULL_GROUP_BY, but since that change broke lots of code that depended on this historical behavior, many people disabled it.


The docs don't make it clear that this works as stated. The first docs I found don't say that they come from the matching row:

From https://www.sqlite.org/lang_select.html#generation_of_the_se...

> Each non-aggregate expression in the result-set is evaluated once for an arbitrarily selected row of the dataset. The same arbitrarily selected row is used for each non-aggregate expression.

Does `max` somehow only affect the selected rows? Or is this relying on a side affect of the query planner sorting the table to optimize max?

However then I found https://www.sqlite.org/lang_select.html#bare_columns_in_an_a...

> If there is exactly one min() or max() aggregate in the query, then all bare columns in the result set take values from an input row which also contains the minimum or maximum.

In all of the nearby examples contain an explicit "GROUP BY" clause but I don't think that this section says that one is required for this behaviour. So I guess this is the behaviour that is being described.

However I found this rule as well:

> If the same minimum or maximum value occurs on two or more rows, then bare values might be selected from any of those rows. [...] The choice might be different for different bare columns within the same query.

Which is in conflict with the earlier rule which says that the row is consistent. Or is this consistent row rule only provided for the implicit grouping. AKA is a and b guaranteed to be from the same row for the first query but not the second? That would be very surprising, maybe the docs just promise too little?

    SELECT a, b, MAX(c)
    FROM t

    SELECT 1 as group, a, b, MAX(c)
    FROM t
    GROUP BY 1
There are also more not-well sepcified results if multiple of MIN or MAX are used or if these functions or customized. So overall it is probably best to avoid this in "production" use. But can be convenient for some quick exportation if you are careful.

I think this is a convenient side-effect of the implementation which was later turned into official behaviour. A bit like Python dictionary key ordering.

In Postgres you can do similar things with a DISTINCT ON query.

I always found this one of the hardest simple things to do in SQL.


I think if you have a set that you want the latest value from, in all engines you can do something explicit like:

> SELECT manifest, versionId, checkinTime FROM version ORDER BY checkinTime DESC LIMIT 1

The problem with putting aggregation functions in the select output is that you're being unclear about what is aggregating; that pattern begins to break down once you have e.g. multiple documents in the same schema. Or if versionId somehow wasn't linear (e.g. branches of changes).


That's um… quite the aside. How can it possibly claim that to be well-defined, given that `manifest` and `versionId` are not functionally dependent¹ on `max(checkinTime)`?

¹e.g., there could be two rows with the same checkinTime, whose value happens to then be the max such.


Man do I love SQLite.

Over the past 1.5 yrs I've build a computer vision tool from recording hardware/software, to derp learning pipelines, to front-end; we had some requirements on the recording side that were difficult to solve with existing solutions (storing exactly timestamped camera frames, gps data, car telemetry and other metadata).

Using a SQLite-backed data format for the video recordings made implementing things by ourselves super straightforward.


I'm working on a similar problem and I've been struggling to convince all my colleagues that we should sqlite most things. By any chance do you have some public code, or blog posts to share?


Not in public repos, but sure. Drop me a line, hn at rombouts dot email.


> derp learning pipelines

This accurately describes the majority of my efforts, too.


Honest to god this was an unintentional typo, but I decided to leave it in as it was just too juicy


At this point, why are we still using JSON/XML when there is SQLite for new projects? Stop the non sense of JSON/XML. SQLite is like json, but very queryable. Just send SQLite files around.

MongoDB also saves document db type of store space just FYI.

[deleted by user]

With JSON/XML the app owner decides the schema of the saved file, as they should. One day Sqlite will do some perfectly fine change that’ll break people who outsource their file format to it. Own your file format!

That said there is some nuance and it depends what the user expects. Is you app more of an MSWord where people expect a format that is decades backward compatible and only changes on explicit save, or is it more like a live app with a db back end. If the latter there should be no save concept around the DB file but perhaps a backup and restore function that exports to a controlled format.


XML and JSON are for serialization. zip and sqlite, and file-system files, are for lossless persistence. They're separate issues.

An app can go bananas with serialization and use, I dunno, binary JSON or Matroska / ebml or .mp4 containers or whatever, and still serialize any way it wants.


In sqlite the on-disk file format does not matter.

All that matters is that you should be able to issue sql to the sqlite embedded library and get back the results.

Freeing you from the overhead of owning (thus inventing and then maintaining) your own file format is almost the entire point of using sqlite in this manner.


It matters for 2 reasons. One, the expectation that the file changes only when you click Save is broken (as mentioned in another comment), and Two, unless you pin the version of sqlite forever then the file format may have braking changes or your need to deal with migrations.


> expectation that the file changes only when you click Save is broken

This has nothing to do with sqlite. You can have (or not have) gradual saves in any file format. It's a choice that the developers of that app made.

> file format may have braking changes

The sqlite file format is unchanged for 19 years now. A world of features and capabilities have been added since. Don't hold your breath waiting for the sqlite format to change.


This is not gradual saves. File changes when nothing has saved even, according to that comment.

Fair enough about the history of it not changing and you can always embed a frozen copy if it does. But this is a pragmatic assumption not a guarantee.


Any text editor in the world, even the ones that ship with the most barebones shells, can open json and xml and present their data to the user.

SQLite files require opening in a DB terminal or using special software to even get to the point where one can see what’s there at all. Further the entire internet basically natively supports XML and JSON.


That is a good argument, however many people like some big game development company start to ship with GB of json file, at that point just use SQLite. It will be faster to query load. Also if you look at how DB such as Mongo (Not promoting them in any way), but when Maildir is used aginst Mongo for file storage, Mongo saves a lot of disk space. Again, it is about how we want to store files? NixOS is a quite a way to think about having a file system or db/store.


Outside of simple cases xml is too verbose and ugly (and in these cases usually zipped), so it's not suitable for a poor human with a plain text editor, so that doesn't give you much of a leg.

(Json has a higher threshold of complexity before it succumbs)


> The use of a ZIP archive to encapsulate XML files plus resources is an elegant approach to an application file format. It is clearly superior to a custom binary file format.

I feel like I have considerable disagreement with the author of these sentences.


Why do you disagree?


Why only documents? How about a SQLitefs?


WinFS (https://en.wikipedia.org/wiki/WinFS) without the mssql Engine?


Homebrew can't install its prerequisite osxfuse onto Ventura.

There is also this, which seems to work: https://github.com/jacobsa/fuse

and this: https://github.com/jilio/sqlitefs


XSLT processors work by accessing the file system. Would this sqlitefs be a way to run XSLT against an SQLite database? Or is there maybe some other way to run a file oriented XSLT processor against an SQLite database in the SQLAR format?


And of course there is sqlarfs, at the bottom of https://www.sqlite.org/sqlar/doc/trunk/README.md


I'm currently working on an application where I use SQLite as the file format. I want to keep a usual workflow for users where you can make edit to your document and it only changes the file when you save it.

So to open a file I copy it into the :memory: database [1], then the user can do whatever manipulation they want and I can directly make the change in the database I don't need to have a model of the document other than its database format. And to save the document I VACUUM [2] it back to the database file. It works quite well, at least for reasonably sized file (which is always the case for my app) :).

[1] https://www.sqlite.org/inmemorydb.html

[2] https://www.sqlite.org/lang_vacuum.html


Why do you use a secondary, volatile database ? Performance-wise you won't gain a lot more (we're talking about a user editing a file, so not even 1 write per second).

A proposal: write directly, and automatically in the database. No more Save button. There are multiple advantages:

- the system is crash-resistant. I like taking the approach of CouchDB where the only correct way to close the system is to crash it. That way a crash is an expected situation that you actually account for, not a special case that you might forget

- there is only one database. Less code, fewer bugs.

- it is safe. A write to SQLite works or doesn't work, there is no in-between. As said in the VACUUM doc you point to: "However, if the VACUUM INTO command is interrupted by an unplanned shutdown or power lose, then the generated output database might be incomplete and corrupt"

- it is how SQLite was intended to work. And because of that, you won't have to think about it for the lifetime of SQLite


> Why do you use a secondary, volatile database ?

For the exact reason I gave in the comment you are replying to: I want to keep a usual workflow for users. Principle of least surprise.

Users are okay with change being autosaved when there is a single "thing" that can be edited to the point that you don't even have to open it, it's just there, it can be seen as a property (as in ownership) of the application more than of the user. For example, your music library in your jukebox application.

On the contrary, when the user have to open the "thing" with your application and can choose between many of their files that can be edited with your application, users do not expect their files to be automatically modified at all. For example users may start doing some heavy editing and then at the moment of saving their work, they might make a backup of the previous state file before saving, or choose to "save as…" in order to keep the old version just in case.

Crashes are not something that happen that often. It can become an actual problem when you have tens of thousands of users and rare-events do happen, but in the particular case of the application I working on, I do not actually have to worry about that (on the contrary, any solution would have downsides that are worst in the particular case of this application than having to do some work again because of a crash if it ever happens).


You can keep a distinction between old and new version inside the same database, by having a pointer to the "current" version, and updating the pointer when clicking on "save". You could store all changes in the database in a "staging area", such that when you reopen the app you can load the changes and you don't need a recovery phase, but with the "save" button active meaning that something changed since last save.


I could, but as you said in your previous comment (emphase is mine):

> Less code, fewer bugs.


Another option is to explicitly start in read-only mode (modification buttons hidden / grayed out, some distinct mode indicator "Viewing Document" next to a button to "Start Editing", etc), and when the user chooses then switch into autosave mode. At this point many users are used to autosave and don't pay due attention to the document saved state. With Microsoft Windows' habit of rebooting your system overnight without your explicit permission, I'm concerned that this might lead to a lot of lost work.


There is nothing I hate more than an app that modifies files secretly when I open them. Then I have to get all defensive to copy files before I open them to keep them intact. You may not see the problem with changing the checksum or hash of a file, but silently tampering with files is a nightmare in many domains. If you open a file and accidentally change something trivial (some apps like to store things like presentation state i. e. window positions, last page viewed, zoom level, ...)

For example in many regulated domains such as human subjects research files must be approved and only approved files may be used. "Is this version of the consent document the version that the IRB approved?" Well let's see... (1) file modification date is after the approval date and (2) checksums do not match.

Not to mention that writing a single byte of content to a filesystems marks the entire blob as needing backup.

The fact is the filesystem is the user's database, save is commit, and it should be under the users control because application developers do not have the faintest idea about user context.


Word has a fairly simple solution to this: there's a big slider labeled "Autosave" in the title bar, right next to the save button, allowing you to turn this behavior on and off at any time.

95% of the time I want changes persisted immediately, but it's nice to be able to turn it off when I don't.


Depends a little on the type of file. A prose document, sure, probably want autosave by default. A vector graphics file? I want autosave when I'm creating it, but I do NOT want autosave when I'm copying out a piece buried several groups in and behind some things I need to delete/move out of the way. I also don't want to have to think about whether I need it or not.

But generally the way autosave works is to save a copy that can be recovered on a crash, and only overwrite the original if directed by the user. That works for both use cases. (Haven't used Word in years, so I'm not sure if they have a different behavior now.)


For an application working with reasonably sized files sqlite files it would be reasonable to

1. on opening a file clone it to a temporary folder

2. edit the temporary file there on disk

3. on save mv/cp the temporary file over the destination

I am probably missing a lot of use cases, but it migth be a good idea for a game like Factorio where you are expected to have multiple on disk saves of the same run at different times.


In the sqlite case, I think it actually can save uncommited edits to a separate journal file until committed. At least, one of the systems I am familiar with that uses sqlite as a container format (MRI scanner) seems to do this, so I suspect sqlite supports that mode natively.

I'm just pushing back against the idea that its a good or helpful idea to "help the users" by taking the deliberate "save" action away from them.

As an aside, one of the things that has been learned from this class of MRI scanners is that users need to feel "in control" of the machines they're using. The "look how smart this machine is by doing all these magical things you used to do yourself!" attitude works well in sales but really does not go over well in the field because users encounter the fuckups and are held responsible for them. So they quickly start to distrust the machines.


> Not to mention that writing a single byte of content to a filesystems marks the entire blob as needing backup.

If the size, mtime, and inode number stay the same (i.e. it writes into the file directly instead of replacing it), then most backup software will skip it. AFAIK to do otherwise you either need to read the whole file every time, or be live monitoring audit events to see what files have been opened for write, or be the filesystem (e.g. ZFS snapshots, which can be maximally efficient since it knows exactly which blocks it's modified)

Of course this has its own downsides. While those writes may have been "unimportant", the fact is that your backups are now flawed. And if the application has had the foresight to distinguish unimportant writes, and preserve the mtime, I'd rather they just not make those writes in the first place


I am under the impression that modifying a file's content updates the modification time. Is this incorrect? Modifying a file without updating the mtime or allowing mtimes to be edited in userspace sounds like a security nightmare.


Yes it does. mtimes can definitely be edited, if you have permission, but it is rare. I have a photo script that pulls the taken time from Exif and writes it to the file mtime.


There's also SSD wear issues. There shouldn't be, because SSDs are durable, but some applications find dumb reasons to write multiple GB in a minute.

And by some applications I pretty much just mean browsers, but still.


You are right, this is a use case I absolutely did not take into account, but I want to separate user-defined actions and app-defined actions. A level of zoom is something a user does to read a document, but I wouldn't consider that as data to be persisted automatically, unlike characters typed or a font chosen. I value the idea of persisting it, but that would be a user-specific action ("Save view" or something like that)

In the case of checksums in a database, that is why read-only modes should be used and I don't see what automatically saving would change. If anything, when the user zooms on a document in read-only mode, either it shouldn't be stored or storing it should trigger the same flow as modifying the document


I see what your are saying, but in other use cases the presentation state needs to be considered as part of the document. This is one of the reasons zip/jar containers work somewhat well. You can audit different chunks of data separately and cryptographically sign them. sqlite actually has an archive format[1] that is interesting to think about and I have pondered using it for some applications (store the files and also store tables of metadata/analysis)

[1] https://sqlite.org/sqlar.html


Why not do it like Blender: just autosave into some software directory, have the possibility to restore on crash, have the possibility to restore the last n autosaves from disk and add a setting for how many to save etc in the options.


> I like taking the approach of CouchDB where the only correct way to close the system is to crash it.

The term you're looking for is (aptly named) crash-only software.[0]

0. https://en.m.wikipedia.org/wiki/Crash-only_software


a save button is still good, as it allows you to keep specific checkpoints.

but the save button could simply tag specific save points in a larger table.

if the format can roll up changes to compress them, they also indicate where which variants need to be kept indefinitely.


Auto-save is nice, but I think it works much better when it's a separate file.

This way "main" document file (which might be checked in to git, or shared via dropbox or read by some document) only contains nice, clean, saved version.

And yet if your computer crashes for whatever reason, the data is still not lost and can be trivially recovered.


In this view the save action is more like a commit, where the user manually checkpoints and also offers a simple description of why this is an important point. But in my view all intermediary points also need to be saved, because the user might have forgotten to explicitly checkpoint, and might still want some undo/redo capability that is more granular than just checkpoints.

[deleted by user]

> I can directly make the change in the database I don't need to have a model of the document other than its database format.

I don't get your point. Are you saying that you don't need to have a model of the document other than the model of the document? What's the nuance I'm missing?


When an application loads a document, for example if the document is formally a list of things (imagine a very simple TODO app), the usual approach is to have this data represented (modeled) as an actual list in your program, like a Python list of objects, because it's what is easy to manipulate programmatically.

Then, saving your document means serializing the data in some format (which could be JSON, XML, CSV, an SQLite database, …) and writing that to disk, and opening a document means reading the file from disk and unserializing it to your internal model.

What I'm saying is that my approach is to use an in-memory SQLite database as the internal model of the data in the applications. I presented an upside (opening and saving are easy), but is also has downsides: I have to do SQL queries to manipulate the data rather than manipulating objects directly (which could be mitigated using an ORM but that's outside my point). In Python-like pseudo-code you can imagining something like:

    self.todos[42].status = 'DONE'

    self._db.query("UPDATE todos SET status='DONE' WHERE id=42")
(Of course there is the possibility of using ORMs or other approach in between the two.)

I suppose this is in the context where you will be syncing up the changes to a backend server which will also be storing the document in an SQL database. Normally, you might expect that data format on the client to be JSON/XML/something else, and you'd need to maintain logic that marshalls the document representation

    SQL <-> In-memory representation <-> Disk format. 
With SQL on the client, in theory you only now need to maintain

    SQL <-> In-memory representation
Obviously I'm skirting over the format you would use to send either entire documents or partial updates of documents over the wire.

An in-memory data model often differs from the serialized data as it exists on disk. For example, emacs uses a gap buffer for text files; but it outputs plain linear text to disk.

Programmers often have to make software design decisions around how to represent a file in memory in order to manipulate it. For example, if I'm writing an HTML editor, should I mostly treat it like a text file (maybe a gap buffer) with syntax highlighting and auto indentation as an afterthought? Or should I maybe load the whole thing into a tree? What are the robustness and performance characteristics of each?

The commenter above was saying that using SQLite made that decision easy. He could keep traditional (or "atavistic" per the commenter upthread, depending on your perspective) load/save semantics while also making the data model easy to work with.


> An in-memory data model often differs from the serialized data as it exists on disk. For example, emacs uses a gap buffer for text files; but it outputs plain linear text to disk.

The whole point of my remark is that the domain model and the export document format are two entirely separate things.


Why not use a transaction?


A single transaction for the whole user session? That seems a bad idea. Also I'm not sure you can do transactions during another transaction, and I need them for other purpose, i.e., for what they were designed to do (doing changes in multiple tables that need to stay consistent).

[deleted by user]

It’s exactly what transactions are for. A nested transaction is called a savepoint, which sqlite does support.


Btw, Apple's CoreData, commonly used by iPhone and Mac apps, uses SQLite by default. That part works fine, so you can study it if you'd like and ignore all the bad parts built on top (ORM, MVC framework, etc).


We used something similar (DB doing caching run in memory but saved periodically on disk) but with backup API



Maybe simpler? When open the DB, change it to WAL mode, turn off the automatic checkpoint https://www.sqlite.org/pragma.html#pragma_wal_autocheckpoint

When user saves, you just checkpointing the file, merging it back into the main database.


> The VACUUM command works by copying the contents of the database into a temporary database file and then overwriting the original with the contents of the temporary file. When overwriting the original, a rollback journal or write-ahead log WAL file is used just as it would be for any other database transaction. This means that when VACUUMing a database, as much as twice the size of the original database file is required in free disk space.

> The VACUUM INTO command works the same way except that it uses the file named on the INTO clause in place of the temporary database and omits the step of copying the vacuumed database back over top of the original database.

Do you use VACUUM (uses a write-ahead log to survive power-off) or VACUUM INTO (as far as I can tell, it doesn't survive power-off during writing, and might corrupt the existing file contents if the filename already exists)?


>> The file named by the INTO clause must not previously exist, or else it must be an empty file, or the VACUUM INTO command will fail with an error.

EDIT: there is no difference between VACUUM/VACUUM INTO - they both write to a new file (COW) it's just VACUUM [NOT INTO] does mv temp.sqlite originalfile.sqlite after that, while VACUUM INTO does not.


This means that like a regular app, you lose data if the app crashes or there is a power loss.

It's much better to save after each operation in a temporary place (probably in ~/.local/share/application/yourapp, using XDG directories), and when the user clicks save, just copy the file into the desired location. That way, if there is a power loss and you reopen the app, it opens right back where it was doing (losing maybe he last few seconds of changes, but not all unsaved data)


If you have a db, why not just model it as unsaved data? I.e. all changes get stored to the db, but have a flag of unsaved. If you open up a file and there are unsaved changes, you can prompt the user to either make them saved or discard them.


That feels like it requires the data model to be very different? The file format would essentially need to be a list of changes, with a "committed" flag.

Like, if someone changes some text in a paragraph, you can't just model that as "this paragraph now contains this new text". You have to model it as "this paragraph used to contain this text, but an uncommitted change changed it to this other text". User deletes an image? You have to still store the image and all the references to it, but with an uncommitted change to delete the image and remove the references to it.

And maybe that's a good thing, maybe a git-like system where the history of every change is tracked is what you want. But it certainly doesn't feel like it'd be appropriate for every application and file format.


You wouldn't necessarily need to track every change, you could just have 2 tables, one which contains the last "saved" version of the document, and one which contains the last modified version of the document. Upon opening after a crash, if there is a more recent modified version, the program will ask if you want to load that version.


Databases handle all this natively with transactions and WALs. i.e. Don't need to build a Flintstones version yourself.

Also binary documents are a lousy fit with git, smashing square peg into round hole makes little sense.


> and when the user clicks save, just copy the file into the desired location.

To be perfectly safe, you want to rename it, not copy it. If there’s a power loss during copying, you may endcup with corrupted data.

Renaming is, to coin a phrase, “more atomic” than copying (on Linux, the OS says it is atomic. ISO C says it, too, but POSIX doesn’t (https://pubs.opengroup.org/onlinepubs/000095399/functions/re...: “This rename() function is equivalent for regular files to that defined by the ISO C standard. Its inclusion here expands that definition to include actions on directories and specifies behavior when the new parameter names a file that already exists. That specification requires that the action of the function be atomic”)

Also, filesystems may have bugs, hardware may lie about syncing to disk, and network shares can be finicky.

Doing this properly isn’t as easy as one would think. You’ve to make sure to sync the file to be written and you’ll have to handle the case where the save location is on a different file system than your temporary file. If so, you’ll have to create a copy on that file system first.

I think many tools do not check whether they need to work cross filesystem and just write their scratch files to the save directory with a different name and then rename them.

Of course, that means you always need twice the disk space on the target disk to do a save. That used to be a problem almost everywhere, but nowadays mostly is restricted to embedded systems and USB sticks.

In this case, however, SQLite will do a lot for you, and probably better than you would do it. It claims (https://www.sqlite.org/atomiccommit.html#_multi_file_commit):

“SQLite allows a single database connection to talk to two or more database files simultaneously through the use of the ATTACH DATABASE command. When multiple database files are modified within a single transaction, all files are updated atomically. In other words, either all of the database files are updated or else none of them are. Achieving an atomic commit across multiple database files is more complex that doing so for a single file. This section describes how SQLite works that bit of magic.”

However, about VACUUM INTO, it says (https://www.sqlite.org/lang_vacuum.html):

“The VACUUM INTO command is transactional in the sense that the generated output database is a consistent snapshot of the original database. However, if the VACUUM INTO command is interrupted by an unplanned shutdown or power lose, then the generated output database might be incomplete and corrupt. Also, SQLite does not invoke fsync() or FlushFileBuffers() on the generated database to ensure that it has reached non-volatile storage before completing.”

So, I don’t think doing “VACUUM INTO” is sufficient to guarantee that you get a good copy of your data on disk.


Well, copying is simply not atomic in linux; directory entry operations (rename, link, unlink) are atomic. There is a definition somewhere that says how many bytes may be written atomically; that's it -- past that writes are not atomic.

The prior comment of "model user interactions in the database" seems spot on -- just keep track of what the user's doing as unsaved data in the database and commit it (in the appropriate way) to the DB as it happens; save is just another user action.

Presumably sqlite has figured out how to write to the filesystem without corrupting itself even in a variety of adverse scenarios?

If not, commit to a temporary copy of the DB that gets renamed to the "main" name periodically or when the app closes. There's an xzzzbit meme there somewhere, yo.


> The prior comment of "model user interactions in the database" seems spot on -- just keep track of what the user's doing as unsaved data in the database and commit it (in the appropriate way) to the DB as it happens; save is just another user action.

The trouble is that often in the wild, the content of the file on the filesystem is a user-facing interface. Users will copy it around and attach the whole document onto emails. When they do, they do not expect the file to contain data that they didn't want to save.

(Yes, they sometimes also expect the file to contain data that they wanted to but didn't explicitly save. This is not a contradiction.)


> they do not expect the file to contain data that they didn't want to save.

I think that's fine. Just remove "save" from the UI, and save after every keystroke. This may sound crazy in 1970, but it's how nearly everything works today. It's really only us weirdos that started using computers before "the cloud" that think "save" is an operation that does something, and we're dwindling in numbers!


That's certainly true, and also the case in many modern file formats.

For example there's the story of the academic studies with falsified data [1] where forensics on the included excel documents showed they'd gone through and replaced data with "randomized" data (if I remember correctly); there are tons of examples of "redacted" data in pdf docs being visible under the blacked-out rectangles.

I'm not disagreeing with you, btw, such actions are certainly problematic, but hopefully kids will grow up knowing they need to run an export to sanitize their data if they don't want to show the whole world their transaction logs...

[1] https://www.npr.org/transcripts/1190568472


While "knowing they need to run an export to sanitize their data" is (unfortunately!) a thing user just Have to Know, a wider issue imo is that we also shouldn't really be encouraging developers to casually assume that users will only interact with their software through their own ordained interfaces.


> Linux, the OS says it is atomic

nitpick: At least on some filesystems if you rename a.txt to overwrite b.txt and the machine crashes, you might end up both a.txt and b.txt hardlinked so they contain the same data.

Of course this is no big deal since b.txt is still updated atomically so it contains the new data (assuming a.txt was fsynced) or the old data. I assume nobody depends on a.txt being deleted simultaneously.


You are right and like you explained this is trivially easily fixed by autosaving regularly.

What I have trouble imagining is people working with documents on computers for more than a few years yet somehow failing to develop the Always Save Instinct. I regularly catch myself saving unreasonably often.


That may have been true years ago in win 95 or xp days. The modern paradigm starting with google docs is that things are automatically saved and even always sharable through the cloud, making manual saving actually an atavistic leftover of a bygone era.


What if I actually don't want any changes saved because I've only opened a document for reference purposes?


Modern apps like gdocs have a toggle for switching between read-only, suggest changes, and editing. If you forgot to toggle, you can just open version history and revert. MS Office also had version history for nearly a decade.


Then you proactively prevent changes. Either "open a copy" or "open in readonly mode".

If you make saving the default you have to manually not save. It's a trade off versus default no saves with manual saves


Have a "Read-only" checkbox. For the love of God, have a "read-only" checkbox.


Users shouldn’t ever need to adapt to computer crashes like this. Software should always auto save or have recovery files or something. As a principle, software should hold anything a user inputs with reverence.


Yes, however you have to be sure every single program across decades is written with those rules in mind. Hard on any OS, but a lost cause on */Linux.


I agree. Maybe as a dev I've become cynical and don't trust anything. Least of all some app holding my document.

Makes me think of the "Voting software" xkcd: https://xkcd.com/2030

"I don't quite know how to put this, but our entire field is bad at what we do, and if you rely on us, everyone will die."


You think that's cynical? I must be some avatar curmudgeonliness then.

I'm pretty sure that what's actually going on is that everything we think of as a 'profession' is that way, and that people are in fact dying because of it.

The difference is that devs are honest about it.


I used to have that instinct but lost it in the age of auto save. The applications (web or native) I use most often all do it for me: Google docs, Dropbox paper, notion, vscode. I don’t think I’m alone in this!


I actually tried using open/libre docs a few years ago just because of it being open source. I was trying to make a point of using locally installed software and avoid google products. Then the thing crashed and I lost an hour of work because it didn't save a temporary version. That's when I gave up on it for good.


Libre office does keep a temporary version that allows you to recover, so you're talking crap.


It might very well keep it, but either that behavior wasn't turned on by default or it crashed in way where it wasn't recoverable. I know I lost work.


Doesn't need to be turned on. However it's always possible you mistakenly pressed Esc/Close, didn't read the dialog, or hit a very obscure bug?

However this has worked well for twenty years, so PEBKAC is a reasonable conclusion.


Yes, I am aware of that, and you are right about this in general. In my particular case however, it is preferable to loose some work in the rare cases were a crash occurs than to have a copy of the file in some place that the users are not aware of. Of course if crashes were frequent the trade-off would be different.


Good ole .filename.swp


ODT was designed to be standardised: while the predecessor format was very similar too, it relies very heavily on XHTML, SVG, and CSS, to name but three (there's a lot more).

Without being able to call out to existing standards, the ODT spec itself would suddenly become massive. The effort to update the standards appears to be significant and hasn't progressed much in recent years already :/

I think realistically, an Sqlite format could be offered as an option, but the office doc ship has really sailed.

Good argument to formalise the spec of Sqlite as a standard though...


The specification is massive (840 pages) even though it is written in very terse way that does not really specify the effects and behavior, only the syntax.

On the other hand if one ignores few warts (explosion of local styles and text spans due to ooo:rsid attribute, non-sparse spreedsheets and weird mechanism for styling tables as a few examples) it is really well designed markup for this kind of document data that strikes right balance between it being semantic markup and representing the kinds of stuff users want to do. Compare that with Office OpenXML with stateful formatting empty tags (yes, really, in DOCX <b/> _TOGGLES_ whether following text is bold).


AutoCAD uses a database as its file format, it is fairly slow.


Sqlite format is smaller than the original format only because xml is super verbose, so any uncompressed binary format ends up being less than lightly zipped xml.

But sqlite files aren't small. One thing I don't understand is why they don't do string deduplication in sqlite (as in you only store a string once and every other occurence is just a pointer to that string). It seems such an obvious and easy way to reduce file size, memory consumption and therefore increase performance (less I/O). Is there a technical reason why this would not be desirable?


My first guess is that if you always store the full string you don't need to scan the database to see if you already have the same string. Essentially you choose to use more space but reduce load. Regardless of whether you do the string deduping on inserts or async later on, you have to do it at some point and the unpredictable performance overhead might be undesirable.


Well it should be a dictionary lookup, it should be pretty fast and predictable. And for freeing it up, it should be a good candidate for reference counting.


If you have the same (long-ish) string repeating many times in a database, it points to a DB schema needing normalization.


I guess it depends on the use case. If you load a csv file into a sqlite database, normalisation isn't the first thing you do.


There is nonzero overhead for doing so: optimizing for duplicate strings invariably adds cost to handling unique strings.

This sounds like something you could do at the schema and application level.


BLOBs in sqlite can be up to 2GB or less, depending on the compilation flags. If you store 2GB and the other application uses sqlite compiled with support for less than 2GB BLOB size, good luck on getting them to work... If you want to store content larger than 2GB in sqlite, you have to chunk them, manage the chunk sequences, etc. And you can't overwrite a fixed size 2KB portion at the specified offset, you'll have to rewrite the entire 2GB chunk.