@samwillis
10d
DuckDB is awesome. As a comparison, I have a dataset that starts life as a 35gb set of json files. Imported into Postgres it's ~6gb, and a key query I run takes 3 min 33 seconds.
Imported into DuckDB (still about ~6gb for all columns), the same SQL query takes 1.1 second!
The key thing is that the columns (for all rows) the query scans total only about 100mb, so DuckDB has a lot less to scan. But on top of that it's vectorised query execution is incredibly quick.
https://mobile.twitter.com/samwillis/status/1633213350002798...
@e12e
10d
@nojito
9d
Calling Dask python spaghetti is quite hilarious.
That spaghetti can auto scale to hundreds of machines without skipping a beat. Which is far more useful than the other tools you mentioned which are only useful for one off tasks.
@pletnes
10d
Duckdb is fantastic. Doesn’t need a schema, either.