@the_duke 10d
> JSON is a text format and interning it into proper data structures is likely going to take _less_ space, not more.

That depends a lot on the language and the json library.

Lets take `{"ab": 22}` as an example.

That's 10 bytes.

In a language like Rust and using the serde library, this could be deserialized directly into a struct with one integer, let's pick a u32. So that would only be four bytes.

But if it was deserialized to serdes dynamic Value type, this would be : a HashMap<String, u32>, which has a constant size of 48 bytes, plus an allocation of I don't know how much (first allocation will cover more than one entry), plus 16 bytes overhead for the string, plus 2 bytes for the actual string contents, plus the 4 bytes for the u32. So that's already over ~90 bytes, a lot more than the JSON.

Dynamic languages like Python also have a lot of overhead for all the objects.

Keys can of course be interned, but not that many default JSON parser libraries do that afaik.

@ot 10d
> Or write a custom little program that streams data from the JSON file without buffering it all in memory. JSON parsing libraries are plentiful so this should not take a lot of code in your favorite language.

Several years ago I wrote a paper [1] on representing the parse tree of a JSON document in a tiny fraction of the JSON size itself, using succinct data structures. The representation could be built with a single pass of the JSON, and basically constant additional memory.

The idea was to pre-process the JSON and then save the parse tree, so it could be kept in memory over several passes of the JSON data (which may not fit in memory), avoiding to re-do the parsing work on each pass.

I don't think I've seen this idea used anywhere, but I still wonder if it could have applications :)

[1] http://groups.di.unipi.it/~ottavian/files/semi_index_cikm.pd...

@bastawhiz 10d
> JSON is a text format and interning it into proper data structures is likely going to take _less_ space, not more.

If you're parsing to structs, yes. Otherwise, no. Each object key is going to be a short string, which is going to have some amount of overhead. You're probably storing the objects as hash tables, which will necessarily be larger than the two bytes needed to represent them as text (and probably far more than you expect, so they have enough free space for there to be sufficiently few hash collisions).

JSON numbers are also 64-bit floats, which will almost universally take up more bytes per number than their serialized format for most JSON data.

@coldtea 10d
>JSON is a text format and interning it into proper data structures is likely going to take _less_ space, not more.

Wanna bet?

  >>> import sys
  >>> import json
  >>> data_as_json_string = '{"a": 5000, "b": 1000}'
  >>> len(data_as_json_string)
  22
  >>> data_as_native_structure = json.loads(data_as_json_string)
  >>> sys.getsizeof(data_as_native_structure)
  232
That's not even the whole story, as what's 232 bytes is not the contents of the dict, but just the Python object with the dict metadata. So the total for the struct inside is much bigger than 232 bytes.

A single int wrapped as a Python object can be quite a lot by itself:

>>> sys.getsizeof(1) 28

A binary int64 would be 8 bytes for comparison.

@ZephyrBlu 10d
Have you done much data analysis?

In my experience adding more steps to your pipeline (E.g. database, deserializing, etc) is a pain when you are figuring things out because nothing has solidified yet, so you're literally just adding overhead that requires even more work to remove/alter later on. If you're not careful you end up with something unmaintainable extremely quickly.

Only analyzing a subset of your data is usually not a magic bullet either. Unless your data is extremely well cleaned and standardized you're probably going to run into edge cases on the full dataset that were not in your subset.

Being able to run your full pipeline on the entire dataset in a short period of time is very useful for testing on the full dataset and seeing realistic analysis results. If you're doing any sort of aggregate analysis it becomes even more important, if not required.

I now believe a relatively fast clean run is one of the most important things for performing data analysis. It increases your velocity tremendously.

@saidinesh5 10d
> * "You might find out that the data doesn’t fit into RAM (which it well might, JSON is a human-readable format after all)" -- if I'm reading this right, the author is saying that the parsed data takes _more_ space than the JSON version? JSON is a text format and interning it into proper data structures is likely going to take _less_ space, not more.

Not to mention, even when using bad data structures (eg. hashmap of hashmaps..), One can just add a large enough swapfile and brute force their way through it no?

@nerdponx 10d
If the JSON data has a regular structure, you probably want a database and/or a "data frame" library and Parquet as the storage file format. SQLite, DuckDB, Polars, plenty of options nowadays that are usable from several different programming languages.
@zigzag312 10d
> if I'm reading this right, the author is saying that the parsed data takes _more_ space than the JSON version? JSON is a text format and interning it into proper data structures is likely going to take _less_ space, not more.

Utf8 json strings will get converted to utf16 strings in some languages, doubling the size of strings in memory compared to the size on disk.

@walrus01 10d
> A few GBs of data isn't really that much.

The entire FCC radio license database (the ULS) is about 14GB in text CSV format and can be imported into a sqlite or sql db and easily queried in RAM on a local workstation...

@taeric 10d
It still surprises me how many have the intuition that loading the data will take more space than the file.

Even more annoying when it is. (Compressed or binary formats not withstanding.)