That depends a lot on the language and the json library.
Lets take `{"ab": 22}` as an example.
That's 10 bytes.
In a language like Rust and using the serde library, this could be deserialized directly into a struct with one integer, let's pick a u32. So that would only be four bytes.
But if it was deserialized to serdes dynamic Value type, this would be : a HashMap<String, u32>, which has a constant size of 48 bytes, plus an allocation of I don't know how much (first allocation will cover more than one entry), plus 16 bytes overhead for the string, plus 2 bytes for the actual string contents, plus the 4 bytes for the u32. So that's already over ~90 bytes, a lot more than the JSON.
Dynamic languages like Python also have a lot of overhead for all the objects.
Keys can of course be interned, but not that many default JSON parser libraries do that afaik.
Several years ago I wrote a paper [1] on representing the parse tree of a JSON document in a tiny fraction of the JSON size itself, using succinct data structures. The representation could be built with a single pass of the JSON, and basically constant additional memory.
The idea was to pre-process the JSON and then save the parse tree, so it could be kept in memory over several passes of the JSON data (which may not fit in memory), avoiding to re-do the parsing work on each pass.
I don't think I've seen this idea used anywhere, but I still wonder if it could have applications :)
[1] http://groups.di.unipi.it/~ottavian/files/semi_index_cikm.pd...
If you're parsing to structs, yes. Otherwise, no. Each object key is going to be a short string, which is going to have some amount of overhead. You're probably storing the objects as hash tables, which will necessarily be larger than the two bytes needed to represent them as text (and probably far more than you expect, so they have enough free space for there to be sufficiently few hash collisions).
JSON numbers are also 64-bit floats, which will almost universally take up more bytes per number than their serialized format for most JSON data.
Wanna bet?
>>> import sys
>>> import json
>>> data_as_json_string = '{"a": 5000, "b": 1000}'
>>> len(data_as_json_string)
22
>>> data_as_native_structure = json.loads(data_as_json_string)
>>> sys.getsizeof(data_as_native_structure)
232
That's not even the whole story, as what's 232 bytes is not the contents of the dict, but just the Python object with the dict metadata. So the total for the struct inside is much bigger than 232 bytes.A single int wrapped as a Python object can be quite a lot by itself:
>>> sys.getsizeof(1) 28
A binary int64 would be 8 bytes for comparison.
In my experience adding more steps to your pipeline (E.g. database, deserializing, etc) is a pain when you are figuring things out because nothing has solidified yet, so you're literally just adding overhead that requires even more work to remove/alter later on. If you're not careful you end up with something unmaintainable extremely quickly.
Only analyzing a subset of your data is usually not a magic bullet either. Unless your data is extremely well cleaned and standardized you're probably going to run into edge cases on the full dataset that were not in your subset.
Being able to run your full pipeline on the entire dataset in a short period of time is very useful for testing on the full dataset and seeing realistic analysis results. If you're doing any sort of aggregate analysis it becomes even more important, if not required.
I now believe a relatively fast clean run is one of the most important things for performing data analysis. It increases your velocity tremendously.
Not to mention, even when using bad data structures (eg. hashmap of hashmaps..), One can just add a large enough swapfile and brute force their way through it no?
Utf8 json strings will get converted to utf16 strings in some languages, doubling the size of strings in memory compared to the size on disk.
The entire FCC radio license database (the ULS) is about 14GB in text CSV format and can be imported into a sqlite or sql db and easily queried in RAM on a local workstation...