Yeah, given that everything is now multi-core, it makes sense to use a natively parallel tool for anything compute-bound. And Spark will happily run locally and (unlike previous big data paradigms) doesn’t require excessive mental contortions.
Of course while you’re at it, you should probably just convert all your JSON into Parquet to speed up successive queries…
How much memory would a spark worker need to process a single JSON file that is 25GB?
To clarify, this is not JSONL or NDJSON file. Just a single JSON object.