I have been doing a comparative study on different ways to analyse a large JSON file in memory. Our basic requirement is to do interactive analysis on nested data; for test purposes I am refraining from using a distributed/big data set up. For this use case, what’s interesting is the variance in test results for analysis in a simple row vs columnar fashion.
Suppose we get a big chunk of JSON data – let’s say from a source like twitter API – and you want to do some analysis on it by firing multiple and random queries on this large JSON. To speed up the queries we’ll load the JSON in memory and answer the queries instead of reaching out to the disk every time (assuming it fits in memory). For this test we won’t bother throwing in distributed computing or adding indices.
The test data I took was 100,000 records from a twitter firehose that made up a 264MB JSON array on disk. The test query is to pull up a tweet that comes from a particular timezone and from a particular user name – something like:
select text where user.screen_name = <something> and user.timezone = <something>.
1. Row by row: Parse the twitter stream data in a traditional row by row fashion and filter by both user name and time zone parameters, and select the tweet text.
2. NoSQL DB: Post the test data into a NoSQL DB like mongodb and run the query. This is not entirely in-memory analysis but with mongodb nearing cache speeds, I would like to get some stats here. I am keeping this test out of scope for now.
3. Columnar: This is where it gets interesting. First, we’ll convert the JSON to columnar format so that it’s better for analytics (compression and indexing not done for this test). There are known techniques to convert nested JSON to columnar-like Parquet format. In fact, the more I looked at it, the more I found that Parquet seems intended to solve more problems than I needed and could potentially perform a lot better – I’m probably going to revisit Parquet in-depth later.
In this case, I decided to code my own algorithm to flatten the JSON and query in an SQL fashion. My approach is a little different to Parquet, in that I’m not going by repetition level and definition level. Instead, I traverse the JSON assigning row ids to columns’ values based on the level and iteration.
With this basic set up, I ran my test query and obtained below stats:
MilliSecs for file load: 30432
MilliSecs for query exec: 33
Memory used: 1086 MB
Row by row
MilliSecs for file load: 34494
MilliSecs for query exec: 192
Memory used: 1349 MB
We can see that right away with this basic columnar implementation, the query execution time is more than 5x faster. I am unable to conclude well on the memory footprint due to other variables, but looks like the columnar format is lighter on memory by a small margin – not bad considering I have not done any compression yet.
The notion that columnar querying will be faster and lighter than with rows is not a new finding. What this test demonstrates is that this finding holds true to when applied to JSON, which is a nested data format.
Industry trends for Big Data Interactive Analysis
If we were to solve the above use case at the petabyte scale, the preferred option would be use Spark with Parquet format. Spark’s RDD takes care of keeping the data in memory for repeated queries on the same data set and Parquet takes care of the columnar format. Other options would be Dremel clones like Impala, Drill or Druid – Impala is very much tied to Hadoop and doesn’t do it in memory like Spark, and Drill looks conceptually promising but keeps us waiting for GA. Meanwhile Hive on Tez is picking up steam and appears to comprise a more stable option – but again not an in-memory solution.
These performance results confirm the future trend for interactive big data analytics towards in-memory processing along with columnar formatting. We are keeping a close watch in this space.