question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to process large json file with nested or deep arrays with tonnes of (sensor) data.

See original GitHub issue

I have a large json file (around 16Gb) with the following structure:

{
"ControlJob": 
	{
		"Keys": 
		{ 
			"CtrJobID": "KE-CJ0000000409",
			"EquipmentID": "TEST01" 
		},
		"Attributes": 
		{
			"FileType": "SensorData",
			"JsonSchemaVersion": "Draft",
			"SensorDescriptions": 
			[
				{
					"SensorID": "1723007",
					"SVID": "1723007",
					"Group": "Temp",
					"SensorType": "Actual_01",
					"Name": "U",
					"DataType": "Double",
					"Units": "degC"
				},
				{
					"SensorID": "2424009",
					"SVID": "2424009",
					"Group": "Press",
					"SensorType": "Actual_VG03_1Hz",
					"Name": "VG03",
					"DataType": "Double",
					"Units": "Pa"
				}
			]
		},
		"RecipeSteps": 
		[
		{"Keys": {
				"RecipeStepID": "START",
				"StepResult": "NormalEnd"
			},
			"SensorData":
			[
				{
					"Keys": {"SensorID": "1723007"},
					"Measurements": 
					[
						{
							"DateTime": "2021-11-16 21:18:37.000",
							"Value": 540.0
						},
						{
							"DateTime": "2021-11-16 21:18:37.100",
							"Value": 539.0
						},
						{
							"DateTime": "2021-11-16 21:18:37.200",
							"Value": 540.0
						},
						{
							"DateTime": "2021-11-16 21:18:37.300",
							"Value": 540.0
						},
						{
							"DateTime": "2021-11-16 21:18:37.400",
							"Value": 540.0
						},
						{
							"DateTime": "2021-11-16 21:18:37.500",
							"Value": 540.0
						},
						{
							"DateTime": "2021-11-16 21:18:37.600",
							"Value": 540.0
						},
						{
							"DateTime": "2021-11-16 21:18:37.700",
							"Value": 538.0
						},
						{
							"DateTime": "2021-11-16 21:18:37.800",
							"Value": 540.0
						}						
					]
				},
				{
					"Keys": {"SensorID": "2424009"},
					"Measurements": 
					[
						{
							"DateTime": "2021-11-16 21:18:37.000",
							"Value": 1333.22
						},
						{
							"DateTime": "2021-11-16 21:18:37.100",
							"Value": 1333.22
						},
						{
							"DateTime": "2021-11-16 21:18:37.200",
							"Value": 1333.22
						},
						{
							"DateTime": "2021-11-16 21:18:37.300",
							"Value": 1333.22
						},
						{
							"DateTime": "2021-11-16 21:18:37.400",
							"Value": 1333.22
						},
						{
							"DateTime": "2021-11-16 21:18:37.500",
							"Value": 1333.22
						},
						{
							"DateTime": "2021-11-16 21:18:37.600",
							"Value": 1333.22
						},
						{
							"DateTime": "2021-11-16 21:18:37.700",
							"Value": 1333.22
						},
						{
							"DateTime": "2021-11-16 21:18:37.800",
							"Value": 1333.22
						}						
					]
				}
			]
		}
		  
		]


	}
}

The file also has a lot of other fields which are not relevant for me and I do not wish to process. There is no way I can load the whole document to the memory and process it. So, the strategy I am trying to follow is this:

Step 1: The properties “Keys” and “Attributes” is common factor for all the elements in the “RecipeSteps” Array. So, I will parse them and write it to a new file.

Step 2: I will now Try to split the file based on each element in the array “RecipeSteps”. The new split files will have the “Keys” and “Attributes” properties since they are common metadata to all these files. At the end of this step, each split file will have the common data(Keys, Attributes) and “RecipeSteps” array with one element.

Step 3: I will further split these files based on each element in the array “SensorData”. At the end of this step each file will have have the common data(Keys, Attributes) and “RecipeSteps” array with one element and that one element will have SensorData array with one element.

Step 4: Now I will iterate through all these files and do the comparison with the reference data I have and post the results.

Note that I have posted only the relevant properties to my requirements. There are so many other elements in the file which I want to ignore completely to make the processing faster.

I saw this article and tried it: https://www.codeproject.com/Tips/5315195/Cinchoo-ETL-Parsing-Huge-JSON-File-as-Stream

But this essentially tries to load the whole object into the memory. I skimmed through the documentation also with no luck.

What would you suggest I do to process this file in the most efficient manner? I can share the actual 16Gb file, if you want a clearer picture of my problem.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:10 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
Cinchoocommented, Jan 13, 2022

Afraid not, as it is serial / forward only processing.

1reaction
Cinchoocommented, Jan 9, 2022

Well, finally had chance to download and look into the file. Was quite challenging to load with the current version. (blowing up with OOM exception!)

After number of attempts, came up with solution to cater your needs (also quite useful for others as well if they need to do…).

In the new approach, users can hook up to the loader with the callback, by which you can direct the output to a file (instead of loading to memory).

Download and use the latest package v1.2.1.40

Sample fiddle: https://dotnetfiddle.net/I7z4tp

Read more comments on GitHub >

github_iconTop Results From Across the Web

Working with large JSON files in Snowflake — Part III
This solution presents a workaround to a specific JSON handling issue in Snowflake — the processing of JSON files that exceed the 16MB...
Read more >
How can I load large JSON files from Snowflake Stage?
The JSON files seem to have nested arrays and the innermost one contains the chunky data. Is there a way to load an...
Read more >
Processing large JSON files in Python without running out of ...
Loading complete JSON files into Python can use too much memory, leading to slowness or crashes. The solution: process JSON data one chunk ......
Read more >
How to manage a large JSON file efficiently and quickly - Sease
In this blog post, I want to give you some tips and tricks to find efficient ways to read and parse a big...
Read more >
How to Read Large JSON file in R?
Following R code is reading small JSON file but when I am applying huge JSON data (3 GB, 5,51,367 records, and 341 features),...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found