Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to process large json file with nested or deep arrays with tonnes of (sensor) data.

See original GitHub issue

I have a large json file (around 16Gb) with the following structure:

{
"ControlJob": 
	{
		"Keys": 
		{ 
			"CtrJobID": "KE-CJ0000000409",
			"EquipmentID": "TEST01" 
		},
		"Attributes": 
		{
			"FileType": "SensorData",
			"JsonSchemaVersion": "Draft",
			"SensorDescriptions": 
			[
				{
					"SensorID": "1723007",
					"SVID": "1723007",
					"Group": "Temp",
					"SensorType": "Actual_01",
					"Name": "U",
					"DataType": "Double",
					"Units": "degC"
				},
				{
					"SensorID": "2424009",
					"SVID": "2424009",
					"Group": "Press",
					"SensorType": "Actual_VG03_1Hz",
					"Name": "VG03",
					"DataType": "Double",
					"Units": "Pa"
				}
			]
		},
		"RecipeSteps": 
		[
		{"Keys": {
				"RecipeStepID": "START",
				"StepResult": "NormalEnd"
			},
			"SensorData":
			[
				{
					"Keys": {"SensorID": "1723007"},
					"Measurements": 
					[
						{
							"DateTime": "2021-11-16 21:18:37.000",
							"Value": 540.0
						},
						{
							"DateTime": "2021-11-16 21:18:37.100",
							"Value": 539.0
						},
						{
							"DateTime": "2021-11-16 21:18:37.200",
							"Value": 540.0
						},
						{
							"DateTime": "2021-11-16 21:18:37.300",
							"Value": 540.0
						},
						{
							"DateTime": "2021-11-16 21:18:37.400",
							"Value": 540.0
						},
						{
							"DateTime": "2021-11-16 21:18:37.500",
							"Value": 540.0
						},
						{
							"DateTime": "2021-11-16 21:18:37.600",
							"Value": 540.0
						},
						{
							"DateTime": "2021-11-16 21:18:37.700",
							"Value": 538.0
						},
						{
							"DateTime": "2021-11-16 21:18:37.800",
							"Value": 540.0
						}						
					]
				},
				{
					"Keys": {"SensorID": "2424009"},
					"Measurements": 
					[
						{
							"DateTime": "2021-11-16 21:18:37.000",
							"Value": 1333.22
						},
						{
							"DateTime": "2021-11-16 21:18:37.100",
							"Value": 1333.22
						},
						{
							"DateTime": "2021-11-16 21:18:37.200",
							"Value": 1333.22
						},
						{
							"DateTime": "2021-11-16 21:18:37.300",
							"Value": 1333.22
						},
						{
							"DateTime": "2021-11-16 21:18:37.400",
							"Value": 1333.22
						},
						{
							"DateTime": "2021-11-16 21:18:37.500",
							"Value": 1333.22
						},
						{
							"DateTime": "2021-11-16 21:18:37.600",
							"Value": 1333.22
						},
						{
							"DateTime": "2021-11-16 21:18:37.700",
							"Value": 1333.22
						},
						{
							"DateTime": "2021-11-16 21:18:37.800",
							"Value": 1333.22
						}						
					]
				}
			]
		}
		  
		]


	}
}

The file also has a lot of other fields which are not relevant for me and I do not wish to process. There is no way I can load the whole document to the memory and process it. So, the strategy I am trying to follow is this:

Step 1: The properties “Keys” and “Attributes” is common factor for all the elements in the “RecipeSteps” Array. So, I will parse them and write it to a new file.

Step 2: I will now Try to split the file based on each element in the array “RecipeSteps”. The new split files will have the “Keys” and “Attributes” properties since they are common metadata to all these files. At the end of this step, each split file will have the common data(Keys, Attributes) and “RecipeSteps” array with one element.

Step 3: I will further split these files based on each element in the array “SensorData”. At the end of this step each file will have have the common data(Keys, Attributes) and “RecipeSteps” array with one element and that one element will have SensorData array with one element.

Step 4: Now I will iterate through all these files and do the comparison with the reference data I have and post the results.

Note that I have posted only the relevant properties to my requirements. There are so many other elements in the file which I want to ignore completely to make the processing faster.

I saw this article and tried it: https://www.codeproject.com/Tips/5315195/Cinchoo-ETL-Parsing-Huge-JSON-File-as-Stream

But this essentially tries to load the whole object into the memory. I skimmed through the documentation also with no luck.

What would you suggest I do to process this file in the most efficient manner? I can share the actual 16Gb file, if you want a clearer picture of my problem.

Issue Analytics

State:
Created 2 years ago
Comments:10 (6 by maintainers)

Top GitHub Comments

1reaction

Cinchoocommented, Jan 13, 2022

Afraid not, as it is serial / forward only processing.

1reaction

Cinchoocommented, Jan 9, 2022

Well, finally had chance to download and look into the file. Was quite challenging to load with the current version. (blowing up with OOM exception!)

After number of attempts, came up with solution to cater your needs (also quite useful for others as well if they need to do…).

In the new approach, users can hook up to the loader with the callback, by which you can direct the output to a file (instead of loading to memory).

Download and use the latest package v1.2.1.40

Sample fiddle: https://dotnetfiddle.net/I7z4tp

Top Results From Across the Web

Working with large JSON files in Snowflake — Part III

This solution presents a workaround to a specific JSON handling issue in Snowflake — the processing of JSON files that exceed the 16MB...

How can I load large JSON files from Snowflake Stage?

The JSON files seem to have nested arrays and the innermost one contains the chunky data. Is there a way to load an...

Processing large JSON files in Python without running out of ...

Loading complete JSON files into Python can use too much memory, leading to slowness or crashes. The solution: process JSON data one chunk ......

How to manage a large JSON file efficiently and quickly - Sease

In this blog post, I want to give you some tips and tricks to find efficient ways to read and parse a big...

How to Read Large JSON file in R?

Following R code is reading small JSON file but when I am applying huge JSON data (3 GB, 5,51,367 records, and 341 features),...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Unable to process large json file with nested or deep arrays with tonnes of (sensor) data.

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

XML read fails on unknown xmlns field

Only parses part of JSON to CSV