Unable to process large json file with nested or deep arrays with tonnes of (sensor) data.
See original GitHub issueI have a large json file (around 16Gb) with the following structure:
{
"ControlJob":
{
"Keys":
{
"CtrJobID": "KE-CJ0000000409",
"EquipmentID": "TEST01"
},
"Attributes":
{
"FileType": "SensorData",
"JsonSchemaVersion": "Draft",
"SensorDescriptions":
[
{
"SensorID": "1723007",
"SVID": "1723007",
"Group": "Temp",
"SensorType": "Actual_01",
"Name": "U",
"DataType": "Double",
"Units": "degC"
},
{
"SensorID": "2424009",
"SVID": "2424009",
"Group": "Press",
"SensorType": "Actual_VG03_1Hz",
"Name": "VG03",
"DataType": "Double",
"Units": "Pa"
}
]
},
"RecipeSteps":
[
{"Keys": {
"RecipeStepID": "START",
"StepResult": "NormalEnd"
},
"SensorData":
[
{
"Keys": {"SensorID": "1723007"},
"Measurements":
[
{
"DateTime": "2021-11-16 21:18:37.000",
"Value": 540.0
},
{
"DateTime": "2021-11-16 21:18:37.100",
"Value": 539.0
},
{
"DateTime": "2021-11-16 21:18:37.200",
"Value": 540.0
},
{
"DateTime": "2021-11-16 21:18:37.300",
"Value": 540.0
},
{
"DateTime": "2021-11-16 21:18:37.400",
"Value": 540.0
},
{
"DateTime": "2021-11-16 21:18:37.500",
"Value": 540.0
},
{
"DateTime": "2021-11-16 21:18:37.600",
"Value": 540.0
},
{
"DateTime": "2021-11-16 21:18:37.700",
"Value": 538.0
},
{
"DateTime": "2021-11-16 21:18:37.800",
"Value": 540.0
}
]
},
{
"Keys": {"SensorID": "2424009"},
"Measurements":
[
{
"DateTime": "2021-11-16 21:18:37.000",
"Value": 1333.22
},
{
"DateTime": "2021-11-16 21:18:37.100",
"Value": 1333.22
},
{
"DateTime": "2021-11-16 21:18:37.200",
"Value": 1333.22
},
{
"DateTime": "2021-11-16 21:18:37.300",
"Value": 1333.22
},
{
"DateTime": "2021-11-16 21:18:37.400",
"Value": 1333.22
},
{
"DateTime": "2021-11-16 21:18:37.500",
"Value": 1333.22
},
{
"DateTime": "2021-11-16 21:18:37.600",
"Value": 1333.22
},
{
"DateTime": "2021-11-16 21:18:37.700",
"Value": 1333.22
},
{
"DateTime": "2021-11-16 21:18:37.800",
"Value": 1333.22
}
]
}
]
}
]
}
}
The file also has a lot of other fields which are not relevant for me and I do not wish to process. There is no way I can load the whole document to the memory and process it. So, the strategy I am trying to follow is this:
Step 1: The properties “Keys” and “Attributes” is common factor for all the elements in the “RecipeSteps” Array. So, I will parse them and write it to a new file.
Step 2: I will now Try to split the file based on each element in the array “RecipeSteps”. The new split files will have the “Keys” and “Attributes” properties since they are common metadata to all these files. At the end of this step, each split file will have the common data(Keys, Attributes) and “RecipeSteps” array with one element.
Step 3: I will further split these files based on each element in the array “SensorData”. At the end of this step each file will have have the common data(Keys, Attributes) and “RecipeSteps” array with one element and that one element will have SensorData array with one element.
Step 4: Now I will iterate through all these files and do the comparison with the reference data I have and post the results.
Note that I have posted only the relevant properties to my requirements. There are so many other elements in the file which I want to ignore completely to make the processing faster.
I saw this article and tried it: https://www.codeproject.com/Tips/5315195/Cinchoo-ETL-Parsing-Huge-JSON-File-as-Stream
But this essentially tries to load the whole object into the memory. I skimmed through the documentation also with no luck.
What would you suggest I do to process this file in the most efficient manner? I can share the actual 16Gb file, if you want a clearer picture of my problem.
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (6 by maintainers)
Top GitHub Comments
Afraid not, as it is serial / forward only processing.
Well, finally had chance to download and look into the file. Was quite challenging to load with the current version. (blowing up with OOM exception!)
After number of attempts, came up with solution to cater your needs (also quite useful for others as well if they need to do…).
In the new approach, users can hook up to the loader with the callback, by which you can direct the output to a file (instead of loading to memory).
Download and use the latest package v1.2.1.40
Sample fiddle: https://dotnetfiddle.net/I7z4tp