Reading large file containing JSON objects on each line lazily
See original GitHub issueI have a large (multiple GBs) file which contains a single JSON object on each line in UTF-8 encoding (just sampled my data they seem to be about 400-700 UTF-8 characters per line).
I’d like to lazily read the file (e.g. iterator) instead of one shot - I have what I believe a solution for that (see below).
// Simplified JsonEvent
case class JsonEvent(id: String, eventType: String, time: Long, optionalField: Option[String], varLengthFieldA: Option[List[String]], varLengthFieldB: Map[String, String])
implicit val jsonEventCodec: JsonValueCodec[JsonEvent] = JsonCodecMaker.make(
CodecMakerConfig
.withFieldNameMapper(JsonCodecMaker.enforce_snake_case))
val gzipInputStreamBufferSize = 8388608
val inputStreamBufferSize = 8388608
val blockingQueueSize = 100000
var eventConsumedCount = 0L
val jsoniterCharBufSize = 4096
val jsoniterBufSize = 8388608 // Half this seems to be relatively the same performance
// I'm reading a gzipped file from HDFS but opening a file should be the same
val inputStream = hdfsFS.open(new Path(file), inputStreamBufferSize)
val gzipInputStream = new GZIPInputStream(inputStream, gzipInputStreamBufferSize)
// Fixed size so we don't OOM
val blockingQueue = new ArrayBlockingQueue[JsonEvent](blockingQueueSize)
val producerThread = new Thread(new Runnable {
override def run(): Unit = {
scanJsonValuesFromStream(
gzipInputStream,
ReaderConfig
.withCheckForEndOfInput(false)
.withPreferredCharBufSize(jsoniterCharBufSize)
.withPreferredBufSize(jsoniterBufSize)) {
(jsonValue: JsonEvent) =>
try {
blockingQueue.put(jsonValue)
} catch {
case _: InterruptedException => /* DO NOTHING */
}
true
}(jsonEventCodec)
}
})
producerThread.start()
while (producerThread.isAlive || !blockingQueue.isEmpty) {
// Do something meaningful with it, currently just counting the events
Option(blockingQueue.poll(waitMs, TimeUnit.MILLISECONDS)).foreach(_ => eventConsumedCount += 1)
}
Currently, I don’t believe I can:
readFromStream(inputStream, ReaderConfig.withCheckForEndOfInput(false))
readFromStream(inputStream, ReaderConfig.withCheckForEndOfInput(false)) // Doesn't read the "2nd line"
With Jackson, I can do something like:
val objectMapper = ???
val gzipInputStreamBufferSize = 8388608
val inputStreamBufferSize = 8388608
val jsonFactory = new JsonFactory(objectMapper)
// I'm reading a gzipped file from HDFS but opening a file should be the same
val inputStream = hdfsFS.open(new Path(file), inputStreamBufferSize)
val gzipInputStream = new GZIPInputStream(inputStream, gzipInputStreamBufferSize)
val jsonParser: JsonParser = jsonFactory.createParser(gzipInputStream)
while (jsonParser.nextToken() != null) {
val jsonEvent = objectMapper.parse[JsonEvent](jsonParser)
...
}
This seems to perform better (just in terms of pure raw performance, didn’t look at memory). I’m getting about 120 event/ms with Jackson and about 105 event/ms with Jsoniter in my test env (this is averaged results from 20 iterations). I’m still running more tests tweaking the buffer sizes. Open to any suggestions you may have about improving the process.
EDIT: It’s pretty likely the blocking queue logic is adding some performance penalty here.
EDIT: blockingQueueSize
of size ~5 seems big enough to not throttle the put
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (6 by maintainers)
Here is a version of
Iterator[A]
for parsing large JSON arrays:Everything seems to check out. One minor change I had to make was:
Hopefully this is to correct way to account for an empty file. Thanks for adding this functionality!