question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Reading large file containing JSON objects on each line lazily

See original GitHub issue

I have a large (multiple GBs) file which contains a single JSON object on each line in UTF-8 encoding (just sampled my data they seem to be about 400-700 UTF-8 characters per line).

I’d like to lazily read the file (e.g. iterator) instead of one shot - I have what I believe a solution for that (see below).

// Simplified JsonEvent
case class JsonEvent(id: String, eventType: String, time: Long, optionalField: Option[String], varLengthFieldA: Option[List[String]], varLengthFieldB: Map[String, String])

implicit val jsonEventCodec: JsonValueCodec[JsonEvent] = JsonCodecMaker.make(
  CodecMakerConfig
    .withFieldNameMapper(JsonCodecMaker.enforce_snake_case))
val gzipInputStreamBufferSize = 8388608
val inputStreamBufferSize = 8388608
val blockingQueueSize = 100000
var eventConsumedCount = 0L
val jsoniterCharBufSize = 4096
val jsoniterBufSize = 8388608 // Half this seems to be relatively the same performance
// I'm reading a gzipped file from HDFS but opening a file should be the same
val inputStream = hdfsFS.open(new Path(file), inputStreamBufferSize)
val gzipInputStream = new GZIPInputStream(inputStream, gzipInputStreamBufferSize)
// Fixed size so we don't OOM
val blockingQueue = new ArrayBlockingQueue[JsonEvent](blockingQueueSize)
val producerThread = new Thread(new Runnable {
  override def run(): Unit = {

    scanJsonValuesFromStream(
      gzipInputStream,
      ReaderConfig
        .withCheckForEndOfInput(false)
        .withPreferredCharBufSize(jsoniterCharBufSize)
        .withPreferredBufSize(jsoniterBufSize)) {

      (jsonValue: JsonEvent) =>
        try {
          blockingQueue.put(jsonValue)
        } catch {
          case _: InterruptedException => /* DO NOTHING */
        }
        true
    }(jsonEventCodec)
  }
})

producerThread.start()
while (producerThread.isAlive || !blockingQueue.isEmpty) {
  // Do something meaningful with it, currently just counting the events
  Option(blockingQueue.poll(waitMs, TimeUnit.MILLISECONDS)).foreach(_ => eventConsumedCount += 1)
}

Currently, I don’t believe I can:

readFromStream(inputStream, ReaderConfig.withCheckForEndOfInput(false))
readFromStream(inputStream, ReaderConfig.withCheckForEndOfInput(false)) // Doesn't read the "2nd line"

With Jackson, I can do something like:

val objectMapper = ???
val gzipInputStreamBufferSize = 8388608
val inputStreamBufferSize = 8388608
val jsonFactory = new JsonFactory(objectMapper)
// I'm reading a gzipped file from HDFS but opening a file should be the same
val inputStream = hdfsFS.open(new Path(file), inputStreamBufferSize)
val gzipInputStream = new GZIPInputStream(inputStream, gzipInputStreamBufferSize)
val jsonParser: JsonParser = jsonFactory.createParser(gzipInputStream)
while (jsonParser.nextToken() != null) {
  val jsonEvent = objectMapper.parse[JsonEvent](jsonParser)
  ...
}

This seems to perform better (just in terms of pure raw performance, didn’t look at memory). I’m getting about 120 event/ms with Jackson and about 105 event/ms with Jsoniter in my test env (this is averaged results from 20 iterations). I’m still running more tests tweaking the buffer sizes. Open to any suggestions you may have about improving the process.

EDIT: It’s pretty likely the blocking queue logic is adding some performance penalty here. EDIT: blockingQueueSize of size ~5 seems big enough to not throttle the put

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
plokhotnyukcommented, May 11, 2021

Here is a version of Iterator[A] for parsing large JSON arrays:

package com.github.plokhotnyuk.jsoniter_scala

package object extra {
  def toJsonArrayIteratorFromStream[A](in: InputStream, config: ReaderConfig = ReaderConfig)
                                      (implicit codec: JsonValueCodec[A]): Iterator[A] =
    new AbstractIterator[A] {
      private[this] val reader = new JsonReader(
        buf = new Array[Byte](config.preferredBufSize),
        charBuf = new Array[Char](config.preferredCharBufSize),
        in = in,
        config = config)
      private[this] var continue: Boolean =
        if (reader.isNextToken('[')) !reader.isNextToken(']') && {
          reader.rollbackToken()
          true
        } else reader.readNullOrTokenError(false, '[')

      override def hasNext: Boolean = continue

      override def next(): A = {
        val x = codec.decodeValue(reader, codec.nullValue)
        continue = reader.isNextToken(',') || checkEndConditions()
        x
      }

      private[this] def checkEndConditions(): Boolean =
        (reader.isCurrentToken(']') || reader.decodeError("expected ']' or ','")) &&
          config.checkForEndOfInput && reader.endOfInputOrError()
    }
}
1reaction
steven-laicommented, May 6, 2021

Everything seems to check out. One minor change I had to make was:

private[this] var continue: Boolean = reader.skipWhitespaces()

Hopefully this is to correct way to account for an empty file. Thanks for adding this functionality!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Processing large JSON files in Python without running out of ...
One common solution is streaming parsing, aka lazy parsing, iterative parsing, or chunked processing. Let's see how you can apply this technique ...
Read more >
How to read line-delimited JSON from large file (line by line)
Just read each line and construct a json object at this time: with open(file_path) as f: for line in f: j_content = json.loads(line)....
Read more >
How to manage a large JSON file efficiently and quickly - Sease
To work with files containing multiple JSON objects (e.g. several JSON rows) is pretty simple through the Python built-in package called json [1]....
Read more >
Reading a large (30.6G) JSONL file : r/learnpython - Reddit
Hi all, I am working on a project where I have text data stored in a massive (30.6G) json lines file. While I...
Read more >
How to Read a JSON File in Python - AskPython
Method 2: Use ijson for large JSON files ... If your JSON file is large enough such that it is expensive to bring...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found