Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Reading large file containing JSON objects on each line lazily

See original GitHub issue

I have a large (multiple GBs) file which contains a single JSON object on each line in UTF-8 encoding (just sampled my data they seem to be about 400-700 UTF-8 characters per line).

I’d like to lazily read the file (e.g. iterator) instead of one shot - I have what I believe a solution for that (see below).

// Simplified JsonEvent
case class JsonEvent(id: String, eventType: String, time: Long, optionalField: Option[String], varLengthFieldA: Option[List[String]], varLengthFieldB: Map[String, String])

implicit val jsonEventCodec: JsonValueCodec[JsonEvent] = JsonCodecMaker.make(
  CodecMakerConfig
    .withFieldNameMapper(JsonCodecMaker.enforce_snake_case))
val gzipInputStreamBufferSize = 8388608
val inputStreamBufferSize = 8388608
val blockingQueueSize = 100000
var eventConsumedCount = 0L
val jsoniterCharBufSize = 4096
val jsoniterBufSize = 8388608 // Half this seems to be relatively the same performance
// I'm reading a gzipped file from HDFS but opening a file should be the same
val inputStream = hdfsFS.open(new Path(file), inputStreamBufferSize)
val gzipInputStream = new GZIPInputStream(inputStream, gzipInputStreamBufferSize)
// Fixed size so we don't OOM
val blockingQueue = new ArrayBlockingQueue[JsonEvent](blockingQueueSize)
val producerThread = new Thread(new Runnable {
  override def run(): Unit = {

    scanJsonValuesFromStream(
      gzipInputStream,
      ReaderConfig
        .withCheckForEndOfInput(false)
        .withPreferredCharBufSize(jsoniterCharBufSize)
        .withPreferredBufSize(jsoniterBufSize)) {

      (jsonValue: JsonEvent) =>
        try {
          blockingQueue.put(jsonValue)
        } catch {
          case _: InterruptedException => /* DO NOTHING */
        }
        true
    }(jsonEventCodec)
  }
})

producerThread.start()
while (producerThread.isAlive || !blockingQueue.isEmpty) {
  // Do something meaningful with it, currently just counting the events
  Option(blockingQueue.poll(waitMs, TimeUnit.MILLISECONDS)).foreach(_ => eventConsumedCount += 1)
}

Currently, I don’t believe I can:

readFromStream(inputStream, ReaderConfig.withCheckForEndOfInput(false))
readFromStream(inputStream, ReaderConfig.withCheckForEndOfInput(false)) // Doesn't read the "2nd line"

With Jackson, I can do something like:

val objectMapper = ???
val gzipInputStreamBufferSize = 8388608
val inputStreamBufferSize = 8388608
val jsonFactory = new JsonFactory(objectMapper)
// I'm reading a gzipped file from HDFS but opening a file should be the same
val inputStream = hdfsFS.open(new Path(file), inputStreamBufferSize)
val gzipInputStream = new GZIPInputStream(inputStream, gzipInputStreamBufferSize)
val jsonParser: JsonParser = jsonFactory.createParser(gzipInputStream)
while (jsonParser.nextToken() != null) {
  val jsonEvent = objectMapper.parse[JsonEvent](jsonParser)
  ...
}

This seems to perform better (just in terms of pure raw performance, didn’t look at memory). I’m getting about 120 event/ms with Jackson and about 105 event/ms with Jsoniter in my test env (this is averaged results from 20 iterations). I’m still running more tests tweaking the buffer sizes. Open to any suggestions you may have about improving the process.

EDIT: It’s pretty likely the blocking queue logic is adding some performance penalty here. EDIT: blockingQueueSize of size ~5 seems big enough to not throttle the put

Issue Analytics

State:
Created 2 years ago
Comments:11 (6 by maintainers)

Top GitHub Comments

1reaction

plokhotnyukcommented, May 11, 2021

Here is a version of Iterator[A] for parsing large JSON arrays:

package com.github.plokhotnyuk.jsoniter_scala

package object extra {
  def toJsonArrayIteratorFromStream[A](in: InputStream, config: ReaderConfig = ReaderConfig)
                                      (implicit codec: JsonValueCodec[A]): Iterator[A] =
    new AbstractIterator[A] {
      private[this] val reader = new JsonReader(
        buf = new Array[Byte](config.preferredBufSize),
        charBuf = new Array[Char](config.preferredCharBufSize),
        in = in,
        config = config)
      private[this] var continue: Boolean =
        if (reader.isNextToken('[')) !reader.isNextToken(']') && {
          reader.rollbackToken()
          true
        } else reader.readNullOrTokenError(false, '[')

      override def hasNext: Boolean = continue

      override def next(): A = {
        val x = codec.decodeValue(reader, codec.nullValue)
        continue = reader.isNextToken(',') || checkEndConditions()
        x
      }

      private[this] def checkEndConditions(): Boolean =
        (reader.isCurrentToken(']') || reader.decodeError("expected ']' or ','")) &&
          config.checkForEndOfInput && reader.endOfInputOrError()
    }
}

1reaction

steven-laicommented, May 6, 2021

Everything seems to check out. One minor change I had to make was:

private[this] var continue: Boolean = reader.skipWhitespaces()

Hopefully this is to correct way to account for an empty file. Thanks for adding this functionality!

Top Results From Across the Web

Processing large JSON files in Python without running out of ...

One common solution is streaming parsing, aka lazy parsing, iterative parsing, or chunked processing. Let's see how you can apply this technique ...

How to read line-delimited JSON from large file (line by line)

Just read each line and construct a json object at this time: with open(file_path) as f: for line in f: j_content = json.loads(line)....

How to manage a large JSON file efficiently and quickly - Sease

To work with files containing multiple JSON objects (e.g. several JSON rows) is pretty simple through the Python built-in package called json [1]....

Reading a large (30.6G) JSONL file : r/learnpython - Reddit

Hi all, I am working on a project where I have text data stored in a massive (30.6G) json lines file. While I...

How to Read a JSON File in Python - AskPython

Method 2: Use ijson for large JSON files ... If your JSON file is large enough such that it is expensive to bring...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Reading large file containing JSON objects on each line lazily

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Support multiple parameter lists for case classes

Compilation error during cross builds