Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

StatisticsGen fails with OutOfMemoryError error while using Flink, external beam worker pool and large amount of data

See original GitHub issue

Hi there!

Flink 1.10.3 Beam 2.27.0 TFX (and subpackages) 0.27.0 MiniO (S3) as storage

I’m develop some TFX pipeline: ImportExampleGen, StatisticsGen, etc.

I have ~5gb TF Example data and 10 million data rows.

I configured Flink cluster with using almost 15gb ram, but StatisticsGen fails with OutOfMemory error.

My dataset is not production. Production datasets is more huge.

Error occures on GenerateStatistics[train]/RunStatsGenerators/GenerateSlicedStatisticsImpl/TopKUniquesStatsGenerator/CombineCountsAndWeights/GroupByKey -> [2]GenerateStatistics[train]/RunStatsGenerators/GenerateSlicedStatisticsImpl/TopKUniquesStatsGenerator/{CombineCountsAndWeights, Rearrange} -> ([2]GenerateStatistics[train]/RunStatsGenerators/GenerateSlicedStatisticsImpl/TopKUniquesStatsGenerator/{Uniques_Keys, Uniques_CountPerFeatureName} -> ToKeyedWorkItem, ToKeyedWorkItem) task

Detailed error from Flink logs:

Exception in thread "grpc-default-executor-4" java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOfRange(Arrays.java:3520)
        at org.apache.beam.vendor.grpc.v1p26p0.com.google.protobuf.ByteString$ArraysByteArrayCopier.copyFrom(ByteString.java:126)
        at org.apache.beam.vendor.grpc.v1p26p0.com.google.protobuf.ByteString.copyFrom(ByteString.java:362)
        at org.apache.beam.vendor.grpc.v1p26p0.com.google.protobuf.ByteString.copyFrom(ByteString.java:372)
        at org.apache.beam.vendor.grpc.v1p26p0.com.google.protobuf.CodedInputStream$StreamDecoder.readBytesSlowPath(CodedInputStream.java:2978)
        at org.apache.beam.vendor.grpc.v1p26p0.com.google.protobuf.CodedInputStream$StreamDecoder.readBytes(CodedInputStream.java:2386)
        at org.apache.beam.model.fnexecution.v1.BeamFnApi$Elements$Data.<init>(BeamFnApi.java:29659)
        at org.apache.beam.model.fnexecution.v1.BeamFnApi$Elements$Data.<init>(BeamFnApi.java:29600)
        at org.apache.beam.model.fnexecution.v1.BeamFnApi$Elements$Data$1.parsePartialFrom(BeamFnApi.java:30539)
        at org.apache.beam.model.fnexecution.v1.BeamFnApi$Elements$Data$1.parsePartialFrom(BeamFnApi.java:30533)
        at org.apache.beam.vendor.grpc.v1p26p0.com.google.protobuf.CodedInputStream$StreamDecoder.readMessage(CodedInputStream.java:2366)
        at org.apache.beam.model.fnexecution.v1.BeamFnApi$Elements.<init>(BeamFnApi.java:29463)
        at org.apache.beam.model.fnexecution.v1.BeamFnApi$Elements.<init>(BeamFnApi.java:29412)
        at org.apache.beam.model.fnexecution.v1.BeamFnApi$Elements$1.parsePartialFrom(BeamFnApi.java:32943)
        at org.apache.beam.model.fnexecution.v1.BeamFnApi$Elements$1.parsePartialFrom(BeamFnApi.java:32937)
        at org.apache.beam.vendor.grpc.v1p26p0.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:86)
        at org.apache.beam.vendor.grpc.v1p26p0.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:48)
        at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.protobuf.lite.ProtoLiteUtils$MessageMarshaller.parseFrom(ProtoLiteUtils.java:223)
        at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.protobuf.lite.ProtoLiteUtils$MessageMarshaller.parse(ProtoLiteUtils.java:215)
        at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.protobuf.lite.ProtoLiteUtils$MessageMarshaller.parse(ProtoLiteUtils.java:118)
        at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.MethodDescriptor.parseRequest(MethodDescriptor.java:299)
        at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:309)
        at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:292)
        at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:782)
        at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

How can i decrease memory usage?

Thanks!

Pipline options:

--runner=FlinkRunner
--flink_master=flink-cluster:8081
--streaming
--auto_balance_write_files_sharding_enabled
--faster_copy
--max_parallelism=5
--object_reuse
--environment_type=EXTERNAL
--environment_config=127.0.0.1:50000
--flink_submit_uber_jar
--parallelism=5
--sdk_worker_parallelism=5
--execution_mode_for_batch=BATCH_FORCED
--s3_access_key=valid-access-key
--s3_secret_access_key=valid-secret-key
--s3_endpoint_url=minio:9000"
--s3_verify=false

Flink memory configuration:

jobmanager.memory.process.size: 2048m
jobmanager.heap.size: 4096m
taskmanager.runtime.large-record-handler: true
taskmanager.runtime.hashjoin-bloom-filters: true
compiler.delimited-informat.max-line-samples: 100
taskmanager.memory.task.heap.size: 12gb
taskmanager.memory.managed.size: 4gb
taskmanager.memory.jvm-overhead.max: 4gb
taskmanager.memory.jvm-overhead.min: 256m
taskmanager.memory.jvm-overhead.fraction: 0.4
taskmanager.memory.task.off-heap.size: 4gb
taskmanager.memory.framework.heap.size: 512m
taskmanager.memory.framework.off-heap.size: 2gb
taskmanager.numberOfTaskSlots: 5
taskmanager.memory.network.fraction: 0.4
taskmanager.memory.network.min: 64mb
taskmanager.memory.network.max: 4gb

There is Flink task manager metrics from Grafana flink-task-manager-metrics As can you see, ImportExampleGen processed successfully. But, after mem idle, starts and fails StatisticsGen.

Issue Analytics

State:
Created 3 years ago
Comments:25 (6 by maintainers)

Top GitHub Comments

1reaction

vaskozlcommented, Jun 8, 2021

@ibzib Thank you for all the tips so far, I’ve added image_domain {} to my features and it seems to help.

Do you have recommended paralelism values that you have found to work well for TFX?

How much memory CPU per Taskmanager
How many slots per Taskmanager
How many Taskmanagers?
What job parallelism
What sdk_worker_parallelism
What do I scale, if I want to scale? Number of taskmanagers? Size of the taskmanagers? Number of slots?

It’s rather a lot of values to configure. I’ve read the flink documentation, and one taskmanager with two slots is roughly worth two taskmanagers with 1 slot half the size in memory.

As trial and error can be quite slow, was wondering if you can give some ballpark values for optimal TFX performance?

1reaction

vaskozlcommented, Mar 3, 2021

schema (StatsOptions.schema) with an image_domain

Thank’s I think I’ve encountered this and didn’t know there was a solution!

TFX does not test its components on Flink runner and the team has no expertise

I myself have had very mixed success with Flink since release 0.21 onward, it’s helped with larger datasets but it’s its own can of worms (jobs hanging inexplicably sometimes). The thing is, when datasets start hitting a few gigabytes one invariably ends up hitting OOM with DirectRunner even if your dataset in theory should be able to easily fit in RAM. E.g. in this issue https://github.com/tensorflow/tfx/issues/1907 it’s recommended to use Spark/Flink/Dataflow. I’ve also found https://github.com/tensorflow/transform/issues/143 to be a problem (atleast in 0.22 haven’t confirmed in 0.27).

I think Spark is even less likely to work in general than Flink and seems to have basically no support from the beam folk. As far as I can see this makes Dataflow the only option for running datasets over a few GB large which causes people who are just starting out and not on GCP to have a hard time.

I think it would be great to test with larger datasets with DirectRunner and Flink at-least to document some expectation of what should work in production with a chunky dataset. Being able to run more reliably on large datasets without Dataflow also means one can use Jupyter notebooks more readily without crashing which I offers a lot of value.

TLDR: I would love if you guys tested the components with a huge dataset if you don’t already.

Top Results From Across the Web

Flink : Memory Ran Out Exception - ADocLib

StatisticsGen fails with OutOfMemoryError error while using Flink, external beam worker pool and large amount of data. Hi there! Flink 1.10.3 Beam 2.27.0 ......

Flink jobs hang when running on more than one task manager

I have already configured the Veam worker containers to run as side0cars to tasks manager containers (in the same pod) by setting the...

Get started with Tensorflow Data Validation | TFX

Running on Google Cloud. Internally, TFDV uses Apache Beam's data-parallel processing framework to scale the computation of statistics over ...

Apache BEAM + Flink Cluster + Kubernetes + Python

To make sure the write to file at the end of the beam pipeline ran, we can visit the worker pool to find...

Tutorial: Understanding Beam with a Local Beam, Flink and ...

I wrote my Beam pipelines on a Mac in Python, but I've also run pure-Java ... Flink cluster; Python worker pool program; Java...