GC overhead limit exceeded using Disk Mode
See original GitHub issueI am using store.mode = "disk"
but I am still observing the GC overhead limit exceeded
exception
The size of my cluster is 116 GB
of RAM with 10 executors with 3 cores each , and I am trying to index 180M documents.
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.lucene.util.ByteBlockPool$DirectTrackingAllocator.getByteBlock(ByteBlockPool.java:103)
at org.apache.lucene.util.ByteBlockPool.nextBuffer(ByteBlockPool.java:203)
at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:118)
at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:189)
at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:843)
at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:251)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:494)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1616)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1235)
at org.zouzias.spark.lucenerdd.partition.LuceneRDDPartition$$anonfun$3.apply(LuceneRDDPartition.scala:72)
at org.zouzias.spark.lucenerdd.partition.LuceneRDDPartition$$anonfun$3.apply(LuceneRDDPartition.scala:69)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.zouzias.spark.lucenerdd.partition.LuceneRDDPartition.<init>(LuceneRDDPartition.scala:69)
at org.zouzias.spark.lucenerdd.partition.LuceneRDDPartition$.apply(LuceneRDDPartition.scala:260)
at org.zouzias.spark.lucenerdd.LuceneRDD$$anonfun$13$$anonfun$apply$7.apply(LuceneRDD.scala:520)
at org.zouzias.spark.lucenerdd.LuceneRDD$$anonfun$13$$anonfun$apply$7.apply(LuceneRDD.scala:517)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
19/03/13 14:35:53 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 1506,5,main]
Issue Analytics
- State:
- Created 5 years ago
- Comments:15 (15 by maintainers)
Top Results From Across the Web
java.lang.OutOfMemoryError: GC Overhead Limit Exceeded
Simply put, the JVM takes care of freeing up memory when objects are no longer being used. This process is called Garbage Collection...
Read more >Error java.lang.OutOfMemoryError: GC overhead limit exceeded
"GC overhead limit exceeded" indicates that the garbage collector is running all the time and Java program is making very slow progress.
Read more >java.lang.OutOfMemoryError: GC overhead limit exceeded
This means that the small amount of heap the GC is able to clean will likely be quickly filled again, forcing the GC...
Read more >java.lang.OutOfMemoryError: GC overhead limit exceeded
The problem persists with master. While the job is running, it seems for periods the memory use is stable, then starts growing. Btw,...
Read more >java.lang.OutOfMemoryError: GC overhead limit exceeded #115
I use: spark1.6.0 spark-rabbitmq 0.4.0. My program ran for 24 hours before it was reported wrong.Please help me. My error log:
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The storage type stores the Lucene index on disk, so not so much memory overhead.
I think your problem might be on the distribution skew of your data. Can you share the group by of:
B.groupBy(blockingFields).count.show()
and
A.groupBy(blockingFields).count.show()
Order descending above.
It could be that the most popular blockingFields are quite large partitions.
For a skewed dataset , where the blocker makes uneven partitions , is there any way to reparation even further? Main goal is to increase parallelism