Performance issue
See original GitHub issueHi,
I tested the ParquetSource on a parquet file of about 5M records, split in 144 partitions (files) on an 8 CPU server in the default way:
ParquetSource(new Path(modelPath))
.toFrame(16)
.toList
.map(convertRow)
After about 10 min the process got stuck and I had to kill it. All cpu’s were at 100% for the whole duration
I then tested with a slightly modified codebase:
ParquetSource(new Path(modelPath))
.parts
.par
.flatMap { part =>
part.data
.map(convertRow)
.toList.toBlocking.single
}
And that finished the job in 30 sec. Even without the .par
line it still only takes 2 min (running on a single cpu)
Any idea why the two implementations give so different results?
Issue Analytics
- State:
- Created 7 years ago
- Comments:49 (30 by maintainers)
Top Results From Across the Web
9 Examples of a Performance Issue - Simplicable Guide
A performance issue is a failure to meet the basic requirements of a job. They are based on reasonable expectations of behavior and...
Read more >Handling Performance Issues With Grace | Monster.com
Low Productivity or Late Completion – Make sure you've been clear about the requirements and expectations of the job. · Poor Quality of...
Read more >Dealing with Performance Problems
Types of Performance Problems ; Quantity of work (untimely completion, limited production). Poor prioritizing, timing, scheduling; Lost time ; Quality of work ( ......
Read more >Top 5 Common Performance Problems - HRCI
Top 5 Common Performance Problems · Shallow Work · Inability to Prioritize · False Sense of Urgency · Productive Procrastination · Low-Quality Output....
Read more >5 Common Reasons for Performance Issues (Plus 3 Tips to ...
Most Common Causes of Performance Issues · 1. They lack knowledge or skill. · 2. They have unclear or unrealistic expectations. · 3....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yes you can do that, or the Frame API will allow you to multi-thread the code in the next release as well. The reason for the List[Row] is that it avoids a lot of the contention on the locks.
On 7 December 2016 at 09:58, Niek Bartholomeus notifications@github.com wrote:
Yes indeed, the performance is fine now. Ok I will see if I can reproduce the conversion error with a minimal scenario.