question-mark

Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance issue

See original GitHub issue

Issue Description

Hi,

I tested the ParquetSource on a parquet file of about 5M records, split in 144 partitions (files) on an 8 CPU server in the default way:

    ParquetSource(new Path(modelPath))
      .toFrame(16)
      .toList
      .map(convertRow)

After about 10 min the process got stuck and I had to kill it. All cpu’s were at 100% for the whole duration

I then tested with a slightly modified codebase:

    ParquetSource(new Path(modelPath))
      .parts
      .par
      .flatMap { part =>
        part.data
          .map(convertRow)
          .toList.toBlocking.single
      }

And that finished the job in 30 sec. Even without the .par line it still only takes 2 min (running on a single cpu)

Any idea why the two implementations give so different results?

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:49 (30 by maintainers)

github_iconTop GitHub Comments

1reaction
sksamuelcommented, Dec 7, 2016

Yes you can do that, or the Frame API will allow you to multi-thread the code in the next release as well. The reason for the List[Row] is that it avoids a lot of the contention on the locks.

On 7 December 2016 at 09:58, Niek Bartholomeus notifications@github.com wrote:

Looks good!

I currently don’t have the full dataset of 5M words available but I tested with roughly a fifth of it. With the previous version the load took 15s and with the latest version it only takes 10s!

I will let you know the figures once I have my 5M dataset back.

One note: as I want to run parallel calculations based on the loaded dataset, I prefer to keep it split up in a ParMap. The parts api allows me to do this so I’m not using the ``toFrame``` api. I noticed however that the latest version is less straightforward to collect all rows per part:

ParquetSource(new Path(wordVariationPath))
  .parts
  .par
  .flatMap { part =>
    var rowsTotal = List[Row]()
    part.iterator.foreach { rows =>
      rowsTotal = rowsTotal ++ rows
    }
  }

Is this the best way to proceed?

Thanks!

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/sksamuel/eel-sdk/issues/180#issuecomment-265404885, or mute the thread https://github.com/notifications/unsubscribe-auth/AAtZGgT7C9eyM2ppEvlit6F2mdun5rvYks5rFoNBgaJpZM4K7dpH .

0reactions
tolomauscommented, Jan 27, 2017

Yes indeed, the performance is fine now. Ok I will see if I can reproduce the conversion error with a minimal scenario.

Read more comments on GitHub >

github_iconTop Results From Across the Web

9 Examples of a Performance Issue - Simplicable Guide
A performance issue is a failure to meet the basic requirements of a job. They are based on reasonable expectations of behavior and...
Read more >
Handling Performance Issues With Grace | Monster.com
Low Productivity or Late Completion – Make sure you've been clear about the requirements and expectations of the job. · Poor Quality of...
Read more >
Dealing with Performance Problems
Types of Performance Problems ; Quantity of work (untimely completion, limited production). Poor prioritizing, timing, scheduling; Lost time ; Quality of work ( ......
Read more >
Top 5 Common Performance Problems - HRCI
Top 5 Common Performance Problems · Shallow Work · Inability to Prioritize · False Sense of Urgency · Productive Procrastination · Low-Quality Output....
Read more >
5 Common Reasons for Performance Issues (Plus 3 Tips to ...
Most Common Causes of Performance Issues · 1. They lack knowledge or skill. · 2. They have unclear or unrealistic expectations. · 3....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found