question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

duplicates when parsing - max workerNum ?

See original GitHub issue

I’m having duplicate objects in m parsing results. I’m parsing large CSV files (using fromFile and the on('json') event) using workers.

When i set workerNum:4 everything seems to be ok (as far as i can tell), but if i use more i get duplicates. For example: when i use workerNum:8 i get double objects in my parsed result. When i use workerNum:12 i get triple duplicates.

any idea why? is there a limit of 4 workers?

Note: my machine has 48 vCPU

UPDATE: seems like 4 workers also produced duplicates

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
Keyangcommented, May 21, 2018

Hi, Please checkout Asynchronously Process section here https://github.com/Keyang/node-csvtojson#asynchronouse-result-process

For your situation you can do:

csvToJson({
    maxRowLength: 65535
}).fromFile(csvFilePath)
.subscribe(async (json)=>{
   const storeResult=await db.insert(json)
})

I have temporarily removed support of workerNum in v2 as workers created too much overhead on inter process communication. The background is the at the same situation. If you are building a node.js based web service, it would probably be better to use built-in cluster feature to utilise multiple cores.

I will add in another way to support multiple cpu cores for parsing fromFile in future as seems divide file into multiple chunk is the only proper way to parse in parallel.

~Keyang

0reactions
Fire-Brandcommented, May 23, 2018

Great, thank you! Can’t wait for parallel parsing, not sure i would go for cluster right now, but i’ll definitely look into it!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Parsing XML in SQL Server with duplicate tags - Stack Overflow
Omitting your XML blob, take a look at this: WITH XMLNAMESPACES (DEFAULT 'http://www.irs.gov/efile') SELECT c.query('.
Read more >
Solved: Removing duplicates based on max of another column
Solved: Hi, I have a table that looks like this SortID ID Name City 1 1 xxx NYC 2 1 xxx Seattle 3...
Read more >
How to find the row that has maximum number of duplicates in ...
To find the row that has maximum number of duplicates in an R matrix, we can follow the below steps −. First of...
Read more >
MarkDuplicates (Picard) - GATK - Broad Institute
Identifies duplicate reads. This tool locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are defined as originating...
Read more >
How To: Identify duplicate field values in ArcGIS 10.x
Select the Python parser. Field Calculator with Python parser button. Ensure that the Show Codeblock option is checked. Paste the following code ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found