question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cleaning large quantities of data takes far too long...

See original GitHub issue

This is more of a question than issue per se, but I didn’t know where else to ask since there isn’t any official forum and there’s no recent activity on the gitter.

Before I go any further, let me preface this by saying I’m still using older aldeed:simple-schema@1.5.3 that is built into the collection2 atmosphere package. Still after reading the release notes of the newer versions, I have no reason to believe the clean() function will be faster in this regard.

I have to insert large quantities of data, tens of thousands of rows (documents) with about 50 keys, into my database and I’ve defined a SimpleSchema for them where most of the fields are just strings, but some are dates and decimal numbers. I run each row of data through schema.clean(row) and then I Match.test(row, schema) to make sure we’ve got all required fields. I assign some other fields prior to insertion, and then perform a bulk insertion by directly accessing the mongo object.

// start recording time
_.each(rows, (row) => {
    schema.clean(row);  // removing this makes this whole process 100 times faster
    if (!Match.test(row, schema)) return;  // this may be removed or not, has no significant effect on runtime

    //... add some fields, these also don't take a significant amount of time
    row.date = today;
    row._id = Random.id();

    valids.push(row);
});
// timetaken on 13000 rows = ~100s with cleaning vs. ~1s without ¯\_(ツ)_/¯
MyCollection.rawCollection().insertMany(valids, { ordered: false }, (err, res) => { ... });

The reason I do it like this is because inserting large amounts of data using Meteor’s Mongo.Collection functions can be slow and doesn’t scale very well, unfortunately. I’ve tried both ways, and can say for certain that independently cleaning/validating bulk data before performing a bulk insert directly into the database is always faster than looping through the data and relying on collection2’s insert hook.

However, this is still too slow because schema.clean(row) is just taking too long. When I don’t call clean() the timing for validating and inserting 10000+ rows drops from 100+ seconds to 1 second, so something is obviously wrong here.

There just has to be a faster way of doing this.

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Reactions:1
  • Comments:9

github_iconTop GitHub Comments

1reaction
aldeedcommented, Feb 20, 2017

There is a chance this has improved in the rewritten SimpleSchema, but there is an even better chance that it’s much worse, at least by default. Because one change was that it now clones rather than mutates by default, though if you pass mutate: true, most likely it will be similar speed to what you have now.

The slowness is most certainly due to the MongoObject handling, which is necessary mainly because of supporting update modifiers and running autoValues. The quick fix would be to create a much simpler clean function that can be used when we know that the full treatment isn’t necessary.

I’ve been thinking about changing how a lot of this works to make it focus on the 95% cases and ditch support for the 5% that is adding all of the extreme complexity. These changes would also make things much faster.

0reactions
aldeedcommented, Apr 11, 2018

The Meteor SimpleSchema package is no longer maintained other than critical fixes to keep it running with each latest Meteor release. Closing non-critical issues in this repo. Anyone who cares about this may do one or more of the following:

  • Switch to the NPM package. Be sure to adjust for the breaking changes. There can be a lot of work involved in switching, but it is worth the effort.
  • If this issue still occurs in the NPM package, submit an issue there. Be sure to include a link to a reproduction repo or a PR that adds a failing test for the issue.
  • If you have more time than money, submit a pull request to resolve any issue labeled “help wanted” in the NPM package repo.
  • If you have more money than time, donate to help support development if you are able
Read more comments on GitHub >

github_iconTop Results From Across the Web

Data Cleaning: Why it's Taking Up Too Much Time
It's best to position the cleaning of data early in your larger set of processes. This means planning out how data will be...
Read more >
Stop overdoing it when cleaning your big data | TechRepublic
Enough is enough--your big data might actually be getting too clean. Find out why it can be useful to keep bad, garbage data....
Read more >
What is the Best Way to Clean Up a Large Data Set?
Steps for cleaning a large data set · Step 1: Remove unwanted observations · Step 2: Fix structural errors · Step 3: Filter...
Read more >
Cleaning Big Data: Most Time-Consuming, Least ... - Forbes
A new survey of data scientists found that they spend most of their time massaging rather than mining or modeling data.
Read more >
Top ten ways to clean your data - Microsoft Support
The basic steps for cleaning data are as follows: Import the data from an external data source. Create a backup copy of the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found