Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cleaning large quantities of data takes far too long...

See original GitHub issue

This is more of a question than issue per se, but I didn’t know where else to ask since there isn’t any official forum and there’s no recent activity on the gitter.

Before I go any further, let me preface this by saying I’m still using older aldeed:simple-schema@1.5.3 that is built into the collection2 atmosphere package. Still after reading the release notes of the newer versions, I have no reason to believe the clean() function will be faster in this regard.

I have to insert large quantities of data, tens of thousands of rows (documents) with about 50 keys, into my database and I’ve defined a SimpleSchema for them where most of the fields are just strings, but some are dates and decimal numbers. I run each row of data through schema.clean(row) and then I Match.test(row, schema) to make sure we’ve got all required fields. I assign some other fields prior to insertion, and then perform a bulk insertion by directly accessing the mongo object.

// start recording time
_.each(rows, (row) => {
    schema.clean(row);  // removing this makes this whole process 100 times faster
    if (!Match.test(row, schema)) return;  // this may be removed or not, has no significant effect on runtime

    //... add some fields, these also don't take a significant amount of time
    row.date = today;
    row._id = Random.id();

    valids.push(row);
});
// timetaken on 13000 rows = ~100s with cleaning vs. ~1s without ¯\_(ツ)_/¯
MyCollection.rawCollection().insertMany(valids, { ordered: false }, (err, res) => { ... });

The reason I do it like this is because inserting large amounts of data using Meteor’s Mongo.Collection functions can be slow and doesn’t scale very well, unfortunately. I’ve tried both ways, and can say for certain that independently cleaning/validating bulk data before performing a bulk insert directly into the database is always faster than looping through the data and relying on collection2’s insert hook.

However, this is still too slow because schema.clean(row) is just taking too long. When I don’t call clean() the timing for validating and inserting 10000+ rows drops from 100+ seconds to 1 second, so something is obviously wrong here.

There just has to be a faster way of doing this.

Issue Analytics

State:
Created 7 years ago
Reactions:1
Comments:9

Top GitHub Comments

1reaction

aldeedcommented, Feb 20, 2017

There is a chance this has improved in the rewritten SimpleSchema, but there is an even better chance that it’s much worse, at least by default. Because one change was that it now clones rather than mutates by default, though if you pass mutate: true, most likely it will be similar speed to what you have now.

The slowness is most certainly due to the MongoObject handling, which is necessary mainly because of supporting update modifiers and running autoValues. The quick fix would be to create a much simpler clean function that can be used when we know that the full treatment isn’t necessary.

I’ve been thinking about changing how a lot of this works to make it focus on the 95% cases and ditch support for the 5% that is adding all of the extreme complexity. These changes would also make things much faster.

0reactions

aldeedcommented, Apr 11, 2018

The Meteor SimpleSchema package is no longer maintained other than critical fixes to keep it running with each latest Meteor release. Closing non-critical issues in this repo. Anyone who cares about this may do one or more of the following:

Switch to the NPM package. Be sure to adjust for the breaking changes. There can be a lot of work involved in switching, but it is worth the effort.
If this issue still occurs in the NPM package, submit an issue there. Be sure to include a link to a reproduction repo or a PR that adds a failing test for the issue.
If you have more time than money, submit a pull request to resolve any issue labeled “help wanted” in the NPM package repo.
If you have more money than time, donate to help support development if you are able