question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance issue on import - chatty database communication

See original GitHub issue

Hi,

I am importing CSVs with 10k rows and I noticed that it is very very slow. Like 10minutes to import more or less. This is against a django app running in my local connected to a postgres running in my local as well.

Upon further inspection, looks like for every row in the imported CSV, a select * from table where id = ? is performed, thereby slowing down the import. It may be better to do a select * from table where id in (?, ?, ?, ...) instead to speed up the process.

Thanks

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

10reactions
franz-seecommented, Mar 10, 2019

@andrewgy8 This is my workaround

class BulkQueryMixin:
    _ids_to_objects = {}

    def before_import(self, dataset, using_transactions, dry_run, **kwargs):
        if len(dataset):
            header_indices = [idx for idx, header in enumerate(dataset.headers) if header in self._meta.import_id_fields]
            query_ids = [[(dataset.headers[header_idx], data[header_idx],) for header_idx in header_indices] for data in dataset]
            where_clause = ' or '.join(['(%s)' % ' and '.join([self._to_where_condition(value[0], value[1]) for value in query_id])for query_id in query_ids])
            select_query = 'select * from %s where %s' % (self._meta.model._meta.db_table, where_clause)
            pre_queried_objects = self._meta.model.objects.raw(select_query)
            self._ids_to_objects = self._map_out_objects(pre_queried_objects)

    def get_instance(self, instance_loader, row):
        row_id = self._get_id(row)
        match = self._ids_to_objects.get(row_id, None)
        return match

    def _map_out_objects(self, values):
        return {self._get_id(value): value for value in values}

    def _get_id(self, value):
        return '-'.join([str(self._get_header_value(header_id, value)) for header_id in self._meta.import_id_fields])

    def _get_header_value(self, header_id, value):
        header_value = value[header_id] if isinstance(value, dict) else getattr(value, header_id)
        return header_value if header_value else ''

    def _to_where_condition(self, column_name, column_value):
        if column_value:
            return '%s = \'%s\'' % (column_name, column_value)
        else:
            return '%s is null' % column_name

Then I use that BulkQueryMixin in my Resource to fetch all data related to the uploaded import once, and then on every get_instance(), i just retrieve from that cached dict. The code is not pretty, but seems to work and has greatly increased the performance on my end.

But aside from that, I had to do a few more things (ranked in terms of performance gain)

  1. Add in my Resource skip_unchanged - removes any update statements which can be one for every row element
  2. Modify my django Models to override __deepcopy__ (and __copy__) - this also slows down things substantially
  3. Add in my Resource report_skipped - speeds up rendering of the confirmation page. Also, with 10k records uploaded, it makes spotting the difference easier
2reactions
franz-seecommented, Mar 13, 2019

@andrewgy8 Good point. I’ll try to see if I can change it to use querysets and filters instead.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why is My Database Application so Slow? - Simple Talk
A very common cause of performance problems, in our experience, is running “chatty” applications over high latency networks. A chatty ...
Read more >
Chatty I/O antipattern - Performance antipatterns for cloud apps
Problem description. Network calls and other I/O operations are inherently slow compared to compute tasks. Each I/O request typically has significant overhead, ...
Read more >
Urgent advise needed - Software AG Tech Community & Forums
It's been a while since I looked into it, but JDBC network traffic used to be “chatty” in small bursts. This caused performance...
Read more >
SQL Server Advanced Troubleshooting and Performance ...
most problems present themselves as general performance issues: ... The Protocol Layer handles communication between SQL Server and client.
Read more >
[ MATLAB + SQL Server ] Slow performance when fetching data
My problem is that the while-loop performs quite slowly. It takes up to 8-10 seconds to fetch the data. One might ask why...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found