question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Duplicate records when mirroring data to BigQuery

See original GitHub issue
  • Extension name: firestore-bigquery-export
  • Extension version: v0.1.12

It was running perfectly fine on v0.1.5 but after upgrading to v0.1.12 the extensions started to export duplicated records to BigQuery.

The only difference between two duplicate records is the document_id field: one of the records has a null document_id and the other has a non-null document_id.

It’s a very severe problem since it compromised many of my reports.

Weirdly, not all the records are getting duplicated. Sometimes they do and sometimes they do not - and whey they don’t, there’s only one record: the one with the non-null document_id. Still, this is a very frequent issue and since then I already have thousands of duplicates.

It’s also worth mentioning that all of my extensions have been upgraded almost at the same time and that some of them are working perfectly.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
raubrey2014commented, Mar 8, 2021

@dackers86 just as a follow up, yes I backfilled the entries to fix this discrepancy.

For anyone else running into this issue, I backfilled by the query below, but you should be very careful when doing this. This query assumes that the last 20 characters of the column document_name are the document id. There could be cases where that is not a safe assumption.

UPDATE `<your project id>.<your dataset id>.<exporter collection name>_raw_changelog` SET document_id=RIGHT(document_name, 20) WHERE document_id IS NULL;

So for me it looks something like:

UPDATE `ryans_project.firestore.comments_raw_changelog` SET document_id=RIGHT(document_name, 20) WHERE document_id IS NULL;

Hope this is helpful!

0reactions
dackers86commented, Jan 18, 2022

Closing as this appears to be resolved.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Copy datasets | BigQuery - Google Cloud
Your project can copy 1,000 tables per run to a destination dataset that is in a different region. For example, if you configure...
Read more >
Google BigQuery There are no primary key or unique ...
Google told us even we rerun the transfers, there will not be duplicated records. Is that bigquery transfer using the streaming? The duplicated...
Read more >
Tips to Prevent Data Duplication in Google Big Query
Tips to prevent data duplicates in Google BigQuery. The article presents common remedies using Dataddo ETL for preventing data duplication.
Read more >
Chapter 4. Loading Data into BigQuery - O'Reilly
Use JSON for small files where human readability is important. Impact of compression and staging via Google Cloud Storage. For formats such as...
Read more >
How to Build a Unique MD5 Row Hash Using SQL in ...
Using native BigQuery functionality to generate a dynamic, unique row identifier ... This means that your data might contain duplicate rows, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found