question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

EmrEtlRunner: fix srcPattern for copying stream enriched data to HDFS

See original GitHub issue

Related to https://github.com/snowplow/snowplow/issues/3717

When we’re resuming from shred in stream enrich mode, S3DistCp tries to copy data from enriched/good, not from enriched/good/run=2018... and fails because of that.

Due this bug, we can recover R102 stream enrich mode only by re-staging enriched data back to enriched.stream.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:15 (15 by maintainers)

github_iconTop GitHub Comments

2reactions
chuwycommented, Apr 16, 2018

Sorry, I was wrong this regex handles $folder$ files and tries to move them from enriched/good to HDFS for shredding. I cannot find any evidences that S3DistCp ignores empty files, so I think it still would be better to stick with .gz.

1reaction
chuwycommented, Apr 13, 2018

Turns out fix should (can) be different from one described in title. When we’re resuming from shred in batch-enrich mode, S3DistCp knows nothing about run-folder and also uses enriched/good, but with --srcPattern .*part-.*, which seems to handle files from subfolders. With stream-enrich mode we use --srcPattern .+, which somehow doesn’t handle subfolders in the same way.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Help to solve EmrEtlRunner HDFS > S3 - Discourse – Snowplow
I'm trying to setup Snowplow, got success in first steps, but now I got stuck in the process to move the enriched:good data...
Read more >
Copy data from Amazon S3 to HDFS in Amazon EMR
Use S3DistCp to copy data between Amazon S3 and Amazon EMR clusters. S3DistCp is installed on Amazon EMR clusters by default.
Read more >
hadoop copying the result from hdfs to S3 - Stack Overflow
It looks like the bug is caused by a race condition when CopyFilesReducer uses multiple CopyFilesRunable instances to download the files ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found