EmrEtlRunner: fix srcPattern for copying stream enriched data to HDFS
See original GitHub issueRelated to https://github.com/snowplow/snowplow/issues/3717
When we’re resuming from shred in stream enrich mode, S3DistCp tries to copy data from enriched/good
, not from enriched/good/run=2018...
and fails because of that.
Due this bug, we can recover R102 stream enrich mode only by re-staging enriched data back to enriched.stream
.
Issue Analytics
- State:
- Created 5 years ago
- Comments:15 (15 by maintainers)
Top Results From Across the Web
Help to solve EmrEtlRunner HDFS > S3 - Discourse – Snowplow
I'm trying to setup Snowplow, got success in first steps, but now I got stuck in the process to move the enriched:good data...
Read more >Copy data from Amazon S3 to HDFS in Amazon EMR
Use S3DistCp to copy data between Amazon S3 and Amazon EMR clusters. S3DistCp is installed on Amazon EMR clusters by default.
Read more >hadoop copying the result from hdfs to S3 - Stack Overflow
It looks like the bug is caused by a race condition when CopyFilesReducer uses multiple CopyFilesRunable instances to download the files ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Sorry, I was wrong this regex handles
$folder$
files and tries to move them fromenriched/good
to HDFS for shredding. I cannot find any evidences that S3DistCp ignores empty files, so I think it still would be better to stick with.gz
.Turns out fix should (can) be different from one described in title. When we’re resuming from shred in batch-enrich mode, S3DistCp knows nothing about run-folder and also uses
enriched/good
, but with--srcPattern .*part-.*
, which seems to handle files from subfolders. With stream-enrich mode we use--srcPattern .+
, which somehow doesn’t handle subfolders in the same way.