question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Standalone ingestion using SegmentUriPush to S3 does not work

See original GitHub issue

Here’s some of the setup:

# pinot controller properties.

# Requires `-Dplugins.dir=/opt/pinot/plugins -Dplugins.include=pinot-s3`
#
pinot.controller.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS
# Any S3 region
pinot.controller.storage.factory.s3.region=us-west-1
# Data directory for Pinot.
controller.data.dir=s3://mybucket/myfolder/pinot

When using an ingestion spec like the following

executionFrameworkSpec:
    name: 'standalone'
    segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
    segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
    segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndUriPush
inputDirURI: ...
outputDirURI: 's3://mybucket/myfolder/pinot'
overwriteOutput: true
pinotFSSpecs:
    - scheme: s3
      className: org.apache.pinot.plugin.filesystem.S3PinotFS
      configs:
        region: 'us-west-2'
pushJobSpec:
    # NB: This is particularly weird. Specifically, this seems
    # to be the "adjusted path" that is provided to the controller. I assume
    # that is because the ingestion job URI may not be the same for a
    # Controller?
    segmentUriPrefix: 's3://'
    segmentUriSuffix: ''
recordReaderSpec:
    # Dataset specific config.
tableSpec:
    # Table specific config.
pinotClusterSpecs:
    # Cluster specific config.

When using the standalone ingestion job via bin/pinot-ingestion-job.sh

  • Segment generation is fine.
  • Data shows up on S3 as expected and the logline in S3PinotFS for Copy has the correct path, but the SegmentPushUtils does not, and the the SegmentUriPushJobRunner fails will get a 500 from the controller due to the path not being found.
2020/08/17 14:45:38.719 INFO [S3PinotFS] [main] Copy /tmp/pinot-a4064eea-301d-4f24-8861-0575a73e6a0b/output/mytable_OFFLINE_1569293930_1569293987_0.tar.gz from local to s3://mybucket/myfolder/pinot/mytable_OFFLINE_1569293930_1569293987_0.tar.gz
2020/08/17 14:45:38.794 INFO [IngestionJobLauncher] [main] Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner
2020/08/17 14:45:38.795 INFO [PinotFSFactory] [main] Initializing PinotFS for scheme s3, classname org.apache.pinot.plugin.filesystem.S3PinotFS
2020/08/17 14:45:38.920 INFO [SegmentPushUtils] [main] Start sending table mytable segment URIs: [s3:///myfolder/pinot/mytable_OFFLINE_1569293930_1569293987_0.tar.gz] to locations: [org.apache.pinot.spi.ingestion.batch.spec.PinotClusterSpec@4e07b95f]
2020/08/17 14:45:38.920 INFO [SegmentPushUtils] [main] Sending table mytable segment URI: s3:///myfolder/pinot/mytable_OFFLINE_1569293930_1569293987_0.tar.gz to location

I suspect it’s related to how the output path is constructed here before SegmentPushUtils.sendSegmentUris but I have not confirmed it.

https://github.com/apache/incubator-pinot/blob/2b58bfb520df074f691277f2ae5b01ecb5c686c2/pinot-plugins/pinot-batch-ingestion/pinot-batch-ingestion-standalone/src/main/java/org/apache/pinot/plugin/ingestion/batch/standalone/SegmentUriPushJobRunner.java#L90-L91

It also was not clear that the same issue would happen with the Hadoop/Spark SegmentUri push jobs.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
KKcorpscommented, Aug 17, 2020

Looking into it @lgo

0reactions
xiangfu0commented, Aug 28, 2020
Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [incubator-pinot] lgo commented on issue #5877 ...
[GitHub] [incubator-pinot] lgo commented on issue #5877: Standalone ingestion using SegmentUriPush to S3 does not work.
Read more >
Batch Ingestion - Apache Pinot Docs
Batch ingestion allows users to create a table using data already present in a file system such as S3. This is particularly useful...
Read more >
Problem when streaming data from S3 - 'Bucket' not found
having problem syncing data from s3 Caused by: java.lang. ... 'org.apache.pinot.plugin.ingestion.batch.standalone.
Read more >
Data ingestion methods - Storage Best Practices for Data and ...
AWS provides services and capabilities to ingest different types of data into your data lake built on Amazon S3 depending on your use...
Read more >
How to re-use the s3 connection in spark standalone cluster?
I am able to read/write files from spark standalone cluster to s3 using the below configuration. val spark = SparkSession.builder() .
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found