Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Standalone ingestion using SegmentUriPush to S3 does not work

See original GitHub issue

Here’s some of the setup:

# pinot controller properties.

# Requires `-Dplugins.dir=/opt/pinot/plugins -Dplugins.include=pinot-s3`
#
pinot.controller.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS
# Any S3 region
pinot.controller.storage.factory.s3.region=us-west-1
# Data directory for Pinot.
controller.data.dir=s3://mybucket/myfolder/pinot

When using an ingestion spec like the following

executionFrameworkSpec:
    name: 'standalone'
    segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
    segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
    segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndUriPush
inputDirURI: ...
outputDirURI: 's3://mybucket/myfolder/pinot'
overwriteOutput: true
pinotFSSpecs:
    - scheme: s3
      className: org.apache.pinot.plugin.filesystem.S3PinotFS
      configs:
        region: 'us-west-2'
pushJobSpec:
    # NB: This is particularly weird. Specifically, this seems
    # to be the "adjusted path" that is provided to the controller. I assume
    # that is because the ingestion job URI may not be the same for a
    # Controller?
    segmentUriPrefix: 's3://'
    segmentUriSuffix: ''
recordReaderSpec:
    # Dataset specific config.
tableSpec:
    # Table specific config.
pinotClusterSpecs:
    # Cluster specific config.

When using the standalone ingestion job via bin/pinot-ingestion-job.sh

Segment generation is fine.
Data shows up on S3 as expected and the logline in S3PinotFS for Copy has the correct path, but the SegmentPushUtils does not, and the the SegmentUriPushJobRunner fails will get a 500 from the controller due to the path not being found.

2020/08/17 14:45:38.719 INFO [S3PinotFS] [main] Copy /tmp/pinot-a4064eea-301d-4f24-8861-0575a73e6a0b/output/mytable_OFFLINE_1569293930_1569293987_0.tar.gz from local to s3://mybucket/myfolder/pinot/mytable_OFFLINE_1569293930_1569293987_0.tar.gz
2020/08/17 14:45:38.794 INFO [IngestionJobLauncher] [main] Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner
2020/08/17 14:45:38.795 INFO [PinotFSFactory] [main] Initializing PinotFS for scheme s3, classname org.apache.pinot.plugin.filesystem.S3PinotFS
2020/08/17 14:45:38.920 INFO [SegmentPushUtils] [main] Start sending table mytable segment URIs: [s3:///myfolder/pinot/mytable_OFFLINE_1569293930_1569293987_0.tar.gz] to locations: [org.apache.pinot.spi.ingestion.batch.spec.PinotClusterSpec@4e07b95f]
2020/08/17 14:45:38.920 INFO [SegmentPushUtils] [main] Sending table mytable segment URI: s3:///myfolder/pinot/mytable_OFFLINE_1569293930_1569293987_0.tar.gz to location

I suspect it’s related to how the output path is constructed here before SegmentPushUtils.sendSegmentUris but I have not confirmed it.

https://github.com/apache/incubator-pinot/blob/2b58bfb520df074f691277f2ae5b01ecb5c686c2/pinot-plugins/pinot-batch-ingestion/pinot-batch-ingestion-standalone/src/main/java/org/apache/pinot/plugin/ingestion/batch/standalone/SegmentUriPushJobRunner.java#L90-L91

It also was not clear that the same issue would happen with the Hadoop/Spark SegmentUri push jobs.