Standalone ingestion using SegmentUriPush to S3 does not work
See original GitHub issueHere’s some of the setup:
# pinot controller properties.
# Requires `-Dplugins.dir=/opt/pinot/plugins -Dplugins.include=pinot-s3`
#
pinot.controller.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS
# Any S3 region
pinot.controller.storage.factory.s3.region=us-west-1
# Data directory for Pinot.
controller.data.dir=s3://mybucket/myfolder/pinot
When using an ingestion spec like the following
executionFrameworkSpec:
name: 'standalone'
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndUriPush
inputDirURI: ...
outputDirURI: 's3://mybucket/myfolder/pinot'
overwriteOutput: true
pinotFSSpecs:
- scheme: s3
className: org.apache.pinot.plugin.filesystem.S3PinotFS
configs:
region: 'us-west-2'
pushJobSpec:
# NB: This is particularly weird. Specifically, this seems
# to be the "adjusted path" that is provided to the controller. I assume
# that is because the ingestion job URI may not be the same for a
# Controller?
segmentUriPrefix: 's3://'
segmentUriSuffix: ''
recordReaderSpec:
# Dataset specific config.
tableSpec:
# Table specific config.
pinotClusterSpecs:
# Cluster specific config.
When using the standalone ingestion job via bin/pinot-ingestion-job.sh
- Segment generation is fine.
- Data shows up on S3 as expected and the logline in
S3PinotFS
forCopy
has the correct path, but theSegmentPushUtils
does not, and the theSegmentUriPushJobRunner
fails will get a 500 from the controller due to the path not being found.
2020/08/17 14:45:38.719 INFO [S3PinotFS] [main] Copy /tmp/pinot-a4064eea-301d-4f24-8861-0575a73e6a0b/output/mytable_OFFLINE_1569293930_1569293987_0.tar.gz from local to s3://mybucket/myfolder/pinot/mytable_OFFLINE_1569293930_1569293987_0.tar.gz
2020/08/17 14:45:38.794 INFO [IngestionJobLauncher] [main] Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner
2020/08/17 14:45:38.795 INFO [PinotFSFactory] [main] Initializing PinotFS for scheme s3, classname org.apache.pinot.plugin.filesystem.S3PinotFS
2020/08/17 14:45:38.920 INFO [SegmentPushUtils] [main] Start sending table mytable segment URIs: [s3:///myfolder/pinot/mytable_OFFLINE_1569293930_1569293987_0.tar.gz] to locations: [org.apache.pinot.spi.ingestion.batch.spec.PinotClusterSpec@4e07b95f]
2020/08/17 14:45:38.920 INFO [SegmentPushUtils] [main] Sending table mytable segment URI: s3:///myfolder/pinot/mytable_OFFLINE_1569293930_1569293987_0.tar.gz to location
I suspect it’s related to how the output path is constructed here before SegmentPushUtils.sendSegmentUris
but I have not confirmed it.
It also was not clear that the same issue would happen with the Hadoop/Spark SegmentUri push jobs.
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (10 by maintainers)
Top Results From Across the Web
[GitHub] [incubator-pinot] lgo commented on issue #5877 ...
[GitHub] [incubator-pinot] lgo commented on issue #5877: Standalone ingestion using SegmentUriPush to S3 does not work.
Read more >Batch Ingestion - Apache Pinot Docs
Batch ingestion allows users to create a table using data already present in a file system such as S3. This is particularly useful...
Read more >Problem when streaming data from S3 - 'Bucket' not found
having problem syncing data from s3 Caused by: java.lang. ... 'org.apache.pinot.plugin.ingestion.batch.standalone.
Read more >Data ingestion methods - Storage Best Practices for Data and ...
AWS provides services and capabilities to ingest different types of data into your data lake built on Amazon S3 depending on your use...
Read more >How to re-use the s3 connection in spark standalone cluster?
I am able to read/write files from spark standalone cluster to s3 using the below configuration. val spark = SparkSession.builder() .
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Looking into it @lgo
Here is also a doc for using s3 as deep store: https://docs.pinot.apache.org/users/tutorials/use-s3-as-deep-store-for-pinot