question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SegmentNameGenerator: Extend interface to accept input file name

See original GitHub issue

Following Slack discussion: https://apache-pinot.slack.com/archives/C011C9JHN7R/p1624443889243500

Use case: Some tables in Pinot are used in conjunction with IdSet filtering or Lookups - and in some cases don’t have a time column.

  • Hence, existing segment name generation strategies (i.e. time-based and fixed) do not allow for simple segment replacements whenever data in those “dimension” tables change (say, data for ID<X> changed)

Proposition Allow segment name generation to be based on the input file names such that segments can be named following a user provided id (in the file names).

E.g.

basedir/id1/file.parquet
basedir/id2/file.parquet

Would generate segments

<table_name>_id1.segment
<table_name>_id2.segment

Currently, the SegmentNameGenerator interface doesn’t allow input file names, therefore it is not possible to implement a strategy similar to the one presented above.

Note: If you known any other alternative to reach the use case goal, please feel free to provide ideas !

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:16 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
kkruglercommented, Jul 27, 2021

Hi @MrNeocore - yes, the scope of this issue is mapping n input files to n segments. I was pointing out that there’s one other place in the code where segments are created (SegmentProcessorFramework), but that code is (a) out of scope, and (b) doesn’t currently use the SegmentNameGenerator support in any case, though there’s a TODO comment in the code about that.

1reaction
kkruglercommented, Jul 6, 2021

How about being able to define things via:

    "filePathPattern":".+/(.+)/.+\\.parquet",
    "segmentName":"${tableName}_${filePathPattern:\\1}.parquet"

or, for our use case:

    "filePathPattern":".+/(.+)\\.csv",
    "segmentName":"${filePathPattern:\\1}"

We could also support formatting of the input dates via say ${minTimeValue:yyyy-MM}, but maybe that’s a bridge too far.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Ingestion Job Spec - Apache Pinot Docs
Segment Name Generator Spec ; append.uuid.to.segment.name. If the input data doesn't contain a time column, set this to true to generate unique segment...
Read more >
<input type="file"> - HTML: HyperText Markup Language | MDN
accept. The accept attribute value is a string that defines the file types the file input should accept. This string is a comma-separated...
Read more >
Dynamic Input appends file extension to file name and errors
Solved: I have a directory input feeding a dynamic input tool. The directory contains only the xml files for processing. Dynamic Input tool ......
Read more >
Operation Path Naming - API Platform
Defining the Operation Segment Name Generator ... Transforms a given string to a valid path name which can be pluralized (eg. for collections)....
Read more >
cmd - How to make the user input the file's name and extension?
I managed to get the working-directory to be variable (code below), but only where the bat file is (so for this to work...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found