question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Better documentation for pipelines

See original GitHub issue

Feature request

The introduction to pipelines documentation does not provide any details on how additional parameters can be passed to the tokenizer during the preprocessing step. After walking through all of the source code, I can see that when instantiating a pipeline via transformers.pipline(...) one can simply pass these arguments in as keyword arguments, this is not documented anywhere. It is also not included in any examples.

This request is to have the documentation updated so future users don’t need to read the source code. This update should expand beyond tokenizing (as it also handles post_processing, etc…).

Motivation

It’s very often the case that a tokenizer is not called with the default arguments: padding, max length, etc… are often changed. The implementation for pipelines actually makes setting these arguments very simple, but it is not communicated so it is difficult to take advantage of.

Your contribution

I can contribute to the documentation if needed.

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
DIvkov575commented, Oct 11, 2022

@Narsil @stevhliu @sgugger could I get assigned to this?

1reaction
rhelmeczicommented, Oct 13, 2022

@Narsil Your suggestions are very helpful.

Adding separate documentation for each pipeline makes sense. For example, in the TextClassificationPipeline the keyword arguments are both keyword arguments for the tokenizer’s call function, and keyword arguments for the postprocess function. I think even a brief statement along the lines of (but not necessarily identical to):

keyword arguments passed to TextClassificationPipeline.tokenizer.__call__ and TextClassificationPipeline.postprocess

where the function names are clickable would be extremely helpful. Simply pointing to the recipient functions also makes this a beginner friendly task. I’m assuming of course that for each pipeline, the keyword arguments are only ever passed along to other functions.

Getting caught up on the documentation should probably be done over several commits: adding one commit at a time for each of the specific pipelines will be much easier to review, that’s just my two cents though.

@DIvkov575 Keeping in mind that I’m not a maintainer of this repository, and therefore keeping in mind that my above suggestions are not necessarily ones that will be accepted, you can feel free to add documentation if you feel up to it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Data pipeline documentation without wasting your time
Documenting an ETL is a daunting task. It's difficult to document a data pipeline because you never know what you may need in...
Read more >
Document your project - The Good Research Code Handbook
In this section, I talk about how to document entire projects. Document pipelines¶. It's a common practice to use graphical tools (GUIs) to...
Read more >
How to Document a Data Pipeline - Alisa in Techland
In this article, learn how to document your data pipelines and save your ... With each addition to the pipeline, it grows more...
Read more >
The Importance of Documenting your Pipelines | SnapLogic
Those are the various ways in which you can document your pipelines. Find which ones work best for you. Diane Miller.
Read more >
Understanding and Using Pipelines - Documentation
A pipeline is an XML document that defines document states as a document moves through stages of content processing. In addition to defining...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found