Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Better documentation for pipelines

See original GitHub issue

Feature request

The introduction to pipelines documentation does not provide any details on how additional parameters can be passed to the tokenizer during the preprocessing step. After walking through all of the source code, I can see that when instantiating a pipeline via transformers.pipline(...) one can simply pass these arguments in as keyword arguments, this is not documented anywhere. It is also not included in any examples.

This request is to have the documentation updated so future users don’t need to read the source code. This update should expand beyond tokenizing (as it also handles post_processing, etc…).

Motivation

It’s very often the case that a tokenizer is not called with the default arguments: padding, max length, etc… are often changed. The implementation for pipelines actually makes setting these arguments very simple, but it is not communicated so it is difficult to take advantage of.

Your contribution

I can contribute to the documentation if needed.

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:8 (4 by maintainers)

Top GitHub Comments

2reactions

DIvkov575commented, Oct 11, 2022

@Narsil @stevhliu @sgugger could I get assigned to this?

1reaction

rhelmeczicommented, Oct 13, 2022

@Narsil Your suggestions are very helpful.

Adding separate documentation for each pipeline makes sense. For example, in the TextClassificationPipeline the keyword arguments are both keyword arguments for the tokenizer’s call function, and keyword arguments for the postprocess function. I think even a brief statement along the lines of (but not necessarily identical to):

keyword arguments passed to TextClassificationPipeline.tokenizer.__call__ and TextClassificationPipeline.postprocess

where the function names are clickable would be extremely helpful. Simply pointing to the recipient functions also makes this a beginner friendly task. I’m assuming of course that for each pipeline, the keyword arguments are only ever passed along to other functions.

Getting caught up on the documentation should probably be done over several commits: adding one commit at a time for each of the specific pipelines will be much easier to review, that’s just my two cents though.

@DIvkov575 Keeping in mind that I’m not a maintainer of this repository, and therefore keeping in mind that my above suggestions are not necessarily ones that will be accepted, you can feel free to add documentation if you feel up to it.