Handling multiple fields of the custom input data in the preprocess_data.py
See original GitHub issueDescribe the bug preprocess_data script expects to have “text” column in the json input regardless of the json-keys passed in the arguments. This is due to lmd.Reader(fname).stream_data() expects to have “text” column in the json input.
To Reproduce Steps to reproduce the behavior: Run with custom input file with fields other than “text”.
Expected behavior We need to extract json elements given the specific json-keys in the preprocessing.
Proposed solution Modify the lm_dataformat to accept the parameter to read the specific json key object.
Error
File "tools/preprocess_data.py", line 193, in <module>
main()
File "tools/preprocess_data.py", line 163, in main
for i, (doc, bytes_processed) in enumerate(encoded_docs, start=1):
File "tools/preprocess_data.py", line 143, in <genexpr>
encoded_docs = (encoder.encode(doc) for doc in fin)
File "tools/preprocess_data.py", line 120, in yield_from_files
yield from yielder(fname, semaphore)
File "tools/preprocess_data.py", line 113, in yielder
for f in filter(lambda x: x, lmd.Reader(fname).stream_data()):
File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 116, in stream_data
yield from self._stream_data(get_meta)
File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 149, in _stream_data
yield from self.read_jsonl(f, get_meta)
File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 207, in read_jsonl
yield from handle_jsonl(rdr, get_meta, autojoin_paragraphs, para_joiner)
File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 99, in handle_jsonl
text = ob['text']
KeyError: 'text'
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
How to take multiple inputs from the same input field and ...
I don't know how to display second value and third value in the second and third list . views.py is : def worldle(request):...
Read more >Taking multiple inputs from user in Python - GeeksforGeeks
In C++/C user can take multiple inputs in one line using scanf but in Python user can take multiple values or inputs in...
Read more >Preprocessing data with TensorFlow Transform | TFX
TensorFlow Transform is a library for preprocessing input data for TensorFlow, including creating features that require a full pass over the training dataset....
Read more >How to take Multiple Input from User in Python - Javatpoint
The matrix can take any value such as integer values, floating values, string, complex numbers, etc. The values are placed horizontally called rows,...
Read more >Preprocess data and train a machine learning model
In this project, Step Functions uses a Lambda function to seed an Amazon S3 bucket with a test dataset and a Python script...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@EricHallahan I have confirmed that nothing goes wrong if you use
lm_dataformat>=0.0.20
in the Eval Harness, and opened a PR in that repo to update the requirements.@StellaAthena For example, we have multiple sections (e.g., abstract, full texts) extracted from scientific articles that we stored within the same jsonl, and plan to train them in parallel.