question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Handling multiple fields of the custom input data in the preprocess_data.py

See original GitHub issue

Describe the bug preprocess_data script expects to have “text” column in the json input regardless of the json-keys passed in the arguments. This is due to lmd.Reader(fname).stream_data() expects to have “text” column in the json input.

To Reproduce Steps to reproduce the behavior: Run with custom input file with fields other than “text”.

Expected behavior We need to extract json elements given the specific json-keys in the preprocessing.

Proposed solution Modify the lm_dataformat to accept the parameter to read the specific json key object.

Error

File "tools/preprocess_data.py", line 193, in <module>
    main()
  File "tools/preprocess_data.py", line 163, in main
    for i, (doc, bytes_processed) in enumerate(encoded_docs, start=1):
  File "tools/preprocess_data.py", line 143, in <genexpr>
    encoded_docs = (encoder.encode(doc) for doc in fin)
  File "tools/preprocess_data.py", line 120, in yield_from_files
    yield from yielder(fname, semaphore)
  File "tools/preprocess_data.py", line 113, in yielder
    for f in filter(lambda x: x, lmd.Reader(fname).stream_data()):
  File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 116, in stream_data
    yield from self._stream_data(get_meta)
  File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 149, in _stream_data
    yield from self.read_jsonl(f, get_meta)
  File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 207, in read_jsonl
    yield from handle_jsonl(rdr, get_meta, autojoin_paragraphs, para_joiner)
  File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 99, in handle_jsonl
    text = ob['text']
KeyError: 'text'

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
StellaAthenacommented, Nov 10, 2021

@EricHallahan I have confirmed that nothing goes wrong if you use lm_dataformat>=0.0.20 in the Eval Harness, and opened a PR in that repo to update the requirements.

0reactions
sameeravithanacommented, Nov 6, 2021

@StellaAthena For example, we have multiple sections (e.g., abstract, full texts) extracted from scientific articles that we stored within the same jsonl, and plan to train them in parallel.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to take multiple inputs from the same input field and ...
I don't know how to display second value and third value in the second and third list . views.py is : def worldle(request):...
Read more >
Taking multiple inputs from user in Python - GeeksforGeeks
In C++/C user can take multiple inputs in one line using scanf but in Python user can take multiple values or inputs in...
Read more >
Preprocessing data with TensorFlow Transform | TFX
TensorFlow Transform is a library for preprocessing input data for TensorFlow, including creating features that require a full pass over the training dataset....
Read more >
How to take Multiple Input from User in Python - Javatpoint
The matrix can take any value such as integer values, floating values, string, complex numbers, etc. The values are placed horizontally called rows,...
Read more >
Preprocess data and train a machine learning model
In this project, Step Functions uses a Lambda function to seed an Amazon S3 bucket with a test dataset and a Python script...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found