Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Handling multiple fields of the custom input data in the preprocess_data.py

See original GitHub issue

Describe the bug preprocess_data script expects to have “text” column in the json input regardless of the json-keys passed in the arguments. This is due to lmd.Reader(fname).stream_data() expects to have “text” column in the json input.

To Reproduce Steps to reproduce the behavior: Run with custom input file with fields other than “text”.

Expected behavior We need to extract json elements given the specific json-keys in the preprocessing.

Proposed solution Modify the lm_dataformat to accept the parameter to read the specific json key object.

Error

File "tools/preprocess_data.py", line 193, in <module>
    main()
  File "tools/preprocess_data.py", line 163, in main
    for i, (doc, bytes_processed) in enumerate(encoded_docs, start=1):
  File "tools/preprocess_data.py", line 143, in <genexpr>
    encoded_docs = (encoder.encode(doc) for doc in fin)
  File "tools/preprocess_data.py", line 120, in yield_from_files
    yield from yielder(fname, semaphore)
  File "tools/preprocess_data.py", line 113, in yielder
    for f in filter(lambda x: x, lmd.Reader(fname).stream_data()):
  File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 116, in stream_data
    yield from self._stream_data(get_meta)
  File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 149, in _stream_data
    yield from self.read_jsonl(f, get_meta)
  File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 207, in read_jsonl
    yield from handle_jsonl(rdr, get_meta, autojoin_paragraphs, para_joiner)
  File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 99, in handle_jsonl
    text = ob['text']
KeyError: 'text'

Issue Analytics

State:
Created 2 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

2reactions

StellaAthenacommented, Nov 10, 2021

@EricHallahan I have confirmed that nothing goes wrong if you use lm_dataformat>=0.0.20 in the Eval Harness, and opened a PR in that repo to update the requirements.

0reactions

sameeravithanacommented, Nov 6, 2021

@StellaAthena For example, we have multiple sections (e.g., abstract, full texts) extracted from scientific articles that we stored within the same jsonl, and plan to train them in parallel.

Top Results From Across the Web

How to take multiple inputs from the same input field and ...

I don't know how to display second value and third value in the second and third list . views.py is : def worldle(request):...

Taking multiple inputs from user in Python - GeeksforGeeks

In C++/C user can take multiple inputs in one line using scanf but in Python user can take multiple values or inputs in...

Preprocessing data with TensorFlow Transform | TFX

TensorFlow Transform is a library for preprocessing input data for TensorFlow, including creating features that require a full pass over the training dataset....

How to take Multiple Input from User in Python - Javatpoint

The matrix can take any value such as integer values, floating values, string, complex numbers, etc. The values are placed horizontally called rows,...

Preprocess data and train a machine learning model

In this project, Step Functions uses a Lambda function to seed an Amazon S3 bucket with a test dataset and a Python script...