how to convert a dict generator into a huggingface dataset.
See original GitHub issueLink
No response
Description
Hey there, I have used seqio to get a well distributed mixture of samples from multiple dataset. However the resultant output from seqio is a python generator dict, which I cannot produce back into huggingface dataset.
The generator contains all the samples needed for training the model but I cannot convert it into a huggingface dataset.
The code looks like this:
for ex in seqio_data:
print(ex[“text”])
I need to convert the seqio_data (generator) into huggingface dataset.
the complete seqio code goes here:
import functools
import seqio
import tensorflow as tf
import t5.data
from datasets import load_dataset
from t5.data import postprocessors
from t5.data import preprocessors
from t5.evaluation import metrics
from seqio import FunctionDataSource, utils
TaskRegistry = seqio.TaskRegistry
def gen_dataset(split, shuffle=False, seed=None, column="text", dataset_params=None):
dataset = load_dataset(**dataset_params)
if shuffle:
if seed:
dataset = dataset.shuffle(seed=seed)
else:
dataset = dataset.shuffle()
while True:
for item in dataset[str(split)]:
yield item[column]
def dataset_fn(split, shuffle_files, seed=None, dataset_params=None):
return tf.data.Dataset.from_generator(
functools.partial(gen_dataset, split, shuffle_files, seed, dataset_params=dataset_params),
output_signature=tf.TensorSpec(shape=(), dtype=tf.string, name=dataset_name)
)
@utils.map_over_dataset
def target_to_key(x, key_map, target_key):
"""Assign the value from the dataset to target_key in key_map"""
return {**key_map, target_key: x}
dataset_name = 'oscar-corpus/OSCAR-2109'
subset= 'mr'
dataset_params = {"path": dataset_name, "language":subset, "use_auth_token":True}
dataset_shapes = None
TaskRegistry.add(
"oscar_marathi_corpus",
source=seqio.FunctionDataSource(
dataset_fn=functools.partial(dataset_fn, dataset_params=dataset_params),
splits=("train", "validation"),
caching_permitted=False,
num_input_examples=dataset_shapes,
),
preprocessors=[
functools.partial(
target_to_key, key_map={
"targets": None,
}, target_key="targets")],
output_features={"targets": seqio.Feature(vocabulary=seqio.PassThroughVocabulary, add_eos=False, dtype=tf.string, rank=0)},
metric_fns=[]
)
dataset = seqio.get_mixture_or_task("oscar_marathi_corpus").get_dataset(
sequence_length=None,
split="train",
shuffle=True,
num_epochs=1,
shard_info=seqio.ShardInfo(index=0, num_shards=10),
use_cached=False,
seed=42
)
for _, ex in zip(range(5), dataset):
print(ex['targets'].numpy().decode())
Owner
No response
Issue Analytics
- State:
- Created a year ago
- Comments:18 (10 by maintainers)
Top Results From Across the Web
Convert a list of dictionaries to hugging face dataset object
I think the easiest way would be datasets.Dataset.from_pandas(pd.DataFrame(data=data)) . 2 Likes.
Read more >How to wrap a generator with HF dataset
Hi ! Right now to do this you have to define your dataset using a dataset script, in which you can define your...
Read more >Load - Hugging Face
A dataset without a loading script by default loads all the data into the train split. ... Create a dataset from a Python...
Read more >Source code for datasets.dataset_dict - Hugging Face
The transform is set for every dataset in the dataset dictionary As :func:`datasets.Dataset.set_format`, this can be reset using :func:`datasets.
Read more >Main classes - Hugging Face
Create DatasetInfo from the JSON file in dataset_info_dir. ... Dataset. Convert dict to a pyarrow. ... Create a Dataset from a generator. Example:....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@lhoestq +1 for the
Dataset.from_generator
idea.Having thought about it, let’s avoid adding
Dataset.from_iterable
to the API since dictionaries are technically iteralbles (“iterable” is a broad term in Python), and we already provideDataset.from_dict
. And for lists maybe we can addDataset.from_list
similar topa.Table.from_pylist
. WDYT?Hi @johann-petrak! You can pass the features directly to ArrowWriter’s initializer like so
ArrowWriter(..., features=features)
.And the reason why I prefer
Dataset.from_generator
overDataset.from_iterable
is mentioned in one of my previous comments.