Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Import labeled dataset

See original GitHub issue

Feature Request: import labeled data sets in BIO format. Like:

SOCCER	O
-	O
JAPAN	B-LOC
GET	O
LUCKY	O
WIN	O
,	O
CHINA	B-PER
IN	O
SURPRISE	O
DEFEAT	O
.	O

Nadim	B-PER
Ladki	I-PER

AL-AIN	B-LOC
,	O
United	B-LOC
Arab	I-LOC
Emirates	I-LOC
1996-12-06 O

Btw, I love your tool, thanks for doing it open source

Issue Analytics

State:
Created 5 years ago
Reactions:4
Comments:30 (12 by maintainers)

Top GitHub Comments

6reactions

machakuxcommented, Feb 18, 2019

I would like to be able import labelled datasets to review, correct wrongly labeled data, continue labelling a partially labeled dataset or to add labelled data to an existing project (mostly use cases 1,2).

I think storing documents together with labels might simplify things.

I it will be decided to store annotations together with the document, the document model could be something like


class Document(models.Model):
    project = models.ForeignKey(Project, related_name='documents', on_delete=models.CASCADE)
    text = models.TextField()
    labels = models.TextField() #  or ManyToManyField() or ArrayField()
    annotations = models.TextField()   # or ManyToManyField() or JSONField() 
    seq2seq_annotations = models.TextField()  # or ManyToManyField()  or ArrayField()
    metadata = models.TextField(default='{}')  #  or JSONField()
    # ...

Django has several third packages like https://github.com/dmkoch/django-jsonfield which can be used to provide a bit more flexible data structures. And if you will be using Postgresql Django has native/in-built fields for JSON, Arrays and more, see https://docs.djangoproject.com/en/dev/ref/contrib/postgres/fields/ .

Assuming the basic functionality does not involve updating existing documents, the import will not need to account for an external_id although users might be allowed to upload them as metadata for their own future references.

Allowing users to update existing documents through bulk upload could be limited to admin interface or command line interface as an advanced functionality for users who are sure with what they are doing. Here users can be allowed to provide the real id field (the object primary key), which means if an object with the provided id already exists in a database it will be updated otherwise a new object will be created.

If documents and annotations will be stored together it may also make easier to utilize existing tools like django-import-export especially for imports via admin interface.

4reactions

Hironsancommented, Mar 12, 2019

I thoroughly redesigned APIs and models and supported labeled dataset import.

Task x format is as follows:

	Plain	CSV	JSON	CoNLL
Text Classification	○	○(single label)	○	X
Sequence Labeling	○	X	○	○
Seq2seq	○	○	○	X

We can confirm the detailed format in an upload page:

This is not a perfect feature. This is the first step. There are some bugs and performance problems. So welcome your opinions and feedbacks.

Thank you for your feedback and contribution.

Top Results From Across the Web

Creating datasets and importing images | AutoML Vision

Create a dataset and specify whether to allow multiple labels on each item. Import data items into the dataset. Label the items. When...

Import pre-annotated data into Label Studio

Import predicted labels, predictions, pre-annotations, or pre-labels into Label Studio for your data labeling, machine learning, and data science projects.

How can I import existing data labels to Azure Machine ...

I have a dataset on the Azure Machine Learning Studio, which is about 1200 images. I also have a tab-delimited text file that...

Create a dataset - Labelbox Docs

A dataset is a collection of data rows imported to Labelbox at one time. ... Asset, A single cloud-hosted file to be labeled...

Loading Datasets From Disk — FiftyOne 0.18.0 documentation

import fiftyone as fo # The directory containing the dataset to import ... Alternatively, when importing labeled datasets in formats such as COCO, ......