question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Possible bug with calculation of jaccard similarity

See original GitHub issue

Firstly, thanks a lot for this wonderful library and also for adding transformers in the latest release.

Describe the bug I trained a model for multi-label classification. After training, when I compute the jaccard similarity on predictions on test set, by calculating intersection over union, the score does not match with the results given by the library. test_statistics reported jaccard of 0.6661, but I calculated it to be 0.5587. This issue was not there in the previous version.

To Reproduce Steps to reproduce the behavior:

  1. Download train_data.csv from link and test_data.csv from link and place them in data dir.

  2. Copy

training:
    epochs: 1
    validation_field: labels
    validation_measure: jaccard
input_features:
    -
        name: comment_text
        type: sequence
        sequence_length_limit: 128
        representation: dense
        lowercase: true
        embedding_size: 256
        cell_type: lstm
        reduce_output: null
        num_layers: 1
        bidirectional: true

output_features:
    -
        name: labels
        type: set
        validation_field: jaccard_index

as model_definition.yaml

  1. Run ludwig experiment -rs 42 --training_set data/train_data.csv \ --test_set data/test_data.csv --data_format csv -cf model_definition.yaml

  2. After one epoch, in order to calculate the jaccard score, run

import pandas as pd
import csv
def compute_jaccard():
    
    # read test file
    df_test = pd.read_csv("data/test_data.csv")
    true_labels = list(df_test["labels"])

    # read predicted labels
    pred_labels = []
    with open("results/experiment_run/labels_predictions.csv") as csvfile:
        label_reader = csv.reader(csvfile, delimiter=',')
        for row in label_reader:
            all_sectors = ' '.join(row)
            pred_labels.append(all_sectors)

    # compute jaccard similiarity
    list_jaccard = []
    for str_true, str_pred in zip(true_labels, pred_labels):
        set_true = set(str_true.split())
        set_pred = set(str_pred.split())
        tp = len(set_true.intersection(set_pred))
        union = len(set_true.union(set_pred))
        list_jaccard.append(tp / union)

    jaccard = sum(list_jaccard) / len(list_jaccard)
    return jaccard

print(compute_jaccard())

Environment :

  • OS: MacOS Mojave
  • Version: 10.14.16
  • Python version: 3.7.3
  • Ludwig version: 0.3

Additional context I used the same logic to calculate jaccard with ludwig version 0.2.2.8, but this issue is not there.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
jenishahcommented, Oct 28, 2020

I checked and it’s giving correct results. Thank you very much for this quick fix! 😃

1reaction
jimthompson5802commented, Oct 27, 2020

@jenishah Thank you for providing complete and detailed description of the issue. I was able to use the code and data you provided to reproduce the issue on my side. I’m still in the process of looking into the root cause. One thing I can relate. Instead of manually calculating the metric, Ludwig is using the keras metric class MeanIOU in v0.3. Right now I’m looking to assess how the difference comes about. This may take me a day or two for this work.

Read more comments on GitHub >

github_iconTop Results From Across the Web

WDM 39: Calculating Jaccard Coefficient ( An Example)
Calculating Jaccard Coefficient ( An Example)For Full Course Experience Please Go To http://mentorsnet.org/course_preview?course_id=1Full ...
Read more >
Possible bug in weighted Jaccard distance calculation #5
I was looking at the code for the weighted Jaccard calculation and I noticed that weightA and weightB are set to the same...
Read more >
python - Is Jaccard similarity/distance suitable for non-binary ...
I have a dataset with each row a country and 10 columns with numerical features like GDP,Electrcity consumption, GNI etc. I am trying...
Read more >
How to Calculate Jaccard Similarity in R
The following formula is used to calculate the Jaccard similarity index: Jaccard Similarity = (number of observations in both sets) ...
Read more >
Spark Python: How to calculate Jaccard Similarity between ...
You could try a solution similar to this stackoverflow answer, though since your data is already tokenized (a list of strings), you wouldn't ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found