Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ValueError: Found input variables with inconsistent numbers of samples: [27321, 27223]

See original GitHub issue

The number of predictions of model.eval_model compared to the inserted data does not fit. This also corresponds to model.predict (and here may be the cause of the error). That is: I provide a pandas dataframe e.g. of length 12 to the function model.eval_model (or text data to model.predict) and receive an output of length 10. This is pretty weird. I, however, do use the classification_report in the args, but I am missing the ‘O’-label, so I wanted to calculate the report myself. I am using the most recent version of the library. Besides, using verbose=True does not get the classification_report to get printed in a Jupyter Notebook. And here comes the code:

# lis is a list in the form required for the NER-class of the library. [[101, 'word1', 'label-1'],[...]]
df_prepared = pd.DataFrame(lis, columns=['sentence_id', 'words', 'labels'])
print(len(df_prepared))
df_prepared.head()

#%%

train_df, eval_df = train_test_split(df_prepared, test_size=0.1, shuffle=False)

#%%

# Create a NERModel
model = NERModel('bert', 'bert-base-german-cased', args={'overwrite_output_dir': True, 'reprocess_input_data': True,
                 'num_train_epochs': 5, 'classification_report' : True, 'use_cached_eval_features' : False},
                 labels=list(set(train_df.labels)))

# Train the model
model.train_model(train_df, eval_df=eval_df)

#%%

# Evaluate the model
result, model_outputs, predictions = model.eval_model(eval_df, verbose=True)

print(result)

predictions_flat = [item for sublist in predictions for item in sublist]
print(classification_report(eval_df['labels'].tolist(), predictions_flat))


#%%
## Another way to calculate input for classification_report, but does fail the same way. The error must be somewhere in model.eval_model or predict:
id = eval_df.iloc[0].sentence_id
sent_lis = list()
sents = list()
for row in eval_df.itertuples():
    if row.sentence_id == id:
        sent_lis.append(row.words)
    else:
        id = row.sentence_id
        sents.append(' '.join(sent_lis))
        sent_lis = list()
        sent_lis.append(row.words)

preds, model_outputs=model.predict(sents)
predictions_flat = [list(item.values())[0] for sublist in preds for item in sublist]

Error:

-------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-19-cc4cd7440094> in <module>
----> 1 print(classification_report(eval_df.labels.tolist(), predictions_flat))
      2 
      3 

~/anaconda3/envs/pytorch_1.3/lib/python3.7/site-packages/sklearn/metrics/_classification.py in classification_report(y_true, y_pred, labels, target_names, sample_weight, digits, output_dict, zero_division)
   1965     """
   1966 
-> 1967     y_type, y_true, y_pred = _check_targets(y_true, y_pred)
   1968 
   1969     labels_given = True

~/anaconda3/envs/pytorch_1.3/lib/python3.7/site-packages/sklearn/metrics/_classification.py in _check_targets(y_true, y_pred)
     78     y_pred : array or indicator matrix
     79     """
---> 80     check_consistent_length(y_true, y_pred)
     81     type_true = type_of_target(y_true)
     82     type_pred = type_of_target(y_pred)

~/anaconda3/envs/pytorch_1.3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
    210     if len(uniques) > 1:
    211         raise ValueError("Found input variables with inconsistent numbers of"
--> 212                          " samples: %r" % [int(l) for l in lengths])
    213 
    214 

ValueError: Found input variables with inconsistent numbers of samples: [27321, 27223]

Issue Analytics

State:
Created 3 years ago
Comments:8 (3 by maintainers)

Top GitHub Comments

1reaction

ThilinaRajapaksecommented, May 6, 2020

There does seem to be an issue with certain characters like these when using the NERModel. It’s possibly related to how the tokenization happens. I’ll see if I can do something about this.

0reactions

Jefffish09commented, Apr 24, 2021

Same problem, any updates now?