question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (9,) (7,)

See original GitHub issue

Expected behavior

I create a the following objective function for a TF-IDF vectorizer in combination with a Decision Tree from sklearn:

def dt_objective(trial):
  
  cv_inputs, test_inputs, cv_labels, test_labels, cv_df, test_df = train_test_split(STUDY_INPUTS, label_array, job_ad_df,
                                                                                    train_size = 0.7,
                                                                                    stratify = label_array, #ensure that all labels are present in train test
                                                                                    shuffle = True, 
                                                                                    random_state=42)

  #Which parameters to tune?

  ngram_range = trial.suggest_categorical('tfidf ngram range', [(1,1), (2,2), (1,2), (1,3)])
  max_df = trial.suggest_float('tfidf max df', 0.80, 1.0)                                                                     
  min_df = trial.suggest_int('tfidf min df', 2, 100) #wenn 1 dann soll es in 100% der documente vorkommen // es soll mindestens in einem doc vorkommen 
  max_features = trial.suggest_int('tfidf max features', 2, 100_000)
  

  vectorizer = TfidfVectorizer(input = 'content',
                               encoding = 'utf-8',
                               lowercase = True,
                               tokenizer = tokenize,
                               ngram_range = ngram_range,
                               max_df = max_df,
                               min_df = min_df,
                               max_features = max_features)
  
  cv_matrix = vectorizer.fit_transform(cv_inputs)

  # https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680
  # https://medium.com/@mohtedibf/indepth-parameter-tuning-for-decision-tree-6753118a03c3
  # https://stats.stackexchange.com/questions/65893/maximal-depth-of-a-decision-tree

  criterion = trial.suggest_categorical('criterion', ['gini', 'entropy'])
  splitter = trial.suggest_categorical('splitter',  ['best', 'random'])
  #max_depth = trial.suggest_int('max depth', 1, cv_matrix.shape[0]-1)
  #min_samples_split = trial.suggest_float('min samples split', 0.01, 1.0)
  #min_samples_leaf = trial.suggest_int('min samples leaf', 1, cv_matrix.shape[0])
  max_features = trial.suggest_categorical('max features', ['auto', 'sqrt', 'log2', None])
  
  dt = DecisionTreeClassifier(criterion = criterion,
                              splitter = splitter,
                              #max_depth = max_depth,
                              #min_samples_split = min_samples_split,
                              #min_samples_leaf = min_samples_leaf,
                              max_features = max_features,
                              random_state = 42)


  return cross_val_score(dt, cv_matrix, cv_labels, scoring= 'f1_macro', cv = 5, n_jobs=-1).mean()

When I execute the study over 20 trials the first trials work as expected but then I get a sudden Indexing Error I can not relate:

study = optuna.create_study(study_name = f'{MODEL_NAME} {approach} Study {N_TRIALS} Trails', sampler = SAMPLER, direction='maximize')
study.optimize(dt_objective, n_trials = N_TRIALS, show_progress_bar = True)
[I 2022-03-26 09:04:26,256] Trial 8 finished with value: 0.4777081336796484 and parameters: {'tfidf ngram range': (1, 1), 'tfidf max df': 0.8849587686384248, 'tfidf min df': 21, 'tfidf max features': 33862, 'criterion': 'gini', 'splitter': 'random', 'max features': None}. Best is trial 2 with value: 0.48253011811114294.
/usr/local/lib/python3.7/dist-packages/optuna/distributions.py:427: UserWarning: Choices for a categorical distribution should be a tuple of None, bool, int, float and str for persistent storage but contains (1, 1) which is of type tuple.
  warnings.warn(message)
/usr/local/lib/python3.7/dist-packages/optuna/distributions.py:427: UserWarning: Choices for a categorical distribution should be a tuple of None, bool, int, float and str for persistent storage but contains (2, 2) which is of type tuple.
  warnings.warn(message)
/usr/local/lib/python3.7/dist-packages/optuna/distributions.py:427: UserWarning: Choices for a categorical distribution should be a tuple of None, bool, int, float and str for persistent storage but contains (1, 2) which is of type tuple.
  warnings.warn(message)
/usr/local/lib/python3.7/dist-packages/optuna/distributions.py:427: UserWarning: Choices for a categorical distribution should be a tuple of None, bool, int, float and str for persistent storage but contains (1, 3) which is of type tuple.
  warnings.warn(message)
[I 2022-03-26 09:04:29,995] Trial 9 finished with value: 0.3722128805977377 and parameters: {'tfidf ngram range': (1, 2), 'tfidf max df': 0.9520693534583604, 'tfidf min df': 59, 'tfidf max features': 91996, 'criterion': 'gini', 'splitter': 'best', 'max features': None}. Best is trial 2 with value: 0.48253011811114294.
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
[<ipython-input-19-4f9d7fce66b8>](https://localhost:8080/#) in <module>()
     13 
     14 study = optuna.create_study(study_name = f'{MODEL_NAME} {approach} Study {N_TRIALS} Trails', sampler = SAMPLER, direction='maximize')
---> 15 study.optimize(dt_objective, n_trials = N_TRIALS, show_progress_bar = True)
     16 
     17 print(f'Best Trial No.: {study.best_trial.number} // Mean F1 Score: {study.best_trial.value}')

10 frames
[/usr/local/lib/python3.7/dist-packages/optuna/samplers/_tpe/parzen_estimator.py](https://localhost:8080/#) in _calculate_categorical_params(self, observations, param_name)
    364             value = prior_weight / n_observations
    365         weights = np.full(shape, fill_value=value)
--> 366         weights[np.arange(n_observations), observations] += 1
    367         weights /= weights.sum(axis=1, keepdims=True)
    368         return weights

IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (9,) (6,)

Environment

  • Optuna version: 2.10.0
  • Python version: 3.10.3
  • OS: Linux-5.4.0-100-generic-x86_64-with-glibc2.31
  • (Optional) Other libraries and their versions: Sklearn 1.0.2

Error messages, stack traces, or logs

[I 2022-03-26 09:04:26,256] Trial 8 finished with value: 0.4777081336796484 and parameters: {'tfidf ngram range': (1, 1), 'tfidf max df': 0.8849587686384248, 'tfidf min df': 21, 'tfidf max features': 33862, 'criterion': 'gini', 'splitter': 'random', 'max features': None}. Best is trial 2 with value: 0.48253011811114294.
/usr/local/lib/python3.7/dist-packages/optuna/distributions.py:427: UserWarning: Choices for a categorical distribution should be a tuple of None, bool, int, float and str for persistent storage but contains (1, 1) which is of type tuple.
  warnings.warn(message)
/usr/local/lib/python3.7/dist-packages/optuna/distributions.py:427: UserWarning: Choices for a categorical distribution should be a tuple of None, bool, int, float and str for persistent storage but contains (2, 2) which is of type tuple.
  warnings.warn(message)
/usr/local/lib/python3.7/dist-packages/optuna/distributions.py:427: UserWarning: Choices for a categorical distribution should be a tuple of None, bool, int, float and str for persistent storage but contains (1, 2) which is of type tuple.
  warnings.warn(message)
/usr/local/lib/python3.7/dist-packages/optuna/distributions.py:427: UserWarning: Choices for a categorical distribution should be a tuple of None, bool, int, float and str for persistent storage but contains (1, 3) which is of type tuple.
  warnings.warn(message)
[I 2022-03-26 09:04:29,995] Trial 9 finished with value: 0.3722128805977377 and parameters: {'tfidf ngram range': (1, 2), 'tfidf max df': 0.9520693534583604, 'tfidf min df': 59, 'tfidf max features': 91996, 'criterion': 'gini', 'splitter': 'best', 'max features': None}. Best is trial 2 with value: 0.48253011811114294.
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-19-4f9d7fce66b8> in <module>()
     13 
     14 study = optuna.create_study(study_name = f'{MODEL_NAME} {approach} Study {N_TRIALS} Trails', sampler = SAMPLER, direction='maximize')
---> 15 study.optimize(dt_objective, n_trials = N_TRIALS, show_progress_bar = True)
     16 
     17 print(f'Best Trial No.: {study.best_trial.number} // Mean F1 Score: {study.best_trial.value}')

10 frames
/usr/local/lib/python3.7/dist-packages/optuna/samplers/_tpe/parzen_estimator.py in _calculate_categorical_params(self, observations, param_name)
    364             value = prior_weight / n_observations
    365         weights = np.full(shape, fill_value=value)
--> 366         weights[np.arange(n_observations), observations] += 1
    367         weights /= weights.sum(axis=1, keepdims=True)
    368         return weights

IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (9,) (6,)

Steps to reproduce

See above

Additional context (optional)

No response

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
himktcommented, Mar 26, 2022

I’m happy to hear that. Sorry for the inconvenience but please wait our next release. Note that you can also avoid the problem by not using None.

https://github.com/optuna/optuna/issues/3129#issuecomment-981104459

0reactions
himktcommented, Mar 27, 2022

The problem is caused by wrong handling of None objects. Please see https://github.com/optuna/optuna/issues/3129#issuecomment-981264696 for more detail.

So we can mitigate the problem by introducing the special value and treats it as None. (For example, “None” (str) for None: https://github.com/optuna/optuna/issues/3129#issuecomment-981104459)

Read more comments on GitHub >

github_iconTop Results From Across the Web

indexing arrays could not be broadcast together with shapes ...
NumPy broadcasting aligns dimensions from right to left, not left to right. – user2357112. Sep 8, 2017 at 22:41 · This error is...
Read more >
Broadcasting errors with multi-dimensional boolean masks
Attempting to index a 2D array of shape [N, M] with two 1D ... True, False) shape mismatch: indexing arrays could not be...
Read more >
IndexError: shape mismatch: indexing arrays could not be ...
Coding example for the question IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (2,) (9,)-numpy.
Read more >
NumPy indexing explained - Towards Data Science
Multidimensional NumPy arrays are extensively used in Pandas, SciPy, ... IndexError: shape mismatch: indexing arrays could not be broadcast ...
Read more >
First introduction to NumPy — SciPyTutorial 0.0.4 documentation
We can create arrays from (nested) python lists or tuples: ... 1 A[(0,1),(0,2,3)] ValueError: shape mismatch: objects cannot be broadcast to a single...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found