Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (9,) (7,)

See original GitHub issue

Expected behavior

I create a the following objective function for a TF-IDF vectorizer in combination with a Decision Tree from sklearn:

def dt_objective(trial):
  
  cv_inputs, test_inputs, cv_labels, test_labels, cv_df, test_df = train_test_split(STUDY_INPUTS, label_array, job_ad_df,
                                                                                    train_size = 0.7,
                                                                                    stratify = label_array, #ensure that all labels are present in train test
                                                                                    shuffle = True, 
                                                                                    random_state=42)

  #Which parameters to tune?

  ngram_range = trial.suggest_categorical('tfidf ngram range', [(1,1), (2,2), (1,2), (1,3)])
  max_df = trial.suggest_float('tfidf max df', 0.80, 1.0)                                                                     
  min_df = trial.suggest_int('tfidf min df', 2, 100) #wenn 1 dann soll es in 100% der documente vorkommen // es soll mindestens in einem doc vorkommen 
  max_features = trial.suggest_int('tfidf max features', 2, 100_000)
  

  vectorizer = TfidfVectorizer(input = 'content',
                               encoding = 'utf-8',
                               lowercase = True,
                               tokenizer = tokenize,
                               ngram_range = ngram_range,
                               max_df = max_df,
                               min_df = min_df,
                               max_features = max_features)
  
  cv_matrix = vectorizer.fit_transform(cv_inputs)

  # https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680
  # https://medium.com/@mohtedibf/indepth-parameter-tuning-for-decision-tree-6753118a03c3
  # https://stats.stackexchange.com/questions/65893/maximal-depth-of-a-decision-tree

  criterion = trial.suggest_categorical('criterion', ['gini', 'entropy'])
  splitter = trial.suggest_categorical('splitter',  ['best', 'random'])
  #max_depth = trial.suggest_int('max depth', 1, cv_matrix.shape[0]-1)
  #min_samples_split = trial.suggest_float('min samples split', 0.01, 1.0)
  #min_samples_leaf = trial.suggest_int('min samples leaf', 1, cv_matrix.shape[0])
  max_features = trial.suggest_categorical('max features', ['auto', 'sqrt', 'log2', None])
  
  dt = DecisionTreeClassifier(criterion = criterion,
                              splitter = splitter,
                              #max_depth = max_depth,
                              #min_samples_split = min_samples_split,
                              #min_samples_leaf = min_samples_leaf,
                              max_features = max_features,
                              random_state = 42)


  return cross_val_score(dt, cv_matrix, cv_labels, scoring= 'f1_macro', cv = 5, n_jobs=-1).mean()

When I execute the study over 20 trials the first trials work as expected but then I get a sudden Indexing Error I can not relate:

study = optuna.create_study(study_name = f'{MODEL_NAME} {approach} Study {N_TRIALS} Trails', sampler = SAMPLER, direction='maximize')
study.optimize(dt_objective, n_trials = N_TRIALS, show_progress_bar = True)

[I 2022-03-26 09:04:26,256] Trial 8 finished with value: 0.4777081336796484 and parameters: {'tfidf ngram range': (1, 1), 'tfidf max df': 0.8849587686384248, 'tfidf min df': 21, 'tfidf max features': 33862, 'criterion': 'gini', 'splitter': 'random', 'max features': None}. Best is trial 2 with value: 0.48253011811114294.
/usr/local/lib/python3.7/dist-packages/optuna/distributions.py:427: UserWarning: Choices for a categorical distribution should be a tuple of None, bool, int, float and str for persistent storage but contains (1, 1) which is of type tuple.
  warnings.warn(message)
/usr/local/lib/python3.7/dist-packages/optuna/distributions.py:427: UserWarning: Choices for a categorical distribution should be a tuple of None, bool, int, float and str for persistent storage but contains (2, 2) which is of type tuple.
  warnings.warn(message)
/usr/local/lib/python3.7/dist-packages/optuna/distributions.py:427: UserWarning: Choices for a categorical distribution should be a tuple of None, bool, int, float and str for persistent storage but contains (1, 2) which is of type tuple.
  warnings.warn(message)
/usr/local/lib/python3.7/dist-packages/optuna/distributions.py:427: UserWarning: Choices for a categorical distribution should be a tuple of None, bool, int, float and str for persistent storage but contains (1, 3) which is of type tuple.
  warnings.warn(message)
[I 2022-03-26 09:04:29,995] Trial 9 finished with value: 0.3722128805977377 and parameters: {'tfidf ngram range': (1, 2), 'tfidf max df': 0.9520693534583604, 'tfidf min df': 59, 'tfidf max features': 91996, 'criterion': 'gini', 'splitter': 'best', 'max features': None}. Best is trial 2 with value: 0.48253011811114294.
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
[<ipython-input-19-4f9d7fce66b8>](https://localhost:8080/#) in <module>()
     13 
     14 study = optuna.create_study(study_name = f'{MODEL_NAME} {approach} Study {N_TRIALS} Trails', sampler = SAMPLER, direction='maximize')
---> 15 study.optimize(dt_objective, n_trials = N_TRIALS, show_progress_bar = True)
     16 
     17 print(f'Best Trial No.: {study.best_trial.number} // Mean F1 Score: {study.best_trial.value}')

10 frames
[/usr/local/lib/python3.7/dist-packages/optuna/samplers/_tpe/parzen_estimator.py](https://localhost:8080/#) in _calculate_categorical_params(self, observations, param_name)
    364             value = prior_weight / n_observations
    365         weights = np.full(shape, fill_value=value)
--> 366         weights[np.arange(n_observations), observations] += 1
    367         weights /= weights.sum(axis=1, keepdims=True)
    368         return weights

IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (9,) (6,)

Environment

Optuna version: 2.10.0
Python version: 3.10.3
OS: Linux-5.4.0-100-generic-x86_64-with-glibc2.31
(Optional) Other libraries and their versions: Sklearn 1.0.2

Error messages, stack traces, or logs

[I 2022-03-26 09:04:26,256] Trial 8 finished with value: 0.4777081336796484 and parameters: {'tfidf ngram range': (1, 1), 'tfidf max df': 0.8849587686384248, 'tfidf min df': 21, 'tfidf max features': 33862, 'criterion': 'gini', 'splitter': 'random', 'max features': None}. Best is trial 2 with value: 0.48253011811114294.
/usr/local/lib/python3.7/dist-packages/optuna/distributions.py:427: UserWarning: Choices for a categorical distribution should be a tuple of None, bool, int, float and str for persistent storage but contains (1, 1) which is of type tuple.
  warnings.warn(message)
/usr/local/lib/python3.7/dist-packages/optuna/distributions.py:427: UserWarning: Choices for a categorical distribution should be a tuple of None, bool, int, float and str for persistent storage but contains (2, 2) which is of type tuple.
  warnings.warn(message)
/usr/local/lib/python3.7/dist-packages/optuna/distributions.py:427: UserWarning: Choices for a categorical distribution should be a tuple of None, bool, int, float and str for persistent storage but contains (1, 2) which is of type tuple.
  warnings.warn(message)
/usr/local/lib/python3.7/dist-packages/optuna/distributions.py:427: UserWarning: Choices for a categorical distribution should be a tuple of None, bool, int, float and str for persistent storage but contains (1, 3) which is of type tuple.
  warnings.warn(message)
[I 2022-03-26 09:04:29,995] Trial 9 finished with value: 0.3722128805977377 and parameters: {'tfidf ngram range': (1, 2), 'tfidf max df': 0.9520693534583604, 'tfidf min df': 59, 'tfidf max features': 91996, 'criterion': 'gini', 'splitter': 'best', 'max features': None}. Best is trial 2 with value: 0.48253011811114294.
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-19-4f9d7fce66b8> in <module>()
     13 
     14 study = optuna.create_study(study_name = f'{MODEL_NAME} {approach} Study {N_TRIALS} Trails', sampler = SAMPLER, direction='maximize')
---> 15 study.optimize(dt_objective, n_trials = N_TRIALS, show_progress_bar = True)
     16 
     17 print(f'Best Trial No.: {study.best_trial.number} // Mean F1 Score: {study.best_trial.value}')

10 frames
/usr/local/lib/python3.7/dist-packages/optuna/samplers/_tpe/parzen_estimator.py in _calculate_categorical_params(self, observations, param_name)
    364             value = prior_weight / n_observations
    365         weights = np.full(shape, fill_value=value)
--> 366         weights[np.arange(n_observations), observations] += 1
    367         weights /= weights.sum(axis=1, keepdims=True)
    368         return weights

IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (9,) (6,)

Steps to reproduce

See above

Additional context (optional)

No response

Issue Analytics

State:
Created a year ago
Comments:9 (5 by maintainers)

Top GitHub Comments

1reaction

himktcommented, Mar 26, 2022

I’m happy to hear that. Sorry for the inconvenience but please wait our next release. Note that you can also avoid the problem by not using None.

https://github.com/optuna/optuna/issues/3129#issuecomment-981104459

0reactions

himktcommented, Mar 27, 2022

The problem is caused by wrong handling of None objects. Please see https://github.com/optuna/optuna/issues/3129#issuecomment-981264696 for more detail.

So we can mitigate the problem by introducing the special value and treats it as None. (For example, “None” (str) for None: https://github.com/optuna/optuna/issues/3129#issuecomment-981104459)

Top Results From Across the Web

indexing arrays could not be broadcast together with shapes ...

NumPy broadcasting aligns dimensions from right to left, not left to right. – user2357112. Sep 8, 2017 at 22:41 · This error is...

Broadcasting errors with multi-dimensional boolean masks

Attempting to index a 2D array of shape [N, M] with two 1D ... True, False) shape mismatch: indexing arrays could not be...

IndexError: shape mismatch: indexing arrays could not be ...

Coding example for the question IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (2,) (9,)-numpy.

NumPy indexing explained - Towards Data Science

Multidimensional NumPy arrays are extensively used in Pandas, SciPy, ... IndexError: shape mismatch: indexing arrays could not be broadcast ...

First introduction to NumPy — SciPyTutorial 0.0.4 documentation

We can create arrays from (nested) python lists or tuples: ... 1 A[(0,1),(0,2,3)] ValueError: shape mismatch: objects cannot be broadcast to a single...