IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (9,) (7,)
See original GitHub issueExpected behavior
I create a the following objective function for a TF-IDF vectorizer in combination with a Decision Tree from sklearn:
def dt_objective(trial):
cv_inputs, test_inputs, cv_labels, test_labels, cv_df, test_df = train_test_split(STUDY_INPUTS, label_array, job_ad_df,
train_size = 0.7,
stratify = label_array, #ensure that all labels are present in train test
shuffle = True,
random_state=42)
#Which parameters to tune?
ngram_range = trial.suggest_categorical('tfidf ngram range', [(1,1), (2,2), (1,2), (1,3)])
max_df = trial.suggest_float('tfidf max df', 0.80, 1.0)
min_df = trial.suggest_int('tfidf min df', 2, 100) #wenn 1 dann soll es in 100% der documente vorkommen // es soll mindestens in einem doc vorkommen
max_features = trial.suggest_int('tfidf max features', 2, 100_000)
vectorizer = TfidfVectorizer(input = 'content',
encoding = 'utf-8',
lowercase = True,
tokenizer = tokenize,
ngram_range = ngram_range,
max_df = max_df,
min_df = min_df,
max_features = max_features)
cv_matrix = vectorizer.fit_transform(cv_inputs)
# https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680
# https://medium.com/@mohtedibf/indepth-parameter-tuning-for-decision-tree-6753118a03c3
# https://stats.stackexchange.com/questions/65893/maximal-depth-of-a-decision-tree
criterion = trial.suggest_categorical('criterion', ['gini', 'entropy'])
splitter = trial.suggest_categorical('splitter', ['best', 'random'])
#max_depth = trial.suggest_int('max depth', 1, cv_matrix.shape[0]-1)
#min_samples_split = trial.suggest_float('min samples split', 0.01, 1.0)
#min_samples_leaf = trial.suggest_int('min samples leaf', 1, cv_matrix.shape[0])
max_features = trial.suggest_categorical('max features', ['auto', 'sqrt', 'log2', None])
dt = DecisionTreeClassifier(criterion = criterion,
splitter = splitter,
#max_depth = max_depth,
#min_samples_split = min_samples_split,
#min_samples_leaf = min_samples_leaf,
max_features = max_features,
random_state = 42)
return cross_val_score(dt, cv_matrix, cv_labels, scoring= 'f1_macro', cv = 5, n_jobs=-1).mean()
When I execute the study over 20 trials the first trials work as expected but then I get a sudden Indexing Error I can not relate:
study = optuna.create_study(study_name = f'{MODEL_NAME} {approach} Study {N_TRIALS} Trails', sampler = SAMPLER, direction='maximize')
study.optimize(dt_objective, n_trials = N_TRIALS, show_progress_bar = True)
[I 2022-03-26 09:04:26,256] Trial 8 finished with value: 0.4777081336796484 and parameters: {'tfidf ngram range': (1, 1), 'tfidf max df': 0.8849587686384248, 'tfidf min df': 21, 'tfidf max features': 33862, 'criterion': 'gini', 'splitter': 'random', 'max features': None}. Best is trial 2 with value: 0.48253011811114294.
/usr/local/lib/python3.7/dist-packages/optuna/distributions.py:427: UserWarning: Choices for a categorical distribution should be a tuple of None, bool, int, float and str for persistent storage but contains (1, 1) which is of type tuple.
warnings.warn(message)
/usr/local/lib/python3.7/dist-packages/optuna/distributions.py:427: UserWarning: Choices for a categorical distribution should be a tuple of None, bool, int, float and str for persistent storage but contains (2, 2) which is of type tuple.
warnings.warn(message)
/usr/local/lib/python3.7/dist-packages/optuna/distributions.py:427: UserWarning: Choices for a categorical distribution should be a tuple of None, bool, int, float and str for persistent storage but contains (1, 2) which is of type tuple.
warnings.warn(message)
/usr/local/lib/python3.7/dist-packages/optuna/distributions.py:427: UserWarning: Choices for a categorical distribution should be a tuple of None, bool, int, float and str for persistent storage but contains (1, 3) which is of type tuple.
warnings.warn(message)
[I 2022-03-26 09:04:29,995] Trial 9 finished with value: 0.3722128805977377 and parameters: {'tfidf ngram range': (1, 2), 'tfidf max df': 0.9520693534583604, 'tfidf min df': 59, 'tfidf max features': 91996, 'criterion': 'gini', 'splitter': 'best', 'max features': None}. Best is trial 2 with value: 0.48253011811114294.
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
[<ipython-input-19-4f9d7fce66b8>](https://localhost:8080/#) in <module>()
13
14 study = optuna.create_study(study_name = f'{MODEL_NAME} {approach} Study {N_TRIALS} Trails', sampler = SAMPLER, direction='maximize')
---> 15 study.optimize(dt_objective, n_trials = N_TRIALS, show_progress_bar = True)
16
17 print(f'Best Trial No.: {study.best_trial.number} // Mean F1 Score: {study.best_trial.value}')
10 frames
[/usr/local/lib/python3.7/dist-packages/optuna/samplers/_tpe/parzen_estimator.py](https://localhost:8080/#) in _calculate_categorical_params(self, observations, param_name)
364 value = prior_weight / n_observations
365 weights = np.full(shape, fill_value=value)
--> 366 weights[np.arange(n_observations), observations] += 1
367 weights /= weights.sum(axis=1, keepdims=True)
368 return weights
IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (9,) (6,)
Environment
- Optuna version: 2.10.0
- Python version: 3.10.3
- OS: Linux-5.4.0-100-generic-x86_64-with-glibc2.31
- (Optional) Other libraries and their versions: Sklearn 1.0.2
Error messages, stack traces, or logs
[I 2022-03-26 09:04:26,256] Trial 8 finished with value: 0.4777081336796484 and parameters: {'tfidf ngram range': (1, 1), 'tfidf max df': 0.8849587686384248, 'tfidf min df': 21, 'tfidf max features': 33862, 'criterion': 'gini', 'splitter': 'random', 'max features': None}. Best is trial 2 with value: 0.48253011811114294.
/usr/local/lib/python3.7/dist-packages/optuna/distributions.py:427: UserWarning: Choices for a categorical distribution should be a tuple of None, bool, int, float and str for persistent storage but contains (1, 1) which is of type tuple.
warnings.warn(message)
/usr/local/lib/python3.7/dist-packages/optuna/distributions.py:427: UserWarning: Choices for a categorical distribution should be a tuple of None, bool, int, float and str for persistent storage but contains (2, 2) which is of type tuple.
warnings.warn(message)
/usr/local/lib/python3.7/dist-packages/optuna/distributions.py:427: UserWarning: Choices for a categorical distribution should be a tuple of None, bool, int, float and str for persistent storage but contains (1, 2) which is of type tuple.
warnings.warn(message)
/usr/local/lib/python3.7/dist-packages/optuna/distributions.py:427: UserWarning: Choices for a categorical distribution should be a tuple of None, bool, int, float and str for persistent storage but contains (1, 3) which is of type tuple.
warnings.warn(message)
[I 2022-03-26 09:04:29,995] Trial 9 finished with value: 0.3722128805977377 and parameters: {'tfidf ngram range': (1, 2), 'tfidf max df': 0.9520693534583604, 'tfidf min df': 59, 'tfidf max features': 91996, 'criterion': 'gini', 'splitter': 'best', 'max features': None}. Best is trial 2 with value: 0.48253011811114294.
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-19-4f9d7fce66b8> in <module>()
13
14 study = optuna.create_study(study_name = f'{MODEL_NAME} {approach} Study {N_TRIALS} Trails', sampler = SAMPLER, direction='maximize')
---> 15 study.optimize(dt_objective, n_trials = N_TRIALS, show_progress_bar = True)
16
17 print(f'Best Trial No.: {study.best_trial.number} // Mean F1 Score: {study.best_trial.value}')
10 frames
/usr/local/lib/python3.7/dist-packages/optuna/samplers/_tpe/parzen_estimator.py in _calculate_categorical_params(self, observations, param_name)
364 value = prior_weight / n_observations
365 weights = np.full(shape, fill_value=value)
--> 366 weights[np.arange(n_observations), observations] += 1
367 weights /= weights.sum(axis=1, keepdims=True)
368 return weights
IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (9,) (6,)
Steps to reproduce
See above
Additional context (optional)
No response
Issue Analytics
- State:
- Created a year ago
- Comments:9 (5 by maintainers)
Top Results From Across the Web
indexing arrays could not be broadcast together with shapes ...
NumPy broadcasting aligns dimensions from right to left, not left to right. – user2357112. Sep 8, 2017 at 22:41 · This error is...
Read more >Broadcasting errors with multi-dimensional boolean masks
Attempting to index a 2D array of shape [N, M] with two 1D ... True, False) shape mismatch: indexing arrays could not be...
Read more >IndexError: shape mismatch: indexing arrays could not be ...
Coding example for the question IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (2,) (9,)-numpy.
Read more >NumPy indexing explained - Towards Data Science
Multidimensional NumPy arrays are extensively used in Pandas, SciPy, ... IndexError: shape mismatch: indexing arrays could not be broadcast ...
Read more >First introduction to NumPy — SciPyTutorial 0.0.4 documentation
We can create arrays from (nested) python lists or tuples: ... 1 A[(0,1),(0,2,3)] ValueError: shape mismatch: objects cannot be broadcast to a single...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I’m happy to hear that. Sorry for the inconvenience but please wait our next release. Note that you can also avoid the problem by not using None.
https://github.com/optuna/optuna/issues/3129#issuecomment-981104459
The problem is caused by wrong handling of None objects. Please see https://github.com/optuna/optuna/issues/3129#issuecomment-981264696 for more detail.
So we can mitigate the problem by introducing the special value and treats it as None. (For example, “None” (str) for None: https://github.com/optuna/optuna/issues/3129#issuecomment-981104459)