question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Getting only TIMEOUT for PredefinedSplit

See original GitHub issue

Describe the bug

When passing PredefinedSplit as a resampling strategy, the result only shows timeout for even a small dataset. By using the default configuration, auto-sklearn can create successful trials in a couple of seconds.

To Reproduce

This is the minimal code I can come up with, based on the example here.

import pandas as pd
import numpy as np
import autosklearn.metrics
from sklearn.model_selection import PredefinedSplit, train_test_split
from benatar.models.automl import  AutoSklearn

# Using credit card public dataset to demonstrate the problem
df = pd.read_csv("https://raw.githubusercontent.com/irenebenedetto/default-of-credit-card-clients/master/dataset/credit_cards_dataset.csv")
X_train, X_test = train_test_split(
    df, test_size=0.2, random_state=42
)
y_train = X_train.pop(X_train.columns[-1])

# Using a random column to create validation set, it's meaningless but also just to demonstrate the point
resampling_strategy = PredefinedSplit(
    test_fold=np.where(X_train.to_numpy()[:, 4] < np.mean(X_train.to_numpy()[:, 4]))[0]
)

autosk = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=600,
    per_run_time_limit=200,
    tmp_folder="./tmp/autosklearn",
    disable_evaluator_output=False,
    resampling_strategy=resampling_strategy,
    metric=autosklearn.metrics.accuracy,
    delete_tmp_folder_after_terminate=False,
    seed=42
)

autosk.fit(X_train, y_train)

By commenting out the resampling_strategy line, the trials run successfully.

I’ve also tried to increase the time_left_for_this_task and per_run_time_limit both to 6000, still only got TIMEOUT.

I also tried to run the example code and it ran with successfully generated trials.

I’m not sure if the issue is the dataset, how I’m using PredefinedSplit or?

Expected behavior

Generate multiple successful trials.

Actual behavior, stacktrace or logfile

Result from sprint statistics: auto-sklearn results: Dataset name: 1e6334d4-3831-11ec-9a9c-0255ac100090 Metric: accuracy Number of target algorithm runs: 12 Number of successful target algorithm runs: 0 Number of crashed target algorithm runs: 0 Number of target algorithms that exceeded the time limit: 12 Number of target algorithms that exceeded the memory limit: 0

Logfile uploaded.

Environment and installation:

Please give details about your installation:

  • OS: Ubuntu 20.04.2 LTS (Focal Fossa) - a pod inside Kubeflow cluster
  • Is your installation in a virtual environment or conda environment: Normal python in a Kubeflow notebook
  • Python version: 3.7.1
  • Auto-sklearn version: 0.13.0

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
eddiebergmancommented, Nov 17, 2021

Hi @mereldawu,

Sorry to read about this issue, I meant to reply earlier but apologies that I did not. I’m not sure what would cause this issue as it should not cause a TIMEOUT if things weren’t to work. Thank you for the full reproducible code example, we will look at this soon.

0reactions
eddiebergmancommented, Dec 13, 2021

Hi @mereldawu, this was resolved by PR #1340 updating our example on how to use PredefinedSplit which gave you the errors. There is nothing we can do about automatically detecting bad splits returned by the custom splitter.

Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn.model_selection.PredefinedSplit
Provides train/test indices to split data into train/test sets using a predefined scheme specified by the user with the test_fold parameter. Read more...
Read more >
Class DatasetServiceClient (1.19.0) | Python client library
The timeout for this request. metadata, Sequence[Tuple[str, str]]. Strings which should be sent along with the request as metadata.
Read more >
In-depth introduction to bufio.Scanner in Golang - Medium
It says that passed data is not enough to get a token. It's done by returning 0, nil, nil . When it happens,...
Read more >
How to Grid Search Hyperparameters for Deep Learning ...
In grid search, we do get train score right? Why it's not displaying in model.cv_results_ only test score we are getting.
Read more >
Comparing Anomaly-Based Network Intrusion Detection ...
Thereby, the risk of getting exposed to security vulnerabilities is ... is altered, so that flows are not only closed by a timeout,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found