question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Change default copy value from True to None

See original GitHub issue

A fair amount of estimators currently have copy=True (or copy_X=True) by default. In practice, this means that the code looks something like,

X = check_array(X, copy=copy)

and then some other calculations that may change or not X inplace. In the case when the following operations are not done inplace, we have just made a wasteful copy with no good reason.

As discussed in https://github.com/scikit-learn/scikit-learn/issues/13923, an example is for instance Ridge(fit_intercept=False) that will copy X, although it is not needed. Actually, I can’t find any inplace operations of X in Ridge even with fit_intercept=True, but maybe I am missing something. (found it)

I think in general it would be better to avoid the,

X = check_array(X, copy=copy)

pattern, and instead make a copy explicitly where it is needed. Maybe it could be OK to not make a copy with copy=True if no copy is needed. Alternatively we could introduce copy=None by default.

Adding a common test that checks that Estimator(copy=True).fit(X, y) doesn’t change X.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:1
  • Comments:14 (14 by maintainers)

github_iconTop GitHub Comments

2reactions
jnothmancommented, Jun 18, 2019

I think I had just assumed that copy=True meant copy=None. I’d prefer copy='on-write' or something…

2reactions
rthcommented, May 30, 2019

Actually we have test with read-only X which probably will enforce that we don’t change X by default.

Very good point. I made #13987 to address this issue in preprocessing (e.g. StandardScaler).

For future reference, to find estimators that potentially have this issue, one can use a common test that checks that an exception is raised when one tries to use an estimator with copy=False on read-only array. If it is not raised, it is likely that with copy=True a copy is not actually necessary (though there are false positives).

@ignore_warnings(category=(DeprecationWarning, FutureWarning))  
def check_transformer_extra_copy(name, transformer):    
    X, y = make_blobs(n_samples=30, centers=[[0, 0, 0], [1, 1, 1]],  
                      random_state=0, n_features=2, cluster_std=0.1)  
    X = StandardScaler().fit_transform(X)  
    X -= X.min()  
  
    X, y = create_memmap_backed_data([X, y])  
  
    estimator = clone(transformer)
    sig = signature(transformer.__class__.__init__)
    if "copy" in sig.parameters:
    	estimator.set_params(copy=False)  
  
    	assert_raise_message(ValueError, "is read-only", estimator.fit, X, y)

It’s not reliable enough to add it to common tests, but as a detection method, it works reasonably well. The same could be done for classifiers etc.

Also, it should be noted, that for more complex estimators with numerous options it is sometimes hard to decide whether a copy is needed (e.g. Birch.fit). In that case, it’s probably better to keep the copy to be safe, particularly when the performance gained by avoiding the copy is negligible with respect to the fit or transform time.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Paste options - Microsoft Support
Pastes all cell contents and formatting of the copied data. Pastes only the formulas of the copied data as entered in the formula...
Read more >
What is the difference between None and boolean (True ...
True and False are specific bool values. Use default False when you have a bool field and you want the default to be...
Read more >
Default parameters - JavaScript - MDN Web Docs
Default function parameters allow named parameters to be initialized with default values if no value or undefined is passed.
Read more >
pandas.Series.replace — pandas 1.5.2 documentation
value scalar, dict, list, str, regex, default None. Value to replace any values matching to_replace with. For a DataFrame a dict of values...
Read more >
Built-in Types — Python 3.11.1 documentation
Changed in version 3.11: Added default argument values for length and byteorder . classmethod int.from_bytes(bytes, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found