question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItΒ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

fetch_openml() does not use the local cache

See original GitHub issue

Description

fetch_openml() does not use the local data cached in ~/scikit_learn_data.

Steps/Code to Reproduce

Run the following code twice. The first time, with a working Internet connexion, the second time without an Internet connexion.

from sklearn.datasets import fetch_openml
mnist = fetch_openml("mnist_784", version=1)
print(mnist.data.shape)

Expected Results

First run (with Internet access): (70000, 784)

Second run (without Internet access): (70000, 784)

Actual Results

First run (with Internet access): (70000, 784) Second run (without Internet access):

Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 964, in send
    self.connect()
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1392, in connect
    super().connect()
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 936, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/socket.py", line 704, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 8] nodename nor servname provided, or not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ageron/.virtualenvs/ml/lib/python3.6/site-packages/sklearn/datasets/openml.py", line 466, in fetch_openml
    data_info = _get_data_info_by_name(name, version, data_home)
  File "/Users/ageron/.virtualenvs/ml/lib/python3.6/site-packages/sklearn/datasets/openml.py", line 275, in _get_data_info_by_name
    data_home)
  File "/Users/ageron/.virtualenvs/ml/lib/python3.6/site-packages/sklearn/datasets/openml.py", line 119, in _get_json_content_from_openml_api
    response = _open_openml_url(url, data_home)
  File "/Users/ageron/.virtualenvs/ml/lib/python3.6/site-packages/sklearn/datasets/openml.py", line 55, in _open_openml_url
    fsrc = urlopen(req)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 526, in open
    response = self._open(req, data)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 544, in _open
    '_open', req)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 1361, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 1320, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>

AFAICT, the data is correctly cached in the ~/scikit_learn_data directory, but simply ignored by fetch_openml():

$ tree ~/scikit_learn_data/openml/
/Users/ageron/scikit_learn_data/openml/
└── openml.org
    β”œβ”€β”€ api
    β”‚Β Β  └── v1
    β”‚Β Β      └── json
    β”‚Β Β          └── data
    β”‚Β Β              β”œβ”€β”€ 554.gz
    β”‚Β Β              β”œβ”€β”€ features
    β”‚Β Β              β”‚Β Β  └── 554.gz
    β”‚Β Β              └── list
    β”‚Β Β                  └── data_name
    β”‚Β Β                      └── mnist_784
    β”‚Β Β                          └── limit
    β”‚Β Β                              └── 2
    β”‚Β Β                                  └── data_version
    β”‚Β Β                                      └── 1.gz
    └── data
        └── v1
            └── download
                └── 52667.gz

15 directories, 4 files

Versions

Darwin-17.7.0-x86_64-i386-64bit Python 3.6.6 (default, Jun 28 2018, 05:43:53) [GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)] NumPy 1.15.2 SciPy 1.1.0 Scikit-Learn 0.20.0

Same results as well on:

Linux-4.15.0-36-generic-x86_64-with-Ubuntu-18.04-bionic Python 3.6.5 (default, Apr 1 2018, 05:46:30) [GCC 7.3.0] NumPy 1.15.1 SciPy 1.1.0 Scikit-Learn 0.20.0

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:11
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

4reactions
elissavetvcommented, May 28, 2020

Hello, I am getting the exact same error as above, is there a fix? Many thanks.

1reaction
shasherazicommented, Jun 9, 2022

Here’s a workaround tho

Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn.datasets.fetch_openml
OpenML ID of the dataset. The most specific way of retrieving a dataset. If data_id is not given, name (and potential version) are...
Read more >
Cannot load MNIST Original dataset using fetch_openml in ...
Method fetch_openml() download dataset from mldata.org which is not stable and can not connect. An alternative way is manually to downloadΒ ...
Read more >
Machine Learning Classification Part -1 | Automated hands-on
fetch_mldata() has been deprecated. Please use fetch_openml() instead, you can find the updated code in the slides and notebook from our GitHub repository....
Read more >
Python sklearn.datasets.fetch_openml() Examples
This page shows Python examples of sklearn.datasets.fetch_openml.
Read more >
datasets/tests/test_openml.py Β· aaronreidsmith/scikit-learn - Gemfury
Important note: Do NOT use this # in combination with a regular cache directory, as the files that are # stored as cache...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found