question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SURFRAD site & date-range download

See original GitHub issue

Is your feature request related to a problem? Please describe. the current SURFRAD iotools only reads in a singe day .dat from either an URL or a filesystem, EG:

# read from url
pvlib.iotools.read_surfrad('ftp://aftp.cmdl.noaa.gov/data/radiation/surfrad/Bondville_IL/2021/bon21001.dat')
# read from file
pvlib.iotools.read_surfrad('bon21001.dat')

Unfortunately, I can’t quickly read an entire range or any arbitrarily large date range. I can use pvlib.iotools.read_surfrad in a loop, but it takes a long time to serially read in an entire year. Maybe it would be faster if I already had the files downloaded. It takes about 1-second to read a single 111kb file. So for 10,000 files that would be about 3 hours which is too long if I have to read 7 sites.

%%timeit
bon95 = [
    pvl.iotools.read_surfrad(r'ftp://aftp.cmdl.noaa.gov/data/radiation/surfrad/Bondville_IL/1995/bon95%03d.dat' % (x+1))
    for x in range(16)]  # read in 16 files

## -- End pasted text --
14.4 s ± 295 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That’s 14.4[s] / 16[files] = 0.9[s] per file. I tried to use threading, but then I get connection errors. I think there’s a limit of 5 connections to the NOAA ftp from your computer. That should bring it down to about 30 minutes, hmm, maybe I didn’t try hard enough? Anyway, I went a different way.

Describe the solution you’d like The current read_surfrad uses Python’s urllib.requests.urlopen for each connection. I have found that opening a long lasting FTP connection using Python’s ftplib allows downloading a lot more files by reusing the same connection. However this download is still serial, so I have found in addition using Python threading allows me to open up to 5 simultaneous connections, but any more and I get a 421 FTP Connection Error, too many connections.

Describe alternatives you’ve considered I was able to open the FTP site directly in Windows, but it was also a serial connection, and so for about 10,000 (about 1gb) would have taken 4 hours. By contrast, using ftplib and threading I can download all of the data from a single site in about 25 mintes.

Additional context #590 #595 gist of my working script: https://gist.github.com/mikofski/30455056b88a5d161598856cc4eedb2c

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
mikofskicommented, Aug 5, 2022

Wow, that’s 3 times faster, but still over a day for 25 years of data. @AdamRJensen can you ask your contact how many HTTPS connections are allowed from the same host? I still think threading this request is the way to go? But maybe we leave that to the user?

Any complaints if I close this issue now? I don’t think I’ll work on it, and funny thing is you only need to download the SURFRAD data once. Maybe this is better as an gallery example?

1reaction
AdamRJensencommented, Aug 5, 2022

@mikofski As discussed in #1459, SURFRAD files are both available via FTP and more recently HTTPS. It seems there is a significant performance gain (at least a factor of two) to be had by using the HTTPS links (see test below). I figured this might be relevant information to this issue.

image

Read more comments on GitHub >

github_iconTop Results From Across the Web

SURFRAD (Surface Radiation Budget) Network
The Global Monitoring Laboratory conducts research on greenhouse gas and carbon cycle feedbacks, changes in clouds, aerosols, and surface radiation, ...
Read more >
EarthExplorer
Query and order satellite images, aerial photographs, and cartographic products through the U.S. Geological Survey.
Read more >
Daily color table of automated quality control check results for ...
Download scientific diagram | Daily color table of automated quality control ... The performance of ACRF instruments, sites, and data systems is measured...
Read more >
An Overview of ARM Program Climate Research Facility Data ...
face Radiation Budget Network (SURFRAD) [6], and the ... select the site, data stream, and date range of interest for analysis.
Read more >
The Atmospheric Model Evaluation Tool | US EPA
Return to Resources/Utilities for Model Users Page ... the Baseline Surface Radiation (BSRN) and SURface RADiation (SURFRAD) budget network.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found