question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support creating DataSet from streaming object

See original GitHub issue

The use case is for netCDF files stored on s3 or other generic cloud storage

import requests, xarray as xr
fp = 'http://nasanex.s3.amazonaws.com/NEX-GDDP/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_MPI-ESM-LR_2029.nc'
    
data = requests.get(fp, stream=True)
ds = xr.open_dataset(data.content)  # raises TypeError: embedded NUL character

Ideal would be integration with the (hopefully) soon-to-be implemented dask.distributed features discussed in #798.

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Reactions:4
  • Comments:16 (11 by maintainers)

github_iconTop GitHub Comments

4reactions
delgadomcommented, Jan 12, 2018

yes! Thanks @jhamman and @shoyer. I hadn’t tried it yet, but just did. worked great!

In  [1]: import xarray as xr
    ...: import requests
    ...: import netCDF4
    ...: 
    ...: %matplotlib inline

In  [2]: res = requests.get(
    ...:     'http://nasanex.s3.amazonaws.com/NEX-GDDP/BCSD/rcp45/day/atmos/tasmin/' +
    ...:     'r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_CESM1-BGC_2073.nc')

In  [3]: res.status_code
Out [3]: 200

In  [4]: res.headers['content-type']
Out [4]: 'application/x-netcdf'

In  [5]: nc4_ds = netCDF4.Dataset('tasmin_day_BCSD_rcp45_r1i1p1_CESM1-BGC_2073', memory=res.content)

In  [6]: store = xr.backends.NetCDF4DataStore(nc4_ds)

In  [7]: ds = xr.open_dataset(store)

In  [8]: ds.tasmin.isel(time=0).plot()
    /global/home/users/mdelgado/git/public/xarray/xarray/plot/utils.py:51: FutureWarning: 'pandas.tseries.converter.register' has been moved and renamed to 'pandas.plotting.register_matplotlib_converters'. 
      converter.register()
Out [8]: <matplotlib.collections.QuadMesh at 0x2aede3c922b0>

output_7_2

In  [9]: ds
Out [9]:
    <xarray.Dataset>
    Dimensions:  (lat: 720, lon: 1440, time: 365)
    Coordinates:
      * time     (time) datetime64[ns] 2073-01-01T12:00:00 2073-01-02T12:00:00 ...
      * lat      (lat) float32 -89.875 -89.625 -89.375 -89.125 -88.875 -88.625 ...
      * lon      (lon) float32 0.125 0.375 0.625 0.875 1.125 1.375 1.625 1.875 ...
    Data variables:
        tasmin   (time, lat, lon) float64 ...
    Attributes:
        parent_experiment:              historical
        parent_experiment_id:           historical
        parent_experiment_rip:          r1i1p1
        Conventions:                    CF-1.4
        institution:                    NASA Earth Exchange, NASA Ames Research C...
        institute_id:                   NASA-Ames
        realm:                          atmos
        modeling_realm:                 atmos
        version:                        1.0
        downscalingModel:               BCSD
        experiment_id:                  rcp45
        frequency:                      day
        realization:                    1
        initialization_method:          1
        physics_version:                1
        tracking_id:                    1865ff49-b20c-4268-852a-a9503efec72c
        driving_data_tracking_ids:      N/A
        driving_model_ensemble_member:  r1i1p1
        driving_experiment_name:        historical
        driving_experiment:             historical
        model_id:                       BCSD
        references:                     BCSD method: Thrasher et al., 2012, Hydro...
        DOI:                            http://dx.doi.org/10.7292/W0MW2F2G
        experiment:                     RCP4.5
        title:                          CESM1-BGC global downscaled NEX CMIP5 Cli...
        contact:                        Dr. Rama Nemani: rama.nemani@nasa.gov, Dr...
        disclaimer:                     This data is considered provisional and s...
        resolution_id:                  0.25 degree
        project_id:                     NEXGDDP
        table_id:                       Table day (12 November 2010)
        source:                         BCSD 2014
        creation_date:                  2015-01-07T19:18:31Z
        forcing:                        N/A
        product:                        output
3reactions
shoyercommented, Nov 22, 2017

Just to clarify: I wrote about that we use could support initializing a Dataset from a netCDF4 file image. But this wouldn’t help yet for streaming access.

Initializing a Dataset from a netCDF4 file image should actually work with the latest versions of xarray and netCDF4-python:

nc4_ds = netCDF4.Dataset('arbitrary-name', memory=netcdf_bytes)
store = xarray.backends.NetCDF4DataStore(nc4_ds)
ds = xarray.open_dataset(store)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Stream - Hugging Face
Loading a dataset in streaming mode creates a new dataset type instance (instead of the classic Dataset object), known as an IterableDataset.
Read more >
Structured Streaming Programming Guide - Apache Spark
Structured Streaming supports joining a streaming Dataset/DataFrame with a static Dataset/DataFrame as well as another streaming Dataset/DataFrame. The result ...
Read more >
Real-time streaming in Power BI - Microsoft Learn
This article shows you how to set up real-time streaming dataset in ... When the dataset is created, the Power BI service automatically ......
Read more >
How To: Create a Streaming Data Loader for PyTorch -- Visual ...
In situations where all of the training data will fit into machine memory, the most common approach is to define a problem-specific Dataset...
Read more >
Power BI streaming data sets: The good, the great, and the ...
Streaming Dataset : As the name implies, streaming datasets have limited or no history and just look at the most current values for...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found