question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NaN values for variables when converting from a pandas dataframe to xarray.DataSet

See original GitHub issue

Code Sample, a copy-pastable example if possible

                                                         wind_surface  hurs   bui       fwi
lat       lon       time                      
34.511383 16.467664 1971-01-10 12:00:00     29.658546  70.481293  ...  8.134300  7.409146
34.515558 16.723973 1971-01-10 12:00:00     30.896049  71.356644  ...  8.874528  8.399877
34.517359 16.852138 1971-01-10 12:00:00     31.514799  71.708603  ...  8.789351  8.763743
34.518970 16.980310 1971-01-10 12:00:00     32.105423  72.023773  ...  8.962551  9.125644
34.520391 17.108487 1971-01-10 12:00:00     32.724174  72.106110  ...  8.725038  9.249104

[5 rows x 10 columns]

In [81]: df.to_xarray()                                                                         
Out[81]: 
<xarray.Dataset>
Dimensions:       (lat: 5, lon: 5, time: 1)
Coordinates:
  * lat           (lat) float64 34.51 34.52 34.52 34.52 34.52
  * lon           (lon) float64 16.47 16.72 16.85 16.98 17.11
  * time          (time) object '1971-01-10 12:00:00'
Data variables:
    wind_surface  (lat, lon, time) float64 29.658546 nan nan ... nan 32.724174
    hurs          (lat, lon, time) float64 70.48129 nan nan ... nan nan 72.10611
    precip        (lat, lon, time) float64 0.0 nan nan nan ... nan nan nan 0.0
    tmax          (lat, lon, time) float64 16.060822 nan nan ... nan 16.185822
    ffmc          (lat, lon, time) float64 83.58528 nan nan ... nan nan 84.05673
    isi           (lat, lon, time) float64 7.7641253 nan nan ... nan nan 9.64494
    dmc           (lat, lon, time) float64 6.797345 nan nan ... nan nan 7.90833
    dc            (lat, lon, time) float64 25.314878 nan nan ... nan 24.324644
    bui           (lat, lon, time) float64 8.1343 nan nan ... nan nan 8.725038
    fwi           (lat, lon, time) float64 7.409146 nan nan ... nan 9.2491045

Problem description

Hi, I get those nan values for variables when I try to convert from a pandas.DataFrame with MultiIndex to a xarray.DataArray. The same happend if I try to build a xarray.Dataset and then unstack the multiindex as shown below:

ds = xr.Dataset(df)
ds.unstack('dim_0')                                                                    
<xarray.Dataset>
Dimensions:       (lat: 5, lon: 5, time: 1)
Coordinates:
  * lat           (lat) float64 34.51 34.52 34.52 34.52 34.52
  * lon           (lon) float64 16.47 16.72 16.85 16.98 17.11
  * time          (time) object '1971-01-10 12:00:00'
Data variables:
    wind_surface  (lat, lon, time) float32 29.658546 nan nan ... nan 32.724174
    hurs          (lat, lon, time) float32 70.48129 nan nan ... nan nan 72.10611
    precip        (lat, lon, time) float32 0.0 nan nan nan ... nan nan nan 0.0

Maybe it’s not an issue. I don’t know. I’m lost. Any help is welcome.

Regards

Output of xr.show_versions()

# Paste the output here xr.show_versions() here INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 (default, May 9 2019, 11:55:04) [GCC 8.3.0] python-bits: 64 OS: Linux OS-release: 5.0.0-16-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.2

xarray: 0.12.1 pandas: 0.24.2 numpy: 1.16.3 scipy: 1.3.0 netCDF4: 1.5.2 pydap: installed h5netcdf: 0.7.3 h5py: 2.9.0 Nio: None zarr: 2.3.1 cftime: 1.0.1 nc_time_axis: 1.1.0 PseudonetCDF: None rasterio: 1.0.23 cfgrib: None iris: 2.3.0dev0 bottleneck: 1.2.1 dask: 1.2.2 distributed: None matplotlib: 3.1.0 cartopy: 0.17.1.dev168+ seaborn: 0.9.0 setuptools: 40.8.0 pip: 19.1.1 conda: None pytest: None IPython: 7.5.0 sphinx: 2.0.1

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
sjvrijncommented, Mar 23, 2020

I recently had a similar issue and found out the cause: When transforming from a dataframe to an xarray, the xarray allocates memory for all possible combinations of the coordinates. In this particular case, you have 5 unique values for latitude and longitude in your five rows, which means there are 5*5=25 possible combinations of lat/long values. All missing values are then filled in as NaN.

Let me illustrate by recreating just your data on latitude, longitude, wind_surface and hurs:

In [3]: data = [
    ...:     [34.511383, 16.467664, 29.658546, 70.481293],
    ...:     [34.515558, 16.723973, 30.896049, 71.356644],
    ...:     [34.517359, 16.852138, 31.514799, 71.708603],
    ...:     [34.518970, 16.980310, 32.105423, 72.023773],
    ...:     [34.520391, 17.108487, 32.724174, 72.106110],
    ...: ]
In [4]: df = pd.DataFrame(data=data, columns=['lat', 'long', 'wind_surface', 'hurs']).set_index(['lat', 'long'])
In [5]: df
Out[5]:
                     wind_surface       hurs
lat       long
34.511383 16.467664     29.658546  70.481293
34.515558 16.723973     30.896049  71.356644
34.517359 16.852138     31.514799  71.708603
34.518970 16.980310     32.105423  72.023773
34.520391 17.108487     32.724174  72.106110

But for the xarray, this means it will end up creating a 5x5 array, of which only 5 values are given along the diagonal. This is very clearly visible when showing just the DataArray for a single column:

In [6]: df.to_xarray()['wind_surface']
Out[6]:
<xarray.DataArray 'wind_surface' (lat: 5, long: 5)>
array([[29.658546,       nan,       nan,       nan,       nan],
       [      nan, 30.896049,       nan,       nan,       nan],
       [      nan,       nan, 31.514799,       nan,       nan],
       [      nan,       nan,       nan, 32.105423,       nan],
       [      nan,       nan,       nan,       nan, 32.724174]])
Coordinates:
  * lat      (lat) float64 34.51 34.52 34.52 34.52 34.52
  * long     (long) float64 16.47 16.72 16.85 16.98 17.11

However, as to_xarray() outputs a DataSet, each DataArray, i.e. column from the dataframe, is summarized as a 1D array, which makes it seem like a lot of data is just ‘missing’:

In [7]: df.to_xarray()
Out[7]:
<xarray.Dataset>
Dimensions:       (lat: 5, long: 5)
Coordinates:
  * lat           (lat) float64 34.51 34.52 34.52 34.52 34.52
  * long          (long) float64 16.47 16.72 16.85 16.98 17.11
Data variables:
    wind_surface  (lat, long) float64 29.66 nan nan nan ... nan nan nan 32.72
    hurs          (lat, long) float64 70.48 nan nan nan ... nan nan nan 72.11

So it works as intended, but can throw you for a loop if you don’t realize it’s creating an array the size of all possible index combinations.

@shoyer can you close this issue?

0reactions
dcheriancommented, Mar 23, 2020

Thanks @sjvrijn

Read more comments on GitHub >

github_iconTop Results From Across the Web

converting pandas dataframe to xarray dataset - Stack Overflow
To convert your data to xarray, first set the datetime as index in pandas, with df.set_index('datetime') . ds = df.set_index('datetime').
Read more >
xarray.Dataset.from_dataframe
Each column will be converted into an independent variable in the Dataset. ... product of one-dimensional indices (filling in missing values with NaN)....
Read more >
pandas.DataFrame.to_xarray — pandas 1.5.2 documentation
Return an xarray object from the pandas object. Data in the pandas structure converted to Dataset if the object is a DataFrame, or...
Read more >
Working with pandas - xarray - Read the Docs
We see that each variable and coordinate in the Dataset is now a column in the DataFrame, with the exception of indexes which...
Read more >
Xarray Fundamentals - Research Computing in Earth Sciences
Select data by position using .isel with values or slices ... Series : pandas.Dataframe :: xarray.DataArray : xarray.Dataset.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found