question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GeoDataset: avoid unnecessary reprojection

See original GitHub issue

This issue is meant to serve as a proposal for how to fix #278. Please try to poke holes in this proposal and find ways in which it won’t work, or suggest alternative solutions.

Rationale

Geospatial data files are stored in a particular CRS. In order to build a single R-tree index of a geospatial dataset, the bounds of all files must be reprojected to the same CRS. During sampling, we also reproject the file itself to a common CRS so that we can stitch together multiple files from the same layer, or combine multiple layers in an IntersectionDataset.

Unfortunately, reprojection/resampling is very slow, and can negatively affect data loader performance (see section 4.2 of https://arxiv.org/abs/2111.08872). It can also lead to large distortions of images in polar regions. We would like to be able to avoid reprojection if possible.

Proposal

I propose to only reproject files during sampling when absolutely necessary. This includes any situation in which we sample from multiple files at once and all files are not in the same CRS, such as:

  • a single GeoDataset containing overlapping files on the border of two UTM zones
  • an IntersectionDataset with datasets in different CRSes
  • a UnionDataset with overlapping files and datasets in different CRSes

The details of how this can be done is left to the “Implementation” section below, but the basic idea is to add an additional parameter for __getitem__ that contains the CRS to project to. The sampler is then responsible for setting this CRS in a way that minimizes reprojection.

There is one issue that arises with this implementation. If we choose our bounding box in one CRS and reproject it to a different CRS, the bounding box will be rotated, and the enclosing bounding box may be larger than the original. This can result in tensors of different size that cannot be concatenated without padding. The “Implementation” section describes how this problem can be avoided.

Implementation

The details of what will need to be changed are as follows:

  1. GeoDataset.index: instead of storing {bbox: filename}, store {bbox: [(filename, CRS), ...]}
  2. {Intersection|Union}Dataset.index: instead of storing {bbox}, store {bbox: [(filename, CRS), ...]}
  3. GeoDataset.__getitem__: instead of taking a BoundingBox(minx, maxx, miny, maxy, mint, maxt), accept a BoundingBox(x, y, t, width, height, time) and CRS, then project (x, y) to that CRS before sampling
  4. GeoSampler: look at the native CRS of all files in the R-tree and select the mode as the CRS to return

By reprojecting only (x, y) instead of (minx, maxx, miny, maxy), we avoid issues with inconsistent tensor sizes during concatenation. The width/height can be applied directly in the new CRS without distortion.

Note that the R-tree index will still use (minx, maxx, miny, maxy, mint, maxt) as its index, while __getitem__ will take a BoundingBox(x, y, t, width, height, time). This is quite different from the previous approach that used the former for both.

Alternatives

There have been a few other alternatives proposed, each with their own pros/cons:

  1. Always project to shared CRS—This is our current implementation. It can be slow and cause spatial distortion, but it’s simple and straightforward.
  2. Always sample in native CRS—This is the implementation we use for ChesapeakeCVPR. It is the fastest possible solution with the least distortion, but it disallows datasets with overlapping images in different CRS, so it’s somewhere between VisionDataset and GeoDataset. It also suffers from the aforementioned bug with tensor concatenation shape mismatches, although we could employ the same solution as proposed here to fix that.
  3. Add flag to control reprojection—We could add a flag that tells TorchGeo whether or not to reproject files. Unfortunately, this just punts the problem to the user and allows them to shoot themselves in the foot. This also results in more code paths to test and debug.
  4. Require single CRS—We could require all users to reproject their data themselves and ensure that all files are in the same CRS. Then we never have to worry about reprojection. This is similar to what Raster Vision does (from my understanding). Fast at loading time, but removes most of what makes TorchGeo cool. Also increases storage demands.

@RitwikGupta

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:1
  • Comments:12 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
adamjstewartcommented, Mar 6, 2022

When we make this switch, we should also switch from interleaved=True to interleaved=False in our rtree index. The former is known to be buggy and actually discouraged by the rtree maintainers: https://github.com/Toblerity/rtree/issues/204#issuecomment-1060013098

1reaction
adamjstewartcommented, Feb 18, 2022

The problem is that implementation details 1, 2, and 3 are only necessary if we decide to go with this proposal. Alternative 1 (not changing anything, requiring the user to specify the CRS to reproject to) means not changing anything. Let’s decide if this proposal is the behavior we want to achieve before we worry about making any changes. I’m still interested in hearing other alternative proposals, especially since you seem interested in alternatives 1 and 3.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Frequently Asked Questions - GEO - NCBI - NIH
Gene Expression Omnibus (GEO) is a database repository of high throughput gene expression data and hybridization arrays, chips, microarrays.
Read more >
Coding Best Practices | Google Earth Engine
Avoid reproject () ... Don't use reproject unless absolutely necessary. One reason you might want to use reproject() is to force Code Editor...
Read more >
How do I correctly reproject a geodataframe with multiple ...
My workaround is to not use a GeoDataFrame , but rather combine a normal pandas DataFrame , for the non-shapely data, with several...
Read more >
gdalwarp — GDAL documentation
The gdalwarp utility is an image mosaicing, reprojection and warping utility. The program can reproject to any supported projection, and can also apply...
Read more >
Chapter 8 Coordinate Reference Systems - Bookdown
Reprojecting rasters always changes the data because they have to be resampled to a new grid in the new coordinate system. This process...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found