GeoDataset: avoid unnecessary reprojection
See original GitHub issueThis issue is meant to serve as a proposal for how to fix #278. Please try to poke holes in this proposal and find ways in which it won’t work, or suggest alternative solutions.
Rationale
Geospatial data files are stored in a particular CRS. In order to build a single R-tree index of a geospatial dataset, the bounds of all files must be reprojected to the same CRS. During sampling, we also reproject the file itself to a common CRS so that we can stitch together multiple files from the same layer, or combine multiple layers in an IntersectionDataset
.
Unfortunately, reprojection/resampling is very slow, and can negatively affect data loader performance (see section 4.2 of https://arxiv.org/abs/2111.08872). It can also lead to large distortions of images in polar regions. We would like to be able to avoid reprojection if possible.
Proposal
I propose to only reproject files during sampling when absolutely necessary. This includes any situation in which we sample from multiple files at once and all files are not in the same CRS, such as:
- a single
GeoDataset
containing overlapping files on the border of two UTM zones - an
IntersectionDataset
with datasets in different CRSes - a
UnionDataset
with overlapping files and datasets in different CRSes
The details of how this can be done is left to the “Implementation” section below, but the basic idea is to add an additional parameter for __getitem__
that contains the CRS to project to. The sampler is then responsible for setting this CRS in a way that minimizes reprojection.
There is one issue that arises with this implementation. If we choose our bounding box in one CRS and reproject it to a different CRS, the bounding box will be rotated, and the enclosing bounding box may be larger than the original. This can result in tensors of different size that cannot be concatenated without padding. The “Implementation” section describes how this problem can be avoided.
Implementation
The details of what will need to be changed are as follows:
GeoDataset.index
: instead of storing{bbox: filename}
, store{bbox: [(filename, CRS), ...]}
{Intersection|Union}Dataset.index
: instead of storing{bbox}
, store{bbox: [(filename, CRS), ...]}
GeoDataset.__getitem__
: instead of taking aBoundingBox(minx, maxx, miny, maxy, mint, maxt)
, accept aBoundingBox(x, y, t, width, height, time)
and CRS, then project (x, y) to that CRS before samplingGeoSampler
: look at the native CRS of all files in the R-tree and select the mode as the CRS to return
By reprojecting only (x, y) instead of (minx, maxx, miny, maxy), we avoid issues with inconsistent tensor sizes during concatenation. The width/height can be applied directly in the new CRS without distortion.
Note that the R-tree index will still use (minx, maxx, miny, maxy, mint, maxt)
as its index, while __getitem__
will take a BoundingBox(x, y, t, width, height, time)
. This is quite different from the previous approach that used the former for both.
Alternatives
There have been a few other alternatives proposed, each with their own pros/cons:
- Always project to shared CRS—This is our current implementation. It can be slow and cause spatial distortion, but it’s simple and straightforward.
- Always sample in native CRS—This is the implementation we use for
ChesapeakeCVPR
. It is the fastest possible solution with the least distortion, but it disallows datasets with overlapping images in different CRS, so it’s somewhere betweenVisionDataset
andGeoDataset
. It also suffers from the aforementioned bug with tensor concatenation shape mismatches, although we could employ the same solution as proposed here to fix that. - Add flag to control reprojection—We could add a flag that tells TorchGeo whether or not to reproject files. Unfortunately, this just punts the problem to the user and allows them to shoot themselves in the foot. This also results in more code paths to test and debug.
- Require single CRS—We could require all users to reproject their data themselves and ensure that all files are in the same CRS. Then we never have to worry about reprojection. This is similar to what Raster Vision does (from my understanding). Fast at loading time, but removes most of what makes TorchGeo cool. Also increases storage demands.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:12 (5 by maintainers)
Top GitHub Comments
When we make this switch, we should also switch from
interleaved=True
tointerleaved=False
in our rtree index. The former is known to be buggy and actually discouraged by the rtree maintainers: https://github.com/Toblerity/rtree/issues/204#issuecomment-1060013098The problem is that implementation details 1, 2, and 3 are only necessary if we decide to go with this proposal. Alternative 1 (not changing anything, requiring the user to specify the CRS to reproject to) means not changing anything. Let’s decide if this proposal is the behavior we want to achieve before we worry about making any changes. I’m still interested in hearing other alternative proposals, especially since you seem interested in alternatives 1 and 3.