You can run this notebook in a live session or view it on Github.
Indexing / Selecting Large Data¶
Note: this feature is experimental!
When the dataset have chunked coordinates (dask arrays), Xoak may build the index and/or performs the selection in-parallel.
[1]:
import dask
import dask.array as da
import numpy as np
import xarray as xr
import xoak
xr.set_options(display_style='text');
Let’s first create an xarray.Dataset
of latitude / longitude points located randomly on the sphere, forming a 2-dimensional (x, y) model mesh. The array coordinates are split into 4 chunks of 250x250 elements each.
[2]:
shape = (500, 500)
chunk_shape = (250, 250)
lat = da.random.uniform(-90, 90, size=shape, chunks=chunk_shape)
lon = da.random.uniform(-180, 180, size=shape, chunks=chunk_shape)
field = lat + lon
ds_mesh = xr.Dataset(
coords={'lat': (('x', 'y'), lat), 'lon': (('x', 'y'), lon)},
data_vars={'field': (('x', 'y'), field)},
)
ds_mesh
[2]:
<xarray.Dataset> Dimensions: (x: 500, y: 500) Coordinates: lat (x, y) float64 dask.array<chunksize=(250, 250), meta=np.ndarray> lon (x, y) float64 dask.array<chunksize=(250, 250), meta=np.ndarray> Dimensions without coordinates: x, y Data variables: field (x, y) float64 dask.array<chunksize=(250, 250), meta=np.ndarray>
Xoak builds an index structure for each chunk (all coordinates must have the same chunks):
[3]:
ds_mesh.xoak.set_index(['lat', 'lon'], 'sklearn_geo_balltree')
# here returns a list of BallTree objects, one for each chunk
ds_mesh.xoak.index
[3]:
[<sklearn.neighbors._ball_tree.BallTree at 0x7f2234002150>,
<sklearn.neighbors._ball_tree.BallTree at 0x7f223c0027d0>,
<sklearn.neighbors._ball_tree.BallTree at 0x7f223401b250>,
<sklearn.neighbors._ball_tree.BallTree at 0x7f223c0953a0>]
Let’s create some query data points, which may also be chunked (here 2 chunks).
[4]:
shape = (100, 10)
chunk_shape = (50, 10)
ds_data = xr.Dataset({
'lat': (('y', 'x'), da.random.uniform(-90, 90, size=shape, chunks=chunk_shape)),
'lon': (('y', 'x'), da.random.uniform(-180, 180, size=shape, chunks=chunk_shape))
})
ds_data
[4]:
<xarray.Dataset> Dimensions: (y: 100, x: 10) Dimensions without coordinates: y, x Data variables: lat (y, x) float64 dask.array<chunksize=(50, 10), meta=np.ndarray> lon (y, x) float64 dask.array<chunksize=(50, 10), meta=np.ndarray>
Queries may be perfomed in parallel using Dask. Please note, however, that some indexes may not support multi-threads and/or multi-process parallelism.
[5]:
from dask.diagnostics import ProgressBar
with ProgressBar(), dask.config.set(scheduler='processes'):
ds_selection = ds_mesh.xoak.sel(lat=ds_data.lat, lon=ds_data.lon)
ds_selection
[########################################] | 100% Completed | 2.1s
[5]:
<xarray.Dataset> Dimensions: (y: 100, x: 10) Coordinates: lat (y, x) float64 dask.array<chunksize=(100, 10), meta=np.ndarray> lon (y, x) float64 dask.array<chunksize=(100, 10), meta=np.ndarray> Dimensions without coordinates: y, x Data variables: field (y, x) float64 dask.array<chunksize=(100, 10), meta=np.ndarray>
[ ]: