Binder

You can run this notebook in a live session or view it on Github.

Indexing / Selecting Large Data

Note: this feature is experimental!

When the dataset have chunked coordinates (dask arrays), Xoak may build the index and/or performs the selection in-parallel.

[1]:
import dask
import dask.array as da
import numpy as np
import xarray as xr
import xoak

xr.set_options(display_style='text');

Let’s first create an xarray.Dataset of latitude / longitude points located randomly on the sphere, forming a 2-dimensional (x, y) model mesh. The array coordinates are split into 4 chunks of 250x250 elements each.

[2]:
shape = (500, 500)
chunk_shape = (250, 250)

lat = da.random.uniform(-90, 90, size=shape, chunks=chunk_shape)
lon = da.random.uniform(-180, 180, size=shape, chunks=chunk_shape)

field = lat + lon

ds_mesh = xr.Dataset(
    coords={'lat': (('x', 'y'), lat), 'lon': (('x', 'y'), lon)},
    data_vars={'field': (('x', 'y'), field)},
)

ds_mesh
[2]:
<xarray.Dataset>
Dimensions:  (x: 500, y: 500)
Coordinates:
    lat      (x, y) float64 dask.array<chunksize=(250, 250), meta=np.ndarray>
    lon      (x, y) float64 dask.array<chunksize=(250, 250), meta=np.ndarray>
Dimensions without coordinates: x, y
Data variables:
    field    (x, y) float64 dask.array<chunksize=(250, 250), meta=np.ndarray>

Xoak builds an index structure for each chunk (all coordinates must have the same chunks):

[3]:
ds_mesh.xoak.set_index(['lat', 'lon'], 'sklearn_geo_balltree')

# here returns a list of BallTree objects, one for each chunk
ds_mesh.xoak.index
[3]:
[<sklearn.neighbors._ball_tree.BallTree at 0x7f2234002150>,
 <sklearn.neighbors._ball_tree.BallTree at 0x7f223c0027d0>,
 <sklearn.neighbors._ball_tree.BallTree at 0x7f223401b250>,
 <sklearn.neighbors._ball_tree.BallTree at 0x7f223c0953a0>]

Let’s create some query data points, which may also be chunked (here 2 chunks).

[4]:
shape = (100, 10)
chunk_shape = (50, 10)

ds_data = xr.Dataset({
    'lat': (('y', 'x'), da.random.uniform(-90, 90, size=shape, chunks=chunk_shape)),
    'lon': (('y', 'x'), da.random.uniform(-180, 180, size=shape, chunks=chunk_shape))
})

ds_data
[4]:
<xarray.Dataset>
Dimensions:  (y: 100, x: 10)
Dimensions without coordinates: y, x
Data variables:
    lat      (y, x) float64 dask.array<chunksize=(50, 10), meta=np.ndarray>
    lon      (y, x) float64 dask.array<chunksize=(50, 10), meta=np.ndarray>

Queries may be perfomed in parallel using Dask. Please note, however, that some indexes may not support multi-threads and/or multi-process parallelism.

[5]:
from dask.diagnostics import ProgressBar

with ProgressBar(), dask.config.set(scheduler='processes'):
    ds_selection = ds_mesh.xoak.sel(lat=ds_data.lat, lon=ds_data.lon)

ds_selection
[########################################] | 100% Completed |  2.1s
[5]:
<xarray.Dataset>
Dimensions:  (y: 100, x: 10)
Coordinates:
    lat      (y, x) float64 dask.array<chunksize=(100, 10), meta=np.ndarray>
    lon      (y, x) float64 dask.array<chunksize=(100, 10), meta=np.ndarray>
Dimensions without coordinates: y, x
Data variables:
    field    (y, x) float64 dask.array<chunksize=(100, 10), meta=np.ndarray>
[ ]: