User Manual

Introduction

dCacheFS provides a file-system interface for a dCache storage system, such as the instance provided at SURF. dCacheFS builds on the Filesystem Spec (fsspec) library and it can be used as an independent library or via the more general fsspec functions.

We consider here a dCache instance with a project space setup in the following way (test is in the project root directory):

test
├── empty_testdir
├── testdir_1
│   ├── file_1.txt
│   └── file_2.txt
└── testdir_2
    ├── file_1.txt
    └── file_2.txt

dCacheFileSystem

The main access point to the functionalities of dCacheFS is its dCacheFileSystem class. Here, we instantiate a dCacheFileSystem object providing authentication credentials (a macaroon for bearer-token authentication) and the URL where the dCache API can be reached (https://dcacheview.grid.surfsara.nl:22880/api/v1 for the SURF dCache instance):

[1]:
from dcachefs import dCacheFileSystem

api_url = 'https://dcacheview.grid.surfsara.nl:22880/api/v1'

# read authentication token
with open('macaroon.dat') as f:
    token = f.read().strip()

fs = dCacheFileSystem(api_url=api_url, token=token)

dCacheFS implements the following methods via the dCache API:

[2]:
fs.ls('/test/testdir_1', detail=False)
[2]:
['/test/testdir_1/file_2.txt', '/test/testdir_1/file_1.txt']
[3]:
fs.ls('/test/testdir_1/file_1.txt')
[3]:
[{'name': '/test/testdir_1/file_1.txt',
  'size': 12,
  'type': 'file',
  'created': datetime.datetime(2022, 9, 14, 9, 28, 55, 841000),
  'modified': datetime.datetime(2022, 9, 14, 9, 28, 55, 881000)}]
[4]:
fs.ls('/test/empty_testdir')
[4]:
[]
[5]:
fs.info('/test/testdir_1/')
[5]:
{'name': '/test/testdir_1/',
 'size': 512,
 'type': 'directory',
 'created': datetime.datetime(2022, 9, 14, 9, 28, 55, 660000),
 'modified': datetime.datetime(2022, 9, 14, 9, 28, 56, 62000)}
[6]:
fs.mv('/test/testdir_2/file_1.txt', '/test/testdir_2/file_renamed.txt')
fs.ls('/test/testdir_2/', detail=False)
[6]:
['/test/testdir_2/file_2.txt', '/test/testdir_2/file_renamed.txt']
[7]:
fs.rm('/test/testdir_2/file_2.txt')
fs.ls('/test/testdir_2/', detail=False)
[7]:
['/test/testdir_2/file_renamed.txt']
[8]:
fs.exists('/test/testdir_1/file_1.txt')
[8]:
True
[9]:
fs.exists('/test/testdir_1/nonexistent_file.txt')
[9]:
False
[10]:
fs.isfile('/test/testdir_1/file_1.txt')
[10]:
True
[11]:
fs.isdir('/test/testdir_1/')
[11]:
True
[12]:
fs.created('/test/testdir_1/file_1.txt')
[12]:
datetime.datetime(2022, 9, 14, 9, 28, 55, 841000)
[13]:
fs.modified('/test/testdir_1/file_1.txt')
[13]:
datetime.datetime(2022, 9, 14, 9, 28, 55, 881000)
[14]:
fs.size('/test/testdir_1/file_1.txt')
[14]:
12
[15]:
fs.glob('/test/testdir_*/file_*.txt')
[15]:
['/test/testdir_1/file_1.txt',
 '/test/testdir_1/file_2.txt',
 '/test/testdir_2/file_renamed.txt']
[16]:
for root, _, _ in fs.walk('/test'):
    print(root)
/test
/test/testdir_2
/test/empty_testdir
/test/testdir_1
[17]:
fs.find('/test')
[17]:
['/test/testdir_1/file_1.txt',
 '/test/testdir_1/file_2.txt',
 '/test/testdir_2/file_renamed.txt']
[18]:
fs.du('/test') # bytes
[18]:
36
[19]:
fs.checksum('/test/testdir_1/file_1.txt')
[19]:
124669634392728841921114151018813876011

The following methods, which involve reading/writing files from/to dCache, involve communication via dCache’s WebDAV door. Thuse, the WebDAV door URL needs to be specified via a separate input argument when instantiating the dCacheFileSystem object:

[20]:
webdav_url = 'https://webdav.grid.surfsara.nl:2880'

fs = dCacheFileSystem(api_url=api_url, webdav_url=webdav_url, token=token)

fs.cat('/test/testdir_1/file_1.txt')
[20]:
b'Hello world!'
[21]:
local_path = './file.txt'
fs.download('/test/testdir_1/file_1.txt', local_path)

# check local copy
!cat $local_path
Hello world!
[22]:
remote_path = '/test/testdir_2/file_uploaded.txt'
fs.upload(local_path, remote_path)

# check remote copy
fs.cat(remote_path)
[22]:
b'Hello world!'
[23]:
with fs.open('/test/testdir_1/file_1.txt', 'rb') as f:
    content = f.read()
content
[23]:
b'Hello world!'
[24]:
path = '/test/testdir_2/file_written.txt'
with fs.open(path, 'wb') as f:
    f.write(b'Hello world!')
fs.cat(path)
[24]:
b'Hello world!'

Usage via fsspec

Once imported, dcachefs registers itself as the fsspec implementation for the “dcache” protocol. This means that all fsspec methods on URL-paths of the following form will be dealt via the dCacheFileSystem:

dcache://path/to/file/or/dir

Parameters like the token for authentication (macaroon), the API and WebDAV door URLs can be specified as input arguments to all fsspec functions:

[25]:
import fsspec

uri = 'dcache://test/testdir_1/file_1.txt'

with fsspec.open(uri, token=token, api_url=api_url, webdav_url=webdav_url) as f:
    content = f.read()
content
[25]:
b'Hello world!'

It can be handy, however, to save these input parameters in a fsspec configuration file, by default in the directory file ${HOME}/.config/fsspec/, so that they are always passed as input arguments when the “dcache” protocol is encountered (see the section of the fsspec documentation on configuration).

One can also temporarily register the dCacheFileSystem class as the default implementation for handling URL-paths that include a different protocol. This is useful, for instance, when dealing with file paths that include the WebDAV door URL. One can thus use the register_implementation context manager:

[26]:
from dcachefs import register_implementation

# URI including the WeDAV door
uri = 'https://webdav.grid.surfsara.nl:2880/test/testdir_1/file_1.txt'

with register_implementation(protocol='https'):
    # no need to specify here the webdav_url - it's already part of the URI
    with fsspec.open(uri, token=token, api_url=api_url) as f:
        content = f.read()
content
[26]:
b'Hello world!'

Dask, Zarr and Xarray

The fsspec library is used internally by Dask, enabling the possibility to read and write data from a variety of data stores. After importing dCacheFS, Dask internal functions can also read and write data from a dCache storage instance:

[27]:
import dask.bag as db

bag = db.read_text(
    'dcache://test/testdir*/file*.txt',
    storage_options=dict(
        api_url=api_url,
        token=token,
        webdav_url=webdav_url,
    )
)

bag.take(1)
[27]:
('Hello world!',)

Note that the storage options passed here as input arguments to .read_text() can be provided via a fsspec configuration file.

fsspec also provides the functionality to create an interface that is compatible with the Zarr library:

[28]:
import zarr

fs_map = fsspec.get_mapper(
    'dcache://test/store.zarr',
    token=token,
    api_url=api_url,
    webdav_url=webdav_url
)

root = zarr.open(fs_map, mode='w')
myarray = root.zeros('myarray', shape=(1000, 1000), chunks=(100, 100))

The same interface can also be used to read and write labeled multi-dimensional arrays in Zarr format via Xarray:

[29]:
import dask.array as da
import pandas as pd
import xarray as xr
import numpy as np

# coordinates
x = np.arange(1000)
y = np.arange(1000)
time = pd.date_range('2022-01-01', periods=5)

# variable
temperature = da.random.random((1000, 1000, 5), chunks=(500, 500, 1))

# create dataset
ds = xr.Dataset(
    data_vars=dict(
        temperature=(['x', 'y', 'time'], temperature),
    ),
    coords=dict(x=x, y=y, time=time)
)
print(ds)

# create interface for Zarr
fs_map = fsspec.get_mapper(
    'dcache://test/temperature.zarr',
    token=token,
    api_url=api_url,
    webdav_url=webdav_url
)

# save dataset to dCache
store = ds.to_zarr(fs_map)
<xarray.Dataset>
Dimensions:      (x: 1000, y: 1000, time: 5)
Coordinates:
  * x            (x) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999
  * y            (y) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999
  * time         (time) datetime64[ns] 2022-01-01 2022-01-02 ... 2022-01-05
Data variables:
    temperature  (x, y, time) float64 dask.array<chunksize=(500, 500, 1), meta=np.ndarray>

Other (geospatial) libraries

dCacheFS can be used in combination with many other libraries that do not directly interface with fsspec, provided that they support reading and writing from file-like objects. In the field of geospatial data analysis, libraries like rasterio, geopandas, h5-netcdf, and laspy all support reading from this kind of source, so that one can open and load data from dCache as shown in the following pseudo-code cell:

[ ]:
import rasterio

with fsspec.open('dcache://path/to/file', ...) as f:
    with rasterio.open(f) as fi:
        ...