User Manual¶
Introduction¶
dCacheFS provides a file-system interface for a dCache storage system, such as the instance provided at SURF. dCacheFS builds on the Filesystem Spec (fsspec
) library and it can be used as an independent library or via the more general fsspec
functions.
We consider here a dCache instance with a project space setup in the following way (test
is in the project root directory):
test
├── empty_testdir
├── testdir_1
│ ├── file_1.txt
│ └── file_2.txt
└── testdir_2
├── file_1.txt
└── file_2.txt
dCacheFileSystem
¶
The main access point to the functionalities of dCacheFS is its dCacheFileSystem
class. Here, we instantiate a dCacheFileSystem
object providing authentication credentials (a macaroon for bearer-token authentication) and the URL where the dCache API can be reached (https://dcacheview.grid.surfsara.nl:22880/api/v1 for the SURF dCache instance):
[1]:
from dcachefs import dCacheFileSystem
api_url = 'https://dcacheview.grid.surfsara.nl:22880/api/v1'
# read authentication token
with open('macaroon.dat') as f:
token = f.read().strip()
fs = dCacheFileSystem(api_url=api_url, token=token)
dCacheFS implements the following methods via the dCache API:
[2]:
fs.ls('/test/testdir_1', detail=False)
[2]:
['/test/testdir_1/file_2.txt', '/test/testdir_1/file_1.txt']
[3]:
fs.ls('/test/testdir_1/file_1.txt')
[3]:
[{'name': '/test/testdir_1/file_1.txt',
'size': 12,
'type': 'file',
'created': datetime.datetime(2022, 9, 14, 9, 28, 55, 841000),
'modified': datetime.datetime(2022, 9, 14, 9, 28, 55, 881000)}]
[4]:
fs.ls('/test/empty_testdir')
[4]:
[]
[5]:
fs.info('/test/testdir_1/')
[5]:
{'name': '/test/testdir_1/',
'size': 512,
'type': 'directory',
'created': datetime.datetime(2022, 9, 14, 9, 28, 55, 660000),
'modified': datetime.datetime(2022, 9, 14, 9, 28, 56, 62000)}
[6]:
fs.mv('/test/testdir_2/file_1.txt', '/test/testdir_2/file_renamed.txt')
fs.ls('/test/testdir_2/', detail=False)
[6]:
['/test/testdir_2/file_2.txt', '/test/testdir_2/file_renamed.txt']
[7]:
fs.rm('/test/testdir_2/file_2.txt')
fs.ls('/test/testdir_2/', detail=False)
[7]:
['/test/testdir_2/file_renamed.txt']
[8]:
fs.exists('/test/testdir_1/file_1.txt')
[8]:
True
[9]:
fs.exists('/test/testdir_1/nonexistent_file.txt')
[9]:
False
[10]:
fs.isfile('/test/testdir_1/file_1.txt')
[10]:
True
[11]:
fs.isdir('/test/testdir_1/')
[11]:
True
[12]:
fs.created('/test/testdir_1/file_1.txt')
[12]:
datetime.datetime(2022, 9, 14, 9, 28, 55, 841000)
[13]:
fs.modified('/test/testdir_1/file_1.txt')
[13]:
datetime.datetime(2022, 9, 14, 9, 28, 55, 881000)
[14]:
fs.size('/test/testdir_1/file_1.txt')
[14]:
12
[15]:
fs.glob('/test/testdir_*/file_*.txt')
[15]:
['/test/testdir_1/file_1.txt',
'/test/testdir_1/file_2.txt',
'/test/testdir_2/file_renamed.txt']
[16]:
for root, _, _ in fs.walk('/test'):
print(root)
/test
/test/testdir_2
/test/empty_testdir
/test/testdir_1
[17]:
fs.find('/test')
[17]:
['/test/testdir_1/file_1.txt',
'/test/testdir_1/file_2.txt',
'/test/testdir_2/file_renamed.txt']
[18]:
fs.du('/test') # bytes
[18]:
36
[19]:
fs.checksum('/test/testdir_1/file_1.txt')
[19]:
124669634392728841921114151018813876011
The following methods, which involve reading/writing files from/to dCache, involve communication via dCache’s WebDAV door. Thuse, the WebDAV door URL needs to be specified via a separate input argument when instantiating the dCacheFileSystem
object:
[20]:
webdav_url = 'https://webdav.grid.surfsara.nl:2880'
fs = dCacheFileSystem(api_url=api_url, webdav_url=webdav_url, token=token)
fs.cat('/test/testdir_1/file_1.txt')
[20]:
b'Hello world!'
[21]:
local_path = './file.txt'
fs.download('/test/testdir_1/file_1.txt', local_path)
# check local copy
!cat $local_path
Hello world!
[22]:
remote_path = '/test/testdir_2/file_uploaded.txt'
fs.upload(local_path, remote_path)
# check remote copy
fs.cat(remote_path)
[22]:
b'Hello world!'
[23]:
with fs.open('/test/testdir_1/file_1.txt', 'rb') as f:
content = f.read()
content
[23]:
b'Hello world!'
[24]:
path = '/test/testdir_2/file_written.txt'
with fs.open(path, 'wb') as f:
f.write(b'Hello world!')
fs.cat(path)
[24]:
b'Hello world!'
Usage via fsspec
¶
Once imported, dcachefs
registers itself as the fsspec
implementation for the “dcache” protocol. This means that all fsspec
methods on URL-paths of the following form will be dealt via the dCacheFileSystem
:
dcache://path/to/file/or/dir
Parameters like the token for authentication (macaroon), the API and WebDAV door URLs can be specified as input arguments to all fsspec
functions:
[25]:
import fsspec
uri = 'dcache://test/testdir_1/file_1.txt'
with fsspec.open(uri, token=token, api_url=api_url, webdav_url=webdav_url) as f:
content = f.read()
content
[25]:
b'Hello world!'
It can be handy, however, to save these input parameters in a fsspec
configuration file, by default in the directory file ${HOME}/.config/fsspec/
, so that they are always passed as input arguments when the “dcache” protocol is encountered (see the section of the fsspec documentation on configuration).
One can also temporarily register the dCacheFileSystem
class as the default implementation for handling URL-paths that include a different protocol. This is useful, for instance, when dealing with file paths that include the WebDAV door URL. One can thus use the register_implementation
context manager:
[26]:
from dcachefs import register_implementation
# URI including the WeDAV door
uri = 'https://webdav.grid.surfsara.nl:2880/test/testdir_1/file_1.txt'
with register_implementation(protocol='https'):
# no need to specify here the webdav_url - it's already part of the URI
with fsspec.open(uri, token=token, api_url=api_url) as f:
content = f.read()
content
[26]:
b'Hello world!'
Dask, Zarr and Xarray¶
The fsspec
library is used internally by Dask, enabling the possibility to read and write data from a variety of data stores. After importing dCacheFS, Dask internal functions can also read and write data from a dCache storage instance:
[27]:
import dask.bag as db
bag = db.read_text(
'dcache://test/testdir*/file*.txt',
storage_options=dict(
api_url=api_url,
token=token,
webdav_url=webdav_url,
)
)
bag.take(1)
[27]:
('Hello world!',)
Note that the storage options passed here as input arguments to .read_text()
can be provided via a fsspec
configuration file.
fsspec
also provides the functionality to create an interface that is compatible with the Zarr library:
[28]:
import zarr
fs_map = fsspec.get_mapper(
'dcache://test/store.zarr',
token=token,
api_url=api_url,
webdav_url=webdav_url
)
root = zarr.open(fs_map, mode='w')
myarray = root.zeros('myarray', shape=(1000, 1000), chunks=(100, 100))
The same interface can also be used to read and write labeled multi-dimensional arrays in Zarr format via Xarray:
[29]:
import dask.array as da
import pandas as pd
import xarray as xr
import numpy as np
# coordinates
x = np.arange(1000)
y = np.arange(1000)
time = pd.date_range('2022-01-01', periods=5)
# variable
temperature = da.random.random((1000, 1000, 5), chunks=(500, 500, 1))
# create dataset
ds = xr.Dataset(
data_vars=dict(
temperature=(['x', 'y', 'time'], temperature),
),
coords=dict(x=x, y=y, time=time)
)
print(ds)
# create interface for Zarr
fs_map = fsspec.get_mapper(
'dcache://test/temperature.zarr',
token=token,
api_url=api_url,
webdav_url=webdav_url
)
# save dataset to dCache
store = ds.to_zarr(fs_map)
<xarray.Dataset>
Dimensions: (x: 1000, y: 1000, time: 5)
Coordinates:
* x (x) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999
* y (y) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999
* time (time) datetime64[ns] 2022-01-01 2022-01-02 ... 2022-01-05
Data variables:
temperature (x, y, time) float64 dask.array<chunksize=(500, 500, 1), meta=np.ndarray>
Other (geospatial) libraries¶
dCacheFS can be used in combination with many other libraries that do not directly interface with fsspec
, provided that they support reading and writing from file-like objects. In the field of geospatial data analysis, libraries like rasterio, geopandas, h5-netcdf, and laspy all support reading from this kind of source, so that one can open and load data from dCache
as shown in the following pseudo-code cell:
[ ]:
import rasterio
with fsspec.open('dcache://path/to/file', ...) as f:
with rasterio.open(f) as fi:
...