{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# User Manual" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "dCacheFS provides a file-system interface for a [dCache storage system](https://www.dcache.org), such as the [instance provided at SURF](http://doc.grid.surfsara.nl/en/stable/Pages/Service/system_specifications/dcache_specs.html). dCacheFS builds on the [Filesystem Spec](https://filesystem-spec.readthedocs.io) (`fsspec`) library and it can be used as an independent library or via the more general `fsspec` functions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We consider here a dCache instance with a project space setup in the following way (`test` is in the project root directory):" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "test\n", "├── empty_testdir\n", "├── testdir_1\n", "│   ├── file_1.txt\n", "│   └── file_2.txt\n", "└── testdir_2\n", " ├── file_1.txt\n", " └── file_2.txt\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `dCacheFileSystem`\n", "\n", "The main access point to the functionalities of dCacheFS is its `dCacheFileSystem` class. Here, we instantiate a `dCacheFileSystem` object providing authentication credentials (a macaroon for bearer-token authentication) and the URL where the dCache API can be reached (https://dcacheview.grid.surfsara.nl:22880/api/v1 for the SURF dCache instance):" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from dcachefs import dCacheFileSystem\n", "\n", "api_url = \"https://dcacheview.grid.surfsara.nl:22880/api/v1\"\n", "\n", "# read authentication token\n", "with open(\"macaroon.dat\") as f:\n", " token = f.read().strip()\n", "\n", "fs = dCacheFileSystem(api_url=api_url, token=token)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "dCacheFS implements the following methods via the [dCache API](https://dcache.org/old/manuals/UserGuide-6.2/frontend.shtml):" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['/test/testdir_1/file_2.txt', '/test/testdir_1/file_1.txt']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fs.ls(\"/test/testdir_1\", detail=False)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'name': '/test/testdir_1/file_1.txt',\n", " 'size': 12,\n", " 'type': 'file',\n", " 'created': datetime.datetime(2023, 5, 4, 22, 47, 35, 736000),\n", " 'modified': datetime.datetime(2023, 5, 4, 22, 47, 35, 775000)}]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fs.ls(\"/test/testdir_1/file_1.txt\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fs.ls(\"/test/empty_testdir\")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'name': '/test/testdir_1/',\n", " 'size': 512,\n", " 'type': 'directory',\n", " 'created': datetime.datetime(2023, 5, 4, 22, 47, 35, 622000),\n", " 'modified': datetime.datetime(2023, 5, 4, 22, 47, 35, 896000)}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fs.info(\"/test/testdir_1/\")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['/test/testdir_2/file_2.txt', '/test/testdir_2/file_renamed.txt']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fs.mv(\"/test/testdir_2/file_1.txt\", \"/test/testdir_2/file_renamed.txt\")\n", "fs.ls(\"/test/testdir_2/\", detail=False)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['/test/testdir_2/file_renamed.txt']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fs.rm(\"/test/testdir_2/file_2.txt\")\n", "fs.ls(\"/test/testdir_2/\", detail=False)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fs.exists(\"/test/testdir_1/file_1.txt\")" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fs.exists(\"/test/testdir_1/nonexistent_file.txt\")" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fs.isfile(\"/test/testdir_1/file_1.txt\")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fs.isdir(\"/test/testdir_1/\")" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "datetime.datetime(2023, 5, 4, 22, 47, 35, 736000)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fs.created(\"/test/testdir_1/file_1.txt\")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "datetime.datetime(2023, 5, 4, 22, 47, 35, 775000)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fs.modified(\"/test/testdir_1/file_1.txt\")" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "12" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fs.size(\"/test/testdir_1/file_1.txt\")" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['/test/testdir_1/file_1.txt',\n", " '/test/testdir_1/file_2.txt',\n", " '/test/testdir_2/file_renamed.txt']" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fs.glob(\"/test/testdir_*/file_*.txt\")" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/test\n", "/test/testdir_1\n", "/test/empty_testdir\n", "/test/testdir_2\n" ] } ], "source": [ "for root, _, _ in fs.walk(\"/test\"):\n", " print(root)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['/test/testdir_1/file_1.txt',\n", " '/test/testdir_1/file_2.txt',\n", " '/test/testdir_2/file_renamed.txt']" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fs.find(\"/test\")" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "36" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fs.du(\"/test\") # bytes" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "47296560241525359883320580310787008204" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fs.checksum(\"/test/testdir_1/file_1.txt\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following methods, which involve reading/writing files from/to dCache, involve communication via dCache's WebDAV door. Thus, the WebDAV door URL needs to be specified via a separate input argument when instantiating the `dCacheFileSystem` object:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "b'Hello world!'" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "webdav_url = \"https://webdav.grid.surfsara.nl:2880\"\n", "\n", "fs = dCacheFileSystem(api_url=api_url, webdav_url=webdav_url, token=token)\n", "\n", "fs.cat(\"/test/testdir_1/file_1.txt\")" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hello world!" ] } ], "source": [ "local_path = \"./file.txt\"\n", "fs.download(\"/test/testdir_1/file_1.txt\", local_path)\n", "\n", "# check local copy\n", "!cat $local_path" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "b'Hello world!'" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "remote_path = \"/test/testdir_2/file_uploaded.txt\"\n", "fs.upload(local_path, remote_path)\n", "\n", "# check remote copy\n", "fs.cat(remote_path)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "b'Hello world!'" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with fs.open(\"/test/testdir_1/file_1.txt\", \"rb\") as f:\n", " content = f.read()\n", "content" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "b'Hello world!'" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = \"/test/testdir_2/file_written.txt\"\n", "with fs.open(path, \"wb\") as f:\n", " f.write(b\"Hello world!\")\n", "fs.cat(path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Usage via `fsspec`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once imported, `dcachefs` registers itself as the `fsspec` implementation for the \"dcache\" protocol. This means that all `fsspec` methods on URL-paths of the following form will be dealt via the `dCacheFileSystem`:\n", "```\n", "dcache://path/to/file/or/dir\n", "```\n", "Parameters like the token for authentication (macaroon), the API and WebDAV door URLs can be specified as input arguments to all `fsspec` functions:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "b'Hello world!'" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import fsspec\n", "\n", "uri = \"dcache://test/testdir_1/file_1.txt\"\n", "\n", "with fsspec.open(uri, token=token, api_url=api_url, webdav_url=webdav_url) as f:\n", " content = f.read()\n", "content" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It can be handy, however, to save these input parameters in a `fsspec` configuration file, by default in the directory file `${HOME}/.config/fsspec/`, so that they are always passed as input arguments when the \"dcache\" protocol is encountered (see the [section of the fsspec documentation on configuration](https://filesystem-spec.readthedocs.io/en/latest/features.html#configuration))." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dask, Zarr and Xarray" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `fsspec` library is used internally by [Dask](https://dask.org), enabling the possibility to read and write data [from a variety of data stores](https://docs.dask.org/en/stable/how-to/connect-to-remote-data.html). After importing dCacheFS, Dask internal functions can also read and write data from a dCache storage instance:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('Hello world!',)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import dask.bag as db\n", "\n", "bag = db.read_text(\n", " \"dcache://test/testdir*/file*.txt\",\n", " storage_options=dict(\n", " api_url=api_url,\n", " token=token,\n", " webdav_url=webdav_url,\n", " ),\n", ")\n", "\n", "bag.take(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the storage options passed here as input arguments to `.read_text()` can be provided via a `fsspec` configuration file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`fsspec` also provides the functionality to create an interface that is compatible with the [Zarr](https://zarr.readthedocs.io/en/stable/) library:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "import zarr\n", "\n", "fs_map = fsspec.get_mapper(\n", " \"dcache://test/store.zarr\", token=token, api_url=api_url, webdav_url=webdav_url\n", ")\n", "\n", "root = zarr.open(fs_map, mode=\"w\")\n", "myarray = root.zeros(\"myarray\", shape=(1000, 1000), chunks=(100, 100))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The same interface can also be used to read and write labeled multi-dimensional arrays in Zarr format via [Xarray](http://xarray.pydata.org):" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Dimensions: (x: 1000, y: 1000, time: 5)\n", "Coordinates:\n", " * x (x) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999\n", " * y (y) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999\n", " * time (time) datetime64[ns] 2022-01-01 2022-01-02 ... 2022-01-05\n", "Data variables:\n", " temperature (x, y, time) float64 dask.array\n" ] } ], "source": [ "import dask.array as da\n", "import numpy as np\n", "import pandas as pd\n", "import xarray as xr\n", "\n", "# coordinates\n", "x = np.arange(1000)\n", "y = np.arange(1000)\n", "time = pd.date_range(\"2022-01-01\", periods=5)\n", "\n", "# variable\n", "temperature = da.random.random((1000, 1000, 5), chunks=(500, 500, 1))\n", "\n", "# create dataset\n", "ds = xr.Dataset(\n", " data_vars=dict(\n", " temperature=([\"x\", \"y\", \"time\"], temperature),\n", " ),\n", " coords=dict(x=x, y=y, time=time),\n", ")\n", "print(ds)\n", "\n", "# create interface for Zarr\n", "fs_map = fsspec.get_mapper(\n", " \"dcache://test/temperature.zarr\",\n", " token=token,\n", " api_url=api_url,\n", " webdav_url=webdav_url,\n", ")\n", "\n", "# save dataset to dCache\n", "store = ds.to_zarr(fs_map)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Other (geospatial) libraries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "dCacheFS can be used in combination with many other libraries that do not directly interface with `fsspec`, provided that they support reading and writing from file-like objects. In the field of geospatial data analysis, libraries like [rasterio](https://rasterio.readthedocs.io), [geopandas](https://geopandas.org), [h5-netcdf](https://h5netcdf.org), and [laspy](https://laspy.readthedocs.io) all support reading from this kind of source, so that one can open and load data from dCache as shown in the following pseudo-code cell:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import rasterio\n", "\n", "with fsspec.open(\"dcache://path/to/file\", ...) as f:\n", " with rasterio.open(f) as fi:\n", " ..." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.10" } }, "nbformat": 4, "nbformat_minor": 4 }