I’ve seen that there is a package “PyBIS” that can be used to interact with the system. We have essentially two use cases: one group would use more the “classic” electronic notebook via web-interface when they prepare samples, do experiments, etc.
The other group is more geared towards simulation and machine learning. Here, we would have a large number of compute jobs that each result in a large number of large(-ish) files (these would need to go to an S3 bucket, see the other question). Each of these compute jobs would also come with a range of relevant data: which job was run when, which parameters / meta-data did the job take as input, which data comes out of this job and where is it stored (on S3), as well as results from the job (say, the classification accuracy of a machine learning job, things we would store in, say MLFlow)
Can openBIS cover both use-cases or if not, how could one link two databases that contain the differnt set of details?
With pyBIS you can create entries and register metadata in openBIS the same way as you can do with the ELN UI.
If data are not stored in the openBIS storage, but stay somewhere else like you mention, you may want to look at the BigDataLink, which allows you to create a link to the data in openBIS: Big Data Link - openBIS Documentation Rel. 20.10 - Confluence
Hello, many thanks for the pointer.
I’ve looked at the documentation but I’m not sure if I understand how it works and/or how it matches our use case.
Do I understand correctly that this assumes that the files we want to store or process are in one initial directory and that the set of files in that directory define the dataset?
One of our use-cases would be, for example: a compute job creates the simulation for a single setting, i.e. for one specific setting of the simulated experimental setup, we obtain the output of this simulation. We have then meta-data on the input - side (details about the simulation), the configuration files, then the files the simulation produces, as well as the log-files.
For simplicity we can maybe assume that we put all configuration and log-files into a zip-archive that then needs to go to a data-store, each output of the simulation will also go to a (separate) data-store (in the simplest case we would only have one simulated experimental output, but it may be a multi-stage process with intermediate files).
We then push, say, 10,000 of these jobs to a big cluster, each job registers the files - so we have maybe 50,000 or so files to manage, some are associated with another (those from the same simulation job), others aren’t.
In a later analysis we would then first need to define the dataset we want to run on. This needs to be based on the meta-data. Say, for example, we want to analyse all simulated experiments that match a set of criteria. Hence the definition of a data-set depends on the query the respective user does. For convenience it would be nice to ‘tag’ datasets, so that we don’t have to repeat the query that leads to this dataset.
The analysis job(s) would then run over this data-set defined in this way, the next analysis job will probably define a different data-set based on the same files, etc.
Looking at the documentation in the link you kindly provided, it seems that the “Big Data Link” seems more focused on a more-or-less static list of files that are in some directory, where the files may be updated at some point in time?
many thanks for your help.
only logged in this forum today for the first time, but since I am the main developer of pyBIS, I might be able to help.
As Caterina pointed out correctly, pyBIS is able to upload (i.e. register) your datasets together with your metadata. It can of course be used to manipulate the metadata, but not the data itself, because as a general principle, datasets cannot be changed. You can use pyBIS to download the data. The documentation is here: PyBIS · PyPI
The latest pyBIS version comes with a small command-line interface which you might want to use. Unfortunately, it does not provide the dataset upload yet, but soon it will.
To «tag» your dataset, you might want to use a property (or even several). For example, you first query the datasets you want to use and then update a certain property (using a transaction, because that’s much faster) to add/update a property («tagging» them). Maybe that is a solution?
many thanks for your answer.
I guess we will have to experiment a bit once we have our instance up and running. One of the constraints is likely going to be that a lot of the data will be stored outside the openBIS system on an S3 store (maybe we can integrate this in the future),i.e. the data will be written to / read from the S3 store while the meta-data needs to be managed (openBIS hopefully?). The datasets are also “fluid” in the sense that data may be added all the time (or deprecated). In the end, we’ll likely have a large data-lake, where each data file has its relevant meta-data and the various analysis jobs will work on different but defined snapshots at each time, where such “tags” should make it reproducible (as we don’t actually want to delete data, so we can, in principle, reproduce “wrong” results as well).
Maybe it becomes a bit clearer when we have done our first attempts.
Many thanks for the help so far.
You have also mentioned the Big Data Link tool… Yes, it is an utility intended to create a link in openBIS to a bigger, external dataset, eg at the end of a data processing workflow. It is included in the obis tool and uses PyBIS inside (described above by Swen). The dependencies include git and git-annex.
See an example session below. It can be a part of an automatization, eg I can imagine adding the last part creating the links in a snakemake rule.
#installing required software
# see for full documentation
# and https://pypi.org/project/PyBIS/
pip3 install obis
pip3 install pybis
conda install -c conda-forge git-annex
#module load git
# run the configuration
obis config -g set openbis_url=https://openbis-biomed-demo.ethz.ch
obis config -g set user=michalo
obis config -g set verify_certificates=false # in case of https
obis config -g set allow_only_https=false # in case of http
obis data_set -g set type=UNKNOWN
# check the settings
obis settings -g get
# choose openBIS object and commit the link
obis init obis_test && cd obis_test
#obis object set id=/DEFAULT/DEFAULT/DEFAULT
obis object set id=/PUBLISHED_DATA/PROJECT1_PUB/S1
obis commit -m 'commit message'
Dear Michal, Swen,
many thanks again - I’m trying to understand how it will work (and/or, how it will relate to our setup we have).
Maybe I’m a bit stuck at understanding how the BigDataLink / obis setup works in general, in particular where the actual data is stored…
I’m not sure if I understand the git-annex approach correctly that is underneath the obis/Big Data Link setup - bit seems as if the files are local to some computer (on their filesystem) and then git-annex is used to make these “repositories” available to others?
Our setup is based on a central openBIS installation across a range of groups, some of which are at the same institutions, some are not. Unfortunately, we won’t have the capacity to store all data inside the DSS/openBIS setup. We will need to store the bigger file / datasets on the central S3 based data store at the computing centre. As our collaboration is spread over a number of institutions, this will also be the only way of making the data available to everyone.
Going through the BigDataLink and obis examples, I’m not quite sure where, well, the actual data ends up…?
Or are we, alternatively, better off moving the data to S3 manually and then register the location in a custom dataset template/object with pyBIS?
many thanks again, and in particular for your patience with my newbie questions