Interact with openBIS via python from scripts

I’ve seen that there is a package “PyBIS” that can be used to interact with the system. We have essentially two use cases: one group would use more the “classic” electronic notebook via web-interface when they prepare samples, do experiments, etc.
The other group is more geared towards simulation and machine learning. Here, we would have a large number of compute jobs that each result in a large number of large(-ish) files (these would need to go to an S3 bucket, see the other question). Each of these compute jobs would also come with a range of relevant data: which job was run when, which parameters / meta-data did the job take as input, which data comes out of this job and where is it stored (on S3), as well as results from the job (say, the classification accuracy of a machine learning job, things we would store in, say MLFlow)

Can openBIS cover both use-cases or if not, how could one link two databases that contain the differnt set of details?
many thanks

With pyBIS you can create entries and register metadata in openBIS the same way as you can do with the ELN UI.
If data are not stored in the openBIS storage, but stay somewhere else like you mention, you may want to look at the BigDataLink, which allows you to create a link to the data in openBIS: Big Data Link - openBIS Documentation Rel. 20.10 - Confluence

Hello, many thanks for the pointer.
I’ve looked at the documentation but I’m not sure if I understand how it works and/or how it matches our use case.

Do I understand correctly that this assumes that the files we want to store or process are in one initial directory and that the set of files in that directory define the dataset?

One of our use-cases would be, for example: a compute job creates the simulation for a single setting, i.e. for one specific setting of the simulated experimental setup, we obtain the output of this simulation. We have then meta-data on the input - side (details about the simulation), the configuration files, then the files the simulation produces, as well as the log-files.
For simplicity we can maybe assume that we put all configuration and log-files into a zip-archive that then needs to go to a data-store, each output of the simulation will also go to a (separate) data-store (in the simplest case we would only have one simulated experimental output, but it may be a multi-stage process with intermediate files).

We then push, say, 10,000 of these jobs to a big cluster, each job registers the files - so we have maybe 50,000 or so files to manage, some are associated with another (those from the same simulation job), others aren’t.
In a later analysis we would then first need to define the dataset we want to run on. This needs to be based on the meta-data. Say, for example, we want to analyse all simulated experiments that match a set of criteria. Hence the definition of a data-set depends on the query the respective user does. For convenience it would be nice to ‘tag’ datasets, so that we don’t have to repeat the query that leads to this dataset.
The analysis job(s) would then run over this data-set defined in this way, the next analysis job will probably define a different data-set based on the same files, etc.

Looking at the documentation in the link you kindly provided, it seems that the “Big Data Link” seems more focused on a more-or-less static list of files that are in some directory, where the files may be updated at some point in time?

many thanks for your help.

Hi Kerzel,

only logged in this forum today for the first time, but since I am the main developer of pyBIS, I might be able to help.

As Caterina pointed out correctly, pyBIS is able to upload (i.e. register) your datasets together with your metadata. It can of course be used to manipulate the metadata, but not the data itself, because as a general principle, datasets cannot be changed. You can use pyBIS to download the data. The documentation is here: PyBIS · PyPI

The latest pyBIS version comes with a small command-line interface which you might want to use. Unfortunately, it does not provide the dataset upload yet, but soon it will.

To «tag» your dataset, you might want to use a property (or even several). For example, you first query the datasets you want to use and then update a certain property (using a transaction, because that’s much faster) to add/update a property («tagging» them). Maybe that is a solution?

Regards, Swen

Dear Swen,

many thanks for your answer.
I guess we will have to experiment a bit once we have our instance up and running. One of the constraints is likely going to be that a lot of the data will be stored outside the openBIS system on an S3 store (maybe we can integrate this in the future),i.e. the data will be written to / read from the S3 store while the meta-data needs to be managed (openBIS hopefully?). The datasets are also “fluid” in the sense that data may be added all the time (or deprecated). In the end, we’ll likely have a large data-lake, where each data file has its relevant meta-data and the various analysis jobs will work on different but defined snapshots at each time, where such “tags” should make it reproducible (as we don’t actually want to delete data, so we can, in principle, reproduce “wrong” results as well).

Maybe it becomes a bit clearer when we have done our first attempts.
Many thanks for the help so far.

You have also mentioned the Big Data Link tool… Yes, it is an utility intended to create a link in openBIS to a bigger, external dataset, eg at the end of a data processing workflow. It is included in the obis tool and uses PyBIS inside (described above by Swen). The dependencies include git and git-annex.
See an example session below. It can be a part of an automatization, eg I can imagine adding the last part creating the links in a snakemake rule.

#installing required software
# see for full documentation
# https://pypi.org/project/obis/
# and https://pypi.org/project/PyBIS/
#
pip3 install obis
pip3 install pybis
conda install -c conda-forge git-annex

#module load git

# run the configuration
obis config -g set openbis_url=https://openbis-biomed-demo.ethz.ch
obis config -g set user=michalo
obis config -g set verify_certificates=false # in case of https
obis config -g set allow_only_https=false # in case of http
obis data_set -g set type=UNKNOWN

# check the settings
obis settings -g get

# choose openBIS object and commit the link
obis init obis_test && cd obis_test

#obis object set id=/DEFAULT/DEFAULT/DEFAULT
obis object set id=/PUBLISHED_DATA/PROJECT1_PUB/S1
obis commit -m 'commit message'

Dear Michal, Swen,

many thanks again - I’m trying to understand how it will work (and/or, how it will relate to our setup we have).
Maybe I’m a bit stuck at understanding how the BigDataLink / obis setup works in general, in particular where the actual data is stored…

I’m not sure if I understand the git-annex approach correctly that is underneath the obis/Big Data Link setup - bit seems as if the files are local to some computer (on their filesystem) and then git-annex is used to make these “repositories” available to others?

Our setup is based on a central openBIS installation across a range of groups, some of which are at the same institutions, some are not. Unfortunately, we won’t have the capacity to store all data inside the DSS/openBIS setup. We will need to store the bigger file / datasets on the central S3 based data store at the computing centre. As our collaboration is spread over a number of institutions, this will also be the only way of making the data available to everyone.
Going through the BigDataLink and obis examples, I’m not quite sure where, well, the actual data ends up…?
Or are we, alternatively, better off moving the data to S3 manually and then register the location in a custom dataset template/object with pyBIS?

many thanks again, and in particular for your patience with my newbie questions
Ulrich

Dear Michal,

I was trying to use obis on our new test server and tried to follow the steps.
The config is given below, where I created a sample to which the data are attached (do I understand correctly that the data are attached to the sample, not the experiment?)
I created a directory with a small (image) file in it but then got the following error:

(openbis-py3.10) kerzel@LT20220902A:~/Repositories/openbis/openbis/data1$ obis commit -m "test SEM image obis"
10:42:28 Could not commit:
Invalid URL '/datastore_server/rmi-data-store-server-v3.json': No scheme supplied. Perhaps you meant https:///datastore_server/rmi-data-store-server-v3.json?

Do I suspect correctly that something else needs to be configured first, such as an external datastore as described on the webpage for Link Data Sets ?
This page does list example scripts, but I have to admit however, that I’m not quite sure where to start from, for example, where would I get the “service” from to call tr = service.transaction()

Many thanks again
best wishes
Ulrich

(openbis-py3.10) kerzel@LT20220902A:~/Repositories/openbis/openbis/data1$ obis settings -g get
{
    "collection": {
        "id": null,
        "permId": null
    },
    "config": {
        "allow_only_https": true,
        "fileservice_url": null,
        "git_annex_backend": null,
        "git_annex_hash_as_checksum": true,
        "hostname": "LT20220902A",
        "obis_metadata_folder": null,
        "openbis_token": null,
        "openbis_url": "https://openbis-t.imm.rwth-aachen.de/",
        "session_name": null,
        "user": "ulrichkerzel",
        "verify_certificates": true
    },
    "data_set": {
        "properties": null,
        "type": "UNKNOWN"
    },
    "object": {
        "id": "/ULRICHKERZEL/TEST_0001/TEST_0001_EXP_1/SAMPLE5",
        "permId": null
    },
    "repository": {
        "data_set_id": null,
        "external_dms_id": null,
        "id": null
    }
}

Dear all,
it seems to work now (need to enquire with IT what exactly has changed)

However, that leads me to another question where I don’t quite understand the datamodel it seems.

What is the difference between “path” and “identifier”, e.g. in

PermId:
20230213103629985-73

Identifier:
/ULRICHKERZEL/TEST_0001/UKSAMPLE_001

Path:
/ULRICHKERZEL/TEST_0001/TEST_0001_EXP_1/UKSAMPLE_001

many thanks again
Ulrich

Dear Ulrich,

Happy to hear that it works on your side! The Identifier is likely an arbitrary ID of the big data link, assigned before commit like in my example, while Path is the path in the remote/non-openbis system.
I will double check in my test system anyway and try to write you more.

If you would like eg test it together or discuss your use case over zoom one day, please let me know via email (Dr. Michal Okoniewski | ETH Zurich)

all the best,
Michal

Dear Michal, Swen,

I’m still continuing to get familiar with the linkedData unsing parts of the code from pyBIS.

I can register a linked dataset, and it shows up in openBIS ELN and I can also retrieve the location of the content copy via ds_0.data[‘linkedData’][‘contentCopies’][0][‘path’], where ds_0 is the first entry in the list of datasets where the object is located: my_datasets = oBis.get_datasets(sample=‘/MATERIALS/EBSDSIM/EBSDSIMMASTER22’)

I was wondering two things:

  1. how can I register more than one content copy?

  2. I (naively) tried to register the same linkedData again, under a different path but otherwise the same, in particular, with the same name.
    I then observed the following but, admittedly, don’t understand it.

In the ELN GUI, I only see one linked dataset - so it seems that if the name exists, the newer object is shown, even if an older one exists with a different PermID?

if I use pyBIS, I see both (based on the permID?)

So I guess the first take-home message is not to use the same name, but I wonder if this is the intended behaviour to snow only one object with the name in the ELN?
How would I add another content copy ?

Many thanks again
best wishes
Ulrich

Dear Ulrich,

It’s interesting. Hard to say for sure now if its “a bug or a feature”, will try to reproduce what you have done, at least the link-ELN part and will ask then the developers if needed.

all the best,
Michal

Dear Ulrich,

(Sorry, it took me some time to re-vive my test openBIS instance… )

For 1. it is probably a constraint that a dataset (link) can be registered only once. For me it says

$ obis init obis_test2 && cd obis_test2
15:13:45 init_data obis_test2
15:13:45 Could not init_data:
Folder is already an obis repository.

Will try to discuss it with the developers what is the logic behind this constraint and if this can be changed or if it is different in the case of big data links. But yes, you can always do like in your point 2.

And as for 2., in the display in my ELN I can see three datasets/big data links, which I’ve added . And see also 3 datasets in the old GUI, likely like you see the two using pyBIS.
The 3 datasets are in the left-bottom panel after choosing the object to which they are attached , is it also that in your case? Your screenshot is likely viewing a dataset info, so clicking back to the object should help, I hope.

Dear @MichalO ,

many thanks for your answer.
Hm, no, I only see the latest linked dataset in openBIS (are your data also linked datsets? for me, I get a little chain-like symbol after the folder symbol in this case?)

same if I chane the dataset type:

only the latest one appears

if I query the object using the Python API, I get to see all of them though:

They go to the same object/sample:

sample = '/MATERIALS/EBSDSIM/EBSDSIMMASTER22'
if sample is not None:
    sample_id = oBis.sample_to_sample_id(sample)
    data_set_creation["sampleId"] = sample_id

I use data_set_creation["autoGeneratedCode"] = True is it maybe related to this?

Many thanks
Ulrich

Dear @MichalO

for the other question: Ebsdsimmaster is a collection:

then if I click on that element, I get:

(maybe I’m doing it wrong?)

All the best
Ulrich

Dear @MichalO

as an addendum, if I upload the file manually within the ELN gui, it shows up in addition, also several times.

and if I use the python interface, I can see them all, either “PHYSICAL” or “LINK”…

OK Ulrich… If only the latest link appears than may be a bug of the display in the GUI.
What’s your openBIS version? See “about” at the end of the left-panel tree…
Mine is 20.10.5

Dear @MichalO

thanks - us too, 20.10.5:
image

All the best
Ulrich

Thank you Ulrich…! And the browser is chrome or something else?
I will check with our developers what can it be, seems like a bug to me…

Dear @MichalO

yes, I use the latest version of Chrome on Win11

Many thanks for your help
Ulrich