Dropbox: possible to automatically convert to "linked data"?

Dear all,
I was wondering if the following was possible:

The dropbox feature allows the convenient import of data with automatic extraction of the metadata - which is indeed very helpful and we (almost) have it running, at least for one instrument.
However, it’s not clear how much space for files we can attach to the DSS. We can then store the files elsewhere and register them in openBIS via linked data.

However, I was wondering if we could move this “behind the scenes” to hide this complexity from the user and do something like: If the file is smaller than some threshold, use the dropbox feature as is and use the DSS to store the file within the openBIS installation. Once we pass some sort of threshold, extract the metadata, move the file to the external storage system and use a LinkedData File instead?

Many thanks
Ulrich

Dear Ulrich,

There is no limit on how much data you can manage with the DSS.

We indeed have customers with petabytes of data, not sure about exabytes i would need to check.

Instead of thinking on using some non supported workflow allow me explain what we do at ETH.

Archivers:
The data is either HOT or COLD.

  • HOT: If is new data that is being used, makes sense to have it on hard drives.

  • COLD: Not very actively used or not at all. It gets archived, using openBIS archiving functionality to integrate with for example a tape storage system. OpenBIS will still keep a copy of the metadata so you can can browse and search filenames.

I can imagine that you could use the archival functionality to move the data out to some system of your choice keeping the metadata. You could write your own archiver.

This has the benefit that you could also implement the un-archive functionality, so you can get the data back any time you want.

Big Data Link:
You could from the beginning have the data on an external system and just ingest the metadata. Sadly you will not be able to download data from UIs or APIs if you do this. I most of the time don’t recommend this approach. It only makes sense when you are generating and using BIG amounts of data on an external system like a cluster.

Hope this makes sense and provide you with some inspiration.

I would like to strongly suggest you don’t tinker with the internals of the Dropboxes. These are due for a rework and when it comes, and it will, we would not be able to support your custom workflow.

Best,
Juan

Dear Juan,

many thanks for the detailed answer.
I guess we are mainly limited by the boundary contraints that we have: We cannot have a large filesystem that we can use with a DSS (at least not one that works reliably and/or has a backup - it’s not a problem to just put a few TB of hard-disks somewhere but if/when they fail we need to be able to recover from it and acquring anything that goes beyond a few HDD from the local retailer would likely be a problem…) and even if we could it would not work across institutions - and the only large file storage we do have access to that we can also use across institutions (and that comes with some guarantees on data integrity etc) is S3 (and there is no plan to offer anything other than S3).
Archiving may indeed be an option for “cold” data, at least for the future, thank you - we’ll keep that in mind.
Following your advice, we’ll stay clear of using DropBoxes in “creative” ways.

Many thanks again,
all the best
Ulrich

Dear Ulrich,

Do you have an estimation of how much data do you plan to store and what kind of data?

  • ELN-LIMS attachments created by users are typically small in size.
  • What takes BIG amounts of space is data created by acquisition machines.

Depending on your use case, data size estimation and budget I maybe able to suggest some options.

Best,
Juan

Dear Juan,

many thanks, that would be great.
We have a collaboration of ~30 groups that are spread out across a number of institutions, some of which at the same university, others at different universities. One group (us) will host the openBIS instance for everyone, but we cannot expose the filesystems, or provide storage for everyone.

A number of groups will generate “small” data that we (at least in the current plan) can attach to the ELN/LIMS, and/or essentially create an object to represent a measurement and then use parent/child relationships in the ELN to represent samples, measurement machines, the measured values themselves, and some additional information.
This will all fit into the openBIS instance (we think)

A smaller number of groups (around 5 of these 30) will generate a larger number of datasets, this can be simulations, as well as data created by, for example, electron microscopes with several dedicated detectors, or other machines.
The upper limit of what at least one group says they can achieve is 1 TB of data per day, although I would guess most of the groups that produce more data will, at least initially and for the near future, be around somewhere 10… 100 TB of data (for the foreseable future in total).
The various groups will need to exchange data, at least to some extend, and since most groups will not have the compute capacity to handle these data locally, we will need to do that on the central university cluster. It would be good to avoid copying data around and instead serve it from the central data store which is, in our case, only this S3 storage we have access to…
(the cluster also has storage, but that is considered volatile, so we cannot store data there longer term and expect it to be there)
Some of the data should be “pipelined” across the groups, i.e. group A records a large-ish amount of data, this will be analysed by group B at a different institution, and then passed on to group C, etc
(ideally, or so we hope, we can automate at least part of the process).

Budget is, for all intents and purposes, effectively 0 since all funding is going to different RDM initiatives that, at least for now and the near future, does not meet the requirements of the collaboration. Maybe that will change in the future but for now we don’t see an alternative to using openBIS.

Many thanks in advance and all the best
Ulrich

Dear Ulrich,

You may end up with a hybrid solution:

  • On one hand your central openBIS providing ELN-LIMS capabilities and “small” storage that all groups can use. Those 100T you where talking. Do your IT provides you with infrastructure(electricity, network, cooling) to host your own server/s on a rack?

  • On another hand S3 storage and some way to LINK it to openBIS datasets if needed. There is different ways of doing this and since you are dealing with remote groups I would suggest to have a look at “Big Data Link” solutions that can create large datasets remotely without Dropboxes or sending the data.

We just updated the big data link command line tool obis and the documentation was improved: obis · PyPI

Best,
Juan

Dear Juan

many thanks, yes we also came to the conclusion that such a hybrid strategy will likely be the most workable solution.
We don’t have capacities to host our own racks/19" servers, etc, at the moment we use a VM and then see how far we get with it…

The main reason why I asked this question is that, well, a number of users will not be familiar/comfortable with using the command line… (or know how to open a terminal). I suppose we can “hide” this by building a Flask/Django webpage around it, although this would mean that the users would have to use two web-interfaces… Hence the idea to (mis-) use the dropbox feature in “creative” ways - which, as you say, is not a good approach…

Many thanks again and all the best
Ulrich