Data Storage Server with S3 Data Lakes

Is it possible to integrate S3-based data-lakes?
Or would the alternative be to write the data to the S3 buckets and then use openBIS to keep track of the location of the bucket and the file within?
many thanks

Hi Kerzel,

You should have a look at the Big data link concept and obis command line application.

This allows you to make data on any remote system you have access to the command line to install obis and send the metadata of the dataset to openBIS.

That said, we don’t have any specific integration for amazon S3 APIS.

Hope this clarifies enough.

Hello,

many thanks - would it be possible to extend the DSS to include S3 (not necessarily Amazon, but also MinIO, Dell ECS, and others) as a storage option instead of a local filesystem?
I suppose FUSE might be an option that should “just work”, but it might be better to use more S3-native approaches.
(Obviously not out of the box but should we be able to hire someone who can spend time on programming…?)
Apologies - I’m quite new to openBIS, it seems to be a great tool - we’re currently figuring out how to use it within the constraints from both the intended users and the services the university offers…

Hi Kerzel,

The current data store “requires” a Posix file system.

That said the next major version of openBIS, openBIS 7, will also include a new datastore with additional features, like mutable data, and comes with a new modern code base.

With this new code base adding more storage backends compatible with S3 API is not out of the question. Is just “currently” not a goal for the release.

Hope this helps,
Juan

Dear Juan,

great - many thanks for this.
We’re sort of in a “pinch” of what we can use and setup short-term (until summer) and what can then be developed more longer term over the course of the next few years.
If I understand you right, we could use FUSE to mount S3 as a local filesystem (with the fallback of using an actual local filesystem) and with the next major release, development work to include S3 more natively could be attempted?

best wishes
Ulrich

Exactly, but bear in mind that S3 API doesn’t really create hierarchical folders but keeps paths as keys on a map.

How a particular FUSE integration maps a Posix hierarchy as S3 map keys is not trivial, is defined by the integration.

To move to different integration you will need convert the S3 map keys to be compatible with the new integration.

Long story short If you start to use a particular FUSE integration I expect you will get lock into it. And moving away from it can take considerable effort. So please make these kinds of choices carefully.

1 Like