We have a need to store larger Machine Learning models as datasets. Sometimes we’d need to store a tarball at sizes up to 1 TB.
Have you tried storing large volumes in one DataSet, or do you recommend using another way to store it on OpenBIS?
My Idea is to use the Datamover service on our servers for rsync the data to openbis generic dropbox. Please let me know if you have any fancier solution in place!
please clarify bit more here.
Are you asking about to ensure openBIS can correctly handle single files of 1TB?
Or are you asking what are our experience and recommendation to design regular data flow of 1TB files into openBIS dropboxes?
Because I see your question a bit exclusive. Yes, we have experience storing large volumes in one dataset but does it mean large volumes of several files of 1TB? Yes, we have several data flow designed for large data sets, as well as thousands of different files reaching to 1TB files as well as single large files per import.
Datamover is more to address complex connectivity situation where you cannot reach data source directly from openBIS instance. It is not likely to address small or large files problem.
Desiging data flow the choice to be made is to use openBIS dropbox with “auto-detection” or “marker-file” as “incoming-data-completeness-condition”.
Fast bandwidth, less files of smaller size favour choosing “auto-detection”, while slow bandwidth, more files of larger size favour "marker-file.
Thank you Artur for an elaborate answer.
I was most interested in knowing if openBIS can handle a single file of 1TB in a dataset. For now, these larger files will probably be 10-20 single 1TB files where each will be connected to it’s own dataset.
We’d use the dropbox importer with a marker-file as “incoming-data-completeness-condition”.
Based on your answer, I feel confident we can proceed with the strategy we have in mind.