[qudi-dev] Re: Managing saved measurement data

From: Sami Koho <sami.koho@xxxxxxxxx>
To: qudi-dev@xxxxxxxxxxxxx
Date: Tue, 5 May 2020 19:12:22 +0200

Hi,

for me the main advantages of HDF5 are:

1. no clutter
2. support for very large datasets and on the fly compression (if desired).
3. ability to manipulate those large datasets without needing a lot of memory,
as you can access a dataset in a HDF file as if it was a Numpy array: all the
slicing things etc work. And you always need just enough memory to hold the
slice that you requested.
4. the very flexible attributes (different attributes can be defined at file,
folder, dataset level)

I would mainly use it when you have large multi-dimensional arrays that are
hard to handle otherwise.

S

On 5. May 2020, at 18.59, Alrik Durand <alrik.durand@xxxxxxxxx> wrote:

Hi,

We had look at HDF5 when we first started using Qudi in our lab, but this
(old) article kind of dissuaded us :
https://cyrille.rossant.net/moving-away-hdf5/ ;
<https://cyrille.rossant.net/moving-away-hdf5/>
It seems it has gotten better since then, but it's a bit late for us.
I don't know what is the difference between searching in a large data folder,
or a large HDF5 database.

I like the ease of managing files directly, it helps me have control over my
data. This strategy is not optimized but it's user friendly.

One reason we would be interested in new saving tools is that for now the
memory efficiency is quite bad. We are going to start taking spectral images
with Qudi soon and I'm worried about the size of a 1080 x 730 x 2048 data
cube (I like high definition format...) when saved in plain text.

Regarding the database saving logic, is it something that some people use in
production ? I've never heard of it before.

Cheers,
Alrik

Le mar. 5 mai 2020 à 18:54, Nikolas Tomek <nikolas.tomek@xxxxxxxxxx
<mailto:nikolas.tomek@xxxxxxxxxx>> a écrit :
Hi,

well I have worked a bit with HDF5 before.

Since it’s a hierarchical data format (hdf) you end up with exactly the same
problems as Kay mentioned with planning a rigid storage system for your
personal needs. The most simplistic and (probably) most general system would
be again date, time and experiment type but then you just end up the same as
now except instead of a folder hierarchy with many files you have that same
hierarchy in a single HDF5 file.

Thus I see no benefit from HDF5 for our needs over a proper database approach
(along with all the problems mentioned by Kay) except maybe a slightly lower
initial hurdle for newbies.

Cheers

Niko

From: Dr. Kay Jahnke <mailto:kay.jahnke@xxxxxxxxxxxxxxxxx>
Sent: Tuesday, May 5, 2020 5:32 PM
To: Sami Koho <mailto:sami.koho@xxxxxxxxx>
Cc: qudi-dev@xxxxxxxxxxxxx <mailto:qudi-dev@xxxxxxxxxxxxx>; Alrik Durand
<mailto:alrik.durand@xxxxxxxxx>
Subject: [qudi-dev] Re: Managing saved measurement data

Hi Sami,

thanks for this pointer. I had not looked into HDF5 so far but reading
through the documentation of the python binding it seems very interesting.
It should indeed solve your problem of making the data somewhat "searchable"
and also store it in an efficient manner. You will however still have
problems when parameter-names change or parameters are added or deleted over
time. Also a clear definition is needed in which hierarchical groups the data
should be stored and which hierarchical level the attributes should live in
(all directly at the datasets, group datasets and then store the attributes
in groups, ...). So there is no way around the definition of what you want
ahead of time.

Implementing it in qudi, this would result in a new save-logic and actually
opens a third option, next to the existing folder based storage and the
database storage I hinted at earlier.

As far as I can see, there is even a smooth integration of HDF5 into pandas,
so implementing this in a save-logic could interface well with Alrik's
analysis scripts.

Cheers,
Kay

On 05.05.20 13:01, Sami Koho wrote:

Hi everyone,

Just a thought: it might make sense to save the data in HDF5 files. That way
you would get a single file for all the data in an experiment. Also you can
add the specific metadata that you are talking about, as “attributes” to each
dataset/collection etc. The good thing with HDF is that it can support
datasets of arbitrary size, as basically you only load into memory the
section of data that you are currently working on — hence you do not get in
trouble with the memory.

Best,

S

On 5. May 2020, at 12.31, Dr. Kay Jahnke <kay.jahnke@xxxxxxxxxxxxxxxxx
<mailto:kay.jahnke@xxxxxxxxxxxxxxxxx>> wrote:

Hi Alrik,

thanks for your reply.

The GUI module you are talking about, would essentially be the GUI to your
own save logic. This is because the GUI should have no functionality by
itself and therefore you will need an additional logic module anyways, that
defines your "required" additional parameters and handles them. This could
for example be done as an Interfuse onto the existing save logic.
Nevertheless, any parameters you define will be quite specific to your
experiment, so the logic will only be valid for you. E.g. you might work with
single diamond sample and single centers while others might work with
different materials or diamond batches and look for regions defined by
position and size, just to take one example. Therefore I would not define a
general parameter framework, that everybody has to use, because qudi is quite
flexible in its use.

Your analysis notebooks we could put into the qudi code e.g. in the folder
notebooks. Your code would then need to be able to at least analyse data
created by the default config generating dummy data. This way other people
can test it and then adapt it for their own purposes. Also someone (Dan
maybe) should please review the code and make sure that it is understandable,
has a good structure and works for them. I would leave it at a notebook level
and not include the analysis scripts into qudi itself, as this is again very
specific to the user.

These are just my thoughts which I discussed a bit with Jan and Niko, but
maybe there are better ideas out there.

Cheers,
Kay

On 04.05.20 19:55, Alrik Durand wrote:

Hi,

We have been using Qudi on most of our setup in our lab in Montpellier
(France) for about a year.

We are using additional parameters a lot via notebook scripts, and it works
well for us. One problem that I have seen with it comes from changes in the
names of the parameters over time. Maybe the typical fields using by people
(ex: 'sample') could be fixed by a new GUI that would interact with savelogic.

I personally like that fact Qudi use a centralized saving scheme, it helps
when you are trying to explore data from someone else or share codes for
analysis.

Concerning the analysis, we also developed some basic tools to load data
files in Pandas Dataframes, and some other tools to do some very general
things with the pandas objects. Maybe such code could be shared as part of
the Qudi project some way, I know for example we have a method that look at
the parameters before parsing the files, something nice in the "Load all"
strategy.

In the end, all this does not prevent the use of a notebook. We are
experimenting with digital notebook in markdown. This has a few advantages
and is quite good looking with the editor Typora.

I would love feedback on the two mentioned idea, as we have some free time
lately...

Best,

Alrik Durand

PhD student @L2C Montpellier

Le lun. 4 mai 2020 à 19:20, Dr. Kay Jahnke <kay.jahnke@xxxxxxxxxxxxxxxxx
<mailto:kay.jahnke@xxxxxxxxxxxxxxxxx>> a écrit :

Hi Dan,

you are correct, the current save logic only saves hierarchical for dates and
then the modules the saved data was created it.
It was designed that way because qudi is not dictating what modules there
are, which parameters they have or which and how the data is saved. So going
by date was the best thing to get any structure.
There are some ways, that can enhance this structure and give you a way to
access your data more conveniently: The current save_logic has
additional_parameters
(https://github.com/Ulm-IQO/qudi/blob/master/logic/save_logic.py#L633 ;
<https://github.com/Ulm-IQO/qudi/blob/master/logic/save_logic.py#L633>).
These are global and can be set from any module or even from a notebook
script. So here you could save sample, center or any other parameters you
like, into the data files. You should best not save parameters in file names,
as this can lead to massive problems afterwards.

Some of my colleagues then query the whole data directory automatically with
a script and load ALL the data files. These data files can then be pushed
into for example a "pandas" data set and this let's you query the specific
parameters per experiment.
The disadvantage is that the querying of all the data takes a while and you
might run out of RAM, as the data might get big (just imagine you want to
look at all the data from one PhD). Also in principle this has nothing really
to do with qudi, because at this point you are writing you analysis scripts
and should not need the qudi core functionality.

Therefore the much more elegant way of solving your problem would be to write
your own save-logic. The current module is just the default qudi logic module
and can be replaced at any time. You will just need to support the same
functions, but you can freely change what happens in the background.
For your case it would probably be best if you set up a (elastic) database
and write a new save-logic that connects to that database and safes the data
in a clever way. A word of caution: Be sure to put a lot of though into the
design of the database beforehand and define explicitly what you want to save
and how (e.g. which parameters, which modules, pictures or only data sets,
dimensionality of the data).
I tried to write a save-logic for a database connection once, but the users
could not agree on any standardized parameters and structure. So the database
became maximally flexible and therefore was extremely complicated to query.
It never got used productively. Therefore, define what you want beforehand,
put thought in it and then keep to your structure.

And finally, if something good comes out, other people might also want to use
something similar. So please than open a Pull Request and commit back to
Upstream. This also has the advantage that other people might fix bugs for
you or enhance the project.

An additional general remark: It is always very useful to keep some kind of
notebook, even if the data is saved automatically. We had very good
experience by going fully digital on the lab notebooks and using a wiki
system for that (https://www.dokuwiki.org/dokuwiki ;
<https://www.dokuwiki.org/dokuwiki>). The most used plug-in it turns out is
than the one that let's you paste screenshots directly into the wiki page.

If there are more questions, please don't hesitate to ask.

Cheers,
Kay

Am 04.05.2020 um 17:59 schrieb Dan Yudilevich:

Hi everyone,

We are a new group out of the Weizmann Institute (Israel), and we are slowly
but surely getting to know qudi, with most of the important features running
smoothly. The software is impressive, so kudos to the developers.

One thing I am struggling with is data management. Although we are only
beginning to acquire data, I already feel it is getting quite cluttered. The
apparent organization hierarchy of date/modules makes it challenging to trace
specific data later on. I would like, for example, a convenient way to find
data related to a specific sample (or defect); data from specific types of
pulse sequences, etc.

By managing a lab notebook I can, of course, refer to the specific dates and
files, but I feel it somewhat defeats the purpose.

So, I wanted to ask if someone has any recommendation –

Does anyone have a particularly elegant way of organizing the information?

Am I missing something in the save logic, so that I’m under-utilizing this
feature?

Thank you all, and stay healthy,

Dan Yudilevich

Finkler Group | Dept. of Chemical and Biological Physics

Weizmann Institute of Science

--
Dr. Kay Daniel Jahnke

Küfergasse 1
89073 Ulm
[T] +49 176 444 346 51
[@] kay.jahnke@xxxxxxxxxxxxxxxxx <mailto:kay.jahnke@xxxxxxxxxxxxxxxxx>

--
Dr. Kay Daniel Jahnke

Küfergasse 1
89073 Ulm
[T] +49 176 444 346 51
[@] kay.jahnke@xxxxxxxxxxxxxxxxx <mailto:kay.jahnke@xxxxxxxxxxxxxxxxx>

--
Dr. Kay Daniel Jahnke

Küfergasse 1
89073 Ulm
[T] +49 176 444 346 51
[@] kay.jahnke@xxxxxxxxxxxxxxxxx <mailto:kay.jahnke@xxxxxxxxxxxxxxxxx>

Follow-Ups:
- [qudi-dev] Re: Managing saved measurement data
  - From: Dan Yudilevich

References:
- [qudi-dev] Managing saved measurement data
  - From: Dan Yudilevich
- [qudi-dev] Re: Managing saved measurement data
  - From: Dr. Kay Jahnke
- [qudi-dev] Re: Managing saved measurement data
  - From: Alrik Durand
- [qudi-dev] Re: Managing saved measurement data
  - From: Dr. Kay Jahnke
- [qudi-dev] Re: Managing saved measurement data
  - From: Sami Koho
- [qudi-dev] Re: Managing saved measurement data
  - From: Dr. Kay Jahnke
- [qudi-dev] Re: Managing saved measurement data
  - From: Nikolas Tomek
- [qudi-dev] Re: Managing saved measurement data
  - From: Alrik Durand

[qudi-dev] Re: Managing saved measurement data

Other related posts: