Hello all,
Thanks Robert, for your comments. Just to clarify, the Bodleian aren't
sponsoring the package of work described by Evelyn and Ross last week - this is
100% Wellcome-supported :-)
Our upcoming implementation project (for which we have just advertised for a
developer) relates though, and will benefit from what's being accomplished by
the Wellcome-sponsored work. As currently planned, the most substantial
component of our project is to develop a new metadata output which captures
what is needed for preservation as succinctly as possible, with the aim of
making significant performance gains. The key difference is that we had
intended to move metadata output to LOD as much as possible, based on
discussion with Artefactual re. our Archivematica pilot findings.
We feel in a slightly uncertain position as things stand. We can't implement
Archivematica with its current metadata/performance issues, so we need to make
a change. But how do we ensure that what we pursue will become integral to
Archivematica and adopted by the community? This is a core reason for our
membership of the PSP - we see it as a forum for coordinating development to
achieve more together and minimise risk.
The sorts of questions that are bubbling away for me as we reach a decision
point around how to proceed with our project are:
* It seems clear that LOD is a good choice for the future, but is it wise
for the Bodleian to go ahead with implementing a LOD approach now? Our
development work is slated for 2020/21, with exact timing dependent on hiring
our developer.
* Is the presence of multiple metadata outputs for Archivematica plausible?
We might find ourselves with three options to toggle: current METS, reduced
METS, and LOD. How many can the community sustain?
* If we don't implement LOD now, then when is a good time?
* Is there any drive/resource in the community to re-implement important
functionality around a new LOD model? What is the scope of work?
* If the work Wellcome has sponsored provides adequate efficiency gains and
captures the metadata we need, would we do better to adopt this instead, and
redirect Bodleian development resource elsewhere? (But can we rely on the
reduced METS form being 'supported' longer-term? To what degree does this also
suffer from a need to persuade the community to adopt a new metadata format,
and to re-implement functionality? With increasing data bloat, would this be a
sticking-plaster solution?)
Robert’s questions around larger investment are very useful. It feels to me
like it’s time for a step-change. I would like to see the Bodleian’s upcoming
project as part of the investment towards bringing Archivematica to the next
level, but I’m unsure whether that’s realistic, and what more might be required.
Grateful for people’s thoughts!
Best,
Susan
From: archivematica-psp-bounce@xxxxxxxxxxxxx
<archivematica-psp-bounce@xxxxxxxxxxxxx> On Behalf Of Evelyn McLellan
Sent: 31 August 2020 18:20
To: Sarah Romkey <sromkey@xxxxxxxxxxxxxxx>
Cc: archivematica-psp@xxxxxxxxxxxxx; Ross Spencer <rspencer@xxxxxxxxxxxxxxx>;
Joel Simpson <joelsimpson@xxxxxxxxxxxxxxx>
Subject: [archivematica-psp] Re: A few questions remaining after yesterday's
METS reduction session
Hi Robert,
These are all great questions! I'm just going to tackle a couple of the METS
ones.
Events
We can make reductions in the size of the Events file by having only one set of
Agents for the entire file, instead of one set for every PREMIS object. We can
also make certain Events aggregate, so that we don't have to repeat an Event
like Ingestion thousands of times. And finally, we can remove empty PREMIS
containers whose presence are not required by the METS schema. Ross has started
on all of these things in the main METS file. However, at the end of the day,
if you have tens of thousands of digital objects in an AIP, you're going to
have a lot of Event metadata - otherwise, you lose the audit trail which is
generally considered important for digital preservation purposes. It would be
interesting to see how far we could go with reducing the size of the Events
METS file before we started to risk losing meaningful information. If we were
to move to Linked Data then we would get a further substantial reduction in
verbosity, since LD is inherently more succinct than XML. The PREMIS OWL
ontology that came out a couple of years ago is designed to capture the same
information as the XML serialization but much more succinctly, and we wouldn't
use METS at all.
Tool outputs
Yeah, those metadata are really verbose. It's a great idea to move them to
their own file, but at some point it might also make sense to figure out how to
summarize or otherwise truncate certain types of output. We've seen cases where
FITS doesn't provide meaningful data, for example, and have suggested people
turn the tool off. But turning off is easier than
truncating/summarizing/selecting only certain data elements from a diverse
array of tools with outputs that vary depending on file format. I don't have a
clear answer right now on how that could be approached. Moving to Linked Data
wouldn't necessarily help us, because if the tool outputs are XML we're not
going to serialize those outputs to LD - the mapping would be very complicated
and the outputs could change over time. One thing I wonder about is whether
tool outputs are always needed at all - maybe if you have an AIP with 50,000
files in the same format or small set of formats, it's not necessary from a
preservation standpoint to extract technical metadata in the first place? Maybe
metadata extraction could be run sometime in the future if needed? Throwing
that out there as an idea.
I hope that's helpful! Happy to continue the conversation any time.
Regards,
--
Evelyn McLellan | Systems Archivist & Metadata Specialist |
www.artefactual.com<http://www.artefactual.com>
On Fri, Aug 28, 2020 at 10:00 AM Sarah Romkey
<sromkey@xxxxxxxxxxxxxxx<mailto:sromkey@xxxxxxxxxxxxxxx>> wrote:
Hi Robert,
I'll let Evelyn or Ross jump in with any addendums to this (I have also cc'd
Joel who is managing the project) but I think what we're saying is, the METS
file could be made into three instead of one, to make the "primary" AIP and
file information (premis:object and AIP structure, etc) easier to index and
access. It's one of the options we're exploring. Whichever idea or combination
of ideas we land on, the first iteration will be released as a toggle, a beta
feature, for users to try and provide feedback on.
I'd like the rest of the PSP members to have a chance to weigh in on your
suggestion of a big investment in changing Archivematica's infrastructure,
because I'd like to hear those opinions before expressing my own, but also
because I only have a couple hours left before I leave for vacation for a week
;) I look forward to catching up on the conversation when I return!
A couple of links for you all:
Slides from yesterday's presentation:
https://docs.google.com/presentation/d/1DnpKWmUF0NTmCk1HaCfbxSI8qijoUtRLKmmnHQjC8vk/edit?usp=sharing
Recording from yesterday:
https://drive.google.com/file/d/1waWoPu7sbaMcswIIrZXvWrwP0XWJEGMQ/view?usp=sharing
And just a reminder to please fill out the Doodle poll to discuss the terms of
reference: https://doodle.com/edit/qycnnpvankwq6t7u/options
And finally, a reminder also for final comments on the Archivematica Core
document!
https://docs.google.com/document/d/1Pn5cpK73nSK-xelUm5O9YGmkycKgqZB69Y9Bt7DgkvM/edit#heading=h.26w3lc5ec2a2
(I think I gave you until the end of this month but as you have gathered I am
away next week so you have a one week extension if you haven't had time!)
Thank you Robert and thank you everyone for your ideas and contributions!
Cheers,
Sarah
Sarah Romkey, MAS,MLIS
Archivematica Program Manager
She/her
Artefactual Systems<http://artefactual.com>
604-527-2056
@archivematica<http://www.twitter.com/archivematica> /
@accesstomemory<http://www.twitter.com/accesstomemory>
On Fri, Aug 28, 2020 at 11:12 AM Robert Gillesse
<robert.gillesse@xxxxxxx<mailto:robert.gillesse@xxxxxxx>> wrote:
Hi Sarah, Evelyn, Ross and others,
Thanks for yesterday’s very informative session on the progress of the
METS/PREMIS verbosity reduction work (and thanks Welcome Collection and
Bodleian for sponsoring it). Processing all that was said yesterday I have a
few questions left I would like to share with you.
If I understood correctly (and it was quite ‘heady’ yesterday so excuse me if
I’m wrong) there will be three METS files instead of one. One ‘main file’, one
file containing the PREMIS events and one file containing the tool output. If
correct, could it be that in the case of an AIP with a huge amount of (small)
files (we have 100.000+ archives) the two latter files can still be huge and so
unruly? Probably testing is needed to prove how that will pan out? And looking
at our own use case: would this mean we can ingest our larger archives? Or are
more things needed?
Which brings me to the linked data solution: as I understood yesterday this is
still seen as the most ideal solution to this problem but implementing this
means a total restructuring of Archivematica and is therefore a project of
totally other order. Two questions here: How big would such a project (roughly)
be? And related: how much better would Archivematica operate if this investment
was made? I realize these are really hard questions to answer but what I really
want to say is this: if a big investment would Archivematica make a
significantly better product maybe we could a find a larger group of sponsors
who be willing to invest and bring Archivematica to another level.
Coming back to my earlier email about the role of the PSP group I think that in
attributing to answering questions like these the PSP group can make a
difference.
All the best and have a nice weekend,
Robert