Pages

Friday, 12 May 2017

AtoM Camp take aways

The view from the window at AtoM Camp ...not that there was
any time to gaze out of the window of course...
I’ve spent the last three days in Cambridge at AtoM Camp. This was the second ever AtoM Camp, and the first in Europe. A big thanks to St John’s College for hosting it and to Artefactual Systems for putting it on.

It really has been an interesting few days, with a packed programme and an engaged group of attendees from across Europe and beyond bringing different levels of experience with AtoM.

As a ‘camp counsellor’ I was able to take to the floor at regular intervals to share some of our experiences of implementing AtoM at the Borthwick, covering topics such as system selection, querying the MySQL database, building the community and overcoming implementation challenges.

However, I was also there to learn!

Here are some bits and pieces that I’ve taken away.

My first real take away is that I now have a working copy of the soon to be released AtoM 2.4 on my Macbook - this is really quite cool. I'll never again be bored on a train - I can just fire up Ubuntu and have a play!

Walk to Camp takes you over Cambridge's Bridge of Sighs
During the camp it was great to be able to hear about some of the new features that will be available in this latest release.

At the Borthwick Institute our catalogue is still running on AtoM 2.2 so we are pretty excited about moving to 2.4 and being able to take advantage of all of this new functionality.

Just some of the new features I learnt about that I can see an immediate use case are:

  • Being able to generate slugs (the end bit of the URL to a record in AtoM) from archival reference numbers rather than titles - this makes perfect sense to me and would make for neater links
  • A modification of the re-indexing script which allows you to specify which elements you want to re-index. I like this one as it means I will not need to get out of bed so early to carry out re-indexes if for example it is only the (non-public facing) accessions records that need indexing.
  • Some really helpful changes to the search results - The default operator in an AtoM search has now been changed from ‘OR’ to ‘AND’. This is a change we already made to our local instance (as have several others) but it is good to see that AtoM now has this built in. Another change focuses on weighting of results and ensures that the most relevant results appear first. This relevance ranking is related to the fields in which the search terms appear - thus, a hit in the title field would appear higher than a hit in scope and content.
  • Importing data through the interface will be carried out through the job scheduler so will be better and won't time out. This is great news as it will give colleagues the ability to do all imports themselves rather than having to wait until someone can do this through the command line


On day two of camp I enjoyed the implementation tours, seeing how other institutions have implemented AtoM and the tweaks and modifications they have made. For example it was interesting to see the shopping cart feature developed for the Mennonite Archival Image Database and most popular image carousel feature on front page of the Chinese Canadian Artifacts Project. I was also interested in some of the modifications the National Library of Wales have made to meet their own needs.

It was also nice to hear the Borthwick Catalogue described  by Dan as “elegant”!


There was a great session on community and governance at the end of day two which was one of the highlights of the camp for me. It gave attendees the chance to really understand the business model of Artefactual (as well as alternatives to the bounty model in use by other open source projects). We also got a full history of the evolution of AtoM and saw the very first project logo and vision.

The AtoM vision hasn't changed too much but the name and logo have!

Dan Gillean from Artefactual articulated the problem of trying to get funding for essential and ongoing tasks, such as code modernisation. Two examples he used were updating AtoM to work with the latest version of Symfony and Elasticsearch - both of these tasks need to happen in order to keep AtoM moving in the right direction but both require a substantial amount of work and are not likely to be picked up and funded by the community.

I was interested to see Artefactual’s vision for a new AtoM 3.0 which would see some fundamental changes to the way AtoM works and a more up-to-date, modular and scalable architecture designed to meet the future use cases of the growing AtoM community.

Artefactual's proposed modular architecture for AtoM 3.0

There is no time line for AtoM 3.0, and whether it goes ahead or not is entirely dependent on a substantial source of funding being available. It was great to see Artefactual sharing their vision and encouraging feedback from the community at this early stage though.

Another highlight of Camp:
a tour of the archives of St John's College from Tracy Deakin
A session on data migrations on day three included a demo of OpenRefine from Sara Allain from Artefactual. I’d heard of this tool before but wasn’t entirely sure what it did and whether it would be of use to me. Sara demonstrated how it could be used to bash data into shape before import into AtoM. It seemed to be capable of doing all the things that I’ve previously done in Excel (and more) ...but without so much pain. I’ll definitely be looking to try this out when I next have some data to clean up.

Dan Gillean and Pete Vox from IMAGIZ talked through the process of importing data into AtoM. Pete focused on an example from Croydon Museum Service who's data needed to be migrated from CALM. He talked through some of the challenges of the task and how he would approach this differently in future. It is clear that the complexities of data migration may be one of the biggest barriers to institutions moving to AtoM from an alternative system, but it was encouraging to hear that none of these challenges are insurmountable.

My final take away from AtoM Camp is a long list of actions - new things I have learnt that I want to read up on or try out for myself ...I best crack on!




Friday, 28 April 2017

How can we preserve Google Documents?

Last month I asked (and tried to answer) the question How can we preserve our wiki pages?

This month I am investigating the slightly more challenging issue of how to preserve native Google Drive files, specifically documents*.

Why?

At the University of York we work a lot with Google Drive. We have the G Suite for Education (formally known as Google Apps for Education) and as part of this we have embraced Google Drive and it is now widely used across the University. For many (me included) it has become the tool of choice for creating documents, spreadsheets and presentations. The ability to share documents and directly collaborate are key.

So of course it is inevitable that at some point we will need to think about how to preserve them.

How hard can it be?

Quite hard actually.

The basic problem is that documents created in Google Drive are not really "files" at all.

The majority of the techniques and models that we use in digital preservation are based around the fact that you have a digital object that you can see in your file system, copy from place to place and package up into an Archival Information Package (AIP).

In the digital preservation community we're all pretty comfortable with that way of working.

The key challenge with stuff created in Google Drive is that it doesn't really exist as a file.

Always living in hope that someone has already solved the problem, I asked the question on Twitter and that really helped with my research.

Isn't the digital preservation community great?

Exporting Documents from Google Drive

I started off testing the different download options available within Google docs. For my tests I used 2 native Google documents. One was the working version of our Phase 1 Filling the Digital Preservation Gap report. This report was originally authored as a Google doc, was 56 pages long and consisted of text, tables, images, footnotes, links, formatted text, page numbers, colours etc (ie: lots of significant properties I could assess). I also used another more simple document for testing - this one was just basic text and tables but also included comments by several contributors.

I exported both of these documents into all of the different export formats that Google supports and assessed the results, looking at each characteristic of the document in turn and establishing whether or not I felt it was adequately retained.

Here is a summary of my findings, looking specifically at the Filling the Digital Preservation Gap phase 1 report document:

  • docx - This was a pretty good copy of the original. It retained all of the key features of the report that I was looking for (images, tables, footnotes, links, colours, formatting etc), however, the 56 page report was now only 55 pages (in the original, page 48 was blank, but in the docx version this blank page wasn't there).
  • odt - Again, this was a good copy of the originals and much like the docx version in terms of the features it retained. However, the 56 page report was now only 54 pages long. Again it omitted page 48 which was blank in the Google version, but also slightly more words were squeezed on to each page which meant that it comprised of fewer pages. Initially I thought the quality of the images was degraded slightly but this turned out to be just the way they appeared to render in LibreOffice. Looking inside the actual odt file structure and viewing the images as files demonstrated to me that they were actually OK. 
  • rtf - First of all it is worth saying that the Rich Text Format file was *massive*. The key features of the document were retained, although the report document was now 60 pages long instead of 56!
  • txt - Not surprisingly this produces a tiny file that retains only the text of the original document. Obviously the original images, tables, colours, formatting etc were all lost. About the only other notable feature that was retained were the footnotes and these appeared together right at the end of the document. Also a txt file does not have a number of 'pages'... not until you print it at least.
  • pdf - This was a good copy of the original report and retained all the formatting and features that I was looking for. This was also the only copy of the report that had the right number of pages. However, it seems that this is not something we can rely on. A close comparison of the pages of the pdf compared with the original shows that there are some differences regarding which words fall on to which page - it isn't exact!
  • epub - Many features of the report were retained but like the text file it was not paginated and the footnotes were all at the end of the document. The formatting was partially retained - the images were there, but were not always placed in the same positions as in the original. For example on the title page, the logos were not aligned correctly. Similarly, the title on the front page was not central.
  • html - This was very similar to the epub file regarding what was and wasn't retained. It included footnotes at the end and had the same issues with inconsistent formatting.

...but what about the comments?

My second test document was chosen so I could look specifically at the comments feature and how these were retained (or not) in the exported version.

  • docx - Comments are exported. On first inspection they appear to be anonymised, however this seems to be just how they are rendered in Microsoft Word. Having unzipped and dug into the actual docx file and looked at the XML file that holds the comments, it is clear that a more detailed level of information is retained - see images below. The placement of the comments is not always accurate. In one instance the reply to a comment is assigned to text within a subsequent row of the table rather than to the same row as the original comment.
  • odt -  Comments are included, are attributed to individuals and have a date and time. Again, matching up of comments with right section of text is not always accurate - in one instance a comment and it's reply are linked to the table cell underneath the one that they referenced in the original document.
  • rtf - Comments are included but appear to be anonymised when displayed in MS Word...I haven't dug around enough to establish whether or not this is just a rendering issue.
  • txt - Comments are retained but appear at the end of the document with a [a], [b] etc prefix - these letters appear in the main body text to show where the comments appeared. No information about who made the comment is preserved.
  • pdf - Comments not exported
  • epub - Comments not exported
  • html - Comments are present but appear at the end of the document with a code which also acts as a placeholder in the text where the comment appeared. References to the comments in the text are hyperlinks which take you to the right comment at the bottom of the document. There is no indication of who made the comment (not even hidden within the html tags).

A comment in original Google doc

The same comment in docx as rendered by MS Word

...but in the XML buried deep within the docx file structure - we do have attribution and date/time
(though clearly in a different time zone)

What about bulk export options?

Ed Pinsent pointed me to the Google Takeout Service which allows you to:
"Create an archive with your data from Google products"
[Google's words not mine - and perhaps this is a good time to point you to Ed's blog post on the meaning of the term 'Archive']

This is really useful. It allows you to download Google Drive files in bulk and to select which formats you want to export them as.

I tested this a couple of times and was surprised to discover that if you select pdf or docx (and perhaps other formats that I didn't test) as your export format of choice, the takeout service creates the file in the format requested and an html file which includes all comments within the document (even those that have been resolved). The content of the comments/responses including dates and times is all included within the html file, as are names of individuals.

The downside of the Google Takeout Service is that it only allows you to select folders and not individual files. There is another incentive for us to organise our files better! The other issue is that it will only export documents that you are the owner of - and you may not own everything that you want to archive!

What's missing?

Quite a lot actually.

The owner, creation and last modified dates of a document in Google Drive are visible when you click on Document details... within the File menu. Obviously this is really useful information for the archive but is lost as soon as you download it into one of the available export formats.

Creation and last modified dates as visible in Document details

Update: I was pleased to see that if using the Google Takeout Service to bulk export files from Drive, the last modified dates are retained, however on single file export/download these dates are lost and the last modified date of the file becomes the date that you carried out the export. 

Part of the revision history of my Google doc
But of course in a Google document there is more metadata. Similar to the 'Page History' that I mentioned when talking about preserving wiki pages, a Google document has a 'Revision history'

Again, this *could* be useful to the archive. Perhaps not so much so for my document which I worked on by myself in March, but I could see more of a use case for mapping and recording the creative process of writing a novel for example. 

Having this revision history would also allow you to do some pretty cool stuff such as that described in this blog post: How I reverse engineered Google Docs to play back any documents Keystrokes (thanks to Nick Krabbenhoft for the link).

It would seem that the only obvious way to retain this information would be to keep the documents in their original native Google format within Google Drive but how much confidence do we have that it will be safe there for the long term?

Conclusions

If you want to preserve a Google Drive document there are several options but no one-size-fits-all solution.

As always it boils down to what the significant properties of the document are. What is it we are actually trying to preserve?

  • If we want a fairly accurate but non interactive digital 'print' of the document, pdf might be the most accurate representation though even the pdf export can't be relied on to retain the exact pagination. Note that I didn't try and validate the pdf files that I exported and sadly there is no pdf/a export option.
  • If comments are seen to be a key feature of the document then docx or odt will be a good option but again this is not perfect. With the test document I used, comments were not always linked to the correct point within the document.
  • If it is possible to get the owner of the files to export them, the Google Takeout Service could be used. Perhaps creating a pdf version of the static document along with a separate html file to capture the comments.

A key point to note is that all export options are imperfect so it would be important to check the exported document against the original to ensure it accurately retains the important features.

Another option would be simply keeping them in their native format but trying to get some level of control over them - taking ownership and managing sharing and edit permissions so that they can't be changed. I've been speaking to one of our Google Drive experts in IT about the logistics of this. A Google Team Drive belonging to the Archives could be used to temporarily store and lock down Google documents of archival value whilst we wait and see what happens next. 

...I live in hope that export options will improve in the future.

This is a work in progress and I'd love to find out what others think.




* note, I've also been looking at Google Sheets and that may be the subject of another blog post

Friday, 7 April 2017

Archivematica Camp York: Some thoughts from the lake

Well, that was a busy week!

Yesterday was the last day of Archivematica Camp York - an event organised by Artefactual Systems and hosted here at the University of York. The camp's intention was to provide a space for anyone interested in or currently using Archivematica to come together, learn about the platform from other users, and share their experiences. I think it succeeded in this, bringing together 30+ 'campers' from across the UK, Europe and as far afield as Brazil for three days of sessions covering different aspects of Archivematica.

Our pod on the lake (definitely a lake - not a pond!)
My main goal at camp was to ensure everyone found their way to the rooms (including the lakeside pod) and that we were suitably fuelled with coffee, popcorn and cake. Alongside these vital tasks I also managed to partake in the sessions, have a play with the new version of Archivematica (1.6) and learn a lot in the process.

I can't possibly capture everything in this brief blog post so if you want to know more, have a look back at all the #AMCampYork tweets.

What I've focused on below are some of the recurring themes that came up over the three days.

Workflows

Archivematica is just one part of a bigger picture for institutions that are carrying out digital preservation, so it is always very helpful to see how others are implementing it and what systems they will be integrating with. A session on workflows in which participants were invited to talk about their own implementations was really interesting. 

Other sessions  also helped highlight the variety of different configurations and workflows that are possible using Archivematica. I hadn't quite realised there were so many different ways you could carry out a transfer! 

In a session on specialised workflows, Sara Allain talked us through the different options. One workflow I hadn't been aware of before was the ability to include checksums as part of your transfer. This sounds like something I need to take advantage of when I get Archivematica into production for the Borthwick. 

Justin talking about Automation Tools
A session on Automation Tools with Justin Simpson highlighted other possibilities - using Archivematica in a more automated fashion. 

We already have some experience of using Automation Tools at York as part of the work we carried out during phase 3 of Filling the Digital Preservation Gap, however I was struck by how many different ways these can be applied. Hearing examples from other institutions and for a variety of different use cases was really helpful.


Appraisal

The camp included a chance to play with Archivematica version 1.6 (which was only released a couple of weeks ago) as well as an introduction to the new Appraisal and Arrangement tab.

A session in progress at Archivematica Camp York
I'd been following this project with interest so it was great to be able to finally test out the new features (including the rather pleasing pie charts showing what file formats you have in your transfer). It was clear that there were a few improvements that could be made to the tab to make it more intuitive to use and to deal with things such as the ability to edit or delete tags, but it is certainly an interesting feature and one that I would like to explore more using some real data from our digital archive.

Throughout camp there was a fair bit of discussion around digital appraisal and at what point in your workflow this would be carried out. This was of particular interest to me being a topic I had recently raised with colleagues back at base.

The Bentley Historical Library who funded the work to create the new tab within Archivematica are clearly keen to get their digital archives into Archivematica as soon as possible and then carry out the work there after transfer. The addition of this new tab now makes this workflow possible.

Kirsty Lee from the University of Edinburgh described her own pre-ingest methodology and the tools she uses to help her appraise material before transfer to Archivematica. She talked about some tools (such as TreeSize Pro) that I'm really keen to follow up on.

At the moment I'm undecided about exactly where and how this appraisal work will be carried out at York, and in particular how this will work for hybrid collections so as always it is interesting to hear from others about what works for them.


Metadata and reporting

Evelyn admitting she loves PREMIS and METS
Evelyn McLellan from Artefactual led a 'Metadata Deep Dive' on day 2 and despite the title, this was actually a pretty interesting session!

We got into the details of METS and PREMIS and how they are implemented within Archivematica. Although I generally try not to look too closely at METS and PREMIS it was good to have them demystified. On the first day through a series of exercises we had been encouraged to look at a METS file created by Archivematica ourselves and try and pick out some information from it so these sessions in combination were really useful.

Across various sessions of the camp there was also a running discussion around reporting. Given that Archivematica stores such a detailed range of metadata in the METS file, how do we actually make use of this? Being able to report on how many AIPs have been created, how many files and what size is useful. These are statistics that I currently collect (manually) on a quarterly basis and share with colleagues. Once Archivematica is in place at York, digging further into those rich METS files to find out which file formats are in the digital archive would be really helpful for preservation planning (among other things). There was discussion about whether reporting should be a feature of Archivematica or a job that should be done outside Archivematica.

In relation to the later option - I described in one session how some of our phase 2 work of Filling the Digital Preservation Gap was designed to help expose metadata from Archivematica to a third party reporting system. The Jisc Research Data Shared Service was also mentioned in this context as reporting outside of Archivematica will need to be addressed as part of this project.

Community

As with most open source software, community is important. This was touched on throughout the camp and was the focus of the last session on the last day.

There was a discussion about the role of Artefactual Systems and the role of Archivematica users. Obviously we are all encouraged to engage and help sustain the project in whatever way we are able. This could be by sharing successes and failures (I was pleased that my blog got a mention here!), submitting code and bug reports, sponsoring new features (perhaps something listed on the development roadmap) or helping others by responding to queries on the mailing list. It doesn't matter - just get involved!

I was also able to highlight the UK Archivematica group and talk about what we do and what we get out of it. As well as encouraging new members to the group, there was also discussion about the potential for forming other regional groups like this in other countries.

Some of the Archivematica community - class of Archivematica Camp York 2017

...and finally

Another real success for us at York was having the opportunity to get technical staff at York working with Artefactual to resolve some problems we had with getting our first Archivematica implementation into production. Real progress was made and I'm hoping we can finally start using Archivematica for real at the end of next month.

So, that was Archivematica Camp!

A big thanks to all who came to York and to Artefactual for organising the programme. As promised, the sun shined and there were ducks on the lake - what more could you ask for?



Thanks to Paul Shields for the photos

Monday, 13 March 2017

Want to learn about Archivematica whilst watching the ducks?

We are really excited to be hosting the first European Archivematica Camp here at the University of York next month - on the 4-6th April.

Don't worry - there will be no tents or campfires...but there may be some wildlife on the lake.


The Ron Cooke Hub on a frosty morning - hoping for some warmer weather for Camp!

The event is taking place at the Ron Cooke Hub over on our Heslington East campus. If you want to visit the beautiful City of York (OK, I'm biased!) and meet other European Archivematica users (or Archivematica explorers) this event is for you. Artefactual Systems will be leading the event and the agenda is looking very full and interesting.

I'm most looking forward to learning more about the workflows that other Archivematica users have in place or are planning to implement.


One of these lakeside 'pods' will be our breakout room


There are still places left and you can register for Camp here or contact the organisers at info@artefactual.com.

...and if you are not able to attend in person, do watch this blog in early April as you can guarantee I'll be blogging after the event!


Friday, 10 March 2017

How can we preserve our wiki pages

I was recently prompted by a colleague to investigate options for preserving institutional wiki pages. At the University of York we use the Confluence wiki and this is available for all staff to use for a variety of purposes. In the Archives we have our own wiki space on Confluence which we use primarily for our meeting agendas and minutes. The question asked of me was how can we best capture content on the wiki that needs to be preserved for the long term? 

Good question and just the sort of thing I like to investigate. Here are my findings...

Space export

The most sensible way to approach the transfer of a set of wiki pages to the digital archive would be to export them using the export options available within the Space Tools.

The main problem with this approach is that a user will need to have the necessary permissions on the wiki space in order to be able to use these tools ...I found that I only had the necessary permissions on those wiki spaces that I administer myself.

There are three export options as illustrated below:


Space export options - available if you have the right permissions!


HTML

Once you select HTML, there are two options - a standard export (which exports the whole space) or a custom export (which allows you to select the pages you would like included within the export).

I went for a custom export and selected just one section of meeting papers. Each wiki page is saved as an HTML file. DROID identifies these as HTML version 5. All relevant attachments are included in the download in their original format.

There are some really good things about this export option:
  • The inclusion of attachments in the export - these are often going to be as valuable to us as the wiki page content itself. Note that they were all renamed with a number that tied them to the page that they were associated with. It seemed that the original file name was however preserved in the linking wiki page text 
  • The metadata at the top of a wiki page is present in the HTML pages: ie Created by Jenny Mitcham, last modified by Jenny Mitcham on 31, Oct, 2016 - this is really important to us from an archival point of view
  • The links work - including links to the downloaded attachments, other wiki pages and external websites or Google Docs
  • The export includes an index page which can act as a table of contents for the exported files - this also includes some basic metadata about the wiki space

XML

Again, there are two options here - either a standard export (of the whole space) or a custom export, which allows you to select whether or not you want comments to be exported and choose exactly which pages you want to export.

I tried the custom export. It seemed to work and also did export all the relevant attachments. The attachments were all renamed as '1' (with no file extension), and the wiki page content is all bundled up into one huge XML file.

On the plus side, this export option may contain more metadata than the other options (for example the page history) but it is difficult to tell as the XML file is so big and unwieldy and hard to interpret. Really it isn't designed to be usable. The main function of this export option is to move wiki pages into another instance of Confluence.

PDF

Again you have the option to export whole space or choose your pages. There are also other configurations you can make to the output but these are mostly cosmetic.

I chose the same batch of meeting papers to export as PDF and this produces a 111 page PDF document. The first page is a contents page which lists all the other pages alphabetically with hyperlinks to the right section of the document. It is hard to use the document as the wiki pages seem to run into each other without adequate spacing and because of the linear nature of a pdf document you feel drawn to read it in the order it is presented (which in this case is not a logical order for the content). Attachments are not included in the download though links to the attachments are maintained in the PDF file and they do continue to resolve to the right place on the wiki. Creation and last modified metadata is also not included in the export.

Single page export

As well as the Space Export options in Confluence there are also single page export options. These are available to anyone who can access the wiki page so may be useful if people do not have necessary permissions for a space export.

I exported a range of test pages using the 'Export to PDF' and 'Export to Word' options.

Export to PDF

The PDF files created in this manner are version 1.4. Sadly no option to export as PDF/A, but at least version 1.4 is closer to the PDF/A standard than some, so perhaps a subsequent migration to PDF/A would be successful.

Export to Word

Surprisingly the 'Word' files produced by Confluence appear not to be Word files at all!

Double click on the files in Windows Explorer and they open in Microsoft Word no problem, but DROID identifies the files as HTML (with no version number) and reports a file extension mismatch (because the files have a .doc extension).

If you view the files in a text application you can clearly see the Content-Type marked as text/html and <html> tags within the document. Quick View Plus, however views them as an Internet Mail Message with the following text displayed at the top of each page:


Subject: Exported From Confluence
1024x640 72 Print 90

All very confusing and certainly not giving me a lot of faith in this particular export format!


Comparison

Both of these single page export formats do a reasonable job of retaining the basic content of the wiki pages - both versions include many of the key features I was looking for - text, images, tables, bullet points, colours. 

Where advanced formatting has been used to lay out a page using coloured boxes, the PDF version does a better job at replicating this than the 'Word' version. Whilst the PDF attempts to retain the original formatting, the 'Word' version displays the information in a much more linear fashion.

Links were also more usefully replicated in the PDF version. The absolute URL of all links, whether internal, external or to attachments was included within the PDF file so that it is possible to follow them to their original location (if you have the necessary permissions to view the pages). On the 'Word' versions, only external links worked in this way. Internal wiki links and links to attachments were exported as a relative link which become 'broken' once that page is taken out of its original context. 

The naming of the files that were produced is also worthy of comment. The 'Word' versions are given a name which mirrors the name of the page within the wiki space, but the naming of the PDF versions are much more useful, including the name of the wiki space itself, the page name and a date and timestamp showing when the page was exported.


Neither of these single page export formats retained the creation and last modified metadata for each page and this is something that it would be very helpful to retain.

Conclusions

So, if we want to preserve pages from our institutional wiki, what is the best approach?

The Space Export in HTML format is a clear winner. It reproduces the wiki pages in a reusable form that replicates the page content well. As HTML is essentially just ASCII text it is also a good format for long term preservation.

What impressed me about the HTML export was the fact that it retained the content, included basic creation and last modified metadata for each page and downloaded all relevant attachments, updating the links to point to these local copies.

What if someone does not have the necessary permissions to do a space export? My first suggestion would be that they ask for their permissions to be upgraded. If not, perhaps someone who does have necessary permissions could carry out the export?

If all else fails, the export of a single page using the 'Export as PDF' option could be used to provide ad hoc content for the digital archive. PDF is not the best preservation format but may be able to be converted to PDF/A. Note that any attachments would have to be exported separately and manually is this option was selected.

Final thoughts

A wiki space is a dynamic thing which can involve several different types of content - blog posts, labels/tags and comments can all be added to wiki spaces and pages. If these elements are thought to be significant then more work is required to see how they can be captured. It was apparent that comments could be captured using the HTML and XML exports and I believe blog posts can be captured individually as PDF files.

What is also available within the wiki platform itself is a very detailed Page History. Within each wiki page it is possible to view the Page History and see how a page has evolved over time - who has edited it and when those edits occurred. As far as I could see, none of the export formats included this level of information. The only exception may be the XML export but this was so difficult to view that I could not be sure either way.

So, there are limitations to all these approaches and as ever this goes back to the age old discussion about Significant Properties. What is significant about the wiki pages? What is it that we are trying to preserve? None of the export options preserve everything. All are compromises, but perhaps some are compromises we could live with.

Tuesday, 7 March 2017

Thumbs.db – what are they for and why should I care?

Recent work I’ve been doing on the digital archive has made me think a bit more about those seemingly innocuous files that Windows (XP, Vista, 7 and 8) puts into any directory that has images in – Thumbs.db.

Getting your folder options right helps!
Windows uses a file called Thumbs.db to create little thumbnail images of any images within a directory. It stores one of these files in each directory that contains images and it is amazing how quickly they proliferate. Until recently I wasn’t aware I had any in my digital archive at all. This is because although my preferences in Windows Explorer were set to display hidden files, the "Hide protected operating system files" option also needs to be disabled in order to see files such as these.

The reason I knew I had all these Thumbs.db files was through a piece of DROID analysis work published last month. Thumbs.db ranked at number 12 in my list of the most frequently occurring file formats in the digital archive. I had 210 of these files in total. I mentioned at the time that I could write a whole blog post about this, so here it is!

Do I really want these in the digital archive? In my mind, what is in the ‘original’ folders within the digital archive should be what OAIS would call the Submission Information Package (SIP). Just those files that were given to us by a donor or depositor. Not files that were created subsequently by my own operating system.

Though they are harmless enough they can be a bit irritating. Firstly, when I’m trying to run reports on the contents of the archive, the number of files for each archive is skewed by the Thumb.db files that are not really a part of the archive. Secondly, and perhaps more importantly, I was trying to create a profile of the dates of files within the digital archive (admittedly not an exact science when using last modified dates) and the span of dates for each individual archive that we hold. The presence of Thumbs.db files in each archive that contained images gave the false impression that all of the archives had had content added relatively recently, when in fact all that had happened was that a Thumbs.db file had automatically been added when I had transferred the data to the digital archive filestore. It took me a while to realise this - gah!

So, what to do? First I needed to work out how to stop them being created.

After a bit of googling I quickly established the fact that I didn’t have the necessary permissions to be able to disable this default behaviour within Windows so I called in the help of IT Services.

IT clearly thought this was a slightly unusual request, but made a change to my account which now stops these thumbnail images being created by me. Being that I am the only person who has direct access to the born digital material within the archive this should solve that problem.

Now I can systematically remove the files. This means that they won’t skew any future reports I run on numbers of files and last modified dates.

Perhaps once we get a proper digital archiving system in place here at the Borthwick we won’t need to worry about these issues as we won’t directly interact with the archive filestore? Archivematica will package up the data into an AIP and put it on the filestore for me.

However, I will say that now IT have stopped the use of Thumbs.db from my account I am starting to miss them. This setting applies to my own working filestore as well as the digital archive. It turns out that it is actually incredibly useful to be able to see thumbnails of your image files before double clicking on them! Perhaps I need to get better at practicing what I preach and make some improvements to how I name my own image files – without a preview thumbnail, an image file *really* does benefit from a descriptive filename!

As always, I'm interested to hear how other people tackle Thumbs.db and any other system files within their digital archives.

Monday, 13 February 2017

What have we got in our digital archive?

Do other digital archivists find that the work of a digital archivist rarely involves doing hands on stuff with digital archives? When you have to think about establishing your infrastructure, writing policies and plans and attending meetings it leaves little time for activities at the coal face. This makes it all the more satisfying when we do actually get the opportunity to work with our digital holdings.

In the past I've called for more open sharing of profiles of digital archive collections but I am aware that I had not yet done this for the contents of our born digital collections here at the Borthwick Institute for Archives. So here I try to redress that gap.

I ran DROID (v 6.1.5, signature file v 88, container signature 20160927) over the deposited files in our digital archive and have spent a couple of days crunching the results. Note that this just covers the original files as they have been given to us. It does not include administrative files that I have added, or dissemination or preservation versions of files that have subsequently been created.

I was keen to see:
  • How many files could be automatically identified by DROID
  • What the current distribution of file formats looks like
  • Which collections contain the most unidentified files
...and also use these results to:
  • Inform future preservation planning and priorities
  • Feed further information to the PRONOM team at The National Archives
  • Get us to Level 2 of the NDSA Levels of Digital Preservation which asks for "an inventory of file formats in use" and which until now I haven't been collating!

Digital data has been deposited with us since before I started at the Borthwick in 2012 and continues to be deposited with us today. We do not have huge quantities of digital archives here as yet (about 100GB) and digital deposits are still the exception rather than the norm. We will be looking to chase digital archives more proactively once we have a Archivematica in place and appropriate workflows established.

Last modified dates (as recorded by DROID) appear to range from 1984 to 2017 with a peak at 2008. This distribution is illustrated below. Note however, that this data is not always to be trusted (that could be another whole blog post in itself...). One thing that it is fair to say though is that the archive stretches back right to the early days of personal computers and up to the present day.

Last modified dates on files in the Borthwick digital archive

Here are some of the findings of this profiling exercise:

Summary statistics

  • Droid reported that 10005 individual files were present
  • 9431 (94%) of the files were given a file format identification by Droid. This is a really good result ...or at least it seems it in comparison to my previous data profiling efforts which have focused on research data. This result is also comparable with those found within other digital archives, for example 90% at Bentley Historical Library, 96% at Norfolk Record Office and 98% at Hull University Archives
  • 9326 (99%) of those files that were identified were given just one possible identification. 1 file was given 2 different identifications (an xlsx file) and 104 files (with a .DOC extension) were given 8 identifications. In all these cases of multiple identifications, identification was done by file extension rather than signature - which perhaps explains the uncertainty

Files that were identified

  • Of the 9431 files that were identified:
    • 6441 (68%) were identified by signature (which suggests a fairly accurate identification - if a file is identified by signature it means that Droid has looked inside the file and seen something that it recognises. Last year I was inducted into the magic ways this happens - see My First File Format Signature!)
    • 2546 (27%) were identified by container (which again suggests a high level of accuracy). The vast majority of these were Microsoft Office files 
    • 444 (5%) were identified by extension alone (which implies a less accurate identification)


  • Only 86 (1%) of the identified files had a file extension mismatch - this means that the file extension was not what you would expect given the identification by signature. There are all sorts of different examples here including:
    • files with a tmp or dot extension which are identified as Microsoft Word
    • files with a doc extension which are identified as Rich Text Format
    • files with an hmt extension identifying as JPEG files
    • and as in my previous research data example, a bunch of Extensible Markup Language files which had extensions other than XML
So perhaps these are things I'll look into in a bit more detail if I have time in the future.

  • 90 different file formats were identified within this collection of data

  • Of the identified files 1764 (19%) were identified as Microsoft Word Document 97-2003. This was followed very closely by JPEG File Interchange Format version 1.01 with 1675 (18%) occurrences. The top 10 identified files are illustrated below:

  • This top 10 is in many ways comparable to other similar profiles that have been published recently from Bentley Historical Library, Hull University Archive and Norfolk Records Office with high occurrences of Microsoft Word, PDF and JPEG images. In contrast. what it is not so common in this profile are HTML files and GIF image files - these only just make it into the top 50. 

  • Also notable in our top ten are the Sibelius files which haven't appeared in other recently published profiles. Sibelius is musical notation software and these files appear frequently in one of our archives.


Files that weren't identified

  • Of the 574 files that weren't identified by DROID, 125 different file extensions were represented. For most of these there was just a single example of each.

  • 160 (28%) of the unidentified files had no file extension at all. Perhaps not surprisingly it is the earlier files in our born digital collection (files from the mid 80's), that are most likely to fall into this category. These were created at a time when operating systems seemed to be a little less rigorous about enforcing the use of file extensions! Approximately 80 of these files are believed to be WordStar 4.0 (PUID:  x-fmt/260) which DROID would only be able to recognise by file extension. Of course if no extension is included. DROID has little chance of being able to identify them!

  • The most common file extensions of those files that weren't identified are visible in the graph below. I need to do some more investigation into these but most come from 2 of our archives that relate to electronic music composition:


I'm really pleased to see that the vast majority of the files that we hold can be identified using current tools. This is a much better result than for our research data. Obviously there is still room for improvement so I hope to find some time to do further investigations and provide information to help extend PRONOM.

Other follow on work involves looking at system files that have been highlighted in this exercise. See for example the AppleDouble Resource Fork files that appear in the top ten identified formats. Also appearing quite high up (at number 12) were Thumbs.db files but perhaps that is the topic of another blog post. In the meantime I'd be really interested to hear from anyone who thinks that system files such as these should be retained.


Friday, 10 February 2017

Harvesting EAD from AtoM: a collaborative approach

In a previous blog post AtoM harvesting (part 1) - it works! I described how archival descriptions within AtoM are being harvested as Dublin Core for inclusion within our University Library Catalogue.* I also hinted that this wouldn’t be the last you would hear from me on AtoM harvesting and that plans were afoot to enable much richer metadata in EAD 2002 XML (Encoded Archival Description) format to be harvested via OAI-PMH.

I’m pleased to be able to report that this work is now underway.

The University of York along with five other organisations in the UK have clubbed together to sponsor Artefactual Systems to carry out the necessary development work to make EAD harvesting possible. This work is scheduled for release in AtoM version 2.4 (due out in the Spring).

The work is being jointly sponsored by:



We are also receiving much needed support in this project from The Archives Hub who are providing advice on the AtoM EAD and will be helping us test the EAD harvesting when it is ready. While the sponsoring institutions are all producers of AtoM EAD, The Archives Hub is a consumer of that EAD. We are keen to ensure that the archival descriptions that we enter into AtoM can move smoothly to The Archives Hub (and potentially to other data aggregators in the future), allowing the richness of our collections to be signposted as widely as possible.

Adding this harvesting functionality to AtoM will enable The Archives Hub to gather data direct from us on a regular schedule or as and when updates occur, ensuring that:


  • Our data within the Archives Hub doesn’t stagnate
  • We manage our own master copy of the data and only need to edit this in one place
  • A minimum of human interaction is needed to incorporate our data into the Hub
  • It is easier for researchers to find information about the archives that we hold without having to search all of our individual catalogues


So, what are we doing at the moment?


  • Developers at Artefactual Systems are beavering away working on the initial development and getting the test site ready for us to play with.
  • The sponsoring institutions have been getting samples of their own AtoM data ready for loading up into the test deployment. It is always better when testing something to have some of your own data to mess around with.
  • The Borthwick have been having discussions with The Archives Hub for some time about AtoM EAD (from version 2.2) but we’ve picked up these discussions again and other institutions have joined in by supplying their own EAD samples. This allows staff at the Hub to see how EAD has changed in version 2.3 of AtoM (it hasn’t very much) and also to see how consistent the EAD from AtoM is from different institutions. We have been having some pretty detailed discussions about how we can make the EAD better, cleaner, fuller - either by data entry at the institutions, automated data cleaning at The Hub prior to display online or by further developments in AtoM.


What we are doing at the moment is good and a huge step in the right direction, but perhaps not perfect. As we work together on this project we are coming across areas where future work would be beneficial in order to improve the quality of the EAD that AtoM produces or to expand the scope of what can be harvested from AtoM. I hope to report on this in more detail at the end of the project, but in the meantime, do get in touch if you are interested in finding out more.







* It is great to see that this is working well and our Library Catalogue is now appearing in the referrer reports for the Borthwick Catalogue on Google Analytics. People are clearly following these new signposts to our archives!

Tuesday, 24 January 2017

Creating an annual accessions report using AtoM

So, it is that time of year where we need to complete our annual report on accessions for the National Archives. Along with lots of other archives across the UK we send The National Archives summary information about all the accessions we have received over the course of the previous year. This information is collated and provided online on the Accessions to Repositories website for all to see.

The creation of this report has always been a bit time consuming for our archivists, involving a lot of manual steps and some re-typing but since we have started using AtoM as our Archival Management System the process has become much more straightforward.

As I've reported in a previous blog post, AtoM does not do all that we want to do in the way of reporting via it's front end.

However, AtoM has an underlying MySQL database and there is nothing to stop you bypassing the interface, looking at the data behind the scenes and pulling out all the information you need.

One of the things we got set up fairly early in our AtoM implementation project was a free MySQL client called Squirrel. Using Squirrel or another similar tool, you can view the database that stores all your AtoM data, browse the data and run queries to pull out the information you need. It is also possible to update the data using these SQL clients (very handy if you need to make any global changes to your data). All you need initially is a basic knowledge of SQL and you can start pulling some interesting reports from AtoM.

The downside of playing with the AtoM database is of course that it isn't nearly as user friendly as the front end.

It is always a bit of an adventure navigating the database structure and trying to work out how the tables are linked. Even with the help of an Entity Relationship Diagram from Artefactual creating more complex queries is ...well ....complex!

AtoM's database tables - there are a lot of them!


However, on a positive note, the AtoM user forum is always a good place to ask stupid questions and Artefactual staff are happy to dive in and offer advice on how to formulate queries. I'm also lucky to have help from more technical colleagues here in Information Services (who were able to help me get Squirrel set up and talking to the right database and can troubleshoot my queries) so what follows is very much a joint effort.

So for those AtoM users in the UK who are wrestling with their annual accessions report, here is a query that will pull out the information you need:

SELECT accession.identifier, accession.date, accession_i18n.title, accession_i18n.scope_and_content, accession_i18n.received_extent_units, 
accession_i18n.location_information, case when cast(event.start_date as char) like '%-00-00' then left(cast(event.start_date as char),4) 
else cast(event.start_date as char)
end as start_date,
case when cast(event.end_date as char) like '%-00-00' then left(cast(event.end_date as char),4) 
else cast(event.end_date as char)
end as end_date, 
event_i18n.date
from accession
LEFT JOIN event on event.object_id=accession.id
LEFT JOIN event_i18n on event.id=event_i18n.id
JOIN accession_i18n ON accession.id=accession_i18n.id
where accession.date like '2016%'
order by identifier

A couple of points to make here:

  • In a previous version of the query, we included some other tables so we could also capture information about the creator of the archive. The addition of the relation, actor and actor_i18n tables made the query much more complicated and for some reason it didn't work this year. I have not attempted to troubleshoot this in any great depth for the time being as it turns out we are no longer recording creator information in our accessions records. Adding a creator record to an accessions entry creates an authority record for the creator that is automatically made public within the AtoM interface and this ends up looking a bit messy (as we rarely have time at this point in the process to work this into a full authority record that is worthy of publication). Thus as we leave this field blank in our accession record there is no benefit in trying to extract this bit of the database.
  • In an earlier version of this query there was something strange going on with the dates that were being pulled out of the event table. This seemed to be a quirk that was specific to Squirrel. A clever colleague solved this by casting the date to char format and including a case statement that will list the year when there's only a year and the full date when fuller information has been entered. This is useful because in our accession records we enter dates to different levels. 
So, once I've exported the results of this query, put them in an Excel spreadsheet and sent them to one of our archivists, all that remains for her to do is to check through the data, do a bit of tidying up, ensure the column headings match what is required by The National Archives and the spreadsheet is ready to go!

Wednesday, 4 January 2017

Hello 2017

Looking back


2016 was a busy year.

I can tell that from just looking at my untidy desk...I was going to include a photo at this point but that would be too embarrassing.

The highlights of 2016 for me were getting our AtoM catalogue released and available to the world in April, completing Filling the Digital Preservation Gap (and seeing the project move from the early 'thinking' phases to actual implementation) and of course having our work on this project shortlisted in the Research and Innovation category of the Digital Preservation Awards.

...but other things happened too. Blogging really is a great way of keeping track of what I've been working on and of course what people are most interested to read about.

The top 5 most viewed posts from 2016 on this blog have been as follows:

  • Research Data - what does it *really* look like? - A post describing my (not entirely successful) efforts to automatically identify the file formats of research data deposited with Research Data York using DROID. This post spawned other similar posts profiling data using DROID and the cumulative value of all of these profiles is gradually increasing over time. I'm still keen to follow this up with a comparison using the born digital data that we hold at the Borthwick Institute so hopefully that is something for 2017.
  • A is for AtoM - An A-Z (actually I only got to 'Y'!) of implementing AtoM at the Borthwick. This post covers some of the problems and issues we have had to address and decisions we have made as we have gone through the process of getting our new archival management system up and running.
  • Modelling Research Data with PCDM - A guest post by Julie Allinson on some thinking carried out as part of the implementation work for Filling the Digital Preservation Gap project. The post describes some preliminary work to define a data model for datasets using the Portland Common Data Model.
  • Why AtoM? - A look back at why we selected AtoM for our archival management system and how it meets our requirements. This post was in response to a question I was frequently asked and hopefully is useful to others who are going through a similar selection process.
  • From Old York to New York: PASIG 2016 - Quite a long summary of the highlights of the PASIG conference that I attended in New York in October 2016. There was some fantastic content at this event and my post really just scrapes the surface of this!


Looking forward


So what is on the horizon for 2017?

Here are some of the things I'm going to be working on - expect blog posts on some or all of these things as the year progresses.

AtoM

I blogged about AtoM a fair bit last year as we prepared our new catalogue for release in the wild! I expect I'll be talking less about AtoM this year as it becomes business as usual at the Borthwick, but don't expect me to be completely silent on this topic.

A group of AtoM users in the UK is sponsoring some development work within AtoM to enable EAD to be harvested via OAI-PMH. This is a very exciting new collaboration and will see us being able to expose our catalogue entries to the wider world, enabling them to be harvested by aggregators such as the Archives Hub. I'm very much looking forward to seeing this take shape.

This year I'm also keen to explore the Locations functionality of AtoM to see whether it is fit for our purposes.

Archivematica

Work with Archivematica is of course continuing. 

Post Filling the Digital Preservation Gap at York we are working on moving our proof of concept into production. We are also continuing our work with Jisc on the Research Data Shared Service. York is a pilot institution for this project so we will be improving and refining our processes and workflows for the management and preservation of research data through this collaboration.

Another priority for the year is to make progress with the preservation of the born digital data that is held by the Borthwick Institute for Archives. Over the year we will be planning a different set of Archivematica workflows specifically for the archives. I'm really excited about seeing this take shape.

We are also thrilled to be hosting the first European ArchivematiCamp here in York in the Spring. This will be a great opportunity to get current and potential Archivematica users across the UK and the rest of Europe together to share experiences and find out more about the system. There will no doubt be announcements about this over the next couple of months once the details are finalised so watch this space.

Ingest processes

Last year a new ingest PC arrived on my desk. I haven't yet had much chance to play with this but the plan is to get this set up for digital ingest work.

I'm keen to get BitCurator installed and to refine our current digital ingest procedures. After some useful chats about BitCurator with colleagues in the UK and the US over 2016 I'm very much looking forward to getting stuck into this.




...but really the first challenge of 2017 is to tidy my desk!