Tag Archives: preservation

PDF for digital preservation and delivery #ndf2012

PDF for digital preservation and delivery
John Laurie, University of Auckland Library
PDF is ubiquitous on the web and many organisations in New Zealand are using it as a document storage format. It has been an open standard since 2008, and has been endorsed by key organisations around the world. It is a complex format with many different versions. This paper will look at differences between PDF/A archival formats and other PDF formats, methods for handling born-digital PDFs and PDFs created by scanning, problems with dirty OCR (optical character recognition) and text extraction for indexing, and issues around file sizes for preservation and online display. It will also look at usage of Adobe’s RDF and Dublin Core-based XMP metadata and compare PDF with METS-Alto as a format for different types of digitisation.

Doubts about PDF as a format – have sometimes used it and then changed to TEI – but with all its faults it’s here to stay.

Issues

  • Is PDF good enough?
  • what’s a maximum file size
  • pdf/a or simple pdf?
  • searchable text or clearscan?
  • OCR?
  • etc

Various local pdf collections at UofAuckland – past exam papers, Journal of the Polynesian Society, New Zealand Journal of History, early NZ statutes, theses, working papers, course materials.

B-engine platform displays as pdf and extracts text and makes it available for cross-site search.

Pdf continually improving – read aloud versions; now working with citations[1]. But hard to edit.

Focusing on digitising pdfs. Choice to use Adobe’s own scanning/ocr or to use other specialised ocr engines? Need to look at outputs you want – many variables to consider. Do you want to save pdf as preservation master copy or keep FineReader tiffs. Have only scanned 300-400dpi for text and haven’t seen advantages to greater for his purposes. Need greyscale for ocr. FineReader better than Adobe but doesn’t offer ClearScan. Is trainable – useful for fractions. Spellchecking options.

Tables are a particular problem. OCR confuses vertical lines with text. Can’t extract tables from PDF to Excel. Could do some training for OCR to recognise the two dots of “blank field” and vertical lines. Thinking of using dirty OCR and making it available as a link from the pdf page.

Compromise between quality and file size. Born digital (usually as Word -> PDF) are usually very small because use fonts. PDFs from scanning balloon out a lot as images. If text is clear can do black and white. Working with 5-10MB TIFF files as preservation master (FineReader creates these automatically).

PDF/A is archival version – ISO-standardised, supposed to be self-contained including embedded fonts. But often if you use “reduce file size” can’t save as PDF/A because it substitutes non-embedded fonts. Many files from big publishers aren’t pdf/a. But will the smarter computers of the future really need embedded fonts? “As we all get smarter and technology improves the acute concerns about format obsolescence may diminish” – Butch Lazorchak The Signal

PDF/A-1a, A-1b, A-2… Can get quite complicated!

ClearScan vs searchable image – clearscan files are just over half the size. Substitutes a new font – matches shape not OCR’d text. Much clearer, less blurry than searchable image version.

Problems with text extraction using pdftotext applet. Applet preindexes results. But with particular fonts/books you get extra spaces between characters. (Finds examples using search for “t h e”.) Problems with macrons won’t ruin display but will ruin search.

PDF XMP metadata – has made attempts at adding dublin core metadata. Automatically extracts a lot of its own. Can add elements from any metadata scheme. File > Properties > Additional metadata. Set up a custom file info panel – can populate a whole group of documents. Advanced shows it with Dublin Core elements.

METS-ALTO looks a lot like pdf – has image in front of text / dirty ocr hidden behind it which you can search on and get either text or image. METS (Metadata Encoding and Transmission Standard) is structural metadata linking things together; ALTO (Analyzed Layout and Text Object) stores layout info, OCR text. Can be used to create derivatives eg pdf, tei, xml, epub.


[1] Allusion relates to an article I came across last night, Refurbishing the Camelot of Scholarship: How to Improve the Digital Contribution of the PDF Research Article. -Deborah


Comment: Budget of 0 so upload pdf to Google Docs and let the settings there OCR it. Little success with older material though.

Comment: Someone at Access conference (Art Rhyno from UofWindsor) has had good luck with open source Tesseract.

Comment: Experimented with Tesseract, Abby – problems with the latter.
A: Tried writing to Abby re problems but no luck.

Comment: Option of using multiple search engines to increase chance of getting a hit. Can render marvellously different results. So training package very valuable because it’s in context of your collection.
A: Then can use trained package on new documents.

Q: How does file size impact decision on format?
A: Often split it up to keep file to 10MB – per chapter or per 50pages. Otherwise risk compromising quality. Best to do this within FineReader to target dpi/quality. Because this is just the delivery file – we keep preservation masters.

Q: When do you decide the OCR’s not good enough and better to transcribe?
A: Outsourced transcription on one project to India and excellent job but expensive, dense text, not in English, hard to proofread. Now use OCR only and provide warnings if quality not good.

Comment: Anyone transcribing? Crowdsourcing transcribing?
Comment: Would need automated software
Comment: Like Trove / National Library of Australia
Comment: This proves there are keen people out there
Comment: Also Project Gutenberg Distributed Proofreaders – volunteers proofread a page at a time and each page proofread multiple time
Comment: Can add layers of rigour

Q: Anyone collecting pdf as born digital?
A: Yes, Journal of Polynesian Society comes born digital, his job is just to split it as appropriate. Once with New Zealand Journal of History an author wanted him to add a section that the journal had missed out. He did it but marked very carefully that he’d done it!

A: Has anyone used XMP metadata?
Comment: We did for Flickr – works but it’s not fun. Software around worldview is incomplete.

A new equity emerges

citizen-created content powering the knowledge economy
Penny Carnaby
abstract (pdf)

Just when we thought we had the web2 environment sussed, it’s about to get more exciting for librarians world-wide. A new equity is emerging which puts individual citizens in the driving seat for the first time.

Every day someone is deleting something on the web. We’re all part of the delete generation. Hana and Sir Tipene O’Regan talked about the loss of indigenous languages.

As librarians we need to take responsibility for preserving information.

Building blocks
Roll-out of broadband
National Digital Heritage Archive
Aotearoa People’s Network Kaharoa
Digital New Zealand
data and information reuse
NLNZ New Generation Strategy

New government has endorsed the digital content strategy. Talks about life of asset from creation to access to sharing to managing and preserving.

Information on two axes from private public and from formal informal.

National Digital Heritage Archive. If we’re taking citizen-created content as seriously as formally created content, how do we go about preserving it? What do we curate – porn, hate sites too?

DigitalNZ has put over 1million NZ digital assets online in one year.

Aotearoa People’s Network Kaharoa – cornerstone of allowing citizen-created content. Allows local kete to emerge all over, through libraries and marae. Extraordinary emergence of citizen-created information collections.

Idea of creating a virtual learning environment in every school, founded on govt-supplied broadband. Ministry of Education looking at how APNK works and thinking about how that could work if it was in every New Zealand school. (Me: Whee!)

International colleagues see New Zealand as an “incubator country”.

Announcement: Will be digitising the Appendices to the Journals of the House of Representatives. (Me: Whee again! This has been much-requested and will be a very valuable asset.)

As of February this year, with digital heritage archive, “we refuse to be part of the delete generation”.

New equity emerging. Kiwis from all walks of life creating solutions to harness and preserve. Each of us has contributed to New Zealand emerging as a digital democracy.

Non-English blog roundup #10

Bibliobsession has posted a set of slides on Towards Library Ecosystems (French). It begins with an introduction to web 2.0 then points out, “A collection doesn’t exist without its users and its uses.” (slide 61) It goes on to discuss the library as an ecosystem: “creating links with other ecosystems in order to benefit from network effects which guarantee it a social utility”.

Bobobiblioblog (French)

  • asks medical students if they’ve used Wikipedia – pretty much all have. Have they edited it? None – “Ah, no, once, a timid young woman whispered that she’d corrected a spelling mistake in one article.”) Bobobiblioblog wonders whether “the general rule is perhaps to have a consumerist attitude towards Wikipedia – using it without participating in it”. [I don’t think it’s necessarily as bad as that – remember the general 90-9-1 theory: 90% use it, 9% contribute occasionally, 1% contribute regularly.]
  • writes about adding an institutional filter to PubMed so that users of MyNCBI can filter their results to those that their institution holds. [Alas, when I try to register for MyNCBI I get 404 file not found, so I can’t play with this myself.]

Vagabondages (French) points to “liquid bookmarks” (Japanese).

Kotkot writes about sustainable libraries (French), asking what sustainable development might mean in a library. The post includes a list of ideas like turning off screens overnight, using rechargeable batteries, reduce tape consumption on books, double-sided printing, create a comfortable bike shelter, etc.

Bib-log (Danish) announces the Roskilde public library mobile site.

Benobis lists French genealogy resources (French).

Via Klog come the steps of digital preservation in 1 slide (French).

De tout sur rien (French) suggests getting our users to scan book covers to go into a cross-library pool particularly if vendors put restrictions on us using theirs.

Disintegrating glue, photos, and old theses

Another team in my library is digitising one of our older theses but had a problem with a couple of pages so asked me to scan those pages from our deposit copy. Unfortunately we had the same problem – the glue used 50 years ago to glue photographs into the thesis has lost any and all adhesive properties it once had.

The problem was exacerbated by the fact that the pages in question were of several photos of oscillographs – and I had no idea either where each one went or which way up it went.

Fortunately someone in the other team had the bright idea of matching the back of the photo to the indentation in the page. I had another look at our copy – there was no indentation, but the old glue left a browning stain so the back each photo had an individual pattern (finger prints, brush strokes, dappling, or at least different shaped corners) which was the mirror image of that on the page.

And then I used an OHT transparency to hold the photos in place while I scanned them (since I don’t want to use any glue before talking to our conservation people). Mission accomplished!