Tag Archives: digital humanities

Digital humanities’ use of cultural data #theta2015

How will digital humanities in the future use cultural data?
Ingrid Mason @1n9r1d

[Presentation basically takes the approach of giving an overview of digital humanities and cultural data by throwing lots of examples at us – fascinating but not conducive to notes.]

Cultural data is generated through all research – seems to be more through humanities, but many others too.
RDS building national collection pulling together statistical adata, manuscripts, documents, artefacts, av recordings from an array of unconnected repositories.

New challenge: people wanting access to collections in bulk, not just borrowing a couple of items. Need to look at developing a wholesale interface on top of our existing retail interface.

Close reading vs distant reading. Computation + arrangement + distance. Researchers interested in immersion; in moving images (eg change over time); pattern analysis; opening up the archive (eg @TroveNewsBot). Text mining/linguistic computing methods to look at World Trade Centre first-responder inteviews. Digital paleography – recognising writing of medieval scripts. Linked Jazz.

A dream: when an undergrad would have loved to have been in the Matrix. Have a novel surrounding you and then turn it immediately into a concordance.

Things digital humanities researchers need: Visualisation hours. Digitisation and OCR. Project managers. Multimedia from various institutions. High-performance computing experts.

~”Undigitised data is like dark matter” (Maltby)

What we can do:

  • Talk to researchers about materials they need
  • Learn about APIs
  • Provide training

Q: Indigenous cultural data
A: Some material is very sensitive and challenges to get it to appropriate researchers/communities so could be opportunities to work together.

Q: Any work on standardisation of cultural data?
A: At a high level (collection description) we can but between fields harder.

HuNI; NZ humanities eResearch; flux in scientific knowledge #nzes

Humanities Networked Infrastructure (HuNI) Virtual Laboratory: Discover | Analyse | Share
Deb Verheven, Deakin University
Conal Tuohy and Richard Rothwell, VeRSI
Ingrid Mason, Intersect Australia

Richard Rothwell presenting. I’ve previously heard Ingrid Mason talk about HuNI at NDF2012.

Idea of a virtual laboratory as a container for data (from variety of disciplines) and a number of tools. But many existing tools are like virtual laboratories themselves, often specific to disciplines.

Have a .9EFTS ontologist. Also project manager, technical coordinator, web page designer, tools coordinator and software developer.

Defined project as linked open data project. Humanities data into HuNI triple store (using RDF), embedded in HuNI virtual lab to create user interface. Embellishments include to provide linked open data in SPARQL, and publish via OAI-PMH; and to use AAF (Shibboleth) authentication; to use SOLR search server for virtual lab.

Have ideas of research use-cases (basic and advanced eg SPARQL queries) and desired features, eg custom analysis tools. The challenge is to get internal bridging relationships between datasets and global interoperability. Aggregating doesn’t solve siloisation.

“Technology-driven projects don’t make for good client outcomes.”

Q: What response from broader humanities community?
A: Did some user research, not as much as wanted. Impediment is that when building database tend to have more contact with people creating collections than people using them. Trying to build framework/container first and idea is that researchers will come to them and say “We want this tool” and they’ll build it. Funding set aside for further development.

Q: You compared this to Galaxy, but you’ve built from ground-up where Galaxy is more fluid. A person with command-line can create tools in Galaxy but with HuNI you’d have to do it yourself.
A: Bioinformatics folk tend to be competent with Python – but we’re not sure what competencies our researchers will have, less likely to be able to develop for themselves.

Requirements for a New Zealand Humanities eResearch Infrastructure
James Smithies, University of Canterbury
Vast amounts of cultural heritage being digitised or being born online. Humanities researchers will never be engineers but need to work through the issues.

International context:
Humanities computing’s been around for decades but still in its infancy. US, UK, even Aus have ongoing strategic conversations, which helps build roadmaps. NZ is quite far behind these (though have used punchcards where necessary). “Digging into Data Challenge” overseas but we’re missing out because of lackk of infrastructure and lack of awareness.

Fundamentals of humanities eresearch:
HuNI provides a good model. Need a shift from thinking of sources as objects to viewing them as data. Big paradigm shift. Not all will work like this. But programmatic access will become more important.

National context:
19th century ship’s logs, medical records from leper colonies. Hard to read, incomplete, possibly accurate. Have traditional methods to deal with these but problems multipy when ported into digital formats. Big problem is lack of awareness of what opportunities exist. So capabilities and infrastructure is low. Decisions often outsourced to social sciences.
At the same time, DigitalNZ, National Digital Heritage Archive, Timeframes archive, AJHR, PapersPast, etc are fantastic resources that could be leveraged if we come up with a central strategy.


  • Need to develop training schemes
  • Capability building. Lots of ideas out there but people don’t know where to start. Need to look at peer review, PBRF – how to measure quality and reward it.
  • International collaboration
  • Requirements elicitation and definition
  • Funding for all of the above including experimentation

Q: Data isn’t just data, it’s situated in a context. Being technology-led and using RDF is one thing. But how do we give richness to a collection?
A: Classic example would be researcher wanting access to object properly marked up and contribute to the conversation by adding scholarly comments, engage with other marginalia. Eg ancient greek text corpus (is I think describing the Perseus Digital Library). Want both a simple interface and programmatic access.

Q: Need to make explicit the value of an NZ corpus. Have some pieces but need to join up. Need to work with DigitalNZ. Once we have corpus can look at tools.
A: Yes, need to get key stakeholders around table and talk about what we need.

Capturing the flux in Scientific Knowledge
Prashant Gupta & Mark Gahegan, The University of Auckland
Everything changes – whether the physical world itself or our understanding of the world:
* new observation or data
* new understanding
* societal drivers
How can we deal with change and make our tools and systems more dynamic to deal with change?

Ontology evolution – have done lots of work on this. Researchers have updated knowledge structure and incorporated in forms of provenance or change logs. Tells us “knowledge that” eg What is the change, when it happened, who did it, to what, etc. But we still don’t capture “knowledge how” or “knowledge why”.

Life cycle of a category:
Processes, context, researchers’ knowledge are involved in birth of a category – but these tend to be lost when the category’s formed. We’re left with the category’s intension, extension, and place in the conceptual hierarchy. Lots of information not captured.

“We focus on products of science and ignore process of science”.

Proposes connecting static categories and the process of science to get a better understanding. Could act as a fourth facet to a category’s representation. Can help address interoperability problem and help track evolution of categories.

Process model:
Process of science gives birth to conceptual change modifies scientific artifacts connected as linked science improves process of science.

If change not captured, network of artifacts will become inconsistent and linked science will fail.

Proposes building a computational framework that captures and analyses changes, creating a category-versioning system.

Comment from James Smithies: would fit well in humanities context.
Comment: drawing parallel with software development changeset management.

Introducing the HathiTrust Research Center #nzes

Unlocking the Secrets of 3 Billion Pages: Introducing the HathiTrust Research Center
Keynote from J. Stephen Downie, Associate Dean for Research and a Professor at the Graduate School of Library and Information Science at the University of Illinois at Urbana-Champaign.

Hathi a membership organisation – mostly top-tier US unis, plus three non-US.
“Wow” numbers:
* 10million volumes including 3.4million volumes in the US public domain
* 3.7 billion pages
* 482 TB of data
* 127 miles of books

Of the 3.4 million volumes in the public domain, about a third are in public domain only in the US; the rest are public domain worldwide (4% US govt documents so public domain from point of publication).

48% English, 9% German (probably scientific publications from pre-WWII).

Services to member unis:
* long term preservation
* full text search
* print on demand
* datasets for research

Bundles have for each page a jpg, OCR text, xml which provides location of words on each page.
METS holds the book together – points to each image/text/xml file. And built into the METS file is structure information et table of contents, chapter start, bibliography, title, etc.
Public domain data available through web interfaces, APIs, data feeds

“Public-domain” datasets still require a signed researcher statement. Stuff digitised by Google has copyright asserted over it by Google. And anything from 1872-1923 is still considered potentially under copyright outside of the US. Working on manual rights determination – have a whole taxonomy for what the status is and how they assessed it that way.

Non-consumptive research paradigm – so no one action by one user, or set of action by a group of users, could be used to reconstruct works and publish. So users submit requests, Hathi does the compute, and sends results back to them. [This reminds me of old Dialog sessions where you had to pay per search so researchers would have to get the librarian to perform the search to find bibliographic data. Kind of clunky but better than nothing I guess…]

Meandre lets researcher set up the processing flow they want to get their results. Includes all the common text processing tasks eg Dunning Loglikelihood (which can be further improved by removing proper nouns). Doesn’t replace a close-reading – answers new questions. Correlation-Ngram viewer so can track use of words across time.

OCR noise is a major limitation.

Downie wants to engage in more collaborative projects, more international partnerships, and move beyond text and beyond humanities. Just been awarded a grant for “Work-set Creation for Scholarly Analysis: Prototyping Project”. Non-trivial to find a 10,000-work subset of 10million works to do research on – project aims to solve this problem. Also going to be doing some user-needs assessments, and in 2014 will be awarding grants for four sub-projects to create tools. Eg would be great if there was a tool to find what pages have music on.

Ongoing challenges:
How do we unlock the potential of this data?
* Need to improve quality of data; improve metadata. Even just to know what’s in what language!
* Need to reconcile various data structure schemes
* May need to accrete metadata (there’s no perfect metadata scheme)
* Overcoming copyright barriers
* Moving beyond text
* Building community

Humanities Informatics #ndf2012

Humanities Informatics:
Ingrid Mason (@1n9r1d), Intersect Australia
The Humanities Networked Infrastructure project is a virtual laboratory project funded by the NeCTAR programme in Australia. The project has several significant scholarly humanities datasets to bring together and map across. The immediate goal is to enable researchers to explore and interpret the commonalities.
The initial design challenge is to select description schema and use linked data and controlled vocabularies for data to align the data. This approach tests the assumption that configuring and building on the knowledge of available schema, methods and datasets, will provide a standards based and curated foundation layer to support research requirements.
This ‘prefabricated’ approach has been the basis by which the digital humanities and GLAM sectors have provided access to data. Observing how researchers shape and use this prefabricated environment will inform the value of that approach and the architectural modelling, and inform next steps to building infrastructure where the ‘researcher query’ is the lens that defines the schema.

Anonymous quote: “Gah semantic web is frying my brain!”

Intersect Australia is eResearch org. Working on virtual lab project in humanities informatics field. Talking and dreaming and living data… Have become conversant in RDF; even taking step to ontology development. Talking about linked data, data as graph. Interested in overlap between humanities informatics and GLAM digital cultural heritage.

Wants to provoke thinking – datasharing across GLAMs and scholarly datasets? Who has authority, truth, encoding consensus or contradiction? Doing something with HuNI data? etc

Digital Humanities sits within eResearch (which has been dominated by science). HuNI (@hunivl – Humanities Networked Infrastructure) is a distributed project want to explore commonalities/divergences in data. Bring together datasets, meaning dealing with multiple standards, need to build an ontology. User-centred design.

Assumptions – they’re “prefabricating” but talking to researchers all the way through. Building foundation layer. Fascinated by idea of a researcher query. Work to help researchers ask the questions they need.

Project to integrate 28 cultural datasets (using linked open data) into a virtual laboratory. Want to break down barriers between disciplines. Want it to be available to all but licensing comes into it.

Data – AusStage, bonza, CAARP, AustLit, CircusOz, Australian Dictionary of Biography, PARADISEC, Australian Women’s Register…….
Tools – eg Omeka, Neatline, LOREinformatics from Wikipedia: “studying how to design a system that delivers the right information, to the right person in the right place and time, in the right way”

(Skimming through – slides will be online.)

“Data” an ineffective word to describe all the kinds of data there are.

Linked Data on Wikipedia.

RDF – resource description framework. Statements known as “triples” – subject, predicate, object. In different formats eg RDF/XML, RDF/JSON
SPARQL – query language

“Ingrid is a Kiwi. Conal is a Kiwi. But what is a Kiwi?”

Ontologies have concepts, relations, instances, and axioms. A set of entities within a domain are related by a concept.

Connections between people within Australian Biographies, and between a group of datasets.


  • Need to help researchers go from above the forest through the canopy into the trees and branches.
  • Unlock data, value in controlled vocabularies.