Monthly Archives: June 2017

Beyond Repositories: Problem-solving-oriented #or2017

Beyond Repositories: From Resource-oriented towards Problem-solving-oriented by Dr Xiaolin Zhang, National Science Library, Chinese Academy of Sciences

With the ubiquitous deployment of digital ecosystems, developing repositories to meet next generation needs and functions become an imperative and increasingly active efforts. However, a paradigmatic shift may be needed to prepare repositories to go outside the resource-orientation box, as JISC report “The future of data-driven decision making” puts it, “[I]t is not sufficient simply to focus on exposing, collecting, storing, and sharing data in the raw. It is what you do with it (and when) that counts”.

The presentation first discusses the emerging digital ecosystems in research, learning, publishing, smart campus/cities, knowledge analytics, etc. where traditional content/repositories are just a small part of stories.

Then an exploration is made about making repositories embedded into, integrated with, and proactively contributing to user problem-solving workflows in digital ecosystems such as scholar hub, research informatics, open science, learning analytics, research management, and other situations.

Further effort is attempted to understand (admittedly preliminarily) strategies for repositories to be transformed into part of problem-solving-oriented services, including, but not limited to, 1) enhancing the interoperability to be re-usable to third part “users”, 2) developing repositories into smart content with application contexts, and 3) developing smart contextualization capabilities to better serve multiple, varied, and dynamically integrating problem-solving processes.

[I’ve previously blogged a keynote by Dr Zhang at THETA 2015.] He has a new perspective since moving jobs two years ago.

104 research institutes, 55,000 researchers. Various repositories eg NSFC Repository for Basic Research, CALIS IR portal of 40+ universities. Research data sharing platform, and Chinese Academy of Science distributed research data management and integrative service platform.

  1. Changes in the digital ecosystems
    • Steady progress of repositories but numbers don’t tell the story – better to look at how users use it. Most still collection based and local applications are the main service. What if we move away from repository-based approach. Imagine new scenarios out in society. What do they need?
    • All media and content can be data (including processes, relations, IoT devices, tweets). Can be smart – and semantic publishing will be the new normal. Knowlege as a Service.
    • Transformation from subscription to open access. Born digital = born linkable.
    • eScience is the knowledge system – opening up data-intensive scientific discovery. Not just about access, it’s a different way of doing science
    • Open Science again more than open access, but open evaluation, open process, open collaboration. (Displays open science taxonomy). Even social science now incorporating computational methods.
    • eLearning creating a new knowledge ecosystem. Things changing quickly. In the classroom everyone (200 students) uploading content and system going down even though made plans for it only 2 years ago. Flipped classrooms where students do work before the class in digitally collaborative environments, multimedia-rich laboratories so students can interact with each other. Requires intelligent campus and services. eStudent Center where student’s whole learning life is together to be analysed; university center can look at trends etc
    • Knowledge analytics – converging data science, computer science, information science. Open source tools for data visualisation and analysis. Data analytics can become new infrastructure
    • Moving into the Machine Learning Age? 7.5 million university graduates every year in China
  2. Explorations to re-orient repositories
    • Towards working labs: Elsevier Knowledge Platform; WDCM
    • From resources to problem solving, eg digital healthcare needing knowledge from literature but also from wearables and other devices; eg intelligent cities with data, linking, analysis, to answer questions.
  3. Challenges in re-developing repositories
    • Re-purpose and reposition repositories? but outside the scholarly communication environment? Eg using big data in smart cities – scholarly knowledge plays a huge role here. Eg learning analytics where we combine data on students (grades, interactions on Moodle).
    • Cycle: environmental scanning -> idea/design/testing -> R&D -> data management -> Data analysis -> dissemination -> preservation/reuse -> evaluation -> environmental scanning
    • Interoperability cf W3C recommendations
    • Identify/select/developed/integrate value-added services (not all work together, but some aren’t meant to). How to turn content into computable data? how to develop rich and smart media resources? eg How to turn powerpoints into actionable data?
    • Working on automatic translation, domain interaction dynamics, scientometrics tools, social network metrics, automatic thesaurus/k-graph development. Hard for students to select a topic when there’s open-source tools already out there about it! Calculations and results become objects to be reused.
    • Representing knowledge with knowledge graphs. which can enable intelligent applications. Text analytics, RDF data management. eg SpringerNature SciGraph – turning all papers into semantic network of knowledge.
    • Too many vocabularies! Some used by many people, some very common (eg ) and general – but also very specific ones eg neuron ontology; Internet of Things developing their own. Ontology mapping tools? Cross-language linking of knowledge graphs and smart data eg Chinese/English Wikipedia pages.
    • What about when live machines join the integration and we put our data into real-life processes? Geospatial/temporal/event/methods/workflows-identifiable.

Are these real life scenarios really relevant to our repositories? If not we’ve got a problem! Is what we’re doing now getting us into these scenarios? Are we talking/collaborating with people in these scenarios? They’re not necessarily going to approach us! Time for us to think and act before it’s too late.

Ideas Challenge presentations #or2017

Challenge to solve an existing problem with emerging technologies.

Data Pickle

Research wanted to upload data but didn’t know how to wrap it up. So cf ThisToThat for gluing thing A to thing B. Let’s make this for data.

Package Shapefiles for Preservation. click “PICKLE!” and it recommends a) the best practice and b) the minimum requirements.

Crowdsourced but curated information for various options.

Technology handshake to achieve Australasia PMC

Right now have EuropePMC and CanadaPMC (child nodes of US PubMed Central which has 27million references). So create AustralasiaPMC so PMC can link to OA articles. Can populate PMC with clumsy markup so need clever handshake technology  to make full-text available in children and parent nodes simultaneously.


Museum is an interface for scientific information to general public. But takes too long for simplified explanations of science (from eg journals) to general public, and journalists don’t always guard scientific integrity.

Want to do a better job of spreading info through social media. Natural language processing to create automated simplified summaries from technical abstracts; push notification to proposal pages so they can create or add to articles; Google translate for other languages.

Put it all together and you get communication immediately after acceptance, being picked up correctly by major news outlets.

(In Q&A: hard to contextualise. Audience notes researchers want to say ‘further researcher needed while lay people want to know what the answer is.)


The technology we’ll use in future repositories has already been written – GitHub is full of work in progress – some people know about it but not all of us. Pull code automatically from everywhere, put it together, throw data in, see if it works.

Plan A – artificial intelligence – most advanced AI right now is self-driving car, so jump in front of one with the repository and the car can evaluate it and then run you over.

Plan B – use humans

(In Q&A: Kim Shepherd suggests when on GitHub and look at number of forks on projects – what percentage might be active, what percentage should we have merged in.)

Global Connections

Deep learning for repository deposit – use existing repository PDFs and metadata to train AI to a) create structured metadata for unstructured content (ie articles), find relevant articles, add structured metadata.

Slice ‘n’ Dice: API-X + XProc-Z

XProc-Z is a simple web server framework HTTP request -> -> -> HTTP response (especially useful for proxies)

API-X for plumbing together microservices.

GET request for info on resource – API-X intercepts/proxies, tweaks, and makes request to server, retrieves result, wraps in a header, tweaks and returns to user.

Don’t need to develop code, just write a text file in XProc language so you can test out what it looks like and you don’t need to wait for repository support. Signposting; generating IIIF manifests; add OAuth authenticating; adding CSS.

Brisbane Declaration ON the Elimination Of Keywords (B-DONEOK)

Keywords can’t express the complexity of language the way full-text can. We spend time doing it anyway. So let’s stop. Instead just use sophisticated full-text search and indexing. SIgn on to the declaration at

(In Q&A audience asks if there’s evidence keywords aren’t useful; team asks in return if there’s evidence keywords are useful.)


Institutional Publications Repositories and beyond #or2017


Curating, But Still Not Mediating by Jim Ottaviani, Amy Neeser

aka “don’t let anyone get away with 6accdœ13eff7i3l9n4o4qrr4s8t12ux” (Isaac Newton establishing priority on calculus in code)

Chinese proverb: “The best time to plant a tree is 20 years ago; the second best time is now.”

Curation starts immediately: don’t wait or people will forget. You have to open every data file and check it’s usable. They assume if it’s intended for humans it’s a “document” but if it’s intended for machines it’s “data”.

Acknowledge/thank the deposit (signing your name so they know you’re human not bot). Then you can ask for a README.txt or offer to help write it.

Home and Away: Exploring the use of metrics in Australia and the UK, with a focus on impact by Jo Lambert, Robin Burgess

Sydney measures metrics through: Atmire tools embedded in repository; PlumX integration altmetrics; ERA requirements; CAUL stats; exploring UK methods. Employing FAIR principles. Researchers provide context for impact.

JISC OA services support through article lifecycle of submission (SHERPA/RoMEO etc), Acceptance (Monitor UK, Jisc collections, OpenDOAR), Publication, etc. Stats collected via aggregation then available in COUNTER format; raw download data from UK IRs using DSpace, Fedora, PURE, ePrints…. Collaboration is important – working with OpenAIRE, concept of creating other IRUS instances eg IRUS-ANZ?

Set up Australian Repository Working Group. Looking at standards and collaboration. “We dream the same dream, we want the same thing” – Belinda Carlisle

Uniform metadata for Finnish repositories by Jyrki Ilva, Esa-Pekka Keskitalo, Päivi Rosenström, Tanja Vienonen, Samu Viita

Open Scientific Publishing project (Tajua). 60 orgs in Finland have an IR, mostly DSpace – some are shared so total of 17 IRs.

Challenges: heterogeneous metadata practices, ad hoc solutions, no general metadata guidelines so repository managers have to fend for themselves.

80 experts got together, formed a smaller working group including National Library experts. Goal-oriented approach to develop a “good-enough” metadata format, semantics prioritised over correct Dublin Core. Compiled most used metadata fields and suggesting fields closer to standard DC.

Spreadsheets collected into Google Drive, meetings held online and in person – then done! Final version published on National Libraries public wiki fields. 62 core dc fields, and 11 extras if needed. 6 fields labelled important: title, author, date, persistent ID, rights (pref Creative Commons). Guidelines at:

Isomorphic Pressures on Instutional Repositories in Japan by Jennifer Beamer

Comparing US and Japanese repositories as interested in situation as institutions interact with repositories. In Japan repositories exploding from 2015-17 and wants to know why. More of a national push in Japan (whereas in the US it’s more grassroots). Previous work not looking at research in Japan especially on big picture scale.

Isomorphism – regulatory/coercive pressures; cognitive/mimetic pressures; normative pressures.

Collecting data from OpenDOAR and ROARMAP – content analysis of themes, mandates, core beliefs. Then interviews with SPARC in Tokyo, librarians, faculties. National Institute for Informatics has a shared cloud server with IR architecture so limited resources  not a barrier. OA policies have started very recently, but librarians play a major role in getting deposits even though only in that role briefly and assist directly. You don’t have to be a PhD to work in faculty so tenure and promotion completely different – publishing isn’t connected to tenure.

The role of the repository in increasing the reach and influence of research by Belinda Tiffen, Kate Byrne

(Acknowledge work of Catherine Williams). Repository enables reporting and assessment but also shopfront to sell research to the world. These roles don’t always sit well together – hard to explain to researchers why we want two versions of their papers.

What role does repository play in sharing research? Data from last year: 2609 UTS publications from Scopus. 33% also in repository. Looked at Almetrics for engagement. 1000 (of the 2609) have an Altmetric score. 47% in both Scopus and repository have an Altmetric score. 63% also in other repositories have an Altmetric score. But only 34% of outputs only in Scopus have an Altmetric score.

What will UTS do with this data? Have OA policy (since 2014) which has increased IR content but still only 35%. In 2015 rolled out new user interface. Active training to get authors to deposit. Want to find out which interventions are having an impact.

(In Q&A: redesigned theme done by part-time graphic designer in library, and in-house DSpace developer.

Scholarly Identity and Author Rights: guiding scholars as they make choices with their scholarly identities in a messy world by Jen Green

Project to managing scholarly/research identity. Schol comm team with wider working group and work with research community to focus outreach efforts. Workshop attendees mostly faculty and postgrads – but this changed when they started talking about online identity.

Workshops on ORCID – in the absence of a repository seemed a good place to start. Short pop-up sessions with ice-cream worked well, chatted, they created an ORCID.

Thought they needed help creating ORCIDs. Learned they needed that plus managing professional identity online and helping their student too. Scholars have limited time and want to spend it on own goals which may not match institution’s.

“Your Research Identity” – covered Twitter, Facebook, etc – so they’d know their online identity exists whether they manage it or not, and here are tools to manage it. Started with Google search on their names, discussed results. When results come up with other people with same name bring up ORCID. Suggested creating one place everything else can link back to (eg her own website).

Outcomes: after this workshop, workshops began to fill up. Once accidentally sent invite to whole campus and 30 seats filled in 10 minutes. Audience didn’t know existed mostly support staff.

(In Q&A: many faculty had never googled themselves, or didn’t know different results with different IP addresses.)

The University of the Philippines Baguio Faculty Research Database: starting a university repository by Cristina Borja Villanueva, Jay Mendoza Mapalo

Cordillera region home to country’s second largest concentration of indigenous people with 7 major ethnolinguistic groups. At Uof Philippines Baguio research is a priority. and library needs to collect and make outputs available to wider community.

Faculty Research Database started in June 2012, launched 2013, 500+ entries to document and disseminate outputs, increase citations, and advance knowledge. Use Joomla. Search author (by dropdown menu). Results page shows number of visits for each item – stats available to show most viewed. Item page may have full-text or may say available on request from author.

Has accomplished availability objectives. Hope to continue improve repository.

Crosswalks, mapping tables, and normalisation rules: when we don’t even share the same vocabulary for authority control by Deborah Fitchett

That’s me! So I didn’t summarise; instead see my full slides and notes.

Integrating DSpace #or2017


Harvesting a Rich Crop: Research Publications and Cultural Collections in DSpace by Andrew Veal, Jane Miller

Currently DSpace v3.2, Repository Tools 1.7.1; upgrading to DSpace v5.6, RT 1.7.4

Wanted independent identity for each major collection area especially research publications and cultural collections; and to avoid weirdly mixed search results – so decided on a multi-tenancy approach. Four repositories on four domains. So could make customisations appropriate to specific collections.

  • research publications (via Elements and self-deposit for theses)
  • cultural collections (digitised; populated by OAI from archives collection and by bulk ingests via csv)
    • 77,000 records: pdf, images, architectural drawings, complete books, audio, video which requires specific display options. Collections based on ownership/subject. Files stored in external archive with metadata stored in DSpace and linking back to file; thumbnail generated on the file.
    • AusStage pilot project – relational index (contributors, productions) linked with digital assets (reviews, photos, video). So eg an event record has a “digital assets” link which brings back a search based on an id shared by related records.
    • Created custom “melbourne.sorting.key” field to enable different sort orders eg for maps where date of accession is irrelevant.
  • coursework resources (eg past exams; architectural drawings for a specific course) – no sitemap or OAI feed
  • admin collections (for ERA)

Couldn’t have done it without service provider (Atmire). Have done lots of business analysis to say what they want, for Atmire to set up. Downside of success is now stakeholders thinks it’s easy to fix anything!


  • develop gallery/lightbox interface
  • upgrade to 5.6; improve Google Scholar exposure
  • OAI harvesting of additional cultural collections
  • look at thesis preservation via Archivematica

DSpace in the centre by Peter Matthew Sutton-Long

Acknowledges Dr Agustina Martinez-Garcia who did much of the integrations work

[Follows up a bit on Arthur Smith’s presentation earlier so I won’t repeat too much background from there.] Before integration, had separate systems for OA publication and research dataset submissions, e-thesis submissions, Apollo repository, CRIS system for REF. This meant a lot of copy-pasting for admins from the manual submission form into repository submission for. And researchers had to enter data in CRIS (Elements) as well as submitting for repository! Also hard to report on eg author collaborations.

Approved June 2016 to integrate things to meet OA requirements, monitor compliance, help researchers share data, allow electronic deposit of theses, integrate systems with community-drive standards for the dissemination of research activities data.

Item deposited in Elements to repository via repository tools connector (though not all files are passed through). An e-theses system feeds into the repository too. Zendesk is also integrated – any deposit creates a Zendesk ticket, which can be used for communication with researchers.

Researchers can work with a single system. They can add grants and links to publications, link to their ORCID profiles (though they don’t seem to want to), obtain DOIs for every dataset and publication (so some people submit old data just to get this DOI; or submit data early, or submit a placeholder to get a DOI they can cite in their article).

Fewer systems for team to access and manage, enhanced internal workflows.

In future want to integrate VIVO.

DSpace for Cultural Heritage: adding support for images visualization,audio/video streaming and enhancing the data model by Andrea Bollini, Claudio Cortese, Giuseppe Digilio, Riccardo Fazio, Emilia Groppo, Susanna Mornati, Luigi Andrea Pascarelli, Matteo Perelli

DSpace-GLAM built by 4Science as an extension to DSpace, which started from discussions around challenges faced by digital humanities. Have to deal with different typologies, formats, structures, scales – and that’s only the first level of complexity. In addition, most data are created/collected by people (not instruments) so affected by personality, place, time, and may be fragmentary, biased. Has to be analysed with contextual information.

How to do this in a digital library management system? Need tools for:

  • modelling, visualising, analysing – quantitatively and qualitatively, and collaboratively
  • highlighting relationships between data
  • explaining interpretations
  • entering the workflow/network scholars are working in

DSpace-GLAM built on top of DSpace and DSpace-CRIS.

  • Flexible/extensible data model – persons, families, events, places, concepts. When you create a “creator-of” relationship, it automatically creates the inverse “created-by” relationship. Can be extended to work with special metadata standards. By setting these up you can see relationships between people, events, etc.
  • with various add-ons
    • IIIF compliant image viewer addon with presentation API, image API, search API, authentication API coming soon. Gives a “See online” option (instead of just downloading) which shows the image, or PDF, or… in an integrated Universal Viewer player: a smooth interaction with the image, alongside metadata about the object, and linking with the OCR/transcription (including right-to-left writing systems). Sharing and reusing with proper attribution.
    • Audio/video streaming with open source stack: transcoding, adaptive streaming, mpeg-dash standard. DASH standard protocol lets you share video along with access to server to provide zoom to make sure the content stays in the digital library so complete access to stats and ensure people see ownership.
    • Visualising and analysing datasets by integrating with CKAN to use grids, graphs, maps.

Evaluation and assessment #or2017


Cambridge’s journey towards Open Access: what we’ve learnt by Arthur Smith [slides]


  • Complicated – lots of funding and govt policies eg HEFCE, Research Councils UK, Wellcome, European Research Council, Cancer Research UK
  • HEFCE run REF with green OA policy (12 month embargo allowed in STEM, 24 month in other fields). Accepted manuscripts must be deposits within 3 months of acceptance. Don’t know what will be used in REF, so have to make them all OA. Goal of 10,000 ms/year.
  • Research Councils UK: green OA (embargo 6month STEM / 12 month others) or gold OA – CC BY immediate availability
  • Engineering and Physical Sciences Research Council – all articles must have data availability statement. Data has to be stored for 10 years or 10 years from the last request for access.
  • Wellcome Trust – gold OA preferred with immediate availability under CC BY and deposited in PMC; or green OA with up to 6month embargo in Europe PMC. Sanctions for non-compliance

Hugely complicated flowchart for decision-making; and another one for depositing.

Uni’s position is: committed to disseminating its research but not fond of APCs (especially hybrid) so prefer green.


Promoted message: “accepted for publication? upload all manuscripts to be eligible for REF” and they take care of rest. Explosion of manuscripts uploaded when HEFCE policy announced; similarly when data requirements started being enforced.

Have spent a lot of money (7mill pounds APCs just centrally – 75% for hybrid) – average APC and lots of paper work. No reduction in subscription spend. 1000/submissions a month. Spike whenever uni-wide emails go out but hard to cope with all the mss deposited in response! Have had to add lots of staff to deal with these and datasets and coordinate outreach, provide support. Lots with research background, and/or with library background.


Early on set up simple upload website with form – but not connected to anything so lots of cutting and pasting. Had lots of sites but nothing connected.

Upgraded DSpace to 5.6, rethemed nicely, and rebranded as “Apollo”. Did lots of work to integrate systems: items (publications and datasets) deposited through Elements (includes OA Monitor) which is linked to the repository, which does DOI minting automatically using DataCite. Zendesk links to DSpace; ORCID integration.


Since 2015 have added 10,000 articles, 1,700 images, 1,600 theses, 800 datasets, 500 working papers, 1,100 ORCIDS. 52% of Cambridge’s output is in an OA source of some kind (may include embargo). 26% in Apollo; 13% in other repositories; 26% in EPMC; 13% in DOAJ; 2% on publisher website.

Papers published in 2015 have a normal falling citation curve. Things that are OA get more citations – counting only less than a year – and are less likely to never be cited. Possibly authors are self-selecting and depositing their best manuscripts. So do need new strategies to capture 100% of outputs.

“Request a copy” button – 8 requests per day for embargoed content (3000 requests total); >20% of requests occur before publication (due to ‘deposit on acceptance’ policy).


Soon upgrading to DSpace 6 or 7, and Elements to Repository Tools 2

HEFCE + Research Councils UK soon forming UK Research & Innovation and hoping for better alignment of policies.

Researchers equate OA with Gold [and presenter equates Gold with APCs] so need cultural change from researchers and policy changes from funders: stop paying for hybrid! Development of the Springer Compact has been great but otherwise pretty problematic.

Open Access policy 3 years in, has researcher behaviour changed? by Avonne Newton, Kate Sergeant

Uni of South Australia’s OA policy launched 2014. Relevant OA funder policies from NHMRC and ARC. Two key points:

  • post-print has to be deposited within 1 month of acceptance
  • embargo only up to 12 months post-publication

Appointed research connections librarian: workshops, training, on OA compliance, ORCIDs etc. Developed OA research guide, publishing research guide – online, plus paper handouts on submitting and on postprints and OA. Uni’s copyright officer has also been a helpful advocate with content on their website.

System and process improvements

  • Developed a system where mss are deposited to the library then exposed in repository, via OAI-PMH, and into reporting systems. Did some prepopulation on submission using DOI lookup of CrossRef. Also harvest UniSA-authored outputs automatically. Migrated repository to library management system – some weaknesses but extensive APIs for direct deposit from automated systems, and good searching, workflow-management and reporting.
  • Developers have used APIs to generate handles and DOIs and automating generation of emails requesting accepted manuscripts.

Data and findings

  • Of traditional research outputs 2008-17, journal articles are 69%.
  • Open access: books and chapters 90%+ restricted; conference have 25% and articles 15% open content; reports >50%.
  • Funding: ARC funding 30% open (now or after embargo); NHMRC funding 38% open; all publications 20%.
  • Since 2014 getting more content (and higher percentages of embargoed content per funders). Hard to determine cause among funder policy, uni policy, general increase in OA awareness – probably all three.

Researcher feedback and final conclusion

  • Aware of policy but more important to some than others. Like disseminating research, gaining readership, increasing citations.

Self-Auditing as a Trusted Digital Repository – evaluating the Australian Data Archive as a trusted repository for Australian social science by Steven McEachern, Heather Leasor

Australian Data Archive (ADA) started as SSDA in 1981; not holds 5000 datasets from 1500 studies from academic, govt and private sectors. Wide variety of subject areas, broadly social science. Serviced by National Computational Infrastructure.

Were going to do Data Seal of Approval but then waited to new joint standard DSA/WDS per November 2016 criteria. Still waiting for review to come back. Purpose is to give you a seal to show you hold trusted data and are a trusted repository.

Process and Outcomes

  • change to DSA/WDS enlarged focus and breadth dramatically. Meant there were no organisations to reference for guidance
  • More emphasis on IT, security, preservation, risk management; governance, expert guidance, business plans; data quality assurance; outsourcing is appropriate
    • eg “The repository has adequate funding and sufficient numbers of qualified staff managed through a clear system of governance to effectively carry out the mission” need to self-assess with stars and write an explanation
  • Not clear what the minimum requirements are. Guidance sometimes didn’t quite match the question and unclear whether to answer the top-level question or the questions in supporting guidance. Most items should be in the public domain so had to work out how to provide evidence from confidential items. In review assessor wants timelines for items “in process” though docs didn’t explain this.

Identified 4 guidelines at level 3, 12 guidelines at level 4. Assessment not complete yet, though some feedback from one of two reviewers.

Assessing the seal

103 Australian research organisations, which could benefit from the seal. But there are challenges:

  • does the data seal support the variation of repository types? or the fact that there isn’t a one-to-one match between organisations and repositories?
  • variety of national/funding/infrastructure/governance frameworks

DSA needs to specify what things need to be in the public domain (many unis won’t make their whole budget public!)  or how to explain items out of org’s direct control (eg funding). Risk management standards per ISO (requires paying) seems overboard for a free self-assessment.

Helped identify things that needed to be updated; info to make public; policies to polish in order to make public; changes to make in practice.

Would like to look at other levels of certification and whether this affords other levels of trust and if this makes a difference to their user community.

Extending DSpace #or2017


Archiving Sensitive Data by Bram Luyten, Tom Desair

Unfortunately not all repository is equally open – different risks and considerations.

Have set up metadata-based access control. DSpace authorisation works okay for dozens of groups, but doesn’t scale well to an entire country where each person is an authorisation group – as they needed. Needed an exact match between a social security number/email address on the eperson and on the item metadata.

  • Advantages: Scales up massively – no identifiable limits on number of ePeople/items, no identifiable effect on performance. Can be managed outside of DSpace so both people and items can be sourced externally.
  • Disadvantages: your metadata becomes even more sensitive. And the rights to modify metadata also gives you the rights to edit authorisations.

Strategies for dealing with sensitive data:

  • Consider probability and impact of each risk eg
    • Data breach
      • Impact – higher if you’re dealing with sensitive data
      • Probability – lower if it’s harder for people to access your system and security updates are frequent
    • Losing all data
      • Impact – high if dealing with data that only exists in one place
      • Probability – depends on how you define “losing” and “all”. Different scenarios have different probabilities

Code on (but documentation in Dutch)

(In Q&A: DSpace used basically as back-end, users would only access it through another front-end.)

Full integration of Piwik analytics platform into DSpace by Jozef Misutka

User statistics really important if defined reasonably and interpreted correctly. Difference between lines in your access logs, item views, bitstream views (which includes community logos), excluding bots, identifying repeat visitors.

DSpace stats

  • internal based on logs
  • SOLR (time, ip, useragent, epersonid, (isBot), geo, etc etc) – specifics for workflow events and search queries; visits vs views confusion
  • ElasticSearch time, ip, user agent, geo, (isBot), dso) deprecated in v6
  • Google Analytics, plugins

DSpace are good for local purposes but for seeing user behaviour lacks a lot of functionality

Had various requirements – reasonable, comparable statistics; separate stats for machine interfaces; ability to analyse user behaviour, search engine reports; export and display for 3rd parties; custom reports for specific time/user slices.

Used Piwik Open Analytics Platform (similar to Google Analytics but you own it – of course means one more web app to maintain). Integrate via DSpace REST API or via lindat piwik proxy. Users can sign up to monthly reports (but needs more work, and users need to understand the definitions more.)

In DSpace, when user visits a webpage, java code is executed on the server, which triggers a statistics update. But with Piwik, the java code is executed and returns html with images/javascript and when that’s executed it triggers statistics update – potentially including a lot more information; this excludes most bots who don’t execute js or view images.

If you change how you count stats, include a transition period where you report both the old method and the new method.

Code at:

Beyond simple usage of handles – PIDs in Dspace by Ondřej Košarko

Handles often used as a metadata field for the reader to use to refer back to the item, or for relations between items. But while a human can click on the handle link, but a machine doesn’t know what it’s for or what’s on the other end of it.

Ideally a PID is all you need: You can ask ‘the authority’ for information of not just where to go but what’s on the other end.

Handles can do more things:

  • split up the id and location of a resource
  • content negotiation – different resource representation based on Accept header (which is passed by proxy)
  • template/parts handles – register one base handle but use it with extensions eg or – can refer to different bitstreams, or to different points in audio/video
  • get metadata based on handle without going right to the landing page – eg to get info in json, or generate citations, or show a generated landing page on error, or…

The Request a Copy Button: Hype and Reality by Steve Donaldson, Rui Guo, Kate Miller, Andrea Schweer

Trying to keep IR as close to 100% full-text as possible; DSPace 5 XMLUI, Mirage 2, now fed from CRIS Elements.

Request a copy button designed to give researchers an alternative way to access restricted (eg embargoed) content. Reader clicks button, initiates request to author; if author grants request, files are emailed to requester. Idea is that authors sharing work on one-to-one is covered by a) tradition of sharing and b) fair dealing under copyright.

Why? Embargoes auto-lift but no indication showing when that’d happen. Working on indicating auto-lift date but wanted to embrace other ways. Did need to make some tweaks to default functionality.

Out-of-the-box two variants: a) request author directly (but all submissions from Elements!) or b) email to admin who can review and overwrite author email address and and and … They used this latter ‘helpdesk’ variant but cuts out steps (and copies in admins).

So admins can use local knowledge of who to contact, intercept spam requests, aren’t responsible for granting access (to avoid legal issues).

Other tweaks:  to show file release date; wording tweaks; some behaviour tweaks (don’t ask if they want all files if there’s only 1; don’t ask the author to make the file OA because in our case the embargo can’t be overridden).

Went live – and lots of very dodgy requests – even for items that didn’t exist. So put in tweaks to cut down on spam requests to ensure requests actually come via the form, not a crawler. Respond appropriately to nonsense requests (for non-existing/deleting files); avoid oversharing eg files of withdrawn items or files not in the ORIGINAL bundle.

Went live-live. Informed of requests as made and approved, and added counts to admin statistics (requests made/approved/rejected).

Live a year

  • 9% of publications (49 items) have the button
  • 18 requests made – mostly local subject matter, majority from personal e-mail addresses
  • 8 approved, 0 denied (presumably ignored – though once author misread email and manually sent the item) – seemed like personal messages may have had a higher chance of success. One item was requested twice

So hasn’t revolutionised service but food for thought.

  • Add reminders for outstanding requests
  • Find out why authors didn’t grant access, maybe redesign email
  • Extend feature to facilitate control over sensitive items
  • Extend to restricted theses (currently completely hidden)

One good comment from academics “great service, really useful”.

(ANU also implemented – some love it, some hate it – if hate it change email address to repository admin.)

(Haven’t yet added it to metadata-only files – as it’d mean the author would have to find the file. OTOH would be a great prompt ‘Oh I should find the file’. Another uni has done this but had to disable for some types due to a flood of requests.)

Repository admin and integration #or2017


Auditing your digital repository(ies): the U-M Library migration experience by Kat Hagedorn

Have over 275 collections to handle: needed to audit prior to migrating. Also needed to know how the repositories interact and how the systems work.

Determined minimum and maximum factors for the audit, and divided into qualitative and quantitative. Quantitative easy to fill in, harder to figure out relevance.

Ran a pilot – including some problematic ones. Finding:

  • Even “number of objects” needs conversation in some collections. May need carefully thought out words – do you count the object or the record?
  • “Collection staleness” not ‘last updated’ but ‘last used’. And ‘update dates’ changed once to date of migration…
  • Some information about collections is only in email. Data often unclear even when locatable.
  • Technical issues – broken script meant collection usage appeared to be nil. Options for format types didn’t originally include XML.
  • Sometimes a stakeholder has been non-responsive in providing better images.

Mind the gap! Reflections on the state of repository data harvesting by Simeon Warner

A long time when 10GB was a lot… OAI-PMH was formed. It works, scales, is easy, is widely deployed. Harvested into aggregators, discovery layers etc. But not RESTful, clunky, focused on metadata and pull-based.

So we hate it, but don’t know what to do instead!

New approach has to meet existing use cases; support content as well as metadata, scale better, follow standards, make developers happy. Need to be able to push for more frequent updates.

Wants to use ResourceSync – ANSI/NISO Z39.99-2017 – has WebSub (was PubSubHubbub) companion standard.

CORE is looking at replacing OAI-PMH with ResourceSync. Work with Hyku & DPLA. Samvera (was Hydra) building native ResourceSync support).

The community should agree on ResourceSync as a new shared approach. Have to support it as primary harvesting support, OAI-PMH as secondary for transition.

(In Q&A: Haven’t yet talked to discovery layers – need to get consensus in community first. Trove have switched some to SiteMaps (basic level of ResourceSync).

Audience suggestion to create user stories of migration.)

Does Curation Impact the Quality of Metadata in Data Repositories? by Amy Elizabeth Neeser, Linda Newman

Various research questions about metadata options and curation. Compared 4 institutions. Looked at most recent 20 datasets in each. Each institution’s metadata is very different, eg ‘author’ vs ‘creator’; some automatically generated (which were discounting). Finding:

  • Sizable variation of metadata ‘options’ per institution
  • Choice to curate doesn’t necessarily guarantee more metadata. More than minimum is available regardless
  • Documentation is far less common in self-submission repositories, usually only a readme
  • Institutions who curate can ensure that each dataset gets a DOI, others currently leave the choice to the user. (May just be related to policy though)
  • Not sure whether placement of input form or curation is the bigger factor in number of keywords

(In Q&A: As a community we think curation is good but want some proof of this to justify all the hours! Also as a result of the study have made changes to own practices eg better input forms.)

Leading the Charge: Supporting Staff and Repository Development at the University of Glasgow by Susan Ashworth

What does repository success look like? Enlighten is a recognised brand that covers all the services. Multiplying repositories as users keep asking for them. Recognition of value of data – populates web pages, research evaluation, benchmarking, KPIs.

How did they get there? Early and ongoing engagement with deans of research and research admins. Surface data publically on researcher profiles. Say yes (and panic afterwards) eg improving reporting from the repository. Adapt quickly to external forces (funding and govt requirements).

What does this mean for libraries? Cross-library services, appointing new staff. Have 6 OA staff in various teams. Developing new skills in data management, licensing, metrics, etc. Lots to do leading to lots of opportunities for staff. Staff can easily see the contribution they make to institution. Clear when service has to deliver on high expectations.

UK-wide adoption of “UK Scholarly Communications License” also used/adapted from Harvard – where unis retain some control over outputs to make them available at point of publication instead of embargos. [Seems to be adapted from CC-BY??]

(In Q&A: 70 UK institutions discussing adoptions of this license – may be declarations in Open Access Week. Some pushback from publishers but Harvard have been able to work with it for 10 years!)

Towards an understanding of Open Access impact: beyond academia by Dr Pauline Zardo, Associate Professor Nicolas Suzor

Most research into OA usefulness is focused on use for researchers. Worked in govt and wanted to use research evidence but couldn’t get access. Did PhD on how to help govt use research evidence in decision-making. Increasingly important in context of govt impact assessments eg REF, PBRF, ERA. Access (and knowing it exists!) is the biggest, structural barrier to using academic information.

“The Conversation” website aims to help researchers communicate research in a way that’s easy for lay people to understand. Free to read and share under Creative Commons. 4 sectors reading it: research and academia; teaching and education; govt and policy; health and medicine together represented 50% of survey participants. Value academic expertise, research finding, clarity of writing, no commercial agenda, editorial independence. Discuss it with friends or colleagues afterwards, many share on social media. Used in discussions and debate; may change behaviour; some use it to inform decision-making (or to support an existing decision….)

(In Q&A: To do more research with pop-up surveys on downloads from IRs to find out why people are using the content.)

Batch processes and outreach for faculty work by Colleen Elizabeth Lyon

UT at Austin – research-intensive. IR on DSpace. Big campus, competing priorities = lots of missing content. Wanted to increase access to content and improve outreach skills using existing repository staff (2 full time plus a few grad students to upload content).

Use CC licenses and publisher policies to identify which publishers allow it without having to ask faculty permissions, and create automated process:

Export content from WoS to Endnote -> csv -> Google Sheets > cf SherpaRomeo via API -> download articles -> use SAFCreator to get into right format for batch import to DSpace (followed by stuff for usage reports for faculty). Results:

  • Filtering and deduping took more time than expected.
  • Faculty didn’t respond to notifications – not sure if they just ignored, didn’t think it needed a response – but at least no complaints!
  • Added almost 2500 items between other projects.

Under the DuraSpace Umbrella: A Framework for Open Repository Project Support by Carol Minton Morris, Valorie Hollister, Debra Hanken Kurtz, Andrew Woods, David Wilcox

DuraSpace is a not-for-profit org so mission to support open tech projects to provide long-term durable access and discovery. Fosters projects eg DSpace, Fedora, Vivo. Offers services eg tech/business expertise, membership/governance framework, infrastructure, marketing/comms. Affiliate project Samvera.

Ecosystem is larger so want to expand support to ensure community/financial/technical sustainability. Criteria include: philosophical alignment; strategic importance to community; financially viable; technical pieces in place. So if you know of a project needing support, contact them.

A Simple Method for Exposing Repository Content on Institutional Websites by Gonzalo Luján Villarreal, Paula Salamone Lacunza, María Marta Vila, Marisa Raquel De Giusti, Ezequiel Manzur

2 institutions with active IRs but questionable web dissemination practices – lack of interest/staff/time/money to maintain web presence for staff profiles.

  • Improved existing sites with Choique CMS, WordPress, Joomla
  • Designed and developed new sites with WordPress multisite – hosted if research centre didn’t have their own site
  • Gave advice about web publishing
  • Use IRs to boost websites by using OpenSearch to retrieve contents, then software library to fetch/filter/organise/deliver and share results, then created software addons for each CMS
    • Easy configuration
    • flexible usage eg “all research centre publications” or “theses from the last 5 years” or “all researcher A’s content”

Results: 7 new websites published, 14 in development. Researchers want to deposit in IRs to keep the website updated. Dev work to continue eg flushing cache; multi-repository retrieval?

Demonstrating impact #or2017


A new approach for measuring the value of big data and big data repositories by Clare Richards, Lesley Wyborn, Ben Evans, Jon Smillie, Jingbo Wang, Claire Trenham, Kelsey Druken

Australian govt has invested $50million++ in data infrastructure so we need to know who effectively we’re achieving development goals, and what’s important for the users to inform future developments.

National Computational Infrastructure manages 10+PB of datasets in geophysics, genomics, astronomy etc. FAIR principles are driving forces:

  • Findable: describe in data catalogues using community standards, and federating with international collections
  • Accessible: for download or programmatically, for usage in virtual labs, high performance applications
  • Interoperable: using a transdisciplinary approach (data is born connected across the discipline boundaries and beyond academia to address societal needs) and applying international data standards
  • Reusable: demonstrating data works across different domains and applications

What does ‘value’ mean to our stakeholders/investors? Cost vs basic/expected/desired/unanticipated value. Looked at ARC definition of research impact and identified the areas where they add value.

Ways to demonstrate value of investment include: case studies of research impact (helpful but time-consuming) or quantitative stats eg hits/visits (easy but tend to use easy factors not meaningful factors, plus context varies between disciplines). Looking at:

  • tracking data usage – which datasets are open / partially open? quality of entries on catalogue? mine usage logs to track who’s using it; track what it’s being used for
  • accessibility and usability – what datasets are compliant with FAIR principles?
  • research outcome measures –  case studies but also publications, citations

Need to then convert these metrics to estimates of return on investment.

“Data only has value if it is used – otherwise it’s just a cost.”

Making an Impact: The recording and capture of Impact-generating activity and evidence at the University of Glasgow by William J. Nixon, Rose-Marie Barbeau, Justin Bradley, Carlos Galan-Diaz, Michael Eadie

In the UK impact (eg to economy, society, culture, public policy or services, health, environment or quality of life, beyond academia) has been enshrined as worth 20% towards REF assessment. So how do we best support activity and evidence? Central pipeline development; individual performace reviews; collective impact strategies

IR service at Glasgow has publications repository (where they started), theses, research data – and impact as a separate repository. Working in partnership among library, research office, ePrints services, colleges. Challenges include confidentiality, visibility, reporting: need to be able to capture it but assure researchers of confidentiality while they’re still working on their project.

Tried storing impact as a “knowledge exchange and impact” field in their publications repository – allowed entering description of activity or evidence. Soon became clear this was too simplistic. Developed an Impact Repository which is very locked down and closed. You have to log in and then you can only access your own slice of evidence.


  • impact snapshot
    • fill out basic information
    • identify the kinds of impact your work might have
    • external collaborators (example of something that may need to be confidential)
    • engagement activities (eg consultation, event with schools, pop sci event, committee involvement)
    • indicators of esteem (eg prizes, advisory panel)
  • upload documents
  • link to documents
  • public information
    • optionally put some info on your public profile
  • click “deposit”

Q: Why build new infrastructure instead of using a CRIS/RIMS?
A: Don’t have a commercial CRIS. Already had a strong repository ecosystem.

How to Speak Business Case by Mike Lynch

Easy for people with domain knowledge to get defensive when asked to get involved in corporate IT side of things. Cringe factor at idea of research having a monetary value. Research often unique to institution; Word and email general across all institutions. But ultimately we’re all nerds.

Core nerd values:

  • standards are good
  • explicit is better than implicit
  • planning and estimating are good, even if impossible

Provisioner was a big data project to create a research app service catalogue, involving OMERO, GitLab, Stash, and linked data. All stuff that sounds fantastic but hard to sell to project managers. A few rounds of trying to sell the project:

  1. They wanted to know what the tangible benefits would be. Did research on research impact and open data but this was still too high-level, too long a timeline.
  2. “How much money will this save the organisation in the next five years?”
    • will automate data description for researchers
    • reduce admin work for faculty staff
    • provide higher quality metadata for data librarians
    • easier management of storage for tech support
    • good design means less effort for developers
      All real but don’t translate into much $$$
  3. So looked at what it costs the university to generate the data that they’re proposing to reuse – mostly this is researcher salaries, so easy to calculate. And can guess the likelihood that data will be reused as a result of the project. Effectively the university will produce more research for the same amount of money.

Business cases are useful. Focus on immediate researcher benefits.

(There are still people anti-business case especially as it starts seeming repetitive. He tells them it’s like an orchestra rehearsal: it’s repetitive but you get better at it. Also requirements-gathering is a way of building relationship with users – you find out things they need that improve the project. Setting up a governance structure means if you have to make changes you don’t have to go al the way up the org structure.)

Scholarly workflows #or2017


Supporting Tools in Institutional Repositories as Part of the Research Data Lifecycle by Malcolm Wolski, Joanna Richardson

Have been working on research data management in context of the whole research data lifecycle. Started asking question: once research data management is under control, what will be the next focus? Their answer was research tools. Produced two journal articles:

  • Wolski, M., Howard, L., & Richardson, J. (2017). The importance of tools in the data lifecycle. Digital Library Perspectives, 33(3), in press
  • Wolski, M., Howard, L., & Richardson, J. (2017). A trust framework for online research data services. Publications, 5(2), article 14

Research life cycle: Data creation and deposit (plan and design, collect and capture) -> Managing active data (Collect and capture, collaborate and analyse) -> Data repositories and archives (manage, store, preserve; share and publish) -> Data catalogues and registries

Research data repositories vary a lot. Collection or ecosystem? Open or closed? End point or part of workflow? Why is it hard to build them? Push-and-pull between re-usability and preservation:

  • technical aspects
  • interoperability
  • lega/regulatory/ethical constraints
  • one-off activity or continuous
  • diversity of accessibility issues
  • diversity of re-usability issues

The average number of research tools per person was 22 per person (includes Word, ResearchGate, email through to SurveyMonkey, Dropbox, Figshare, through to R and really specialised ones). Kramer and Bosman (2016) divided tools into assessment, outreach, publication, writing, analysis, discovery, preparation phases. Tools exploding as research activity scales up, collaboration increases. Large-capacity projects being funded. Data science courses upskilling researchers.

Researchers use lots of tools as part of the data workflow. The institution may manage data, but have no ownership of workflow. Since data has to move seamlessly between tools, interoperability is key – but how do we built these interoperable workflows and infrastructures?

Need to remember repository is only part of the research ecosystem. Need to take an institutional approach – or approaches rather than a single design solution. Look at main workflows and tools used – check out research communities who may already have the solutions – focus must be meeting the researchers’ needs.

Q: Will we see researchers use fewer tools as disciplinary workflows develop?A: Probably not but will see more integration between them eg Qualtrics adding an R connector.

Research Offices As Vital Factors In The Implementation Of Research Data Management Strategies by Reingis Hauck

Have a full-text repository on DSpace, building data repository on CKAN. What if we build something (at great expense) and they don’t come? We need cultural change. Eg UK seems far ahead but only 16% of respondents are accessing university RDM support services in 2016.

They have data repository, and provide support service by research office, library and IT services.

Research offices provides support in grant writing; advocates on policies; helps with internal research funding; report to senior leadership. Their toolkit:

  • need to win research managers over – explain how important it is
  • embedded an RDM-expert
  • upskilled research office staff about data management planning and how to make a case for data management.

Look out for game changers:

  • eg large collaborative research projects – produce lots of data and need to share it to be successful so more likely to listen
  • DMP preview as standard procedure for proposal review and training on proposal writing. (Want data management planning to be like brushing your teeth: you do it every day and if you forget you can’t sleep.)
  • adapt incentives – eg internal funding for early career researchers requires data management plans
  • use existing networks – researchers go to lots of boards and meetings already so feed this as a topic like any other topic
  • engage with members of DFG[German science foundation] review board – to get them to draw up criteria to reward researchers doing it

Cultural change towards open science can be supported by your research office. Let’s team up more!

Towards Researcher Participation in Research Information Management Systems by Dong Joon Lee, Besiki Stvilia, Shuheng Wu

RIMS – include ResearchGate, Academia, Google Scholar; ORCID, ImpactStory; PURE, Elements

ResearchGate sends out a flood of emails – good for some, a put-off for others. How can we improve our RIMS to improve researcher engagement?

Interviewed 15 researchers then expanded to survey 412 participants; also analysed metadata on 126 ResearchGate profiles of participants. Preliminary findings:

  • Variety of different researcher activities in RIMS eg write manuscripts, interact with peers, curate, evaluate, look for jobs, monitor literature, identify  collaborators, disseminate research, find relevant literature.
  • Different levels of participation: readers may have a profile but don’t maintain it or interact with people; record managers maintain their profile, but don’t interact with others; community members maintain profiles but also interact with others etc.
  • Different motivations to maintain profile: to share scholarship (most popular); improve status, enjoyment, support evaluation, quality of recommendations, external pressure (least popular)
  • Different use of metadata categories: people tend to use the person, publication, and research subject catories. Maybe research experience, but rarely education, award, teaching experience, other other.
    • In Person most people put in first, last name, affiliation, dept;
    • Publication: Most use most of these except only 30% of readers share the file – about 80% of record managers and community member

Want to develop design recommendations to enable RIMS to increase participation.

Research and non-publications repositories, Open Science #or2017


OpenAIRE-Connect: Open Science as a Service for repositories and research communities by Paolo Manghi, Pedro Principe, Anthony Ross-Hellauer, Natalia Manola

Project 2017-19 with 11 partners (technical, research communities, content providers) to extend technological services and networking bridges – creating open science services and building communities. Want to support reuse/reproducibility and transparent evaluation around research literature and research data, during the scientific process and in publishing artefacts and packages of artefacts.

Barriers – repositories lack support (eg integration, links between repositories). OpenAIRE want to facilitate new vision so providing “Open Science as a Service” – research community dashboard with variety of functions and catch-all broker service.

RDM skills training at the University of Oslo by Elin Stangeland

Researchers using random storage solutions and don’t really know what they’re doing. Need to improve their skills. Have been setting up training for various groups in organisation. Software Carpentry for young researchers to make their work more productive and reliable. 2-day workshops which are discipline-specific and well-attended. Now running their own instructor training which allows expanding service. Author carpentry, data carpentry, etc.

Training for research support staff who are first port of call on data management plans, data protection, basic data management. Recently made mandatory by Dept of GeoSciences to attend DMP training.

Expanding library carpentry to national level.

IIIF Community Activities and Open Invitation by Sheila Rabun

Global community that develops shared APIs for web-based image delivery; implements that in software; to expose interoperabie image content.

Many image repositories are effectively silos. IIIF APIs allows a layer that lets servers talk to each other and allow easier management and better functionality for end-users. Lots of image servers and clients around now so you can mix-and-match your front and back-ends. Can have deep zoom; compare images and more.

Everything created by global community so always looking for more participants. Community groups, technical specification groups eg extending to AV resources, discovery, text granularity (in text annotations). Also a consortium to provide leadership and communication channels.

Data Management and Archival Needs of the Patagonian Right Whale Program Data by Harish Maringanti, Daureen Nesdill, Victoria Rowntree

Importance of curating legacy datasets. World’s longest continuous study of large whale species: 47 years and counting of data. Two problems:

  • to identify whales – found the callosities of right whales were unique (number, position, shape) and pattern remained same despite slight change over time. So can take aerial photos when they surface. Data analysed with complicated computer system and compared with existing photos.
  • to gather data over a period of times – where to find whales regularly. Discovered whales gather in three places: 1) mothers and calves; 2) males and females; 3) free-for-all.

Collection has tens of thousands b&w negatives; color slides; analysis notebooks; field notes; Access 1996 database records; sightings maps.

Challenges: heterogeneity of data; metadata – including how much can be displayed publically; outdated databases.

Why should libraries care? We can provide continuity beyond life of individual researchers. Legacy data is as important as current data in biodiversity type fields and generally isn’t digitised yet.

Repository driven by the data journal: real practices from China Scientific Data by Lili Zhang, Jianhui Li, Yanfei Hou

China Scientific Data is a multidisciplinary journal publishing data papers – raw data and derived datasets. Submission (of paper and dataset), review (paper review and curation check), peer review, editorial voting.

How to publish:

  • Massive data? – on-demand sample data publication: can’t publish the whole set, but publishes a sample (typical, minimum sized) to announce the dataset’s existence
  • Complex data? – publish data and supplementary materials together eg background info, software, vocabulary, code, etc. Eg selected font collections for minority languages
  • Dynamic data? – eg when updating with new data using same methodology and data quality control. Could publish as new paper but it’s duplicative so published instead as another version with same DOI. Can be good for your citations!

Encourage authors to store data in their repository so its long-term availability is more reliable.

RDM and the IR: Don’t Reuse and Recycle – Reimplement by Dermot Frost, Rebecca Grant

We all have IRs and they’re designed for PDF publications. Research Data Management is largely driven by funder mandates; some disciplines are very good at it, some less so (eg historians claiming “I have no data” – having just finished a large project including land ownership registries from 17th century, georectified etc!)

FAIR (findable, accessible, interoperable, reusable) data concept (primarily machine-oriented ie findable by machines). IRs can’t do this well enough. Technically uploading a zip file is FAIR but time-costly to user.

Instead should find a domain-specific repository (and read the terms and conditions carefully especially around preservation!) Or implement your own institutional data repository (but different scale of data storage can take serious engineering efforts). Follow the Research Data Alliance.

Developing a university wide integrated Data Management Planning system by Rebecca Deuble, Andrew Janke, Helen Morgan, Nigel Ward

Need to help researcher across the life-cycle. UofQueensland identifying opportunity to support researchers around funding/journal requirements. Used DMP Online but poor uptake due to lack of mandate. UQ Research Data Manager system:

  • Record developed by research – active record (not plan though includes project info) which can change over course of project. Simple dynamic form, tailored to researchers, with guidance for each field.
  • Storage auto-allocated by storage providers for working data – given a mapped drive accessible by national collaborators (hopefully international soon) using code provided in completing the form.
  • [Working on this part] Publish and manage selected data to managed collection (UQ eSpace). Currently manual process filling in form with metadata fields in eSpace. Potential to transfer metadata from RDM system to eSpace.
  • Developing procedures to support the system.

Benefits include uni oversight of research in progress, researcher-centric, improves impact/citation, provides access of data to public.

Preserving and reusing high-energy-physics data analyses by Sünje Dallmeier-Tiessen, Robin Lynnette Dasler, Pamfilos Fokianos, Jiří Kunčar, Artemis Lavasa, Annemarie Mattmann, Diego Rodríguez Rodríguez, Tibor Šimko, Anna Trzcinska, Ioannis Tsanaktsidis

Data very valuable – data published even 15 years after funding stopped, and although always building new and bigger colliders, data is still relevant even decades after collected.

Projects involve 3000 people, including high turnover of young researchers. CERN need to capture everything need to understand and rerun an analysis years later – data, software, environment, workflow, context, documentation.

  • Invenio (JSON schema with lots of domain-specific fields) to describe analysis
  • Capture and preserve analysis elements
  • Reusing – need to reinstantiate the environment and execute the analysis on the cloud.

REANA = REusable ANAlyses supports collaboration and multiple scenarios.