A new approach for measuring the value of big data and big data repositories by Clare Richards, Lesley Wyborn, Ben Evans, Jon Smillie, Jingbo Wang, Claire Trenham, Kelsey Druken
Australian govt has invested $50million++ in data infrastructure so we need to know who effectively we’re achieving development goals, and what’s important for the users to inform future developments.
National Computational Infrastructure manages 10+PB of datasets in geophysics, genomics, astronomy etc. FAIR principles are driving forces:
- Findable: describe in data catalogues using community standards, and federating with international collections
- Accessible: for download or programmatically, for usage in virtual labs, high performance applications
- Interoperable: using a transdisciplinary approach (data is born connected across the discipline boundaries and beyond academia to address societal needs) and applying international data standards
- Reusable: demonstrating data works across different domains and applications
What does ‘value’ mean to our stakeholders/investors? Cost vs basic/expected/desired/unanticipated value. Looked at ARC definition of research impact and identified the areas where they add value.
Ways to demonstrate value of investment include: case studies of research impact (helpful but time-consuming) or quantitative stats eg hits/visits (easy but tend to use easy factors not meaningful factors, plus context varies between disciplines). Looking at:
- tracking data usage – which datasets are open / partially open? quality of entries on catalogue? mine usage logs to track who’s using it; track what it’s being used for
- accessibility and usability – what datasets are compliant with FAIR principles?
- research outcome measures – case studies but also publications, citations
Need to then convert these metrics to estimates of return on investment.
“Data only has value if it is used – otherwise it’s just a cost.”
In the UK impact (eg to economy, society, culture, public policy or services, health, environment or quality of life, beyond academia) has been enshrined as worth 20% towards REF assessment. So how do we best support activity and evidence? Central pipeline development; individual performace reviews; collective impact strategies
IR service at Glasgow has publications repository (where they started), theses, research data – and impact as a separate repository. Working in partnership among library, research office, ePrints services, colleges. Challenges include confidentiality, visibility, reporting: need to be able to capture it but assure researchers of confidentiality while they’re still working on their project.
Tried storing impact as a “knowledge exchange and impact” field in their publications repository – allowed entering description of activity or evidence. Soon became clear this was too simplistic. Developed an Impact Repository which is very locked down and closed. You have to log in and then you can only access your own slice of evidence.
- impact snapshot
- fill out basic information
- identify the kinds of impact your work might have
- external collaborators (example of something that may need to be confidential)
- engagement activities (eg consultation, event with schools, pop sci event, committee involvement)
- indicators of esteem (eg prizes, advisory panel)
- upload documents
- link to documents
- public information
- optionally put some info on your public profile
- click “deposit”
Q: Why build new infrastructure instead of using a CRIS/RIMS?
A: Don’t have a commercial CRIS. Already had a strong repository ecosystem.
How to Speak Business Case by Mike Lynch
Easy for people with domain knowledge to get defensive when asked to get involved in corporate IT side of things. Cringe factor at idea of research having a monetary value. Research often unique to institution; Word and email general across all institutions. But ultimately we’re all nerds.
Core nerd values:
- standards are good
- explicit is better than implicit
- planning and estimating are good, even if impossible
Provisioner was a big data project to create a research app service catalogue, involving OMERO, GitLab, Stash, and linked data. All stuff that sounds fantastic but hard to sell to project managers. A few rounds of trying to sell the project:
- They wanted to know what the tangible benefits would be. Did research on research impact and open data but this was still too high-level, too long a timeline.
- “How much money will this save the organisation in the next five years?”
- will automate data description for researchers
- reduce admin work for faculty staff
- provide higher quality metadata for data librarians
- easier management of storage for tech support
- good design means less effort for developers
All real but don’t translate into much $$$
- So looked at what it costs the university to generate the data that they’re proposing to reuse – mostly this is researcher salaries, so easy to calculate. And can guess the likelihood that data will be reused as a result of the project. Effectively the university will produce more research for the same amount of money.
Business cases are useful. Focus on immediate researcher benefits.
(There are still people anti-business case especially as it starts seeming repetitive. He tells them it’s like an orchestra rehearsal: it’s repetitive but you get better at it. Also requirements-gathering is a way of building relationship with users – you find out things they need that improve the project. Setting up a governance structure means if you have to make changes you don’t have to go al the way up the org structure.)