Big data, little data, no data: scholarship in the networked world
Technological advances in mediated communication – have gone to writing to computers to social media and these are cumulative: we use all of these concurrently. And increasingly thinking of these in terms of data. Need to think about new infrastructures because this will determine what will be there for tomorrow’s students/librarians/archivists.
Australian notable for ANDS, and for movements to open access policies – only place she’s found where managing data is part of (ARC’s) Code for the Responsible Conduct of Research.
Book coming out late 2014/early 2015 – data and scholarship; case studies in data scholarship; data policy and practice. Organised around “provocations”:
- How do rights, responsibilities, and risks around research data vary by disciplines and stakeholders?
- How can data be exchanged across domains, contexts, time?
- How do publication and data differ?
- What are scholars’ motivations to share?
- What expertise is needed to manage research data?
- How can knowledge infrastructures adapt to the needs of scholars and demands of stakeholders?
Until the first journal in 17th century, scholars communicated by private letters. Journals were the beginning of peer review, of opening up knowledge beyond those privileged to exchange letters. –However things began much earlier: brick from 5th-6th century inscribed with Sutra on Dependent Origination. Now we have complete open access in PLOS One. (Shows If We Share Data, Will Anyone Use Them? Data Sharing and Reuse in the Long Tail of Science and Technology.) Lots of journals, preprint servers, institutional repositories to submit to.
Publishing (including peer review) serves to legitimise knowledge; to disseminate it; and to provide access, preservation and curation.
Open access means many things – uses Suber’s “digital, online, free of charge, and free of most copyright and licensing restrictions” definition.
ANDS model of “more Australian researchers reusing research data more often”. Moving from unmanaged, disconnected, invisible, single-use data to managed, connected, findable, reusable data.
Open data has even more definitions: Open Data Commons “free to use, reuse and redistribute”; Royal Society says “accessible, useable, assessable, intelligible”. OECD has 13 conditions. People don’t agree because data’s really messy!
Data aren’t publications
When data’s created it’s not clear who owns it – field researcher, funder, instrument, principle investigator?
Papers are arguments – data are evidence.
Few journals try to peer review data. Some repositories do but most just check the format.
Data aren’t natural objects
What are data? Most places list possibilities; few define what is and isn’t data. Marie Curie’s notebook? A mouse? A map or figure? An astronomical photo – which the public loves, but astronomers don’t agree on what the colours actually mean… 3D figure in PDF (if you have the exact right version of Adobe Acrobat). Social science data where even when specifically designed to share it’s full of footnotes telling you which appendices to read to understand how the questions/methods changed over time…
Data are representations
“Data are representations of observations, objects, or other entities used as evidence of phenomena for the purposes of research or scholarship.”
You think you have problems on catalogue interoperability, try looking at open ontologies intersecting different communities.
Data sharing and reuse depends on infrastructure
You don’t just build an infrastructure and you’re done. They’re complex, interact with communities. Huge amount of provenance important to make sense of data down the line.
Data management is difficult – scholars have a hard enough time managing it for their own reuse let alone someone else’s reuse. Need to think about provenance, property rights, different methods, different theoretical perspectives, “the wonderful thing about standards is there’s so many to choose from”.
Ways to release data:
- contribute to archive
- attach to journal article
- post on local website
- license on request
- release on request
These last ones are very effective because people are talking to each other and can exchange tacit knowledge — but it doesn’t scale. The first scales but only works for well-structured and organised data.
So what are we trying to do? Reuse by investigator, collaborators, colleagues, unaffiliated others, future generations/millennia? These are very different purposes and commitments.
Traditional economics (1950s) was based on physical goods – supply and demand. But this doesn’t work with data. Public/private goods distinction doesn’t work with information. There’s no rivalry around the sunset or general knowledge in the way there is around a table or book. So concept of “common pool resources” – libraries, data archives – where goods must be governed.
|
Low subtractability/rivalry |
High |
Exclusion difficult |
public goods |
common pool resources |
Easy |
toll or club goods |
private goods |
While data are unstructured and hard to use they’re private goods. Are we investing to make them tool goods, common pool resources or public goods?
Need to make sustainability decisions – what to keep, why, how, how long, who will govern them, what expertise required?
—
Q: Health sciences doing well
A: Yes but representation issues. Attempt to outsource mammogram readings fell foul of huge amounts of tacit knowledge required. In genomics attempts to get scientists and drug companies to work together in the open, but complicated situation with journals who say that because the data is out there it’s prior publication when in fact the paper is explaining the science behind it; and issues around (misleading) partial release of data – recommends Goldacre’s Big Pharma.
Q: Scientists want to know who they’re giving data to. But maybe data citation a way to get scientists on board?
A: Citing data as incentive is a hypothesis. Really sharing data is a gift – if you put it on a repository you don’t have it available to trade to collaborators, funders, new universities. Data as dowry: people getting hired because they have the data.
Agreeing on the citable unit is hard – some people would have a DOI on every cell, others would have a footnote “OECD”. Citation isn’t just about APA vs Blue Book, it’s about citable unit and who gets credit and….