Tag Archives: data

Flexible, Secure and Sharable Storage for Researchers #theta2015

Flexible, Secure and Sharable Storage for Researchers (abstract)
Andrew Nielson and Stephen McGregor

Talked a lot to researchers. Quarter of researchers didn’t know how much storage needed. Few needed more than 10TB. Built http://research-storage.griffith.edu.au/

Found existing services were uni-focused – hard to give access to external collaborators. Need to be competitive with cloud services. Want to let people collaborate with everyone, but not everyone. So there’s a form that lets researchers invite other users to sign in using a uni, Google, or LinkedIn account.

Needed multiple ways to share. Internal sharing – share with people by name. External sharing – provide a web URL with password protection / expiration date.

Device support: web interface plus apps including desktop sync apps.

Project spaces – you get 5GB storage by default but set up a project and storage space is unlimited. Space is a folder / “logical grouping of data”. When creating, have to include metadata for admin purposes (owner, project name, funder, backup contact). Instant approval and provision – don’t want to get in the way. Unless told to delete old / unaccessed data, just move to cheaper storage – effectively archiving off.

Block level deduplication (basically store a reference to previously stored data) better than single-instancing and lower overhead than compression. Have managed to save 46% space this way. This is needed because software stores entire new version, instead of a diff. “Don’t keep backups” but do replicate/sync between their geographically separated datacenters.

Used by Sciences but also Arts/Ed/Law, Business, and Health.
30% of projects (18 researchers) unfunded – data that would otherwise be on hard drives and uni wouldn’t even know it exists.

Developing and piloting more services including storage for use by instruments.
Currently administrators need to be hands-on to setup service – want to automate.

Q: Mandate?
A: If you force it people get annoyed. Providing option.

Q: Funding going forward given that new data probably bigger?
A: Yeah… basically want to build it well, get data off hard drives, show popularity, and then write business case if/when new space needed. Nowhere near this need yet.

Audience comment that fantastic usability for researchers.
A: Getting feedback from researchers has helped this.

Q: Any data publication service in development?
A: Project focused on working storage. eResearch Services department are working on a system for post-publication storage.

Q: Is it accessible to computational services?
A: Another project in early stages working on computational needs. Data in this format isn’t ideal for putting on servers – technically possible but usually when people are doing stuff on a server they want their storage there too.

Intersection of big data and learning #theta2015 #bright-dark

Creating Connections in Complexity: discussions at the intersection of big data & learning (abstract and bibliography)
Theresa Anderson @uts_mdsi and Simon Buckingham Shum

“Data is explosive, evolving and infinite” – connecting the dots is important but happens at the expense of things that aren’t connected. Ubiquitous technologies often grab the spotlight, but the ‘invisible hand’ of big data and analytics is important. “datapoints in a graph are tiny portholes onto a rich human world” (Buckingham Shum 2015)

Risk of assumptions and values getting baked into data. Tools don’t just provide access to reality but can shape reality. [Yet] “Raw data is both an oxymoron and a bad idea” (Bowker 2015)

[Flipped classroom presentation – here we start playing with picture cards and post-its to brainstorm and discuss:]

  • where is the ‘human’ in analytics?
  • what human/machine partnerships can/should we enable in computationally intensive work?
  • can the analytics of curation help us support creativity and learning?

[My brainstormed image]

Chat about Research Repository #theta2015

Elements Integration – lets chat about Research Repository and populating Researcher Profiles (abstract)
Leonie Hayes and Anne Harvey

[Facilitated audience discussion of various questions only loosely related. Probably unintended that largely drew an audience of people perhaps more interested in learning about Elements than of people who had already implemented it.]

Discussion of data – Creative Commons licenses not very appropriate to datasets because immediately locks down opportunities for reuse. Creative Commons Zero is better here.

“Sunshine cleaning” – when you hang your data out to dry and everyone sees how dirty it is so you quickly clean it. [Very effective but terrifying for many researchers so I suggest an alternative might be to put the data, like the journal article, out for peer review.]

Looking at impact for Creative Works – altmetrics. Many don’t see themselves as researchers but as practitioners. Uptake of workshops is low as often working from home. The institution needs to focus on areas outside STEM and traditional metrics – these alienate scholars in other fields.

Open Access policy. Many have ideals but doesn’t translate into practice. Especially license issues. Difficulties when managing a PBRF version vs an open access version.

B(uild)YO skilled Data Librarian #theta2015

B(uild)YO skilled Data Librarian (abstract)
Karen Visser, Natasha Simons and Kathryn Unsworth

[Flipped classroom approach: 1) looked at recent data librarian job ads to work out what skills we/librarians would need to develop; 2) shared ideas previously generated of how to upskill (eg reading D-Lib articles; attending iassist conferences; data ‘bootcamp’ workshop); 3) discussed topics in teams – eg what skills does a data librarian need (we came up with subject ontology expertise, ethical/cultural understanding, knowledge of legal issues, then unfortunately time was up).

[Pretty chaotic but got through a lot especially since multiple ‘streamed’ discussions went on at once – organisers aim to distribute notes to attendees post-conference.]

Digital humanities’ use of cultural data #theta2015

How will digital humanities in the future use cultural data?
Ingrid Mason @1n9r1d

[Presentation basically takes the approach of giving an overview of digital humanities and cultural data by throwing lots of examples at us – fascinating but not conducive to notes.]

Cultural data is generated through all research – seems to be more through humanities, but many others too.
RDS building national collection pulling together statistical adata, manuscripts, documents, artefacts, av recordings from an array of unconnected repositories.

New challenge: people wanting access to collections in bulk, not just borrowing a couple of items. Need to look at developing a wholesale interface on top of our existing retail interface.

Close reading vs distant reading. Computation + arrangement + distance. Researchers interested in immersion; in moving images (eg change over time); pattern analysis; opening up the archive (eg @TroveNewsBot). Text mining/linguistic computing methods to look at World Trade Centre first-responder inteviews. Digital paleography – recognising writing of medieval scripts. Linked Jazz.

A dream: when an undergrad would have loved to have been in the Matrix. Have a novel surrounding you and then turn it immediately into a concordance.

Things digital humanities researchers need: Visualisation hours. Digitisation and OCR. Project managers. Multimedia from various institutions. High-performance computing experts.

~”Undigitised data is like dark matter” (Maltby)

What we can do:

  • Talk to researchers about materials they need
  • Learn about APIs
  • Provide training

Q: Indigenous cultural data
A: Some material is very sensitive and challenges to get it to appropriate researchers/communities so could be opportunities to work together.

Q: Any work on standardisation of cultural data?
A: At a high level (collection description) we can but between fields harder.

Towards 2020: Making the Most of Your Research Data #DSShowcase

Summary: One of the themes coming out of the day was that the New Zealand research sector is in a unique position to collaborate to develop the policies, infrastructure and best practices we need to treat our research data as the valuable asset it is. It’s really noticeable how each time we get together for talks like this, there’s been more progress made towards this ideal.

(Day hosted by Digital Science. Notes below are ‘live-blog’-style; my occasional notes are added in [square brackets].)

Daniel Hook – Perspectives on Data Management
Remarkably little has changed in how we store research data since the days of wax tablets. Data is highly volatile: much thrown out or just lost.

Currently all kudos in publication; none in producing a good reusable dataset. Will never get to be a professor because you’ve curated a beautiful dataset.

UK – all scoring and allocation of money based on journal publication – terrible for engineers who are used to publishing in conferences [and presumably humanities who publish in books]. Engineers tried to game this by switching to publishing in journals, but got terrible citation counts because it split the publication sites and didn’t suit global publication norms for the field.

“The Impact Agenda” – various governments trying to quantify (socioeconomic) impact. Sometimes of course this impact occurs years/decades later. And the impact agenda forces us down some odd routes.

REF Impact Case Studies – http://impact.ref.ac.uk/CaseStudies/search1.aspx – includes API and mapped to NZ subjects.

Publish a paper, want to track:

  • Academic Attention (job; paper, public talk)
  • Popular Attention (Twitter, social media, news)
  • Enablers (equipment, clinical trials, funding)
  • Routes to Impact (patent, drug discovery, policy

Ages of research: individual; institutional; national; international. [citation] This makes it difficult to know who owns the research/data.
Daniel’s take: unregulated era; era of evaluation; era of collaboration; era of impact. This last is actually a step backwards, especially because typically driven by national governments so at odds with globalisation of research community.

Big data – a good way to get grants… However while it’s the biggest buzzword, it’s not the biggest problem. It’s challenging but a known problem and a funded problem.

Small data is the bigger problem. (aka “the long tail” problem). Anyone can create small data, and it’s mostly stored in a pen drive in someone’s desk, rarely well backed up. Some attempted solutions (figshare, DataDryad, etc) but thin end of the wedge.

Three university types:

  • just want to tick the box of government mandates – so data still locked away
  • archival university – want to preserve in archival format. Expensive infrastructure, especially to curate
  • aggressively open – making data openly available at point of creation.


  • capture early
  • capture everything
  • share early
  • structure the data

Nick Jones (NeSI)
Want to pool resources nationally to support researchers

Elephants in the room:

local and rare (unique datasets; funded by research grants) shared but rare (facilities: Hubble/CERN; national and international funding)
local but ubiquitous (offices, desktops: instututional funding) shared and ubiquitous (TelCos: commercial investment)

Challenge to build a connecting e-infrastructure. [from diagram by Rhys Francis]

Various ways of breaking down the research lifecycle and translating this into desirable services. Often no or little connectivity between such services. Cf EUDAT

Three Vs: Volume (issues include bandwidth, storage space; cleaning; computationally intensive); velocity (ages quickly); variety. [from NCDS Roadmap July 2012)

Europe doing a fair bit. US just starting to think about a “The National Data Service”. Canada has Research Data Canada which has done a gap analysis. 7 recommendations from the Research Data Alliance including “Do require a data plan” but also “Don’t regulate what we don’t yet understand” [The Data Harvest].

eResearch challenges in New Zealand

  1. skills lag
  2. research communities – strategy needs to fit with needs of different disciplines
  3. aligning incentives
  4. future infrastructure

NeSI is addressing this last by putting in a case with govt. Want a consultative process over this year, case to govt by Q3. Others also stepping up, including CONZUL – good to see developing around this.

Mark Costello: Benefits of Data Publication

People once got angry about being asked to put their stuff up on the internet [still do!] Now get angry when stuff they put up gets used in others’ research – don’t see it as having published but having put it up in a kind of shop window.

Why make data available?

  • better science (independent validation; suppliementation with additional data; re-use for novel purposes)
  • save costs (don’t need to collect again; some data can’t be collected again; don’t need to deal with data requests)
  • discourage misconduct
  • [later mentions increasing profile of researcher / institution, encouraging others to come work there]

Cf system when discover new species – you have to deposit specimen to a museum and can’t publish without an accession number.

Talks about data publishing, not data sharing: people understand the word ‘publication’ and its attendant rights. Well understood process; provides access, archiving and citation; deals with IP; no liability for misuse; increases quality assurance; meritorious for scientists.

Could publish on website, institutional repository, journal, data centre. Considerations: permanence, quality checks, standards, peer review. Journals do peer-review but don’t necessarily follow standards. Data centres follow standards but don’t neceessarily have peer review.

What about publication models? Need an editorial management system; archiving/DOIs; access/discovery tools; open access but who pays cost?

Need a convergence between people with IT skills to manage a data centre and people with editorial/publishing process skills to get a viable data publication process.

Q: What would incentivise data publication?
A: DOIs, peer review. Peer review isn’t actually that difficult: look at the metadata, run statistics to check columns add up, etc. Much can be automated.

Ross Wilkinson (ANDS)
Trends in research data:

  • becoming a first class research output – sometimes even pre-eminent
  • valuable
  • a commodity in international research (cf rise of Research Data Alliance)
  • can be made more valuable eg moving from data (unmanaged, disconnected, invisible, single use) to structured collections (managed, connected, findable, reusable)

Funders are seeing data as publishable and expect it to be managed.

Role for research institutions: data used to be considered a researcher problem, dealt with as project costs, now increasingly seen as institutional assets. Why should unis care? Reputation is important to research institutions. Libraries can contribute – well known for collection so creating world-class data collections can help a library build institution’s reputation. Institution responsibilities in policy, management support, infrastructure, asset management.

National plans – national consensus is being developed – need to make sure we also develop national coherence. National licensing schemes; preservation as a service. Partnerships between researchers and information professionals [and ITS].

Need institutional answer to what support is available


  • Data identification – DataCite
  • Researcher identification – ORCID
  • Publication identification
  • Project id
  • Institution id
  • Funder id

Discovery, eg Research Data Australia

Data use/reuse – need high reliability data services and data computation. Need partnerships – including internationally as no one jurisdiction can afford all needed.

Q: How much are shared infrastructures a going concern internationally?
A: Would need to be a government-to-government deal. Aus/NZ makes sense – close in culture and miles and would save costs.

Penny Carnaby: Creating a shared research data infrastructure in New Zealand

Data deluge, digital landfill => unacceptable loss: “digital dark ages”, “digital amnesia”: it doesn’t make sense to invest in creating data without a strategy to make sure we don’t lose it.

Let’s work with Australians rather than reinventing the wheel.

In last year have produced four great reports:

  • Lincoln Hub Data and Information Architecture Project
  • Digital assets – mitigating the risk
  • Harnessing the economic and social power of dat
  • eResearch challenges in New Zealand discussion document

All saying similar things and we don’t need to analyse problem anymore – need to get on and do something. First we need leadership and direction. Have been trying to gather people across the sector towards a case for government. Need to engage both Science New Zealand and Universities New Zealand.

Lincoln Hub DATA2 project’s goal to ensure: “Data kept over the long term, is easily discoverable, is available, as appropriate, for reuse and replication and that our infrastructure makes it easy for researchers to collaborate and share data for mutual benefit”. Project is collaboration (among others) between land-based CRIs and Lincoln University. Involves development of shared facilities but also provides catalyst for jointly tackling data management issues.

Need to invest in future researchers too – eg including data literacy in curriculum.

Steve Knight: Digital preservation as a service
Digital preservation is “despite the obsolescence of everything”. Not backup and recovery – these are only short-term concerns. Not about (open) access. Not an afterthought.

Loss stories – BBC Domesday project on 12″ videodiscs which now can’t be read. Engineer getting request for data stored on 7″ floppy, drive no longer exists. Footage stored on videos that can’t be read so needs to be reshot.

NZ: National Library was development partner for Rosetta, now leveraged by Archives New Zealand. Content being preserved includes images, cartoons, Paper Plus, Sibelius music files, etc etc. Permanent repository has 136TB; half a billion files of web data. An object could be an image or a 3000pg website – lots of variety.

But we don’t know how much stuff is being created in NZ and how much might be of long-term value. Need an audit to start working out how we can make decisions re value. Given economics of scale, NZ has opportunity for digital preservation at a national level.

Copyright and privacy are both broken, which is a blocker for a high-functioning digital environment in NZ. Even bigger issues are policy frameworks, national research infrastructure (we have some but being told to monetise and being told to be open access…)

Q: What about research data?
A: They deal in formats. 40-60% is out in Excel spreadsheet. Also need to decide who does what? Institutional responsibility?

Alison Stringer:
Steps for civil servants to open data:

  • Find the data
  • judicial check
  • process data
  • publish

Creative Commons requirement in NZGOAL means need to get permission from every agency involved (regional councils, universities) – could take half a year with back-and-forths. Would love a system where they don’t have to get all these permissions… Great thing about NZGOAL is don’t have to get a legal opinion for every dataset. Can use CC only – not CC0 or PD mark so not ideal.

MBIE contestable funding – wants its investments to meet minimum expectations of good data management and availability to public; contractors receiving new funding should provide open access to all copyright works developed using MBIE funding. One thing to have a policy, another to enable/guide/monitor/enforce.

Marsden contract terms not publically available but have in past required all fundings to be made public including data, metadata, samples, publications. No guidance/monitoring/enforcement.

NZ is in top 5 for government open data surveys.
No similar research on research data. 24 countries include data in open access mandates.
Open data leads to higher expectations – in organisation, with users, with stakeholders. Eg sharing code.

Hopes that improving things at MfE drives improvements across the system.

Once it’s public, what next?

  • quality
  • standards
  • systems
  • collaborations

[A couple of compare/contrast quotes]
“Metadata is a love note to the future.”
“Never trust anyone who’s enthusiastic about metadata.”

[Here my battery died so notes get skimpier]

Open standards – if there are 5 methods for gathering data, agree on 1
“Systems built for open data look a lot like reproducible research systems”

At the Lincoln meeting people came up with principles:

  • managed as assets
  • agreed open standards
  • treated as research outputs (eg for PBRF)
  • treated as research artefacts
  • shared infrastructure, tools, expertise (but don’t wait for everyone to get together on something or nothing will get started)
  • publicly available
  • open license
  • open format

Discussion follows re carrots/sticks and who will take the lead – MBIE seems to be stepping up a bit.

Fabian commented that NSF required a data management plan – it could be “We’ll store the data on a thumb drive in our desk drawer” but at least this then acted as an audit of the issues.

5-10 ministries are doing lots with data. Others might have more confidential data. Also you often don’t hear what’s being done until after it’s been done.

Panel discussion
Penny asked what we think is needed. Answers included:

  • policy of mandate
  • fix PBRF to include datasets as research outputs. Discussion followed. Fabiana thought that datasets can be included but that panels don’t value them [and researchers/admin staff don’t know they can be included]. Someone else [apologies for forgetting who] thought that a data journal article could be included, but a dataset qua dataset couldn’t [which matches my impression].
  • resource and reward data management
  • UK’s Ref2020 policy (analogous to PBRF) is that if a journal article isn’t OA, it can’t be included
  • make citation easier
  • registries are needed – we need to take an all-of-NZ approach
  • solve the NZGOAL / commercialisation dichotomy where institutions are told they have to make research open access but they also have to commercialise it and makea financial return
  • lock a percentage of overheard to data management (or give a certain percentage more if the project has a data management plan, or withhold a percentage until data management performance has been proved)
  • define ‘datasets of national significance / establish a mechanism to identify new such datasets
  • leadership: eResearch2020 is stepping into this area – joint governance group comprised of NeSI, REANNZ, NZGL

Social media as an agent of socio-economic change #vala14 p2

Johan Bollen Social media as an agent of socio-economic change: analytics and applications

World we live in increasingly about online connections. First computer had 1KB RAM and programmable by BASIC. Now can wake up parents in Belgium by FaceTime. Data from 2012 2.4billion internet users worldwide (15.6% Africa to 78.6% North America, 67.6% Oceania/Australia). Amount of online content staggering.

Facebook, LiveJournal, Twitter… We’re not using these networks to broadcast – they’re to collaborate socially. Many-to-many. Generates content and establishes social relations — collaboratively.

Displays xkcd cartoon re ubiquity of phones and map of usage of Twitter and Flickr. Visualising languages spoken; what things are being downloaded. Using Twitter to map discussion of beer vs church. And using it to monitor outbreaks of flu.

Wikipedia using collaboration to create content. Estimize using it to predict markets.
“Prevailing pessimism about large groups collaborating in a productive manner, absent central authority, may not be justified.” From the “madness of crowds” (wacky ideas) to “the wisdom of crowds”. On “Who wants to be a millionaire”, asking an expert gets it right 65%, asking the audience 91% right. When you ask people questions they have to guestimate an answer to, “the average of two guesses from one individual was more accurate than either guess alone”.

Galton (1907), Nature, 1949(75):450-451 – aggregating judgements of people of weight of dressed ox got within 1% of accuracy.
Condorcet Jury Theorem (1785) – even if jurors individually are rarely right, going for a majority vote the chance of being right approaches unity.
Collective intelligence – birds flocking, ants finding food.

We have telescopes to look at huge things, microscopes to look at tiny things – we need a macroscope to look at really complex things: this is computational social science studying data generated by social media. Network analysis. Natural language processing.

Epictetus “Men are disturbed, not by things, but by the principles and notions which they form concerning things”.

Sentiment analysis. eg “Affective Norms for English Words” rated along valence, arousal and dominance, OpinionFinder, SentiWordnet. We understand individual emotions well, not so much collective emotions. Diagram charting fluctuations in collective mood based on Twitter feeds; correlating with market fluctuations – discovered that the Twitter ‘calm’ mood correlated with increase in DOW three days in advance 85%. Other results have largely confirmed this using Google trends, using dataset from LiveJournal posts.

Where does collective emotion come from? Is it more than the sum of individual emotions? Do sad people flock together or do they make each other sad? Homophily (bird of a feather) prevalent in social networks. People connected to lots of people tend to be connected to other people who are connected to lots of people. (Ie the popular kids hang out with each other.) Image of political homophily on Twitter. So does mood act in the same way? Looked at reciprocal following on Twitter. Found small cluster of negative-emotion users, and larger cluster of positive-emotion users. (Don’t know where causation is.) The closer the friendship, the more reliable this was.

Application to bibliometrics: got rejected from journals so published on arXiv and got massively read and within a month cited. So looked at arXiv papers and found a weak correlation between Twitter mentions and early citations. But the problem with altmetrics: the biggest nodes are the media, big blogs etc. The number mentions doesn’t matter as who is mentioning.

Radical proposal for funding science (developed over alcohol-fueled Christmas party grumps about writing funding proposals). (Motto: “What would the aliens say?”) Fund people not projects. Science as gift-economy. Encourage innovation. Change scholarly incentives for the better. Congress should give money to scientific community – every scientist gets an equal chunk, but you have to donate a certain percentage to anyone you want (who have to donate a percentage of what they’ve received). Would lead to an uneven “but fair” distribution. [My criticism: would be susceptible to issues of implicit bias against women, people of colour, etc. However don’t know if it’d be more or less susceptible to these problems than the current system is.] Ran a simulation using network data: when F=0.5 it matches the distribution by the NSF and NIH.

Q: Risk of feedback loops?
A: Yes – citing hacking of Twitter account to post about bombs in White House leading to massive market shorting – not just people getting freaked out, algorithms getting freaked out. Positive feedback loops bad news – hopefully can set up things so instead you’ll get negative feedback loops that lead to homeostasis. Can only mitigate problems by understanding how things work.

Big data, little data, no data – Christine Borgman #vala14 #p1

Big data, little data, no data: scholarship in the networked world

Technological advances in mediated communication – have gone to writing to computers to social media and these are cumulative: we use all of these concurrently. And increasingly thinking of these in terms of data. Need to think about new infrastructures because this will determine what will be there for tomorrow’s students/librarians/archivists.

Australian notable for ANDS, and for movements to open access policies – only place she’s found where managing data is part of (ARC’s) Code for the Responsible Conduct of Research.

Book coming out late 2014/early 2015 – data and scholarship; case studies in data scholarship; data policy and practice. Organised around “provocations”:

  • How do rights, responsibilities, and risks around research data vary by disciplines and stakeholders?
  • How can data be exchanged across domains, contexts, time?
  • How do publication and data differ?
  • What are scholars’ motivations to share?
  • What expertise is needed to manage research data?
  • How can knowledge infrastructures adapt to the needs of scholars and demands of stakeholders?

Until the first journal in 17th century, scholars communicated by private letters. Journals were the beginning of peer review, of opening up knowledge beyond those privileged to exchange letters. –However things began much earlier: brick from 5th-6th century inscribed with Sutra on Dependent Origination. Now we have complete open access in PLOS One. (Shows If We Share Data, Will Anyone Use Them? Data Sharing and Reuse in the Long Tail of Science and Technology.) Lots of journals, preprint servers, institutional repositories to submit to.

Publishing (including peer review) serves to legitimise knowledge; to disseminate it; and to provide access, preservation and curation.

Open access means many things – uses Suber’s “digital, online, free of charge, and free of most copyright and licensing restrictions” definition.

ANDS model of “more Australian researchers reusing research data more often”. Moving from unmanaged, disconnected, invisible, single-use data to managed, connected, findable, reusable data.

Open data has even more definitions: Open Data Commons “free to use, reuse and redistribute”; Royal Society says “accessible, useable, assessable, intelligible”. OECD has 13 conditions. People don’t agree because data’s really messy!

Data aren’t publications
When data’s created it’s not clear who owns it – field researcher, funder, instrument, principle investigator?
Papers are arguments – data are evidence.
Few journals try to peer review data. Some repositories do but most just check the format.

Data aren’t natural objects
What are data? Most places list possibilities; few define what is and isn’t data. Marie Curie’s notebook? A mouse? A map or figure? An astronomical photo – which the public loves, but astronomers don’t agree on what the colours actually mean… 3D figure in PDF (if you have the exact right version of Adobe Acrobat). Social science data where even when specifically designed to share it’s full of footnotes telling you which appendices to read to understand how the questions/methods changed over time…

Data are representations
“Data are representations of observations, objects, or other entities used as evidence of phenomena for the purposes of research or scholarship.”

You think you have problems on catalogue interoperability, try looking at open ontologies intersecting different communities.

Data sharing and reuse depends on infrastructure
You don’t just build an infrastructure and you’re done. They’re complex, interact with communities. Huge amount of provenance important to make sense of data down the line.

Data management is difficult – scholars have a hard enough time managing it for their own reuse let alone someone else’s reuse. Need to think about provenance, property rights, different methods, different theoretical perspectives, “the wonderful thing about standards is there’s so many to choose from”.

Ways to release data:

  • contribute to archive
  • attach to journal article
  • post on local website
  • license on request
  • release on request

These last ones are very effective because people are talking to each other and can exchange tacit knowledge — but it doesn’t scale. The first scales but only works for well-structured and organised data.

So what are we trying to do? Reuse by investigator, collaborators, colleagues, unaffiliated others, future generations/millennia? These are very different purposes and commitments.

Traditional economics (1950s) was based on physical goods – supply and demand. But this doesn’t work with data. Public/private goods distinction doesn’t work with information. There’s no rivalry around the sunset or general knowledge in the way there is around a table or book. So concept of “common pool resources” – libraries, data archives – where goods must be governed.

Low subtractability/rivalry High
Exclusion difficult public goods common pool resources
Easy toll or club goods private goods

While data are unstructured and hard to use they’re private goods. Are we investing to make them tool goods, common pool resources or public goods?

Need to make sustainability decisions – what to keep, why, how, how long, who will govern them, what expertise required?

Q: Health sciences doing well
A: Yes but representation issues. Attempt to outsource mammogram readings fell foul of huge amounts of tacit knowledge required. In genomics attempts to get scientists and drug companies to work together in the open, but complicated situation with journals who say that because the data is out there it’s prior publication when in fact the paper is explaining the science behind it; and issues around (misleading) partial release of data – recommends Goldacre’s Big Pharma.

Q: Scientists want to know who they’re giving data to. But maybe data citation a way to get scientists on board?
A: Citing data as incentive is a hypothesis. Really sharing data is a gift – if you put it on a repository you don’t have it available to trade to collaborators, funders, new universities. Data as dowry: people getting hired because they have the data.
Agreeing on the citable unit is hard – some people would have a DOI on every cell, others would have a footnote “OECD”. Citation isn’t just about APA vs Blue Book, it’s about citable unit and who gets credit and….