Tag Archives: small data

Towards 2020: Making the Most of Your Research Data #DSShowcase

Summary: One of the themes coming out of the day was that the New Zealand research sector is in a unique position to collaborate to develop the policies, infrastructure and best practices we need to treat our research data as the valuable asset it is. It’s really noticeable how each time we get together for talks like this, there’s been more progress made towards this ideal.

(Day hosted by Digital Science. Notes below are ‘live-blog’-style; my occasional notes are added in [square brackets].)


Daniel Hook – Perspectives on Data Management
Remarkably little has changed in how we store research data since the days of wax tablets. Data is highly volatile: much thrown out or just lost.

Currently all kudos in publication; none in producing a good reusable dataset. Will never get to be a professor because you’ve curated a beautiful dataset.

UK – all scoring and allocation of money based on journal publication – terrible for engineers who are used to publishing in conferences [and presumably humanities who publish in books]. Engineers tried to game this by switching to publishing in journals, but got terrible citation counts because it split the publication sites and didn’t suit global publication norms for the field.

“The Impact Agenda” – various governments trying to quantify (socioeconomic) impact. Sometimes of course this impact occurs years/decades later. And the impact agenda forces us down some odd routes.

REF Impact Case Studies – http://impact.ref.ac.uk/CaseStudies/search1.aspx – includes API and mapped to NZ subjects.

Publish a paper, want to track:

  • Academic Attention (job; paper, public talk)
  • Popular Attention (Twitter, social media, news)
  • Enablers (equipment, clinical trials, funding)
  • Routes to Impact (patent, drug discovery, policy

Ages of research: individual; institutional; national; international. [citation] This makes it difficult to know who owns the research/data.
Daniel’s take: unregulated era; era of evaluation; era of collaboration; era of impact. This last is actually a step backwards, especially because typically driven by national governments so at odds with globalisation of research community.

Big data – a good way to get grants… However while it’s the biggest buzzword, it’s not the biggest problem. It’s challenging but a known problem and a funded problem.

Small data is the bigger problem. (aka “the long tail” problem). Anyone can create small data, and it’s mostly stored in a pen drive in someone’s desk, rarely well backed up. Some attempted solutions (figshare, DataDryad, etc) but thin end of the wedge.

Three university types:

  • just want to tick the box of government mandates – so data still locked away
  • archival university – want to preserve in archival format. Expensive infrastructure, especially to curate
  • aggressively open – making data openly available at point of creation.

Manifesto:

  • capture early
  • capture everything
  • share early
  • structure the data

Nick Jones (NeSI)
Want to pool resources nationally to support researchers

Elephants in the room:

local and rare (unique datasets; funded by research grants) shared but rare (facilities: Hubble/CERN; national and international funding)
local but ubiquitous (offices, desktops: instututional funding) shared and ubiquitous (TelCos: commercial investment)

Challenge to build a connecting e-infrastructure. [from diagram by Rhys Francis]

Various ways of breaking down the research lifecycle and translating this into desirable services. Often no or little connectivity between such services. Cf EUDAT

Three Vs: Volume (issues include bandwidth, storage space; cleaning; computationally intensive); velocity (ages quickly); variety. [from NCDS Roadmap July 2012)

Europe doing a fair bit. US just starting to think about a “The National Data Service”. Canada has Research Data Canada which has done a gap analysis. 7 recommendations from the Research Data Alliance including “Do require a data plan” but also “Don’t regulate what we don’t yet understand” [The Data Harvest].

eResearch challenges in New Zealand

  1. skills lag
  2. research communities – strategy needs to fit with needs of different disciplines
  3. aligning incentives
  4. future infrastructure

NeSI is addressing this last by putting in a case with govt. Want a consultative process over this year, case to govt by Q3. Others also stepping up, including CONZUL – good to see developing around this.


Mark Costello: Benefits of Data Publication

People once got angry about being asked to put their stuff up on the internet [still do!] Now get angry when stuff they put up gets used in others’ research – don’t see it as having published but having put it up in a kind of shop window.

Why make data available?

  • better science (independent validation; suppliementation with additional data; re-use for novel purposes)
  • save costs (don’t need to collect again; some data can’t be collected again; don’t need to deal with data requests)
  • discourage misconduct
  • [later mentions increasing profile of researcher / institution, encouraging others to come work there]

Cf system when discover new species – you have to deposit specimen to a museum and can’t publish without an accession number.

Talks about data publishing, not data sharing: people understand the word ‘publication’ and its attendant rights. Well understood process; provides access, archiving and citation; deals with IP; no liability for misuse; increases quality assurance; meritorious for scientists.

Could publish on website, institutional repository, journal, data centre. Considerations: permanence, quality checks, standards, peer review. Journals do peer-review but don’t necessarily follow standards. Data centres follow standards but don’t neceessarily have peer review.

What about publication models? Need an editorial management system; archiving/DOIs; access/discovery tools; open access but who pays cost?

Need a convergence between people with IT skills to manage a data centre and people with editorial/publishing process skills to get a viable data publication process.

Q: What would incentivise data publication?
A: DOIs, peer review. Peer review isn’t actually that difficult: look at the metadata, run statistics to check columns add up, etc. Much can be automated.


Ross Wilkinson (ANDS)
Trends in research data:

  • becoming a first class research output – sometimes even pre-eminent
  • valuable
  • a commodity in international research (cf rise of Research Data Alliance)
  • can be made more valuable eg moving from data (unmanaged, disconnected, invisible, single use) to structured collections (managed, connected, findable, reusable)

Funders are seeing data as publishable and expect it to be managed.

Role for research institutions: data used to be considered a researcher problem, dealt with as project costs, now increasingly seen as institutional assets. Why should unis care? Reputation is important to research institutions. Libraries can contribute – well known for collection so creating world-class data collections can help a library build institution’s reputation. Institution responsibilities in policy, management support, infrastructure, asset management.

National plans – national consensus is being developed – need to make sure we also develop national coherence. National licensing schemes; preservation as a service. Partnerships between researchers and information professionals [and ITS].

Need institutional answer to what support is available

Connectivity:

  • Data identification – DataCite
  • Researcher identification – ORCID
  • Publication identification
  • Project id
  • Institution id
  • Funder id

Discovery, eg Research Data Australia

Data use/reuse – need high reliability data services and data computation. Need partnerships – including internationally as no one jurisdiction can afford all needed.

Q: How much are shared infrastructures a going concern internationally?
A: Would need to be a government-to-government deal. Aus/NZ makes sense – close in culture and miles and would save costs.


Penny Carnaby: Creating a shared research data infrastructure in New Zealand

Data deluge, digital landfill => unacceptable loss: “digital dark ages”, “digital amnesia”: it doesn’t make sense to invest in creating data without a strategy to make sure we don’t lose it.

Let’s work with Australians rather than reinventing the wheel.

In last year have produced four great reports:

  • Lincoln Hub Data and Information Architecture Project
  • Digital assets – mitigating the risk
  • Harnessing the economic and social power of dat
  • eResearch challenges in New Zealand discussion document

All saying similar things and we don’t need to analyse problem anymore – need to get on and do something. First we need leadership and direction. Have been trying to gather people across the sector towards a case for government. Need to engage both Science New Zealand and Universities New Zealand.

Lincoln Hub DATA2 project’s goal to ensure: “Data kept over the long term, is easily discoverable, is available, as appropriate, for reuse and replication and that our infrastructure makes it easy for researchers to collaborate and share data for mutual benefit”. Project is collaboration (among others) between land-based CRIs and Lincoln University. Involves development of shared facilities but also provides catalyst for jointly tackling data management issues.

Need to invest in future researchers too – eg including data literacy in curriculum.


Steve Knight: Digital preservation as a service
Digital preservation is “despite the obsolescence of everything”. Not backup and recovery – these are only short-term concerns. Not about (open) access. Not an afterthought.

Loss stories – BBC Domesday project on 12″ videodiscs which now can’t be read. Engineer getting request for data stored on 7″ floppy, drive no longer exists. Footage stored on videos that can’t be read so needs to be reshot.

NZ: National Library was development partner for Rosetta, now leveraged by Archives New Zealand. Content being preserved includes images, cartoons, Paper Plus, Sibelius music files, etc etc. Permanent repository has 136TB; half a billion files of web data. An object could be an image or a 3000pg website – lots of variety.

But we don’t know how much stuff is being created in NZ and how much might be of long-term value. Need an audit to start working out how we can make decisions re value. Given economics of scale, NZ has opportunity for digital preservation at a national level.

Copyright and privacy are both broken, which is a blocker for a high-functioning digital environment in NZ. Even bigger issues are policy frameworks, national research infrastructure (we have some but being told to monetise and being told to be open access…)

Q: What about research data?
A: They deal in formats. 40-60% is out in Excel spreadsheet. Also need to decide who does what? Institutional responsibility?


Alison Stringer:
Steps for civil servants to open data:

  • Find the data
  • judicial check
  • process data
  • publish

Creative Commons requirement in NZGOAL means need to get permission from every agency involved (regional councils, universities) – could take half a year with back-and-forths. Would love a system where they don’t have to get all these permissions… Great thing about NZGOAL is don’t have to get a legal opinion for every dataset. Can use CC only – not CC0 or PD mark so not ideal.

MBIE contestable funding – wants its investments to meet minimum expectations of good data management and availability to public; contractors receiving new funding should provide open access to all copyright works developed using MBIE funding. One thing to have a policy, another to enable/guide/monitor/enforce.

Marsden contract terms not publically available but have in past required all fundings to be made public including data, metadata, samples, publications. No guidance/monitoring/enforcement.

NZ is in top 5 for government open data surveys.
No similar research on research data. 24 countries include data in open access mandates.
Open data leads to higher expectations – in organisation, with users, with stakeholders. Eg sharing code.

Hopes that improving things at MfE drives improvements across the system.

Once it’s public, what next?

  • quality
  • standards
  • systems
  • collaborations

[A couple of compare/contrast quotes]
“Metadata is a love note to the future.”
“Never trust anyone who’s enthusiastic about metadata.”

[Here my battery died so notes get skimpier]

Open standards – if there are 5 methods for gathering data, agree on 1
“Systems built for open data look a lot like reproducible research systems”

At the Lincoln meeting people came up with principles:

  • managed as assets
  • agreed open standards
  • treated as research outputs (eg for PBRF)
  • treated as research artefacts
  • shared infrastructure, tools, expertise (but don’t wait for everyone to get together on something or nothing will get started)
  • publicly available
  • open license
  • open format

Discussion follows re carrots/sticks and who will take the lead – MBIE seems to be stepping up a bit.

Fabian commented that NSF required a data management plan – it could be “We’ll store the data on a thumb drive in our desk drawer” but at least this then acted as an audit of the issues.

5-10 ministries are doing lots with data. Others might have more confidential data. Also you often don’t hear what’s being done until after it’s been done.


Panel discussion
Penny asked what we think is needed. Answers included:

  • policy of mandate
  • fix PBRF to include datasets as research outputs. Discussion followed. Fabiana thought that datasets can be included but that panels don’t value them [and researchers/admin staff don’t know they can be included]. Someone else [apologies for forgetting who] thought that a data journal article could be included, but a dataset qua dataset couldn’t [which matches my impression].
  • resource and reward data management
  • UK’s Ref2020 policy (analogous to PBRF) is that if a journal article isn’t OA, it can’t be included
  • make citation easier
  • registries are needed – we need to take an all-of-NZ approach
  • solve the NZGOAL / commercialisation dichotomy where institutions are told they have to make research open access but they also have to commercialise it and makea financial return
  • lock a percentage of overheard to data management (or give a certain percentage more if the project has a data management plan, or withhold a percentage until data management performance has been proved)
  • define ‘datasets of national significance / establish a mechanism to identify new such datasets
  • leadership: eResearch2020 is stepping into this area – joint governance group comprised of NeSI, REANNZ, NZGL