Tag Archives: research data management

Removing barriers to sharing for the benefit of Māori #FigshareFestNZ

Data for whom? Removing barriers to sharing for the benefit of Māori
Dr Kiri Dell, Ngapera Riley, CEO figure.nz

Ref Decolonising Methodologies by Linda Tuhiwai Smith

The academy privileges a certain type of knowing, but indigenous people have other ways as well (which we all use to some degree) eg

  • Sense perception – I felt it
  • Imagination – I envisioned it
  • Memory – I remembered it
  • Inherited – My nanny told me
  • Faith – God told me

Example of using data badly: MOTU (economic research centre) put out research a few years ago comparing Māori and other ethnicities and concluding that collectivist beliefs were holding Māori back from economic success. This was not taken well by Māori…. Researchers made sweeping statements about Māori culture where they had no research; compared Māori to completely different groups (eg African Americans, Chinese) with different histories and belief systems; interpreted through white male lens.

Basho haiku
Pull the wings off a dragonfly and look – you get a red pepperpod!
vs
Add wings to a pepperpod, and look – you get a red dragonfly

Figure.nz set up as charity to democratise data – aims to provide valid and ethical data. Data is important but so is context and people behind it. Draws licensed data from over a hundred sources. Exists for the benefit of Aotearoa so believe if they can get data right for Māori they can get it right for all. Kaupapa that data is for everyone not just experts.

Partnering with Te Mana Rarauranga – Māori Data Sovereignty Network; and with nine government agencies who’ve got a lot of data that was never meant to be shared so navigating benefits and dangers of sharing. How will data be used, for whom, why?

Data is never perfect – it’s just one tool alongside experience and connections. Māori data has traditionally not been collected well – cf especially the latest census – so have to be careful about conclusions drawn.

Figure.NZ –  over 44,000 charts and datasets (CSVs and images) around people, travel, health, education, employment, economy, environment, social welfare, technology, broken down by geographic area. Very careful to publish metadata around sources etc.

Original data was a mess so have been working hard to tidy it up.  Check sources, make sure it’s statistically valid (no small datasets) – have a robust process to work with source to make sure the metadata explains methodology and context.

Focused on public aggregate data but starting to use other sources. Wondering how to safely share research data. Excited to see people have started publishing theses etc with CC licensing.

 

Rolling out research for public good #FigshareFestNZ

The bumpy road to rolling out the Index of Multiple Deprivation
Assoc Prof Daniel Exeter

IMD looks at deprivation in different areas eg employment, education, income, housing, access to services. Most rural areas in South Island not very deprived, but the east and north of North Island. In Christchurch also various quintile 5 areas surrounded by quintile 3 areas – very sharp north-west/south-east divide. Can drill down to see what’s driving deprivation in each area.

Want to maximise the impact of the public good money so put up website and made all datasets and reports freely available, easy to download, with full attribution. But first hurdle was getting data out of StatsNZ IDI who are very cautious – had to work closely with them pointing out their data was very aggregated. Published papers with journal who wanted data and wouldn’t accept institutional site they’d painstakingly created. [In fairness, institutions do mess up their websites frequently so it’s not good for longterm storage!] So quickly created a Figshare website – one advantage is other institutions can use API to access this data.

Bumpy ride = mostly interesting challenge. Eg

  • someone creating a survey wanting to know people’s geographical area without asking for actual address to make it easier to manage data. Created something where users could enter address and it’d convert it to the area along with IMD data
  • looking at poor oral health of children per deprivation per ethnicity
  • getting it into policy – mostly from getting emails from people asking how to interpret something. One DHB in North Island interacts with 5 local govt authorities and was asked to create one per authority – lots of work to recreate this at that level but achieved
  • someone using it for climate change – heat vulnerability index including IMD, social isolation, age bands
  • noted some areas ended up not included in census which affects accuracy

Funded by Health Research Council but most uptake through govt, eg also Canterbury Wellbeing Index; Ministry of Education, Alcohol Health Watch lets you type in an address and get a report to make a submission against liquor licenses

Are big data and data sharing the panacea? It’s fantastic but there are big issues around too: attirbution, ethics, risk of data gerrymandering, need to use theory to inform methods. Sometimes people use the IMD in ways that are useful but potentially misleading. Institutions are playing catchup, not yet sure how to deal with this. Hopes the community being built around Figshare etc will help develop solutions and best practices.

 

Developing research data services #FigshareFestNZ

I talked about our implementation and various integrations, including glitches along the road, eg

  • being “piggy in the middle” between Figshare and our ITS trying to troubleshoot CNAME issues without knowing what a CNAME is
  • setting up emails so that Figshare system emails come from a Lincoln address – while our ITS is cracking down on phishing and Figshare emails look to our system like they’re spoofed.
  • our first attempt at an HR feed only sent through academics, not postgrads; now redoing this from scratch with new extract scripts, and Figshare is working out how to deal with the ~ we use to signify someone doesn’t have a first name
  • indexing in LibrarySearch (Primo) using the OAI-PMH feed – also in data.govt.nz and DigitalNZ
  • Next year implementing Repository Tools 2 in Elements which will let researchers deposit datasets from there – doing pre-work to make sure we can have two user IDs in the HR feed: one to integrate with the login system, the other one to integrate with Elements

Salila Bryant from ESR about the background behind implementing Figshare as an institution repository, and some of the challenges (including getting support from the other side of the world; getting researchers on board with publishing data; covering privacy, metadata requirements etc)

Katy Miller from Victoria University of Wellington used to have Rosetta for everything but researchers couldn’t interact with it. Ended up choosing Figshare as it interacts with Elements; easy for academic staff to deposit; modern interface; ‘just right’ – looks, well indexed etc; proven solution. Lots of decisions needed to be made to set it up: thought it’d all be simple but turns out not to be as simple as it looks.

  • Groups – most places use academic units, but this is greatly in flux at VUW so are probably going to use publication types as the groups (eg a group for theses etc)
  • Metadata – keeping it light to focus on their key aims
  • Mediation – trying for no mediation between deposit and availability
  • Scope – starting with journal articles and conference papers as this is mainly to support OA mandate. Theses will be next (and will need to add postgrads into the system) – then think about research data down the track.
  • DOIs – grappling with how we handle situations where we publish a preprint with a DOI and later the published version becomes available with a different DOI? Currently planning to use the publisher DOI field in Figshare when the published DOI becomes available.

Laura Armstrong, UofAuckland talking about engaging with researchers. Talk about their “data publishing and discovery service” but still at https://auckland.figshare.com Currently 236 datasets in data.govt.nz; 666 DOIs minted year-to-date.

  • Researchers want DOIs (often not knowing why); discoverability (via Google); browser preview; branding/match website; metrics
  • Have group sites for different use cases eg conferences, research groups
  • Process – usually researcher requests; then discuss use case, what they want to achieve, who’s the owner, etc; then create a group site based on this, and figshare uploads branding overnight; they make it live but not public; once at least one item is published, the site is made public.
  • User types – most people only assigned to one group and can only publish there. Can set people up as admins for multiple groups, they can then publish to any of those groups. For external users, an internal person in group X can create an unpublished project and invite the external person to it: project items then get published to group X.

Simon Porter and Jared Watts, Digital Science looking at visualisation of collaboration networks based on Jupyter Notebook using Figshare API. Making available raw data in CSVs, visualisations, and technical report “This Poster is Reproducible”.

Building research data services at NeSI #FigshareFestNZ

Building research data services at NeSI
Brian Flaherty, NeSI

NeSI is a collaboration supporting researchers to tackle large problems (ie super-computers). Core services around HPC, consultancy, training. Two supercomputers Mahuika and Māui, a couple dozen staff, covering a range of disciplines.

2011-2014 mostly about computer infrastructure with a little storage and consultancy

2014-2019 shifting

2019-future looking at research platforms, virtual labs, scientific gateways

Data management – mostly want to deal with the active data: collection, pre-processing, analysis and modelling, repurposing pre-publication.

Refreshing its offering around transferring data – looking at a national data transfer platform. Nodes at UoA, NIWA in Wellington, AgResearch in Christchurch, and Dunedin. (There’s a big need for an Australasian platform as lots of data gets sent back eg to the Garvan Institute sequencing laboratory. Lots of hard drives still being shipped.) transfer.nesi.org.nz has a point-and-click transfer interface.

Automated workflows: Genomics Aotearoa doing a project sequencing taonga species eg the kākāpō. DoC was storing data in the cloud in Australia; Ngāi Tahu weren’t happy so have brought it back into NZ stored with NeSI. Genomics Aotearoa Data Repository being developed at NeSI – starting small (“don’t try to boil the ocean”), with downloading, storing, sharing data with group-based access control and group membership management. FAIR – so far at findable (just) and accessible (in that it’s sharable) but still working towards interoperable and reusable.

Indigenous data: a kāhui Māori working to make sure Māori data managed within a Māori context so need to map this into security, auditing, permissions process, etc. Work in process. Underpinned by “Te Mata Ira” guidelines for genomic research with Māori.

Increasingly researchers want the data to be stored in the same place as the compute so it doesn’t have to be transferred backwards and forwards.

Security around sensitive data: have firewalls, multifactor authentication, need to look more at privacy policy and standards around health information security frameworks etc.

Curation: “maintaining, preserving, and adding value to digital data/object through its life cycle”. Eg transforming formats, evaluating for FAIRness. Knowing what to get rid of when storage is stretched.

Metadata: Creating README files – getting stuff out of people’s heads. RO-CRATE as a way to package it in a human- and computer-readable way.

Open data? Perceptions of barriers to research data-sharing – Jo Simons #open17

Many aspects of open data – today focusing on research data, ie created by research projects at an institution.

Research workflow is very complex but to really simplify: researchers start a project, get lots of data, and summarise results in journals.  But it’s not the data – it’s a summary of the data with maybe a few key examples. The rest goes to places where only the researcher can access it.

Why do we care?

  • for the good of all
  • expensive to generate so want to maximise use eg validate, meta-analyses, used in different ways
  • much funded by government therefore taxpayer – so they should be able to access it

Used to work in a group which shared greenhouse space but had no idea what else was in there. Proposed sharing basic information about what was there and what to do in case of emergency – and was shocked when some said no. Supervisor said don’t let it stop you asking the question but that’ll happen, yeah.

Requesting data, odds of it being extant decrease 17% each year. (cite: Vines (2013) 10.1016/j.cub/2013.11.014)

This is where academic libraries come in – getting the data off the USB drives. So need to understand why they might not want to share. Did interviews to inform survey construction to get info from more people. 102 responses from researchers across 10 disciplines; 18 from librarians (about 20% response rate).

Do librarians and researchers agree on the major drivers that determine whether researchers choose to share their data?

Is data-sharing part of the research culture? Librarians: 7% said common/essential; researchers 26%

Factors influencing data-sharing

  • agreement in some areas eg ability to publish, inappropriate use, copyright and IP pretty high; then resources, interest to others, system structure and data access
  • differences: librarians thought institutional policy, system integration very important; funder policy, system usability somewhat important – all very low for researchers. What was important for researchers were: ethics (>40%); culture, research quality (10-15%); data preservation, publisher policy (5-10%)

Are there differences across major disciplines in what those drivers are?

5 disciplines with 10+ responses: business, medicine/health, phys/chem/earth; life sci/bio; soc sci/education. Ethics important for most but not a high-ranking factor for phys/chem/earth due to nature of their data. Whereas data preservation/archiving is more important for them (and med/health), somewhat important for life sci and soc sci, while business barely cared.

Take home

So consult with your community to find out what’s worrying them. Target those concerns in promotion and training. Eg we know system usability is important so definitely fix it – but don’t waste your communication opportunities talking about it when they’re worried about other things.

Scholarly workflows #or2017

Abstracts

Supporting Tools in Institutional Repositories as Part of the Research Data Lifecycle by Malcolm Wolski, Joanna Richardson

Have been working on research data management in context of the whole research data lifecycle. Started asking question: once research data management is under control, what will be the next focus? Their answer was research tools. Produced two journal articles:

  • Wolski, M., Howard, L., & Richardson, J. (2017). The importance of tools in the data lifecycle. Digital Library Perspectives, 33(3), in press
  • Wolski, M., Howard, L., & Richardson, J. (2017). A trust framework for online research data services. Publications, 5(2), article 14 https://doi.org/10.3390/publications5020014

Research life cycle: Data creation and deposit (plan and design, collect and capture) -> Managing active data (Collect and capture, collaborate and analyse) -> Data repositories and archives (manage, store, preserve; share and publish) -> Data catalogues and registries

Research data repositories vary a lot. Collection or ecosystem? Open or closed? End point or part of workflow? Why is it hard to build them? Push-and-pull between re-usability and preservation:

  • technical aspects
  • interoperability
  • lega/regulatory/ethical constraints
  • one-off activity or continuous
  • diversity of accessibility issues
  • diversity of re-usability issues

The average number of research tools per person was 22 per person (includes Word, ResearchGate, email through to SurveyMonkey, Dropbox, Figshare, through to R and really specialised ones). Kramer and Bosman (2016) divided tools into assessment, outreach, publication, writing, analysis, discovery, preparation phases. Tools exploding as research activity scales up, collaboration increases. Large-capacity projects being funded. Data science courses upskilling researchers.

Researchers use lots of tools as part of the data workflow. The institution may manage data, but have no ownership of workflow. Since data has to move seamlessly between tools, interoperability is key – but how do we built these interoperable workflows and infrastructures?

Need to remember repository is only part of the research ecosystem. Need to take an institutional approach – or approaches rather than a single design solution. Look at main workflows and tools used – check out research communities who may already have the solutions – focus must be meeting the researchers’ needs.

Q: Will we see researchers use fewer tools as disciplinary workflows develop?A: Probably not but will see more integration between them eg Qualtrics adding an R connector.

Research Offices As Vital Factors In The Implementation Of Research Data Management Strategies by Reingis Hauck

Have a full-text repository on DSpace, building data repository on CKAN. What if we build something (at great expense) and they don’t come? We need cultural change. Eg UK seems far ahead but only 16% of respondents are accessing university RDM support services in 2016.

They have data repository, and provide support service by research office, library and IT services.

Research offices provides support in grant writing; advocates on policies; helps with internal research funding; report to senior leadership. Their toolkit:

  • need to win research managers over – explain how important it is
  • embedded an RDM-expert
  • upskilled research office staff about data management planning and how to make a case for data management.

Look out for game changers:

  • eg large collaborative research projects – produce lots of data and need to share it to be successful so more likely to listen
  • DMP preview as standard procedure for proposal review and training on proposal writing. (Want data management planning to be like brushing your teeth: you do it every day and if you forget you can’t sleep.)
  • adapt incentives – eg internal funding for early career researchers requires data management plans
  • use existing networks – researchers go to lots of boards and meetings already so feed this as a topic like any other topic
  • engage with members of DFG[German science foundation] review board – to get them to draw up criteria to reward researchers doing it

Cultural change towards open science can be supported by your research office. Let’s team up more!

Towards Researcher Participation in Research Information Management Systems by Dong Joon Lee, Besiki Stvilia, Shuheng Wu

RIMS – include ResearchGate, Academia, Google Scholar; ORCID, ImpactStory; PURE, Elements

ResearchGate sends out a flood of emails – good for some, a put-off for others. How can we improve our RIMS to improve researcher engagement?

Interviewed 15 researchers then expanded to survey 412 participants; also analysed metadata on 126 ResearchGate profiles of participants. Preliminary findings:

  • Variety of different researcher activities in RIMS eg write manuscripts, interact with peers, curate, evaluate, look for jobs, monitor literature, identify  collaborators, disseminate research, find relevant literature.
  • Different levels of participation: readers may have a profile but don’t maintain it or interact with people; record managers maintain their profile, but don’t interact with others; community members maintain profiles but also interact with others etc.
  • Different motivations to maintain profile: to share scholarship (most popular); improve status, enjoyment, support evaluation, quality of recommendations, external pressure (least popular)
  • Different use of metadata categories: people tend to use the person, publication, and research subject catories. Maybe research experience, but rarely education, award, teaching experience, other other.
    • In Person most people put in first, last name, affiliation, dept;
    • Publication: Most use most of these except only 30% of readers share the file – about 80% of record managers and community member

Want to develop design recommendations to enable RIMS to increase participation.

Research and non-publications repositories, Open Science #or2017

Abstracts

OpenAIRE-Connect: Open Science as a Service for repositories and research communities by Paolo Manghi, Pedro Principe, Anthony Ross-Hellauer, Natalia Manola

Project 2017-19 with 11 partners (technical, research communities, content providers) to extend technological services and networking bridges – creating open science services and building communities. Want to support reuse/reproducibility and transparent evaluation around research literature and research data, during the scientific process and in publishing artefacts and packages of artefacts.

Barriers – repositories lack support (eg integration, links between repositories). OpenAIRE want to facilitate new vision so providing “Open Science as a Service” – research community dashboard with variety of functions and catch-all broker service.

RDM skills training at the University of Oslo by Elin Stangeland

Researchers using random storage solutions and don’t really know what they’re doing. Need to improve their skills. Have been setting up training for various groups in organisation. Software Carpentry for young researchers to make their work more productive and reliable. 2-day workshops which are discipline-specific and well-attended. Now running their own instructor training which allows expanding service. Author carpentry, data carpentry, etc.

Training for research support staff who are first port of call on data management plans, data protection, basic data management. Recently made mandatory by Dept of GeoSciences to attend DMP training.

Expanding library carpentry to national level.

IIIF Community Activities and Open Invitation by Sheila Rabun

Global community that develops shared APIs for web-based image delivery; implements that in software; to expose interoperabie image content.

Many image repositories are effectively silos. IIIF APIs allows a layer that lets servers talk to each other and allow easier management and better functionality for end-users. Lots of image servers and clients around now so you can mix-and-match your front and back-ends. Can have deep zoom; compare images and more.

Everything created by global community so always looking for more participants. Community groups, technical specification groups eg extending to AV resources, discovery, text granularity (in text annotations). Also a consortium to provide leadership and communication channels.

Data Management and Archival Needs of the Patagonian Right Whale Program Data by Harish Maringanti, Daureen Nesdill, Victoria Rowntree

Importance of curating legacy datasets. World’s longest continuous study of large whale species: 47 years and counting of data. Two problems:

  • to identify whales – found the callosities of right whales were unique (number, position, shape) and pattern remained same despite slight change over time. So can take aerial photos when they surface. Data analysed with complicated computer system and compared with existing photos.
  • to gather data over a period of times – where to find whales regularly. Discovered whales gather in three places: 1) mothers and calves; 2) males and females; 3) free-for-all.

Collection has tens of thousands b&w negatives; color slides; analysis notebooks; field notes; Access 1996 database records; sightings maps.

Challenges: heterogeneity of data; metadata – including how much can be displayed publically; outdated databases.

Why should libraries care? We can provide continuity beyond life of individual researchers. Legacy data is as important as current data in biodiversity type fields and generally isn’t digitised yet.

Repository driven by the data journal: real practices from China Scientific Data by Lili Zhang, Jianhui Li, Yanfei Hou

China Scientific Data is a multidisciplinary journal publishing data papers – raw data and derived datasets. Submission (of paper and dataset), review (paper review and curation check), peer review, editorial voting.

How to publish:

  • Massive data? – on-demand sample data publication: can’t publish the whole set, but publishes a sample (typical, minimum sized) to announce the dataset’s existence
  • Complex data? – publish data and supplementary materials together eg background info, software, vocabulary, code, etc. Eg selected font collections for minority languages
  • Dynamic data? – eg when updating with new data using same methodology and data quality control. Could publish as new paper but it’s duplicative so published instead as another version with same DOI. Can be good for your citations!

Encourage authors to store data in their repository so its long-term availability is more reliable.

RDM and the IR: Don’t Reuse and Recycle – Reimplement by Dermot Frost, Rebecca Grant

We all have IRs and they’re designed for PDF publications. Research Data Management is largely driven by funder mandates; some disciplines are very good at it, some less so (eg historians claiming “I have no data” – having just finished a large project including land ownership registries from 17th century, georectified etc!)

FAIR (findable, accessible, interoperable, reusable) data concept (primarily machine-oriented ie findable by machines). IRs can’t do this well enough. Technically uploading a zip file is FAIR but time-costly to user.

Instead should find a domain-specific repository (and read the terms and conditions carefully especially around preservation!) Or implement your own institutional data repository (but different scale of data storage can take serious engineering efforts). Follow the Research Data Alliance.

Developing a university wide integrated Data Management Planning system by Rebecca Deuble, Andrew Janke, Helen Morgan, Nigel Ward

Need to help researcher across the life-cycle. UofQueensland identifying opportunity to support researchers around funding/journal requirements. Used DMP Online but poor uptake due to lack of mandate. UQ Research Data Manager system:

  • Record developed by research – active record (not plan though includes project info) which can change over course of project. Simple dynamic form, tailored to researchers, with guidance for each field.
  • Storage auto-allocated by storage providers for working data – given a mapped drive accessible by national collaborators (hopefully international soon) using code provided in completing the form.
  • [Working on this part] Publish and manage selected data to managed collection (UQ eSpace). Currently manual process filling in form with metadata fields in eSpace. Potential to transfer metadata from RDM system to eSpace.
  • Developing procedures to support the system.

Benefits include uni oversight of research in progress, researcher-centric, improves impact/citation, provides access of data to public.

Preserving and reusing high-energy-physics data analyses by Sünje Dallmeier-Tiessen, Robin Lynnette Dasler, Pamfilos Fokianos, Jiří Kunčar, Artemis Lavasa, Annemarie Mattmann, Diego Rodríguez Rodríguez, Tibor Šimko, Anna Trzcinska, Ioannis Tsanaktsidis

Data very valuable – data published even 15 years after funding stopped, and although always building new and bigger colliders, data is still relevant even decades after collected.

Projects involve 3000 people, including high turnover of young researchers. CERN need to capture everything need to understand and rerun an analysis years later – data, software, environment, workflow, context, documentation.

  • Invenio (JSON schema with lots of domain-specific fields) to describe analysis
  • Capture and preserve analysis elements
  • Reusing – need to reinstantiate the environment and execute the analysis on the cloud.

REANA = REusable ANAlyses supports collaboration and multiple scenarios.