Monthly Archives: October 2019

State of Open Data 2019 #FigshareFestNZ

State of Open Data 2019 report
Dr Mark Hahnel, Figshare

EU reckons it could save 10.2billion Euros a year using FAIR data. In the US the OPEN Government Data Act has been signed. NIH has said it requires everyone to make all their data available – but where are they going to put it?

FAIR – at least start with F and don’t name it “dataset.xlsx” because that’s hardly findable.

Figshare/Zenodo/Dryad – increasing uptake over time 20-30% year on year. Effects of EPSRC mandate in UK, NSFC mandate in China are significant. NZ is actually following the same trend even without mandates – global culture seems to be having an effect.

arXiv:1907.02565 found linking paper to dataset associated with 25% increase in citations

Annual survey – 8500 responses (three times as many as last few times)

  • 74.5% responses were extremely/somewhat likely to use other people’s data
  • 2/3 say funders should mandate data sharing, and 2/3 say that funders should withhold funding if people don’t
  • 66.4% think they don’t get enough credit for datasets – they want full citation, co-authorship (of paper based on their data – technically against norms but many do get it), consideration in job reviews, financial reward
  • Awareness of FAIR – slightly up but still <20% familiar, <30% heard of them

South African data repository (for compliance with a funder there)

NZ Research Information System #FigshareFestNZ

NZ Research Information System (NZRIS) update.
Chris Dangerfield, MBIE

Concept model covering goal/purpose, resources, requests for resources, asset pools (eg funds), awards, projects, activities, people, teams, proposals, organisations…

Data model divided into two sides “asset pool managers” (funders) and “research, science and innovation managers” (eg unis, CRIs)

Implementation Phase 1 – plan to release in March with publically available historic data (from 2009 to present), with mostly mandatory data resulting from HRC proposals, Marsden Fund, Endeavour Fund, Sustainable Farming Fund, Partnerships, etc.

Phase 2 – bringing on more funding organisations and bringing in optional data

Review of phase 1 – data sovereignty, privacy, engaging with Māori stakeholders, creating engaging visualisations, improving data quality

Phase 3 – bring on unis, CRIs, etc in 2021-2022 – looking at projects, outputs (including datasets)

More information

Preparing the healthcare workforce #FigshareFestNZ

Preparing the healthcare workforce to deliver the digital future – the good, the bad and
the ugly.
Dr Ehsan Vaghefi

Lots of lessons learned through commercialisation involving AI.

The Good

  • Great for science – IBM’s Watson Oncology can provide evidence-based treatment options, generate notes and reports, etc: the oncologist then audits this. Enables the hospital to increase capacity as AI is doing heavy lifting. Would it replace radiologists? Some yes; but other jobs have been created to work with the AI.
    • linking diseases to different genetic profiles
    • predicting possible treatments/vaccines for testing
    • AI-assisted cardiac imaging system
  • Gift of time – clinicians will have more time to focus on interacting with patients
  • (Good reads: “The patient will see you now”
  • Ophthalmology/optometry relying heavily on pattern recognition eg AI often more accurate detecting cataracts; can match accuracy detecting glaucoma (which you otherwise don’t know you have until too late); can match accuracy detecting diabetic screening

The Bad

  • Implementation – customer request, design, documentation, customer actual needs often all very different!
    • Eg one example where they provided more information to clinicians it slowed them down and made them worse. Clinicians are scared of AI so start double-guessing themselves. They do get faster using it with AI with more practice – but never reach their unassisted screening rates! Similar study in Thailand – when gathering data, clinicians only passed on the good data that they were confident about. So when AI tried to deal with ambiguous situations it didn’t cope.

The Ugly

  • Deepmind Health got more than 1million anonymised eye scans with related medical records – then sold themselves to Google. (In 2017 UK ruled that the NHS had broken the law in providing medical records.) Microsoft partnering with Prasad Eye Institute in India. IBM acquiring Merge Healthcare and IBM Watson analysing glaucoma images for deep learning. Streams medical diagnosis app to help you self-manage your health – and provides the results to hospital and your insurance company…..  Zorgprisma Publiek “pre-clinical decision-maker” helps “avoid unnecessary hospitalizations” – in practice the hospital can see in advance that you’ll be a costly patient and not admit you.
  • Re-identification – based on a single photo you can guess so much about a person you can start to work out who the person is.
  • AI bias – racism – based on incomplete datasets. Eg police using AI to assign risk factor based on risk background and face but because it’s got lots of racially biased data, it produces racially-biased risk factors. Eg a health-care algorithm where only 17.7% of patients receiving extra care were black (should have been 46.5%). Vital to be very careful about data collection – who’s contributing and not contributing – and invest more in diversifying the AI field itself.

Is ethical AI an oxymoron? Need to work out data ownership, governance, custodianship, security, impact on future.

Five pillars ethical AI

  • Transparency (informed consent etc)
  • Justice and faireness (make sure you’re not missing parts of the community)
  • Non-maleficence
  • Responsibility
  • Privacy

Is ethical AI a bargain/contract? A bargain struck between data sources and data users. Science needs data so it must be shared – but what benefit does the data source receive? Next evolution of big data in healthcare is “learning health systems” so instead of just holding your information the system can learn about you and give you better treatment.

Is privacy always beneficial? Sometimes sharing the data with an AI lets you get a better treatment plan.

A roadmap: “First do no harm”. Choose the right problem, not going fishing for data and make sure when gathering data the population understands everything about the research

Removing barriers to sharing for the benefit of Māori #FigshareFestNZ

Data for whom? Removing barriers to sharing for the benefit of Māori
Dr Kiri Dell, Ngapera Riley, CEO figure.nz

Ref Decolonising Methodologies by Linda Tuhiwai Smith

The academy privileges a certain type of knowing, but indigenous people have other ways as well (which we all use to some degree) eg

  • Sense perception – I felt it
  • Imagination – I envisioned it
  • Memory – I remembered it
  • Inherited – My nanny told me
  • Faith – God told me

Example of using data badly: MOTU (economic research centre) put out research a few years ago comparing Māori and other ethnicities and concluding that collectivist beliefs were holding Māori back from economic success. This was not taken well by Māori…. Researchers made sweeping statements about Māori culture where they had no research; compared Māori to completely different groups (eg African Americans, Chinese) with different histories and belief systems; interpreted through white male lens.

Basho haiku
Pull the wings off a dragonfly and look – you get a red pepperpod!
vs
Add wings to a pepperpod, and look – you get a red dragonfly

Figure.nz set up as charity to democratise data – aims to provide valid and ethical data. Data is important but so is context and people behind it. Draws licensed data from over a hundred sources. Exists for the benefit of Aotearoa so believe if they can get data right for Māori they can get it right for all. Kaupapa that data is for everyone not just experts.

Partnering with Te Mana Rarauranga – Māori Data Sovereignty Network; and with nine government agencies who’ve got a lot of data that was never meant to be shared so navigating benefits and dangers of sharing. How will data be used, for whom, why?

Data is never perfect – it’s just one tool alongside experience and connections. Māori data has traditionally not been collected well – cf especially the latest census – so have to be careful about conclusions drawn.

Figure.NZ –  over 44,000 charts and datasets (CSVs and images) around people, travel, health, education, employment, economy, environment, social welfare, technology, broken down by geographic area. Very careful to publish metadata around sources etc.

Original data was a mess so have been working hard to tidy it up.  Check sources, make sure it’s statistically valid (no small datasets) – have a robust process to work with source to make sure the metadata explains methodology and context.

Focused on public aggregate data but starting to use other sources. Wondering how to safely share research data. Excited to see people have started publishing theses etc with CC licensing.

 

Rolling out research for public good #FigshareFestNZ

The bumpy road to rolling out the Index of Multiple Deprivation
Assoc Prof Daniel Exeter

IMD looks at deprivation in different areas eg employment, education, income, housing, access to services. Most rural areas in South Island not very deprived, but the east and north of North Island. In Christchurch also various quintile 5 areas surrounded by quintile 3 areas – very sharp north-west/south-east divide. Can drill down to see what’s driving deprivation in each area.

Want to maximise the impact of the public good money so put up website and made all datasets and reports freely available, easy to download, with full attribution. But first hurdle was getting data out of StatsNZ IDI who are very cautious – had to work closely with them pointing out their data was very aggregated. Published papers with journal who wanted data and wouldn’t accept institutional site they’d painstakingly created. [In fairness, institutions do mess up their websites frequently so it’s not good for longterm storage!] So quickly created a Figshare website – one advantage is other institutions can use API to access this data.

Bumpy ride = mostly interesting challenge. Eg

  • someone creating a survey wanting to know people’s geographical area without asking for actual address to make it easier to manage data. Created something where users could enter address and it’d convert it to the area along with IMD data
  • looking at poor oral health of children per deprivation per ethnicity
  • getting it into policy – mostly from getting emails from people asking how to interpret something. One DHB in North Island interacts with 5 local govt authorities and was asked to create one per authority – lots of work to recreate this at that level but achieved
  • someone using it for climate change – heat vulnerability index including IMD, social isolation, age bands
  • noted some areas ended up not included in census which affects accuracy

Funded by Health Research Council but most uptake through govt, eg also Canterbury Wellbeing Index; Ministry of Education, Alcohol Health Watch lets you type in an address and get a report to make a submission against liquor licenses

Are big data and data sharing the panacea? It’s fantastic but there are big issues around too: attirbution, ethics, risk of data gerrymandering, need to use theory to inform methods. Sometimes people use the IMD in ways that are useful but potentially misleading. Institutions are playing catchup, not yet sure how to deal with this. Hopes the community being built around Figshare etc will help develop solutions and best practices.

 

Developing research data services #FigshareFestNZ

I talked about our implementation and various integrations, including glitches along the road, eg

  • being “piggy in the middle” between Figshare and our ITS trying to troubleshoot CNAME issues without knowing what a CNAME is
  • setting up emails so that Figshare system emails come from a Lincoln address – while our ITS is cracking down on phishing and Figshare emails look to our system like they’re spoofed.
  • our first attempt at an HR feed only sent through academics, not postgrads; now redoing this from scratch with new extract scripts, and Figshare is working out how to deal with the ~ we use to signify someone doesn’t have a first name
  • indexing in LibrarySearch (Primo) using the OAI-PMH feed – also in data.govt.nz and DigitalNZ
  • Next year implementing Repository Tools 2 in Elements which will let researchers deposit datasets from there – doing pre-work to make sure we can have two user IDs in the HR feed: one to integrate with the login system, the other one to integrate with Elements

Salila Bryant from ESR about the background behind implementing Figshare as an institution repository, and some of the challenges (including getting support from the other side of the world; getting researchers on board with publishing data; covering privacy, metadata requirements etc)

Katy Miller from Victoria University of Wellington used to have Rosetta for everything but researchers couldn’t interact with it. Ended up choosing Figshare as it interacts with Elements; easy for academic staff to deposit; modern interface; ‘just right’ – looks, well indexed etc; proven solution. Lots of decisions needed to be made to set it up: thought it’d all be simple but turns out not to be as simple as it looks.

  • Groups – most places use academic units, but this is greatly in flux at VUW so are probably going to use publication types as the groups (eg a group for theses etc)
  • Metadata – keeping it light to focus on their key aims
  • Mediation – trying for no mediation between deposit and availability
  • Scope – starting with journal articles and conference papers as this is mainly to support OA mandate. Theses will be next (and will need to add postgrads into the system) – then think about research data down the track.
  • DOIs – grappling with how we handle situations where we publish a preprint with a DOI and later the published version becomes available with a different DOI? Currently planning to use the publisher DOI field in Figshare when the published DOI becomes available.

Laura Armstrong, UofAuckland talking about engaging with researchers. Talk about their “data publishing and discovery service” but still at https://auckland.figshare.com Currently 236 datasets in data.govt.nz; 666 DOIs minted year-to-date.

  • Researchers want DOIs (often not knowing why); discoverability (via Google); browser preview; branding/match website; metrics
  • Have group sites for different use cases eg conferences, research groups
  • Process – usually researcher requests; then discuss use case, what they want to achieve, who’s the owner, etc; then create a group site based on this, and figshare uploads branding overnight; they make it live but not public; once at least one item is published, the site is made public.
  • User types – most people only assigned to one group and can only publish there. Can set people up as admins for multiple groups, they can then publish to any of those groups. For external users, an internal person in group X can create an unpublished project and invite the external person to it: project items then get published to group X.

Simon Porter and Jared Watts, Digital Science looking at visualisation of collaboration networks based on Jupyter Notebook using Figshare API. Making available raw data in CSVs, visualisations, and technical report “This Poster is Reproducible”.

National DOI consortium #FigshareFestNZ

National DOI consortium
Andrea Goethals, National Library of New Zealand

DOIs: persistent and unique identifiers for digital objects. Support the sharing of research as part of creating FAIR data (eg principle F1 “(meta)data are assigned a globally unique and persistent identifier”). Mostly assigned by either DataCite (especially for data and grey literature) or Crossref (especially for journal articles etc).

DataCite’s model requires being a member – either direct member, or consortial – New Zealand has done the latter. Consortial is cheaper, can leverage each others’ skills and expertise, have more influence, and share strategies. Each consortium member might have one or more repositories. Responsibilities: full autonomy creating your own DOIs using their tools; communicating with institution’s researchers; and paying annual fees.

Consortial fees:
Membership  fee: 2000Euro / #members
Repository fee: 500Euro per repository
DOI fee: 500Euro up to 10,000

Minting DOIs – at UoA can either create manually through a form or uploading a file; or by API automatically.

Also have an NZDOI Interest Group – open to anyone to join.

Have more information at National Library site.

Building research data services at NeSI #FigshareFestNZ

Building research data services at NeSI
Brian Flaherty, NeSI

NeSI is a collaboration supporting researchers to tackle large problems (ie super-computers). Core services around HPC, consultancy, training. Two supercomputers Mahuika and Māui, a couple dozen staff, covering a range of disciplines.

2011-2014 mostly about computer infrastructure with a little storage and consultancy

2014-2019 shifting

2019-future looking at research platforms, virtual labs, scientific gateways

Data management – mostly want to deal with the active data: collection, pre-processing, analysis and modelling, repurposing pre-publication.

Refreshing its offering around transferring data – looking at a national data transfer platform. Nodes at UoA, NIWA in Wellington, AgResearch in Christchurch, and Dunedin. (There’s a big need for an Australasian platform as lots of data gets sent back eg to the Garvan Institute sequencing laboratory. Lots of hard drives still being shipped.) transfer.nesi.org.nz has a point-and-click transfer interface.

Automated workflows: Genomics Aotearoa doing a project sequencing taonga species eg the kākāpō. DoC was storing data in the cloud in Australia; Ngāi Tahu weren’t happy so have brought it back into NZ stored with NeSI. Genomics Aotearoa Data Repository being developed at NeSI – starting small (“don’t try to boil the ocean”), with downloading, storing, sharing data with group-based access control and group membership management. FAIR – so far at findable (just) and accessible (in that it’s sharable) but still working towards interoperable and reusable.

Indigenous data: a kāhui Māori working to make sure Māori data managed within a Māori context so need to map this into security, auditing, permissions process, etc. Work in process. Underpinned by “Te Mata Ira” guidelines for genomic research with Māori.

Increasingly researchers want the data to be stored in the same place as the compute so it doesn’t have to be transferred backwards and forwards.

Security around sensitive data: have firewalls, multifactor authentication, need to look more at privacy policy and standards around health information security frameworks etc.

Curation: “maintaining, preserving, and adding value to digital data/object through its life cycle”. Eg transforming formats, evaluating for FAIRness. Knowing what to get rid of when storage is stretched.

Metadata: Creating README files – getting stuff out of people’s heads. RO-CRATE as a way to package it in a human- and computer-readable way.

Round-up of #anzreg2019 sessions

ANZREG = the Australia / New Zealand Ex Libris User Group (the acronym is historic). This covers topics related to Alma, Primo, Leganto, Esploro, etc etc.

I was (not heavily) involved in organising the conference, and moderated the developers’ day, and my main takeaway from this is that if you have the option to pay $$$ for AV support during a conference, pay it: it’s worth every single cent to have someone there who’s responsible for the mics and livestreaming and remote presentations, and let you focus on the people and timekeeping and stuff.

Day 1

  • I made a terrible strategic decision not to liveblog the keynote “Libraries at the Edge of Reality”. Keynotes are often hard to liveblog and this would have been too but I regret not writing down the first point of Jeff Brand’s “Manifesto for Civilising Digitalisation”. It was – after talking about the respect people have for physical libraries and other spaces; about the grief people feel when eg Notre Dame burnt because they’ve got an emotional connection to it – about making a virtual/digital space that would deserve that same feeling and respect. It left me wondering what kind of website does this? The closest I can think of is Wikipedia maybe?
  • Predicting Student Success with Leganto – library joined an Ex Libris pilot project to see if it’d be possible to predict student success/failure based on reading list interactions. Some limited success but lots of false positives/false negatives. Would need lots more data, and lots of discretion if planning any intervention based on the results.
  • Understanding user behaviour and motivations – turned on “expand my results” by default and got a large increase in interloan requests, especially from first-time users/undergraduates. Big usability improvement.
  • Aligning project milestones to development schedules – introduced Leganto in multiphase project, making various bugfix/enhancement requests along the way
  • Exploring Esploro – had a very unintegrated repository/CRIS system built on manual processes. Esploro eliminates much of this double-handling, has automagic harvesting etc. Researcher still needs to upload full-text themselves but system sends emails.
  • A national library perspective on Almalots of original cataloguing which Alma isn’t strong in. Numerous challenges around this and born-digital items; various workarounds found. Make heavy use of templates.
  • “It should just work”: access to library resources – sponsor presentation on LibKey products which is essentially a redesigned link resolver plugin thing. Possibly a bit heavy reliance on DOIs and PDFs which limits how often it’ll be successful but it’s early days for the product and they seem keen to expand the cases where it’ll work.
  • A briefing from Ex Libris – upcoming improvements to MetaData Editor, CDI, COUNTER 5, Provider Zone content loading, next gen resource sharing, next gen analytics

Day 2

  • “Primo is broken, can you fix it?” – linking issues from Primo. Lots to do with EBSCOhost (partly including a move from EZproxy to SSO for authentication). Also discussed the infamous “No full text error” problem which Ex Libris apparently says is in development.
  • What do users want from Primo?  – very detailed talk on getting evidence on how users use Primo, and what improvements to make as a result. Includes links to survey kits and dataset of analytics.
  • Achieving self-management using Leganto  – Very successful implementation. Started with a small pilot project which helped finetune how they sold it, built their own confidence, and created champions among their userbase. But ultimately seems like their faculty just really like the product (even if they’re not yet using all the functionality). Library is retaining some functions in their control eg rollover.
  • Creating actionable data using Alma Analytics – using various dashboard visualisations to inform a large weeding project. Will share reports in community area.
  • Central Discovery Index – update on CDI from the libraries testing it. Testing only partway through. Some issues found, Ex Libris investigating these. Switchover is planned by July for all customers.

Developers’ Day

  • Primo Workflow Testing with Cypress – I’ve long liked the idea of automated testing, but figured I didn’t have the skills to set it up. With Cypress, which uses JavaScript… I just might. The time is another matter but I think I want to explore it as it could be useful for a lot more systems than just Primo, and give us early warning when things break (instead of us finding out days later when someone gets around to using and/or reporting it).
  • Using APIs to enhance the user experience – using the APIs to create their own user interface over the top of their various Ex Libris products for consistency, usability, robustness (by caching so it covers downtime better). Big investment of time! But makes sense in their context.
  • Harnessing Alma Analytics and R/RShiny for Insights – RShiny for interactive visualisation. Learning curve but powerful (and free!) Their talk showed some cool use cases.
  • You are what you count – another really detailed talk, basic theme being to be strategic about what you count – make metrics fit your strategy, not dictate it.
  • The fight against academic piracy – Splunk with EZproxy data to automate blocking users who fit a pattern of excessive/abnormal downloads. Some false positives but easily resolved and generally results in positive and constructive conversations.
  • rss2oai for harvesting WordPress into Primo – this was my talk, slides not yet live and I obviously didn’t liveblog 🙂 but the code is at https://github.com/LincolnUniLTL/rss2oai At the last minute this morning I realised that I hadn’t included a section on what it actually looks like for users as a result, so hurriedly edited that in; during the session someone asked if we had analytics on how it was used which is another massive oversight I should rectify sometime When I Have Time (and can overcome my hatred of Google Analytics).

The fight against academic piracy #anzreg2019

UniSA Library and the fight against academic piracy
Sam Germein, University of South Australia

Previous method for monitoring abuse of EZproxy was cumbersome and prone to error.

Next used Splunk. Could get a top 10 downloaders; do a lookup on usernames etc. Reduced time to look for unauthorised access, but vendors would still contact them outside of business hours, and block access to the EZproxy for server for potentially the whole weekend.

Splunk has a notification function – looking into how to use this.

Eg a report if a username logging in from three countries or more. (Two countries turned up lots of false positives due to VPNs.) Alerts got sent to Sam by email. Could then block the username.

Looked into other ways it might be more accurate. Still potential situation where student in a country where access was blocked and VPN needed. Added database info to see if they’re hopping between lots of databases, and how much content they’re downloading. All this info built into dashboards so needed to reverse engineer them and get the info into his report.

Another issue – in the weekend getting alerts on phone where couldn’t view spreadsheet. But Splunk could embed the info in the email.

Extended emails to other team members and to their help desk software to log a formal job and make it part of the business workflow. Got IT Helpdesk involved.

Still getting false positives, so looked into only sending the alert if downloaded more than 25MB. Refine how info displayed for wider range of people managing it.

Increased frequency to every 6 hours.

Using API could directly write the username to the EZproxy deny file – fully automating the block process. Still getting some false positives but much more on the front foot – they see alerts and contact vendor rather than vice versa.

Still lots more to do. Still implementing EZproxy 6.5 and experimenting with the EZproxy blacklist which helps.

Q: How did you decide the parameters?
A: Mostly trial and error, trying to strike a balance between legitimate blocks and false positives. Decided to be reasonably strict.

Q: Have you had any feedback from vendors?
A: Not specifically, but have had a reduction of contacts from vendors about issues.

Q: Have you had feedback from false positives blocked?
A: No, put a note in the deny file. [Another audience member’s had some conversations, students are usually good and good opportunity to hear how they’re using resources.]