Removing barriers to sharing for the benefit of Māori #FigshareFestNZ

Data for whom? Removing barriers to sharing for the benefit of Māori
Dr Kiri Dell, Ngapera Riley, CEO figure.nz

Ref Decolonising Methodologies by Linda Tuhiwai Smith

The academy privileges a certain type of knowing, but indigenous people have other ways as well (which we all use to some degree) eg

  • Sense perception – I felt it
  • Imagination – I envisioned it
  • Memory – I remembered it
  • Inherited – My nanny told me
  • Faith – God told me

Example of using data badly: MOTU (economic research centre) put out research a few years ago comparing Māori and other ethnicities and concluding that collectivist beliefs were holding Māori back from economic success. This was not taken well by Māori…. Researchers made sweeping statements about Māori culture where they had no research; compared Māori to completely different groups (eg African Americans, Chinese) with different histories and belief systems; interpreted through white male lens.

Basho haiku
Pull the wings off a dragonfly and look – you get a red pepperpod!
vs
Add wings to a pepperpod, and look – you get a red dragonfly

Figure.nz set up as charity to democratise data – aims to provide valid and ethical data. Data is important but so is context and people behind it. Draws licensed data from over a hundred sources. Exists for the benefit of Aotearoa so believe if they can get data right for Māori they can get it right for all. Kaupapa that data is for everyone not just experts.

Partnering with Te Mana Rarauranga – Māori Data Sovereignty Network; and with nine government agencies who’ve got a lot of data that was never meant to be shared so navigating benefits and dangers of sharing. How will data be used, for whom, why?

Data is never perfect – it’s just one tool alongside experience and connections. Māori data has traditionally not been collected well – cf especially the latest census – so have to be careful about conclusions drawn.

Figure.NZ –  over 44,000 charts and datasets (CSVs and images) around people, travel, health, education, employment, economy, environment, social welfare, technology, broken down by geographic area. Very careful to publish metadata around sources etc.

Original data was a mess so have been working hard to tidy it up.  Check sources, make sure it’s statistically valid (no small datasets) – have a robust process to work with source to make sure the metadata explains methodology and context.

Focused on public aggregate data but starting to use other sources. Wondering how to safely share research data. Excited to see people have started publishing theses etc with CC licensing.

 

Rolling out research for public good #FigshareFestNZ

The bumpy road to rolling out the Index of Multiple Deprivation
Assoc Prof Daniel Exeter

IMD looks at deprivation in different areas eg employment, education, income, housing, access to services. Most rural areas in South Island not very deprived, but the east and north of North Island. In Christchurch also various quintile 5 areas surrounded by quintile 3 areas – very sharp north-west/south-east divide. Can drill down to see what’s driving deprivation in each area.

Want to maximise the impact of the public good money so put up website and made all datasets and reports freely available, easy to download, with full attribution. But first hurdle was getting data out of StatsNZ IDI who are very cautious – had to work closely with them pointing out their data was very aggregated. Published papers with journal who wanted data and wouldn’t accept institutional site they’d painstakingly created. [In fairness, institutions do mess up their websites frequently so it’s not good for longterm storage!] So quickly created a Figshare website – one advantage is other institutions can use API to access this data.

Bumpy ride = mostly interesting challenge. Eg

  • someone creating a survey wanting to know people’s geographical area without asking for actual address to make it easier to manage data. Created something where users could enter address and it’d convert it to the area along with IMD data
  • looking at poor oral health of children per deprivation per ethnicity
  • getting it into policy – mostly from getting emails from people asking how to interpret something. One DHB in North Island interacts with 5 local govt authorities and was asked to create one per authority – lots of work to recreate this at that level but achieved
  • someone using it for climate change – heat vulnerability index including IMD, social isolation, age bands
  • noted some areas ended up not included in census which affects accuracy

Funded by Health Research Council but most uptake through govt, eg also Canterbury Wellbeing Index; Ministry of Education, Alcohol Health Watch lets you type in an address and get a report to make a submission against liquor licenses

Are big data and data sharing the panacea? It’s fantastic but there are big issues around too: attirbution, ethics, risk of data gerrymandering, need to use theory to inform methods. Sometimes people use the IMD in ways that are useful but potentially misleading. Institutions are playing catchup, not yet sure how to deal with this. Hopes the community being built around Figshare etc will help develop solutions and best practices.

 

Developing research data services #FigshareFestNZ

I talked about our implementation and various integrations, including glitches along the road, eg

  • being “piggy in the middle” between Figshare and our ITS trying to troubleshoot CNAME issues without knowing what a CNAME is
  • setting up emails so that Figshare system emails come from a Lincoln address – while our ITS is cracking down on phishing and Figshare emails look to our system like they’re spoofed.
  • our first attempt at an HR feed only sent through academics, not postgrads; now redoing this from scratch with new extract scripts, and Figshare is working out how to deal with the ~ we use to signify someone doesn’t have a first name
  • indexing in LibrarySearch (Primo) using the OAI-PMH feed – also in data.govt.nz and DigitalNZ
  • Next year implementing Repository Tools 2 in Elements which will let researchers deposit datasets from there – doing pre-work to make sure we can have two user IDs in the HR feed: one to integrate with the login system, the other one to integrate with Elements

Salila Bryant from ESR about the background behind implementing Figshare as an institution repository, and some of the challenges (including getting support from the other side of the world; getting researchers on board with publishing data; covering privacy, metadata requirements etc)

Katy Miller from Victoria University of Wellington used to have Rosetta for everything but researchers couldn’t interact with it. Ended up choosing Figshare as it interacts with Elements; easy for academic staff to deposit; modern interface; ‘just right’ – looks, well indexed etc; proven solution. Lots of decisions needed to be made to set it up: thought it’d all be simple but turns out not to be as simple as it looks.

  • Groups – most places use academic units, but this is greatly in flux at VUW so are probably going to use publication types as the groups (eg a group for theses etc)
  • Metadata – keeping it light to focus on their key aims
  • Mediation – trying for no mediation between deposit and availability
  • Scope – starting with journal articles and conference papers as this is mainly to support OA mandate. Theses will be next (and will need to add postgrads into the system) – then think about research data down the track.
  • DOIs – grappling with how we handle situations where we publish a preprint with a DOI and later the published version becomes available with a different DOI? Currently planning to use the publisher DOI field in Figshare when the published DOI becomes available.

Laura Armstrong, UofAuckland talking about engaging with researchers. Talk about their “data publishing and discovery service” but still at https://auckland.figshare.com Currently 236 datasets in data.govt.nz; 666 DOIs minted year-to-date.

  • Researchers want DOIs (often not knowing why); discoverability (via Google); browser preview; branding/match website; metrics
  • Have group sites for different use cases eg conferences, research groups
  • Process – usually researcher requests; then discuss use case, what they want to achieve, who’s the owner, etc; then create a group site based on this, and figshare uploads branding overnight; they make it live but not public; once at least one item is published, the site is made public.
  • User types – most people only assigned to one group and can only publish there. Can set people up as admins for multiple groups, they can then publish to any of those groups. For external users, an internal person in group X can create an unpublished project and invite the external person to it: project items then get published to group X.

Simon Porter and Jared Watts, Digital Science looking at visualisation of collaboration networks based on Jupyter Notebook using Figshare API. Making available raw data in CSVs, visualisations, and technical report “This Poster is Reproducible”.

National DOI consortium #FigshareFestNZ

National DOI consortium
Andrea Goethals, National Library of New Zealand

DOIs: persistent and unique identifiers for digital objects. Support the sharing of research as part of creating FAIR data (eg principle F1 “(meta)data are assigned a globally unique and persistent identifier”). Mostly assigned by either DataCite (especially for data and grey literature) or Crossref (especially for journal articles etc).

DataCite’s model requires being a member – either direct member, or consortial – New Zealand has done the latter. Consortial is cheaper, can leverage each others’ skills and expertise, have more influence, and share strategies. Each consortium member might have one or more repositories. Responsibilities: full autonomy creating your own DOIs using their tools; communicating with institution’s researchers; and paying annual fees.

Consortial fees:
Membership  fee: 2000Euro / #members
Repository fee: 500Euro per repository
DOI fee: 500Euro up to 10,000

Minting DOIs – at UoA can either create manually through a form or uploading a file; or by API automatically.

Also have an NZDOI Interest Group – open to anyone to join.

Have more information at National Library site.

Building research data services at NeSI #FigshareFestNZ

Building research data services at NeSI
Brian Flaherty, NeSI

NeSI is a collaboration supporting researchers to tackle large problems (ie super-computers). Core services around HPC, consultancy, training. Two supercomputers Mahuika and Māui, a couple dozen staff, covering a range of disciplines.

2011-2014 mostly about computer infrastructure with a little storage and consultancy

2014-2019 shifting

2019-future looking at research platforms, virtual labs, scientific gateways

Data management – mostly want to deal with the active data: collection, pre-processing, analysis and modelling, repurposing pre-publication.

Refreshing its offering around transferring data – looking at a national data transfer platform. Nodes at UoA, NIWA in Wellington, AgResearch in Christchurch, and Dunedin. (There’s a big need for an Australasian platform as lots of data gets sent back eg to the Garvan Institute sequencing laboratory. Lots of hard drives still being shipped.) transfer.nesi.org.nz has a point-and-click transfer interface.

Automated workflows: Genomics Aotearoa doing a project sequencing taonga species eg the kākāpō. DoC was storing data in the cloud in Australia; Ngāi Tahu weren’t happy so have brought it back into NZ stored with NeSI. Genomics Aotearoa Data Repository being developed at NeSI – starting small (“don’t try to boil the ocean”), with downloading, storing, sharing data with group-based access control and group membership management. FAIR – so far at findable (just) and accessible (in that it’s sharable) but still working towards interoperable and reusable.

Indigenous data: a kāhui Māori working to make sure Māori data managed within a Māori context so need to map this into security, auditing, permissions process, etc. Work in process. Underpinned by “Te Mata Ira” guidelines for genomic research with Māori.

Increasingly researchers want the data to be stored in the same place as the compute so it doesn’t have to be transferred backwards and forwards.

Security around sensitive data: have firewalls, multifactor authentication, need to look more at privacy policy and standards around health information security frameworks etc.

Curation: “maintaining, preserving, and adding value to digital data/object through its life cycle”. Eg transforming formats, evaluating for FAIRness. Knowing what to get rid of when storage is stretched.

Metadata: Creating README files – getting stuff out of people’s heads. RO-CRATE as a way to package it in a human- and computer-readable way.

Round-up of #anzreg2019 sessions

ANZREG = the Australia / New Zealand Ex Libris User Group (the acronym is historic). This covers topics related to Alma, Primo, Leganto, Esploro, etc etc.

I was (not heavily) involved in organising the conference, and moderated the developers’ day, and my main takeaway from this is that if you have the option to pay $$$ for AV support during a conference, pay it: it’s worth every single cent to have someone there who’s responsible for the mics and livestreaming and remote presentations, and let you focus on the people and timekeeping and stuff.

Day 1

  • I made a terrible strategic decision not to liveblog the keynote “Libraries at the Edge of Reality”. Keynotes are often hard to liveblog and this would have been too but I regret not writing down the first point of Jeff Brand’s “Manifesto for Civilising Digitalisation”. It was – after talking about the respect people have for physical libraries and other spaces; about the grief people feel when eg Notre Dame burnt because they’ve got an emotional connection to it – about making a virtual/digital space that would deserve that same feeling and respect. It left me wondering what kind of website does this? The closest I can think of is Wikipedia maybe?
  • Predicting Student Success with Leganto – library joined an Ex Libris pilot project to see if it’d be possible to predict student success/failure based on reading list interactions. Some limited success but lots of false positives/false negatives. Would need lots more data, and lots of discretion if planning any intervention based on the results.
  • Understanding user behaviour and motivations – turned on “expand my results” by default and got a large increase in interloan requests, especially from first-time users/undergraduates. Big usability improvement.
  • Aligning project milestones to development schedules – introduced Leganto in multiphase project, making various bugfix/enhancement requests along the way
  • Exploring Esploro – had a very unintegrated repository/CRIS system built on manual processes. Esploro eliminates much of this double-handling, has automagic harvesting etc. Researcher still needs to upload full-text themselves but system sends emails.
  • A national library perspective on Almalots of original cataloguing which Alma isn’t strong in. Numerous challenges around this and born-digital items; various workarounds found. Make heavy use of templates.
  • “It should just work”: access to library resources – sponsor presentation on LibKey products which is essentially a redesigned link resolver plugin thing. Possibly a bit heavy reliance on DOIs and PDFs which limits how often it’ll be successful but it’s early days for the product and they seem keen to expand the cases where it’ll work.
  • A briefing from Ex Libris – upcoming improvements to MetaData Editor, CDI, COUNTER 5, Provider Zone content loading, next gen resource sharing, next gen analytics

Day 2

  • “Primo is broken, can you fix it?” – linking issues from Primo. Lots to do with EBSCOhost (partly including a move from EZproxy to SSO for authentication). Also discussed the infamous “No full text error” problem which Ex Libris apparently says is in development.
  • What do users want from Primo?  – very detailed talk on getting evidence on how users use Primo, and what improvements to make as a result. Includes links to survey kits and dataset of analytics.
  • Achieving self-management using Leganto  – Very successful implementation. Started with a small pilot project which helped finetune how they sold it, built their own confidence, and created champions among their userbase. But ultimately seems like their faculty just really like the product (even if they’re not yet using all the functionality). Library is retaining some functions in their control eg rollover.
  • Creating actionable data using Alma Analytics – using various dashboard visualisations to inform a large weeding project. Will share reports in community area.
  • Central Discovery Index – update on CDI from the libraries testing it. Testing only partway through. Some issues found, Ex Libris investigating these. Switchover is planned by July for all customers.

Developers’ Day

  • Primo Workflow Testing with Cypress – I’ve long liked the idea of automated testing, but figured I didn’t have the skills to set it up. With Cypress, which uses JavaScript… I just might. The time is another matter but I think I want to explore it as it could be useful for a lot more systems than just Primo, and give us early warning when things break (instead of us finding out days later when someone gets around to using and/or reporting it).
  • Using APIs to enhance the user experience – using the APIs to create their own user interface over the top of their various Ex Libris products for consistency, usability, robustness (by caching so it covers downtime better). Big investment of time! But makes sense in their context.
  • Harnessing Alma Analytics and R/RShiny for Insights – RShiny for interactive visualisation. Learning curve but powerful (and free!) Their talk showed some cool use cases.
  • You are what you count – another really detailed talk, basic theme being to be strategic about what you count – make metrics fit your strategy, not dictate it.
  • The fight against academic piracy – Splunk with EZproxy data to automate blocking users who fit a pattern of excessive/abnormal downloads. Some false positives but easily resolved and generally results in positive and constructive conversations.
  • rss2oai for harvesting WordPress into Primo – this was my talk, slides not yet live and I obviously didn’t liveblog 🙂 but the code is at https://github.com/LincolnUniLTL/rss2oai At the last minute this morning I realised that I hadn’t included a section on what it actually looks like for users as a result, so hurriedly edited that in; during the session someone asked if we had analytics on how it was used which is another massive oversight I should rectify sometime When I Have Time (and can overcome my hatred of Google Analytics).

The fight against academic piracy #anzreg2019

UniSA Library and the fight against academic piracy
Sam Germein, University of South Australia

Previous method for monitoring abuse of EZproxy was cumbersome and prone to error.

Next used Splunk. Could get a top 10 downloaders; do a lookup on usernames etc. Reduced time to look for unauthorised access, but vendors would still contact them outside of business hours, and block access to the EZproxy for server for potentially the whole weekend.

Splunk has a notification function – looking into how to use this.

Eg a report if a username logging in from three countries or more. (Two countries turned up lots of false positives due to VPNs.) Alerts got sent to Sam by email. Could then block the username.

Looked into other ways it might be more accurate. Still potential situation where student in a country where access was blocked and VPN needed. Added database info to see if they’re hopping between lots of databases, and how much content they’re downloading. All this info built into dashboards so needed to reverse engineer them and get the info into his report.

Another issue – in the weekend getting alerts on phone where couldn’t view spreadsheet. But Splunk could embed the info in the email.

Extended emails to other team members and to their help desk software to log a formal job and make it part of the business workflow. Got IT Helpdesk involved.

Still getting false positives, so looked into only sending the alert if downloaded more than 25MB. Refine how info displayed for wider range of people managing it.

Increased frequency to every 6 hours.

Using API could directly write the username to the EZproxy deny file – fully automating the block process. Still getting some false positives but much more on the front foot – they see alerts and contact vendor rather than vice versa.

Still lots more to do. Still implementing EZproxy 6.5 and experimenting with the EZproxy blacklist which helps.

Q: How did you decide the parameters?
A: Mostly trial and error, trying to strike a balance between legitimate blocks and false positives. Decided to be reasonably strict.

Q: Have you had any feedback from vendors?
A: Not specifically, but have had a reduction of contacts from vendors about issues.

Q: Have you had feedback from false positives blocked?
A: No, put a note in the deny file. [Another audience member’s had some conversations, students are usually good and good opportunity to hear how they’re using resources.]

You are what you count #anzreg2019

You are what you count
Rachelle Orodio & Megan Lee, Monash University

Very often we count what’s easy to count, rather than what’s meaningful. Created a project starting with identifying what metrics they should collect.

Principles: metricsshould be strategic, purposeful, attributable, systematic, consistent, accurate, secure and accessible, efficient, integrated. Wanted to reflect key library activities.

Identified 35 metrics – 18 were manually recorded into Google Forms, Qualtrics and other temporary storage. All needed to be pulled into one place so it could be cross-referenced, and data visualisations created. Data only valuable if it can be used and shared.

Looked at Tableau, Splunk, Power BI (uni-preferred for use with data warehouse), Excel, OpenRefine, Google Data Studio.

Data sources: Alma/Primo analytics, Google analytics, EZproxy, Figshare, Libcal/LibGuides, the people counter, and custom software, spreadsheets, forms, manual recording. Quarterly email for collection of manual data.

Dashboard in Tableau with eg number of searches in Primo, how many searches produce zero results. Usage of discussion rooms vs availability. Tableau provides sophisticated visualisations, integrates with lots of sources and is great for large datasets. But expensive annual fees, needs a server environment to share reports securely, and not as easy to use as PowerBI.

Power BI example showing reference queries. Easy to learn and most functionality available in free version; full control over the layout; changes reflected immediately from one graph to another eg when you filter to one library. Sharing interactive version, the other person needs a license – or thousands of dollars for a cloud computing license.

Alma Analytics FTP – used for new titles list. Create report, schedule a job, FTP, then process files, upload to LibraryThing to get bookcovers in a carousel.

Project is ongoing. Scoping is important. Lots of info you could present, have to select the key data based on target audience, their needs etc.

Harnessing Alma Analytics and R/RShiny for Insights #anzreg2019

Harnessing Alma Analytics and R/RShiny for Insights
David Lewis & Drew Fordham, Curtin University

Interactive visualisation tools useful as it lets the user choose (within parameters) what they want to see. Alma Analytics was a bit limited. Looked at products like Tableau but it’s mostly for visualisation (and expensive) albeit easy to use.  R/RShiny free to install on desktop, more of a learning curve but worth it.

Early successes:

  • in exporting Analytics -> CSV -> clean with R -> reimport into Alma. Weeding project with printouts of the whole collection was highly manual, lots of errors, seemingly endless. With R, ran logic over entire collection and could print targeted pick lists for closer investigation. Massively accelerated deselection.
  • Could also finely-tune shelving roster more finely over the semester which saved money.

Refurbishment modelling needed to create a low-use compactus collection. Created model of previous semester as if the collection had been shelved that way, to see what would actually need to be moved back and forth. Let people explore parameters. Ended up deciding that there’d be a lot of movement in and out of the open access collection and would still require a lot of staff effort – so needed to make the compactus open access, not closed access.

Getting started with Alma Analytics and Trove API. Started with documentation then experimenting. Found the only match point was the ISBN number. Record structures complex so needed to know which substructures were relevant. Created test SQL schema and started trying test queries. Next phase: took 3-4 days to get all their holdings in Trove. Then started importing into SQL database, Views were cumbersome so created a table from the view and indexed that – which proved a lot faster.

Visualisation example with

  • * number of libraries with shared holdings – in WA, interstate, or both; at university libraries, other libraries, or both; not borrowed since [date slider input].
  • * usage by call number – user can select call number range, not borrowed since, etc.

Expanded professional networks in process of making a lot of impact with their analyses

Using APIs to enhance the user experience #anzreg2019

Using APIs to enhance the user experience
Euwe Ermita

Live with Primo and Alma in 2017, and Rosetta and Adlib 2017. Trying to customise interfaces to fit user needs and reach parity with previous system.

Adlib (manuscripts, oral history and pictures catalogue) with thumbnails pointing back to Rosetta. Primo doesn’t do hierarchies well but Adlib can show collection in context. But different technology stack – dotnet while their developers were used to other techs, so had to bring in skills.

Still getting lots of feedback that experience is inconsistent between website, catalogue, collection viewer, etc. Viewers would get lost. System performance slow for large collections; downtime for many release dates.

Options:

  • do nothing (and hide from users)
  • configure out of box – but hitting diminishing returns
  • decouple user interfaces (where user interface is separate from the application, connected via web services)

Application portfolio management strategy

  • systems of record – I know exactly what I want and it doesn’t have to be unique (eg Rosetta, Alma) – longer lifespan, maintain tight control
  • systems of differentiation – I know what I want but it needs to be different from competitors (eg Primo, their own website)
  • systems of innovation – I don’t know what I want, I need to experiment (developing their own new interfaces) – shorter lifespan, disruptive thinking

But most importantly is having a good service layer in the middle.

Lots of caching so even if Alma/Primo go down can still serve a lot of content.

Apigee API management layer – an important feature is the response cache so API responses get stored ‘forever’ – cuts response time to 1/180, and cuts load on back-end systems, avoiding hitting the API limit. Also handy to have this layer if you want to make your data open as whatever system you have behind the scenes, the links you give users don’t change; can also give customised API to users (rather than giving them a key to your backend system).

MASA – Mesh App and Service Architecture. Want to get rid of point-to-point integrations as if one point changes, you have to update all your integrations. Instead just update the single point-to-mesh connection.

Have done an internal prototype release, looking at pushing out to public end of this year/early next year.

Takeaways:

  • Important to have an application strategy – use systems for their strengths (whether that’s data or usability)
  • Don’t over-customise systems of record: it creates technical debt. Every time there’s an upgrade you have to re-test, re-customise
  • Play with API mediation/management – lots of free tools out there
  • Align technology with business strategy