Tag Archives: dspace

Integrating DSpace #or2017

Abstracts

Harvesting a Rich Crop: Research Publications and Cultural Collections in DSpace by Andrew Veal, Jane Miller

Currently DSpace v3.2, Repository Tools 1.7.1; upgrading to DSpace v5.6, RT 1.7.4

Wanted independent identity for each major collection area especially research publications and cultural collections; and to avoid weirdly mixed search results – so decided on a multi-tenancy approach. Four repositories on four domains. So could make customisations appropriate to specific collections.

  • research publications (via Elements and self-deposit for theses)
  • cultural collections (digitised; populated by OAI from archives collection and by bulk ingests via csv)
    • 77,000 records: pdf, images, architectural drawings, complete books, audio, video which requires specific display options. Collections based on ownership/subject. Files stored in external archive with metadata stored in DSpace and linking back to file; thumbnail generated on the file.
    • AusStage pilot project – relational index (contributors, productions) linked with digital assets (reviews, photos, video). So eg an event record has a “digital assets” link which brings back a search based on an id shared by related records.
    • Created custom “melbourne.sorting.key” field to enable different sort orders eg for maps where date of accession is irrelevant.
  • coursework resources (eg past exams; architectural drawings for a specific course) – no sitemap or OAI feed
  • admin collections (for ERA)

Couldn’t have done it without service provider (Atmire). Have done lots of business analysis to say what they want, for Atmire to set up. Downside of success is now stakeholders thinks it’s easy to fix anything!

Future:

  • develop gallery/lightbox interface
  • upgrade to 5.6; improve Google Scholar exposure
  • OAI harvesting of additional cultural collections
  • look at thesis preservation via Archivematica

DSpace in the centre by Peter Matthew Sutton-Long

Acknowledges Dr Agustina Martinez-Garcia who did much of the integrations work

[Follows up a bit on Arthur Smith’s presentation earlier so I won’t repeat too much background from there.] Before integration, had separate systems for OA publication and research dataset submissions, e-thesis submissions, Apollo repository, CRIS system for REF. This meant a lot of copy-pasting for admins from the manual submission form into repository submission for. And researchers had to enter data in CRIS (Elements) as well as submitting for repository! Also hard to report on eg author collaborations.

Approved June 2016 to integrate things to meet OA requirements, monitor compliance, help researchers share data, allow electronic deposit of theses, integrate systems with community-drive standards for the dissemination of research activities data.

Item deposited in Elements to repository via repository tools connector (though not all files are passed through). An e-theses system feeds into the repository too. Zendesk is also integrated – any deposit creates a Zendesk ticket, which can be used for communication with researchers.

Researchers can work with a single system. They can add grants and links to publications, link to their ORCID profiles (though they don’t seem to want to), obtain DOIs for every dataset and publication (so some people submit old data just to get this DOI; or submit data early, or submit a placeholder to get a DOI they can cite in their article).

Fewer systems for team to access and manage, enhanced internal workflows.

In future want to integrate VIVO.

DSpace for Cultural Heritage: adding support for images visualization,audio/video streaming and enhancing the data model by Andrea Bollini, Claudio Cortese, Giuseppe Digilio, Riccardo Fazio, Emilia Groppo, Susanna Mornati, Luigi Andrea Pascarelli, Matteo Perelli

DSpace-GLAM built by 4Science as an extension to DSpace, which started from discussions around challenges faced by digital humanities. Have to deal with different typologies, formats, structures, scales – and that’s only the first level of complexity. In addition, most data are created/collected by people (not instruments) so affected by personality, place, time, and may be fragmentary, biased. Has to be analysed with contextual information.

How to do this in a digital library management system? Need tools for:

  • modelling, visualising, analysing – quantitatively and qualitatively, and collaboratively
  • highlighting relationships between data
  • explaining interpretations
  • entering the workflow/network scholars are working in

DSpace-GLAM built on top of DSpace and DSpace-CRIS.

  • Flexible/extensible data model – persons, families, events, places, concepts. When you create a “creator-of” relationship, it automatically creates the inverse “created-by” relationship. Can be extended to work with special metadata standards. By setting these up you can see relationships between people, events, etc.
  • with various add-ons
    • IIIF compliant image viewer addon with presentation API, image API, search API, authentication API coming soon. Gives a “See online” option (instead of just downloading) which shows the image, or PDF, or… in an integrated Universal Viewer player: a smooth interaction with the image, alongside metadata about the object, and linking with the OCR/transcription (including right-to-left writing systems). Sharing and reusing with proper attribution.
    • Audio/video streaming with open source stack: transcoding, adaptive streaming, mpeg-dash standard. DASH standard protocol lets you share video along with access to server to provide zoom to make sure the content stays in the digital library so complete access to stats and ensure people see ownership.
    • Visualising and analysing datasets by integrating with CKAN to use grids, graphs, maps.

Extending DSpace #or2017

Abstracts

Archiving Sensitive Data by Bram Luyten, Tom Desair

Unfortunately not all repository is equally open – different risks and considerations.

Have set up metadata-based access control. DSpace authorisation works okay for dozens of groups, but doesn’t scale well to an entire country where each person is an authorisation group – as they needed. Needed an exact match between a social security number/email address on the eperson and on the item metadata.

  • Advantages: Scales up massively – no identifiable limits on number of ePeople/items, no identifiable effect on performance. Can be managed outside of DSpace so both people and items can be sourced externally.
  • Disadvantages: your metadata becomes even more sensitive. And the rights to modify metadata also gives you the rights to edit authorisations.

Strategies for dealing with sensitive data:

  • Consider probability and impact of each risk eg
    • Data breach
      • Impact – higher if you’re dealing with sensitive data
      • Probability – lower if it’s harder for people to access your system and security updates are frequent
    • Losing all data
      • Impact – high if dealing with data that only exists in one place
      • Probability – depends on how you define “losing” and “all”. Different scenarios have different probabilities

Code on https://github.com/milieuinfo/dspace54-atmire/ (but documentation in Dutch)

(In Q&A: DSpace used basically as back-end, users would only access it through another front-end.)

Full integration of Piwik analytics platform into DSpace by Jozef Misutka

User statistics really important if defined reasonably and interpreted correctly. Difference between lines in your access logs, item views, bitstream views (which includes community logos), excluding bots, identifying repeat visitors.

DSpace stats

  • internal based on logs
  • SOLR (time, ip, useragent, epersonid, (isBot), geo, etc etc) – specifics for workflow events and search queries; visits vs views confusion
  • ElasticSearch time, ip, user agent, geo, (isBot), dso) deprecated in v6
  • Google Analytics, plugins

DSpace are good for local purposes but for seeing user behaviour lacks a lot of functionality

Had various requirements – reasonable, comparable statistics; separate stats for machine interfaces; ability to analyse user behaviour, search engine reports; export and display for 3rd parties; custom reports for specific time/user slices.

Used Piwik Open Analytics Platform (similar to Google Analytics but you own it – of course means one more web app to maintain). Integrate via DSpace REST API or via lindat piwik proxy. Users can sign up to monthly reports (but needs more work, and users need to understand the definitions more.)

In DSpace, when user visits a webpage, java code is executed on the server, which triggers a statistics update. But with Piwik, the java code is executed and returns html with images/javascript and when that’s executed it triggers statistics update – potentially including a lot more information; this excludes most bots who don’t execute js or view images.

If you change how you count stats, include a transition period where you report both the old method and the new method.

Code at: https://github.com/ufal/clarin-dspace

Beyond simple usage of handles – PIDs in Dspace by Ondřej Košarko

Handles often used as a metadata field for the reader to use to refer back to the item, or for relations between items. But while a human can click on the handle link, but a machine doesn’t know what it’s for or what’s on the other end of it.

Ideally a PID is all you need: You can ask ‘the authority’ for information of not just where to go but what’s on the other end.

Handles can do more things:

  • split up the id and location of a resource
  • content negotiation – different resource representation based on Accept header (which is passed by hdl.handle.net proxy)
  • template/parts handles – register one base handle but use it with extensions eg http://hdl.handle.net/11372/LRT-22822@XML or http://hdl.handle.net/11372/LRT-22822@format=cmdi – can refer to different bitstreams, or to different points in audio/video
  • get metadata based on handle without going right to the landing page – eg to get info in json, or generate citations, or show a generated landing page on error, or…

The Request a Copy Button: Hype and Reality by Steve Donaldson, Rui Guo, Kate Miller, Andrea Schweer

Trying to keep IR as close to 100% full-text as possible; DSPace 5 XMLUI, Mirage 2, now fed from CRIS Elements.

Request a copy button designed to give researchers an alternative way to access restricted (eg embargoed) content. Reader clicks button, initiates request to author; if author grants request, files are emailed to requester. Idea is that authors sharing work on one-to-one is covered by a) tradition of sharing and b) fair dealing under copyright.

Why? Embargoes auto-lift but no indication showing when that’d happen. Working on indicating auto-lift date but wanted to embrace other ways. Did need to make some tweaks to default functionality.

Out-of-the-box two variants: a) request author directly (but all submissions from Elements!) or b) email to admin who can review and overwrite author email address and and and … They used this latter ‘helpdesk’ variant but cuts out steps (and copies in admins).

So admins can use local knowledge of who to contact, intercept spam requests, aren’t responsible for granting access (to avoid legal issues).

Other tweaks:  to show file release date; wording tweaks; some behaviour tweaks (don’t ask if they want all files if there’s only 1; don’t ask the author to make the file OA because in our case the embargo can’t be overridden).

Went live – and lots of very dodgy requests – even for items that didn’t exist. So put in tweaks to cut down on spam requests to ensure requests actually come via the form, not a crawler. Respond appropriately to nonsense requests (for non-existing/deleting files); avoid oversharing eg files of withdrawn items or files not in the ORIGINAL bundle.

Went live-live. Informed of requests as made and approved, and added counts to admin statistics (requests made/approved/rejected).

Live a year

  • 9% of publications (49 items) have the button
  • 18 requests made – mostly local subject matter, majority from personal e-mail addresses
  • 8 approved, 0 denied (presumably ignored – though once author misread email and manually sent the item) – seemed like personal messages may have had a higher chance of success. One item was requested twice

So hasn’t revolutionised service but food for thought.

  • Add reminders for outstanding requests
  • Find out why authors didn’t grant access, maybe redesign email
  • Extend feature to facilitate control over sensitive items
  • Extend to restricted theses (currently completely hidden)

One good comment from academics “great service, really useful”.

(ANU also implemented – some love it, some hate it – if hate it change email address to repository admin.)

(Haven’t yet added it to metadata-only files – as it’d mean the author would have to find the file. OTOH would be a great prompt ‘Oh I should find the file’. Another uni has done this but had to disable for some types due to a flood of requests.)