Tag Archives: coding

Extending DSpace #or2017


Archiving Sensitive Data by Bram Luyten, Tom Desair

Unfortunately not all repository is equally open – different risks and considerations.

Have set up metadata-based access control. DSpace authorisation works okay for dozens of groups, but doesn’t scale well to an entire country where each person is an authorisation group – as they needed. Needed an exact match between a social security number/email address on the eperson and on the item metadata.

  • Advantages: Scales up massively – no identifiable limits on number of ePeople/items, no identifiable effect on performance. Can be managed outside of DSpace so both people and items can be sourced externally.
  • Disadvantages: your metadata becomes even more sensitive. And the rights to modify metadata also gives you the rights to edit authorisations.

Strategies for dealing with sensitive data:

  • Consider probability and impact of each risk eg
    • Data breach
      • Impact – higher if you’re dealing with sensitive data
      • Probability – lower if it’s harder for people to access your system and security updates are frequent
    • Losing all data
      • Impact – high if dealing with data that only exists in one place
      • Probability – depends on how you define “losing” and “all”. Different scenarios have different probabilities

Code on https://github.com/milieuinfo/dspace54-atmire/ (but documentation in Dutch)

(In Q&A: DSpace used basically as back-end, users would only access it through another front-end.)

Full integration of Piwik analytics platform into DSpace by Jozef Misutka

User statistics really important if defined reasonably and interpreted correctly. Difference between lines in your access logs, item views, bitstream views (which includes community logos), excluding bots, identifying repeat visitors.

DSpace stats

  • internal based on logs
  • SOLR (time, ip, useragent, epersonid, (isBot), geo, etc etc) – specifics for workflow events and search queries; visits vs views confusion
  • ElasticSearch time, ip, user agent, geo, (isBot), dso) deprecated in v6
  • Google Analytics, plugins

DSpace are good for local purposes but for seeing user behaviour lacks a lot of functionality

Had various requirements – reasonable, comparable statistics; separate stats for machine interfaces; ability to analyse user behaviour, search engine reports; export and display for 3rd parties; custom reports for specific time/user slices.

Used Piwik Open Analytics Platform (similar to Google Analytics but you own it – of course means one more web app to maintain). Integrate via DSpace REST API or via lindat piwik proxy. Users can sign up to monthly reports (but needs more work, and users need to understand the definitions more.)

In DSpace, when user visits a webpage, java code is executed on the server, which triggers a statistics update. But with Piwik, the java code is executed and returns html with images/javascript and when that’s executed it triggers statistics update – potentially including a lot more information; this excludes most bots who don’t execute js or view images.

If you change how you count stats, include a transition period where you report both the old method and the new method.

Code at: https://github.com/ufal/clarin-dspace

Beyond simple usage of handles – PIDs in Dspace by Ondřej Košarko

Handles often used as a metadata field for the reader to use to refer back to the item, or for relations between items. But while a human can click on the handle link, but a machine doesn’t know what it’s for or what’s on the other end of it.

Ideally a PID is all you need: You can ask ‘the authority’ for information of not just where to go but what’s on the other end.

Handles can do more things:

  • split up the id and location of a resource
  • content negotiation – different resource representation based on Accept header (which is passed by hdl.handle.net proxy)
  • template/parts handles – register one base handle but use it with extensions eg http://hdl.handle.net/11372/LRT-22822@XML or http://hdl.handle.net/11372/LRT-22822@format=cmdi – can refer to different bitstreams, or to different points in audio/video
  • get metadata based on handle without going right to the landing page – eg to get info in json, or generate citations, or show a generated landing page on error, or…

The Request a Copy Button: Hype and Reality by Steve Donaldson, Rui Guo, Kate Miller, Andrea Schweer

Trying to keep IR as close to 100% full-text as possible; DSPace 5 XMLUI, Mirage 2, now fed from CRIS Elements.

Request a copy button designed to give researchers an alternative way to access restricted (eg embargoed) content. Reader clicks button, initiates request to author; if author grants request, files are emailed to requester. Idea is that authors sharing work on one-to-one is covered by a) tradition of sharing and b) fair dealing under copyright.

Why? Embargoes auto-lift but no indication showing when that’d happen. Working on indicating auto-lift date but wanted to embrace other ways. Did need to make some tweaks to default functionality.

Out-of-the-box two variants: a) request author directly (but all submissions from Elements!) or b) email to admin who can review and overwrite author email address and and and … They used this latter ‘helpdesk’ variant but cuts out steps (and copies in admins).

So admins can use local knowledge of who to contact, intercept spam requests, aren’t responsible for granting access (to avoid legal issues).

Other tweaks:  to show file release date; wording tweaks; some behaviour tweaks (don’t ask if they want all files if there’s only 1; don’t ask the author to make the file OA because in our case the embargo can’t be overridden).

Went live – and lots of very dodgy requests – even for items that didn’t exist. So put in tweaks to cut down on spam requests to ensure requests actually come via the form, not a crawler. Respond appropriately to nonsense requests (for non-existing/deleting files); avoid oversharing eg files of withdrawn items or files not in the ORIGINAL bundle.

Went live-live. Informed of requests as made and approved, and added counts to admin statistics (requests made/approved/rejected).

Live a year

  • 9% of publications (49 items) have the button
  • 18 requests made – mostly local subject matter, majority from personal e-mail addresses
  • 8 approved, 0 denied (presumably ignored – though once author misread email and manually sent the item) – seemed like personal messages may have had a higher chance of success. One item was requested twice

So hasn’t revolutionised service but food for thought.

  • Add reminders for outstanding requests
  • Find out why authors didn’t grant access, maybe redesign email
  • Extend feature to facilitate control over sensitive items
  • Extend to restricted theses (currently completely hidden)

One good comment from academics “great service, really useful”.

(ANU also implemented – some love it, some hate it – if hate it change email address to repository admin.)

(Haven’t yet added it to metadata-only files – as it’d mean the author would have to find the file. OTOH would be a great prompt ‘Oh I should find the file’. Another uni has done this but had to disable for some types due to a flood of requests.)

Getting started with Angular UI development for Dspace #OR2017

by Tim Donohue and Art Lowel; session overview; Wiki with instructions for setup; demo site

[So my experience started off inauspiciously because I have an ancient version of MacOS so installing Node.js and yarn ran into issues with Homebrew and Xcode developer tools and I don’t know what else, but after five and a half hours I got it working. I then left all my browser and Terminal windows open for the next several days on the “if it ain’t broke” principle….]

Angular in DSpace

Angular 4.0 came out March 2017 – straight after conference there’ll be a sprint to get that into DSpace. (Angular tutorial) How Angular works with DSpace:

  • user goes to website
  • server returns first page in pre-compiled html, and javascript
  • user requests data via REST
  • API (could be hosted on different server than Angular) returns JSON data

With Angular Universal (available in Angular 2, packaged in Angular 4), it can still work with a browser that doesn’t have javascript (search engine browser, screen-reader, etc). Essentially if Angular app doesn’t load, your browser requests the page instead of the json, so the server will return the pre-compiled html again.

Caches (API replies and objects) on client-side in your browser so very quick to return to previously seen pages.

Building/running Angular apps

  • node.js – server-side JS platform (can provide pre-compiled html)
  • npm – Node’s package manager (pulls in dependencies from registry)
  • yarn – third-party Node package manager (same config, faster)
  • TypeScript language – extension of ES6 (latest javascript – adds types instead of generic ‘var’) – gets compiled down by Angular to ES5 javascript before it gets sent to the browser

You write angular applications by

  • composing html templates with angularized markup – almost all html is valid; can load other components via their selector; components have their own templates
  • writing component classes to manage those templates – lets you create new html tags that come with their own code and styling; consist of view (template) and controller. Implements interfaces eg onInit; extends another component; has a <selector>; has a constructor defining inputs; has a template. Essentially a component has a class and a template.
  • adding app logic in services – retrieve data for components, or operations to add or modify data – created once, used globally by injecting into component
  • boxing component(s) and optionally service(s) in modules – useful for organising app into blocks of functionality – would use this for supporting 3rd-party DSpace extensions (however business logic would be dealt with in REST API not in the Angular UI)

DSpace-angular folder structure

  • config/
  • resources/ – static files eg i18n, images
  • src/app/ – each feature in its own subfolder
    • .ts – component class
    • .html – template
    • .scss – component style
    • .spec.ts – component specs/test
    • .module.ts – module definition
    • .service.ts – service
  • src/backend/ – mock REST data
  • src/platform/ – root modules for client/server
  • src/styles/ – global stylesheet
  • dist/ – compiled code


[Here we got into the first couple of steps from the Wiki/gitHub project linked from there.]

Random XMPPHP note with autobiographical footnote

If you happen, for Sekrit Reasons, to be playing with XMPPHP and you get the error message:

Fatal error: Cannot access protected property XMPPHP_XMPP::$disconnected

What you need to do is go into XMPPHP/XMLStream.php and change line 85 from

protected $disconnected = false;


public $disconnected = false;

It’s possible if not likely that people who know more PHP than I(1) could have figured this out for themselves. But I had to quack it, and the only answer was in a Spanish-speaking forum (¡muchos gracias!), so I figured it may be worth putting a translation into the eyes of the search engines for folk who didn’t happen to study Spanish for a few years there.

(1) I went through a stage of learning whatever human languages I could get my hands on, and every few years make an attempt on a new Latin grammar, but I was always most successful when there was a purpose to the learning, like reading novels or translating a Star Trek episode into French or talking to the girl in Mongolia who alerted me to my pocket having just been picked.

When however it comes to computer languages, I’ve never been sufficiently motivated to learn something just for the heck of it, or to embark on a sufficiently gigantic project that my ordinary task-oriented learning methods have accrued much more knowledge than the basics. If I knew more I’d probably use it a little more, but with the kinds of things I tackle, what I know is generally enough to either:

  1. solve the problem;
  2. help me figure out how to quack the problem; or
  3. decide that I didn’t care that much about the problem anyway.

Which I feel is a valid solution, given how many other things there are to do in the world than just code.

My first foray into coding with open data

My first foray into coding with someone else’s data would probably have been when I created some php and a cron job to automatically block-and-report-for-spam any Twitter account that tweeted one of three specific texts that spammers were flooding on the #eqnz channel. I really don’t want to work with Twitter’s API, or more specifically with their OAuth stuff, ever again.

So my first foray that I enjoyed was with the Christchurch Metroinfo (bus) data – specifically the real-time bus arrival data (link requires filling out a short terms and conditions thing but then the data’s free under CC-BY license). For a long time I’ve used this real-time information to keep an eye out on when I need to leave the house to reach my stop in time for my bus. But if I’m working in another window and get distracted, or traffic suddenly speeds up, I can still miss it. I wanted a web app that’d give me an audio alert when the bus came in range.

Working with the data turned out to be wonderfully easy. A bit of googling yielded me information about SimpleXML and I knew enough PHP to use it. There was an odd glitch when I tried to upload my code, which worked perfectly fine on my computer, to my webserver with a slightly older version of PHP which for some reason required an extra step in parsing the attributes ECan use in their XML. But once I worked out what was going on, that was an easy fix too.

Then I did a whole bunch of fiddling with the CSS and HTML5, and the SQL is a whole nother story; and then I uploaded the source code to GitHub; and eventually even remembered to cite the data, whoops.

So now I have:

online, and I’m already starting to think about what other open data projects might be out there waiting for me.

(And now that the development phase is over and I’m using the thing live, I think my cat is starting to recognise that when this particular bird song plays, I’m about to leave the house.)

My reo Māori dictionary lookup bookmarklet

So sometimes (especially during Te Wiki o te Reo Māori) I’m reading stuff on the web and come across a kupu hou I don’t recognise and want to look up. I used to select this, open a new tab, type in http://www.maoridictionary.co.nz/, wait for it to load, and paste the word in.

Then I went on a javascript bookmarklet spree and among the simple bookmarklets I made (aka created in a Frankensteinian mashup of at least two other people’s unrelated bookmarklets) was this one:

He aha?

Click and drag the link into your browser bookmarks bar. Then you can just select any word(s) and click the bookmarklet. It will pop up a new window which looks up the word for you; when you’re done reading, close it and you’re back to reading.

Converting a plaintext bibliography to Endnote/RIS format with help from Linux/Terminal

[Update 16/7/2011: See my more recent post on the topic, Launching Ref2RIS – convert your typed bibliography to Endnote format, which makes things even easier.]

You won’t want to do this unless you’ve got literally hundreds of references. Any less, and these suggestions are way easier.

1. Format references so they’re each on their own line – no blank lines.

2. Use Word’s “Find Special” capabilities to replace a phrase in italics with {it}a phrase in italics{endit} and a phrase in bold with {b}a phrase in bold{endb}.  (Similarly if the citations contain underlines.)

3. Save as plaintext – say, source.txt.  Now the fun begins…  My own source text contains 600-odd lines in ACS style, like this:

Bamford, C. H.; Tipper, C. F. H. {it}Comprehensive Chemical Kinetics{endit}; Elsevier: New York, {b}1977{endb}. 
House, D. A.{it}Chem. Rev.{endit} {b}1962{endb}, {it}62{endit}, 185

4. Open up Terminal or some other Linux command line.

5. Endnote records are separated by a line

ER  - 

– that’s two spaces before the hyphen and one after.  (All these details come from Endnote’s help pages.) This is the easy part: type in

sed -e 's/^\(.*\)/\1ER  - /' source.txt > source1.txt

6. The start of each Endnote record tells you what kind of citation it is – eg a book, journal etc.  To find every line that includes a colon (ie separating the publisher from the city published in) type in

sed -e 's/^\(.*:\)/TY  - BOOK@@\1/' source1.txt > source2.txt

Note 1: The “@@” is in there as a sign that you’ll need to replace this with a new line later; but we want to keep everything on one line for now.
Note 2: This is a good example of why this whole method is highly suspect, because it’ll also catch citations which have a colon in the article title or in a typo or whatever.  So if you can think of a better sign that a citation is a book then use that instead of the colon.

Alternatively, you could type in

sed -e 's/^\(.*{it}[0-9]*{endit}\)/TY  - JOUR@@\1/' source1.txt > source2.txt

to find every line that contains {it}[some number]{endit} which, in my source, is the best indicator that I’m dealing with a journal.  The same caveats apply – you’ll get both false positives and false negatives.

Anyway, keep doing what seems best given your source, and fix up the inevitable mistakes by hand until each line starts with TY  – something.  If you want to give up and just assume that everything that isn’t already assigned as something must be a journal then try

sed -e 's/^\([^(TY  - )].*$\)/TY  - JOUR@@\1/' source2.txt > source3.txt

I now have source looking like:

TY  - BOOK@@Bamford, C. H.; Tipper, C. F. H. {it}Comprehensive Chemical Kinetics{endit}; Elsevier: New York, {b}1977{endb}. 
ER  -
TY  - JOUR@@House, D. A.{it}Chem. Rev.{endit} {b}1962{endb}, {it}62{endit}, 185
ER  -

7. Now we keep playing with patterns.  (You may be able to do large chunks of this with regular find/replace, but for illustrative purposes I’ll keep using Terminal.)

For example, in my source the authors are nicely set off: they come after “@@” and before the first “{it}” (or “in {it}”), and if there’s more than one of them they’re separated by “;”.  So a few commands:

sed -e 's/@@\(.* in {it}\)/@@A1  - \1/' source3.txt > source4.txt
sed -e 's/@@\(.* {it}\)/@@A1  - \1/' source3.txt > source4.txt
sed -e 's/;\(.*;\)/@@A1  - \1/' source5.txt > source6.txt (This one I had to repeat a few times depending how many authors could be cited in one reference; there's supposed to be a way to do it globally but my unix fu is not strong.)
sed -e 's/;\(.*{it}\)/@@A1  - \1/' source8.txt > source9.txt

Journal titles:

sed -e 's/^\(TY  - JOUR.*\)\({it}.*{endit} {b}\)/\1@@JO  - \2/' source9.txt > source10.txt


sed -e 's/\({b}[0-9]*{endb}\)/@@Y1  - \1/' source10.txt > source11.txt

And so forth.  You pretty soon start to see why the first suggestion on most lists of ways to convert plaintext citations into RIS format is always “Just type it in / search for it again by hand”.  The method above is really only suitable if you’ve got literally hundreds of citations. (I have 639, plus or minus.)

8. Eventually you’ll be at a point where you can do a simple find/replace to change @@ to a new line and nuke all the {it} and so forth.  This will be a great relief.

9. Rename your final saved file from source12.txt to source12.ris and open with Endnote.

10. Bonus material:  if this was a bibliography to a paper using numbered citations in order using eg [1], then in that paper you can do a find/replace on [ -> { and ] -> }, then tell the Endnote plugin to format citations, and voila, the best magic ever.  (If the paper uses author/date citations then you’ll have to link them by hand, sorry.)