Innovate #vala14 #s13 #s14 #s15

Hue Thi Pham and Kerry Tanner Influences of technology on collaboration between academics and librarians

Interrelationships between collaboration, institutional structure, and technology.
Things like Google Apps tend to be used within departments – less use on smaller campuses because more casual face-to-face interaction. Level of use varies by discipline, faculty, campus.
Social technologies like Twitter used in lectures
Learning management system (eg Moodle) most important technology mentioned in interviews.
Institutional repository common space for depositing resources

Technology facilitating transition from traditional to digital library – more electronic resources, communicating over telephone, email, Skype. But purely online interaction means a reduced mutual understanding of partners’ contributions, and an old perception of librarians’ roles.

Divide between library system and learning management system leads to a divide between the two communities around these. Librarians complain they can’t do a workshop about an assignment without Moodle access to see the assignment. Academics say they think librarians could have a role but they don’t understand why they would need access or what they would do with it. Lack of coordination can be a problem – means LMS people and library people make decisions that each other isn’t aware of. Siloisation.

Library staff need to consider roles of interpersonal interaction with technology – value of tech, value of face-to-face interaction, importance of space design / architecture. Get automatic access to learning management system but avoid resulting workload. Need to find ways to integrate library management system with learning management system.

Audience comment: Involvement of librarian in discussion boards can be useful – some topics the academics are relieved to leave to librarian. But important to have awareness of mutual roles.

Lisa Ogle and Kai Jin Chen Just accept it! Increasing researcher input into the business of research outputs

Implementing Symplectic Elements at UoNewcastle. (37,000 students, 1000 academics plus 1500 professional staff) HERDC is reporting exercise to Australian government to secure funding – sounds similar to New Zealand’s PBRF. Work managed by research division but most data entry done by admin folk. Issues include duplicate data entry, variance in data quality, many publications never reported – funding missed out on. Library asked to assist from 2005 – centralised model addresses many issues.

Various identification mechanisms: scholarly databases, researchers, conference lists, uni website, library orders. All put manually into Endnote library, then manually copy/pasted into Callista database. Labour-intensive and would often be a 2-6 month delay for researchers, very frustrating.

Getting Elements. Loved harvesting from databases (based on search settings: “We think this is your publication, please log in to claim or reject it”). Originally not keen on opening up to researchers, but after demos got convinced researchers could add manual entry without compromising data quality as library/research staff can verify and lock it.

Benefits: database searches can be customised to minimise false positives/negatives. Can delegate others to act on researchers’ behalf. Publications appear on profile within 48 hours. Can upload Endnote libraries. Can include ‘in press’ publications without messing up workflow. Easily generate publication lists. Capture of bibliometric data. Pretty graphs on user’s dashboard.

Have been running 4 months, 2 thirds of publishing academics have logged in and interacted with system. (800 in first two weeks, and a lull over summer). 2900 publications in the system from current collection year (usually 3500).

Challenges: early adopter in Australian market. Development module took longer than expected – learned that everyone does HERDC differently.

Most negative feedback so far is from people who haven’t yet logged into the system. Someone complaining it was too hard – talked her through it over the phone and now fine.

Need to investigate further repository integration.

Malcolm Wolski and Joanna Richardson Terra Nova: a new land for librarians?
Big issues emerging around vast amounts of data and trying to connect it. Global connectedness another impact.

Researchers needing a “dry lab” to work with data instead of hands-on wet-lab. Seeing this in many areas.
Researchers can’t afford to work solo any more. Much infrastructure costs beyond reach of individual researcher or individual centre. Problems are too much for one person.
Can get storage and computing power – but may need to work with data for ten years so need to be able to retain it and keep working on it through changing technology. Lots of outputs are governmental reports not journal articles.
Most large research projects these days involve communities – even incorporated bodies.
80% of papers in the EU are of people collaborating with people outside their institution.

NeCTAR have invested heavily in virtual laboratories because it’s not just about creating data but using it – of course this creates more data.
In theory nothing stops a researcher going to Research Data Storage Infrastructure for storage without their university knowing.
Various community solutions like Tropical Data Hub, Australian National Corpus – slide lists a pile and he points out that for each of these, some institution has put their hand up to take responsibility for maintenance.

Approach of institutions keeping their own data but having to share metadata. Requires lots of discussion around data schemas – what you expect to find in data descriptions. Eg Research Data Australia from 85 participating organisations and growing. Goal to get more data, better connected data, more findable/usable.

Two impacts around:
Research tools: New suite from NeCTAR and ANDS eg virtual laboratories, discipline-specific tools. Need to choose which we’ll support, which data collection schemes we’ll be involved in. May need to develop our own tools for specific disciplines.
Library/research collaboration: Moving more to a partnership model.

Libraries provide support for data management plans and citing data, but there’s huge demand for archiving/preserving data.

Impact on university libraries:

  • New jobs coming out for the “databrarian”.
  • Need research services to help develop common data structures
  • Participation in cross-disciplinary teams bringing librarian skills
  • Development of legal frameworks for acquiring, generating, storing and sharing data
  • Assisting with development of tools – lots of disciplines have different ways of exploring/analysing data so national collections/communities may have specific search (eg maps, chemical structure, vs facets) or visualisation tools.
  • Archiving and preservation services

Librarian support roles

  • Sourcing relevant data sets
  • Consultancy – identify faculty needs, refer back to experts
  • Targeted outreach services re data citation or data repositories
  • New support service tools and processes

Want to be able to offer a service to researchers and them not have to worry about where it’s stored, whether on campus or Amazon Web Services or whatever.

Cloud gazing #vala14 #s8 and #s9

Michelle McLean, Residing in the cloud: looking at the forecast now and into the future
Service models:
Software as a service (LibGuides, Office365, HathiTrust)
Platform as a service (eg Yahoo Pipes, OCLC Web Services, Google App Engine)
Infrastructure as a service (Britash Library, Library of Congress, My Kansas Library)

Deployment models:
Private cloud
community cloud
hybrid cloud
public cloud

Essential characteristics:
Resource pooling
rapid elasticity
on-demand self-service
measured service
broad network access

Pros

  • Scale and cost
  • Change management done for you – you don’t have to worry about upgrades
  • Choice and agility – if you want something new just pay and you get it
  • Next-generation architecture
  • IT isn’t a library core business – let the experts do it. Better security, better sustainability, better reliability

Cons

  • Security – when people leave need to remove their access right away because access through the web. All big companies have had failures
  • Lock-in. Need to be sure you can take your data with you if you leave
  • Lack of control. If the website is down where is the problem?
  • Financial savings mightn’t be as good as predicted.
  • You lose your IT expertise if you outsource, but then you lose your first point of trouble-shooting.

Preparing for the cloud
Consider security, privacy, access, law, lock-in, whether it’s right for your business.
Cloud computing services are marginally more reliable that IT departments (99% vs 98% uptime). So make sure you have backup systems.

Derek Whitehead All on the ground: there is no cloud
Metaphor of cloud as fluffy, friendly, faraway – slideshows never show stormclouds!
Behind the metaphor nothing’s actually in the cloud, they’re in servers in a building on the ground in a legal jurisdiction (not always ours).

There are four basic perspectives on the cloud:

  • Technology
  • Content – “information located remotely” but information is rarely independent of computation
  • Personal – companies want us to locate our info elsewhere than our own computers so they can ‘develop a relationship’ with us [lovely euphemism there! -Deborah]
  • Legal – jurisdiction makes a difference though not quite as simple as “in Australia = free of PATRIOT Act”. Frequently mirrored, moved around, using redundancy to safeguard info. People mostly concerned about privacy legislation – strong in Australia and Europe.

Swinburne’s policy is to externally host/manage most where possible – “opportunistic vendor hosting”. Student email; HR; learning management system, library system, OJS, etc.

What do we want the cloud people to do for us? Vendor cloud hosting vs service aggregator provision. Huge range of hybrid or multisource options. But services have to be efficient, reliable, high quality, fast to access, and cost-effective.

Why would we do it? When a kid, generated own electricity – not a great way to live. Thinks IT will one day look back at the idea of having your own server in your basement in the same way. Cost minimisation, efficiency, economies of scale — all of these issues. Security is an issue because bigger targets for hackers, but also have bigger resources to defend against them.

Will need a realignment in skillsets. Getting ability to read/write/negotiate contracts is vital.
But libraries are leaders. Remember when we moved from print to CD-ROMs? (Okay, this was the wrong direction…)
Exit strategies where possible – harder in monopoloy situations.
Helped by clear customer benefits and freeing up buildings. Libraries have access to economies of scale, we’re comfortable with automation, it benefits collaboration.

Q: What’s the customer experience of change to the cloud?
A: Infrastructure/management should be invisible to customers. But having info in the cloud brings huge benefits: eg huge increase in number of articles used by academics when they can get them from their desktop.

Q: What if things go wrong?
A: With an external host you’ll have remedies in the contract if things go wrong – no such remedy if you stuff up yourself!

Big data, little data, no data – Christine Borgman #vala14 #p1

Big data, little data, no data: scholarship in the networked world

Technological advances in mediated communication – have gone to writing to computers to social media and these are cumulative: we use all of these concurrently. And increasingly thinking of these in terms of data. Need to think about new infrastructures because this will determine what will be there for tomorrow’s students/librarians/archivists.

Australian notable for ANDS, and for movements to open access policies – only place she’s found where managing data is part of (ARC’s) Code for the Responsible Conduct of Research.

Book coming out late 2014/early 2015 – data and scholarship; case studies in data scholarship; data policy and practice. Organised around “provocations”:

  • How do rights, responsibilities, and risks around research data vary by disciplines and stakeholders?
  • How can data be exchanged across domains, contexts, time?
  • How do publication and data differ?
  • What are scholars’ motivations to share?
  • What expertise is needed to manage research data?
  • How can knowledge infrastructures adapt to the needs of scholars and demands of stakeholders?

Until the first journal in 17th century, scholars communicated by private letters. Journals were the beginning of peer review, of opening up knowledge beyond those privileged to exchange letters. –However things began much earlier: brick from 5th-6th century inscribed with Sutra on Dependent Origination. Now we have complete open access in PLOS One. (Shows If We Share Data, Will Anyone Use Them? Data Sharing and Reuse in the Long Tail of Science and Technology.) Lots of journals, preprint servers, institutional repositories to submit to.

Publishing (including peer review) serves to legitimise knowledge; to disseminate it; and to provide access, preservation and curation.

Open access means many things – uses Suber’s “digital, online, free of charge, and free of most copyright and licensing restrictions” definition.

ANDS model of “more Australian researchers reusing research data more often”. Moving from unmanaged, disconnected, invisible, single-use data to managed, connected, findable, reusable data.

Open data has even more definitions: Open Data Commons “free to use, reuse and redistribute”; Royal Society says “accessible, useable, assessable, intelligible”. OECD has 13 conditions. People don’t agree because data’s really messy!

Data aren’t publications
When data’s created it’s not clear who owns it – field researcher, funder, instrument, principle investigator?
Papers are arguments – data are evidence.
Few journals try to peer review data. Some repositories do but most just check the format.

Data aren’t natural objects
What are data? Most places list possibilities; few define what is and isn’t data. Marie Curie’s notebook? A mouse? A map or figure? An astronomical photo – which the public loves, but astronomers don’t agree on what the colours actually mean… 3D figure in PDF (if you have the exact right version of Adobe Acrobat). Social science data where even when specifically designed to share it’s full of footnotes telling you which appendices to read to understand how the questions/methods changed over time…

Data are representations
“Data are representations of observations, objects, or other entities used as evidence of phenomena for the purposes of research or scholarship.”

You think you have problems on catalogue interoperability, try looking at open ontologies intersecting different communities.

Data sharing and reuse depends on infrastructure
You don’t just build an infrastructure and you’re done. They’re complex, interact with communities. Huge amount of provenance important to make sense of data down the line.

Data management is difficult – scholars have a hard enough time managing it for their own reuse let alone someone else’s reuse. Need to think about provenance, property rights, different methods, different theoretical perspectives, “the wonderful thing about standards is there’s so many to choose from”.

Ways to release data:

  • contribute to archive
  • attach to journal article
  • post on local website
  • license on request
  • release on request

These last ones are very effective because people are talking to each other and can exchange tacit knowledge — but it doesn’t scale. The first scales but only works for well-structured and organised data.

So what are we trying to do? Reuse by investigator, collaborators, colleagues, unaffiliated others, future generations/millennia? These are very different purposes and commitments.

Traditional economics (1950s) was based on physical goods – supply and demand. But this doesn’t work with data. Public/private goods distinction doesn’t work with information. There’s no rivalry around the sunset or general knowledge in the way there is around a table or book. So concept of “common pool resources” – libraries, data archives – where goods must be governed.

Low subtractability/rivalry High
Exclusion difficult public goods common pool resources
Easy toll or club goods private goods

While data are unstructured and hard to use they’re private goods. Are we investing to make them tool goods, common pool resources or public goods?

Need to make sustainability decisions – what to keep, why, how, how long, who will govern them, what expertise required?

Q: Health sciences doing well
A: Yes but representation issues. Attempt to outsource mammogram readings fell foul of huge amounts of tacit knowledge required. In genomics attempts to get scientists and drug companies to work together in the open, but complicated situation with journals who say that because the data is out there it’s prior publication when in fact the paper is explaining the science behind it; and issues around (misleading) partial release of data – recommends Goldacre’s Big Pharma.

Q: Scientists want to know who they’re giving data to. But maybe data citation a way to get scientists on board?
A: Citing data as incentive is a hypothesis. Really sharing data is a gift – if you put it on a repository you don’t have it available to trade to collaborators, funders, new universities. Data as dowry: people getting hired because they have the data.
Agreeing on the citable unit is hard – some people would have a DOI on every cell, others would have a footnote “OECD”. Citation isn’t just about APA vs Blue Book, it’s about citable unit and who gets credit and….

Collaboratively solving the research data management problems in Australia

Talk by @markhahnel @figshare at #SymUCOZ14

In a study in Nature, 67% of researchers say lack of access to others’ data is a major impedance to scientific progress. 36% say they’d share their data. (31% therefore admitting they’re impeding scientific progress!)

Lots of funders require data and outputs to be available – but repositories not helping them do it.
Australia much further forward than UK, US, Europe (who now have mandates but no infrastructure)
ANDS – great website, guides for students, etc
Victoria University Research Data Management libguide

Being about to cite data is vital – elevates it to equality with papers
Currently people happily cite papers, but only 25% cite data only in reference list (as opposed to in the data availability section etc). Hitting people with a white paper doesn’t make them listen. So how can we report on impact? What use will ThomsonReuters’ data citation package be if they miss 75% of actual impact?

Taylor and Francis now providing datasets (in fact paying figshare to convert it back to the format it originally was submitted it to them in…)

People don’t care about research integrity, funders’ data policies, legislative change. Not even moral and ethical oblications. Somewhat about raising profile of yourself and research. What really engages them are:
* increased citation rate
* simplicity
* visualisation is cool

Institutions:
* How much data are we generating?
* Where the hell is it?
* What’s happened to the old stuff?
* How can we get our research re-used more than any others?
* I want to out-perform other institutions in league tables.
Receiving all this money and no idea where it’s going.

figshare for institutions – provides statistics, tracking impact.

Loyalty cards for scholarly publishing

Two things I’ve come across recently which I don’t think I’ve seen before:

“Each article published in ACS journals during 2014 will qualify the corresponding author for ACS Author Rewards article credit. Credits issued under this program, at a total value of $1,500 per publication, may be used to offset article publishing charges and any ACS open access publishing services of the author’s choosing, and will be redeemable over the next three years (2015-2017).”
American Chemical Society extends new open access program designed to assist authors

“Under [IOP’s] new programme, referees will be offered a 10% credit towards the cost of publishing on a gold open access basis when they review an article.”
Changing the way referees are rewarded

(I’m presuming, though it’s not explicit, that these credits are additive, so if you published 2 toll-access articles with ACS you’d get $3,000 credit, and if you refereed 10 IOP articles you’d get to publish 1 article on a gold open access basis for free.)

I find this fascinating. The obvious catch for scientists is the same as any loyalty card: in order to use it you’ve got to keep shopping at the same company. It’s great psychology, because humans are notoriously reluctant to ignore the opportunity for a discount, so:

  • Someone who’s got credit owing will be less likely to publish in some other journal even if the final cost-to-author is equal and even if that other journal is a better fit for the particular article. (How much less likely I don’t know, but I do think it’d be a factor.)
  • Someone who’s got credit owing for OA publication would probably be more likely to pay the extra to publish OA rather than to publish toll-access for free but not get to use that tempting credit. (This might at least have a small side-effect of getting more people experience with the benefits of publishing open access.)

Both of these are obviously what the companies in question are banking on. I’m a bit concerned about what this pressure to publish with the same old big companies will mean for science – partly about competition, as in the world of supermarkets, but also partly the journals where articles should be finding their best fit. (Perhaps the whole ‘impact factor’ issue has meant that no-one’s ever considered only subject scope in that regard, but this definitely adds another confounding factor.) But given the clear financial benefits to the companies, I expect to be seeing more scholarly publishing reward cards popping up in future.

Reporting a crime to the police (aka my #roastbusters post)

This post is not about my normal subjects, to which I’ll return another day.

Trigger Warning: Roast Busters, and reporting sexual assault  
Certain people love comparing rape and burglary. “I’m not blaming the victims,” they’ll say. “I’m just saying, you can’t expect your insurance company to pay out if you haven’t installed a deadbolt and burglar alarm on your vagina.” Or something remarkably similar to that.

And in the news certain other people have been talking about victims being or not being “brave enough” to report – mostly people who seem to have never experienced let alone tried to report a sexual assault. So. Okay, I’m going to do this: here are my stories about reporting a burglary and reporting a sexual assault.

A couple of years ago someone tried to rob my house and was scared off by the alarm. When I came home I called the police who were nice and professional and unmemorable, as were the afterhours alarm repair company (the would-be burglar had tried to stop the alarm by ripping it off the wall), the carpenter who fixed my back door (no dead-bolt, they just kicked it in and splintered the frame), and my insurance company (who didn’t even ask if I had a dead-bolt). The police dusted for fingerprints but the burglar had worn gloves so that was that and life went on. It’s an easy story to tell, no-one ever questions it, everything’s cool.

On the 6th of September 1997, someone stopped across the street from my busstop, exposed himself and masturbated in a way designed to get my attention. (What do you even call this? All the terms I can think of carry a connotation of victimless crimes. He didn’t touch me, approach me, or speak to me, but I was nevertheless very much his target. So for the purposes of this post I’m going with ‘telepathic sexual assault’.)

I followed all society’s rules for how a woman should behave in order to not be a victim, and how a victim should behave in order to be taken seriously. To start with I was white, cis, and middle-class. I’d been working, not drinking. I was wearing ‘modest’ clothes. My assailant fit the conventional narrative of a stranger lurking in the bushes, not the uncomfortable truth that over 90% of rapes are committed by victims’ acquaintances, friends and family. I watched him leave so I could try and get a description. As soon as possible I went to the police kiosk in town and reported it. I was visibly and audibly shaken but forthright and articulate. I knew I wasn’t giving them much to go on, but I wanted it on the record in case he did it to someone else.

The police were nice and professional and told me that guys like this were cowards, so if anything like it ever happened again I should shout at or walk towards him.

When was the last time you heard the police say that if you come home to a burglary in progress you should confront the cowardly burglar?

The first time I told this story was three years later, on a mailing list, and doing it gave me an adrenaline reaction as if it’d just happened. Fortunately I was among friends (one of whom told me with authority that the police’s advice was balderdash) and it was cathartic and ever since then it’s just been a thing that happened one time.

So I thought. At lunch yesterday, thinking about Roast Busters and the perennial burglary comparison, I suddenly thought: after the burglary, the police dusted for fingerprints. Did they look for evidence after the telepathic sexual assault? I remember the mood at the time was very matter-of-factly that nothing could be done. Maybe I’m now forgetting a perfectly good reason for this. But. But. Suddenly there’s this question in my mind – Did they even think about looking? – and boom, adrenaline reaction. What had been a fantastic day was suddenly crap because of psychic residue from something that happened sixteen years ago.

I ended up writing to the police to ask what information I’d be able to access relating to that report. I expected there’d be some bureaucratic hoops to jump through. Instead, within a few hours I got an email saying:

I have checked and the only file I can see is a Burglary report you made on [date redacted].

So. I guess that answers my question. And honestly, having heard the far worse stories I’ve heard sixteen years on, I wasn’t surprised. It’s just one on the long, long list of reasons different people have for not reporting sexual assault: sometimes we do report it, but the police simply don’t keep any records of that report.


Administrivia:

  • I’m happy for this post to be linked to or, per my CC-BY license, to be quoted or reposted with attribution back to this url.
  • I welcome comments. That said, I won’t tolerate any kind of victim-blaming or rape apologia. Wishes for, or jokes about, rapists being raped in prison count as both of these things.
  • If you want to do something.


My first foray into coding with open data

My first foray into coding with someone else’s data would probably have been when I created some php and a cron job to automatically block-and-report-for-spam any Twitter account that tweeted one of three specific texts that spammers were flooding on the #eqnz channel. I really don’t want to work with Twitter’s API, or more specifically with their OAuth stuff, ever again.

So my first foray that I enjoyed was with the Christchurch Metroinfo (bus) data – specifically the real-time bus arrival data (link requires filling out a short terms and conditions thing but then the data’s free under CC-BY license). For a long time I’ve used this real-time information to keep an eye out on when I need to leave the house to reach my stop in time for my bus. But if I’m working in another window and get distracted, or traffic suddenly speeds up, I can still miss it. I wanted a web app that’d give me an audio alert when the bus came in range.

Working with the data turned out to be wonderfully easy. A bit of googling yielded me information about SimpleXML and I knew enough PHP to use it. There was an odd glitch when I tried to upload my code, which worked perfectly fine on my computer, to my webserver with a slightly older version of PHP which for some reason required an extra step in parsing the attributes ECan use in their XML. But once I worked out what was going on, that was an easy fix too.

Then I did a whole bunch of fiddling with the CSS and HTML5, and the SQL is a whole nother story; and then I uploaded the source code to GitHub; and eventually even remembered to cite the data, whoops.

So now I have:

online, and I’m already starting to think about what other open data projects might be out there waiting for me.

(And now that the development phase is over and I’m using the thing live, I think my cat is starting to recognise that when this particular bird song plays, I’m about to leave the house.)

Open Access cookies

Creative Commons Aotearoa New Zealand are running a series of blogposts for Open Access Week, and I’ve contributed Levelling up to open research data.

I also, for Reasons, had an urge tonight to make Open Access biscuits. (I know my title says ‘cookies’, but the real word is of course ‘biscuits’, and I shall use it throughout the rest of this post along with real measurements and real temperatures. Google can convert for you, should you need it to.) The following instructions I hereby license as Creative Commons Zero, which should not be taken as a reflection on their calorie count.

First I started with a standard biscuit base recipe. You could use your own. I used the base for my family’s recipe for chocolate chip biscuits, which probably means it ultimately derives from Alison Holst, but I think I’ve modified it sufficiently that it’s okay to include here:

  1. Cream 125 grams of butter and 125 grams of sugar. The longer you beat it, the light and crisper the biscuits will be.
  2. Beat in 2 tablespoons sweetened condensed milk (or just milk will do, at a pinch) and 1 teaspoon vanilla essence.
  3. Sift in 1.5 cups of flour and 1 teaspoon of baking powder and mix to a dough.

Now we diverge from the chocolate chip recipe by not adding 90 grams of chocolate chips. We also divide the mixture in half, dying one half orange by using a few drops of red colouring and three times as many drops of yellow colouring:

Open Access biscuits step 1

The plain lot should then be divided into halves, each half rolled long and flat.
The orange lot should have just a small portion taken off and rolled into a fat spaghetto (a bit thinner than I did would be ideal), and the rest rolled into a large rectangle.

Then start rolling it together into our shape. The orange spaghetto gets rolled up into one of the plain rectangles. In this photo I’m doing two steps at once – most of the orange hasn’t been properly rolled out yet:

Open Access biscuits step 2

Then roll the rest of the orange around that with enough hanging off the top that you can fit some more plain stuff in to keep the lock open:

Open Access biscuits step 3

The ends will be raggedy. Don’t worry, this is all part of the plan.

At this point, put your roll of dough into the fridge to firm up a bit while you do the dishes. You could also consider feeding the cat, cooking dinner, etc. Or you can skip this step (or shorten it as I did) and it won’t hurt the biscuits, you’ll just have to do more shaping with your fingers because cutting the slices squashes them into rectangles:

Open Access biscuits step 4

These slices are about half a centimetre thick. I got about 38 off this roll, plus the raggedy ends. Remember I said those were part of the plan? Right, now – listen carefully, because this is very important – what you need to do is dispose of all the raggedy ends that won’t make pretty biscuits by eating the raw dough. I know, I know, but somebody’s got to do it.

The rest of the biscuits you put on a tray in the oven on a slightly low setting, say 150 Celsius, while you do the dishes that you missed last time because they were under things, and generally tidy up. 10 minutes or so, but whatever you do don’t go and start reading blogs because once these start to burn they burn quickly. Take them out when the ones in the hottest part of the oven are just starting to brown, and turn out onto a cooling rack.

Et voilà, open access biscuits:

Open Access biscuits step 5

Open access and peer review

We’re likely to be hearing about John Bohannon’s new article in Science, “Who’s afraid of peer review?” Essentially the author created 304 fake papers with bad science and submitted one each to an ‘author-pays’ open access journal to test their peer review. 157 of the journals accepted it, 98 rejected it; other journals were abandoned websites or still have/had the paper under review at time of analysis. (Some details are interesting. PLOS ONE provided some of the most rigorous peer review and rejected it; OA titles from Sage and Elsevier and some scholarly societies accepted it.)

Sounds pretty damning, except…

Peter Suber and Martin Eve each write a takedown of the study, both well worth reading. They list many problems with the methodology and conclusions. (For example, over two-thirds of open access journals listed on DOAJ aren’t “author-pays” so it’s odd to exclude them.)

But the key flaw is even more obvious than the flaws in the fake articles: his experiment was done without any kind of control. He only submitted to open access journals, not to traditionally-published journals, so we don’t know whether their peer review would have performed any better. As Mike Taylor and Michael Eisen point out, this isn’t the first paper with egregiously bad science that’s slipped through Science‘s peer review process either.

Institutional repositories for data?

Via my Twitter feed:

University researcher sites lack of “institutional repositories” where data can be published as a reason more data isn’t online. #nethui

— Jonathan Brewer (@kiwibrew) July 8, 2013

(And discussion ensuing.)

I’m not an expert in data management. A year ago it was top of my list of Things That Are Clearly Very Important But Also Extremely Scary, Can Someone Else Please Handle It? But then I got a cool job which includes (among other things) investigating what this data management stuff is all about, so I set about investigating.

Sometime in the last half year I dropped the assumption that we needed to be working towards an institutional data repository. In fact, I now believe we need to be working away from that idea. Instead, I think we should be encouraging researchers to deposit their datasets in the discipline-specific (or generalist) data repositories that already exist.

I have a number of reasons for this:

  • My colleague and I, with a certain amount of outsourcing, already have to run a catalogue, the whole rickety edifice of databases and federated searching and link resolving and proxy authentication, library website and social media account, institutional repository, community archive, open journal system, etc etc. Do we look like we need another system to maintain?
  • An institutional archive is great kind of serviceable for pdfs. But datasets come in xls, csv, txt, doc, html, xml, mp3, mp4, and a thousand more formats, no, I’m not exaggerating. They can be maps, interviews, 3D models, spectral images, anything. They can be a few kilobytes or a few petabytes. Yeah, you can throw this stuff into DSpace, but that doesn’t mean you should. That’s like throwing your textbooks, volumes of abstracts, Kindles, Betamax, newspapers, murals, jigsaw puzzles, mustard seeds, and Broadway musicals (not a recording, the actual theatre performance) onto a single shelf in a locked glass display cabinet and making people browse by the spine labels.
  • If you want a system that can do justice to the variety of datasets out there, you’d better have the resources of UC3 or DCC or Australia or PRISM. Because you’re either going to have to build it or you’re going to have to pay someone to build it, and then you’re going to have to maintain it. And you’re going to have to pay for storage and you’re going to have to run checksums for data integrity and you’re going to have to think about migrating the datasets as time marches on and people forget what the current shiny formats are. And you’re going to have to wonder if and how Google Scholar indexes it (and hope Google Scholar lasts longer than Google Reader did) or no-one will ever find it. And a whole lot more else.
  • If anything’s in it. Do you know how hard it is to get researchers to put their conference papers into institutional repositories? My own brother flatly refuses. He points out that his papers are already available via his discipline’s open access repository. That’s where people in his discipline will look for it. It’s indexed by Google. Why put it anywhere else? I conceded the point for the sake of our family dinner, and I haven’t brought it up again because on reflection he’s right. (He’s ten years younger than me; he has no business being right, dammit.) And because it’s hard enough to get researchers to put their conference papers into institutional repositories even when their copy is the only one in existence.
  • Do you know how hard it is to convince most researchers that they should put their datasets anywhere online other than a private Dropbox account? (Shameless plug: Last week another colleague and I did a talk responding to 8 ‘myths’ or reasons why many researchers hesitate – slides and semi-transcript here. That’s summarised from a list we made of 23 reasons, and other people have come up with more objections and necessary counters.) The lack of an institutional repository for data doesn’t even rate.

No, forget creating institutional data repositories. What we need to be doing is getting familiar with the discipline data repositories and data practices that already exist, so when we talk to a researcher we can say “Look at what other researchers in your discipline are doing!”

This makes it way easier to prove that this data publishing thing isn’t just for Those Other Disciplines, and that there are ways for them to deal with [confidentiality|IP issues|credibility|credit]. And it makes sure the dataset is where other researchers in that discipline are searching for it. And it makes sure the datasets are deposited according to that discipline’s standards and that discipline’s needs, not according to the standards and needs of whoever was foremost in mind of the developer who created the generic institutional data repository – so the search interface will be more likely to work reasonable for that discipline. And it means the types of data will be at least a little more homogenous (in some cases a lot more) so there’s more potential for someone to do cool stuff with linked open data.

And it means we can focus on what we do best, which is helping people find and search and understand and use and cite and publish these resources. Trust me, there is plenty more to do in data management than just setting up an institutional data repository.