Tag Archives: open data

Big data, little data, no data – Christine Borgman #vala14 #p1

Big data, little data, no data: scholarship in the networked world

Technological advances in mediated communication – have gone to writing to computers to social media and these are cumulative: we use all of these concurrently. And increasingly thinking of these in terms of data. Need to think about new infrastructures because this will determine what will be there for tomorrow’s students/librarians/archivists.

Australian notable for ANDS, and for movements to open access policies – only place she’s found where managing data is part of (ARC’s) Code for the Responsible Conduct of Research.

Book coming out late 2014/early 2015 – data and scholarship; case studies in data scholarship; data policy and practice. Organised around “provocations”:

  • How do rights, responsibilities, and risks around research data vary by disciplines and stakeholders?
  • How can data be exchanged across domains, contexts, time?
  • How do publication and data differ?
  • What are scholars’ motivations to share?
  • What expertise is needed to manage research data?
  • How can knowledge infrastructures adapt to the needs of scholars and demands of stakeholders?

Until the first journal in 17th century, scholars communicated by private letters. Journals were the beginning of peer review, of opening up knowledge beyond those privileged to exchange letters. –However things began much earlier: brick from 5th-6th century inscribed with Sutra on Dependent Origination. Now we have complete open access in PLOS One. (Shows If We Share Data, Will Anyone Use Them? Data Sharing and Reuse in the Long Tail of Science and Technology.) Lots of journals, preprint servers, institutional repositories to submit to.

Publishing (including peer review) serves to legitimise knowledge; to disseminate it; and to provide access, preservation and curation.

Open access means many things – uses Suber’s “digital, online, free of charge, and free of most copyright and licensing restrictions” definition.

ANDS model of “more Australian researchers reusing research data more often”. Moving from unmanaged, disconnected, invisible, single-use data to managed, connected, findable, reusable data.

Open data has even more definitions: Open Data Commons “free to use, reuse and redistribute”; Royal Society says “accessible, useable, assessable, intelligible”. OECD has 13 conditions. People don’t agree because data’s really messy!

Data aren’t publications
When data’s created it’s not clear who owns it – field researcher, funder, instrument, principle investigator?
Papers are arguments – data are evidence.
Few journals try to peer review data. Some repositories do but most just check the format.

Data aren’t natural objects
What are data? Most places list possibilities; few define what is and isn’t data. Marie Curie’s notebook? A mouse? A map or figure? An astronomical photo – which the public loves, but astronomers don’t agree on what the colours actually mean… 3D figure in PDF (if you have the exact right version of Adobe Acrobat). Social science data where even when specifically designed to share it’s full of footnotes telling you which appendices to read to understand how the questions/methods changed over time…

Data are representations
“Data are representations of observations, objects, or other entities used as evidence of phenomena for the purposes of research or scholarship.”

You think you have problems on catalogue interoperability, try looking at open ontologies intersecting different communities.

Data sharing and reuse depends on infrastructure
You don’t just build an infrastructure and you’re done. They’re complex, interact with communities. Huge amount of provenance important to make sense of data down the line.

Data management is difficult – scholars have a hard enough time managing it for their own reuse let alone someone else’s reuse. Need to think about provenance, property rights, different methods, different theoretical perspectives, “the wonderful thing about standards is there’s so many to choose from”.

Ways to release data:

  • contribute to archive
  • attach to journal article
  • post on local website
  • license on request
  • release on request

These last ones are very effective because people are talking to each other and can exchange tacit knowledge — but it doesn’t scale. The first scales but only works for well-structured and organised data.

So what are we trying to do? Reuse by investigator, collaborators, colleagues, unaffiliated others, future generations/millennia? These are very different purposes and commitments.

Traditional economics (1950s) was based on physical goods – supply and demand. But this doesn’t work with data. Public/private goods distinction doesn’t work with information. There’s no rivalry around the sunset or general knowledge in the way there is around a table or book. So concept of “common pool resources” – libraries, data archives – where goods must be governed.

Low subtractability/rivalry High
Exclusion difficult public goods common pool resources
Easy toll or club goods private goods

While data are unstructured and hard to use they’re private goods. Are we investing to make them tool goods, common pool resources or public goods?

Need to make sustainability decisions – what to keep, why, how, how long, who will govern them, what expertise required?

Q: Health sciences doing well
A: Yes but representation issues. Attempt to outsource mammogram readings fell foul of huge amounts of tacit knowledge required. In genomics attempts to get scientists and drug companies to work together in the open, but complicated situation with journals who say that because the data is out there it’s prior publication when in fact the paper is explaining the science behind it; and issues around (misleading) partial release of data – recommends Goldacre’s Big Pharma.

Q: Scientists want to know who they’re giving data to. But maybe data citation a way to get scientists on board?
A: Citing data as incentive is a hypothesis. Really sharing data is a gift – if you put it on a repository you don’t have it available to trade to collaborators, funders, new universities. Data as dowry: people getting hired because they have the data.
Agreeing on the citable unit is hard – some people would have a DOI on every cell, others would have a footnote “OECD”. Citation isn’t just about APA vs Blue Book, it’s about citable unit and who gets credit and….

Collaboratively solving the research data management problems in Australia

Talk by @markhahnel @figshare at #SymUCOZ14

In a study in Nature, 67% of researchers say lack of access to others’ data is a major impedance to scientific progress. 36% say they’d share their data. (31% therefore admitting they’re impeding scientific progress!)

Lots of funders require data and outputs to be available – but repositories not helping them do it.
Australia much further forward than UK, US, Europe (who now have mandates but no infrastructure)
ANDS – great website, guides for students, etc
Victoria University Research Data Management libguide

Being about to cite data is vital – elevates it to equality with papers
Currently people happily cite papers, but only 25% cite data only in reference list (as opposed to in the data availability section etc). Hitting people with a white paper doesn’t make them listen. So how can we report on impact? What use will ThomsonReuters’ data citation package be if they miss 75% of actual impact?

Taylor and Francis now providing datasets (in fact paying figshare to convert it back to the format it originally was submitted it to them in…)

People don’t care about research integrity, funders’ data policies, legislative change. Not even moral and ethical oblications. Somewhat about raising profile of yourself and research. What really engages them are:
* increased citation rate
* simplicity
* visualisation is cool

* How much data are we generating?
* Where the hell is it?
* What’s happened to the old stuff?
* How can we get our research re-used more than any others?
* I want to out-perform other institutions in league tables.
Receiving all this money and no idea where it’s going.

figshare for institutions – provides statistics, tracking impact.

My first foray into coding with open data

My first foray into coding with someone else’s data would probably have been when I created some php and a cron job to automatically block-and-report-for-spam any Twitter account that tweeted one of three specific texts that spammers were flooding on the #eqnz channel. I really don’t want to work with Twitter’s API, or more specifically with their OAuth stuff, ever again.

So my first foray that I enjoyed was with the Christchurch Metroinfo (bus) data – specifically the real-time bus arrival data (link requires filling out a short terms and conditions thing but then the data’s free under CC-BY license). For a long time I’ve used this real-time information to keep an eye out on when I need to leave the house to reach my stop in time for my bus. But if I’m working in another window and get distracted, or traffic suddenly speeds up, I can still miss it. I wanted a web app that’d give me an audio alert when the bus came in range.

Working with the data turned out to be wonderfully easy. A bit of googling yielded me information about SimpleXML and I knew enough PHP to use it. There was an odd glitch when I tried to upload my code, which worked perfectly fine on my computer, to my webserver with a slightly older version of PHP which for some reason required an extra step in parsing the attributes ECan use in their XML. But once I worked out what was going on, that was an easy fix too.

Then I did a whole bunch of fiddling with the CSS and HTML5, and the SQL is a whole nother story; and then I uploaded the source code to GitHub; and eventually even remembered to cite the data, whoops.

So now I have:

online, and I’m already starting to think about what other open data projects might be out there waiting for me.

(And now that the development phase is over and I’m using the thing live, I think my cat is starting to recognise that when this particular bird song plays, I’m about to leave the house.)

Open Access cookies

Creative Commons Aotearoa New Zealand are running a series of blogposts for Open Access Week, and I’ve contributed Levelling up to open research data.

I also, for Reasons, had an urge tonight to make Open Access biscuits. (I know my title says ‘cookies’, but the real word is of course ‘biscuits’, and I shall use it throughout the rest of this post along with real measurements and real temperatures. Google can convert for you, should you need it to.) The following instructions I hereby license as Creative Commons Zero, which should not be taken as a reflection on their calorie count.

First I started with a standard biscuit base recipe. You could use your own. I used the base for my family’s recipe for chocolate chip biscuits, which probably means it ultimately derives from Alison Holst, but I think I’ve modified it sufficiently that it’s okay to include here:

  1. Cream 125 grams of butter and 125 grams of sugar. The longer you beat it, the light and crisper the biscuits will be.
  2. Beat in 2 tablespoons sweetened condensed milk (or just milk will do, at a pinch) and 1 teaspoon vanilla essence.
  3. Sift in 1.5 cups of flour and 1 teaspoon of baking powder and mix to a dough.

Now we diverge from the chocolate chip recipe by not adding 90 grams of chocolate chips. We also divide the mixture in half, dying one half orange by using a few drops of red colouring and three times as many drops of yellow colouring:

Open Access biscuits step 1

The plain lot should then be divided into halves, each half rolled long and flat.
The orange lot should have just a small portion taken off and rolled into a fat spaghetto (a bit thinner than I did would be ideal), and the rest rolled into a large rectangle.

Then start rolling it together into our shape. The orange spaghetto gets rolled up into one of the plain rectangles. In this photo I’m doing two steps at once – most of the orange hasn’t been properly rolled out yet:

Open Access biscuits step 2

Then roll the rest of the orange around that with enough hanging off the top that you can fit some more plain stuff in to keep the lock open:

Open Access biscuits step 3

The ends will be raggedy. Don’t worry, this is all part of the plan.

At this point, put your roll of dough into the fridge to firm up a bit while you do the dishes. You could also consider feeding the cat, cooking dinner, etc. Or you can skip this step (or shorten it as I did) and it won’t hurt the biscuits, you’ll just have to do more shaping with your fingers because cutting the slices squashes them into rectangles:

Open Access biscuits step 4

These slices are about half a centimetre thick. I got about 38 off this roll, plus the raggedy ends. Remember I said those were part of the plan? Right, now – listen carefully, because this is very important – what you need to do is dispose of all the raggedy ends that won’t make pretty biscuits by eating the raw dough. I know, I know, but somebody’s got to do it.

The rest of the biscuits you put on a tray in the oven on a slightly low setting, say 150 Celsius, while you do the dishes that you missed last time because they were under things, and generally tidy up. 10 minutes or so, but whatever you do don’t go and start reading blogs because once these start to burn they burn quickly. Take them out when the ones in the hottest part of the oven are just starting to brown, and turn out onto a cooling rack.

Et voilà, open access biscuits:

Open Access biscuits step 5

Institutional repositories for data?

Via my Twitter feed:

University researcher sites lack of “institutional repositories” where data can be published as a reason more data isn’t online. #nethui

— Jonathan Brewer (@kiwibrew) July 8, 2013

(And discussion ensuing.)

I’m not an expert in data management. A year ago it was top of my list of Things That Are Clearly Very Important But Also Extremely Scary, Can Someone Else Please Handle It? But then I got a cool job which includes (among other things) investigating what this data management stuff is all about, so I set about investigating.

Sometime in the last half year I dropped the assumption that we needed to be working towards an institutional data repository. In fact, I now believe we need to be working away from that idea. Instead, I think we should be encouraging researchers to deposit their datasets in the discipline-specific (or generalist) data repositories that already exist.

I have a number of reasons for this:

  • My colleague and I, with a certain amount of outsourcing, already have to run a catalogue, the whole rickety edifice of databases and federated searching and link resolving and proxy authentication, library website and social media account, institutional repository, community archive, open journal system, etc etc. Do we look like we need another system to maintain?
  • An institutional archive is great kind of serviceable for pdfs. But datasets come in xls, csv, txt, doc, html, xml, mp3, mp4, and a thousand more formats, no, I’m not exaggerating. They can be maps, interviews, 3D models, spectral images, anything. They can be a few kilobytes or a few petabytes. Yeah, you can throw this stuff into DSpace, but that doesn’t mean you should. That’s like throwing your textbooks, volumes of abstracts, Kindles, Betamax, newspapers, murals, jigsaw puzzles, mustard seeds, and Broadway musicals (not a recording, the actual theatre performance) onto a single shelf in a locked glass display cabinet and making people browse by the spine labels.
  • If you want a system that can do justice to the variety of datasets out there, you’d better have the resources of UC3 or DCC or Australia or PRISM. Because you’re either going to have to build it or you’re going to have to pay someone to build it, and then you’re going to have to maintain it. And you’re going to have to pay for storage and you’re going to have to run checksums for data integrity and you’re going to have to think about migrating the datasets as time marches on and people forget what the current shiny formats are. And you’re going to have to wonder if and how Google Scholar indexes it (and hope Google Scholar lasts longer than Google Reader did) or no-one will ever find it. And a whole lot more else.
  • If anything’s in it. Do you know how hard it is to get researchers to put their conference papers into institutional repositories? My own brother flatly refuses. He points out that his papers are already available via his discipline’s open access repository. That’s where people in his discipline will look for it. It’s indexed by Google. Why put it anywhere else? I conceded the point for the sake of our family dinner, and I haven’t brought it up again because on reflection he’s right. (He’s ten years younger than me; he has no business being right, dammit.) And because it’s hard enough to get researchers to put their conference papers into institutional repositories even when their copy is the only one in existence.
  • Do you know how hard it is to convince most researchers that they should put their datasets anywhere online other than a private Dropbox account? (Shameless plug: Last week another colleague and I did a talk responding to 8 ‘myths’ or reasons why many researchers hesitate – slides and semi-transcript here. That’s summarised from a list we made of 23 reasons, and other people have come up with more objections and necessary counters.) The lack of an institutional repository for data doesn’t even rate.

No, forget creating institutional data repositories. What we need to be doing is getting familiar with the discipline data repositories and data practices that already exist, so when we talk to a researcher we can say “Look at what other researchers in your discipline are doing!”

This makes it way easier to prove that this data publishing thing isn’t just for Those Other Disciplines, and that there are ways for them to deal with [confidentiality|IP issues|credibility|credit]. And it makes sure the dataset is where other researchers in that discipline are searching for it. And it makes sure the datasets are deposited according to that discipline’s standards and that discipline’s needs, not according to the standards and needs of whoever was foremost in mind of the developer who created the generic institutional data repository – so the search interface will be more likely to work reasonable for that discipline. And it means the types of data will be at least a little more homogenous (in some cases a lot more) so there’s more potential for someone to do cool stuff with linked open data.

And it means we can focus on what we do best, which is helping people find and search and understand and use and cite and publish these resources. Trust me, there is plenty more to do in data management than just setting up an institutional data repository.

NeSI; publishing data; open licenses #nzes

Connecting Genetics Researchers to NeSI
James Boocock & David Eyers, University of Otago
Phil Wilcox, Tony Merriman & Mik Black, Virtual Institute of Statistical Genetics (VISG) & University of Otago

Theme of conference “eResearch as an enabler” – show researchers that eresearch can benefit them and enabling them.
There’s been a genomic data explosion – genomic, microarray, sequencing data. Genetics researchers need to use computers more and more. Computational cost increasing, need to use shared resources. “Compute first, ask questions later”.

Galaxy aims to be web-based platform for computational biomedical research – accessible, reproducible, transparent. Has a bunch of interfaces. Recommends shared file system and splitting jobs into smaller tasks to take advantage of HPC.

Goal to create an interface between NeSI and Galaxy. Galaxy job > a job splitter > subtasks performed at NeSI then ‘zipped up’ and returned to Galaxy. Not just file spliting by lines, but by genetic distance. Gives different sized files.

Used git/github to track changes, and Sphynx for python documentation. Investigating Shibboleth for authentication. Some bugs they’re working on. Further looking at efficiency measures for parallelization, building machine-learning approach do doing this.

Myths vs Realities: the truth about open data
Deborah Fitchett & Erin-Talia Skinner, Lincoln University
Our slides and notes available at the Lincoln University Research Archive

Some rights reserved: Copyright Licensing on our Scholarly record
Richard Hosking & Mark Gahegan, The University of Auckland

Copyright law has effect on reuse of data. Copyright = bundle of exclusive rights you get for creating work, to prevent others using it. Licensing is legal tool to transfer rights. Variety of licensing approaches, not created equal.

Linked data, combining sources with different licenses, makes licensing unclear – interoperability challenges.

* Lack of license – obvious problem
* Copyleft clauses (sharealike) – makes interoperability hard
* Proliferation of semi-custom terms – difficulties of interpretation
* Non-open public licenses (eg noncommercial) – more difficulties of interpretation

Technical, semantic, and legal challenges.
Research aims to capture semantics of licenses in a machine-readable format to align with, and interpret in context of, research practice. Need to go beyond natural language legal text. License metadata: RDF is a useful tool – allows sharing and reasoning over implications. Lets us work out whether you can combine sources.

Mapping terminology in licenses to research jargon.
Eg “reproduce” “making an exact Copy”
“collaborators” “other Parties”

This won’t help if there’s no license, or legally vague, or for novel use cases where we’re waiting for precedent (eg text mining over large corpuses)

Compatibility chart of Creative Commons licenses – some very restricted. “Pathological combinations of licenses”. Computing this can help measure combinability of data, degree of openness. Help understanding of propagation of rights and obligations.

Discussion of licensing choices should go beyond personal/institutional policies.

Comment: PhD student writing thesis and reusing figures from publications. For anything published by IEEE legally had to ask for permission to reuse figures he’d created himself. Not just about datasets but anything you put out.

Comment: “Best way to hide data is to publish a PhD thesis”.

Q: Have you started implementing?
A: Yes but still early on coding as RDF structure and asking simple questions. Want to dig deeper.

Q: Get in trouble with practicing law – always told by institution to send questions to IP lawyers etc. Has anyone got mad at you yet?
A: I do want to talk to a lawyer at some point. Can get complex fast especially pulling in cross-jurisdiction.
Comment: This will save time (=$$$) when talking to lawyer.
A: There’s a lot of situations where you don’t need a lawyer – that’s more for fringe cases.

U of Washington eScience Institute #nzes

eScience and Data Science at the University of Washington eScience Institute
“Hangover” Keynote by Bill Howe, Director of Research, Scalable Data Analytics, eScience Institute Affiliate Assistant Professor, Department of Computer Science & Engineering, University of Washington

Scientific process getting reduced to database problem – instead of querying the world we download the world and query the database…

UoW eScience Inst to get in the forefront of research in eScience techniques/technology, and in fields that depend on them.

3Vs of big data:
volume – this gets lots of attention but
variety – this is the bigger challenge

Sources a longtail image from Carol Goble showing lots of data in Excel spreadsheets, lab books, etc, is just lost.
Types of data stored – especially data data and some text. 87% of time is on “my computer”; 66% a hard drive…
Mostly people are still in the gigabytes range, or megabytes, less so in terabytes (but a few in petabytes).
No obvious relationship between funding and productivity. Need to support small innovators, not just the science stars.

Problem – how much time do you spend handling data as opposed to doing science? General answer is 90%.
May be spending a week doing manual copy-paste to match data because not familiar with tools that would allow a simple SQL JOIN query in seconds.
Sloan Digital Sky Survey incredibly productive because they put the data online in database format and thousands of other people could run queries against it.

SQLShare: Query as a service
Want people to upload data “as is”. Cloud-hosted. Immediately start writing queries, share results, others write their queries on top of your queries. Various access methods – REST API -> R, Python, Excel Addin, Spreadsheet crawler, VizDeck, App on EC2.

Has been recommending throwing non-clean data up there. Claims that comprehensive metadata standards represent a shared consensus about the world but at the frontier of research this shared consensus by definition doesn’t exist, or will change frequently, and data found in the wild will typically not conform to standards. So modifies Maslow’s Needs Hierarchy:
Usually storage > sharing > curation > query > analytics
Recommends: storage > sharing > query > analytics > curation
Everything can be done in views – cleaning, renaming columns, integrating data from different sources while retaining provenance.

Bring the computation to the data. Don’t want just fetch-and-retrieve – need a rich query service, not a data cemetary. “Share the soup and curate incrementally as a side-effect of using the data”.

Convert scripts to SQL and lots of problems go away. Tested this by sending postdoc to a meeting and doing “SQL stenography” – real-time analytics as discussion went on. Not a controlled study – didn’t have someone trying to do it in Python or R at same time – but would challenge someone to do it as quickly! Quotes (a student?) “Now we can accomplish a 10minute 100line script in 1 line of SQL.” Non-programmers can write very complex queries rather than relying on staff programmers and feeling ‘locked out’.

Data science
Taught an intro to data science MooC with tens of thousands of students. (Power of discussion forum to fix sloppy assignment!)

Lots of students more interested in building things than publishing, and are lost to industry. So working on ‘incubator’ projects, reverse internships pulling people back in from industry.

Q: Have you experimented with auto-generating views to cleanup?
A: Yes, but less with cleaning and more deriving schemas and recommending likely queries people will want. Google tool “Data wrangler”.

Q: Once again people using this will think of themselves as ‘not programmers’ – isn’t this actually a downside?
A: Originally humans wrote queries, then apps wrote queries, now humans are doing it again and there’s no good support for development in SQL. Risk that giving people power but not teaching programming. But mostly trying to get people more productive right now.

French open data and historic monuments

Way back I had this idea I’d keep up with library blogs in French (and another couple of languages I was semi-competent at, at least with the help of Google Translate) and feed back the occasional roundup of interesting stuff into the anglophone world. I did it a bit, then ran out of steam, then the rss feeds I followed got out of date so nowadays I’ve no idea where the really interesting conversations are happening.

But I still see the occasional tidbit, such as (via Des Bibliothèques 2.0) the launch of the official French Open Data website data.gouv.fr. (A nice touch is that down the bottom of the page they link to Open Data initiatives in a bunch of other countries too.)

And even more cool, (via the same), Monuments historiques, a mashup of data from data.gouv.fr, OpenStreetMap, INSEE, Wikipedia and DBpedia, and Yahoo! (see more on the sources and process) which lets you search or browse for nearly 44,000 monuments in France by type, historic period, region, Metro stop… and gives you data, description, and images about each monument in a really pretty interface.