Monthly Archives: July 2013

Institutional repositories for data?

Via my Twitter feed:

University researcher sites lack of “institutional repositories” where data can be published as a reason more data isn’t online. #nethui

— Jonathan Brewer (@kiwibrew) July 8, 2013

(And discussion ensuing.)

I’m not an expert in data management. A year ago it was top of my list of Things That Are Clearly Very Important But Also Extremely Scary, Can Someone Else Please Handle It? But then I got a cool job which includes (among other things) investigating what this data management stuff is all about, so I set about investigating.

Sometime in the last half year I dropped the assumption that we needed to be working towards an institutional data repository. In fact, I now believe we need to be working away from that idea. Instead, I think we should be encouraging researchers to deposit their datasets in the discipline-specific (or generalist) data repositories that already exist.

I have a number of reasons for this:

  • My colleague and I, with a certain amount of outsourcing, already have to run a catalogue, the whole rickety edifice of databases and federated searching and link resolving and proxy authentication, library website and social media account, institutional repository, community archive, open journal system, etc etc. Do we look like we need another system to maintain?
  • An institutional archive is great kind of serviceable for pdfs. But datasets come in xls, csv, txt, doc, html, xml, mp3, mp4, and a thousand more formats, no, I’m not exaggerating. They can be maps, interviews, 3D models, spectral images, anything. They can be a few kilobytes or a few petabytes. Yeah, you can throw this stuff into DSpace, but that doesn’t mean you should. That’s like throwing your textbooks, volumes of abstracts, Kindles, Betamax, newspapers, murals, jigsaw puzzles, mustard seeds, and Broadway musicals (not a recording, the actual theatre performance) onto a single shelf in a locked glass display cabinet and making people browse by the spine labels.
  • If you want a system that can do justice to the variety of datasets out there, you’d better have the resources of UC3 or DCC or Australia or PRISM. Because you’re either going to have to build it or you’re going to have to pay someone to build it, and then you’re going to have to maintain it. And you’re going to have to pay for storage and you’re going to have to run checksums for data integrity and you’re going to have to think about migrating the datasets as time marches on and people forget what the current shiny formats are. And you’re going to have to wonder if and how Google Scholar indexes it (and hope Google Scholar lasts longer than Google Reader did) or no-one will ever find it. And a whole lot more else.
  • If anything’s in it. Do you know how hard it is to get researchers to put their conference papers into institutional repositories? My own brother flatly refuses. He points out that his papers are already available via his discipline’s open access repository. That’s where people in his discipline will look for it. It’s indexed by Google. Why put it anywhere else? I conceded the point for the sake of our family dinner, and I haven’t brought it up again because on reflection he’s right. (He’s ten years younger than me; he has no business being right, dammit.) And because it’s hard enough to get researchers to put their conference papers into institutional repositories even when their copy is the only one in existence.
  • Do you know how hard it is to convince most researchers that they should put their datasets anywhere online other than a private Dropbox account? (Shameless plug: Last week another colleague and I did a talk responding to 8 ‘myths’ or reasons why many researchers hesitate – slides and semi-transcript here. That’s summarised from a list we made of 23 reasons, and other people have come up with more objections and necessary counters.) The lack of an institutional repository for data doesn’t even rate.

No, forget creating institutional data repositories. What we need to be doing is getting familiar with the discipline data repositories and data practices that already exist, so when we talk to a researcher we can say “Look at what other researchers in your discipline are doing!”

This makes it way easier to prove that this data publishing thing isn’t just for Those Other Disciplines, and that there are ways for them to deal with [confidentiality|IP issues|credibility|credit]. And it makes sure the dataset is where other researchers in that discipline are searching for it. And it makes sure the datasets are deposited according to that discipline’s standards and that discipline’s needs, not according to the standards and needs of whoever was foremost in mind of the developer who created the generic institutional data repository – so the search interface will be more likely to work reasonable for that discipline. And it means the types of data will be at least a little more homogenous (in some cases a lot more) so there’s more potential for someone to do cool stuff with linked open data.

And it means we can focus on what we do best, which is helping people find and search and understand and use and cite and publish these resources. Trust me, there is plenty more to do in data management than just setting up an institutional data repository.

NeSI; publishing data; open licenses #nzes

Connecting Genetics Researchers to NeSI
James Boocock & David Eyers, University of Otago
Phil Wilcox, Tony Merriman & Mik Black, Virtual Institute of Statistical Genetics (VISG) & University of Otago

Theme of conference “eResearch as an enabler” – show researchers that eresearch can benefit them and enabling them.
There’s been a genomic data explosion – genomic, microarray, sequencing data. Genetics researchers need to use computers more and more. Computational cost increasing, need to use shared resources. “Compute first, ask questions later”.

Galaxy aims to be web-based platform for computational biomedical research – accessible, reproducible, transparent. Has a bunch of interfaces. Recommends shared file system and splitting jobs into smaller tasks to take advantage of HPC.

Goal to create an interface between NeSI and Galaxy. Galaxy job > a job splitter > subtasks performed at NeSI then ‘zipped up’ and returned to Galaxy. Not just file spliting by lines, but by genetic distance. Gives different sized files.

Used git/github to track changes, and Sphynx for python documentation. Investigating Shibboleth for authentication. Some bugs they’re working on. Further looking at efficiency measures for parallelization, building machine-learning approach do doing this.

Myths vs Realities: the truth about open data
Deborah Fitchett & Erin-Talia Skinner, Lincoln University
Our slides and notes available at the Lincoln University Research Archive

Some rights reserved: Copyright Licensing on our Scholarly record
Richard Hosking & Mark Gahegan, The University of Auckland

Copyright law has effect on reuse of data. Copyright = bundle of exclusive rights you get for creating work, to prevent others using it. Licensing is legal tool to transfer rights. Variety of licensing approaches, not created equal.

Linked data, combining sources with different licenses, makes licensing unclear – interoperability challenges.

* Lack of license – obvious problem
* Copyleft clauses (sharealike) – makes interoperability hard
* Proliferation of semi-custom terms – difficulties of interpretation
* Non-open public licenses (eg noncommercial) – more difficulties of interpretation

Technical, semantic, and legal challenges.
Research aims to capture semantics of licenses in a machine-readable format to align with, and interpret in context of, research practice. Need to go beyond natural language legal text. License metadata: RDF is a useful tool – allows sharing and reasoning over implications. Lets us work out whether you can combine sources.

Mapping terminology in licenses to research jargon.
Eg “reproduce” “making an exact Copy”
“collaborators” “other Parties”

This won’t help if there’s no license, or legally vague, or for novel use cases where we’re waiting for precedent (eg text mining over large corpuses)

Compatibility chart of Creative Commons licenses – some very restricted. “Pathological combinations of licenses”. Computing this can help measure combinability of data, degree of openness. Help understanding of propagation of rights and obligations.

Discussion of licensing choices should go beyond personal/institutional policies.

Comment: PhD student writing thesis and reusing figures from publications. For anything published by IEEE legally had to ask for permission to reuse figures he’d created himself. Not just about datasets but anything you put out.

Comment: “Best way to hide data is to publish a PhD thesis”.

Q: Have you started implementing?
A: Yes but still early on coding as RDF structure and asking simple questions. Want to dig deeper.

Q: Get in trouble with practicing law – always told by institution to send questions to IP lawyers etc. Has anyone got mad at you yet?
A: I do want to talk to a lawyer at some point. Can get complex fast especially pulling in cross-jurisdiction.
Comment: This will save time (=$$$) when talking to lawyer.
A: There’s a lot of situations where you don’t need a lawyer – that’s more for fringe cases.

U of Washington eScience Institute #nzes

eScience and Data Science at the University of Washington eScience Institute
“Hangover” Keynote by Bill Howe, Director of Research, Scalable Data Analytics, eScience Institute Affiliate Assistant Professor, Department of Computer Science & Engineering, University of Washington

Scientific process getting reduced to database problem – instead of querying the world we download the world and query the database…

UoW eScience Inst to get in the forefront of research in eScience techniques/technology, and in fields that depend on them.

3Vs of big data:
volume – this gets lots of attention but
variety – this is the bigger challenge

Sources a longtail image from Carol Goble showing lots of data in Excel spreadsheets, lab books, etc, is just lost.
Types of data stored – especially data data and some text. 87% of time is on “my computer”; 66% a hard drive…
Mostly people are still in the gigabytes range, or megabytes, less so in terabytes (but a few in petabytes).
No obvious relationship between funding and productivity. Need to support small innovators, not just the science stars.

Problem – how much time do you spend handling data as opposed to doing science? General answer is 90%.
May be spending a week doing manual copy-paste to match data because not familiar with tools that would allow a simple SQL JOIN query in seconds.
Sloan Digital Sky Survey incredibly productive because they put the data online in database format and thousands of other people could run queries against it.

SQLShare: Query as a service
Want people to upload data “as is”. Cloud-hosted. Immediately start writing queries, share results, others write their queries on top of your queries. Various access methods – REST API -> R, Python, Excel Addin, Spreadsheet crawler, VizDeck, App on EC2.

Has been recommending throwing non-clean data up there. Claims that comprehensive metadata standards represent a shared consensus about the world but at the frontier of research this shared consensus by definition doesn’t exist, or will change frequently, and data found in the wild will typically not conform to standards. So modifies Maslow’s Needs Hierarchy:
Usually storage > sharing > curation > query > analytics
Recommends: storage > sharing > query > analytics > curation
Everything can be done in views – cleaning, renaming columns, integrating data from different sources while retaining provenance.

Bring the computation to the data. Don’t want just fetch-and-retrieve – need a rich query service, not a data cemetary. “Share the soup and curate incrementally as a side-effect of using the data”.

Convert scripts to SQL and lots of problems go away. Tested this by sending postdoc to a meeting and doing “SQL stenography” – real-time analytics as discussion went on. Not a controlled study – didn’t have someone trying to do it in Python or R at same time – but would challenge someone to do it as quickly! Quotes (a student?) “Now we can accomplish a 10minute 100line script in 1 line of SQL.” Non-programmers can write very complex queries rather than relying on staff programmers and feeling ‘locked out’.

Data science
Taught an intro to data science MooC with tens of thousands of students. (Power of discussion forum to fix sloppy assignment!)

Lots of students more interested in building things than publishing, and are lost to industry. So working on ‘incubator’ projects, reverse internships pulling people back in from industry.

Q: Have you experimented with auto-generating views to cleanup?
A: Yes, but less with cleaning and more deriving schemas and recommending likely queries people will want. Google tool “Data wrangler”.

Q: Once again people using this will think of themselves as ‘not programmers’ – isn’t this actually a downside?
A: Originally humans wrote queries, then apps wrote queries, now humans are doing it again and there’s no good support for development in SQL. Risk that giving people power but not teaching programming. But mostly trying to get people more productive right now.

HuNI; NZ humanities eResearch; flux in scientific knowledge #nzes

Humanities Networked Infrastructure (HuNI) Virtual Laboratory: Discover | Analyse | Share
Deb Verheven, Deakin University
Conal Tuohy and Richard Rothwell, VeRSI
Ingrid Mason, Intersect Australia

Richard Rothwell presenting. I’ve previously heard Ingrid Mason talk about HuNI at NDF2012.

Idea of a virtual laboratory as a container for data (from variety of disciplines) and a number of tools. But many existing tools are like virtual laboratories themselves, often specific to disciplines.

Have a .9EFTS ontologist. Also project manager, technical coordinator, web page designer, tools coordinator and software developer.

Defined project as linked open data project. Humanities data into HuNI triple store (using RDF), embedded in HuNI virtual lab to create user interface. Embellishments include to provide linked open data in SPARQL, and publish via OAI-PMH; and to use AAF (Shibboleth) authentication; to use SOLR search server for virtual lab.

Have ideas of research use-cases (basic and advanced eg SPARQL queries) and desired features, eg custom analysis tools. The challenge is to get internal bridging relationships between datasets and global interoperability. Aggregating doesn’t solve siloisation.

“Technology-driven projects don’t make for good client outcomes.”

Q: What response from broader humanities community?
A: Did some user research, not as much as wanted. Impediment is that when building database tend to have more contact with people creating collections than people using them. Trying to build framework/container first and idea is that researchers will come to them and say “We want this tool” and they’ll build it. Funding set aside for further development.

Q: You compared this to Galaxy, but you’ve built from ground-up where Galaxy is more fluid. A person with command-line can create tools in Galaxy but with HuNI you’d have to do it yourself.
A: Bioinformatics folk tend to be competent with Python – but we’re not sure what competencies our researchers will have, less likely to be able to develop for themselves.

Requirements for a New Zealand Humanities eResearch Infrastructure
James Smithies, University of Canterbury
Vast amounts of cultural heritage being digitised or being born online. Humanities researchers will never be engineers but need to work through the issues.

International context:
Humanities computing’s been around for decades but still in its infancy. US, UK, even Aus have ongoing strategic conversations, which helps build roadmaps. NZ is quite far behind these (though have used punchcards where necessary). “Digging into Data Challenge” overseas but we’re missing out because of lackk of infrastructure and lack of awareness.

Fundamentals of humanities eresearch:
HuNI provides a good model. Need a shift from thinking of sources as objects to viewing them as data. Big paradigm shift. Not all will work like this. But programmatic access will become more important.

National context:
19th century ship’s logs, medical records from leper colonies. Hard to read, incomplete, possibly accurate. Have traditional methods to deal with these but problems multipy when ported into digital formats. Big problem is lack of awareness of what opportunities exist. So capabilities and infrastructure is low. Decisions often outsourced to social sciences.
At the same time, DigitalNZ, National Digital Heritage Archive, Timeframes archive, AJHR, PapersPast, etc are fantastic resources that could be leveraged if we come up with a central strategy.


  • Need to develop training schemes
  • Capability building. Lots of ideas out there but people don’t know where to start. Need to look at peer review, PBRF – how to measure quality and reward it.
  • International collaboration
  • Requirements elicitation and definition
  • Funding for all of the above including experimentation

Q: Data isn’t just data, it’s situated in a context. Being technology-led and using RDF is one thing. But how do we give richness to a collection?
A: Classic example would be researcher wanting access to object properly marked up and contribute to the conversation by adding scholarly comments, engage with other marginalia. Eg ancient greek text corpus (is I think describing the Perseus Digital Library). Want both a simple interface and programmatic access.

Q: Need to make explicit the value of an NZ corpus. Have some pieces but need to join up. Need to work with DigitalNZ. Once we have corpus can look at tools.
A: Yes, need to get key stakeholders around table and talk about what we need.

Capturing the flux in Scientific Knowledge
Prashant Gupta & Mark Gahegan, The University of Auckland
Everything changes – whether the physical world itself or our understanding of the world:
* new observation or data
* new understanding
* societal drivers
How can we deal with change and make our tools and systems more dynamic to deal with change?

Ontology evolution – have done lots of work on this. Researchers have updated knowledge structure and incorporated in forms of provenance or change logs. Tells us “knowledge that” eg What is the change, when it happened, who did it, to what, etc. But we still don’t capture “knowledge how” or “knowledge why”.

Life cycle of a category:
Processes, context, researchers’ knowledge are involved in birth of a category – but these tend to be lost when the category’s formed. We’re left with the category’s intension, extension, and place in the conceptual hierarchy. Lots of information not captured.

“We focus on products of science and ignore process of science”.

Proposes connecting static categories and the process of science to get a better understanding. Could act as a fourth facet to a category’s representation. Can help address interoperability problem and help track evolution of categories.

Process model:
Process of science gives birth to conceptual change modifies scientific artifacts connected as linked science improves process of science.

If change not captured, network of artifacts will become inconsistent and linked science will fail.

Proposes building a computational framework that captures and analyses changes, creating a category-versioning system.

Comment from James Smithies: would fit well in humanities context.
Comment: drawing parallel with software development changeset management.

NZ e-Infrastructures Panel #nzes

NZ e-Infrastructures Panel
Nick Jones, New Zealand eScience Infrastructure
Steve Cotter, REANNZ
Andrew Rohl, Curtin University, ex ED iVEC
Tony Lough, NZ Genomics Ltd
Don Smith, NZ Synchrotron Group Ltd
Rhys Francis, eResearch Coordination Project

How we doing and how can we work better with Australia?
* NJ: Have been working closer recently, but big gaps in data especially, and unevenness in various disciplines.
* SC: Working to identify gaps and work across organisations. REANNZ working closer with AARnet than have in the past which is bearing fruit re bandwidth.
* Political overlay – need to be able to say we’ve got the scientific partnership working.
* RF: Fair amount of partnership. But have found that governance separates things. “I don’t believe in uninterpreted data.” Need to figure out combo of data and tools to get results.
* Plenty of opportunity to work with Australia. Useful to look at infrastructures and what they’ve done right and haven’t done right – lessons to be learn.
* AR: Problems faced here are not unique so you can avoid our mistakes and make your own instead. 🙂

National Science Challenge signals government would like to roll framework out further. How do researchers engage with this?
* NJ: At many workshops people already know what they want to work on; at others there’s range of possibilities. Need to build networks so not everyone has to be at table.
* RF: eResearch and IT isn’t mentioned in challenges – but these are embedded in everything. If you want to be world-class at X, you need to be good at computer science.

How would you benchmark and measure return on investment?
* AR: Instance where in early days govt felt that if people wanted to keep investing, it must be valuable. This is changing now that investments are bigger. Hesitant about benchmarking because don’t really want to be doing the same as anyone else.
* RF: How do you go from 0 to world’s best supercomputer overnight? No idea how to measure that. It’s a commitment to the advancement of knowledge but the govt doesn’t have a KPI about that…

NZ had to set up Tuakiri because differences in law meant we couldn’t use Australia’s system. What other things the two countries might have to do to overcome differences in legislation?
* (Other audience member) – Yes there are differences so have needed to build systems that deal with both privacy acts and have been successful.
* (Anne Berryman) – Have started conversation with counterparts overseas and chief science advisors in Aus/NZ have a line of communication. There are platforms and issues we can deal with.

One goal is to achieve self-sustainability, eg user charging, member contributions. What’s the Australian experience in user-pays and sustainability?
* RF: Financial benefits are overwhelming. If went to commercial provider it’d cost more and do less. Sustainability needs constant flow of funds to keep supercomputing running. There is a sustainability cliff. Govt keeps putting money in.
* SC: MBIE have removed self-sustainability requirement. Charging to make sure researchers have skin in the game does prove that service is needed; but not everyone can participate who should be.

Introducing the HathiTrust Research Center #nzes

Unlocking the Secrets of 3 Billion Pages: Introducing the HathiTrust Research Center
Keynote from J. Stephen Downie, Associate Dean for Research and a Professor at the Graduate School of Library and Information Science at the University of Illinois at Urbana-Champaign.

Hathi a membership organisation – mostly top-tier US unis, plus three non-US.
“Wow” numbers:
* 10million volumes including 3.4million volumes in the US public domain
* 3.7 billion pages
* 482 TB of data
* 127 miles of books

Of the 3.4 million volumes in the public domain, about a third are in public domain only in the US; the rest are public domain worldwide (4% US govt documents so public domain from point of publication).

48% English, 9% German (probably scientific publications from pre-WWII).

Services to member unis:
* long term preservation
* full text search
* print on demand
* datasets for research

Bundles have for each page a jpg, OCR text, xml which provides location of words on each page.
METS holds the book together – points to each image/text/xml file. And built into the METS file is structure information et table of contents, chapter start, bibliography, title, etc.
Public domain data available through web interfaces, APIs, data feeds

“Public-domain” datasets still require a signed researcher statement. Stuff digitised by Google has copyright asserted over it by Google. And anything from 1872-1923 is still considered potentially under copyright outside of the US. Working on manual rights determination – have a whole taxonomy for what the status is and how they assessed it that way.

Non-consumptive research paradigm – so no one action by one user, or set of action by a group of users, could be used to reconstruct works and publish. So users submit requests, Hathi does the compute, and sends results back to them. [This reminds me of old Dialog sessions where you had to pay per search so researchers would have to get the librarian to perform the search to find bibliographic data. Kind of clunky but better than nothing I guess…]

Meandre lets researcher set up the processing flow they want to get their results. Includes all the common text processing tasks eg Dunning Loglikelihood (which can be further improved by removing proper nouns). Doesn’t replace a close-reading – answers new questions. Correlation-Ngram viewer so can track use of words across time.

OCR noise is a major limitation.

Downie wants to engage in more collaborative projects, more international partnerships, and move beyond text and beyond humanities. Just been awarded a grant for “Work-set Creation for Scholarly Analysis: Prototyping Project”. Non-trivial to find a 10,000-work subset of 10million works to do research on – project aims to solve this problem. Also going to be doing some user-needs assessments, and in 2014 will be awarding grants for four sub-projects to create tools. Eg would be great if there was a tool to find what pages have music on.

Ongoing challenges:
How do we unlock the potential of this data?
* Need to improve quality of data; improve metadata. Even just to know what’s in what language!
* Need to reconcile various data structure schemes
* May need to accrete metadata (there’s no perfect metadata scheme)
* Overcoming copyright barriers
* Moving beyond text
* Building community

eResearchNZ2013 Day 1 Wrap-up #nzes

Selected notes from the audience inspired by today’s sessions:

  • Synergies between sectors, between Australia/New Zealand. Ability to move to researcher-centric rather than infrastructure-centric.
  • No connections apparent to government systems which are needed by digital humanities.
  • From experience researchers need lots of help. Australian ideal seems to be it’s all there and easy-to-use on desktop. Nice ideal but how practical?
  • Data management and data curation are still “dragons in a swamp. We know there’s dragons there, don’t know what they look like, but we’re planning to kill them anyway.”
  • Need data management policy and a national solution. And if going to invest all this money in research don’t want to delete all the data so need to work on preservation too.
  • Good to see REANNZ looking at service level and tools. Lots to learn from Australia about where we need to put our efforts.
  • There is a policy direction from government around access and reuse of data. Challenge is around how to most effectively implement this. Especially re publically funded research (cf commercially sensitive) there’s an expectation that there’d be access to the results and, where possible, the data. But still work to do.
  • Users who don’t get help can get something out of the system; but users to do get help can do a whole lot more. Hence software carpentry sessions. [Cf this blog post about software carpentry I coincidentally read today.]
  • Peer instruction becomes very important – need someone who’s doing similar things to come in and teacher researchers and students.
  • Can embed slides, photos, etc into ‘abstract’ pages linked from the conference programme.
  • Many tools and skills great to instill in people but don’t always fit with packages – eg version control doesn’t really work with MATLAB. 🙁
  • Therefore “the less software researchers write, the better”. There’s a limit to how much we can afford to maintain.
  • Benefit to software carpentry is so people can collaborate on software rather than write your own. The best software is what lots of people work on.

All about Data Network Performance #nzes

[This post just covers the first half of the session]

All about Data Network Performance
Alessandra Scicchitano, SWITCH
Domenico Vicinanza, DANTE
James Wix & Sam Russell, REANNZ

1. Networking background – the models and layers
Two major models in networking:
Open Systems Interconnection (more conceptual – to define what you’re trying to do) got overtaken by:
TCP/IP – aims at robustness and end-to-end functionality

OSI’s layers are Physical (optics/interfaces), Data Link, Network, Transport, Session, Presentation, Application.
TCP/IP runs the world. Layers: Link, Internet, Transport, Application.

2. The tools and what they can do
Ping sends echo request, gets echo reply and tells us roundtrip time

Traceroute sends packets with timeout increasing in increments so as they timeout and are sent back you can see where they’re going.

OWAMP – one-way active measurement protocol aka one-way ping – because standard ping doesn’t show delay direction. This can handle asymmetry but requires NTP for reliable useful results.

Iperf – commandline, measures bandwidth (TCP) and quality (UDP) of a network path. Client/server architecture (client is sender, server is receiver). Jperf is a java graphical front-end. Most useful to watch in second-intervals. Can send parallel and bidirectional tests.

Nuttcp – can show where dropped packets are.

Bwctl – wrapper for iperf, nuttcp, thrlay. Command line.

UDPmon – sends stream of carefully spaced UDP packets and records set of metrics (timestamps, packets received, lost, arrived in bad order, lost in network; bytes received and bytes/frame rate; elapsed time; time per received packet; receiver data rate and wire rate (Mb/s)). Histogram option to data and paste straight into Excel to create graph.

3. Network transfer
Simple network transfers: Sender sends packet, reciever recieves, receiver sends acknowledgement, sender receives acknowledgement. Wastes lots of time and bandwidth.

Windowing: So send more than one packet at a time and see which stuff arrived and which didn’t. With selective acknowledgements can say “Am missing #4 but got the rest so don’t resend that.” Send enough data to fill the whole pipe. Send it again while waiting for acknowledgements. Amount depends on round trip time. Eg sending 360MB/s with a 280ms latency then BDP = 360×280 = 100.8Mb = 100.8/8MB so need a 12.6MB window

Default OS settings slow you down. To get the best out of REANNZ you need to tune for “elephant” flows (Long Fat Network aka LFN). Tuning your TCP can massively improve performance.

Design patterns for lab #labpatterns; Research cloud – #nzes

A pattern language for organising laboratory knowledge on the web #labpatterns
Cameron McLean, Mark Gahegan & Fabiana Kubke, The University of Auckland
Google Site

Lots of lab data hard to find/reuse – big consequences for efficiency, reproducibility, quality.
Want to help researchers locate, understand, reuse, and design data. Particularly focused on describing semantics of the experiment (rather than semantics of data).

Design pattern concept originated in field of architecture. Describes a solution to a problem in a context. Interested in idea of forces – essential/invariant concepts in a domain.

Kitchen recipe as example of design pattern for lab protocol.
What are recurring features and elements? Forces for a cake-like structure include: structure providers (flour), structure modifier (egg), flavours; aeration and heat transfer.

Apply this to lab science, in a linked science setting. Take a “Photons alive” design pattern (using light to virtualise biological processes in an animal). See example paper. Can take a sentence re methodology and annotate eg “imaging” as diagnostic procedure. This using current ontologies gives you the What but not the Why. Need to tag with a “Force” concept eg “immobilisation”. Deeper understanding of process – with role of steps. And can start thinking about what other methods of immobilisation there may be.

So how can we make these patterns? Need to use semantic web methods.
A wiki for lab semantics. (Wants to implement this.) Semantic form on wiki – a template. Wiki serves for attribution, peer review, publication – and endpoint to RDF store.

Q: How easy is this to use for a domain expert?
A: Semantic modeling is iterative process and not easy. But semantic wiki can hide complexity from enduser so domain expert can just enter data.

Q: We spend lots of time pleading with researchers to fill out webforms. How else can we motivate them, eg to do it during process rather than at end?
A: Certain types of people are motivated to use wiki. This is first step, proof of concept. Need a critical mass before self-sustaining.

Q: How much use would this actually be for domain experts? Would people without implicit knowledge gain from it?
A: Need to survey this and evaluate. It’s valuable as a democratising process.

Q: What about patent/commercial knowledge?
A: Personally taking Open science / linked science approach – intended for research that’s intended to be maximally shared.

A “Science Distribution Network” – Hadoop/ownCloud syncronised across the Tasman
Guido Aben, AARNet; Martin Feller, The University of Auckland; Andrew Farrell, New Zealand eScience Infrastructure; Sam Russell, REANNZ

Have preferred to do one-to-few applications rather than google-style one-to-billions. Now changing. Because themselves experiencing trouble sending large files. Scraped up own file transfer system, marketed as cloudstor though not in the cloud and doesn’t store things. Expected couple hundred uses, got 6838 users over the last use. Why linear growth? “Apparently word of mouth is a linear thing…” Seem to be known by everyone who have file-sharing issues.

Can we keep files permanently?
Can I upload multiple files?
Why called cloudstor when it’s really for sending?

“cloudstor+ beta” – looks like dropbox so why doing this if already there? They’re slow (hosted in Singapore or US). Cloudstor+ 30MB/s cf 0.75MB/s as a maximum for other systems. Pricing models not geared towards large datasets. And subject to PRISM etc.

Built on a stack:
Anycast | AARNet
ownCloud – best OSS they’ve seen/tested so far – has plugin system and defined APIs
hadoop – but looking at substituting with XTREEMFS which seems to work with latencies.

Distributed architecture – can be extended internationally. Would like one in NZ, Europe, US, then scale up.

Bottleneck is from desktop to local node. Only way they can address this is to get as close to researcher as possible – want to build local nodes on campus.

Official statistics; NeSI; REANNZ; Australian eResearch infrastructure – #eResearchNZ2013

Don’t know how long I’ll be live-blogging, but here’s the start of the eResearchNZ 2013 conference:

Some thoughts on what eResearch might glean from official statistics
Len Cook,
* Research-based info competes with other sources of info people use to make decisions
Politicians like weathercocks – have to respond to wind. Sources of info include: official stats, case studies, anecdote, and ideology/policy framework. More likely to hear anecdotes than research. NZ’s data-rich but poor at getting access to existing data. Confidentiality issues: “Statisticians spend half the time collecting data and the other preventing people from accessing it.” Need to shift ideas – recent shifts in legislation a step to this.

* Official statistics has evolved over the last few centuries
19th century: measurement developed to challenge policy. Florence Nightengale wanted to measure wellbeing in military hospitals because it was like taking hundreds of young men, lining them up and shooting them. Mass computation and ingenuity of graphical presentation – all by hand.
20th century: development of sampling, reliability, meshblocks. Common classifications, frameworks.
1990s and beyond: mass monitoring of transactions. Politics of info access/ownership important. Obligations created when data collected. Registers and identifiers now central. Importance of investing in metadata to categorise and integrate information.

* Managing data not just about technology – probably the reverse.

* Structural limitations. Need strong sectoral leadership. Need a chief information office for a sector not for government as a whole.

NeSI’s Experience as National Research Infrastructure
Nick Jones, New Zealand eScience Infrastructure
NZ is very good at scientific software. Also significant national investments in data (GeoNet, NZSSDS, StatsNZ, DigitalNZ, cellML, LRIS, LERNZ, CEISMIC, OBIS). But also significant (unintended) siloisation and no investment to break down barriers and integrate. However do have good capability. NeSI wants to enhance existing capabilities but also help people meet each other. Build up shared mission, collegiality.

Heterogeneous systems improve ability to deal with specific datasets, but increasingly need ability to adapt software. NeSI gives capability to support maturing of existing scientific computing capabilities.

CRIs are widespread. So are research universities. All connected by REANNZ (KAREN). Research becoming more highly connected, collaborative. National Science Challenges targeted to building collaboration too. But sector still fragmented and small-scale. “Each project creates, and destroys, its own infrastructure.”

Research eInfrastructure roadmap 2012 includes NZ Genomics Ltd (->Bioinformatics Cloud); BeSTGRID, BlueFern, NIWA (->NeSI); BeSTGRID Federation (Tuakiri); KAREN->REANNZ. Is a big gap in area of research data infrastructure.

Need government investment to overcome coordination failure. Institutions should support national infrastructure. NeSI to create scalable computing infrastructure; provide middleware and user-support; encourage cooperation; contribute to high quality research outputs. In addition to infrastructure have team of experts to support researchers.

REANNZ: An Instrument for Data-intensive Science
Steve Cotter, REANNZ
Move from experimental -> theoretical -> computational sciences, and now to data-intensive science (see “The Fourth Paradigm“). Exponential data growth. Global collaboration and requirement for data mobility. “Science productivity is directly proportional to the ease with which we can move data.” Trend towards cloud-based services.

And trend to need for lossless networking. Easy to predict capacity for youtube etc. But when simulating global weather patterns, datasets are giant and unpredictable – big peaks and troughs in traffic. TCP good at handling loss for small packets, but can be crushed by a large packet loss – 80x reduction in data transfer rates for NZ-type distances. So can’t rely on commercial networks.

Higgs-Boson work example of network as part of the scientific instrument and workflow.

Working on customisation, flexibility. Optimising end-to-end: Data transfer node; Science DMZ (REANNZ working with NZ unis, CRIs etc to deploy); perfSONAR.

Firewalls are harmful to large data flows and unnecessary. Not as effective as they once were.

If you can’t move 1TB in 20 minutes, talk to to REANNZ – they’ll raise your expectations.

Progressing to work with services above the network.

Australian Research Informatics Infrastructure
Rhys Francis, eResearch Coordination Project
Sustained strategic investment over a decade into tools, data, computation, networks, and buildings (for computation). (Personnel hidden in all of these.) Tools are mission critical, data volumes explode, systems grow to exascale, global bandwidth scales up. High ministerial turnover; each one takes about six months then realises we need this infrastructure. Breaking it down into these areas helps explain it to people.

OTOH volume of well-curated data is not exploding.

National capabilities: Want extended bandwidth, better HPC modelling, larger data collections. Shared access highly desirable but very hard to get agreement on how.
Research integration: Want research data commons and problem-oriented digital laboratories.

Hard to explain top, and when you chop it up into bits people think “Any university could have done that bit.” But need expertise and need to share it.

In last 7 years added fibre and super-computing infrastructure. Many software tools and lab integration projects. Hundreds of data and data flow improvement projects. Single sign-on. Data commons for publication/discovery. Recruit overseas but still only so much they can resource.

These things are hard, and it was data slowing it down because didn’t know where collections would physically be. If you’re dealing with petabytes, the only way to move it is by forklift.

eResearch infrastructure brings capabilities to the researcher.
NCI and Pawsey: do computational modeling, data analysis, visualise results
NeCTAR: use new tools, apps, work remotely and colaborate in the cloud
ANDS and RDSI: keep data and observations, describe, collect, share, etc.

Current status (I’m handpicking data-related bulletpoints):
* 50,000 collections published in research data commons
* coordination project to work with implementation projects to deploy data and tools as service for key national data holdings

Looking to 2014:
* data publication delivered
* Australian data discoverable
* 50 petabytes of quality research data online
* colocation of data and computing delivered

Need to focus on content (including people/knowledge, data, tools) as infrastructure. Datasets and skillsets. Less and less bespoke tools; more and more open-source or commercial products.

Need to support and fund infrastructure as business-as-usual.