Tag Archives: eresearch

University Helpdesk for Digital Research Skills #theta2015

Reimaging the University Helpdesk for the Next Generation of Digital Research Skills (abstract)
Dr. Steven Manos, David F. Flanders and Dr. Fiona Tweedie

Can’t hope to offer one-to-one support to all the researchers they need to support (especially in the context of the “digital native researcher”) so want to reimagine how they offer support.

Asked researchers what tools they use:
eg python, git, chrome, WebGL, OpenGL, Data-Driven Documents
eg ArcGIS, Google Maps, SPSS
eg Terminal, Matlab, Dropbox, Evernote, iPhone camera
eg Anaconda, R, PsychoPy, iPython, Markdown
Often have enormous of array of tools in their toolbox but still want to add more tools, so how can we hope to help them.

“Community: it’s what makes digital research possible”. Instead of supporting researchers with tools, encourage/facilitate users of these tools to support each other. [Ooh so much potential here.] Build community. Researchers already often learn from each other. All training done by researchers. Research networks tend to be self-sustaining and ongoing.

“A helpdesk is reactive. A training community is proactive.”

Sometimes run into “I have books, leave me alone” and “I don’t computer”. But many excited by being able to flash up a paper by adding a customised map. Workshop on this, very popular, researchers coming back, had 3-4 papers come out.

Software carpentry – teaching coding to non-coders. Teaching them enough coding to be able to make use of Python, R, Matlab in their work (eg a for loop) to make their lives easier without trying to turn them into computer scientists. Taught by researchers for researchers. Intensive, hands-on, many helpers. Every 15min stop talking and they do a challenge to put into practice. Code breaks – important for people to see how this works: you google the error message, the answer is on StackOverflow and you patch it up and continue.

Data carpentry assumes no coding experience. Teaching text mining/analysis for humanities.

How do we get people involved in 3D printing? Throw a grant at them. [Ah to be in an organisation where a few thousand dollars is spare change. šŸ™‚ ]

Research Tool Speed Dating: set up tools on workstations around the room and rotate researchers around the room – if they like it they can set up a second ‘date’ ie training.

HackyHour: come to a bar and people can come, have a drink, ask questions.

Research Bazaar: pulled 19 courses together over a 3-day event.

Different people engage in different ways so having all these methods is really important.

Why would a university want to invest/engage in something like this? [Why wouldn’t it?!] Often IT shops are enterprise-focused, not researcher-focused. Take a user-driven approach.

Asked researchers to cite them if skills help produce articles, and 2 articles have been published citing ResBaz (Research Bazaar). Much social media engagement.

ResBaz going international – Mozilla Science taking over the community. 1st week of Feb next year if you want to do it at your university.

Takeaways

  • open and collaborative platforms
  • some fanatical community engagement
  • cost-effective

Introducing the ResBaz Cookbook (in development)

Integrating user support for eResearch services within institutions (AeRO) #theta2015

Integrating user support for eResearch services within institutions. Lessons learned from AeRO Stage 2 User Support Project (abstract)
Hamish Holewa and Loretta Davis

AeRO = Australian eResearch Organisations – cooperative to deliver national services to researchers. Set up a user support project.

User support is often fragmented. Wanted to make a joined-up network. How-to guides, who to go to if there’s an issue, how to support services being designed to provide service to end-users.

Presented a maturity model to ap service providers for increasing maturity. For ap developers at first it’s about getting the ap and running (many have only been released in beta). AeRO user support say they understand this but here are some things you need to think about moving forward.

Outcomes – service maturity model and the “AeRO Tick” for service maturity in practice (self-assessment tool that shows you where you can grow). Incident management and ticket transfer framework. Uni IT Research Support Expert Group formed. Service catalogue definitions – how do you describe these services?

Lessons learned:

  • Maturity models work well: non-accusatory, acknowledges efforts already taken, and gives people a clear pathway forward
  • Sector-wide progress is gaining momentum: approx 65% of services have level 2 maturity (of 3 levels)
  • Value of cross-sector initiatives: wanted a central ticket system but though this sounds like a good idea heaps of work and doesn’t work well, but better to invest in templates, protocols, communication so can transfer tickets to other providers in a standard way.
  • Enable representative groups to inform/change project
  • eResearch Uni IT Expert Group: informed outputs of the maturity model and ways to integrate into institutions. Pointed out not all institutions work the same or even care.

Future: Possibilities to expand the maturity model, expand services in the service catalogue and further engage institutions.

The librarian in the context of eResearch #vala14 #s43

Natasha Simons and Sam Searle Redefining ‘the librarian’ in the context of emerging eResearch services

Thinking about kinds of skills and knowledge that they’ve found useful and how other libraries moving into this area could gain skills / support transition. Various ways working in eResearch is quite different from traditional research support.

eResearch Services
* technical solutions – promoting tools already available, like an onsite survey manager (has just reached 1000th survey); adapting existing solutions; building custom solutions
* advice, referrals and consulting
* partnership with other providers like ANDS, Nectar, RDSI, QCIF, NCI

Apply for lots of internal and external grants which has helped grow teams to 30 people but most on term contracts.

Similarities between eresearch services and research support
* directly working with researchers
* providing advice
* supporting compliance with policies/mandates
* seeking funding
* metadata support

Differences
* combine client and technical services
* organised around projects in flat structure. Focus on project management, change management
* often have to challenge stakeholders’ assumptions, promoting change, convincing people of long-term benefits to short-term pain.
* primary working relationships aren’t with other librarians – mostly with IT who don’t always understand/respect their skills/experience

Need to be able to use your knowledge of your lack of knowledge to fill the gaps in your knowledge.

What we bring as librarians:
* being brokers and boundary spanners – babelfish translaters between different groups that have their own languages (software vs metadata) and cultures
* understanding the braoder policy environment and working well with different stakeholders
* promoting standards – legislative, ethical and technical. Software developers often focus on user needs above compliance/reporting/interoperability requirements.

Paper identifies eight topics; talk concentrating on three:
* metadata skills – might need to focus on collection level instead of item level, or on admin/preservation/rights management instead of just subject-based. Much has to be learned on the job.
* scholarly communication – awareness of developments in open access, research methods (someone reads lots of research not for the research but for the discussion of methodology)
* project management

More in paper about development pathways too (self learning, workplace learning, education, training).

Curious whether there are personality traits that have a bearing. 2008 study suggests different librarianship specialities attracts different personalities. Technique-oriented vs people-oriented. Sam and Natasha think “adaptive archivists and systems librarians” and “adaptive academic reference librarians” best fit librarians moving into eResearch support.

For people who want to move into eResearch support, some will find it easier than others. Need to be aware of your preferences and able to assess how well they fit with the area you’re moving into. No clear pathway into it – or through/beyond it either.

For organisations there are implications for recruitment and training – instead of focusing on skills need to develop traits like resilience and assertiveness. Managers can support transition through professional development on both skills and traits.

eResearch teams can benefit from librarian involvement but much work will go on whether we’re involved or not so need to sell our value to researchers and IT.

Q: Looking 2 years in the future, are we looking at a multidisciplinary thing rather than silos of library vs departments?
A: Good way of looking at it. Need teams of people with range of skills. They do consultations as a team (software developer, data specialist, etc) instead of one librarian going out. Resource intensive but better outcome and get more respect of what they can bring to it.

Q: Have tried this but danger of overwhelming the researcher. [Nobody expects the Spanish Inquisition!]
A: Need to balance it and would avoid taking more people from ‘our’ side than their side. Not just descending on one researcher, trying to work with a group of researchers – even a research team with multiple projects to improve their practice generally.

Q: More on the eResearch Hub?
A: Research Hub pulls together stuff from grants system, data repository, etc etc – researcher profile system on speed.

Developing eResearch@Flinders #vala14 #s35

Amanda Nixon, Liz Walkley Hall, Ian McBain, Richard Constantine and Colin Carati We built it and they are coming: the development of eResearch@Flinders

“eResearch” – use of info/comm technologies in a research space:
* in data management
* high performance computing
* collaboration tools
* visualisation / haptics (tactile sense of using computing)

Operating at Flinders Uni since April 2012, started with ANDS/uni funding and longstanding relationship with academics. Using core library skills:
* liaison with researchers
* liaison with service providers
* metadata creation
* service ethic

Structure is partnership between library/ICT. Includes statistical consultant, metadata stores project officers, eresearch support librarian, open scholarship and data management librarian. Because they’re new do a lot of reporting: to library senior staff committee, info services executive, eResearch advisory committee.

Primarily dealing with data storage (big or small, complex or simple), high performance computing, collaboration skills. Identify tools and services, refer researchers to service providers, prepare info on return on investment, do outreach to researchers.

ReDBox software (had a ReDBox community day for all institutions using/developing this)
Planning, coordination of data management services – set up ReDBox, got it running, and right on cue ARC are requiring data management plans. Have done lots of outreach but now as result of ARC rule changes researchers are coming to them.

Statistica consultant – individual consultations and workshops, covering use of SPSS, NVivo, handles licensing

Mapping old skills:
staff management -> staff management
researcher liaison -> researcher liaison
vendor relationships -> eResearch service provider relationships
assessing value of resources -> assessing value of eResearch tools
referral to services -> referral to eResearch tools
metadata creation re publications -> metadata creation re research data

New skills:
business analysis
social media
event management
managing software development
having an ear on the ground to make connections
matchmaking

Why does it work?
* we come from the library which is well-respected so good PR
* we do good liaison
* building on existing skills
* building on institutional knowledge
* don’t know all the answers but can find them
* most importantly: there was an unfulfilled need

Launch by vice-chancellor, 8-session staff development programme to introduce library staff to what they do. Since ARC rule change haven’t had to do any coldcalling because people are calling them. Brokering more access to federally funded data storage. In uni Research Strategic Plan and Info Services Strategic Plan.

Q: How to show you’re successful?
A: Want to collate list of new relationships built because of matchmaking, successful grant applications where they’ve given advice, publications coming out of things. Don’t know how to pull it together yet but probably a matter of following up and keeping relationships going.

Q: What KPIs do you have?
A: Strategic plans very high-level – getting people involved in things. Usage stats of data storage. Further down the track as business model changes, more cost, might be harder to create useful KPIs.

Innovate #vala14 #s13 #s14 #s15

Hue Thi Pham and Kerry Tanner Influences of technology on collaboration between academics and librarians

Interrelationships between collaboration, institutional structure, and technology.
Things like Google Apps tend to be used within departments – less use on smaller campuses because more casual face-to-face interaction. Level of use varies by discipline, faculty, campus.
Social technologies like Twitter used in lectures
Learning management system (eg Moodle) most important technology mentioned in interviews.
Institutional repository common space for depositing resources

Technology facilitating transition from traditional to digital library – more electronic resources, communicating over telephone, email, Skype. But purely online interaction means a reduced mutual understanding of partners’ contributions, and an old perception of librarians’ roles.

Divide between library system and learning management system leads to a divide between the two communities around these. Librarians complain they can’t do a workshop about an assignment without Moodle access to see the assignment. Academics say they think librarians could have a role but they don’t understand why they would need access or what they would do with it. Lack of coordination can be a problem – means LMS people and library people make decisions that each other isn’t aware of. Siloisation.

Library staff need to consider roles of interpersonal interaction with technology – value of tech, value of face-to-face interaction, importance of space design / architecture. Get automatic access to learning management system but avoid resulting workload. Need to find ways to integrate library management system with learning management system.

Audience comment: Involvement of librarian in discussion boards can be useful – some topics the academics are relieved to leave to librarian. But important to have awareness of mutual roles.

Lisa Ogle and Kai Jin Chen Just accept it! Increasing researcher input into the business of research outputs

Implementing Symplectic Elements at UoNewcastle. (37,000 students, 1000 academics plus 1500 professional staff) HERDC is reporting exercise to Australian government to secure funding – sounds similar to New Zealand’s PBRF. Work managed by research division but most data entry done by admin folk. Issues include duplicate data entry, variance in data quality, many publications never reported – funding missed out on. Library asked to assist from 2005 – centralised model addresses many issues.

Various identification mechanisms: scholarly databases, researchers, conference lists, uni website, library orders. All put manually into Endnote library, then manually copy/pasted into Callista database. Labour-intensive and would often be a 2-6 month delay for researchers, very frustrating.

Getting Elements. Loved harvesting from databases (based on search settings: “We think this is your publication, please log in to claim or reject it”). Originally not keen on opening up to researchers, but after demos got convinced researchers could add manual entry without compromising data quality as library/research staff can verify and lock it.

Benefits: database searches can be customised to minimise false positives/negatives. Can delegate others to act on researchers’ behalf. Publications appear on profile within 48 hours. Can upload Endnote libraries. Can include ‘in press’ publications without messing up workflow. Easily generate publication lists. Capture of bibliometric data. Pretty graphs on user’s dashboard.

Have been running 4 months, 2 thirds of publishing academics have logged in and interacted with system. (800 in first two weeks, and a lull over summer). 2900 publications in the system from current collection year (usually 3500).

Challenges: early adopter in Australian market. Development module took longer than expected – learned that everyone does HERDC differently.

Most negative feedback so far is from people who haven’t yet logged into the system. Someone complaining it was too hard – talked her through it over the phone and now fine.

Need to investigate further repository integration.

Malcolm Wolski and Joanna Richardson Terra Nova: a new land for librarians?
Big issues emerging around vast amounts of data and trying to connect it. Global connectedness another impact.

Researchers needing a “dry lab” to work with data instead of hands-on wet-lab. Seeing this in many areas.
Researchers can’t afford to work solo any more. Much infrastructure costs beyond reach of individual researcher or individual centre. Problems are too much for one person.
Can get storage and computing power – but may need to work with data for ten years so need to be able to retain it and keep working on it through changing technology. Lots of outputs are governmental reports not journal articles.
Most large research projects these days involve communities – even incorporated bodies.
80% of papers in the EU are of people collaborating with people outside their institution.

NeCTAR have invested heavily in virtual laboratories because it’s not just about creating data but using it – of course this creates more data.
In theory nothing stops a researcher going to Research Data Storage Infrastructure for storage without their university knowing.
Various community solutions like Tropical Data Hub, Australian National Corpus – slide lists a pile and he points out that for each of these, some institution has put their hand up to take responsibility for maintenance.

Approach of institutions keeping their own data but having to share metadata. Requires lots of discussion around data schemas – what you expect to find in data descriptions. Eg Research Data Australia from 85 participating organisations and growing. Goal to get more data, better connected data, more findable/usable.

Two impacts around:
Research tools: New suite from NeCTAR and ANDS eg virtual laboratories, discipline-specific tools. Need to choose which we’ll support, which data collection schemes we’ll be involved in. May need to develop our own tools for specific disciplines.
Library/research collaboration: Moving more to a partnership model.

Libraries provide support for data management plans and citing data, but there’s huge demand for archiving/preserving data.

Impact on university libraries:

  • New jobs coming out for the “databrarian”.
  • Need research services to help develop common data structures
  • Participation in cross-disciplinary teams bringing librarian skills
  • Development of legal frameworks for acquiring, generating, storing and sharing data
  • Assisting with development of tools – lots of disciplines have different ways of exploring/analysing data so national collections/communities may have specific search (eg maps, chemical structure, vs facets) or visualisation tools.
  • Archiving and preservation services

Librarian support roles

  • Sourcing relevant data sets
  • Consultancy – identify faculty needs, refer back to experts
  • Targeted outreach services re data citation or data repositories
  • New support service tools and processes

Want to be able to offer a service to researchers and them not have to worry about where it’s stored, whether on campus or Amazon Web Services or whatever.

NeSI; publishing data; open licenses #nzes

Connecting Genetics Researchers to NeSI
James Boocock & David Eyers, University of Otago
Phil Wilcox, Tony Merriman & Mik Black, Virtual Institute of Statistical Genetics (VISG) & University of Otago

Theme of conference “eResearch as an enabler” – show researchers that eresearch can benefit them and enabling them.
There’s been a genomic data explosion – genomic, microarray, sequencing data. Genetics researchers need to use computers more and more. Computational cost increasing, need to use shared resources. “Compute first, ask questions later”.

Galaxy aims to be web-based platform for computational biomedical research – accessible, reproducible, transparent. Has a bunch of interfaces. Recommends shared file system and splitting jobs into smaller tasks to take advantage of HPC.

Goal to create an interface between NeSI and Galaxy. Galaxy job > a job splitter > subtasks performed at NeSI then ‘zipped up’ and returned to Galaxy. Not just file spliting by lines, but by genetic distance. Gives different sized files.

Used git/github to track changes, and Sphynx for python documentation. Investigating Shibboleth for authentication. Some bugs they’re working on. Further looking at efficiency measures for parallelization, building machine-learning approach do doing this.

Myths vs Realities: the truth about open data
Deborah Fitchett & Erin-Talia Skinner, Lincoln University
Our slides and notes available at the Lincoln University Research Archive

Some rights reserved: Copyright Licensing on our Scholarly record
Richard Hosking & Mark Gahegan, The University of Auckland

Copyright law has effect on reuse of data. Copyright = bundle of exclusive rights you get for creating work, to prevent others using it. Licensing is legal tool to transfer rights. Variety of licensing approaches, not created equal.

Linked data, combining sources with different licenses, makes licensing unclear – interoperability challenges.

* Lack of license – obvious problem
* Copyleft clauses (sharealike) – makes interoperability hard
* Proliferation of semi-custom terms – difficulties of interpretation
* Non-open public licenses (eg noncommercial) – more difficulties of interpretation

Technical, semantic, and legal challenges.
Research aims to capture semantics of licenses in a machine-readable format to align with, and interpret in context of, research practice. Need to go beyond natural language legal text. License metadata: RDF is a useful tool – allows sharing and reasoning over implications. Lets us work out whether you can combine sources.

Mapping terminology in licenses to research jargon.
Eg “reproduce” “making an exact Copy”
“collaborators” “other Parties”

This won’t help if there’s no license, or legally vague, or for novel use cases where we’re waiting for precedent (eg text mining over large corpuses)

Compatibility chart of Creative Commons licenses – some very restricted. “Pathological combinations of licenses”. Computing this can help measure combinability of data, degree of openness. Help understanding of propagation of rights and obligations.

Discussion of licensing choices should go beyond personal/institutional policies.

Comment: PhD student writing thesis and reusing figures from publications. For anything published by IEEE legally had to ask for permission to reuse figures he’d created himself. Not just about datasets but anything you put out.

Comment: “Best way to hide data is to publish a PhD thesis”.

Q: Have you started implementing?
A: Yes but still early on coding as RDF structure and asking simple questions. Want to dig deeper.

Q: Get in trouble with practicing law – always told by institution to send questions to IP lawyers etc. Has anyone got mad at you yet?
A: I do want to talk to a lawyer at some point. Can get complex fast especially pulling in cross-jurisdiction.
Comment: This will save time (=$$$) when talking to lawyer.
A: There’s a lot of situations where you don’t need a lawyer – that’s more for fringe cases.

U of Washington eScience Institute #nzes

eScience and Data Science at the University of Washington eScience Institute
“Hangover” Keynote by Bill Howe, Director of Research, Scalable Data Analytics, eScience Institute Affiliate Assistant Professor, Department of Computer Science & Engineering, University of Washington

Scientific process getting reduced to database problem – instead of querying the world we download the world and query the database…

UoW eScience Inst to get in the forefront of research in eScience techniques/technology, and in fields that depend on them.

3Vs of big data:
volume – this gets lots of attention but
variety – this is the bigger challenge
velocity

Sources a longtail image from Carol Goble showing lots of data in Excel spreadsheets, lab books, etc, is just lost.
Types of data stored – especially data data and some text. 87% of time is on “my computer”; 66% a hard drive…
Mostly people are still in the gigabytes range, or megabytes, less so in terabytes (but a few in petabytes).
No obvious relationship between funding and productivity. Need to support small innovators, not just the science stars.

Problem – how much time do you spend handling data as opposed to doing science? General answer is 90%.
May be spending a week doing manual copy-paste to match data because not familiar with tools that would allow a simple SQL JOIN query in seconds.
Sloan Digital Sky Survey incredibly productive because they put the data online in database format and thousands of other people could run queries against it.

SQLShare: Query as a service
Want people to upload data “as is”. Cloud-hosted. Immediately start writing queries, share results, others write their queries on top of your queries. Various access methods – REST API -> R, Python, Excel Addin, Spreadsheet crawler, VizDeck, App on EC2.

Metadata
Has been recommending throwing non-clean data up there. Claims that comprehensive metadata standards represent a shared consensus about the world but at the frontier of research this shared consensus by definition doesn’t exist, or will change frequently, and data found in the wild will typically not conform to standards. So modifies Maslow’s Needs Hierarchy:
Usually storage > sharing > curation > query > analytics
Recommends: storage > sharing > query > analytics > curation
Everything can be done in views – cleaning, renaming columns, integrating data from different sources while retaining provenance.

Bring the computation to the data. Don’t want just fetch-and-retrieve – need a rich query service, not a data cemetary. “Share the soup and curate incrementally as a side-effect of using the data”.

Convert scripts to SQL and lots of problems go away. Tested this by sending postdoc to a meeting and doing “SQL stenography” – real-time analytics as discussion went on. Not a controlled study – didn’t have someone trying to do it in Python or R at same time – but would challenge someone to do it as quickly! Quotes (a student?) “Now we can accomplish a 10minute 100line script in 1 line of SQL.” Non-programmers can write very complex queries rather than relying on staff programmers and feeling ‘locked out’.

Data science
Taught an intro to data science MooC with tens of thousands of students. (Power of discussion forum to fix sloppy assignment!)

Lots of students more interested in building things than publishing, and are lost to industry. So working on ‘incubator’ projects, reverse internships pulling people back in from industry.

Q: Have you experimented with auto-generating views to cleanup?
A: Yes, but less with cleaning and more deriving schemas and recommending likely queries people will want. Google tool “Data wrangler”.

Q: Once again people using this will think of themselves as ‘not programmers’ – isn’t this actually a downside?
A: Originally humans wrote queries, then apps wrote queries, now humans are doing it again and there’s no good support for development in SQL. Risk that giving people power but not teaching programming. But mostly trying to get people more productive right now.

HuNI; NZ humanities eResearch; flux in scientific knowledge #nzes

Humanities Networked Infrastructure (HuNI) Virtual Laboratory: Discover | Analyse | Share
Deb Verheven, Deakin University
Conal Tuohy and Richard Rothwell, VeRSI
Ingrid Mason, Intersect Australia

Richard Rothwell presenting. I’ve previously heard Ingrid Mason talk about HuNI at NDF2012.

Idea of a virtual laboratory as a container for data (from variety of disciplines) and a number of tools. But many existing tools are like virtual laboratories themselves, often specific to disciplines.

Have a .9EFTS ontologist. Also project manager, technical coordinator, web page designer, tools coordinator and software developer.

Defined project as linked open data project. Humanities data into HuNI triple store (using RDF), embedded in HuNI virtual lab to create user interface. Embellishments include to provide linked open data in SPARQL, and publish via OAI-PMH; and to use AAF (Shibboleth) authentication; to use SOLR search server for virtual lab.

Have ideas of research use-cases (basic and advanced eg SPARQL queries) and desired features, eg custom analysis tools. The challenge is to get internal bridging relationships between datasets and global interoperability. Aggregating doesn’t solve siloisation.

“Technology-driven projects don’t make for good client outcomes.”

Q: What response from broader humanities community?
A: Did some user research, not as much as wanted. Impediment is that when building database tend to have more contact with people creating collections than people using them. Trying to build framework/container first and idea is that researchers will come to them and say “We want this tool” and they’ll build it. Funding set aside for further development.

Q: You compared this to Galaxy, but you’ve built from ground-up where Galaxy is more fluid. A person with command-line can create tools in Galaxy but with HuNI you’d have to do it yourself.
A: Bioinformatics folk tend to be competent with Python – but we’re not sure what competencies our researchers will have, less likely to be able to develop for themselves.

Requirements for a New Zealand Humanities eResearch Infrastructure
James Smithies, University of Canterbury
Vast amounts of cultural heritage being digitised or being born online. Humanities researchers will never be engineers but need to work through the issues.

International context:
Humanities computing’s been around for decades but still in its infancy. US, UK, even Aus have ongoing strategic conversations, which helps build roadmaps. NZ is quite far behind these (though have used punchcards where necessary). “Digging into Data Challenge” overseas but we’re missing out because of lackk of infrastructure and lack of awareness.

Fundamentals of humanities eresearch:
HuNI provides a good model. Need a shift from thinking of sources as objects to viewing them as data. Big paradigm shift. Not all will work like this. But programmatic access will become more important.

National context:
19th century ship’s logs, medical records from leper colonies. Hard to read, incomplete, possibly accurate. Have traditional methods to deal with these but problems multipy when ported into digital formats. Big problem is lack of awareness of what opportunities exist. So capabilities and infrastructure is low. Decisions often outsourced to social sciences.
At the same time, DigitalNZ, National Digital Heritage Archive, Timeframes archive, AJHR, PapersPast, etc are fantastic resources that could be leveraged if we come up with a central strategy.

Requirements:

  • Need to develop training schemes
  • Capability building. Lots of ideas out there but people don’t know where to start. Need to look at peer review, PBRF – how to measure quality and reward it.
  • International collaboration
  • Requirements elicitation and definition
  • Funding for all of the above including experimentation

Q: Data isn’t just data, it’s situated in a context. Being technology-led and using RDF is one thing. But how do we give richness to a collection?
A: Classic example would be researcher wanting access to object properly marked up and contribute to the conversation by adding scholarly comments, engage with other marginalia. Eg ancient greek text corpus (is I think describing the Perseus Digital Library). Want both a simple interface and programmatic access.

Q: Need to make explicit the value of an NZ corpus. Have some pieces but need to join up. Need to work with DigitalNZ. Once we have corpus can look at tools.
A: Yes, need to get key stakeholders around table and talk about what we need.

Capturing the flux in Scientific Knowledge
Prashant Gupta & Mark Gahegan, The University of Auckland
Everything changes – whether the physical world itself or our understanding of the world:
* new observation or data
* new understanding
* societal drivers
How can we deal with change and make our tools and systems more dynamic to deal with change?

Ontology evolution – have done lots of work on this. Researchers have updated knowledge structure and incorporated in forms of provenance or change logs. Tells us “knowledge that” eg What is the change, when it happened, who did it, to what, etc. But we still don’t capture “knowledge how” or “knowledge why”.

Life cycle of a category:
Processes, context, researchers’ knowledge are involved in birth of a category – but these tend to be lost when the category’s formed. We’re left with the category’s intension, extension, and place in the conceptual hierarchy. Lots of information not captured.

“We focus on products of science and ignore process of science”.

Proposes connecting static categories and the process of science to get a better understanding. Could act as a fourth facet to a category’s representation. Can help address interoperability problem and help track evolution of categories.

Process model:
Process of science gives birth to conceptual change modifies scientific artifacts connected as linked science improves process of science.

If change not captured, network of artifacts will become inconsistent and linked science will fail.

Proposes building a computational framework that captures and analyses changes, creating a category-versioning system.

Comment from James Smithies: would fit well in humanities context.
Comment: drawing parallel with software development changeset management.

NZ e-Infrastructures Panel #nzes

NZ e-Infrastructures Panel
Nick Jones, New Zealand eScience Infrastructure
Steve Cotter, REANNZ
Andrew Rohl, Curtin University, ex ED iVEC
Tony Lough, NZ Genomics Ltd
Don Smith, NZ Synchrotron Group Ltd
Rhys Francis, eResearch Coordination Project

How we doing and how can we work better with Australia?
* NJ: Have been working closer recently, but big gaps in data especially, and unevenness in various disciplines.
* SC: Working to identify gaps and work across organisations. REANNZ working closer with AARnet than have in the past which is bearing fruit re bandwidth.
* Political overlay – need to be able to say we’ve got the scientific partnership working.
* RF: Fair amount of partnership. But have found that governance separates things. “I don’t believe in uninterpreted data.” Need to figure out combo of data and tools to get results.
* Plenty of opportunity to work with Australia. Useful to look at infrastructures and what they’ve done right and haven’t done right – lessons to be learn.
* AR: Problems faced here are not unique so you can avoid our mistakes and make your own instead. šŸ™‚

National Science Challenge signals government would like to roll framework out further. How do researchers engage with this?
* NJ: At many workshops people already know what they want to work on; at others there’s range of possibilities. Need to build networks so not everyone has to be at table.
* RF: eResearch and IT isn’t mentioned in challenges – but these are embedded in everything. If you want to be world-class at X, you need to be good at computer science.

How would you benchmark and measure return on investment?
* AR: Instance where in early days govt felt that if people wanted to keep investing, it must be valuable. This is changing now that investments are bigger. Hesitant about benchmarking because don’t really want to be doing the same as anyone else.
* RF: How do you go from 0 to world’s best supercomputer overnight? No idea how to measure that. It’s a commitment to the advancement of knowledge but the govt doesn’t have a KPI about that…

NZ had to set up Tuakiri because differences in law meant we couldn’t use Australia’s system. What other things the two countries might have to do to overcome differences in legislation?
* (Other audience member) – Yes there are differences so have needed to build systems that deal with both privacy acts and have been successful.
* (Anne Berryman) – Have started conversation with counterparts overseas and chief science advisors in Aus/NZ have a line of communication. There are platforms and issues we can deal with.

One goal is to achieve self-sustainability, eg user charging, member contributions. What’s the Australian experience in user-pays and sustainability?
* RF: Financial benefits are overwhelming. If went to commercial provider it’d cost more and do less. Sustainability needs constant flow of funds to keep supercomputing running. There is a sustainability cliff. Govt keeps putting money in.
* SC: MBIE have removed self-sustainability requirement. Charging to make sure researchers have skin in the game does prove that service is needed; but not everyone can participate who should be.

Introducing the HathiTrust Research Center #nzes

Unlocking the Secrets of 3 Billion Pages: Introducing the HathiTrust Research Center
Keynote from J. Stephen Downie, Associate Dean for Research and a Professor at the Graduate School of Library and Information Science at the University of Illinois at Urbana-Champaign.

Hathi a membership organisation – mostly top-tier US unis, plus three non-US.
“Wow” numbers:
* 10million volumes including 3.4million volumes in the US public domain
* 3.7 billion pages
* 482 TB of data
* 127 miles of books

Of the 3.4 million volumes in the public domain, about a third are in public domain only in the US; the rest are public domain worldwide (4% US govt documents so public domain from point of publication).

48% English, 9% German (probably scientific publications from pre-WWII).

Services to member unis:
* long term preservation
* full text search
* print on demand
* datasets for research

Data:
Bundles have for each page a jpg, OCR text, xml which provides location of words on each page.
METS holds the book together – points to each image/text/xml file. And built into the METS file is structure information et table of contents, chapter start, bibliography, title, etc.
Public domain data available through web interfaces, APIs, data feeds

“Public-domain” datasets still require a signed researcher statement. Stuff digitised by Google has copyright asserted over it by Google. And anything from 1872-1923 is still considered potentially under copyright outside of the US. Working on manual rights determination – have a whole taxonomy for what the status is and how they assessed it that way.

Non-consumptive research paradigm – so no one action by one user, or set of action by a group of users, could be used to reconstruct works and publish. So users submit requests, Hathi does the compute, and sends results back to them. [This reminds me of old Dialog sessions where you had to pay per search so researchers would have to get the librarian to perform the search to find bibliographic data. Kind of clunky but better than nothing I guess…]

Meandre lets researcher set up the processing flow they want to get their results. Includes all the common text processing tasks eg Dunning Loglikelihood (which can be further improved by removing proper nouns). Doesn’t replace a close-reading – answers new questions. Correlation-Ngram viewer so can track use of words across time.

OCR noise is a major limitation.

Downie wants to engage in more collaborative projects, more international partnerships, and move beyond text and beyond humanities. Just been awarded a grant for “Work-set Creation for Scholarly Analysis: Prototyping Project”. Non-trivial to find a 10,000-work subset of 10million works to do research on – project aims to solve this problem. Also going to be doing some user-needs assessments, and in 2014 will be awarding grants for four sub-projects to create tools. Eg would be great if there was a tool to find what pages have music on.

Ongoing challenges:
How do we unlock the potential of this data?
* Need to improve quality of data; improve metadata. Even just to know what’s in what language!
* Need to reconcile various data structure schemes
* May need to accrete metadata (there’s no perfect metadata scheme)
* Overcoming copyright barriers
* Moving beyond text
* Building community