Tag Archives: datasets

The future of metadata #lianza11 #keynote6

Karen Coyle
Five steps to the future of metadata

Everyone on Facebook has created a webpage. We expect to be able to comment on news stories. Still have the Powers That Be – but also Wikileaks. Can’t do anything without expecting user interaction.

Devices and interfaces still very crude to the point that libraries have to help users, though users expect to be able to Just Use It.

Access means getting a copy – and hard drives get cluttered and messy. We don’t have good means for helping manage that.

Communication is increasingly remote and faster. The “slow conversation of books” cf IM and SMS.

Much training is in video form.

Everything is becoming part of the record. Every cat has a webcam. Email is used as evidence in court.

What are libraries doing about this?
Linked data – this year the concept of linked data has become mainstream in library (though we may not have heard about it…) Internet developed (before web) for sharing of documents. About 12 years ago idea of semantic web – instead of documents on the web can put data on web and let it link.

Linked data is a simple concept but the technology can be complex. Data can be linked to more data – a web of data. The link itself has meaning – doesn’t just link between Melville and Moby Dick, but says “he’s the author”.

Plus anyone can link to me. Data remains intact, but the linking leads to knowledge creation. See http://linkeddata.org. Shows a link cloud full of sets of data from various organisations. Many scientific data sets – everyone works in narrow environment but know it probably connects with other people’s data. Government data – big efforts in UK and EU to get data out for people (and other agencies!) to use.

Some library data (though not a complete picture) starting to appear. W3C Consortium wants to get more on the web – huge interest in library data. People begging for us to get our data on the web!

Five steps
* Data, not text
** Identifiers for things
*** Machine-readable schema
**** Machine-readable lists
***** Open access on the web

Web of data only functions when people can make free use of what they find. Some organisations have a hard time with this. Open Data movement; concept that bibliographic should not be considered proprietary.

LCSH, BnF RAMEAU subject headings, Dewey Online (just the summary) are available online in linked data format, and soon LC classification. MARC geographic and language codes but not MARC itself. All RDA Elements and RDA controlled vocabularies are out there – though no applications using them.

FRBR and ISBD. Virtual International Authority File (merged name records – access via MARC and linked data formats).

Getting open access to citation data would be great; friend-of-a-friend data.

Linked data format more flexible – can add into existing network without disrupting what’s there.

When we try to meet everyone’s needs we build something so awkward no-one will use it.

Expressing library data as linked data isn’t rocket science. British National Bibliography is put out as linked data, Swedish catalogue, German libraries have done this. We can do this – the question is, is this what we want to do?

What might this let us do? Open Library does this. Lets you have different views. Page for author doesn’t just give list of titles, but information about author. Page for work gives general info and list of manifestations/blurbs.

Current metadata, much is useless – xii, 356 p. ; 23cm – it’s like the secret language of twins, and yet this is our face to the users.

Our classification schemes are incredibly rich. Bing, Google, etc do keyword search not because it’s effective but because it’s easy. You can’t say broader or narrower. No categories. It’s up to the user to turna complex query into a simple search – all the intelligence is on the user, so it depends on the user’s skills.

It is good for nouns, especially proper nouns. Doesn’t work for concepts.Terrible if searching for common terms. Can’t ask specific questions. Linked data can let you ask and answer this type of question – cf WolframAlpha.

Why is Wikipedia always near the top? Because it’s organised info and people love it.

When we get results that don’t help us we forget it – we use our human intelligence to ignore everything that isn’t helpful. Keyword searching is like dumpster diving, trying to find that one sandwich among the trash.

Tagging is okay but it’s not knowledge organisation. Miscellany has its role but puts a great burden on the user.

Need to change our concept of what the library catalogue is. Need an inventory for librarians, but this inventory is not what users should see! Need to link to circulation too. But need something users can access and use because OCLC report shows only 2% of users start with the library catalogue. Our data needs to be elsewhere, where the users are. Must be willing to free our data.

Need to focus on knowledge organisation – have rewritten our rules but haven’t looked at classification. Finding books by title or author isn’t the most exciting thing people can do! Should assume people looking for something are doing so because they don’t have the information.

W3C Library Linked Data group – has a good discussion list
LOD-LAM forum in Wellington, December – where people talk about what we can do
The Data Hub

Karen Coyle’s site will have links

Breaking news: this morning got an email that LC has just released Future of Bibliographic Control report.

The death of organised data

I’ve been hearing rumours that the big IT companies may be giving up on organised data. Which is kind of a big thing for the same reason that it makes perfect sense: there are terabytes upon terabytes of data pouring onto computers and servers all the time, and organising all of that into a useful format takes a heck of a lot of time.

Especially because data organised to suit one need isn’t necessarily going to suit most actual needs. If you’re a reference librarian (either academic or, I suspect, public) you’ll have had the student coming to your desk who can’t quite understand why typing their assignment topic into a database doesn’t return the single perfect article that explicitly answers all their questions.

So I think there’s two ways of organising data:

  • “pre-organising” it – eg a dictionary, which is organised alphabetically, assuming you want to find out about a given word. It has information about which are nouns and what dates they derive from (to a best guess, obviously) but there’s no way to search for nouns that were used in the 16th century because the dictionary creator never imagined someone might want to know such a thing.
  • organising it at point of need – eg a database which had all this same information but allowed you to tell it you want only nouns deriving from the 16th century or earlier; or only pronunciations that end in a certain phonetic pattern; or only words that include a certain other word in the definition.

Organising data at point of need solves one problem (it’s much more flexible) but it doesn’t actually save time on the organising end. In fact, it’s likely to take quite a lot more time.

So is humanity doomed to be swimming in yottabytes of undifferentiated, unorganised, and thus useless data? I frowned over this for a while, and after some time I remembered the alternative to organising data: parsing it. (This is just what humans do when we skim a text looking for the information we want.) So, for example, a computer could take an existing dictionary as input and look for the pattern of a line which includes “n.” (or s.b. or however the dictionary indicates a noun), and a date matching certain criteria, and returns to the user all the lines that match what was asked for.

Parsing is hard, and computers have historically been bad at it. (Bear in mind though that for a long time humans beat computers at chess.) This is not because computers aren’t good at pattern-matching; it’s because humans are so good at making typos, or rephrasing things in ways that don’t fit the criteria. (One dictionary says “noun”, one says “n.”, one says “s.b.”, one uses “n.” but it refers to something else entirely…) A computer parsing data has to account for all the myriad ways something might be said, and all the myriad things a given text might mean.

But if you look around, you’ll see parsing is already emerging. One of the things the LibX plugin does is look for the pattern of an ISBN and provide a link to your library’s catalogue search. You may have an email program that, when your friend writes “Want to meet at 12:30 tomorrow at the Honeypot Cafe?”, gives you a one-click option to put this appointment into your calendar. Machine transcription from videos, recognition of subjects in images, machine translation – none of it’s anywhere near perfect, but it’s all improving, and all these are important steps in the emergence of parsing as a major player in the field of managing data.

So yes, if I was a big IT company I might want to get out of the dead-end that is organising data, too – and get into the potentially much more productive field of parsing it.

Links of interest 21/1/2011

Library instruction
I’ve recently been pondering the idea of database searches as an experiment – hypothesis, experiment, evaluate, modify the hypothesis and try again. This might make a useful way to introduce sci/tech students in particular to the idea that you’re not going to necessarily get your best results from your first search; I’ll have to see how they receive it when I’ve actually got a class to test it on.

Incorporating Failure Into Library Instruction (from ACRLog) discusses the pedagogy of learning by failure and talks about times when it’s more or less suitable for library instruction.

Anne Pemberton’s super-awesome paper From friending to research: Using Facebook as a teaching tool (January 2011, College & Research Libraries News, vol. 72 no. 1 28-30) discusses Facebook as a useful teaching metaphor for databases.

Don’t Make It Easy For Them (from ACRLog) – with caveats in the comments that I think are at least as important as the main post.

Heads they win, tales we lose: Discovery tools will never deliver on their promise – and don’t miss the comment thread at the bottom of the page, which segues into the dilemma of increasingly expensive journal bundles and possible (vs viable) solutions.

Research data
There’s a whole D-Lib Magazine issue devoted to this topic this month.

Web services
The Web Is a Customer Service Medium discusses the idea that “the fundamental question of the web” is “Why wasn’t I consulted?” – that is, each medium has its niche of what it’s good at and why people use it, and webpages need to consider how to answer this question.

Library Day in the Life
Round 6 begins next week, in which librarians from all walks of librarianship share a day (or week) in the life.

Links of Interest 4/11/09

New Zealand Electronic Text Centre has posted a list of online texts for current courses at VUW.

The Dept of Internal Affairs has launched Government datasets online, a directory of publicly-available NZ government datasets (especially but not exclusively machine-readable datasets).

Complementary Twitter accounts:

  • APStylebook (Sample: Election voting: Use figures for totals and separate the large totals with “to” instead of hyphen.)
  • FakeAPStylebook (Sample: To describe more than one octopus, use sixteentopus, twentyfourtopus, thirtytwotopus, and so on.)

Information Literacy
There was a lot of interest at and after LIANZA09 about the Cephalonia Method of library instruction (basically, handing out pre-written questions on cards to students to ask at appropriate times during the tutorial). A recent blogpost by a librarian worn out from too many tutorials wonders “what if the entire class session consisted of me asking students questions? What if I asked them to demonstrate searching the library catalog and databases?”

Scandal du jour
A document by Stephen Abram (SirsiDynix) on open source library management systems (pdf, 424KB) appeared on WikiLeaks. The biblioblogosphere saw this as evidence of SirsiDynix secretly spreading FUD (fear, uncertainty and doubt) against their open-source competition. Stephen Abram replied on his blog that it was never a secret paper and he’s not against open source software but it’s not ready for most libraries. Much discussion followed in his blog comments and on blogs elsewhere; Library Journal has also picked up the story.

For fun
Also at Library Journal, The Card Catalog Makes a Graceful Departure at the University of South Carolina – rather than just dumping it the library is hosting events such as a Catalog Card Boat Race and What Can You Make With Catalog Cards?

Things Librarians Fancy.