Tag Archives: datasets

I’ve been hearing rumours that the big IT companies may be giving up on organised data. Which is kind of a big thing for the same reason that it makes perfect sense: there are terabytes upon terabytes of data pouring onto computers and servers all the time, and organising all of that into a useful format takes a heck of a lot of time.

Especially because data organised to suit one need isn’t necessarily going to suit most actual needs. If you’re a reference librarian (either academic or, I suspect, public) you’ll have had the student coming to your desk who can’t quite understand why typing their assignment topic into a database doesn’t return the single perfect article that explicitly answers all their questions.

So I think there’s two ways of organising data:

“pre-organising” it – eg a dictionary, which is organised alphabetically, assuming you want to find out about a given word. It has information about which are nouns and what dates they derive from (to a best guess, obviously) but there’s no way to search for nouns that were used in the 16th century because the dictionary creator never imagined someone might want to know such a thing.
organising it at point of need – eg a database which had all this same information but allowed you to tell it you want only nouns deriving from the 16th century or earlier; or only pronunciations that end in a certain phonetic pattern; or only words that include a certain other word in the definition.

Organising data at point of need solves one problem (it’s much more flexible) but it doesn’t actually save time on the organising end. In fact, it’s likely to take quite a lot more time.

So is humanity doomed to be swimming in yottabytes of undifferentiated, unorganised, and thus useless data? I frowned over this for a while, and after some time I remembered the alternative to organising data: parsing it. (This is just what humans do when we skim a text looking for the information we want.) So, for example, a computer could take an existing dictionary as input and look for the pattern of a line which includes “n.” (or s.b. or however the dictionary indicates a noun), and a date matching certain criteria, and returns to the user all the lines that match what was asked for.

Parsing is hard, and computers have historically been bad at it. (Bear in mind though that for a long time humans beat computers at chess.) This is not because computers aren’t good at pattern-matching; it’s because humans are so good at making typos, or rephrasing things in ways that don’t fit the criteria. (One dictionary says “noun”, one says “n.”, one says “s.b.”, one uses “n.” but it refers to something else entirely…) A computer parsing data has to account for all the myriad ways something might be said, and all the myriad things a given text might mean.

But if you look around, you’ll see parsing is already emerging. One of the things the LibX plugin does is look for the pattern of an ISBN and provide a link to your library’s catalogue search. You may have an email program that, when your friend writes “Want to meet at 12:30 tomorrow at the Honeypot Cafe?”, gives you a one-click option to put this appointment into your calendar. Machine transcription from videos, recognition of subjects in images, machine translation – none of it’s anywhere near perfect, but it’s all improving, and all these are important steps in the emergence of parsing as a major player in the field of managing data.

So yes, if I was a big IT company I might want to get out of the dead-end that is organising data, too – and get into the potentially much more productive field of parsing it.

Links of interest 21/1/2011

Links of Interest 4/11/09

1 Reply

Resources
New Zealand Electronic Text Centre has posted a list of online texts for current courses at VUW.

The Dept of Internal Affairs has launched Government datasets online, a directory of publicly-available NZ government datasets (especially but not exclusively machine-readable datasets).

Complementary Twitter accounts:

APStylebook (Sample: Election voting: Use figures for totals and separate the large totals with “to” instead of hyphen.)
FakeAPStylebook (Sample: To describe more than one octopus, use sixteentopus, twentyfourtopus, thirtytwotopus, and so on.)

Information Literacy
There was a lot of interest at and after LIANZA09 about the Cephalonia Method of library instruction (basically, handing out pre-written questions on cards to students to ask at appropriate times during the tutorial). A recent blogpost by a librarian worn out from too many tutorials wonders “what if the entire class session consisted of me asking students questions? What if I asked them to demonstrate searching the library catalog and databases?”

Scandal du jour
A document by Stephen Abram (SirsiDynix) on open source library management systems (pdf, 424KB) appeared on WikiLeaks. The biblioblogosphere saw this as evidence of SirsiDynix secretly spreading FUD (fear, uncertainty and doubt) against their open-source competition. Stephen Abram replied on his blog that it was never a secret paper and he’s not against open source software but it’s not ready for most libraries. Much discussion followed in his blog comments and on blogs elsewhere; Library Journal has also picked up the story.

For fun
Also at Library Journal, The Card Catalog Makes a Graceful Departure at the University of South Carolina – rather than just dumping it the library is hosting events such as a Catalog Card Boat Race and What Can You Make With Catalog Cards?

Things Librarians Fancy.

Deborah

Deborah Fitchett

Tag Archives: datasets

The future of metadata #lianza11 #keynote6

The death of organised data

Links of interest 21/1/2011

Links of Interest 4/11/09