eScience and Data Science at the University of Washington eScience Institute
“Hangover” Keynote by Bill Howe, Director of Research, Scalable Data Analytics, eScience Institute Affiliate Assistant Professor, Department of Computer Science & Engineering, University of Washington
Scientific process getting reduced to database problem – instead of querying the world we download the world and query the database…
UoW eScience Inst to get in the forefront of research in eScience techniques/technology, and in fields that depend on them.
3Vs of big data:
volume – this gets lots of attention but
variety – this is the bigger challenge
Sources a longtail image from Carol Goble showing lots of data in Excel spreadsheets, lab books, etc, is just lost.
Types of data stored – especially data data and some text. 87% of time is on “my computer”; 66% a hard drive…
Mostly people are still in the gigabytes range, or megabytes, less so in terabytes (but a few in petabytes).
No obvious relationship between funding and productivity. Need to support small innovators, not just the science stars.
Problem – how much time do you spend handling data as opposed to doing science? General answer is 90%.
May be spending a week doing manual copy-paste to match data because not familiar with tools that would allow a simple SQL JOIN query in seconds.
Sloan Digital Sky Survey incredibly productive because they put the data online in database format and thousands of other people could run queries against it.
SQLShare: Query as a service
Want people to upload data “as is”. Cloud-hosted. Immediately start writing queries, share results, others write their queries on top of your queries. Various access methods – REST API -> R, Python, Excel Addin, Spreadsheet crawler, VizDeck, App on EC2.
Has been recommending throwing non-clean data up there. Claims that comprehensive metadata standards represent a shared consensus about the world but at the frontier of research this shared consensus by definition doesn’t exist, or will change frequently, and data found in the wild will typically not conform to standards. So modifies Maslow’s Needs Hierarchy:
Usually storage > sharing > curation > query > analytics
Recommends: storage > sharing > query > analytics > curation
Everything can be done in views – cleaning, renaming columns, integrating data from different sources while retaining provenance.
Bring the computation to the data. Don’t want just fetch-and-retrieve – need a rich query service, not a data cemetary. “Share the soup and curate incrementally as a side-effect of using the data”.
Convert scripts to SQL and lots of problems go away. Tested this by sending postdoc to a meeting and doing “SQL stenography” – real-time analytics as discussion went on. Not a controlled study – didn’t have someone trying to do it in Python or R at same time – but would challenge someone to do it as quickly! Quotes (a student?) “Now we can accomplish a 10minute 100line script in 1 line of SQL.” Non-programmers can write very complex queries rather than relying on staff programmers and feeling ‘locked out’.
Taught an intro to data science MooC with tens of thousands of students. (Power of discussion forum to fix sloppy assignment!)
Lots of students more interested in building things than publishing, and are lost to industry. So working on ‘incubator’ projects, reverse internships pulling people back in from industry.
Q: Have you experimented with auto-generating views to cleanup?
A: Yes, but less with cleaning and more deriving schemas and recommending likely queries people will want. Google tool “Data wrangler”.
Q: Once again people using this will think of themselves as ‘not programmers’ – isn’t this actually a downside?
A: Originally humans wrote queries, then apps wrote queries, now humans are doing it again and there’s no good support for development in SQL. Risk that giving people power but not teaching programming. But mostly trying to get people more productive right now.