{"id":24,"date":"2013-07-04T10:56:00","date_gmt":"2013-07-03T22:56:00","guid":{"rendered":"http:\/\/deborahfitchett.com\/blog\/?p=24"},"modified":"2013-07-04T10:56:00","modified_gmt":"2013-07-03T22:56:00","slug":"u-of-washington-escience-institute-nzes","status":"publish","type":"post","link":"https:\/\/deborahfitchett.com\/blog\/2013\/07\/u-of-washington-escience-institute-nzes\/","title":{"rendered":"U of Washington eScience Institute #nzes"},"content":{"rendered":"<p><strong>eScience and Data Science at the University of Washington eScience Institute<\/strong><br \/><em>&#8220;Hangover&#8221; Keynote by Bill Howe, Director of Research, Scalable Data Analytics, eScience Institute Affiliate Assistant Professor, Department of Computer Science &#038; Engineering, University of Washington<\/em><\/p>\n<p>Scientific process getting reduced to database problem &#8211; instead of querying the world we download the world and query the database&#8230;<\/p>\n<p>UoW eScience Inst to get in the forefront of research in eScience techniques\/technology, and in fields that depend on them.<\/p>\n<p>3Vs of big data:<br \/>volume &#8211; this gets lots of attention but <br \/>variety &#8211; this is the bigger challenge<br \/>velocity<\/p>\n<p>Sources a longtail image from Carol Goble showing lots of data in Excel spreadsheets, lab books, etc, is just lost.<br \/>Types of data stored &#8211; especially data data and some text. 87% of time is on &#8220;my computer&#8221;; 66% a hard drive&#8230;<br \/>Mostly people are still in the gigabytes range, or megabytes, less so in terabytes (but a few in petabytes).<br \/>No obvious relationship between funding and productivity. Need to support small innovators, not just the science stars.<\/p>\n<p>Problem &#8211; how much time do you spend handling data as opposed to doing science? General answer is 90%.<br \/>May be spending a week doing manual copy-paste to match data because not familiar with tools that would allow a simple SQL JOIN query in seconds.<br \/>Sloan Digital Sky Survey incredibly productive because they put the data online in database format and thousands of other people could run queries against it.<\/p>\n<p><a href=\"http:\/\/escience.washington.edu\/sqlshare\">SQLShare<\/a>: Query as a service<br \/>Want people to upload data &#8220;as is&#8221;. Cloud-hosted. Immediately start writing queries, share results, others write their queries on top of your queries. Various access methods &#8211; REST API -> R, Python, Excel Addin, Spreadsheet crawler, VizDeck, App on EC2.<\/p>\n<p>Metadata<br \/>Has been recommending throwing non-clean data up there. Claims that comprehensive metadata standards represent a shared consensus about the world but at the frontier of research this shared consensus by definition doesn&#8217;t exist, or will change frequently, and data found in the wild will typically not conform to standards. So modifies Maslow&#8217;s Needs Hierarchy:<br \/>Usually storage > sharing > curation > query > analytics<br \/>Recommends: storage > sharing > query > analytics > curation<br \/>Everything can be done in views &#8211; cleaning, renaming columns, integrating data from different sources while retaining provenance.<\/p>\n<p>Bring the computation to the data. Don&#8217;t want just fetch-and-retrieve &#8211; need a rich query service, not a data cemetary. &#8220;Share the soup and curate incrementally as a side-effect of using the data&#8221;.<\/p>\n<p>Convert scripts to SQL and lots of problems go away. Tested this by sending postdoc to a meeting and doing &#8220;SQL stenography&#8221; &#8211; real-time analytics as discussion went on. Not a controlled study &#8211; didn&#8217;t have someone trying to do it in Python or R at same time &#8211; but would challenge someone to do it as quickly! Quotes (a student?) &#8220;Now we can accomplish a 10minute 100line script in 1 line of SQL.&#8221; Non-programmers can write very complex queries rather than relying on staff programmers and feeling &#8216;locked out&#8217;.<\/p>\n<p>Data science<br \/>Taught an intro to data science MooC with tens of thousands of students. (Power of discussion forum to fix sloppy assignment!)<\/p>\n<p>Lots of students more interested in building things than publishing, and are lost to industry. So working on &#8216;incubator&#8217; projects, reverse internships pulling people back in from industry.<\/p>\n<p>Q: Have you experimented with auto-generating views to cleanup?<br \/>A: Yes, but less with cleaning and more deriving schemas and recommending likely queries people will want. Google tool &#8220;Data wrangler&#8221;.<\/p>\n<p>Q: Once again people using this will think of themselves as &#8216;not programmers&#8217; &#8211; isn&#8217;t this actually a downside?<br \/>A: Originally humans wrote queries, then apps wrote queries, now humans are doing it again and there&#8217;s no good support for development in SQL. Risk that giving people power but not teaching programming. But mostly trying to get people more productive right now.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>eScience and Data Science at the University of Washington eScience Institute&#8220;Hangover&#8221; Keynote by Bill Howe, Director of Research, Scalable Data Analytics, eScience Institute Affiliate Assistant Professor, Department of Computer Science &#038; Engineering, University of Washington Scientific process getting reduced to database problem &#8211; instead of querying the world we download the world and query the [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[],"tags":[11,10,9,6],"_links":{"self":[{"href":"https:\/\/deborahfitchett.com\/blog\/wp-json\/wp\/v2\/posts\/24"}],"collection":[{"href":"https:\/\/deborahfitchett.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/deborahfitchett.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/deborahfitchett.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/deborahfitchett.com\/blog\/wp-json\/wp\/v2\/comments?post=24"}],"version-history":[{"count":0,"href":"https:\/\/deborahfitchett.com\/blog\/wp-json\/wp\/v2\/posts\/24\/revisions"}],"wp:attachment":[{"href":"https:\/\/deborahfitchett.com\/blog\/wp-json\/wp\/v2\/media?parent=24"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/deborahfitchett.com\/blog\/wp-json\/wp\/v2\/categories?post=24"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/deborahfitchett.com\/blog\/wp-json\/wp\/v2\/tags?post=24"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}