Hadoop Overview - Doug Cutting and Eric Baldeschwieler
Doug Cutting - pretty much the father of Hadoop gave an overview of Hadoop history. Interesting comment was that Hadoop has achieved web-scale in early 2008...
Eric14 (Eric B...): Grid computing at Yahoo. 500M unique users per month, billions of interesting events per day. 'Data analysis is the inner loop" at Yahoo.
Y's vision and focus: On-demand shared access to vast pool of resources, support for massively parallel execution. , Data Intensive Super Computer (DISC), centrally provisioned and managed, service oriented...Y's focus is not grid computing in in terms of Globus, etc., not focused on external usage ala Amazon EC2/S3. Biggest grid is about 2,000 nodes.
Open Source Stack: Commitment to open source developent, Yahoo is an Apache Platinum Sponsor
Tools used to implement Yahoo's grid vision: Hadoop, Pig, Zookeeper (high avail directory and config sevices), Simon (cluster and app monitoring).
Simon: Very early days, internal to Yahoo right now...similar to Ganglia "but more configurable". Highly configurable aggregation system - gathering data from various nodes to produce (customizable?) reports.
HOG - Hadoop On Demand. Current Hadoop scheduler currently FIFO - jobs will run in parallel to the extent that the previous job doesn't saturate the node. HOG is built on Torque (www.clusterresources.com) to build virtual clusters, separate file systems, etc. Yahoo has taken development about as far as they want...cleaning up code, etc. Future direction for Yahoo is to invest more heavily in the Hadoop scheduler. Does HOG disrupt data locality - yup, it does. Good news: Future Hadoop work will improve rack locality handling significantly.
Hadoop, HOG, Pig all part of Apache today,
Multiple grids inside Yahoo: tens of thousands of nodes, hundreds of thousands of cores, TBs of memory, PBs of disk...ingests TBs of data daily.
M45 Project: Open Academic Cluster in collaboration with CMU: 500 nodes, 3TB RAM, 1.5P disk, high bandwith located conveniently in a semi-truck trailer
Open source project and Apache: Goal is for Hadoop to remain a viable open source project. Yahoo has invested heavily...very excited to see additional contributors and commiters. "Yahoo is very proud of what we've done with Hadoop over the past few years." Interesting metric: Megawatts of Hadoop
Hadoop Interest Growing: 400 people expressed interest in today's conference, 28 organizations registered their Hadoop usage/cluster, in use in universities on multple continents, Y is now started hiring employees with Hadoop experience.
GET INVOLVED: Fix a bug, submit a test case, write some docs, help out!
Random notes: More than 50% conf attendees running Hadoop, many with grids more than 20 nodes, and several with grids > 100 nodes.
Yahoo just announced collaboration with Computational Research Labs (CRL) in India to "jointly support cloud computing research"...CRL runs EKA - the 4th fastest supercomputer on the planet.