Daniel Abadi (Yale University)
Tradeoffs Between Parallel Database Systems, Hadoop, and HadoopDB as Platforms for Petabyte-Scale Analysis
Abstract: As the market demand for analyzing data sets of increasing variety and scale continues to explode, the software options for performing this analysis are beginning to proliferate. No fewer than a dozen companies have launched in the past few years that sell parallel database products to meet this market demand. At the same time, MapReduce-based options, such as the open source Hadoop framework are becoming increasingly popular, and there have been a plethora of research publications in the past two years that demonstrate how MapReduce can be used to accelerate and scale various data analysis tasks.
Both parallel databases and MapReduce-based options have strengths and weaknesses that a practitioner must be aware of before selecting an analytical data management platform. In this talk, I describe some experiences in using these systems, and the advantages and disadvantages of the popular implementations of these systems. I then discuss a hybrid system that we are building at Yale University, called HadoopDB, that attempts to combine the advantages of both types of platforms. Finally, I discuss our experience in using HadoopDB for both traditional decision support workloads (i.e., TPC-H) and also scientific data management (analyzing the Uniprot protein sequence, function, and annotation data).
Speaker's bio: Daniel Abadi is an Assistant Professor at Yale University. His research interests are in database system architecture and implementation, scalable data management, and cloud computing. He received his Ph.D. from the Massachusetts Institute of Technology where his work on query execution in column-oriented database systems resulted in the SIGMOD Jim Gray Doctoral Dissertation Award. Abadi has also been a recipient of a Churchill Scholarship, an NSF CAREER Award, and the 2007 VLDB best paper award.
Roger Barga (Microsoft Research)
Emerging Trends and Converging Technologies in Data Intensive Scalable Computing
Abstract: There is today wide agreement that data-intensive scalable computing methods are essential to advancing research in many disciplines. Such methods are expected to play an increasing important role in providing support for well-informed technical decisions and policies. They are therefore of great scientific and social importance.
The growing wealth of data is manifest as increasing numbers of data collections, varying from curated databases to assemblies of files. The former provide reference resources, preservation and computational access whilst the latter are often structured as spreadsheets or CSV files and stored on individual researchers' computers. Many of these collections are growing both in size and complexity. As computer technology and laboratory automation increases in speed and reduces in cost, more and more primary sources of data are deployed and the flow of data from each one is increased.
At the same time, a growing number of researchers and decision makers are both contributing to the data and expecting to exploit this abundance of data for their own work. They require new combinations of data, new and ever more sophisticated data analysis methods and substantial improvements in the ways in which results are presented. And it is not just the volume of information but also its scope. It's becoming more important for different fields of science to work collaboratively to drive new discoveries. While cross-disciplinary collaboration is helping drive new understanding, it also imposes even greater levels of complexity.
This pervasive change is part of a research revolution that introduces a wave of data-driven approaches termed "The Fourth Paradigm" by Jim Gray, as it is so transformative. Current strategies for supporting it demonstrate the power and potential of these new methods. However, they are not a sustainable strategy as they demand far too much expertise and help to address each new task. In order to carry out these tasks effectively and repeatedly, we must tease out the principles that underpinned their success, and through clarification, articulation, and tools, make it possible to replicate that success widely with fewer demands for exceptional talents. And this will allow researchers to spend more of their time on research.
This talk will take an opinionated look at the past, present, and future of data intensive scalable computing. I will outline trends that have recently emerged in the computer industry to cope with data intensive scalable computing, show why existing software systems are ill-equipped to handle this new reality, and point towards some bright spots on the horizon and share predictions of technology convergence.