Simplifying Big Data Analysis with Apache Spark
Matei Zaharia is an assistant professor in computer science at MIT and CTO of Databricks, the company commercializing Apache Spark. He started the Spark project at UC Berkeley in 2009 when he was a PhD student and continues to serve as its vice president at Apache. Matei has also contributed to other open source projects in the large-scale data processing space, including Apache Mesos and Apache Hadoop.
Fireside Chats: Big Data at Scale
Attend the Fireside Chats for three technical discussions around the theme of "Big Data at Scale,” which we will explore from various angles. Along with scaling the technologies, how do experts in the industry deal with scaling open-source engineering teams? Scaling the user base? And scaling their startups? We hope you will join us in these vendor-agnostic discussions as we bring some of the top minds in big data together to understand how they tackle some of the industries' toughest technical challenges.
Cassandra, Neo4j and Spark are the best of breed technologies in their respective categories. In three different 20-minute discussions, we will tackle each of these technologies one by one with one of the leading experts in the world on that topic.
Moderator: Sameer Farooqui
Cassandra with Ben Coverston
First up, we take a look at Apache Cassandra with Ben Coverston. Cassandra is a column family database that has seen a meteoric rise in the past few years. For example, Apple currently has 75,000 machines running Cassandra storing more than 10 PB of data. Cassandra stores structured data of rows and columns in geographically distributed clusters. Ben is a Cassandra Architect at Datastax where he's worked since late 2010. Currently, he helps with training and support efforts at Datastax, but he has also helped design and implement new features and products based on Cassandra. Ben's experience includes big data, NoSQL, Hadoop, Spark, search and analytics.
Neo4j with Max De Marzi
From a user base adoption perspective, Neo4j is the clear winner in the field of open-source graph databases. Graph Databases store data in nodes (each point) and edges (a connection between two nodes). Neo4j can scale to billions of nodes and billions of edges. Some of the world's most sophisticated challenges involves exploiting graphs, such as counter-terrorism, Web search, bioinformatics, risk analysis and product/friend recommenders. Max De Marzi is a software field engineer at Neo4j, where he's worked for about three years. He has been studying graph databases in depth and has been exposed to wide array of use cases, which we hope to learn more about tonight.
Spark with Matei Zaharia
Spark is the new big data processing engine that is taking the industry by storm. It can read from any data source (relational, NoSQL, file systems, etc) and offers one unified API for batch analytics, SQL queries, real-time analysis, machine learning and graph processing. Developers no longer have to learn separate processing engines for different tasks. Tonight, we will separate the hype from the truth with the creator of Apache Spark Matei Zaharia. Matei is currently CTO of Databricks, a company he created with the original UC Berkeley spark team, and a professor at MIT.