Details for Government Employees
Big Data TechCon
is sponsored by:


Big Data TechCon San Francisco Tutorials

(Click here to view Techical Classes)

Filter Tutorials By:

Time Slot:

 Monday, October 27

Full Day Tutorial
8:30am - 5:00pm

Clear Data Graphics with Illustrations in R newHands On

Richard M. Heiberger and Naomi B. Robbins

This full-day tutorial will help you rank elementary graphical perception tasks to identify those that you do best as well as show the limitations of many common graphical constructions. We will demonstrate newer, more effective graphical forms developed on the basis of the ranking, and will describe these graphical forms and when to use them. This part will be software agnostic and applies to any software you might use. We will then introduce R Graphics as an important example of a graphical system and show how to use R to produce all of the graphs, including  bar charts, dot plots, formula specification, axis notation, mosaic plots for categorical data  and many more. We will also cover combining software systems such as R and Excel.

Additionally, you will learn general principles for creating effective graphs, and compare the same data using different graph forms so you can see how understanding depends on the graphical construction used. Since scales (the rulers along which we graph the data) have a profound effect on our interpretation of graphs, the section on general principles contains a detailed discussion of scales, including: When to include or not to include zero? When do logarithmic scales improve clarity? And are two scales better than one? How can we distinguish between informative and deceptive double axes? The tutorial will conclude with before and after examples.

Attend this tutorial to:

  • Learn how to present data more effectively in all media
  • Have a hands-on experience drawing graphs in R
  • Understand principles of effective simple graphs to build upon when creating interactive or dynamic displays
  • Become more critical and analytical when viewing graphs.
  • Recognize misleading and deceptive graphs.

LEVEL: Overview

TRACK: Analysis

Hadoop: A One-Day, Hands-On Crash Course Most PopularHands On

Sameer Farooqui

This full-day tutorial is a fast-paced, vendor-agnostic technical overview of the Hadoop landscape, and is targeted at both technical and non-technical people who want to understand the emerging world of Big Data, with a specific focus on Hadoop. You will be introduced to the core concepts of Hadoop, and dive deep into the critical paths of HDFS, Map/Reduce and HBase. You will also learn the basics of how to effectively write Pig and Hive scripts, and how to choose the correct use cases for Hadoop. During the tutorial, you will have access to an individual one-node Hadoop cluster in Rackspace to run through some hands-on labs for the five software components: HDFS, Map/Reduce, Pig, Hive and HBase.

In each sub-topic, you will be provided links and resource recommendations for further exploration. You will also be given a 100-page PDF slide deck, which can be used as reference material after the course. PDFs will also be given out for the five short, hands-on labs. No prior knowledge of databases or programming is assumed.

Note: You are required to bring a laptop. If you run into an issue during the hands-on portions, it is also not guaranteed the instructor will be available to help you troubleshoot.

LEVEL: Overview

TRACK: Hadoop

Hands-on Intro to Apache Spark newHands On

Paco Nathan

In this full-day, hands-on programming tutorial, we will cover Spark essentials and best practices; APIs in Python, Java, and Scala, and example apps. We will also build and deploy across the most popular Hadoop distros, and look at case studies of production deployments at enterprise scale.

This is an introductory tutorial and is intended primarily for a developer audience that is not familiar with Spark and learning to create apps; or Java developers who are familiar with Hadoop and just learning Scala; or Python developers (PySpark). This tutorial also teaches some basics of Scala.

By the end of the day, you will be comfortable performing the following:

  • Open a Spark Shell and run apps
  • Use of some ML algorithms
  • Explore data sets loaded from HDFS, etc.
  • How to install/deploy on major Hadoop distros
  • Review Spark SQL, Spark Streaming, Shark
  • Familiarity with case studies of Spark production deployments
  • Overview of advanced topics and BDAS projects
  • Developer community resources, events, etc.
  • Available follow-up courses and certification
  • Return to workplace and demo use of Spark!

NOTE: You are required to have the following:

LEVEL: Overview

TRACK: Hadoop

Half-Day Tutorial
8:30am - 12:15pm

Big Data Architecture Approach new

Tony Shan

This tutorial presents the best practices in architecting Big Data solutions pragmatically. We will first review the major architecture methods and frameworks to understand the landscape and current state. Then we look into the key challenges in the Big Data space, which leads to the conceptualization of Big Data Architecture. We will further characterize the core attributes of Big Data architecture. Afterwards, we will introduce the Big Data Architecture Framework (BDAF), which defines a comprehensive architecture approach to help manage a set of discrete artifacts and implement a collection of specific design elements in Big Data applications and systems.

BDAF establishes a common practice of systematic architecture modeling and design disciplines for big data practitioners. We will dive deep into the four integral parts of BDAF in great detail: Domain-specific, Enablement-dependent, Platform-agnostic, and Technology-neutral (DEPT). Subsequently, we will walk through some real-world work examples of NoSQL-based implementations to illustrate the pragmatic use of BDAF to improve design consistency, simplify system complexity, maximize loose-coupling, and increase productivity. Emerging practices and patterns, along with lessons learned and antipatterns in real-life projects, will be discussed as well.

LEVEL: Advanced

TRACK: NoSQL, Hadoop

Introduction to Neo4j newHands On

Max De Marzi

Get a hands-on introduction to Neo4j. We'll install the database on your laptop, load the sample data and learn the basics of Neo4j's query language "Cypher." You'll learn the Neo4j "secret sauce," how to model data in a graph, what kind of queries and uses cases Neo4j excels at, and how to get started building your own proof of concept.

LEVEL: Overview


Half-Day Tutorial
1:15pm - 5:00pm

Data Structures and Algorithms for Big Databases Most Popular

Michael A. Bender and Bradley C. Kuszmaul

This tutorial will explore data structures and algorithms for big databases. The topics include:

  • Data structures including B-trees, Log Structured Merge Trees, Streaming B-trees, and Fractal Trees.
  • Bloom filters and Bloom filter alternatives designed for SSDs.
  • Index design, including covering indexes.
  • Getting good performance in memory.
  • Cache efficiency including both cache-aware and cache-oblivious data structures and algorithms.

These algorithms and data structures are used both in NoSQL implementations such as HBase, LevelDB, MongoDB, and TokuMX, and in SQL-oriented implementations such as MySQl and TokuDB. We will explain the underlying data structures and algorithms that drive big databases. There will also be an emphasis on write-optimized data structures, such as Log Structured Merge Trees or Fractal Trees.

This tutorial includes explaining and analyzing data structures, so it might not be aimed at someone who hates seeing O(N log N); however, the content will be accessible so that anyone who can tolerate some math will benefit from attending.

LEVEL: Advanced

TRACK: Analysis, NoSQL

The Building Blocks of Storm Trident Deployment new

Ron Bodkin

This half-day tutorial will cover Storm Trident, complete with a step-by-step walkthrough where we’ll showcase the logistics of utilizing a stock ticker use case. We’ll demonstrate how to build, debug, deploy, and develop your own custom reusable stock statistics and aggregations running on the Storm Trident abstraction. We will also demonstrate how to query in real-time using DRPC. We’ll also cover and you through the following points:

  • What is Storm?
  • What is Trident?
  • Why Trident and Storm?
  • Stock ticker - stock statistics use case
  • Code
  • Building
  • Deploying
  • Running your topology
  • Monitoring your topology
  • Building a custom aggregation and statistics operators
  • Querying your topology
  • Next Steps

LEVEL: Intermediate

TRACK: Hadoop

 Tuesday, October 28

Half-Day Tutorial
8:00am - 11:30am

Apache Cassandra, An Introduction new

Ben Coverston

This tutorial is an introduction to Apache Cassandra. It will be broken up into two general topics. First is an introduction to the internals of an Apache Cassandra cluster: How data is partitioned, replicated, and managed from an operational perspective. The second part of the tutorial will discuss how to use Apache Cassandra from applications using the Cassandra Query Language (CQL). Examples will be approachable to beginners and the audience will be able to follow along if they desire. We will discuss patterns and anti-patterns for Cassandra applications. After this tutorial, you will have the knowledge necessary to bring up a Cassandra cluster, store data, and query it effectively while avoiding some of the most typical mistakes that are made by new users.

LEVEL: Overview


Big Data Programming with Spark newHands On

Dean Wampler

Spark is a data computation engine that is replacing MapReduce as the preferred programming tool for Hadoop jobs. Written in Scala, with APIs in Java, Python, and soon R, Spark combines high performance through intelligent caching of data in memory with a set of powerful operations for writing advanced data workflows using relatively little code. Spark also supports a subset of SQL through a new Spark SQL API, which lets you mix SQL with programming logic, and a port of Hive to Spark, called Shark. Other Spark-based libraries support event-stream processing, graph algorithms, and machine learning.

This hands-on tutorial will introduce you to the Spark Scala APIs for batch-mode and event-stream processing, including the SQL support. We’ll also learn how to use Spark within Hadoop or standalone. You’ll see that Spark is an ideal tool when a full-featured and flexible toolset for Big Data applications is needed, beyond what Hive, Pig, and the original MapReduce API can provide.

LEVEL: Intermediate

TRACK: Hadoop

Introduction to Machine Learning in Scalding newHands On

Krishnan Raman

In this tutorial, we explore supervised and unsupervised learning techniques from scratch, using the newly introduced Iterative Stats API in Scalding. We construct the following ML algorithms:

Each algorithm is introduced, implemented in Scalding, and performance of the algorithm is tested on publicly available datasets. Finally, we briefly introduce production-ready Scalding-based Machine Learning suites used in industry such as Etsy's Conjecture (

NOTE: This tutorial will involve a lot of code in Scalding, so prior knowledge of Scalding and some in Machine Learning may be beneficial. You should also have a fair amount of knowledge in linear algebra. You must have set up a working copy of the Scalding repo on their machines. All code will be run in the HDFS-local mode.

LEVEL: Advanced

TRACK: Analysis

NoSQL for SQL Professionals

Dipti Borkar

With all of the buzz around Big Data and NoSQL (non-relational) database technology, what actually matters for today's SQL professional? Learn more in this tutorial about Big Data and NoSQL in the context of the SQL world, and get to what's truly important for data professionals today. In this tutorial, we will discuss:

  • The main characteristics of NoSQL databases
  • High-level architectural overviews of the most popular NoSQL databases
  • Differences between distributed NoSQL and relational databases
  • Use cases for NoSQL technologies, with real-world examples from organizations in production today

Finally, we will drill down into Couchbase Server and its underlying distributed architecture, with a hands-on tour of how NoSQL databases like Couchbase work in a production environment, including online rebalancing while adding nodes to a cluster, indexing and querying, and cross-data-center replication.

LEVEL: Intermediate


R, Data Wrangling and Data Science Competitions newHands On

Krishna Sankar

Let us mix R, add a dash of Machine Learning Algorithmics and work on Data Science Analytics competitions hosted by Kaggle, a platform for predictive modeling and analytics competitions, which offers interesting data science challenges as well as scoring mechanisms and a leaderboard. This tutorial introduces the intersection of Data, Inference and Machine Learning, structured in a progressive mode, so that the attendees learn by hands-on wrangling with data for interesting inferences using R and libraries like dplyr and ggplot2. First, we will quickly look at the classes of algorithms (classification, clustering, regression and SVM) and their internals through simple competition problems and datasets. Then we will dig deeper into two Kaggle competitions, working through the datasets, choosing the right features, hands-on programming the appropriate algorithms and submitting a few entries. In the process, we will encounter knowledge accumulated by data scientists, model evaluation and performance mechanisms, feature selection methods as well as interesting datasets.

LEVEL: Advanced

TRACK: Analysis