Tutorials | Big Data TechCon 2013, Boston
Big Data TechCon
Like Us on Facebook Follow Us on Twitter
Download the Exhibitor Prospectus
Register Now
BigData classes
Big Data TechReport Signup
HP Vertica
Think Big Analytics

Choose from 10 full-day and half-day Big Data tutorials to kick-start your Big Data training experience at Big Data TechCon. Whether you need introductory Hadoop tutorials or a deep dive into Scalding, there's a how-to Big Data tutorial waiting for you at Big Data TechCon.

This icon indicates code will be shown in a session.

Filter classes by :
Day                         Time Slot
Speaker  Level  Clear Filters
Monday, April 8
Half-Day Tutorial
8:30 AM - 12:15 PM

Crash Course in Machine Learning Has code image

Machine learning is a broad topic with many categories of problems and approaches, as well as lots of special-purpose tricks of the trade. We can’t cover all that, of course, but we can give you a taste of the most common problems addressed by machine learning and some of the techniques used to address them.

We’ll discuss supervised learning, where a system is trained with data that has already been classified, then use that system to classify new data. We’ll use the particular example of classifying SPAM vs. non-SPAM e-mails. Classification applies a set of labels to data. We’ll experiment with a SPAM classifier. Finally, we’ll finish this section with a brief discussion of regression, where a value for a continuous number is assigned to data, such as predicting housing prices from historical data.

How does Netflix determine what movies to recommend to you? How does Amazon know what products you might want? We’ll examine recommendation engines that compare either user preferences or item features to make recommendations to you. We’ll experiment with a movie-recommendation engine.

Clustering is an example of unsupervised learning, where we don’t train the system in advance. Instead, the system finds structure in the data "on its own." We’ll look at two examples: k-means clustering, where you find k clusters and their centers in a data set; and nearest-neighbors, where you find the nearest neighbors to a given data point in an efficient way. We’ll experiment with k-means.

We’ll finish with a brief discussion of topics you might pursue on your own; the importance of data preparation, probabilities and statistics; and more advanced machine-learning concepts, such as probabilistic graphical models and neural networks.

Side notes: This tutorial is good for intermediate to advanced developers and data analysts. Some programming ability will be assumed, such as using a text editor, writing simple scripts in some language, and using simple Linux "shell" commands.

Level: Advanced
Getting Started with Cassandra Has code image

Unless you have experience with Google BigTable, HBase or Cassandra, column-oriented databases are probably an enigma. Cassandra's data model is both simple and powerful.

It takes some time to get used to the differences between the relational model and Cassandra's column-based model.

Cassandra is not schema-less, but we do not model relationships in Cassandra either. Data Modeling in Cassandra usually consists of finding the best way to denormalize the data when you put the data in the database so that you can retrieve it quickly and efficiently. This workshop will prepare you for success when modeling your data. This tutorial will dive into Cassandra from a developer perspective and give you the tools you need to get started with Cassandra today.

This tutorial will cover:
    •    An introduction to Cassandra in the context of relational databases and non-relational alternatives.
    •    Best practices for modeling your data in Cassandra
    •    Cassandra Query Language (CQL version 3)
    •    Wide, and Composite Columns
    •    Practical Examples
    •    Anti-Patterns (things to avoid)

For a more advanced look at Cassandra, attend the "Apache Cassandra -- A Deep Dive" class.
Level: Overview
Hands-on NoSQL for the DBA Has code image

In this tutorial you will learn the what, why, how, when and where around non-relational (NoSQL) database technologies. Here you will understand the different types of NoSQL data-storage options that are in popular use now. These include graph databases such as Neo4j and Freebase, key-value stores like DynamoDB and Cassandra, document data stores such as MongoDB, and column data stores like Hadoop. You will learn about and work with both locally hosted versions of these data stores and cloud-based versions. Importantly, you'll understand which types of data stores will best suit your business and technical requirements. The tutorials’s objectives include:
  • Understanding the capabilities (and limitations) of NoSQL databases

  • Seeing an example of each type of NoSQL database

  • Understanding basic query technologies for NoSQL data stores by working with Hadoop (working with MapReduce, Pig and Hive)

  • Getting hands-on experience installing, querying and performing basic administration with MongoDB

Prerequisite knowledge: RDBMS development or administration (SQL Server references will be used in this talk for comparison).

Prerequisite: The class will be taught on Windows; students can use a Mac, but step-by-step instructions will differ slightly.

Level: Intermediate
Introduction and Best Practices for Storing and Analyzing Your Data with Apache Hive Has code image

This tutorial on Apache Hive will introduce Hive and the best practices for storage and data analysis in Hive. Hive is an open-source data warehousing system based on top of Apache Hadoop which lets you query, mine and analyze the data stored in Hadoop clusters using familiar SQL-like queries.

This tutorial will go through a hands-on exercise on how users can use Hive queries to perform data analysis data. Because not all analysis can be expressed using SQL-like queries, the workshop will cover chow to write, test and use User Defined Functions and User Defined Aggregate Functions in Hive. This tutorial will then go through some of the best practices related to partitioning, bucketing and joining various datasets in Hive.

You will also learn how to leverage other technologies in the Hadoop ecosystem, such as plugging in MapReduce scripts from Hadoop directly into their Hive queries, and how to how to integrate HBase with Hive to share the data across the two systems. The tutorial will wrap up with a question-and-answer session.

Note: For this tutorial, you are required to bring in a laptop with Apache Hadoop and Apache Hive installed on it. The best and easiest way to get started is to download a Demo VM with Hadoop and Hive installed and configured on it. You may download such a Demo VM from https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Hadoop+Demo+VM+for+CDH4 VMWare, KVM and VirtualBox images are available at the above link.

Also, please clone the git repository at https://github.com/markgrover/bdtc-hive *on the demo VM* before you come to the tutorial.

Level: Intermediate
Introduction to Hadoop, Map/Reduce and HDFS for Big Data Applications Has code image

This tutorial will teach you how to solve Big Data problems using Hadoop in a fast, scalable and cost-effective way. It is designed for technical personnel and managers who are evaluating and considering using Hadoop to solve data-scalability problems.

We will start with Hadoop basics and discuss best practices for using Hadoop in enterprises dealing with large datasets. We will look into the current data problems you are dealing with and potential use cases of using Hadoop in your infrastructure. The presentation covers the Hadoop architecture and its main components: Hadoop Distributed File System (HDFS) and Map/Reduce. We will present case studies on how other enterprises are using Hadoop, and look into what it takes to get Hadoop up and running in your environment.

Two case studies will cover near-real-time data-processing scenarios and Hadoop cluster implementations for large clusters (2,000 to 4,000 nodes). The near-real-time case study can be used as guidance for building the infrastructure of near-real-time architecture. All components used in architecture are open-source under the Apache license, and will provide cost-effective solutions for solving Big Data problems.

By attending this tutorial, you will:
  • Understand Hadoop main components and Architecture
  • Be comfortable working with Hadoop Distributed File System
  • Understand Map/Reduce abstraction and how it works
  • Understand components of a Map/Reduce job
  • Know best practices for using Hadoop in the enterprise

We will demonstrate the real-life code for basic Map/Reduce jobs and code for working with HDFS during the tutorial.

Level: Intermediate
1:15 PM - 5:00 PM

A Case Study: Data Visualizations and Insight Into How America Eats at Restaurants

This tutorial will discuss the pains and lessons learned from building out a database that collects data from thousands of restaurants across North America on a nightly basis. The challenges of integrating data from disparate Point of Sales systems, data cleaning, data normalization, ETL and generating insight will all be discussed.

Additional topics include:
  • Why choose open source?
  • Why pick Mongo?
  • What's up with ZFS?
We'll wrap up by covering lessons learned and "What's next?"

Level: Overview
Hadoop Data Warehousing with Hive Has code image

In this hands-on tutorial, you’ll learn how to use Hive for Hadoop-based data warehousing. You’ll also learn some tricks of the trade and how to handle known issues.

We’ll spend most of the tutorial using a series of hands-on exercises with actual Hive queries, so you can learn by doing. We’ll go over all the main features of Hive’s query language, HiveQL, and how Hive works with data in Hadoop. We’ll also contrast Hive with relational and non-relational database options.

Hive is very flexible about table schemas, file formats, and where the files are stored. We’ll discuss real-world scenarios for the different options. We’ll briefly examine how you can write Java user defined functions (UDFs) and other plugins that extend Hive for data formats that aren’t supported natively.

You’ll learn Hive’s place in the Hadoop ecosystem, such as how it compares to other available tools. We’ll discuss data organization and configuration topics that ensure best performance and ease of use in production environments.

Side notes: This tutorial is suitable for beginning data analysts and software developers. Bring your laptop pre-installed with a suitable secure shell (ssh) client, such as Putty (installer zip) for Windows. MacOS and Linux systems come preconfigured with ssh.

Prerequisite knowledge: Some prior SQL experience will be assumed.

Level: Overview
Hands-on Google BigQuery and Google Predictive API Has code image

In this tutorial you will learn the what, why, how, when and where around two of Google’s cloud data offerings. Here you will learn about Google BigQuery and the Google Predictive API and why you might consider using either of these cloud services. You will learn how to write effective, performant and useful queries when using either of these services. You’ll also get to wire up the services via their exposed APIs. The course will be taught in Java.

The tutorials’s objectives include:
  • Understanding the capabilities (and limitations) of Google  BigQuery and the Google Predictive API
  • Writing and executing queries against BigQuery and the Predictive API
  • Understanding how to wire up the services into your application
  • Understanding costs involved in using these cloud services from Google

Prerequisite knowledge: RDBMS administration or development; data-mining experience is helpful but not required; you should know Java.

Prerequisite software: You must have a Gmail account to sign up for these services from Google.

Level: Intermediate
Hive, Pig, Cascading and Codd: A Crash Course in Map/Reduce Relational Languages via an Appeal to History Has code image

This tutorial will teach you how to simultaneously implement the relational operators as defined by famed computer scientist E.F. Codd in Hive, Pig and Cascading. You, the developer, will focus on the abstract concepts of the Relational Algebra in order to learn ALL the languages simultaneously. Theory can sometimes be dry, but, in this case, revisiting Codd’s original intentions — and his seminal papers — can accelerate our learning of these new Big Data Relational Languages:

• What are the Relational Algebra operators?
    - Select
    - Project
    - Join
    - Cross, union, intersect and divide
• What is the Pig, Hive and Cascading syntax for these operators?
• How do high-level languages and libraries like Pig and Hive compile these operators to MapReduce?
• Why do the syntax of Pig, Hive and Cascading differ, and what are each trying to emphasize or deemphasize?
• Running Exercises: Practice HiveQL, Pig and Cascading/Cascalog/Scalding concepts as they are introduced.
• When to use Pig, Hive or Cascading over another
• Code and examples of each language
The following prerequisites ensure that you will gain the maximum benefit from the class:
• Programming experience: This is a developer’s course. We will write Hive, Pig, Cascading/Scalding/Cascalog applications. Prior programming experience is recommended.
• Linux shell experience: Basic Linux shell (bash) commands will be used extensively. Some prior experience is recommended.
• Experience with SQL databases: SQL experience is helpful for learning these languages, but not essential.

The main format of this tutorial will be follow-along, although all code will be provided in case you want to code simultaneously. You may log into remote EMR instances to build, test and run the applications if you wish, but it will be on your account, and the instructor may not be able to help you with debugging. You will also be provided with all the exercise software, so you can view it on your laptop if desired.

Level: Advanced
Programming with Scalding and Algebird Has code image

This is a hands-on coding tutorial. We will code up a few Scalding programs in different domains: portfolio optimization, healthcare, cosine similarity and random forests. While Scalding looks like a thin Scala API atop Cascading, this appearance is deceptive. The power of Scala combined with the mapping, grouping and joining primitives in Scalding, along with the Algebird abstract algebra library, allow for a whole new level of flexibility with Big Data. Matrix operations in Scalding are powered by Algebird, and using large dimension matrices as a primitive, we can tackle problems in diverse domains that employ linear algebra over very large datasets in a batch mode.

Note: You are expected to have installed Scala, Scalding and Algebird on your laptop before the tutorial commences.

Access the slides and Scala code here: https://github.com/krishnanraman/bigdata.

Level: Advanced

A BZ Media Production