Classes | Big Data TechCon 2013, Boston | The HOW-TO Big Data Conference | Hadoop, training and more!
Big Data TechCon
Like Us on Facebook Follow Us on Twitter
Download the Exhibitor Prospectus
Register Now
Catalog
BigData classes
Workshops
Speakers
Big Data TechReport Signup
SPONSORS
PLATINUM
HP Vertica
LexisNexis
GOLD
MapR
OdinText
SILVER
Altoros
Cloudera
Paradigm4
Think Big Analytics
OTHER EXHIBITING COMPANIES
IBM
QlikView
SoftNAS
Tableau


Choose from more than 45 classes on Big Data and put together your own custom Big Data training experience. Whether you need beginning Hadoop training classes or the most hardcore deep dives into predictive analysis, you will find the training you need at Big Data TechCon. Our 41 speakers are Big Data experts and particularly skilled at communicating their knowledge effectively.

This icon indicates code will be shown in a session.

Filter classes by :
Day                         Time Slot
Speaker  Level  Clear Filters
Tuesday, April 9
8:30 AM - 9:30 AM

Apache Cassandra -- A Deep Dive Has code image

Recently, there has been some discussion about what Big Data is. The definition of Big Data continues to evolve. Along with variety, volume and velocity (which the usual suspects handle well), other facets have been introduced, namely complexity and distribution. Complexity and distribution are facets that require a different type of solution.

While you can manually shard your data (Oracle, MySQL) or extend the master-slave paradigm to handle data distribution, a modern big data solution should solve the problem of distribution in a straightforward and elegant manner without manual intervention or external sharding. Apache Cassandra was designed to solve the problem of data distribution. It remains the best database for low latency access to large volumes of data while still allowing for multi-region replication. We will discuss how Cassandra solves the problem of data distribution and availability at scale.

This class will cover:
    •    Replication
    •    Data Partitioning
    •    Local Storage Model
    •    The Write Path
    •    The Read Path
    •    Multi-Datacenter Deployments
    •    Upcoming Features (1.2 and beyond)

For the most benefit from this class, attend the "Getting Started with Cassandra" workshop.

Level: Intermediate
Building an Impenetrable ZooKeeper Has code image

Apache ZooKeeper is a project that provides reliable and timely coordination of processes. Given the many cluster resources leveraged by distributed ZooKeeper, it’s frequently the first to notice issues affecting cluster health, which explains its moniker: “The canary in the Hadoop coal mine.”

Come to this class and you will learn:
    •    How to configure ZooKeeper reliably
    •    How to monitor ZooKeeper closely
    •    How to resolve ZooKeeper errors efficiently

Culling from the diverse environments we’ve supported, we will share what it takes to set up an impenetrable ZooKeeper environment, what parts of your infrastructure specifically to monitor, and which ZooKeeper errors and alerts indicate something seriously amiss with your hardware, network or HBase configuration.

Level: Intermediate
Data Modeling and Relational Analysis in a NoSQL World Has code image

The new wave of NoSQL technology is built to provide the flexibility and scalability required by agile Web, mobile and enterprise applications. Interestingly, any system that supports chained MapReduce processing (specifically MapReduce-Map) fulfills the basic query requirements of a SQL engine. Therefore, we will work to help you bridge the gap between SQL, relational (big) data, and the brave new world of NoSQL.

In this class, you will learn how to model real-world relational data in a modern document database. We next go on to compile various SQL operations (SELECT, SUM, AVG, JOIN, etc.) into exceptionally simple MapReduce programs. We finish with a study demonstrating the performance, scalability and "time-to-value" benefits of this approach, specifically the pre-computation of materialized views. The class will be a mix of chalkboard and interactive demonstrations.

Prerequisites:  Bring a laptop with a modern Web browser (Chrome, Safari or Firefox). Previous experience with basic scripting languages (e.g. JavaScript) is an advantage but not a requirement. All data and code samples will be provided at the beginning of the class.

Level: Advanced
Distributed Search and Real Time Analytics, Parts I & II Has code image

In this hands-on, two-part class, you will learn the importance of distributed search from the instructors' industry experience and knowledge of real-world use cases. We’ll introduce different architectures that incorporate distributed search techniques, and share pain points experienced and lessons learned. Building on that, we’ll depict the landscape of distributed search tools and their future directions.

For the hands-on part, you will learn how to install and use Apache Solr for real-time Big Data analytics, search, and reporting on popular NoSQL databases. You’ll also learn some tricks of the trade and how to handle known issues.

We will spend around 30 minutes providing some background and use-case information on distributed search. We'll then go through finer technical details with exercises every 10 minutes or so. We will go through the code on stage, but you can follow along with the EC2 instances that will be set up.

Prerequisites: You should be familiar with Java and Unix shell commands. Some prior familiarity with SQL and Big Data solutions like Hadoop/Hbase/Cassandra/Solr would be helpful but not required.

Level: Intermediate
How to Integrate Structured and Unstructured Data with Avro Has code image

This class is designed to demonstrate how to use the Apache Avro Java serialization libraries with Hadoop frameworks to speed up integration of big volumes of unstructured data sets.

The class will teach you about the Avro framework. You will learn about:
• Avro schema definitions and data types
• Avro record creation
• How to create Avro schemas programmatically
• How to sort records by setting a property in schema
• How to read records from a file
• Hadoop MapReduce integration with Avro data
• Advantages of using Avro data over flat files or map files
• Specifics of the integrations with Mapper and Reducer code
• Avro format for MapReduce result output
• Cascading MapReduce jobs
• MapReduce for converting flat files into Avro data

We’ll use two real-life examples demonstrate the advantage of using Avro versus regular files.

Level: Advanced
Taming Elephants, Bees and Pigs – The Big Data Circus

This class will discuss the reasons and motivations behind the Big Data revolution and how it has evolved from previous data-processing technologies. Hadoop's technical advantages and emphasis on scale over raw performance is primarily driven by the growth in variety of data sources, for example.

Based on real-world experience while at Facebook, the instructor will talk about some of the key challenges of scale and the evolution of these technologies out of necessity, starting with Hadoop, expansion with SQL on top, and adding microstrategy and business intelligence layers. This class will cover specific issues and solutions at Facebook, such as latency gaps in the infrastructure solved by caching results in the MySQL tier, and the investment made to build low-latency query engines on HDFS. This will lead to a discussion of business demands and the technical responses.

Finally, we will discuss the future of Big Data and how technology is continuing to simplify the process and become accessible for all. We are able to address noise in the data and use the cloud to simplify what to use and what not to use by hiding these technologies behind a comprehensive data platform.

Attend this class to learn about the issues that were encountered in the trenches at Facebook. They were addressed by the trailblazers and can now be handled even by smaller, leaner companies.

Level: Intermediate
Visualizing Your Graph Has code image

Attend this class to learn how to see graphs conceptually and visually. Learn to think of your domain as a graph, and visualize it. See just one node, a handful of relationships, a sub-graph, or everything all at once. You'll learn how to create a pretty picture of your friends' friendships, see clusters of nodes congregate, and get lost in a 3D graph in this class.
Level: Overview
11:00 AM - 12:00 PM

Analytics Maturity Model Has code image

Every company is at a different stage in leveraging analytics to improve their operational efficiency and product offerings. In this class, you will learn an eight-stage analytics maturity model that companies can use to determine how far they are from the most analytical companies.

Level: Intermediate
Beyond Map/Reduce

Apache Hadoop is the current darling of the Big Data world. At its core is the Map/Reduce computing model for decomposing large data-analysis jobs into smaller tasks, and distributing those tasks around a cluster. Map/Reduce itself was pioneered at Google for indexing the Web and other computations over massive data sets.

The strengths of Map/Reduce are cost-effective scalability and relative maturity. Its weaknesses are its batch orientation, making it unsuitable for real-time event processing, and its difficulty of implementing data-analysis idioms in the Map/Reduce computing model.

We can address the weaknesses in several ways. First, higher-level programming languages, which provide common query and manipulation abstractions, make it easier to implement Map/Reduce programs. However, longer term, we need new distributed computing models that are more flexible for different problems and which provide better real-time performance.

We’ll review these strengths and weaknesses of Map/Reduce and the Hadoop implementation, then discuss several emerging alternatives, such as Cloudera’s Impala system for analytics, Google’s Pregel system for graph processing, and Storm for event processing. We’ll finish with some speculation about the longer-term future of Big Data.

This class is good for developers, data analysts and managers, but people with Hadoop and or programming experience will get the most out of it.

Level: Overview
Big Data Science: Extracting Truth from Large, Multi-structured Data, Parts I & II Has code image

Come to this class for an overview of Big Data Science from planning to execution. We will start by covering the processes and best practices for successfully doing Data Science in a business setting. We will then review a use case and walk through the exploratory to confirmatory modeling stages. Each use case includes source code in R and Pig, with the goal of showing how to parallelize analysis in R over Hadoop.

The use case will highlight the advantages (and difficulties) of working with multi-structured data through text analysis, linear models and more. A theme throughout will be the importance of triangulating around truth when problems are intractable.

The following prerequisites ensure that you will gain the maximum benefit from the class:
    •    Programming experience: Big Data is still the Wild West of technologies, and programming skills are required to wrangle and analyze Big Data.
    •    Analysis experience: Although this is not a statistics course, understanding the principles of analysis as well as research methods will help in reapplying the lessons.

Level: Advanced
Distributed Search and Real Time Analytics, Parts I & II Has code image

In this hands-on, two-part class, you will learn the importance of distributed search from the instructors' industry experience and knowledge of real-world use cases. We’ll introduce different architectures that incorporate distributed search techniques, and share pain points experienced and lessons learned. Building on that, we’ll depict the landscape of distributed search tools and their future directions.

For the hands-on part, you will learn how to install and use Apache Solr for real-time Big Data analytics, search, and reporting on popular NoSQL databases. You’ll also learn some tricks of the trade and how to handle known issues.

We will spend around 30 minutes providing some background and use-case information on distributed search. We'll then go through finer technical details with exercises every 10 minutes or so. We will go through the code on stage, but you can follow along with the EC2 instances that will be set up.

Prerequisites: You should be familiar with Java and Unix shell commands. Some prior familiarity with SQL and Big Data solutions like Hadoop/Hbase/Cassandra/Solr would be helpful but not required.

Level: Intermediate
Getting Started with R and Hadoop, Parts I & II Has code image

Increasingly viewed as the lingua franca of statistics, R is a natural choice for many Data Scientists seeking to perform Big Data Analytics. And with Hadoop Streaming, the formerly Java-only Big Data system is now open to nearly any programming or scripting language. This two-part class will teach you options for working with Hadoop and R before focusing on the RMR package from the RHadoop project. We will cover the basics of downloading and installing RMR, and will test our installation and demonstrate its use by walking through three examples in depth.

You will learn the basics of applying the Map/Reduce paradigm to your analysis and how to write mappers, reducers and combiners using R. We will submit jobs to the Hadoop cluster and retrieve results from the HDFS. We will explore the interaction of the Hadoop infrastructure with your code by tracing the input and output data for each step. Examples will include the canonical "word-count" example, as well as the analysis of structured data from the airline industry.

No specific prerequisite knowledge is required, but a familiarity with R and Hadoop or Map/Reduce is helpful.

Level: Advanced
Hands-on MongoDB Has code image

Come to this class to learn how to use MongoDB one of the most popular NoSQL databases. Together we will install and use MongoDB so that you can start understanding which business scenarios are a fit for this type of data storage. After setup, you’ll learn basic administration, such as data inserts, indexing and more. Then you’ll get experience with the various ways to query a MongoDB database, using GUI tools such as MongoVUE, via the native query language, MapReduce, and the recently released aggregation framework. After this class, you will know how to use MongoDB in your projects.

Prerequisite knowledge: Some RDBMS admin or development helpful, but not required.

Setup software: The class will be taught on Macs, but you can implement on Windows if you wish.

Level: Intermediate
12:15 PM - 12:45 PM

An Overview of the HP Vertica Analytics Platform – Purpose Built for Big Data Analytics at Massive Scale and Extreme Performance image

The HP Vertica Analytics Platform has disrupted the analytics industry with a purpose-built platform designed to solve the toughest Big Data challenges. Faster than any analytics platform you’ve ever experienced, the HP Vertica Analytics Platform offers simple, easy-to-use Big Data analytics. The HP Vertica Analytics Platform was built from the ground up with speed, scalability, simplicity and openness at its core, delivered at the lowest total cost of ownership. Today, hundreds of organizations rely on the HP Vertica Analytics Platform for data analytics, with results that include 50–1,000 times faster performance at 30% of the cost and 10–30 times more data stored per server than traditional databases.

This class is sponsored by HP Vertica.

Level: Overview
Large-scale Entity Extraction, Disambiguation and Linkage image

Large-scale entity extraction, disambiguation and linkage in Big Data can challenge the traditional methodologies developed over the last three decades. Entity linkage in particular is the cornerstone for a wide spectrum of applications, such as Master Data Management, Data Warehousing, Social Graph Analytics, Fraud Detection, and Identity Management. Traditional rules-based heuristic methods usually don't scale properly, are language-specific and require significant maintenance over time.

We will introduce the audience to the use of probabilistic record linkage (also known as specificity-based linkage) on Big Data, to perform language-independent large-scale entity extraction, resolution and linkage across diverse sources.

The benefit of specificity-based linkage is that it does not use hand-coded user rules. Instead, it determines the relevance/weight of a particular field in the scope of the linking process, and a mathematical model based on the input data, which is key to the overall efficiency of the method.
 
We will also present a live demonstration reviewing the different steps required during the data-integration process (ingestion, profiling, parsing, cleansing, standardization and normalization), and show the basic concepts behind probabilistic record linkage on a real-world application.
 
Record linking fits into a general class of data processing known as data integration, which can be defined as the problem of combining information from multiple heterogeneous data sources. Data integration can include data preparation steps such as parsing, profiling, cleansing, normalization, parsing and standardization of the raw input data prior to record linkage to improve the quality of the input data and to make the data more consistent and comparable. (These data preparation steps are sometimes referred to as ETL, or Extract, Transform, Load.)

This class is sponsored by LexisNexis.

Level: Overview
1:45 PM - 2:45 PM

Big Data Science: Extracting Truth from Large, Multi-structured Data, Parts I & II Has code image

Come to this class for an overview of Big Data Science from planning to execution. We will start by covering the processes and best practices for successfully doing Data Science in a business setting. We will then review a use case and walk through the exploratory to confirmatory modeling stages. Each use case includes source code in R and Pig, with the goal of showing how to parallelize analysis in R over Hadoop.

The use case will highlight the advantages (and difficulties) of working with multi-structured data through text analysis, linear models and more. A theme throughout will be the importance of triangulating around truth when problems are intractable.

The following prerequisites ensure that you will gain the maximum benefit from the class:
    •    Programming experience: Big Data is still the Wild West of technologies, and programming skills are required to wrangle and analyze Big Data.
    •    Analysis experience: Although this is not a statistics course, understanding the principles of analysis as well as research methods will help in reapplying the lessons.

Level: Advanced
Getting Started with Predictive Modeling: Simple Models and Basic Evaluation

This class presents the basic steps of building a predictive model, and encourages you to get your feet wet on a small problem using Excel and Weka to build a model.

You will learn about selection training examples, constructing an appropriate feature vector, some basic feature selection, building models using three to four different machine learning techniques (logistic regression, decision trees and rule induction), and evaluating the resulting models using cross validation and a separate test set.

Level: Intermediate
Getting Started with R and Hadoop, Parts I & II Has code image

Increasingly viewed as the lingua franca of statistics, R is a natural choice for many Data Scientists seeking to perform Big Data Analytics. And with Hadoop Streaming, the formerly Java-only Big Data system is now open to nearly any programming or scripting language. This two-part class will teach you options for working with Hadoop and R before focusing on the RMR package from the RHadoop project. We will cover the basics of downloading and installing RMR, and will test our installation and demonstrate its use by walking through three examples in depth.

You will learn the basics of applying the Map/Reduce paradigm to your analysis and how to write mappers, reducers and combiners using R. We will submit jobs to the Hadoop cluster and retrieve results from the HDFS. We will explore the interaction of the Hadoop infrastructure with your code by tracing the input and output data for each step. Examples will include the canonical "word-count" example, as well as the analysis of structured data from the airline industry.

No specific prerequisite knowledge is required, but a familiarity with R and Hadoop or Map/Reduce is helpful.

Level: Advanced
Hadoop Programming with Scalding Has code image

Scalding is a Scala API for writing advanced data workflows for Hadoop. Unlike low-level APIs, it provides intuitive pipes and filters idioms, while hiding the complexities of MapReduce programming. Scalding wraps the Java-based Cascading framework in Functional Programming concepts that are ideal for data problems, especially the mathematical algorithms for machine learning.

This class will use examples to demonstrate these points. Developers will see that Scalding is an ideal tool when they need a more full-featured and flexible tool set for Big Data applications, beyond what Hive or Pig can provide.

Level: Intermediate
Mastering Sqoop for Data Transfer for Big Data Has code image

Apache Sqoop is a tool for efficiently transferring bulk data between Hadoop-related systems (such as HDFS, Hive and HBase) and structured data stores (such as relational databases, data warehouses and NoSQL systems).

Even though Sqoop works very well in production environments, come to this class to learn how to handle bulk-transfer challenges that many Sqoop users face.

Also – come to this class to learn about Sqoop 2, the next generation of the tool which is currently in development! Sqoop 2 will address ease of use, ease of extension and security. We’ll talk about Sqoop 2 from both the development and operations perspectives.

Level: Intermediate
Real-Time Hadoop Has code image

We will start with a short introduction of different approaches to using Hadoop in the real-time environment, including real-time queries, streaming and real-time data processing and delivery. Then we'll describe the most common use cases for real-time queries and products implementing these capabilities. We will also describe the role of streaming, common use cases, and products in the space.

The majority of time will be dedicated to the usage of HBase as a foundation for the real-time data process. We describe several architectures for such implementation, and a high-level design and implementation for two examples: system for storing and retrieving images, and using HBase as a back end for Lucene.

Level: Intermediate
Untangling the Relationship Hairball with a Graph Database Has code image

Not only has data gotten bigger, it's gotten more connected. Make sense of it all and discover what these Big Data connections can tell you about your users and your business. Come to this class to learn some of the different use cases for graph databases, and how to spot the non-obvious opportunities in your data.

Level: Overview
3:15 PM - 4:15 PM

Hadoop by Example Has code image

This class is designed to demonstrate the most commonly used MapReduce design patterns for various problems. Performance and scalability will be taken into consideration.

The class will present a general overview of the problems that can be solved using MapReduce, scalability and performance tuning for clusters of different sizes. The techniques described here can be used on all Hadoop distributions.

The following technical problems will be covered:
•    “Hello world!” of the MapReduce universe—a word count example
•    Mapping only MapReduce jobs and their usage for ETL-type jobs
•    Global sorting techniques
•    Sequencing files and its usage in MapReduce jobs
•    Mapping files and its usage in MapReduce jobs
•    Reduce-side join and its advantages and limitations
•    Map-side join and its advantages and limitations

Each technique will be provided with a code example that can be used as a template. No prior knowledge about the topic is required, however, some Java knowledge is recommended.

Level: Intermediate
Hands-on Hadoop for Developers—Understanding Map/Reduce Has code image

In this class, you’ll get hands-on experience working with a leading implementation of Hadoop: Cloudera’s CDH4 release. We’ll work with their the company’s virtual machine and concentrate on learning how to get business value from a Hadoop cluster of data by writing increasingly more complex queries. These queries are known as Map/Reduce jobs, and we will write them in Java. Even if you aren’t a Java pro, you can still get value from this class, as we will also work with higher-level languages, such as Hive’s HQL and Pig. You’ll finish this class understanding when and how to get value from a Hadoop cluster.

Prerequisite knowledge: database development using T-SQL; Java or C# helpful, but not required

Setup software: Cloudera Hadoop VM v CDH4 (free download - https://ccp.cloudera.com/display/SUPPORT/Cloudera%27s+Hadoop+Demo+VM+for+CDH3u4)

Level: Intermediate
How to See and Understand Big Data

Visual analysis is an iterative process that exploits the power of the human visual system to help people work with all kinds of data. When data is big, people must overcome the challenges of wide data, tall data, and data from multiple sources, often coming in fast and furiously. Attend this class to learn how people working with data can address these challenges. The key technique is to use multiple coordinated views of data during visual analysis and storytelling with data.
You'll learn:
  • What research and practice have taught us about designing great visualizations and dashboards.
  • Fundamental principles for designing effective coordinated views for yourself and others.
  • How to systematically analyze data from multiple databases using your visual system.

The instructor works for Tableau Software, a provider of data visualization solutions.

Level: Overview
In-Database Predictive Analytics Has code image

Predictive analytics have long lived in the domain of statistical tools like R. Increasingly, however, as companies struggle to deal with exploding volumes of data not easily analyzed by small data tools, they are looking at ways of doing predictive analytics directly inside the primary data store.

This approach, called in-database predictive analytics, eliminates the need to sample data and perform a separate ETL process into a statistical tool, which can decrease total cost, improve the quality of predictive models, and dramatically shorten development time. In this class, you will learn the pros and cons of doing in-database predictive analytics, highlights of its limitations, and survey the tools and technologies necessary to head down the path.

Level: Advanced
NoSQL for SQL professionals

With all of the buzz around Big Data and NoSQL (non-relational) database technology, what actually matters for today's SQL professional? Learn more in this talk about Big Data and NoSQL in the context of the SQL world, and get to what's truly important for data professionals today.

In this presentation you'll learn:
    •    The main characteristics of NoSQL databases
    •    Differences between distributed NoSQL and relational databases
    •    Use cases for NoSQL technologies, with real-world examples from organizations in production today

Level: Intermediate
Seven Deadly Hadoop Misconfigurations Has code image

Misconfigurations and bugs break the most Hadoop clusters. Fixing misconfigurations is up to you!

Attend this session to learn how to get your Hadoop configuration right the first time. In some support contexts, a handful of common issues account for a large fraction of issues. That is not the case for Hadoop, where even the most common specific issues account for no more than 2% of support cases. Hadoop errors show up far from where you configured, making it hard to know what log files to analyze. It pays to be proactive. Come to this class!

Level: Intermediate
The Dark Sides of Predictive Modeling: Tricks and Pitfalls Has code image

This class will teach advanced strategies to improve model performance, including multi-state modeling and advanced feature engineering. In addition, we will look at some of the most common pitfalls: over-fitting and leakage.

With the increasing set of tools available for data storage, management, consolidation and preprocessing, the art of exploratory data analysis is getting lost. Too often are we trying to make sense of data that is “cleaned” beyond recognition. It makes a major difference whether missing values have been removed or replaced by something else.
In the end, data analysts work with data that are not what they think they are; they are unaware of sampling biases or information “from the future” that really should not be allowed in the model. As a result, the models look initially very good, and fail when used in reality. We will also cover some strategies of managing really large amounts of data, including strategic sampling and grouping examples.

Prerequisites: You should have some experience in predictive modeling.

Level: Advanced
Wednesday, April 10
8:45 AM - 9:45 AM

Building Applications Using HBase Has code image

Apache HBase is one of the popular new-age scalable NoSQL databases, and has seen a significant increase in production use cases over the last couple of years. HBase is an open-source version of Google BigTable.

This class will teach you the basics of HBase, including its architecture, data model, and how HBase is different from traditional database systems in terms of design assumptions and also with respect to building applications.


Level: Intermediate
Building Successful Data Science Teams

Come to this class and join a lively conversation on Big Analytics. We’ll discuss current trends in the field, including the new role of Data Science beyond analytics, and how companies are embracing new technologies and approaches.

We will frame the discussion with a client use case, covering objectives, approach and challenges encountered. Through this lens, we’ll explore how organizations and individuals can transitioning skills into the Big Data space from more traditional roles, hiring strategies and team structures. Finally, we will discuss the patterns that define a mature Data Science team and how firms will grow to assess and evaluate these teams.

While our conversation will be technology-agnostic, we will focus particular attention to data science within the Hadoop ecosystem.

What you'll learn:
    •    Best practices in Data Science projects
    •    Common technologies and tools
    •    Training and hiring strategies
    •    Best team structures
    •    The current state of the field
    •    KPIs that define a mature Data Science practice

Level: Overview
Extending Your Data Infrastructure with Hadoop Has code image

Hadoop provides significant value when integrated with an existing data infrastructure, but even among Hadoop experts there's still confusion about options for data integration and business intelligence with Hadoop. This class will help clear up the confusion.

You will learn:
    •    How can I use Hadoop to complement and extend my data infrastructure?
    •    How can Hadoop complement my data warehouse?
    •    What are the capabilities and limitations of available tools?
    •    How do I get data into and out of Hadoop?
    •    How can I use my existing data integration and business intelligence tools with Hadoop?
    •    How can I use Hadoop to make my ETL processing more scalable and agile?

We'll illustrate this with an end-to-end example data flow using open-source and commercial tools, showing how data can be imported and exported with Hadoop, ETL processing in Hadoop, and reporting and visualization of data in Hadoop. You will also learn recent advancements that make Hadoop an even more powerful platform for data processing and analysis.

Level: Intermediate
Matrix Methods with Hadoop Has code image

Get a brief introduction to thinking about data problems as matrices, and then learn how to implement many of these algorithms in the Hadoop streaming framework. The data-as-matrix paradigm has had a rich history, and the point of this talk is to give folks some idea of which statistical algorithms are likely to be reasonably efficient in MapReduce, and which are probably not going to be so reasonable.

This will involve a few ideas:
• How to storing matrix data and the performance tradeoffs. The idea is to take natural ways of looking at methods to store data and describe them as a way of storing a matrix. This gives some insight into how a method could be fast.
• How to implement some basic matrix operations that form the basis of many numerical and statistical algorithms.
• Problems we've run into working with matrix data, and how we've solved some of them.
• Ideas for future platforms that are ideal for this case
• A concern about the numerical accuracy for Big Data. Most of the code samples will use Python interfaces to Hadoop streaming, such as Dumbo, mrjob or Hadoopy.
Prerequisites:
• You ought to have a rough handle on what a matrix represents.
• Those with a linear algebra background will probably get more out of this talk, but I'll explain any linear algebra property with a statistical or data-oriented analog.
• You should know about how to use MapReduce or Hadoop, most of the code examples will use Hadoop streaming via Dumbo. Example slides: www.slideshare.net/dgleich/mapreduce-for-scientific-simulation-analysis

Example code: github.com/dgleich/mrmatrix

Level: Advanced
Microsoft’s Big Data Story

Big Data is hot, but with so much press covering open-source projects, what does Microsoft have to offer in this area? More than you might expect! From extensions to core products, such as the addition of columnstore indexes in SQL Server 2012, to entirely new products, such as Data Explorer, you will learn about the “Microsoftification” of current Big Data offerings, such as “Hadoop on Azure.” You’ll leave this class understanding how you can best leverage your existing investment in Microsoft technologies to take advantage of Big Data opportunities.

Level: Overview
Oozie: A Workflow Scheduler for Hadoop Has code image

This class will start with a discussion of Oozie’s role in the Apache Hadoop ecosystem and its relationship to other Hadoop platform components. We will then describe the most straightforward use cases/examples where Oozie can be helpful or even necessary. From that we move to Oozie workflow specification: components, illustrated with examples for various Oozie actions (java, map-reduce, hive, pig, sqoop).

Then we will describe Oozie architecture. We will present the main Oozie components, job life cycle, retry and recovery. We will demonstrate the Oozie management console in integration/complementary usage with other Hadoop GUI tools (MapReduce Administration, Task Tracker, NameNode viewer, Fair Scheduler Administration, and Log File View).

We will then discuss Oozie job parameterization, expression language, job configuration, runtime artifacts placement, and job submission using the Oozie command-line utility (CLI). We will demonstrate how to check statuses or stop Oozie jobs from CLI. We will also briefly describe Oozie APIs (java and REST), and provide fragments of code. After that, we will present Oozie coordinator and discuss how Oozie allows expressing dependencies between jobs and groups of jobs using “Synchronous Datasets.” We will touch on Oozie SLA support.

Then we will describe Oozie bundles and demonstrate the whole hierarchy of Oozie artifacts: from bundles to coordinator to workflows to actions to processes running on Hadoop clusters. In conclusion, we will discuss Oozie limitations and ways to overcome some of them (via extension and customization). We will also talk about new features in future versions.

Level: Intermediate
11:30 AM - 12:30 PM

Building Your Own Facebook Graph Search with Cypher and Neo4j Has code image

Learn how to create your own Facebook Graph Search or build a similar system with your own company data. Also learn how to interpret natural language into a grammar and use it to build Cypher queries to retrieve data from a graph. Knowledge of Natural Language Processing not required. The instructor works for Neo Technology, creators of the Neo4j Graph Database.

Level: Intermediate
Hadoop Backup and Disaster Recovery 101 Has code image

Any production-level implementation of Hadoop must have its data protected from threats. Threats to data integrity can be human-generated (malicious/unintentional) or site-level (power outage, flood, etc.). As soon as you start to identify these threats, it’s important  to develop a backup or disaster-recovery solution for Hadoop!

In this class, you will learn the unique considerations for Hadoop backup and disaster recovery, as well as how to navigate the common issues that arise when architects and developers look to protect the data.

We’ll cover:
•    How to model your backup/disaster-recovery solution, considering your threat model and specifics around data integrity, business continuity, and load balancing.
•    Best practices and recommendations, highlighting Hadoop in contrast to traditional SAN/DB systems; replication versus "teeing" models for ensuring DR; replication scheduling; Hive; HBase; managing bandwidth; monitoring replication; using one's secondary beyond replication; and a survey of existing tools and products that can be used for backup and DR

After taking this class, you should be able to explain to your organization the right way to effect a backup or data recovery solution for Hadoop.

Level: Intermediate
HBase Schema and Table Design Principles Has code image

More and more companies are adopting the open-source Apache HBase as the back-end database for Big Data applications that have very large tables – billions f rows, millions of columns. Come to this class to learn how to design tables to leverage the design principles and features of HBase.

HBase tables are very different from relational database tables, and there are several concepts at play that you need to keep in mind while designing them. This class will introduce you to the basics of HBase schema and table design and how to think about tables when working with this popular database.

Level: Intermediate
High-Speed Data Ingestion with Sharded NewSQL Databases Has code image

Websites and back-end data applications have grown to the point that the data coming in is too fast for traditional databases without causing significant latencies for the clients. In this session, we will look at how to take advantage of NewSQL sharding to consume data as fast as possible, and then extract it to a traditional database in a transactionally secure manner.

We'll combine traditional slides with real code, demonstrating how to quickly import and extract from a high-speed NewSQL database to a slower-speed traditional SQL database. The instructor works for VoltDB, which use the NewSQL term to describe its VoltDB database.

Level: Advanced
How to Fit a Petabyte in Apache HBase Has code image

There are many ways to load a petabyte of data in HBase, and this class will show you the best approaches! We will first review the solutions that are commonly adopted instinctively, and show why they fail. This will help you understand the practicalities of HBase's architecture.

The first technique that will be taught is better schema designs—that is, how to create keys that won't inflate the size of your data set. The second technique that will be presented is better management of the loading of the data through pre-splitting of the regions and tuning the cluster for that type of workload.

Finally, we’ll show you how to use bulk loading will be discussed as the best way to populate the database. The code that will be used for the demonstrations will be available on GitHub.

Level: Advanced
What You Can Do with a Tall-and-Skinny QR Factorization on Hadoop Has code image

A common Big Data style dataset is one with a large number (millions to billions) of samples with up to a few thousand features. One way to view these datasets from an algorithmic perspective is to treat them as a tall-and-skinny matrix.

In this class you will learn one of the most widely used matrix decomposition techniques: the QR factorization.
First, you’ll see how to use the QR factorization to solve various statistical analysis problems on datasets with many samples. For instance, we'll see how to compute a linear regression for such a problem, how to compute principal components, and how easily we can implement a new scalable Kernel K-Means clustering method.

Next, you will learn how to compute these QR factorizations in Hadoop. We use an algorithm that maps beautifully onto the MapReduce distributed computational engine. It boils down to computing independent QR factorizations in each mapper and reducer stage. One challenge we faced was in improving the numerical stability of the routine in order to get a good estimate of the orthogonal matrix factor, which is most important for applications that need a tall-and-skinny matrix singular value decomposition.

Prerequisites:
• You ought to have a rough handle on what a matrix represents.
• Those with a linear algebra background will probably get more out of this talk, but I'll motivate any linear algebra property with a statistical or data-oriented analog.
• You should know about how to use MapReduce or Hadoop; most of the code examples will use Hadoop streaming via Dumbo in Python.
• If you attend my other talk "Matrix Methods with Hadoop," you'll get more out of this class, but it isn't required.
• If you are familiar with linear regression, principal components or Kernel matrices, you'll get more about of this class.

Example slides: www.slideshare.net/dgleich/tallandskinny-qr-factorizations-in-mapreduce-architectures

Example code: github.com/dgleich/mrtsqr

Level: Advanced
1:45 PM - 2:45 PM

Implementing a Real-Time Data Platform Using HBase Has code image

Apache HBase is a distributed, column-oriented data store for processing large amounts of data in a scalable and cost-effective way. Most of the developers, designers and architects have lots of background or experience in working on relational databases using SQL; the transition to NoSQL is challenging and oftentimes confusing.

Come to this class to learn what and what not to consider when using HBase while implementing a real-time data-management system for your organization. The focus will be on sharing knowledge and real-world experiences to help the audience understand and address the full spectrum of technical and business challenges. We will offer recommendations and lessons learned in terms of scalability, performance, reliability and transitioning to the new platform.

You will learn the major considerations for infrastructure, schema design, implementation, deployment and tuning of HBase solutions. At the end of the class, you will be in a better position to make the right choices on applicability, design, tuning and infrastructure selection for a real-time HBase-enabled data platform.

Level: Intermediate
Implementing a Simple Mongo Application

One simple app, three different technologies deployed locally to two different clouds. The application is executed, the source is examined, the approach is compared, the tools are demonstrated, and your questions are answered.

Level: Overview
Introduction to Apache Pig, Parts I & II Has code image

This two-part class provides an intensive introduction to Pig for data transformations. You will learn how to use Pig to manage data sets in Hadoop clusters, using an easy-to-learn scripting language. The specific topics of the 120-minute class will be calibrated to your needs, but we will generally cover: What is Pig and why would I use it? understanding the basic concepts of data structures in Pig; and understanding the basic language constructs in Pig. We'll also create basic Pig scripts.

Prerequisites: This class will be taught in a Linux environment, using the Hive command-line interface (CLI). Please come prepared with the following:
•  Linux shell experience; the ability to log into Linux servers and use basic Linux shell (bash) commands is required
•  Basic experience connecting to an Amazon EC2/EMR cluster via SSH
•  Windows users should have a knowledge of CYGWIN and Putty
•  A basic knowledge of Vi would be helpful but not necessary

Also, bring your laptop with the following software installed in advance:
•  Putty (Windows only): You will log into a remote cluster for this class. Mac OS X and Linux environments include ssh (secure shell) support. Windows users will need to install Putty. Download putty.zip from here.
•  A Text Editor: An editor suitable for editing source code, such as SQL queries. On Windows, WordPad (but not Word) or NotePad++ (but not Notepad) are suitable.

Level: Overview
Map/Reduce Tips and Tricks Has code image

This class will start with a short Map/Reduce architectural refresher, showing how it executes in Hadoop and what the main Map/Reduce components and classes are, which can be used for customizing an execution. We will then describe the most common possible Map/Reduce customizations and the reasons for their implementation.

The majority of time will be dedicated to going through the code examples, showing how to design and implement custom input/output formats, readers and writers, and partitioners. We will also show what to expect out of any customization.

Time permitting, we will also talk about high-level Map/Reduce frameworks, namely Apache Crunch.

Level: Intermediate
Selecting the Right Big Data Tool for the Right Job, and Making It Work Has code image

This class will focus on wide range of Big Data solutions from open-source to commercial solutions, and the specific selection criteria and profiles of each. As in all technology areas, each solution has its own sweet spots and challenges, either in CAP theorem, ACID compliance, performance or scalability. This class will provide an overview of the technical tradeoffs for the list of solutions in technical terminology. Once the technical tradeoffs are reviewed, we will review the cost and value of open-source solutions versus commercial software, and the tradeoffs that folks must take to choose one over the other.

The next phase will go into great detail on use cases for specific solutions based on real-world experience. All of the specific use cases have been seen firsthand from our customers. The solutions will be reviewed to the level of specific technical architecture and deployment details. This is intended for a highly technical audience and will not provide any high-level material on the solutions discussed. It is assumed that you will have working knowledge of solutions such as SQL, NoSQL, distributed file systems and time series indexes. You should also have an understanding of CAP theorem, ACID compliance and general performance characteristics of systems.

Level: Advanced
Using Apache HBase's API for Complex Real-Time Queries Has code image

Apache HBase is generally viewed as an augmented key-value store, but what does that mean really? In this class, you will explore the basic functionalities (get, put, delete and scan), and then fast-forward to the good parts. You will see how to use the filters in order to run more efficient scans, like running lightweight counts over millions of rows.

The class will also cover the different compare-and-swap (CAS) operations that HBase enables by being strongly consistent, like incrementing counters or appending data without having to bring the data back to the client. We’ll also cover more advanced features like using coprocessors. The code that will be used for the demonstrations will be available on GitHub.

Level: Advanced
3:30 PM - 4:30 PM

A Survey of Probabilistic Data Structures Has code image

Big Data requires Big Resources, which cost Big Money. But if you only need answers that are good enough, rather than precisely right, probabilistic data structures can be a way to get those answers with a fraction of the resources and cost.

This class will teach you about different data structures, give some theory behind them and point out some use cases. You will learn how those structures can be used for common tasks, including counting the unique items in a collection, determining if an item is in a collection and counting the occurrences of items in a collection. If you want Big Data without spending Big Money, this class is for you!

Level: Intermediate
Building High-Volume Web Applications Has code image

Keeping a Web application fast under highly variable user load presents website developers with a frustrating problem: What happens when load spikes and the site can't scale out? This often requires using a variety of techniques to ensure high performance. We'll cover how to manage fast, slow and very slow-moving data, as well as methods for ensuring fast client responses.

Using Spring, Ehcache, jQuery, JSON and VoltDB, you will learn how to write high-throughput, low-latency Web applications that can keep up with heavy, unexpected load from the front to the back end. The instructor works for VoltDB.

Level: Advanced
Introduction to Apache Pig, Parts I & II Has code image

This two-part class provides an intensive introduction to Pig for data transformations. You will learn how to use Pig to manage data sets in Hadoop clusters, using an easy-to-learn scripting language. The specific topics of the 120-minute class will be calibrated to your needs, but we will generally cover: What is Pig and why would I use it? understanding the basic concepts of data structures in Pig; and understanding the basic language constructs in Pig. We'll also create basic Pig scripts.

Prerequisites: This class will be taught in a Linux environment, using the Hive command-line interface (CLI). Please come prepared with the following:
•  Linux shell experience; the ability to log into Linux servers and use basic Linux shell (bash) commands is required
•  Basic experience connecting to an Amazon EC2/EMR cluster via SSH
•  Windows users should have a knowledge of CYGWIN and Putty
•  A basic knowledge of Vi would be helpful but not necessary

Also, bring your laptop with the following software installed in advance:
•  Putty (Windows only): You will log into a remote cluster for this class. Mac OS X and Linux environments include ssh (secure shell) support. Windows users will need to install Putty. Download putty.zip from here.
•  A Text Editor: An editor suitable for editing source code, such as SQL queries. On Windows, WordPad (but not Word) or NotePad++ (but not Notepad) are suitable.

Level: Overview
Running Mission-Critical Applications on Hadoop

This class will look at what is involved when you move Hadoop from a lab environment to actual deployment in production. We will cover the critical enterprise-grade features like data integration, data protection, business continuity, and high availability, and discuss the ways you can accomplish this in your environment. We will also identify potential stumbling blocks, identify what a platform can or can’t provide, and help determine the scope and level of customization necessary to make your deployment successful.

At the end of the class, you will better understand how to move Hadoop from the test bed to production deployment, what is involved in the process, and how to run a mission-critical Hadoop environment. Where appropriate, there will be real-world examples.

Level: Intermediate
Visualizing Big Data in the Browser Has code image

Whether for exploratory investigation or final presentation, visualization and data go hand in hand. We posit that visualization is in fact the vehicle by which data scientists, analysts and developers identify and disseminate the true value in data. Is visualization technology keeping pace with recent explosive innovations in data management, storage and processing? In a word: yes.

In this class, you will learn how to create custom, agile, flyweight in-browser applications for data analysis and visualization. We will leverage modern browser tools such as D3, Cubism and Crossfilter to do client-side data processing and visualization in concert with server-side distributed MapReduce processing in a scalable document database.  You will create a powerful, two-tier application that is served directly to the browser from a distributed database, allowing you to deploy on cloud hosting providers with no server install required.

Requirements: The class will be a mix of lecture and demonstration. Come prepared with a laptop, Wi-Fi and a modern Web browser (Safari, Chrome or Firefox) and you too will make data beautiful.

Level: Advanced

A BZ Media Production