Big Data TechCon
Register Now and Save

Registration Details  
Government Details  
 
SPONSORS
DIAMOND
HP Vertica
PLATINUM
Dataguise
Informatica
MindStream Analytics
Pentaho
GOLD
Actian Corporation
Actuate
BlueMetal Architects
Hitachi Data Systems
Intel
OdinText
SAP
Sqrrl
USP Lab
SILVER
Darwin Ecosystem
FujiFilm Recording Media
Sripathi Solutions
OTHER EXHIBITING COMPANIES
Aerospike
Arista Networks, Inc.
Basho Technologies
Brandeis University
FairCom
NuoDB
Super Micro Computer, Inc.
Talend
Texifter
Tokutek
COFFEE BREAK SPONSOR
Think Big Analytics


Classes

(Click here to view Tutorials)

Filter classes by :
Day                         Time Slot
Speaker  Level  Clear Filters
Tuesday, April 1
1:00 PM - 2:15 PM

Apache Cassandra--A Deep Dive image

Recently, there has been some discussion about what Big Data is. The definition of Big Data continues to evolve. Along with variety, volume and velocity (which the usual suspects handle well), other facets have been introduced, namely complexity and distribution. Complexity and distribution are facets that require a different type of solution.

While you can manually shard your data (Oracle, MySQL) or extend the master-slave paradigm to handle data distribution, a modern big data solution should solve the problem of distribution in a straightforward and elegant manner without manual intervention or external sharding. Apache Cassandra was designed to solve the problem of data distribution. It remains the best database for low latency access to large volumes of data while still allowing for multi-region replication. We will discuss how Cassandra solves the problem of data distribution and availability at scale.

This class will cover:
    •    Replication
    •    Data Partitioning
    •    Local Storage Model
    •    The Write Path
    •    The Read Path
    •    Multi-Datacenter Deployments
    •    Upcoming Features (1.2 and beyond)

For the most benefit from this class, attend the "Getting Started with Cassandra" workshop.

Level: Intermediate
Extending Your Data Infrastructure with Hadoop image

Hadoop provides significant value when integrated with an existing data infrastructure, but even among Hadoop experts there is still confusion about options for data integration and analysis of data in Hadoop. This class will help clear up the confusion by providing answers to the following:
  • How can Hadoop be used to complement and extend a data infrastructure?
  • How can Hadoop complement my data warehouse?
  • What are the capabilities and limitations of available tools?
  • How do I get data into and out of Hadoop?
  • Can I use my existing data integration and business intelligence tools with Hadoop?
  • How can I use Hadoop to make my ETL processing more scalable and agile?
We'll illustrate this with an example end-to-end processing flow, using available tools showing how data can be imported and exported with Hadoop, ETL processing in Hadoop, and reporting and visualization of data in Hadoop. We'll also cover recent advancements that make Hadoop an even more powerful platform for data processing and analysis.

Level: Intermediate
Getting Started with R and Hadoop, Parts I and II

Increasingly viewed as the lingua franca of statistics, R is a natural choice for many data scientists seeking to perform Big Data analytics. And with Hadoop Streaming, the formerly Java-only Big Data system is now open to nearly any programming or scripting language. This two-part class will teach you options for working with Hadoop and R before focusing on the RMR package from the RHadoop project. We will cover the basics of downloading and installing RMR, and we will test our installation and demonstrate its use by walking through three examples in depth.

You will learn the basics of applying the Map/Reduce paradigm to your analysis, and how to write mappers, reducers and combiners using R. We will submit jobs to the Hadoop cluster and retrieve results from the HDFS. We will explore the interaction of the Hadoop infrastructure with your code by tracing the input and output data for each step. Examples will include the canonical "word count" example, as well as the analysis of structured data from the airline industry.

No specific prerequisite knowledge is required, but a familiarity with R and Hadoop or Map/Reduce is helpful.

Level: Advanced
Graph Database Use Casesstarburst image

Learn from existing open-source projects how to build proof-of-concept solutions and how to add a new tool to your development toolkit. Social Networks, Recommendation Engines, Personalization, Dating Sites, Job Boards, Permission Resolution and Access Control, Routing and Pathfinding, and Disambiguation are just a few of the uses cases that lend themselves well to graph databases.

Level: Overview
Intro to Apache Hive and Impalastarburst image

This class gives an introduction to Apache Hive and Impala, both of which are open source SQL engines for Hadoop. Both of these engines enable Hadoop users to move away from writing MapReduce jobs and instead write familiar SQL like queries to query data in Hadoop. We will go into architectural details and use cases for both of the engines. We will also detail which use cases are suited for which SQL engine and conclude with a feature and performance comparison of the two engines.

Level: Overview
Introduction to MapReducestarburst image

MapReduce is a programming framework introduced by Google in the early 2000s. It is targeted at solving problems that have to work on huge datasets. Rather than devising an algorithm that works on the entire dataset, the MapReduce framework works on several chunks of the same dataset in parallel during the Map phase and combines the results together during the Reduce phase. MapReduce can take advantage of locality of data, processing data on or near the storage assets to decrease transmission of data. In this class, we will cover:
  • An introduction to the MapReduce paradigm in Hadoop
  • A Code-walkthrough in Java. This part will cover everything from building your first MapReduce application in Java to writing custom MapReduce applications to perform complex sorts and joins.
  • How does MapReduce work in Hadoop? This part will provide an explanation for the internal workings (Shuffle-Sort) of MapReduce, job scheduling and failure handling in classical MapReduce.
  • Use cases for MapReduce. Are there scenarios that fit the MapReduce paradigm?
Level: Intermediate
MongoDB: A gentle introduction to the world of NoSQLstarburst image

We will start with a gentle introduction to the history and common uses for MongoDB. We will follow up the introduction with simple ingestion of data and querying. We will finish off with aggregation operations, and if time permits, we will begin to discuss MapReduce in the context of MongoDB. If you have not ventured significantly from the SQL world, this is the class for you.

If you would like to follow along in an identical environment, please bring a workstation-class laptop (8+Gb RAM) with VirtualBox installed.

Level: Overview
2:45 PM - 3:15 PM

360-Degree Customer View: Integration & Analytics Empowers Big Data image

Companies are generating more and more information today, filling an ever-growing collection of data warehouses. The information may be about customers, Web sites, security, or company logistics. As this information grows, so does the need for businesses to sift through the information for insights that will lead to increased sales, better security, lower costs and more. 

The instructor, who wrote a book on analytics with Mongo, will discuss how integrating big data environments enables customers to transform and analyze vast amounts of structured and unstructured information. Through loading and blending data stored in big data environments, companies are able to get a more complete understanding of customer behavior. Pentaho’s integration, visualization and analytics platform helps companies transition from loading big data environments to getting a 360-degree view of consumer behavior.

This class is sponsored by Pentaho.

Level: Overview
Building Big Apps with Sqrrl Enterprise image

Many folks are still struggling to find that "killer app" – the big data use case that will produce significant measurable impact to their line of business. Other than the obvious obstacle (coming up with the right idea), there are a wide variety of tactical challenges that are easy to take for granted. Early adopters have found that rolling out a multi-tenant Big Data system comes with a fair share of obstacles, and security is more important now than it has ever been. To prevent massive data breaches such as Target or Snowden, you need to adopt a data-centric security mindset. Attend this class for an overview of building big apps with Sqrrl Enterprise to learn about our philosophy for approaching Big Data challenges.

This class is sponsored by Sqrrl.

Level: Overview
Can Big Data Change Your World? An Open Panel, Real-World Discussion about Big Data Best Practices image

Big Data is changing our world and impacting business like never before, and technological advances are enabling previously unavailable insights and transforming the way we do business, work with others, and live our lives.  

To be competitive you need to leverage Big Data and the business value it brings. Yet many are daunted by this, wrongly anticipating the need for large complex solutions. The SAP HANA platform along with the Intel Data Platform (with the Intel Distribution for Apache Hadoop) and technologies from Hitachi Data Systems provide a powerful, highly distributed, elastic solution that can simplify your IT architecture for Big Data. The joint solution amplifies the value of Big Data across your data fabric and helps your organization gain unprecedented business value, on premises and in the cloud.

Attend this class to learn how you can gain actionable real-time insights, combined with the depth of analytics, and see how three powerful and complementary technologies, the SAP HANA platform, the Intel Data Platform (with the Intel Distribution for Apache Hadoop) and  Hitachi Data Systems converged infrastructure empower enterprises with a future-proof modern data architecture for end user access anywhere and on any device.  
During this class, you'll learn: 

  • How Hadoop and SAP HANA are complementary technologies engineered to work together.
  • Where and how an organization can obtain the ability to gain deep insights.
  • Why Hitachi’s converged infrastructure make Hadoop and SAP HANA superior solutions.
  • How to enable your mobile workforce with secure access to information anywhere
  • Insights from Hitachi, Intel and SAP Big Data Experts, and your questions too!
This class is sponsored by Hitachi Data Systems, Intel and SAP.

Level: Intermediate
How to Connect to the Internet of Things Using Informatica image

The boundaries between the physical and virtual worlds have been erased, where every device is connected to the Internet, which many refer to as "The Internet of Things." This has the power to transform global industries by combining human insight with machine intelligence. Modern-day applications now depend on data originating from multiple sources, including application databases, social media, and machine data (e.g. sensor devices, log files, mobile, call detail records, etc.).  Discover how to integrate massive amounts of real-time machine data using Informatica so you can connect enterprise data to the "Internet of Things" and achieve your information potential.
 
In this class, you will learn:
  • About a reference architecture to process machine data
  • How to stream machine data into Hadoop
  • How to parse and integrate machine data with RDBMS data
  • How to combine and correlate real-time data with historical data
This class is sponsored by Informatica.

Level: Intermediate
Know Thy Customer: Enhancing Customer Experiences with Big Data Analytics image

Competitive advantages are changing. The days of differentiating on product and competing on price alone are winding down. The new competitive advantage is to differentiate effectively on the customer experience. Customers today have higher expectations than they did just a few short years ago, and firms that want to be proactive in reaching out to customers must analyze a great deal more information than ever before. They must be able to listen to the ‘voice of the customer’ in real time when possible and then be able to make real-time decisions that take action.

The voice of the customer comes in different forms and ends up generating volumes of quantitative and qualitative data. Today, there may be real-time or near real-time availability in say three different systems, but they may not talk to each other. Thus, important customer signals are being missed in the ever-increasing amount of data. Gaining visibility and reducing complexities transcends merely applying technology in silos.
 
Attend this class to learn how you can head off situations with technology-enabled customer analytics to obtain a real-time view of the customer. Making real-time decisions requires an optimal blend of predictive analytics from multiple data sources. Learn key principles in moving beyond the data silos so that the overall ‘voice of the customer’ is heard.  
 
This class is sponsored by MindStream Analytics.

Level: Overview
Sensitive Data Discovery and Protection for Big Data image

Increasingly, enterprises, organizations and government are turning to Hadoop for business and intelligence applications that incorporate sensitive data. Whether they are driving new revenue streams and customer retention programs, reducing costs for existing service delivery, or optimizing supply chains and data sharing, these customers have a burgeoning data security challenge that if left on its own creates new compliance, audit, and data exposure risks for organizations that handle private, personal data, financial data, or health records.
 
This class will highlight data protection techniques that combine with Hadoop security services such as Knox to allow customers to provide fast, powerful, timely insights that require consistent, unique values, but protect, remove and de-risk sensitive data inside Hadoop. Removing that risk while letting the business, R&D, engineering, and test developers do their jobs, will be the principal focus of this presentation.

This class is sponsored by Dataguise.

Level: Overview
The Ultimate Selfie: Picture Yourself with the Fastest Analytics on Hadoop with HP Vertica and Pentaho image

Big Data is both a marketing and a technical term that refers to a valuable asset: Information. With such a variety of complex data, this class will talk about how Vertica and Pentaho are working together to support organizations in making better business decisions. Understand how big data is not just about acquiring and storing information. It's also about using the data in a Big Data analytics platform to help business users interact with the valuable data. Picture yourself creating stunning visuals worthy of sharing with your colleagues and fans.

This class is sponsored by HP Vertica.

Level: Overview
3:30 PM - 4:45 PM

Application Architectures with Hadoop: Putting the Pieces Together Through Examplestarburst image

Numerous companies are undertaking efforts to explore how Hadoop can be applied to optimize their data management and processing, as well as address challenges with ever-growing data volumes. Although there's currently a large amount of material on how to use Hadoop and related components in the Hadoop ecosystem, there's a scarcity of information on how to optimally tie those components together into complete applications, as well as how to integrate Hadoop with existing data management systems – a requirement to make Hadoop truly useful in enterprise environments.

This class will help developers and architects who are already familiar with Hadoop to put the pieces together through a real-world example architecture, implementing an end-to-end application using Hadoop and components in the Hadoop ecosystem. This example will be used to illustrate important topics such as:
  • Modeling data in Hadoop and selecting optimal storage formats for data stored in Hadoop
  • Moving data between Hadoop and external systems such as relational databases and logs
  • Accessing and processing data in Hadoop
  • Orchestrating and scheduling workflows on Hadoop
Throughout the example, best practices and considerations for architecting applications on Hadoop will be covered. This class will be valuable for developers who are already knowledgeable about Hadoop, and are now looking for more insight into how it can be leveraged to implement real-world applications.

Note: You will need an understanding of Hadoop concepts and components in the Hadoop ecosystem, as well as an understanding of traditional data management systems (e.g. relational databases), and knowledge of programming languages and concepts.

Level: Advanced
Building Your Own Facebook Graph Search with Cypher and Neo4j

Learn how to create your own Facebook Graph Search or how to build a similar system with your own company data. Also learn how to interpret natural language into a grammar and use it to build Cypher queries to retrieve data from a graph. Knowledge of Natural Language Processing not required. The instructor works for Neo Technology, creators of the Neo4j Graph Database.

Level: Intermediate
Enterprise-Grade Security for Apache Hivestarburst image

Hive provides a flexible SQL-like interface to access data stored in the Hadoop platform and is increasingly used to process sensitive data. Securing access to such data becomes critical for users, especially for those that are required to comply with security regulations. While a secure system appears simple when it’s working, there are significant challenges involved in building one. This is especially true for a system such as Hive because of the flexibility and different deployment options it offers.

This class will cover enterprise-grade authentication and authorization for Apache Hive. We will discuss various Hive deployment options like HiveServer and Metastore, and ways to configuring security for these deployments. We will also cover in depth different authentication options like Kerberos and LDAP, message encryption, as well as cover fine grain using Apache Sentry, a new incubating Apache project that enables role-based multi-tenant authorization that includes out-of-the-box support for Hive.

Level: Intermediate
Getting Started with R and Hadoop, Parts I and II

Increasingly viewed as the lingua franca of statistics, R is a natural choice for many data scientists seeking to perform Big Data analytics. And with Hadoop Streaming, the formerly Java-only Big Data system is now open to nearly any programming or scripting language. This two-part class will teach you options for working with Hadoop and R before focusing on the RMR package from the RHadoop project. We will cover the basics of downloading and installing RMR, and we will test our installation and demonstrate its use by walking through three examples in depth.

You will learn the basics of applying the Map/Reduce paradigm to your analysis, and how to write mappers, reducers and combiners using R. We will submit jobs to the Hadoop cluster and retrieve results from the HDFS. We will explore the interaction of the Hadoop infrastructure with your code by tracing the input and output data for each step. Examples will include the canonical "word count" example, as well as the analysis of structured data from the airline industry.

No specific prerequisite knowledge is required, but a familiarity with R and Hadoop or Map/Reduce is helpful.

Level: Advanced
H20 for Small Datastarburst image

H20 is a high-performance clustered math engine for data sets in the 100s GB to several TB range. Hence, it fills the gap between single-machine tools such as R and Hadoop, which is inefficient for small data sets. In this class, you will learn how H20 works and see demonstrations of example data problems, including several machine learning algorithms, using the Java API and alternatives.

Level: Advanced
Selecting the Right Big Data Tool for the Right Job, and Making It Work for You image

This class will focus on the various types of Big Data solutions—from open-source to commercial solutions—and the specific selection criteria and profiles of each. As in all technology areas, each solution has its own sweet spots and challenges either in CAP theorem, ACID compliance, performance or scalability. This class will provide an overview of the technical tradeoffs for the list of solutions in technical terminology. Once the technical tradeoffs are reviewed, we will review the cost and value of open-source solutions versus commercial software, and the trade-offs that folks must take to choose one over the other.

The next phase will go into great detail on use cases for specific solutions based on real-world experience. All of the specific use cases have been seen first-hand from our customers. The solutions will be reviewed to the level of specific technical architecture and deployment details. This is intended for a highly technical audience and will not provide any high-level material on the solutions discussed. It is assumed you have working knowledge of solutions such as SQL, NoSQL, distributed file systems and time-series indexes. You should also have an understanding of CAP theorem, ACID compliance and general performance characteristics of systems.

Level: Advanced
YARN and MapReduce 2: An Overviewstarburst image

At their core, YARN and MapReduce 2’s improvements separate cluster resource management capabilities from MapReduce-specific logic. YARN enables Hadoop to share resources dynamically between multiple parallel processing frameworks such as Cloudera Impala, allows more sensible and finer-grained resource configuration for better cluster utilization, and scales Hadoop to accommodate more and larger jobs.

This class offers an overview of YARN. We will cover YARN's motivation, its architecture, its implementation and its usage. We will also discuss MapReduce 2 and how it fits in with YARN.

Level: Overview
Wednesday, April 2
8:30 AM - 9:45 AM

Analyzing Tweets with HBase, Parts I and II image

This two-hour class will cover how to use the Twitter API to download and model tweets in HBase, and then run natural-language processing against them. We will first cover the architecture fundamentals of HBase, including log-structured merge trees, data models, memstores, HFiles and Bloom filters. Next, tweets will be populated into HBase. Finally, we will explore some of the more interesting analysis that can be done with the tweets and NLP. All code for this class is publicly released under a Creative Commons license.

Level: Intermediate
Building a Big Data Recommendation Engine on Hadoopstarburst image

Recommendations have become an expected part of the customer experience. Internet leaders such as Amazon and Netflix have set this bar very high. Quality recommendations help individuals find things they may not have thought of themselves. This positive buyer experience has become a critical tool for online retailers, the hospitality industry and countless other service-driven organizations.

Hadoop provides a scalable, flexible and cost-effective platform needed to build a Big Data Recommendation engine. Leveraging the collaborative filtering and clustering libraries within Mahout, development of a basic yet powerful recommender is very straightforward.

In this class, we will discuss some of the key algorithms used in common recommenders. We will then walk through the critical design aspects of building a recommendation engine and share tips and leanings from real-world implementation. We will then walk through the code and operation of a functioning Big Data recommender, available as a fully functional virtual machine distributed during the lecture.

Note: We will interactively demonstrate a few key components of the sample application. Although not a requirement, we encourage attendees to download our Hadoop VM and clone the latest version of our sample project. The Hadoop VM is over 1GB, and a stable release, please download prior to the class.   

Download the Hadoop Virtual Machine and Sample Recommender Github repository

Level: Intermediate
Hardcore Data Science Explainedstarburst image

As the amount of data in the world grows dramatically, humanity does not have enough resources to review and label it. Algorithms and other tools for working with large amount of unlabeled data will be game-changers for businesses in the near future. One of the most powerful algorithms known today for analyzing unlabeled data is deep learning neural networks. Inspired by brain architecture and the idea of a single learning algorithm for any recognition problem, deep learning brings state of the art performance for wide variety of analytical problems such as video analysis, image analysis, audio analysis, text analysis and others.

In order to design for deep learning neural networks, you need to understand how they work and a fair amount of processing power for training. One way to get this power is with parallel processing on GPU. Training for deep learning neural networks on GPU is the most efficient and cheapest way of doing state of the art Big Data analytics today.

This class provides insight behind deep learning networks and breaks down the process of design, training and the fine tuning of these models. It also provides guidelines and a technology overview for simplifying training processes on GPU.

Note: You will need an understanding of Machine Learning.

Level: Advanced
Introduction to Apache Pig, Parts I & II image

This two-part class provides an intensive introduction to Pig for data transformations. You will learn how to use Pig to manage data sets in Hadoop clusters, using an easy-to-learn scripting language. The specific topics of the 120-minute class will be calibrated to your needs, but we will generally cover:
  • What is Pig and why would I use it?
  • Understanding the basic concepts of data structures in Pig
  • Understanding the basic language constructs in Pig. We'll also create basic Pig scripts.
Prerequisites: This class will be taught in a Linux environment, using the Hive command-line interface (CLI). Please come prepared with the following:
  • Linux shell experience; the ability to log into Linux servers and use basic Linux shell (bash) commands is required
  • Basic experience connecting to an Amazon EC2/EMR cluster via SSH
  • Windows users should have a knowledge of Cygwin and Putty
  • A basic knowledge of Vi would be helpful but not necessary
Also, bring your laptop with the following software installed in advance:
  • Putty (Windows only): You will log into a remote cluster for this class. Mac OS X and Linux environments include SSH (Secure Shell) support. Windows users will need to install Putty. Download putty.zip from here.
  • A text editor: An editor suitable for editing source code, such as SQL queries. On Windows, WordPad (but not Word) or Notepad++ (but not Notepad) are suitable.
Level: Overview
Introduction to Topological Data Analysisstarburst image

Topological Data Analysis (TDA) is a framework for data analysis and machine learning and represents a breakthrough in how to effectively use geometric and topological information to solve Big Data problems. TDA provides meaningful summaries (in a technical sense to be described) and insights into complex data problems.

In this class, we will start with an overview of TDA and describe the core algorithm, "Mapper." We will do this concretely through a carefully worked out example that we will return to as the material gets more complex so that the technique is transparent and concrete. The class will include both the theory and real-world problems that have been solved using TDA. After taking this class, you will understand:
  • How the Mapper algorithm works and how it improves on existing 'classical' data analysis techniques.
  • How Mapper forms a framework for many machine learning algorithms and tasks.
  • How to interpret the algorithmic output (called an extended Reeb graph or ERG) and have concrete examples how it has been applied in real-world applications.
Note: You will need knowledge of Calculus and basic statistics. A Machine Learning background is also preferred to give context to the examples.

Level: Advanced
Organizing Big Data from the Internetstarburst image

While many of us make extensive use of Big Data already, arguably the biggest and most useful data set is the public Internet itself.  Methods for harnessing this data can be cumbersome, but with the right framework in mind, it doesn't have to be. This class will present a framework for connecting to public data sets, organizing discovered information into a structured format, and using this structured data to identify entities and resolve them to those in your own data set. The result is that your original data set is enriched by the Internet, with more entries, and richer data on each entry.

This class will provide a clear view of the steps and implementation needed to orchestrate data feeds for a category of interest, and how to use big data processes to organize and resolve incoming data to existing data sets. By utilizing the public Internet as a data source for filling in sparse data, big data becomes more complete and actionable.

Level: Overview
Untangling the Relationship Hairball with a Graph Database

Not only has data gotten bigger, it's gotten more connected. Make sense of it all and discover what these Big Data connections can tell you about your users and your business. Come to this class to learn some of the different use cases for graph databases, and how to spot the non-obvious opportunities in your data.

Level: Overview
10:00 AM - 11:15 AM

Analyzing Tweets with HBase, Parts I and II image

This two-hour class will cover how to use the Twitter API to download and model tweets in HBase, and then run natural-language processing against them. We will first cover the architecture fundamentals of HBase, including log-structured merge trees, data models, memstores, HFiles and Bloom filters. Next, tweets will be populated into HBase. Finally, we will explore some of the more interesting analysis that can be done with the tweets and NLP. All code for this class is publicly released under a Creative Commons license.

Level: Intermediate
From Big Data to Smart Datastarburst image

Tools for Big Data management typically concentrate on the volume or velocity side of the equation, but give short shrift to the variety and veracity aspects, despite the fact that these often determine the value of the processed data. A quiet revolution is taking place on the fringes of the Big Data movement as Hadoop and other Big Data tools have begun to be used to make data contextual—for resources to know not only about themselves (self-awareness) but also to know about how they relate to the rest of the world (external awareness).

This class looks at the collision of Big Data and Semantic Technologies, showing how Hadoop and similar tools can be used to work with RDF triple stores and SPARQL to describe, relate and infer the information that you need. SPARQL can be thought of as SQL for the Web, providing a way to query distributed and federated data systems, to discover the inherent data models of complex information and to use this to help these systems learn new things. In this class, you will dive into semantic technologies (RDF, Turtle, OWL, SPARQL) to see how such data works and what benefits it provides. Then it will look at MapReduce solutions both for ingesting RDF content and using this to drive inferencing of information.

Note: This class requires an understanding of MapReduce principles, data modeling, NoSQL databases and common industry interchange protocols (XML, JSON, RESTful Architectures). Knowledge of XQuery or XSLT2 is useful but not required. Given the volume of material to cover, this will be given as a lecture, but you will be able to test against a live database.

Level: Advanced
Genetic Programming with Hadoopstarburst image

In addition to being an excellent Big Data tool, Hadoop is a powerful platform for distributed computing and lends itself well to Monte Carlo simulations. This class will focus on leveraging concepts of genetic algorithms and genetic programing in the Hadoop environment. Concepts, tools, and strategies will be discussed in general form, including program representation, population creation, cross breeding, and fitness selection. Specific applications for both graph routing and trading algorithms will be detailed and demonstrated running in a Hadoop cluster.   

Genetic Programming is a subset of evolutionary computation – or programs that write themselves. The programs converge on a solution by following biological principles to propagate and cross breed successful program characteristics. Genetic programs have been widely used in scientific and engineering applications and are beginning to find their way into wide spread computing applications including routing, pricing, trading strategies, and vision systems.

Hadoop provides the opportunity to scale genetic programing into more complex areas by allowing the developer to focus on the problem space and not on the intricacies of distributed computing. This class will introduce genetic programming and show two applied applications. One a simple ant/search program, the other a currency trading model. The genetic representations of these programs and their runtime experience on Hadoop will be examined and compared to other approaches such as MPI, HPC and GPU.  

Level: Advanced
Getting Started in Big Data Science Consultingstarburst image

Getting started in Big Data Science (BDS) consulting can be a daunting task. The primary focus of this class will be to walk through the process of going from concept to completion of a BDS consulting engagement. We will start by discussing the professional landscape and various ways to get started as an aspiring consultant. Next, we will cover available open-source tools and other resources such as Massively Open Online Courses. The majority of our time will be spent discussing the technical side of scoping, executing, and delivering your project. If time permits, we will discuss ancillary topics such as trends that we are seeing with our clients and how we approach new domains.

We will discuss:
  • Joining an old firm vs. a start-up vs. working independently
  • Scoping your project and risks that early BDS consultants should avoid
  • Choosing a modeling method based on associated trade-offs
  • Using feature generation to improve models
  • Using models to improve feature generation
  • Variable Selection
  • Dimensionality reduction
  • Overtraining
  • Model delivery and presentation
  • Project Management for Data Scientists
Note: Having an introductory understanding of p-values, information gain, receiver operating characteristic curves, Logistic Regression, and classification in general will be useful.

Level: Intermediate
How to Build and Operate Big Data Applicationsstarburst image

Hadoop is a tremendously powerful piece of technology but it’s low-level infrastructure and not targeted at application developers. Not only is the development itself difficult, but debugging, deploying, and managing an app on Hadoop is complex, with much responsibility left to the developer. In this class, the instructor will use his real-world experience of developing apps on top of Hadoop and HBase to highlight some of the key aspects and best practices of building and operating big data applications.

Level: Intermediate
Introduction to Apache Pig, Parts I & II image

This two-part class provides an intensive introduction to Pig for data transformations. You will learn how to use Pig to manage data sets in Hadoop clusters, using an easy-to-learn scripting language. The specific topics of the 120-minute class will be calibrated to your needs, but we will generally cover:
  • What is Pig and why would I use it?
  • Understanding the basic concepts of data structures in Pig
  • Understanding the basic language constructs in Pig. We'll also create basic Pig scripts.
Prerequisites: This class will be taught in a Linux environment, using the Hive command-line interface (CLI). Please come prepared with the following:
  • Linux shell experience; the ability to log into Linux servers and use basic Linux shell (bash) commands is required
  • Basic experience connecting to an Amazon EC2/EMR cluster via SSH
  • Windows users should have a knowledge of Cygwin and Putty
  • A basic knowledge of Vi would be helpful but not necessary
Also, bring your laptop with the following software installed in advance:
  • Putty (Windows only): You will log into a remote cluster for this class. Mac OS X and Linux environments include SSH (Secure Shell) support. Windows users will need to install Putty. Download putty.zip from here.
  • A text editor: An editor suitable for editing source code, such as SQL queries. On Windows, WordPad (but not Word) or Notepad++ (but not Notepad) are suitable.
Level: Overview
11:45 AM - 1:00 PM

Building APIs with NodeJS and MongoDBstarburst image

Imagine a system where the user interface, server backend, and persistence layer shared a common language. Goodbye to programming language context switching! Seems impossible, but it’s not. You can now build a fully functional and high performant Web API (or Web app) with NodeJS and MongoDB - all of which share one common programming language: JavaScript. ExpressJS, HapiJS and many other frameworks for NodeJS provide a simple and easy-to-use tooling to build Web APIs or Web apps. MongooseJS is a MongoDB object modeling tool that provides you the tools to create domain models that match that of your documents in MongoDB.

Using these three tools, you can now develop in an environment that shares one common programming language: JavaScript. In this class, you will build a small Web API or Web app using NodeJS ExpressJS and MongooseJS to expose the data in a MongoDB database.

Level: Intermediate
From Insight to Impact: Lessons Learned from Advanced Analytics Projectsstarburst image

In the last two years, the world has generated as much data as was generated in all of human history prior to then. Winning organizations are ones that harness the power of this data to make more intelligent decisions. Many advanced analytics departments are delivering game-changing insights. Unfortunately, most of these insights do not generate game-changing results.
                                                                 
In this class, we will discuss best practices that can set the stage for successful execution of your analytics engagements.  These practices include:
  • Setting stage gates for the analytics process
  • Engaging critical stakeholder support early in the process
  • Recruiting and mentoring the right “culture” of advanced analytics talent
  • Common pitfalls and how to avoid them
Level: Overview
Implementing Real-Time Analytics with Apache Stormstarburst image

Companies are pushing to grow revenue by personalizing customer experiences in real time based what they care about, where they are, and what they are doing now. Hadoop systems provide valuable batch analytics, but applications also require a component that can analyze Web cookies, mobile device location, and other up-to-the-moment user data. Increasingly, Apache Storm is the component that developers are using.

This class will discuss the design considerations for using the Apache Storm distributed real-time computation system, as well as how to install and use it. Key topics will include:
  • A technical overview of Storm for real-time analysis of streaming data.
  • A solution architecture overview of how Storm’s role in real-time analysis complements Hadoop and NoSQL.
  • Examples of common use cases, including trending words, recommendations, and simple fraud counts
  • Technical requirements for implementing Storm
  • Live demonstrations of how to install and run Storm
Throughout the discussion, concepts will be illustrated by examples of businesses that have implemented real-time applications using Apache Storm.

For this advanced-level technical class, it is recommended that you have a fundamental understanding of Hadoop and NoSQL since it will provide a contextual baseline for how Storm complements these technologies in providing a 360-degree view of users.

Level: Advanced
Metadata Management: The Emerging Big Data Disciplinestarburst image

In many organizations, there is a nuclear pile lurking, quietly building up heat, which if left unchecked, will ultimately react explosively. Content/asset-management systems, built with an eye toward tracking hundreds or even thousands of individual assets for reuse, are now straining as the users of these systems begin dealing with millions of assets, content fragments, and media files. Add into this the increasingly tenuous ability of archivists and digital librarians to keep up with the classification of these resources, and the often inadequate approaches toward building effective taxonomies, and finally trying to maintain these systems for users who are not technically savvy, and you have a recipe for disaster that will likely have repercussions throughout the organization.

Metadata management is an emerging discipline that takes a different tack toward solving this Big Data problem. It combines Big Data tools such as Hadoop MapReduce with NoSQL databases, highly indexed textual content, document enrichment, and semantic processing to build a framework for managing, combining and curating these content fragments. Such solutions have applications primarily in the publishing and media spaces, but as many larger organizations are now increasingly publishing-oriented (especially as a byproduct of marketing), this means that metadata management has applicability everywhere.

This class dives into what metadata is, what the discipline of metadata management looks like, and explores some examples with live tools illustrating how metadata-management systems can be built to tackle the needs of organizations today and tomorrow.

This class is intended to be a deep dive into the tools and techniques of metadata management, so we will be looking at code and architecture for building such systems. It would also be good for users to have attended "From Big Data to Smart Data" as many of the core concepts will be first presented there.

Level: Advanced
Practical Natural Language Processing with Hadoopstarburst image

Hadoop is a natural choice for storing large volumes of unstructured and semi-structured data, but what is done with this data is still largely relational or set based analysis. To gain deep insight from this data, it is often necessary to leverage natural language processing tools to understand the meaning within the data. This class will focus on using the Python Natural Language Toolkit, an open source NLP framework, on top of Hadoop to perform natural language processing on consumer sentiment from social media and email interactions. 

Natural language processing is a complex problem space that promises to enable deep understandings of human actions and decisions. The promise of Big Data is largely predicated on the ability to understand unstructured data quickly. This has proven to be beyond the scope of most solutions. This class will introduce the tools (NLTK, HDP, and Excel) needed to perform real-world NLP as well as strategies for dealing with richness of language like noun phrase chunking and context free grammars. An insurance claims filtering example will be presented from recent research that shows how these tools help companies reduce costs by automating normally manual claims adjudication tasks.

Level: Advanced
2:00 PM - 3:15 PM

Advanced Schema Design Using HBasestarburst image

The first thing to consider when selecting a database is the characteristics of the data you are looking to leverage. If the data has simple tabular structure with a set of rows and columns like spreadsheet, then the relational model is sufficient. On the other hand, if data represents geospatial or engineering parts or others, they tend to be more complex. The data may have multiple levels of nesting and the complete data model can be complicated. In these cases, one should consider NoSQL databases as an option. The next big questions to ask are: “What is the volatility of the data model?” And is the  data model likely to change and evolve or is it most likely going to stay the same?" In other words, some flexibility is needed in the future.

HBase is a distributed, column-oriented data store to process large amounts of data in a scalable and cost effective way. Most of the developers, designers and architects have lot of background or experience in working on relational databases using SQL, making the transition to NoSQL challenging and often times confusing. Since HBase runs on top of Hadoop, HDFS and MapReduce (the core components of Hadoop) add additional complexity in making the transition to this new paradigm of data processing.

This class will help you understand the features available in HBase to design your schema and address the shortcomings of the technology, so that you can address them in your design at the early stages. The focus will be on sharing knowledge and real-world experiences to help you understand the full spectrum of technical and business challenges. The class will highlight recommendations and lessons learned in terms of schema design and transitioning to the new platform. At the end of the class, you will be in a better position to make right choices on schema design using HBase.

Level: Advanced
Bridging the gap, OLTP and Real-Time Analytics in a Big Data Worldstarburst image

Big Data is often created by high-volume feeds of incoming state that remains mutable, possibly indefinitely. It's a poor fit for traditional OLTP databases due to the volume involved and it's a poor fit for analytic databases due to the emphasis on mutable state and point queries. Both types of databases are ultimately necessary because the ideal storage layout for a given data set is determined by the queries you need to execute efficiently.

Over the past few years, several solutions have cropped up to solve this problem either by bridging specialized systems together or creating new systems that attempt to be a one-size-fits-all solution to OLTP and analytics.

This class will give an overview of OLTP-oriented databases (Postgres, MySQL, VoltDB), hybrid OLTP and OLAP (HBase, Cassandra, MongoDB), and OLAP (Vertica, Netezza, Hadoop), and how they can be used exclusively or together to satisfy mixed OLTP/OLAP workload at "Big Data" scale.

Level: Advanced
How to Build Enterprise Data Apps with Cascadingstarburst image

Cascading is a popular Java-based application development framework for building Big Data applications on Apache Hadoop. This open source, enterprise development framework allows developers to leverage their existing skill sets such as Java and SQL to create enterprise-grade applications without having to think in MapReduce. This comprehensive framework separates business logic from integration logic so that developers can quickly build and test data applications locally on their laptop and then deploy them on Hadoop. While typical enterprise data applications must cross through multiple departments and frameworks, Cascading allows multiple departments to seamlessly integrate their application components into one single data processing application.

In this class, you will get a brief introduction to Cascading, see how it works, and then dive into building applications with Cascading. You will discover what types of use cases exist for data-driven businesses and how to approach them with Cascading and its vast ecosystem (Lingual, Pattern, Scalding, Cascalog, etc.). You will also learn how to get started in building enterprise-grade applications with Cascading, all without having to think in MapReduce. You will be guided through the Cascading framework, seeing code, examples, and best practices for Cascading application development.

This class is highly beneficial for application developers and data scientists looking to build data-oriented applications on top of Hadoop. Because Cascading is a Java library, Java skills will be helpful in understanding the code and examples. Come and see how companies like Twitter, eBay, Etsy, and other data-driven companies are taking advantage of the Cascading framework and how Cascading is changing the business of Big Data in the enterprise!

Level: Advanced
Large Scale Data Analysis with MongoDBstarburst image

In traditional academics and medical research, SQL and XML (gasp!) based data stores are very commonly used for large data management. These systems are often used to store large amounts of computationally intensive data, but traditionally have poor overall performance for these data sets. In this class, we will cover how to use MongoDB running on EC2 for large scale, efficient DNA genome analysis at AgileMedicine. You will see benchmarked numbers and compare them to a comparable MySQL database running on the same EC2 configuration. We will conclude with how MongoDB has helped to import large DNA genome data (such as from 23andMe, etc) and speed up the analysis time while dropping cost for the academic and medical research fields.

Level: Overview
Predictive Analytics: Turning Big Data into Smart Datastarburst image

Conquering the throughput, capture, and storage challenges of Big Data is only the beginning. Once you have a handle on the input, how do you turn that into actionable output? How do you change your mindset from reviewing your past to charting your future?

Simply put: How do you turn your Big Data into smart data?

In this class, we’ll begin with seeing that the critical step in using data analysis to find the right answers is to ask the right questions. We will explore the kinds of questions that people commonly ask, and propose a different style of questions that focus on what will happen, instead of what has already happened. Using real-world examples of tracking customer churn and social media sentiment, you will learn how to use predictive analytics models to forecast which high-value customers you are most likely to lose, or what sentiments might gain momentum in the future. From there, we will see how that information can be used to best prepare resources and develop strategies to retain customers and drive positive sentiment.

We will also review tips and tricks that resolve common issues that can skew reports, visualizations, and recommendations, including false positives, false negatives, and outliers in the data. With the ever-increasing size of available data sets, issues like these present more of a challenge if not quickly identified and handled, allowing for more accurate predictive models.

Note: Prior knowledge of some data analysis is helpful, but no experience in predictive modeling is assumed or required.

Level: Intermediate
3:45 PM - 5:00 PM

A Raw Data Journal: A History of Everythingstarburst image

Every time you run an 'UPDATE' on your database, you lose information. If you want to analyze how your data is changing through time, or undo a mistake, you’re out of luck. Several next-generation database systems have been designed from the ground up to solve this problem by placing an emphasis on data versioning, but it can be done without living on the bleeding edge. In fact, you can start today with your existing database, without impacting performance, changing your application, or even going down for maintenance.

This class will introduce the concept of a "Raw Data Journal," which is a log of every state your data has ever been in, and share how you can start aggregating this journal in HDFS directly from your MySQL or PostgreSQL database. We'll then work through some examples of how to use the journal for analytics and disaster recovery.

Level: Intermediate
Building an Intelligent Real-Time Notification System That Scalesstarburst image

Building a scalable notification system requires at least three parts: a queue to receive and persist incoming events; a fast analytics and decisioning platform to process those events and generate real-time decisions; and a scalable notifications service to deliver the derived messages. Notification systems often come into play with event streams, which are a common way to handle data that arrives quickly and needs to be processed and acted on intelligently in real time.

The approach is to look at arriving data as a stream of events, to process those events in context, and to react to exceptional inputs with notifications or alerts. One example would be generating an in-game notification if your high score is eclipsed.

This class shows a simple and straightforward solution to this common problem using standard Big Data tooling. We will examine, assemble and demonstrate (with working code) a real-time notification application using open-source, NewSQL and AWS technologies.

Note: You will need basic knowledge of work queues, SQL, distributed applications and AWS.

Level: Advanced
Cascading Lingual Shows How SQL Can Save Your on-and-off Relationship with Hadoopstarburst image

With Hadoop, life is complicated. It’s a give and take relationship where you constantly want to migrate workloads onto Hadoop for processing, but also want to get data off for reporting and analysis. These scenarios can be tricky and will often complicate your relationship with Hadoop; however, don’t worry - SQL is the therapy that will help.

SQL has been the critical centerpiece for describing data in the enterprise for nearly three decades. It speaks with relational databases and is the lifeblood of business intelligence (BI) tools. Lingual, an open-source extension to the Cascading application development framework, allows you to use familiar SQL to connect disparate data sources, whether on-premise or in the cloud, and leverage the Hadoop for data processing.

This class introduces Cascading Lingual, an open-source project that provides ANSI-compatible SQL, enabling fast and simple Big Data application development on Hadoop. Enterprises that have invested millions of dollars in BI tools, such as Pentaho, Jaspersoft and Cognos, and training can now access their data on Hadoop in a matter of hours, rather than weeks. By leveraging the broad platform support of the Cascading application framework, Cascading Lingual lowers the barrier to enterprise application development on Hadoop. You will learn how to:
  • Utilize Cascading and Lingual for Big Data app development
  • Integrate Hadoop with heterogeneous data sources via a single SQL statement
  • Quickly integrate BI tools to Hadoop
  • Migrate costly SQL workloads on to Hadoop
  • Create scalable applications with ANSI SQL
Come and see how Cascading Lingual can be used with various tools like R and desktop applications to drive improved execution on Big Data strategies by leveraging existing in-house resources, skills sets and product investments. Data analysts, scientists and developers can now easily work with data stored on Hadoop using their favorite BI tool.

Level: Advanced
De-Bugging Hive with Hadoop in the Cloudstarburst image

Anyone that has used Hadoop knows that jobs sometimes get stuck. Hadoop is powerful, and it’s experiencing a tremendous rate of innovation, but it also has many rough edges. As Hadoop practitioners, we all spend a lot of effort dealing with these rough edges in order to keep Hadoop and Hadoop jobs running well for our customers and or organizations. In this class, we will look at a typical problem encountered by a Hadoop user, and discuss its implications for the future of Hadoop development. We will also go through the solution to this kind of problem using step-by-step instructions and the specific code we used to identify the issue.

One of the instructor's customers asked why his Hive job seemed to be stuck. The job had been running for more than a day, with 43 out of 44 mappers successfully completed. To figure out what was happening with the job, he used the Hadoop 2 Resource Manager user interface, which provides status and diagnostic information about jobs running on Hadoop. Stepping out of the details, the Hadoop did provide the information required to understand the problem with this Hive job. On the negative side, the job kept running new map attempts after it hit an error that should have been terminal. To make matters worse, the relevant error message was hard to find – even after finding the relevant page five levels deep in the user interface.

As a community, we need to work together to improve this kind of experience for our industry. Now that Hadoop 2 has been shipped, we believe the Hadoop community will be able to focus its energies on rounding off rough edges like these, and this class will provide advanced users with some tools and strategies to identify issues with jobs and how to keep these running smoothly. Attend this class to:
  • Learn techniques for identifying the source of specific problems within Hive jobs
  • Have a broad discussion on the portions of Hadoop that can cause the most issues for data scientists
  • Showcase how to use the Hadoop 2 Resource Manager user interface to get the info you need to solve specific issues
  • Discuss next steps we must take as an industry in order to make Hadoop easier for everyone
Level: Advanced
Root Causes: Fixing Big Data Quality Problems Before They Startstarburst image

Often, large changes in complex systems can be traced back to small, seemingly harmless incidents.  This is most apparent in Big Data, where inaccuracies and inconsistencies quickly grow into data quality problems that can threaten the success of your entire project.

Low-quality data leads to confusion, wasted resources, poor performance, and bad business value. The success of your project relies on strong governance practices, backed by a confident understanding of the root causes of your data quality problems.

In this class, we will discuss the consequences of poor data quality in your project, and will demonstrate how small tweaks to people, process and technology can lead to an agile approach to data management.

Level: Intermediate
Using Cooperative Analytic Processing to Unleash Your Analyticsstarburst image

The ability to do complex ad-hoc queries and analytics at scale is a Big Data dream that many are talking about, but very few are actually achieving. SMBs are stifled by budget, while the companies who have invested in trendy new in-memory databases are settling for hours or sometimes day-long queries in order to get the answers they want from their data. This could hardly be considered "ad-hoc." It would be easy to assume that we simply don't have access to the technology to unlock this vision yet, but that would be incorrect.

Cooperative Analytic Processing is an analytics technique designed to make the most out of your entire Hadoop ecosystem, driven by the principle that the processing should be with the most appropriate tools available as close to the data as possible. Companies are successfully utilizing Cooperative Analytic Processing to get their analysts time on the system so that they can produce business value.

In this class, you will be given a summary of the underlying principles that drive Cooperative Analytic Processing, as well as a short overview of its successful implementations and benefits. You will then be shown in-depth samples of how to successfully move processes closer to data in order to speed analytic queries for their own organization.

Level: Advanced

 


A BZ Media Production