Big Data TechCon
Facebook
Twitter Linked In
Register Now!
Resources
Reserve Hotel Room
Download the Catalog
View Classes
View Tutorials
View Speakers
Sign up for Big Data TechReport
Classes
Filter classes by :
Day                         Time Slot
Speaker  Level  Clear Filters
Wednesday, October 16
8:30 AM - 9:30 AM

Analytics Maturity Model

Every company is at a different stage in leveraging analytics to improve their operational efficiency and product offerings. In this class, you will learn an eight-stage analytics maturity model that companies can use to determine how far they are from the most analytical companies.

Level: Intermediate
Apache Cassandra--A Deep Dive Has code image image

Recently, there has been some discussion about what Big Data is. The definition of Big Data continues to evolve. Along with variety, volume and velocity (which the usual suspects handle well), other facets have been introduced, namely complexity and distribution. Complexity and distribution are facets that require a different type of solution.

While you can manually shard your data (Oracle, MySQL) or extend the master-slave paradigm to handle data distribution, a modern big data solution should solve the problem of distribution in a straightforward and elegant manner without manual intervention or external sharding. Apache Cassandra was designed to solve the problem of data distribution. It remains the best database for low latency access to large volumes of data while still allowing for multi-region replication. We will discuss how Cassandra solves the problem of data distribution and availability at scale.

This class will cover:
    •    Replication
    •    Data Partitioning
    •    Local Storage Model
    •    The Write Path
    •    The Read Path
    •    Multi-Datacenter Deployments
    •    Upcoming Features (1.2 and beyond)

For the most benefit from this class, attend the "Getting Started with Cassandra" workshop.

Level: Intermediate
Extending Your Data Infrastructure with Hadoop Has code image image

Hadoop provides significant value when integrated with an existing data infrastructure, but even among Hadoop experts there's still confusion about options for data integration and business intelligence with Hadoop. This class will help clear up the confusion.

You will learn:
    •    How can I use Hadoop to complement and extend my data infrastructure?
    •    How can Hadoop complement my data warehouse?
    •    What are the capabilities and limitations of available tools?
    •    How do I get data into and out of Hadoop?
    •    How can I use my existing data integration and business intelligence tools with Hadoop?
    •    How can I use Hadoop to make my ETL processing more scalable and agile?

We'll illustrate this with an end-to-end example data flow using open-source and commercial tools, showing how data can be imported and exported with Hadoop, ETL processing in Hadoop, and reporting and visualization of data in Hadoop. You will also learn recent advancements that make Hadoop an even more powerful platform for data processing and analysis.

Level: Intermediate
HBase Use Casesstarburst image

The class will be an overview of the use cases for HBase. There will be an initial overview of HBase and its architecture, recapping key concepts such as column families, region servers and master. From this point, we will then move onto the use cases HBase is best suited to, such as high write loads with fast lookups, and the types of application that can be developed on HBase, such as Time Series Databases. We will also describe some of the use cases that HBase is not suited to, which helps avoid dissonance between technology choice and solution requirements. This includes discussion of why it isn't suitable for relational analytics or OLTP.

Finally, a recap of things to consider before embarking on an HBase implementation project, which includes design activities, deployment, tuning and administration considerations. You will leave with a good overview of HBase and its use cases. These will be useful for making informed investigations of technology selection, ensuring that if HBase is chosen, it will satisfy business and technical requirements.

Level: Intermediate
Pattern: A Machine-Learning Library for Cascading, Migrating PMML Models to Hadoopstarburst image Has code image

Pattern is an open-source project that takes models trained in popular analytics frameworks, such as SAS, R, SPSS, MicroStrategy, etc., and runs them at scale on Apache Hadoop. This machine-learning library works by translating PMML—an established XML standard for predictive model markup—into data workflows based on the Cascading API in Java.

PMML models can be run in a pre-defined JAR file with no coding required. PMML can also be combined with other flows based on ANSI SQL (Lingual), Scala (Scalding), Clojure (Cascalog), etc. Multiple companies have collaborated to implement parallelized algorithms: Random Forest, Logistic Regression, SVM, K-Means, Hierarchical Clustering, etc., with more machine-learning support being added. Benefits include greatly reduced development costs and less licensing at scale while leveraging a combination of Apache Hadoop clusters, existing intellectual property in predictive models, and the core competencies of analytics staff.

Sample code in the class will show apps using predictive models built in R for anti-fraud classifiers. In addition, examples will show how to compare variations of models for large-scale customer experiments. Portions of this material come from the book "Enterprise Data Workflows with Cascading."

You will learn how to migrate predictive models to run on Hadoop clusters at scale, how to leverage PMML for customer experiments, and how the notion of "ensembles" has enhanced predictive power: Netflix
Prize, Kaggle, KDD, etc.

Level: Overview
Product Logs Ingestion into HDFS Using Spring Technologiesstarburst image Has code image

Hadoop is everywhere due to the core functionalities of Hadoop HDFS and Hadoop Map/Reduce. The ability to ingest millions of product logs into a Hadoop ecosystem helps us unlock the business value hidden in the log files. This class provides an overview of the architecture to ingest device data into Hadoop using Spring Batch and Spring Integration Technologies.

Level: Intermediate
11:00 AM - 12:00 PM

Graph Database Use Casesstarburst image

Learn from existing open-source projects how to build proof-of-concept solutions and how to add a new tool to your development toolkit. Social Networks, Recommendation Engines, Personalization, Dating Sites, Job Boards, Permission Resolution and Access Control, Routing and Pathfinding, and Disambiguation are just a few of the uses cases that lend themselves well to graph databases.

Level: Overview
HBase Schema Design Done Rightstarburst image

Schema design is one of the areas which has often been overlooked yet can play a critical role when determining overall application performance when working with HBase. This class is based on practical experience and lessons learned on the importance of breaking away from the traditional relational-model approach to schema design, and is appropriate to all levels of individuals who are looking to improve their knowledge of HBase and of designing effective schemas.

We will cover the following:
  • Trade-offs between a flattened schema vs. storing complex structures in cells
  • The use of column families
  • Different key designs, including hashing, salting and composite keys
  • Secondary Indexing
It is assumed that you have a basic understanding of the basic fundamentals of HBase and Hadoop.

Level: Intermediate
Implementing a Simple Mongo Application

One simple app, three different technologies deployed locally to two different clouds. The application is executed, the source is examined, the approach is compared, the tools are demonstrated, and your questions are answered.

Level: Overview
In-Database Predictive Analytics Has code image

Predictive analytics have long lived in the domain of statistical tools like R. Increasingly, however, as companies struggle to deal with exploding volumes of data not easily analyzed by small data tools, they are looking at ways of doing predictive analytics directly inside the primary data store.

This approach, called in-database predictive analytics, eliminates the need to sample data and perform a separate ETL process into a statistical tool, which can decrease total cost, improve the quality of predictive models, and dramatically shorten development time. In this class, you will learn the pros and cons of doing in-database predictive analytics, highlights of its limitations, and the tools and technologies necessary to head down the path.

Level: Advanced
Large-Scale, High-Accuracy Entity Extraction Made Easystarburst image Has code image

Big Data is a great opportunity to make smarter decisions. But it is also a great challenge, in particular where Big Data comes as huge collections of raw text, logs, tweets, etc. Entity and relation extraction are crucial components in turning such collections of unstructured text into more meaningful, “smart” data. There exists a plethora of commercial and open-source services or tools for extracting entities such as cities, company names, or prices from documents. Unfortunately, traditional services have suffered from a trifecta of challenges: low coverage, inconsistent accuracy, and complex, tool-specific APIs.

In this class, we will introduce a recent open-source API, ROSEAnn, which provides a simple, uniform interface for most of the existing extraction services and tools out there. We will walk through several scenarios for using ROSEAnn, from detecting mentions of a company to more complex cases combining the detection of several entity types. In addition to providing a uniform interface, ROSEAnn also allows you to easily “scale up” the accuracy and coverage of your entity extraction by a smart integration of an arbitrary number of extraction services. On entity types where the underlying services overlap, accuracy is improved (by reconciling the different results); where they don’t overlap, coverage is increased. At the end of this class, you will be able to deploy automatic entity and relation extraction easily, and make use of the integration features of ROSEAnn to achieve entity extraction with unparalleled coverage and accuracy.

Level: Advanced
The Hadoop Ecosystem: Putting the Pieces Togetherstarburst image Has code image

Everybody's talking about Hadoop and Big Data, and a number of companies are undertaking efforts to explore how Hadoop can be applied to optimize their data-management and processing processes, as well as address challenges with ever-growing data volumes. Unfortunately, there's still a lack of understanding of how Hadoop can be leveraged, not to mention how the tools in the Hadoop ecosystem can be used together to implement data-processing pipelines.

This class will seek to provide clarity by first discussing some typical real-world use cases for Hadoop that are allowing companies to address challenges and derive tangible value. We'll then dive deeper to discuss specific tools in the Hadoop ecosystem such as Hive, Pig, Oozie, Flume, Sqoop and Mahout. More importantly, we'll discuss some example architectures to understand how these tools can be used together to create processing pipelines that implement some of these use cases. Since Hadoop isn't a panacea, we'll also discuss criteria for determining when Hadoop is a suitable fit and when it isn't, as well as some suggestions for getting started with a Hadoop pilot project.

Level: Intermediate
Tools in the Wild - Tips and Tricks for Pig, Python and R in Data Sciencestarburst image Has code image

This class will give practical tips and tricks for scripting in Pig, Python and R. It is aimed at practitioners who have entry-level knowledge of these languages, but want to learn the ropes for using them in real applications. The class will include extensive code walk-throughs, emphasizing both language-specific tricks and larger-scale patterns. Example topics include:
  • How and when to control parallelism in Pig
  • Using streaming to combine Pig with Python and R
  • Performance tips for core Python data structures
  • Architectural conventions we often use
Level: Intermediate
12:15 PM - 12:45 PM

Crowd Sourcing Reflective Intelligence image

Search has quickly evolved from being an extension of the data warehouse to being run as a real-time decision processing system. It is also increasingly being used to gather intelligence on multi-structured data, leveraging distributed platforms such as Hadoop in the background.

This class will provide details on how search engines can be abused to use not text, but mathematically derived tokens to build models that implement reflected intelligence. In such a system, intelligent or trend-setting behavior of some users is reflected back at other users. More importantly, the mathematics of evaluating these models can be hidden in a conventional search engine like Solr, making the system easy to build and deploy.

The class will describe how to integrate Apache Solr/Lucene with Hadoop. Then we will show how crowd-sourced search behavior can be looped back into analysis and how constantly self-correcting models can be created and deployed. Finally, we will show how these models can respond with intelligent behavior in real time.

This class is sponsored by LucidWorks.

Level: Intermediate
HP Vertica in the Big Data Ecosystem image

With the rapidly growing excitement for the Apache Hadoop and related open-source projects, the terms "Big Data" and Hadoop are becoming synonymous in the minds of many. The reality is that there are many technologies that fall under the heading of Big Data. Hadoop certainly is an important tool addressing the Big Data challenge, but it is not the only tool in the toolbox.

HP Vertica is designed to address the demands of the ever-growing quantity of data and the need to make sense of it, and is therefore a centerpiece in the Big Data toolbox. Both technologies address parts of the Big Data challenge and it's important to understand what each tool provides and how to effectively use them together. This class addresses the roles that Hadoop and Vertica jointly address as complementary technologies and how they connect together, and in the context of producing analytical results at the speed of business.

This class is sponsored by HP Vertica.

Level: Overview
The Need for Speed: Taking Big Data Real-Time image

There is no denying the considerable value that Big Data can provide. Yet despite the hype, a recent Gartner survey found that only 42% of respondents had invested in Big Data technology or were planning to do so within a year. Behind this stunted adoption: costly, legacy solutions that have been too slow and inflexible to deliver on the promise of real-time Big Data analytics.
 
In-memory databases are emerging to solve the performance problem. With terabytes of addressable memory, in-memory solutions are allowing companies to seize the real-time opportunity by gaining insights quicker than their prehistoric counterparts. In-memory databases are providing faster and more predictable performance than disk – delivering unprecedented speed complemented by emerging technologies that combine structured, semi-structured and unstructured data into a unified interface.
 
In this class, we will discuss how in-memory technologies are changing the Big Data landscape, including customer anecdotes from organizations that have reaped the ROI of real-time Big Data analytics.

This class is sponsored by MemSQL.

Level: Overview
Understanding Analytic Workloads – Learn When to Turn the Tables on Your Big Data image

"Big Data analytics" is a widely over-subscribed term, used by vendors with wildly different architectures to describe their technology.  Matching the right tools for your analytic workloads can take the difficulty out of Big Data analytics and deliver easy data analytics.
 
Analytic Workloads vary based on some critical dimensions:
  • Type of Analysis: Does the analysis focus on records that describe single entities, or finding patterns between and within records?
  • Flexibility of Analysis: What are the flexibility/extensibility requirements? Do you have a handful of analysis patterns that are fixed, or are the patterns diverse, dynamic, and growing?
  • Schema: Reliance on schema on demand or pre-defined schema, or both? 
  • Scope of Analysis: Related to size of data.  How many records are referenced in support of some analytics operation?
Pure columnar storage combined with linear scalability when distributing work can deliver game-changing performance for Analytic Workloads. Learn where different Big Data technologies (including columnar) best fit to solve specific workload challenges in today’s Big Data landscape.

This class is sponsored by Calpont.

Level: Intermediate
Windows-Centric Data Management for the Hybrid Environment image

Attend this class for a discussion and live demonstration of best practices for data management in the heterogeneous environment. Special attention will be paid to Microsoft Exchange, SharePoint. Active Directory, SQL, Lync and File. CommVault will cover E-Discovery and Search for compliance as well-tiered storage practices with regard to DeDuplication and Azure Cloud Storage.

This class is sponsored by CommVault.

Level: Overview
1:45 PM - 2:45 PM

Analyzing Tweets with HBase, Parts I and IIstarburst image Has code image

This two-hour class will cover how to use the Twitter API to download and model tweets in HBase, and then run natural-language processing against them. We will first cover the architecture fundamentals of HBase, including log-structured merge trees, data models, memstores, HFiles and Bloom filters. Next, tweets will be populated into HBase. Finally, we will explore some of the more interesting analysis that can be done with the tweets and NLP. All code for this class is publicly released under a Creative Commons license.

Level: Intermediate
Introduction to Apache Pig, Parts I & II Has code image

This two-part class provides an intensive introduction to Pig for data transformations. You will learn how to use Pig to manage data sets in Hadoop clusters, using an easy-to-learn scripting language. The specific topics of the 120-minute class will be calibrated to your needs, but we will generally cover:
  • What is Pig and why would I use it?
  • Understanding the basic concepts of data structures in Pig
  • Understanding the basic language constructs in Pig. We'll also create basic Pig scripts.
Prerequisites: This class will be taught in a Linux environment, using the Hive command-line interface (CLI). Please come prepared with the following:
• Linux shell experience; the ability to log into Linux servers and use basic Linux shell (bash) commands is required
• Basic experience connecting to an Amazon EC2/EMR cluster via SSH
• Windows users should have a knowledge of Cygwin and Putty
• A basic knowledge of Vi would be helpful but not necessary

Also, bring your laptop with the following software installed in advance:
• Putty (Windows only): You will log into a remote cluster for this class. Mac OS X and Linux environments include SSH (Secure Shell) support. Windows users will need to install Putty. Download putty.zip from here.
• A text editor: An editor suitable for editing source code, such as SQL queries. On Windows, WordPad (but not Word) or Notepad++ (but not Notepad) are suitable.

Level: Overview
Running, Managing and Operating Hadoop at Searsstarburst image

High ETL complexity and costs, data latency and redundancy, and batch window limits are just some of the IT challenges caused by traditional data warehouses. This class covers the use cases and technology that enables Sears to solve the problems of the traditional enterprise data warehouse approach. Learn how Sears significantly minimized data architecture complexity with Hadoop, resulting in a reduction of time to insight by 30-70%, and discover "quick wins" such as mainframe MIPS reduction.
You will learn:
  • What is HDFS and MapReduce
  • Traditional Database vs Hadoop
  • Logical and Physical View of Hadoop
  • Hadoop as an enterprise data hub
Level: Overview
Understanding MongoDB: New Features Explored Through Codestarburst image Has code image

This class is geared toward understanding some of the new features and enhancements that were released in MongoDB 2.4. We'll explore these concepts by building an application that uses text search, geospatial queries and the aggregation framework. The application will be built entirely in JavaScript utilizing Node.js and jQuery. However, the emphasis will be on MongoDB, so those who are not experts in JavaScript need not worry.


Level: Intermediate
Untangling the Relationship Hairball with a Graph Database Has code image

Not only has data gotten bigger, it's gotten more connected. Make sense of it all and discover what these Big Data connections can tell you about your users and your business. Come to this class to learn some of the different use cases for graph databases, and how to spot the non-obvious opportunities in your data.

Level: Overview
Using Hadoop to Lower the Cost of Data Warehousing: The Paradigm Shift Underwaystarburst image

Data warehouses are bursting from increased data volume, and new sources of data are making traditional approaches to data analysis costly and slow.  Typically, analysts define the problem, identify data samples and pull the data through an ETL (extract, transform and load) process. But now, Hadoop is changing the data-warehousing landscape by improving data archiving and lowering costs by offloading data warehouse processing. A Hadoop platform enables companies to easily scale as the volume, velocity and variety of data continues to increase while providing even higher-quality results.

This class will cover the operational cost of deploying Hadoop relative to more traditional data-warehousing implementations. We will cover real-world customer use cases and demonstrate how dramatic cost savings (often a magnitude of savings) were achieved through properly deployed Hadoop implementations.

Level: Overview
3:15 PM - 4:15 PM

Analyzing Tweets with HBase, Parts I and IIstarburst image Has code image

This two-hour class will cover how to use the Twitter API to download and model tweets in HBase, and then run natural-language processing against them. We will first cover the architecture fundamentals of HBase, including log-structured merge trees, data models, memstores, HFiles and Bloom filters. Next, tweets will be populated into HBase. Finally, we will explore some of the more interesting analysis that can be done with the tweets and NLP. All code for this class is publicly released under a Creative Commons license.

Level: Intermediate
Data Modeling and Relational Analysis in a NoSQL World Has code image image

The new wave of NoSQL technology is built to provide the flexibility and scalability required by agile Web, mobile and enterprise applications. Interestingly, any system that supports chained Map/Reduce processing (specifically Map/Reduce Map) fulfills the basic query requirements of a SQL engine. Therefore, we will work to help you bridge the gap between SQL, relational (big) data, and the brave new world of NoSQL.

In this class, you will learn how to model real-world relational data in a modern document database. We next go on to compile various SQL operations (SELECT, SUM, AVG, JOIN, etc.) into exceptionally simple Map/Reduce programs. We finish with a study demonstrating the performance, scalability and "time-to-value" benefits of this approach, specifically the pre-computation of materialized views. The class will be a mix of chalkboard and interactive demonstrations.

Prerequisites: Bring a laptop with a modern browser (Chrome, Safari or Firefox). Previous experience with basic scripting languages (e.g., JavaScript) is an advantage but not a requirement. All data and code samples will be provided at the beginning of the class.

Level: Advanced
Introduction to Apache Pig, Parts I & II Has code image

This two-part class provides an intensive introduction to Pig for data transformations. You will learn how to use Pig to manage data sets in Hadoop clusters, using an easy-to-learn scripting language. The specific topics of the 120-minute class will be calibrated to your needs, but we will generally cover:
  • What is Pig and why would I use it?
  • Understanding the basic concepts of data structures in Pig
  • Understanding the basic language constructs in Pig. We'll also create basic Pig scripts.
Prerequisites: This class will be taught in a Linux environment, using the Hive command-line interface (CLI). Please come prepared with the following:
• Linux shell experience; the ability to log into Linux servers and use basic Linux shell (bash) commands is required
• Basic experience connecting to an Amazon EC2/EMR cluster via SSH
• Windows users should have a knowledge of Cygwin and Putty
• A basic knowledge of Vi would be helpful but not necessary

Also, bring your laptop with the following software installed in advance:
• Putty (Windows only): You will log into a remote cluster for this class. Mac OS X and Linux environments include SSH (Secure Shell) support. Windows users will need to install Putty. Download putty.zip from here.
• A text editor: An editor suitable for editing source code, such as SQL queries. On Windows, WordPad (but not Word) or Notepad++ (but not Notepad) are suitable.

Level: Overview
Running Mission-Critical Applications on Hadoop

This class will look at what is involved when you move Hadoop from a lab environment to actual deployment in production. We will cover the critical enterprise-grade features like data integration, data protection, business continuity and high availability, and discuss the ways you can accomplish these in your environment. We will also identify potential stumbling blocks, identify what a platform can or can’t provide, and help determine the scope and level of customization necessary to make your deployment successful.

At the end of the class, you will better understand how to move Hadoop from the test bed to production deployment, what is involved in the process, and how to run a mission-critical Hadoop environment. Where appropriate, there will be real-world examples.

Level: Intermediate
Staying Alive: Ensuring Service Availability in Hadoopstarburst image

All sorts of things can go wrong in your data center. Ensuring that your Hadoop systems stay up through various types of threats—from node failures to site failures—is vital toward meeting SLAs and ensuring a high quality of experience for clients of your production-level distributed system. We will discuss the various threat models that need to be handled and the elements of how to build highly available architectures for key system services.

In this class, we will shed light on the complexity of what it takes to keep different system services alive, starting from core Hadoop services of HDFS and Map/Reduce, and extending to higher-level applications such as Hive. This understanding will help you evaluate the risk profile and cost of providing different SLAs for your entire Hadoop system. In this class you will learn about:
  • Worker node versus master node failure characteristics and tolerance levels of various services in the Hadoop ecosystem
  • Vital components of each service as they pertain to availability (metadata, data, databases, ZooKeeper quorum, etc.)
  • What it takes to set up back up nodes versus backup clusters

Level: Advanced
Storm: Real-Time Data Processing at a Massive Scalestarburst image Has code image

Using the stream processing and Hadoop ecosystem, Sears is developing a platform to collect, process and publish data in real time. By integrating Storm into the enterprise data flow and deploying scalable topologies that source from multiple data sources, this platform makes it possible to process hundreds of thousands of messages per second. This processing is essential for many Sears real-time initiatives, including gamification, individualized customer offers, inventory management, sales metrics and fraud detection. Attend this class to learn:
  • An overview of real-time computation
  • Components and operational features of Storm
  • A proposed real-time processing platform
Level: Intermediate
Thursday, October 17
8:45 AM - 9:45 AM

Building an Impenetrable ZooKeeper Has code image

Apache ZooKeeper is a project that provides reliable and timely coordination of processes. Given the many cluster resources leveraged by distributed ZooKeeper, it’s frequently the first to notice issues affecting cluster health, which explains its moniker: “The canary in the Hadoop coal mine.”

Come to this class and you will learn:
    •    How to configure ZooKeeper reliably
    •    How to monitor ZooKeeper closely
    •    How to resolve ZooKeeper errors efficiently

Culling from the diverse environments we’ve supported, we will share what it takes to set up an impenetrable ZooKeeper environment, what parts of your infrastructure specifically to monitor, and which ZooKeeper errors and alerts indicate something seriously amiss with your hardware, network or HBase configuration.

Level: Intermediate
Data Modeling for Chat Messages with Cassandrastarburst image Has code image

Cassandra excels at many aspects of database design, such as fast ingestion, replication and distributed architecture. However, there are some other features of databases and modeling at which Cassandra shines and lends itself to interesting use cases. This class will focus on some of these features, specifically automatic ordering and automatic grouping of related data elements.

In this hands-on class, we will explore using Cassandra for a chat-message storage/retrieval application. The class will set up a Cassandra instance in a VM, and demonstrate Schema design using Cassandra CLI and interaction with it using a Java client. You will gain an appreciation for Cassandra, its application design in Java, and its data-modeling intricacies.

Prior experience in Java will be helpful; however, even without it, watching the demonstration will allow you to gain from the exercises.

Please note: You will need a laptop with VMware, CentOS and Eclipse installed.

Level: Intermediate
First Steps to Big Data from MySQLstarburst image

MySQL is the ubiquitous database on the Web, and just about every organization has a copy running someplace. But how do you get your information in your MySQL instances into a Big Data store? This class covers basic data warehousing that can be done with the community edition, column storage engines, and finally moving into Hadoop (more than 80% of all Hadoop sites feed data from MySQL). So if you need to plunge into deep data but need some guidance, please attend this class.

Level: Overview
Hadoop by Example Has code image

This class is designed to demonstrate the most commonly used Map/Reduce design patterns for various problems. Performance and scalability will be taken into consideration.

The class will present a general overview of the problems that can be solved using Map/Reduce, scalability and performance tuning for clusters of different sizes. The techniques described here can be used on all Hadoop distributions.

The following technical problems will be covered:
• “Hello world!” of the Map/Reduce universe—a word count example
• Mapping only Map/Reduce jobs and their usage for ETL-type jobs
• Global sorting techniques
• Sequencing files and its usage in MapReduce jobs
• Mapping files and its usage in MapReduce jobs
• Reduce-side join and its advantages and limitations
• Map-side join and its advantages and limitations

Each technique will be provided with a code example that can be used as a template. No prior knowledge about the topic is required; however, some Java knowledge is recommended.

Level: Intermediate
Implementing Real-time Analytics with Hadoop, NoSQL and Stormstarburst image

From financial services to digital advertising, gaming and retail, companies are pushing to grow revenue by interacting with consumers in real time, based on a 360-degree picture of what they care about, where they are, and what they are doing now. For growing numbers of these businesses, this means developing applications that combine the historical analysis provided by Hadoop with real-time analysis through Storm and within NoSQL databases, themselves.
 
This class will examine the design considerations and development approaches for successfully delivering interactive applications based on real-time analysis using a combination of Hadoop, Storm and NoSQL. Key topics will include:
  • A quick review of the respective roles that Hadoop, Storm and NoSQL databases play in real-time analysis.
  • Considerations in choosing which technology to use in areas where their capabilities overlap.
  • Strategies for addressing the diverse data types required for providing a complete view of the customers.
  • Approaches to managing large data types to ensure reliable real-time responses.
Throughout the discussion, concepts will be illustrated by use cases of businesses that have implemented real-time applications using Hadoop, Storm and NoSQL, which are in production today.

For this advanced-level technical class, it is recommended that participants have a fundamental understanding of Hadoop, Storm and NoSQL.

Level: Advanced
Managing a World of Data: Geospatial Best Practicesstarburst image Has code image

Everything happens somewhere, and organizations are realizing how important a role location, surroundings and time play in decision-making, including:
  • Which driving route is least congested?
  • Where's the closest top-rated French restaurant?
  • Are field equipment malfunctions related to altitude or time of day?
  • What people will be in a particular location given their current activity?
The challenge is how to store, index and query the massive quantities of spatial and temporal data generated via mobile phones, cameras, computers, sensors and Internet-enabled appliances.
This class will present the best practices emerging from the field of geospatial data and the specialized database systems developed to manage it. We will cover:
  • Apps that inspire: novel ways geospatial data is being applied in government and industry
  • Geospatial data management standards such as GeoJSON and GML; which should you follow?
  • Geospatial basics: bounding box and proximity searches
  • Advanced geospatial operations, including storing complex geometries, temporal, and geo-metadata; performing bounding polygon and radius, intersections, and buffering; and best practices for scaling and partitioning geospatial data
  • Comparison of specific SQL and NoSQL geo-indexing libraries
Attend this class to learn how to best store, index, and query the spatial and temporal data generated by mobile devices, sensor networks, and Internet-enabled appliances. Fundamentals on geospatial data-management standards (GeoJSON, GML), bounding box and proximity searches, and the storage of geometries and geo-metadata will also be covered. Databases will include PostGIS and CouchDB (running on Cloudant).

Level: Overview
10:00 AM - 11:00 AM

A/B Testing in a Big Data Environmentstarburst image

Big Data has allowed organizations to make significantly better evidence-based decisions than ever before. However, with so much information, a new challenge for us is to rank all the possible actionable hypotheses. There is good news on this front because many organizations have shown that it is possible to introduce controlled (A/B) experiments into business intelligence processes.

But enhancing your organization's business intelligence process maturity level to include A/B-based testing, particularly in a Big Data environment, requires new underlying technical and business capabilities – and cultural shifts. This class reviews the foundations of A/B testing and presents several case studies of successful and failed applications of A/B testing in a Big Data setting. This will help you apply sound A/B testing into your data-driven organization. Attend this class to learn:
  • How other organizations are utilizing A/B testing
  • Best-practices for applying A/B testing
  • Where to place your organization in an A/B testing inclusive maturity model
  • Methods to introduce A/B testing into your environment
Level: Overview
How to Find Pairwise Dependencies in Your Data with Map/Reducestarburst image

Does your customer’s browser choice relate to the amount of money they spend in your online store? Or are people that come to your site through Pinterest more likely to download your trial than those who come from Facebook? In this class, you will learn how to compute those correlations with a pairwise dependency value across all columns of your data, applying Map/Reduce, on a data stream. Based on mutual information, this measure is derived from two-dimensional histograms, no matter whether the columns are numerical or categorical. The final result is a heat map matrix that compares all columns with each other, visualizing their pairwise dependency values.

Level: Intermediate
Introduction to Parallel Iterative Machine-Learning Algorithms on Hadoop’s Next-Generation YARN Frameworkstarburst image Has code image

Online learning techniques, such as Stochastic Gradient Descent (SGD), are powerful when applied to risk minimization and convex games on large problems. However, their sequential design prevents them from taking advantage of newer distributed frameworks such as Hadoop Map/Reduce. In this class, we will take a look at how we parallelize parameter estimation for linear models on the next-gen YARN framework Iterative Reduce and the parallel machine-learning library Metronome.

Level: Advanced
MongoDB on AWS: Operational Best Practicesstarburst image Has code image

MongoDB and AWS are a powerful combination in terms of flexibility and scalability, but not always operationally trivial to maintain. At Parse, we developed an extensive set of best practices for minimizing the pain associated with provisioning and maintaining many MongoDB clusters in the cloud.

We will discuss tips and tools for capacity planning, fully scripted provisioning using chef, puppet, and other systems management tools, and snapshotting your data safely, as well as using replica sets for high availability across availability zones. We will also cover the good, the bad and the ugly of disk performance options on EC2, as well as several filesystem tricks for wringing more performance out of your block devices.

As your cluster grows, maintenance operations become ever more critical and sensitive to operator error. We will share some best practices for warming up secondaries, compacting collections and repairing databases, and effectively monitoring the health of your clusters. And we will talk about what not to do in the middle of a Mongo disaster, because it is very easy to make everything much worse for yourself. Instead, we will talk about some ways to protect yourself from disaster spirals and minimize your downtime.

The speaker works for Parse.

Level: Advanced
Proper Care and Feeding of HBase Coprocessorsstarburst image

Coprocessors are a relatively new feature within HBase. While they are capable of providing useful and powerful performance improvements, if they are not designed properly, they can have extreme detrimental effects. Like the Tribbles of “Star Trek” or the Gremlins, one must use extreme caution and follow good practices, or else bad things can happen...

This class provides an introduction to Coprocessors, focusing on the potential problems that can arise from poor design as well as issues with the current implementation. This class is geared towards the more experienced HBase users and will provide some common examples of how Coprocessors are being used today, as well as some potential future use cases.

Level: Intermediate
Querying Google’s BigQuerystarburst image

Google uses Big Data, big compute and "big smarts" for a huge business advantage. With its Big Query product, it aims to provide Google-scale data analytics on billion-row data sets to customers. The power of such a tool affords many businesses the ability to access powerful data analytics for the first time. In this class, talk to Google about its plans for the product, what impacts and opportunities it will create in the marketplace, and learn effective ways to use big query to your advantage.

Level: Overview
11:30 AM - 12:30 PM

Building Your Own Facebook Graph Search with Cypher and Neo4j Has code image

Learn how to create your own Facebook Graph Search or how to build a similar system with your own company data. Also learn how to interpret natural language into a grammar and use it to build Cypher queries to retrieve data from a graph. Knowledge of Natural Language Processing not required. The instructor works for Neo Technology, creators of the Neo4j Graph Database.

Level: Intermediate
Dive into Twitter's Observability Stackstarburst image

Twitter's Observability stack collects, processes, monitors and visualizes over 170 million real-time time series from all service and system components. This class covers how the stack is built and scales to enable developers and reliability engineers to build fault-tolerant distributed services. In this class, you will learn what works and what doesn’t, from architecture to implementation, for both small and Twitter-scale monitoring systems.

We will cover the end-to -end architecture of the stack, starting from instrumenting services and applications, through the storage technologies for managing millions of time series, and finally to the analytic and monitoring tools enabling users to use data to build and maintain large-scale distributed systems. Lessons will be shared on how to make many classes of users happy, including developers and operations staff, in addition to lessons on how to build and scale monitoring stacks for your application, whether big or small.

Level: Intermediate
Real-Time Hadoop Has code image image

We will start with a short introduction to different approaches to using Hadoop in the real-time environment, including real-time queries, streaming, and real-time data processing and delivery. Then we'll describe the most common use cases for real-time queries and products implementing these capabilities. We will also describe the role of streaming, common use cases, and products in the space.

The majority of time will be dedicated to the usage of HBase as a foundation for the real-time data process. We describe several architectures for such implementation, and a high-level design and implementation for two examples: system for storing and retrieving images, and using HBase as a back end for Lucene.

Level: Intermediate
Seven Deadly Hadoop Misconfigurations Has code image

Misconfigurations and bugs break the most Hadoop clusters. Fixing misconfigurations is up to you!

Attend this session to learn how to get your Hadoop configuration right the first time. In some support contexts, a handful of common issues account for a large fraction of issues. That is not the case for Hadoop, where even the most common specific issues account for no more than 2% of support cases. Hadoop errors show up far from where you configured, making it hard to know what log files to analyze. It pays to be proactive. Come to this class!

Level: Intermediate
Time + Space + Transaction: Multidimensional Geospatial Analysis with Hadoopstarburst image Has code image

Most organizations do a good job of reporting on the past, some do a good job of estimating the future, and few have a total understanding of their environment, customers, or business based on multidimensional analytics that account for time, space and transactions. This class will tie these three areas of analytics together to teach attendees how to use commonly available information and tools to achieve deep insights.

The arrival of Big Data blurs the boundaries between reporting, software development and analytics. This deeply technical class will walk through moving beyond the most common sources of transactional reporting and temporal analytics to include geospatial dimensions that can unlock the true potential of our data. By examining CRM (transactional), Web logs (temporal) and IP Geolocation (spatial), this deep-dive will walk developers and data scientists through a real example of identifying trends across transactions, time and space to discover deeper insights and actionable intelligence.  

From data procurement, transformation, loading and analysis, to display and interpretation, this class will teach attendees how to dig deeper into their data by reaching across paradigms and using machine-learning algorithms and rich presentation platforms to explore and visualize Big Data. 

Level: Advanced
1:45 PM - 2:45 PM

Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologiesstarburst image Has code image

One of the biggest challenges for data engineers building real-world predictive applications with Big Data is the steep learning curve of multiple data-processing frameworks, learning algorithms and scalable programming.

In this class, you will get hands-on instructions for data engineers to add predictive features, such as personalization, recommendation and content discovery, to your applications using Big Data. The class will begin with a brief overview of scalable machine learning for Big Data. You will then see demonstrations with the use of open-source technologies such as Hadoop, Cascading, Scalding and PredictionIO with live sample codes. A number of collaborative filtering algorithms will be explained.

You will also see the use of open-source user-friendly control interfaces to evaluate, compare, select and deploy learning algorithms; tune hyperparameters of algorithms manually or automatically; and review the predictive model training status. By the end of the class, you will master the core concepts of Machine Learning and be able to apply scalable algorithms into real software production environment.

Level: Advanced
Getting Started with R and Hadoop, Parts I & II Has code image

Increasingly viewed as the lingua franca of statistics, R is a natural choice for many data scientists seeking to perform Big Data analytics. And with Hadoop Streaming, the formerly Java-only Big Data system is now open to nearly any programming or scripting language. This two-part class will teach you options for working with Hadoop and R before focusing on the RMR package from the RHadoop project. We will cover the basics of downloading and installing RMR, and we will test our installation and demonstrate its use by walking through three examples in depth.

You will learn the basics of applying the Map/Reduce paradigm to your analysis, and how to write mappers, reducers and combiners using R. We will submit jobs to the Hadoop cluster and retrieve results from the HDFS. We will explore the interaction of the Hadoop infrastructure with your code by tracing the input and output data for each step. Examples will include the canonical "word count" example, as well as the analysis of structured data from the airline industry.

No specific prerequisite knowledge is required, but a familiarity with R and Hadoop or Map/Reduce is helpful.

Level: Advanced
Hadoop Backup and Disaster Recovery 101 Has code image

Any production-level implementation of Hadoop must have its data protected from threats. Threats to data integrity can be human-generated (malicious/unintentional) or site-level (power outage, flood, etc.). As soon as you start to identify these threats, it’s important  to develop a backup or disaster-recovery solution for Hadoop!

In this class, you will learn the unique considerations for Hadoop backup and disaster recovery, as well as how to navigate the common issues that arise when architects and developers look to protect the data.

We’ll cover:
•    How to model your backup/disaster-recovery solution, considering your threat model and specifics around data integrity, business continuity, and load balancing.
•    Best practices and recommendations, highlighting Hadoop in contrast to traditional SAN/DB systems; replication versus "teeing" models for ensuring DR; replication scheduling; Hive; HBase; managing bandwidth; monitoring replication; using one's secondary beyond replication; and a survey of existing tools and products that can be used for backup and DR

After taking this class, you should be able to explain to your organization the right way to effect a backup or data recovery solution for Hadoop.

Level: Intermediate
Intro to Machine Learning: A Crash Course, Parts I and IIstarburst image Has code image

This two-part, 120-minute class provides a crash-course introduction to Machine Learning. We'll start by defining the terminology, making comparisons with the related fields of statistical inference and optimization theory, and then review some history of ML, from early neural nets onward. We'll consider a process for feature engineering, with emphasis on using tools for data prep and visualization, plus how to grapple with dimensional reduction.

The remainder of the practice will be divided into three parts: Representation: a survey of useful algorithms, including probabilistic data structures, text analytics and NLP, plus issues to consider; Evaluation: distinguishing how some methods work better for given use cases, including issues of overfitting, bias, etc., and the use of quantitative measures; and Optimization: methods for improving on a good thing, including how to move from graph theory to sparse matrices, ensemble models, plus a look at ML competition platforms. We'll conclude with suggestions for where to continue further studies.

Prerequisites: some familiarity with programming, probability, statistics, linear algebra, and calculus. We will be programming in R and Python, along with some bits of Hadoop and Spark.

Note: This class is part lecture and part hands-on; you are required to bring a laptop.

Level: Intermediate
Simple Yet Efficient Web Extraction with OXPath, Part Istarburst image Has code image

Big Data has already changed how we make decisions, whether on pricing, recommendations or investment. However, access to such Big Data is often expensive or limited to large organizations that collect it. Though much is available on the Web, it is often only available through Web Forms and HTML pages.

In this two-part class, we give a thorough overview of large-scale data extraction from the Web, as well as its challenges. The first part gives an overview of existing tools and walks through real-life examples of manual wrappers. In the second part, we delve deeper into data extraction and discuss common patterns and fallacies when creating, maintaining, and running large-scale data-extraction systems.

In the first part, we start with an overview of traditional approaches, outlining their strengths and limitations to enable you to more easily decide what tools are most appropriate for their needs. We walkthrough real-life examples for manual wrapper creation with XPath and WebDriver, the emerging W3C standard for programmatic browser control.

Finally, we show that extracting Big Data from the Web doesn’t have to be hard or costly. We will show you how to extract data with just a little knowledge of XPath. That’s all you need to get started with OXPath, a high-level, high-performance extension of XPath for efficient data extraction from any website. OXPath extends XPath with four well-defined extensions, including the ability to simulate user actions and to select elements of a Web page through their appearance. This allows for easy navigation through complex Web applications, and reduces maintenance in the face of structural page changes.

Level: Intermediate
Windows Azure HDInsight and PowerPivot: Cloud-Based Data Analysis with Familiar and Friendly Toolsstarburst image Has code image

This class will explore the features of Windows Azure HDInsight and Excel PowerPivot, a powerful combination of a cloud-based Big Data platform and an Excel front end that allows data scientists and analysts to explore semi-structured data in the familiar Excel tools that many are already experienced with. From provisioning and loading data, to structuring for Hive queries and connecting to with Excel’s ODBC driver for Hive, this session will be a hands-on walkthrough of leading Hadoop tools on the Windows platform. Also included will be data exploration using Excel tools, and particularly PowerPivot to visualize and explore data in a graphical Excel environment.

The cloud-based experience of Windows Azure HDInsight allows for a pay-as-you-go model for processing data on 100% Apache Hadoop-compatible clusters with zero configuration time. This class will be hands-on and require Excel 2013 and access to an HDInsight (or other Apache Hadoop compatible) installation (including HDInsight for Windows). Detailed software requirements will be sent with the presentation ahead of time, and you are expected to familiar with SQL, Hadoop and Excel. Active participation is strongly encouraged but not required, as is working in pairs.

Level: Advanced
3:30 PM - 4:30 PM

Getting Started with R and Hadoop, Parts I & II Has code image

Increasingly viewed as the lingua franca of statistics, R is a natural choice for many data scientists seeking to perform Big Data analytics. And with Hadoop Streaming, the formerly Java-only Big Data system is now open to nearly any programming or scripting language. This two-part class will teach you options for working with Hadoop and R before focusing on the RMR package from the RHadoop project. We will cover the basics of downloading and installing RMR, and we will test our installation and demonstrate its use by walking through three examples in depth.

You will learn the basics of applying the Map/Reduce paradigm to your analysis, and how to write mappers, reducers and combiners using R. We will submit jobs to the Hadoop cluster and retrieve results from the HDFS. We will explore the interaction of the Hadoop infrastructure with your code by tracing the input and output data for each step. Examples will include the canonical "word count" example, as well as the analysis of structured data from the airline industry.

No specific prerequisite knowledge is required, but a familiarity with R and Hadoop or Map/Reduce is helpful.

Level: Advanced
Hadoop Design Patternsstarburst image Has code image

This class is designed to demonstrate how to solve the most common problems with Map/Reduce technologies and to optimize your Map/Reduce jobs to run efficiently on a given Hadoop cluster. The differences among types of joins will be described with real code examples. After this class, you will understand:
  • Different types of joins in Map/Reduce
  • ETL design patterns
  • Sort and secondary sort
  • Data-driven design patterns
Level: Intermediate
Intro to Machine Learning: A Crash Course, Parts I and IIstarburst image Has code image

This two-part, 120-minute class provides a crash-course introduction to Machine Learning. We'll start by defining the terminology, making comparisons with the related fields of statistical inference and optimization theory, and then review some history of ML, from early neural nets onward. We'll consider a process for feature engineering, with emphasis on using tools for data prep and visualization, plus how to grapple with dimensional reduction.

The remainder of the practice will be divided into three parts: Representation: a survey of useful algorithms, including probabilistic data structures, text analytics and NLP, plus issues to consider; Evaluation: distinguishing how some methods work better for given use cases, including issues of overfitting, bias, etc., and the use of quantitative measures; and Optimization: methods for improving on a good thing, including how to move from graph theory to sparse matrices, ensemble models, plus a look at ML competition platforms. We'll conclude with suggestions for where to continue further studies.

Prerequisites: some familiarity with programming, probability, statistics, linear algebra, and calculus. We will be programming in R and Python, along with some bits of Hadoop and Spark.

Note: This class is part lecture and part hands-on; you are required to bring a laptop.

Level: Intermediate
Large-Scale Distributed Queries with Apache Drillstarburst image Has code image

This class will discuss the architecture behind full ANSI-SQL large-query solution for Big Data. Apache Drill, like Hadoop, was inspired by a Google white paper; is architected to work with nested data structures such as JSON; and can process queries against a variety of databases, including MongoDB, HBase and Oracle. The class will give an overview of the use cases and present the design of some of the critical architectural components.

Level: Advanced
Selecting the Right Big Data Tool for the Right Job, and Making It Work for You Has code image image

This class will focus on the various types of Big Data solutions—from open-source to commercial solutions—and the specific selection criteria and profiles of each. As in all technology areas, each solution has its own sweet spots and challenges either in CAP theorem, ACID compliance, performance or scalability. This class will provide an overview of the technical tradeoffs for the list of solutions in technical terminology. Once the technical tradeoffs are reviewed, we will review the cost and value of open-source solutions versus commercial software, and the trade-offs that folks must take to choose one over the other.

The next phase will go into great detail on use cases for specific solutions based on real-world experience. All of the specific use cases have been seen first-hand from our customers. The solutions will be reviewed to the level of specific technical architecture and deployment details. This is intended for a highly technical audience and will not provide any high-level material on the solutions discussed. It is assumed you have working knowledge of solutions such as SQL, NoSQL, distributed file systems and time-series indexes. You should also have an understanding of CAP theorem, ACID compliance and general performance characteristics of systems.

Level: Advanced
Simple Yet Efficient Web Extraction with OXPath, Part IIstarburst image Has code image

In the second part of this class, we will look at more complex wrappers, as well as the maintenance and management of large-scale extraction infrastructure. We’ll walk you through several examples on how to create wrappers, driven by real use cases from finance and competitive pricing. For these examples, we’ll use the OXPath Firefox IDE, which allows for the development of OXPath wrappers using familiar Firefox developer tools. We will discuss how to make wrappers robust and maintainable through a small set of wrapper design patterns. OXPath’s open-source engine is able to deal with many of the issues that make Web scraping a pain, from buffer management to auto-complete fields.

However, we will also show the limits of the engine and how to deal with them. We will conclude the presentation with best practices for deploying and scheduling the resulting wrappers, e.g., for repeated extraction to keep extracted data up to date.

Level: Advanced
 
 
Download the Prospectus
Sponsors and Exhibitors
PLATINUM
Calpont
CommVault
HP Vertica
Lucidworks
MemSQL
GOLD
Actian Corporation
Actuate
OdinText
SILVER
Syncsort
Think Big Analytics
OTHER EXHIBITING COMPANIES
Aerospike


 
A BZ Media Production