BigDataTechCon-header.png
    

RegBy-SAVE-Box-B.png
Details for Government Employees
Big Data TechCon
is sponsored by:
sponsorhead-gold
Aerospike-smactuate-smOdinText_sidebar
sponsorhead-silver
ThinkBigAnalytics
sponsorhead-other

sparkleDB-smSplice_Machine-smtextifer-sm

BigData TechCon San Francisco Classes

(Click here for Tutorials)

Filter Classes By:

Day:
Time Slot:
Speaker:
Level:
Track:

 Tuesday, October 28

1:30pm - 2:45pm

Apache Spark for Event Stream Processing new

Dean Wampler

Apache Spark is the leading contender to replace MapReduce for batch-mode processing in Hadoop. Spark also provides a Streaming API for processing events in near-real time, as well as SQL abstractions, including a port of Hive called Shark, for writing concise queries. In this class, you’ll get introduced to Spark and see a demonstration that simulates an Internet of Things network of devices, where analysis of the events is performed using Spark Streaming and Spark SQL. This will be a code-heavy talk, with many code samples in Scala and SQL, plus a few samples of Java and Python.

LEVEL: Overview

TRACK: Analysis, Hadoop


Big Data and the New Data Warehouse Paradigm new

Ken Krupa

For decades, the Enterprise Data Warehouse (EDW) was at the heart of data analytics and decision support. Based on relational technologies and a star schema model, the EDW was considered the gold standard for business intelligence. Based on the data explosion of recent years, coupled with a huge diversification of content to analyze (from unstructured to structured and everything in between), enterprises are concluding that the traditional EDW alone is no longer suited to the task. Partially as a result, Big Data and NoSQL technologies have entered the marketplace, forever disrupting the status quo, resulting in newer reference architectures for the enterprise.

In this class, we will examine best practices around the use of Big Data technologies, especially NoSQL and Hadoop, and their respective roles in newer reference architectures for data analysis and information discovery.

What you’ll learn:

  • How various architecture patterns are being applied to data warehousing, such as data consolidation, metadata management, and data virtualization.
  • How unstructured data fits into the landscape for both real-time and batch analytics.
  • Alternatives to the traditional waterfall approach of model-ETL-query for EDW delivery.
  • How multi-variant data models play an important role in today's data-centered environment.

LEVEL: Intermediate

TRACK: Hadoop, NoSQL, Use Case


Building a Recommendation Engine with Neo4j new

Max De Marzi

Bring your laptop and favorite Java IDE to this class for a hands on walk-through of how to build a recommendation engine using Neo4j. Get a quick introduction to the core concepts and then dive right into the Neo4j API. We'll make a mistake at first and then figure out how to find it, and fix it. We'll test performance and talk about how to build more complex examples.

LEVEL: Intermediate

TRACK: NoSQL


The Architecture Behind Real-Time Big Data new

Shane K. Johnson

Apache Hadoop is the foundation of enterprise big data solutions. However, it’s one component.  It’s not the complete solution. A complete solution requires databases, message brokers, stream processors and more. The flow of data starts at the client and it ends with Apache Hadoop. However, data flows through multiple systems before it ends with Apache Hadoop. The key to a real-time big data solution is architecture. It has to fill the big data/fast data gap. It has to meet both operational and analytical requirements.  This requires integrating multiple systems to enable a flow of data.

In this class, we will discuss real-time big data architecture by evaluating the architecture of enterprise production deployments. These industry solutions range from messaging to finance to advertising. In addition, we will highlight several components of real-time big data solutions, including Apache Hadoop, Apache Storm, Apache Sqoop, Apache Kafka and Couchbase Server.

Finally, we will examine the requirements for extending a real-time big data solution to mobile devices and the Internet of Things.

LEVEL: Intermediate

TRACK: NoSQL


Predictive Analytics: Turning Big Data into Smart Data new

Todd Cioffi

Conquering the throughput, capture, and storage challenges of Big Data is only the beginning. Once you have a handle on the input, how do you turn that into actionable output? How do you change your mindset from reviewing your past to charting your future?

Simply put: How do you turn your Big Data into smart data?

In this class, we’ll begin with seeing that the critical step in using data analysis to find the right answers is to ask the right questions. We will explore the kinds of questions that people commonly ask, and propose a different style of questions that focus on what will happen, instead of what has already happened. Using real-world examples of tracking customer churn and social media sentiment, you will learn how to use predictive analytics models to forecast which high-value customers you are most likely to lose, or what sentiments might gain momentum in the future. From there, we will see how that information can be used to best prepare resources and develop strategies to retain customers and drive positive sentiment.

We will also review tips and tricks that resolve common issues that can skew reports, visualizations, and recommendations, including false positives, false negatives, and outliers in the data. With the ever-increasing size of available data sets, issues like these present more of a challenge if not quickly identified and handled, allowing for more accurate predictive models.

Note: Prior knowledge of some data analysis is helpful, but no experience in predictive modeling is assumed or required.

LEVEL: Intermediate

TRACK: Analysis


3:45pm - 5:00pm

Application Architectures with Hadoop: Putting the Pieces Together Through Example Most Popular

Mark Grover and Jonathan Seidman

Numerous companies are undertaking efforts to explore how Hadoop can be applied to optimize their data management and processing, as well as address challenges with ever-growing data volumes. Although there's currently a large amount of material on how to use Hadoop and related components in the Hadoop ecosystem, there's a scarcity of information on how to optimally tie those components together into complete applications, as well as how to integrate Hadoop with existing data management systems – a requirement to make Hadoop truly useful in enterprise environments.

This class will help developers and architects who are already familiar with Hadoop to put the pieces together through a real-world example architecture, implementing an end-to-end application using Hadoop and components in the Hadoop ecosystem. This example will be used to illustrate important topics such as:

  • Modeling data in Hadoop and selecting optimal storage formats for data stored in Hadoop
  • Moving data between Hadoop and external systems such as relational databases and logs
  • Accessing and processing data in Hadoop
  • Orchestrating and scheduling workflows on Hadoop
  • Throughout the example, best practices and considerations for architecting applications on Hadoop will be covered. This class will be valuable for developers who are already knowledgeable about Hadoop, and are now looking for more insight into how it can be leveraged to implement real-world applications.

Note: You will need an understanding of Hadoop concepts and components in the Hadoop ecosystem, as well as an understanding of traditional data management systems (e.g. relational databases), and knowledge of programming languages and concepts.

LEVEL: Advanced

TRACK: Hadoop


Graph Analysis of Enterprise Data with Spark new

Patrick White

In this class, we will explain why large scale, near-real-time graph analytics and graph search will be a game changer for companies. This will be a technical talk about graph analysis of enterprise data using Spark, and how near-real-time graph analysis in graph search technology is being deployed in enterprise data settings.

In the last 10 years, advances in natural language processing, server-side computing, machine learning and tools like Spark and Titan have paved for the way for significant improvements in the quality of graph analysis and allowed for the most accurate and relevant graph search results. And, because of today’s massively scalable infrastructure platforms, API ubiquity, and graph analysis capabilities, the rapid gains in query understanding and information retrieval are having a huge impact on enterprise search.

One of the primary points of this class will be to expose you to graph analysis with a deep focus on Spark for enterprise data. This class will also highlight tools like Titan and how to use them to do massive scale graph analysis.

We will highlight just how vast the applications of graph search and real-time data analysis using Spark are, and will demonstrate the ways in which cutting-edge companies are already leveraging the techniques to build recommendation engines, content management systems, fraud detection and more with search-based applications of enterprise data.

You will learn:

  • The basics of graph analysis and graph DBs
  • How Spark can be leveraged to speed up processing and development time
  • How to setup Spark and write/run a basic graph analysis algorithm
  • How graph search is already being leveraged to build things like content management systems and fraud detection.
  • The technical underpinnings of successful graph analysis and graph search, including the infrastructure, algorithms, and the data science that make it all possible

Note that you will need:

  • Basic knowledge of graph theory and graph analysis
  • Understanding of enterprise data and metadata
  • Basics of big data processing (map-reduce, etc.)
  • Knowledge of big data infrastructures and distributed systems
  • Basic understanding of NoSQL databases
  • Familiarity with tools like Spark and Titan

LEVEL: Intermediate

TRACK: Analysis, Hadoop, NoSQL


Inside Google Cloud Dataflow new

Eric Schmidt

Cloud Dataflow is a fully managed Google Cloud service and programming model for building massively parallelized batch and stream based data processing pipelines. Based on a highly efficient and popular model used internally at Google, Cloud Dataflow originated from MapReduce and then evolved into Flume and MillWheel.  This class will review how Cloud Dataflow can used to perform classic ETL, continuous computation and real-time analysis over arbitrarily large datasets. This class will review Cloud Dataflow concepts including:

  • Programming model e.g. PCollections and PTransforms
  • Input and output sources including BigQuery, Cloud Pub/Sub and Cloud Storage
  • Optimizing pipeline performance
  • Deployment and management

Developers interested in learning about data pipelining or are currently using Apache Storm, Apache Spark and or Amazon Kinesis should find this class useful in understanding the power of Cloud Dataflow.

LEVEL: Intermediate

TRACK: Hadoop


Semantic Technology in the Real World new

Stephen Buxton

Semantic Technologies such as RDF and SPARQL have become mainstream ways to manage and model graph-like relationships. Unlike generic graph databases, RDF triple stores treat the links between nodes as first-class citizens with their own unique identities. And they also provide advanced capabilities such as inferencing and ontology management.

This class will explore the Semantic technology stack, including where and how to use it effectively as part of your information access solution, using real-world examples from several industries.

What you’ll learn:

  • The semantic technology stack and its components
  • Usage of semantic technology within different industries

LEVEL: Intermediate

TRACK: NoSQL


Techniques for Driving Truly Personal Information Experiences new

YY Lee

Personalization is becoming an assumed part of technology UX. Rapid advances—and increased expectations set by flagship consumer apps—create both a need and an opportunity for enterprise software to deliver analogous experiences on a per-person basis, in the context of traditional applications and workflows.

Many semantic and data science techniques focus on characterizing and understanding a target data set. To close the last-mile gap towards a high-impact information experience for each person, we need techniques to discern individual care-abouts, which present the familiar challenges of being heterogenous, unstructured, overly sparse or noisy, and sometimes self-conflicting.

This class will cover data and analytics techniques for building user profiles, leveraging explicit vs. implicit factors, and the challenges of ambiguity—and how to discern what a user truly cares about in order to create a highly adaptive, individualized information experience.

Topics to be discussed include:

  • What factors can you leverage to build profiles and nuanced understanding of users (implicit vs. explicit data, positive indicators v. false positives, rationally-derived characteristics, etc.)?
  • What information is easy (and what is hard) to get?
  • What can we do with sparse or changing information?
  • What techniques do you use to understand what a user cares about and create an adaptive experience for them?

LEVEL: Advanced

TRACK: Analysis


 Wednesday, October 29

8:30am - 9:45am

Advanced Implementation of Big Data Analytics with Graph Processing, Part I new

Steven Brobst and Tom Fastner

There are a significant number of big data analytics opportunities where graph processing is an effective model of computation for problem solving. In this 150-minute class, we present a programming model for implementation of graph algorithms and explain how the execution model works. We explain how bulk synchronous processing (BSP) can be used to obtain extreme scalability on top of massively parallel processing architectures. We will also describe how to calculate graph metrics such as centrality, closeness, “betweenness,” triangles, etc. and show how to use these calculations for advanced analytics. Additionally, we will show best practices in community detection within graphs using max clique detection algorithms.  Characterization of event diffusion using viral path analysis will be discussed along with algorithms for predicting such phenomena. We will also provide example applications in the areas of social network analysis, product cross-selling, and fraud detection.

  • Learn about big data analytics with graph processing.
  • Learn about parallel processing with graph algorithms.
  • Learn about business applications of graph processing in areas of marketing and fraud.

LEVEL: Advanced

TRACK: Analysis, NoSQL


Application Development in an Eventually Consistent World new

Ben Coverston

Building high availability applications is an evolving art. The biggest impediment to high availability in recent history has been the database. With the advent of truly AP (from CAP) databases, high availability databases are not only possible, but a practical reality. With that problem solved, this class will focus on ways to solve availability in the rest of the application stack. In other words, how do we build a resilient stack using both emerging and established technologies while still allowing for arbitrary failure, ease of management, and recovery? In this class, we will discuss real and emerging technology stacks, and talk about how applications like Spark and Cassandra are changing the way we use data in real time and analytical applications.

LEVEL: Advanced

TRACK: Hadoop


Big Data Use Cases in Banking new

Masoud Charkhabi

The modern bank is not just a depot of deposits, loans and investments, but also of data. This class will focus on Big Data use cases in banking at a technical level of detail that makes the work reproducible. A handful of tips, best practices and common mistakes will be presented with each of the below use cases. Special attention will be given to how Big Data changes traditional approaches to these business problems. We will cover:

Text Mining customer feedback and employee memos
The majority of unstructured information in a bank comes from logged interactions between clients and employees. Merely structuring this data does not generate value but integrating the unstructured data into established analytical work streams is extremely powerful. This class will focus on machine learning and regular expression methods to process text data.

Segmenting customers and locations
Merging customer segments with location segments is a must for banks that spread across diverse geographies. The good news is that Big Data has made more and more geospatial data available. Statistical learning and geospatial analytical methods are used in this use case.

Economic Linkages associated with segment specific pain points
Determining the financial costs and benefits through the cross analysis of segments and sentiment allows the insights generated from the previous two sessions to facilitate A/B testing and business case building. This use case merges statistical experiment methodology with accounting and project finance principles.

All examples will be with simulated data that resembles real-world scenarios. Examples will leverage various commonly used software packages however an appendix of alternatives will be presented. This class will focus on helping you find ways to generate a return on your big data investments. It is targeted at analytics and management consulting professionals as well as data scientists and statisticians exposed to the banking industry.

LEVEL: Intermediate

TRACK: Analysis


Copious Data, the “Killer App” for Functional Programming new

Dean Wampler

The world of Copious Data (whether it’s Big or not) is currently dominated by Apache Hadoop, a clean-room version of the MapReduce computing model and a distributed file system invented at Google. But the MapReduce computing model is hard to program, suffers from performance issues, and it doesn’t support event-stream processing. Translating many straightforward algorithms to MapReduce requires specialized expertise. So, the industry is looking for a replacement. Apache Spark is the leading contender.

The very name MapReduce tells us its roots, the core concepts of mapping and reducing from Functional Programming (FP). We’ll see how to return data-centric computing to its ideal place, rooted in FP. We’ll discuss the “combinators” (core operations) of FP and their analogs in SQL, which is actually the most widespread FP language in use today. We’ll see how the combinators provide the right granularity for composable units of computation. We’ll see the trends that are already moving us in the right direction.

LEVEL: Advanced

TRACK: Hadoop


How to build (and use) a Text Mining Platform new

Lutz Finger and Yongzheng Zhang

Today more than ever before, we have access raw data in the form of texts. Businesses around the world store text discussions from their market research, customer care discussions, or brand relevant conversation on social media. While it is clear that texts contain valuable information, it is often less clear on how best texts can be analyzed at scale.

In this class, we will share how we at LinkedIn built a scalable text-mining platform to uncover insights from text data. We will focus on two important components: THEME DISCOVERY of new content and how to CLASSIFY existing text. Using both features, we can detect emerging trends within reviews, customer care discussions and market research data. You will learn:

THEME DISCOVERY - information extraction

Theme recognition is a highly complex task due to the multi-facetted nature of our language. Theme Recognition (without requiring manual reviews) is, however, the main component of any text-mining platform. We will introduce an innovation in information extraction using part of speech tagging (currently patent pending) to uncover themes within textual data.

TEXT CLASSIFICATIONS - Supervised Machine Learning

Another important component of our NLP platform is the ability to classify text via supervised machine learning algorithms such as support vector machine (SVM).  The ability to classify serves many business use-cases ranging from sentiment analytics to product identification. You will learn in our talk how to cater to those different requirements via a flexible platform setup.

VALUE of DATA - Member Feedback

The combined ability of Themes Discovery (new content and ideas) as well as Classifications (standard measure) creates a very effective framework to get business insights out of text data. We will demonstrate this on the use case of classifying and responding to member feedback.

LEVEL: Advanced

TRACK: Analysis, Use Case


Predictive Analytics at Scale: The Impact of Big Data new

Todd Cioffi

As Big Data and predictive analytics are each utilized more and more to drive better decision making, people will increasingly look to use them together. But do they collaborate or collide – and when? How does the scale of Big Data impact the predictive analytics algorithms often used at smaller scales?

In this class, we will discuss key components that factor into predictive analytics and reason through the influence of scale.  We will explore how predictive analytics works with a step-by-step walkthrough of a sample analytics process. We will drill down to discover the role of the learning algorithm in this process and review different types of algorithms. Drilling down further, we will reveal how a sampling of specific algorithms works, and learn why this point in the analytics process could be the most sensitive for Big Data – so choose wisely!

Depending on where audience participation and questions guide us, we might include any of these model generators: k-Nearest Neighbor, k-Means Clustering, Normalization, Naïve Bayes, Forward Selection, Backward Elimination, Support Vector Machines, or others.

Note: Prior knowledge of some data analysis is helpful, but no experience in predictive modeling is assumed or required.

LEVEL: Intermediate

TRACK: Analysis


10:00am - 11:15am

Advanced Implementation of Big Data Analytics with Graph Processing, Part II new

Steven Brobst and Tom Fastner

There are a significant number of big data analytics opportunities where graph processing is an effective model of computation for problem solving. In this 150-minute class, we present a programming model for implementation of graph algorithms and explain how the execution model works. We explain how bulk synchronous processing (BSP) can be used to obtain extreme scalability on top of massively parallel processing architectures. We will also describe how to calculate graph metrics such as centrality, closeness, “betweenness,” triangles, etc. and show how to use these calculations for advanced analytics. Additionally, we will show best practices in community detection within graphs using max clique detection algorithms.  Characterization of event diffusion using viral path analysis will be discussed along with algorithms for predicting such phenomena. We will also provide example applications in the areas of social network analysis, product cross-selling, and fraud detection.

  • Learn about big data analytics with graph processing.
  • Learn about parallel processing with graph algorithms.
  • Learn about business applications of graph processing in areas of marketing and fraud.

LEVEL: Advanced

TRACK: Analysis, NoSQL


Data Aggregation at Scale Using Apache Flume new

Arvind Prabhakar

This class will first introduce you to Flume at a high level and then take you through the sizing and capacity planning for your deployment. We will specifically discuss a converging flow configuration, aka fan-in flow, and highlight the ways you can compute the topology details, configuration details, and evaluate the hardware capacity needed to make it function to the specific requirements.

At the end of this class, you will have a clear understanding of how to plan your Flume deployment in a manner that enables you to address all your current and projected needs for the immediate future. You will also develop a deeper understanding of Flume as a distributed system and be able to build on this knowledge to plan bigger topologies or modify existing topologies where necessary.

LEVEL: Overview

TRACK: Hadoop


Deep Machine Reading: Taming Unstructured, Natural Language Data new

Naveen Ashish

This class will cover recent advances in making sense of what comprises 95% of the ocean of big data today - unstructured data! The primary focus will be on our work centered on Cognie, which is an R&D platform for developing deep text analytics engines. It has been developed using state of the art technologies in Natural Language Processing (NLP), Machine Learning, and Information Semantics.

There are a number of features where the text analytics engines developed using Cognie have been pioneering in pushing the state of the art in information distillation from text. These include the accurate distillation of Expressions, abstract notions like endorsements, advice, or personal experience, that people express implicitly in the text; Associations, where we not only identify individual concepts in the text but also establish meaningful associations across extracted instances of such concepts; Custom Signal Distillation, where we extract and classify very domain and application specific particular signal from text, and; Application Oriented Topic Modeling, where we overlay and "normalize" text in preparation for topic modeling to achieve more meaningful mining of topics from the text.

The class will comprise of:

Core Cognie technology
Here we would cover a technical description of the realization of the core Cognie platform. This will include a description of key technical algorithms for features such as expression distillation, association establishment and topic modeling. We will discuss how we have leveraged recent technologies such as "deep learning" based sentiment classifiers in the Cognie information distillation tasks

Applications
In the second part, we will cover some case studies of particular applications developed using text analytics module and engines developed using Cognie. These applications span key verticals including retail and survey analytics, risk assessment, and consumer health analytics. They also cover a varied gamut of unstructured data that was analyzed, including social media conversations, abstracts of (medical) research papers, and consumer reviews.

LEVEL: Advanced

TRACK: Analysis, Use Case


Implementing Real-Time Analytics with Apache Storm Most Popular

Brian Bulkowski

Companies are pushing to grow revenue by personalizing customer experiences in real time based what they care about, where they are, and what they are doing now. Hadoop systems provide valuable batch analytics, but applications also require a component that can analyze Web cookies, mobile device location, and other up-to-the-moment user data. Increasingly, Apache Storm is the component that developers are using.

This class will discuss the design considerations for using the Apache Storm distributed real-time computation system, as well as how to install and use it. Key topics will include:

  • A technical overview of Storm for real-time analysis of streaming data.
  • A solution architecture overview of how Storm’s role in real-time analysis complements Hadoop and NoSQL.
  • Examples of common use cases, including trending words, recommendations, and simple fraud counts.
  • Technical requirements for implementing Storm
  • Live demonstrations of how to install and run Storm
  • Throughout the discussion, concepts will be illustrated by examples of businesses that have implemented real-time applications using Apache Storm.

For this advanced-level technical class, it is recommended that you have a fundamental understanding of Hadoop and NoSQL since it will provide a contextual baseline for how Storm complements these technologies in providing a 360-degree view of users.

LEVEL: Advanced

TRACK: Analysis, Hadoop


Hadoop Puzzlers new

Aaron T. Meyers and Daniel Templeton

The intention of this class is to take a light-hearted look at a very technical topic. We will dissect a series of code samples that look innocuous, but whose behavior is anything but obvious. After presenting the code and explaining the apparent function, we will present a multiple choice list of possible results and ask you to vote for the right answer by a show of hands. After the audience has ruled, we'll reveal the actual behavior, talk through why it happens, and draw from the example lessons that can be put to practical use. The target audience is Hadoop developers who have at least a basic understanding of HDFS, MapReduce, and how to develop Hadoop jobs. You will learn a series of best practices that will hopefully save you hours of debugging time and frustration further down the road.

LEVEL: Intermediate

TRACK: Hadoop


11:45am - 1:00pm

Increase Hadoop Utilization With Your Data Warehouse new

Supreet Oberoi

The Hadoop ecosystem has already proven to be an effective platform for conducting cleansing, validation, enrichment and other data engineering tasks over large volumes of data. However, mission critical and business intelligence applications continue to run on traditional data warehouse platforms. In such enterprise use cases, Hadoop is emerging as a technology that augments existing platforms for processing data-intensive tasks, requiring you to constantly migrate workloads onto Hadoop for processing, but also to get data off for reporting and analysis.

Lingual is an open-source extension to the popular Cascading application development framework for Big Data applications. It allows ETL and other developers to use familiar SQL constructs to not only connect and extract data from your data warehouses, whether on-premise or in the cloud, but also leverages Hadoop for data processing without requiring any knowledge of MapReduce concepts. By using Lingual, enterprise architects can extend the life of their existing IT systems by conducting computation-intensive tasks of their applications using their existing SQL-based skill sets.

This class will show how to address some common scenarios by leveraging SQL to improve your utilization of Hadoop with your other data sources. You will learn how to:

  • Utilize Cascading and Lingual for Big Data app development.
  • Integrate Hadoop with Enterprise data warehouse via a single SQL statement.
  • Quickly integrate BI tools to Hadoop.
  • Migrate costly SQL workloads on to Hadoop.
  • Create scalable applications with ANSI SQL.

LEVEL: Advanced

TRACK: Hadoop


Industrial Internet Using Big Data  new

Diwakar Kasibhotla and Samir Lad

Industrial Internet is igniting the next big data revolution. Airline engines are generating millions of data points per second, power generators are sending sensor data at and communicating with each other, hospital equipment is sending patient data to medical specialists for immediate response.

We at GE are building the Industrial internet by processing all this machine sensor data on a big data platform. In this class, we will go over how the Big data infrastructure was setup and will introduce the big data technology stack being used at GE.

After the infrastructure was built, we will deep dive into how the data was loaded into the big data platform. The class will describe some of the use cases of the captured sensor data being consumed by users and machines to make smart decisions.

LEVEL: Intermediate

TRACK: Use Case


Lessons Learned and Best Practices for Running Hadoop in AWS  new

Amandeep Khurana

Enterprises are starting to deploy large-scale Hadoop clusters to extract value out of the data that they are generating. These clusters often span hundreds of nodes. To speed up the time to value, a lot of the newer deployments are happening in AWS, moving from the traditional on-premise bare metal world. In this class, you'll hear lessons learned and best practices for deploying multi-tenant Hadoop clusters in AWS. We'll cover what reference deployments look like, what services are relevant for Hadoop deployments, network configurations, and instance types as well as backup, disaster recovery, and security considerations. We’ll also talk about what works well, what doesn’t and what we have to do going forward to improve the operability of Hadoop in AWS.

LEVEL: Intermediate

TRACK: Hadoop


Moving Cassandra from Bare Metal to the Cloud with OSv new

Don Marti

NoSQL data stores such as Cassandra are memory and CPU intensive loads which can have unacceptable performance when run under virtualization in production environments. The OSv open-source project has profiled some of the important bottlenecks for NoSQL applications, and designed a new open-source cloud environment to improve performance in key areas such as JVM memory allocation and network throughput. The side effect of this performance work has been to move some difficult tuning tasks off the administrator's to-do list and into the guest OS where they can be handled automatically.  Attendees will learn to connect cloud deployment to their existing management processes, and create entire VM images as small as 12MB to facilitate rapid deployment. The audience will learn to expand their options for where Cassandra can run, improve cost-effectiveness, and minimize administrative overhead.  The talk will cover running on Amazon EC2, on Google Compute Engine, and on private clouds.

NOTE: You should have working knowledge of NoSQL concepts and Cassandra. To try the technology, they will require basic skills with using a hypervisor such as VMWare, VirtualBox, or KVM.

LEVEL: Advanced

TRACK: NoSQL


Reducing Cost in Big Data Using Statistics and In-Memory Technology new

Mayur Rustagi

The world is shifting from private, dedicated data centers to on-demand computing on the cloud. This shift moves the onus of cost from the hands of IT to the developers. As your data sizes start to rise, the computing cost grows linearly with it. This class will show you how improving computation speed, using statistical techniques and in-memory technology Apache Spark, helped us cut down a customer's cost from $1000/TB down to $100/TB on the cloud. We will also do a hands-on demo of how several statistical techniques like HyperLogLog, CountMinSketch and Bloomfilters can be applied to solve every day problems, and save as much as 10x in terms of cost and machines on your existing workloads.

LEVEL: Advanced

TRACK: Analysis


2:00pm - 3:15pm

Optimizing Your Big Data Ecosystem, Part I new

Stephen Brobst and Tom Fastner

Big Data exploitation has the potential to revolutionize the analytic value proposition for organizations that are able to successfully harness these capabilities. However, the architectural components necessary for success in Big Data analytics are different than those used in traditional data warehousing. This 150-minute class will provide a framework for Big Data exploitation along with recommendations for architectural deployment of Big Data solutions.

We will examine three critical technologies necessary for full exploitation of Big Data: multi-temperature data management, polymorphic file systems, and late binding. We will explore why these technologies are important and give examples of effective use for each. We will also discuss tools and frameworks such as Hadoop 2.0 and SQL++, along with approaches to successfully deploy advanced technologies in this arena. The importance of the late binding paradigms will be described along with the implementation techniques for realizing the advantages of these capabilities. We will explain which use cases are appropriate for which components of a Big Data ecosystem and how to architect inter-operability among the various technology platforms used in a best practices implementation. Our approach emphasizes the ability to quickly take advantage of emerging Big Data technologies with maximum leverage of existing sill sets and assets within an enterprise.

  • Learn about the critical components of a Big Data ecosystem.
  • Learn about the use cases that are most appropriate for Big Data solutions.
  • Learn how to architect an optimized ecosystem for Big Data exploitation.

LEVEL: Intermediate

TRACK: Analysis, Hadoop, Use Case


Privilege Isolation in Docker Containers new

Dinesh Subhraveti

While still in early stages of adoption, Containers represent what many consider to be the next generation of virtualization. Containers are a fundamentally different form of virtualization designed to directly virtualize applications rather than the operating system. The absence of a guest OS layer makes containers extremely lightweight, leading to almost imperceptible runtime overhead and startup latencies, orders of magnitude higher scalability, and simplified management. These attributes make containers ideally suited to provide isolation required to support a diverse ecosystem of applications on Hadoop/YARN.

Docker is an ambitious program with a charter to make the container primitives of the kernel trivially accessible to end users. Docker achieves the goal in part through a highly intuitive user interface that hides the complexity of kernel configuration by choosing the most appropriate defaults. It also provides a community-curated repository of self-contained application images that can be portably run on any host, regardless of its underlying state and configuration.

One of the current shortcomings of Docker containers is that the root user within these virtualized environments automatically acquires root privileges on the host system. This challenge has been a stumbling block for use of Docker as the container substrate in YARN. A major step towards integrating Docker into YARN is the support for "user namespaces" feature in Docker, which prevents containerized applications from compromising the security of the host or other containers by isolating the scope of their privilege to the container in which they run. In addition to providing an introduction to Containers and how they are used by YARN, we will discuss the design of the “user namespaces” feature along with several other Docker and YARN features required to make them a perfect mutual fit.

LEVEL: Intermediate

TRACK: Hadoop


Sync is the Future of Mobile (Big) Data new

Chris Anderson

Mobile devices are the preferred means of data access today, but databases are stuck in the mainframe era. Learn how the NoSQL document model can be leveraged for off-line synchronization.

Which data structures provide robust conflict detection and management for multiuser interactive applications? How can your database insulate you from error-prone mobile networks? What performance lessons can we learn from the Web world?

See open-source examples so that you can quickly get up to speed with off-line capable synchronizing applications for major mobile platforms. Learn how you can contribute to the open source databases behind this movement.

You will walk away from the class with an understanding of concrete steps you can take in your development environment to build more reliable, faster user-friendly mobile applications.

LEVEL: Advanced

TRACK: NoSQL


The Hitchhiker's Guide to Machine Learning with Python and Apache Spark, Part I new Hands On

Krishna Sankar

The Apache Spark framework is gaining popularity as a versatile Big Data fabric across varying levels of data pipeline, covering data management (collect-store-transform) and data science (model, reason, machine learning). The python layer (PySpark) is particularly interesting as it enables data scientists and data engineers to explore models and algorithms at scale and deploy interesting data products to production using the same technologies without any scale limitations.

In this hands-on two-part class, we will explore the class of Machine Learning problems and associated datasets that are solvable with Python and Spark. We will cover data wrangling with algorithms, including regression, classification, recommendation, tf/idf and clustering as well as model evaluation primitives.

NOTE: You should have a local installation of Spark 1.1 and Python. The slides, notes and datasets will be available in a github repository.

LEVEL: Advanced

TRACK: Analysis, Hadoop


3:45pm - 5:00pm

Asynchronous Full-Stack Big Data: A How-To  new

Brian Bulkowski

As developers move to create applications that engage in real-time interactions, they are increasingly adopting an open source technology stack, including Node.js, a javascript MVC framework, asynchronous messaging, and an in-memory NoSQL database.

The class will begin with a brief introduction to the next-generation application stacks, including the new ASEAN (/a-shawn/) stack, and explore best practices for achieving better time to market, high reliability, and a compelling, interactive experience. It then will explore asynchronous back-end programming techniques, mobile-first designs and data models, and two-way networking, as well as the importance of low-latency NoSQL.

The class will illustrate the concepts presented by walking through a Web application that has been built using AngularJS, Express, Socket.io, and Node.js on top of the Aerospike database.

Developers who are interested in learning how to quickly get started writing Web applications using the new ASEAN stack are encouraged to attend.

NOTE: The speaker works for Aerospike; however, the lessons learned in this class can be applied to other in-memory NoSQL databases.

LEVEL: Intermediate

TRACK: Analysis, Hadoop, Use Case


Discovering Novel Segments in Big Data with Statistical Learning new

Masoud Charkhabi

Big Data only amplifies the value of sophisticated segmentation. In order to take advantage of the volume and variety of big data sources, segmentation methods on big data must evolve well beyond classical approaches. In this class, we will cover:

Critical data preparation steps
Segmentation algorithms are very sensitive to the format of the data. This class will cover the advantages and disadvantages of applying various Clustering and Dimension Reduction algorithms to specific data. These views come from nearly a decade of professional hands on experience with mountains of data and a variety of business problems.

Segment diagnostics
Determining the optimal number of segments is a problem of the past. With modern software, the creation of layers and topologies of segments is more relevant. Analysts should be able to dynamically adjust the number of segments and understand the relationships between segments as well as discover the appearance of new segments as time passes. Data visualization techniques and methods to interpret segments; including presentation templates, will be covered.

Reverse Engineering
In real-world scenarios, full authority is rarely granted to the analytics professional. At times, setting aside a few strategic segments is a must. Other times, segmentation has a narrow objective to match predetermined alternatives to various segments. We will cover techniques to prune existing segments with statistical learning tools in the first case and to reverse engineer actions to segments using collaborative filtering in the latter.

All examples will use simulated data that resembles real-world scenarios. The examples will leverage commonly used software packages and an appendix of alternatives will also be presented. This class will focus on helping you find ways to generate a return on your big data investments. It is targeted at analytics and marketing professionals as well as data scientists and statisticians.

LEVEL: Advanced

TRACK: Analysis, Use Case


Future-Proof Your Big Data Investments With Cascading new

Supreet Oberoi

As enterprises are engaging with big data technologies to develop applications intended to meet strict operational demands, new computation fabrics are continually being introduced into the market. These new fabrics, while providing distinct analytical advantages for some use cases, also have different interfaces and design patterns for optimal use. What this means is that to leverage new innovations in big data platforms, organizations would continually be rewriting their application code, training their staff, squandering time-to-market opportunities to gain expertise in new systems–essentially breaking and rebuilding their operating model.

This class will discuss how organizations can leverage existing infrastructure and big data investments and future-proof them against emerging technologies with Cascading. Cascading is the most widely used application-development framework for building data-centric applications and allows developers to easily build enterprise-grade applications on emerging fabrics like Tez and Spark with existing skill sets such as Java, SQL, and Scala. The community around Cascading also supports a robust ecosystem that extends the power of Cascading applications with various integrations, dynamic programming languages, and data source connections – all capable of connecting existing data infrastructure with rich data analytics and data management applications.

LEVEL: Advanced

TRACK: Analysis, Use Case


Optimizing Your Big Data Ecosystem, Part II new

Stephen Brobst and Tom Fastner

Big Data exploitation has the potential to revolutionize the analytic value proposition for organizations that are able to successfully harness these capabilities. However, the architectural components necessary for success in Big Data analytics are different than those used in traditional data warehousing. This 150-minute class will provide a framework for Big Data exploitation along with recommendations for architectural deployment of Big Data solutions.

We will examine three critical technologies necessary for full exploitation of Big Data: multi-temperature data management, polymorphic file systems, and late binding. We will explore why these technologies are important and give examples of effective use for each. We will also discuss tools and frameworks such as Hadoop 2.0 and SQL++, along with approaches to successfully deploy advanced technologies in this arena. The importance of the late binding paradigms will be described along with the implementation techniques for realizing the advantages of these capabilities. We will explain which use cases are appropriate for which components of a Big Data ecosystem and how to architect inter-operability among the various technology platforms used in a best practices implementation. Our approach emphasizes the ability to quickly take advantage of emerging Big Data technologies with maximum leverage of existing sill sets and assets within an enterprise.

  • Learn about the critical components of a Big Data ecosystem.
  • Learn about the use cases that are most appropriate for Big Data solutions.
  • Learn how to architect an optimized ecosystem for Big Data exploitation.

LEVEL: Intermediate

TRACK: Analysis, Hadoop, Use Case


The Hitchhiker's Guide to Machine Learning with Python and Apache Spark, Part II new Hands On

Krishna Sankar

The Apache Spark framework is gaining popularity as a versatile Big Data fabric across varying levels of data pipeline, covering data management (collect-store-transform) and data science (model, reason, machine learning). The python layer (PySpark) is particularly interesting as it enables data scientists and data engineers to explore models and algorithms at scale and deploy interesting data products to production using the same technologies without any scale limitations.

In this hands-on two-part class, we will explore the class of Machine Learning problems and associated datasets that are solvable with Python and Spark. We will cover data wrangling with algorithms, including regression, classification, recommendation, tf/idf and clustering as well as model evaluation primitives.

NOTE: You should have a local installation of Spark 1.1 and Python. The slides, notes and datasets will be available in a github repository.

LEVEL: Advanced

TRACK: Analysis, Hadoop