Big Data TechCon | April 26-28, 2015 | Boston, MA


Public Data Sets: Education

An investment in education is an investment in the future, but many factors contribute to the quality of the education delivered. Governments around the world spend trillions of dollars aggregating data about test scores, student populations, and educational institutions all in the name of improving education. Sharpen your pencils before thumbing through these data sets which assess the state of education across the globe.

U.S. Census Bureau Public Elementary–Secondary Education Finance Data 2012

Education finance data include revenues, expenditures, debt, and assets [cash and security holdings] of elementary and secondary public school systems. Statistics cover school systems in all states, and include the District of Columbia. Data are available in viewable tables and downloadable files.

Jan 20, 2015 11:30:00 AM

Topics: Big Data analytics

Deep learning is making an impact in artificial intelligence

You could be forgiven for thinking that this past week yielded the discovery of some new type of AI, but watching Elon Musk donate US$10 million to the Future of Life Program would lead one to believe that. Musk donated these funds specifically to push for research designed to keep artificial intelligence friendly.

The Future of Life Institute was founded by one of the creators of Skype, who said that building AI is like controlling a rocket: You concentrate on acceleration first, but once that’s accomplished, steering becomes the real focus.

But Musk wasn’t the only one focusing on futuristic AI this week. Deep learning got a big boost from bloggers and developers, as well. Two larger pieces on the state of deep learning went live.

Particularly interesting is the use of AI and data visualization to imagine higher-level dimensional data. And in the latter link, a podcast on machine learning, specifically tackles the ethical questions, again.

Does this mean we need three rules for AI, like Asimov’s robots? 2015 might be the year we finally tart laying some ethical guidelines for AI. But looking more deeply into what machine learning actually entails, Christopher Olah’s blog postings on the topic are truly enlightening. He dives deeply into the ways neural networks function to try and help us understand what, exactly is going on inside these mysterious learning programs.

While reading his pieces on the topic, all of which are magnificent, I was struck by just how much learning algorithms can be understood by outputting visuals of their inner workings. Olah does a great job of this, representing the hidden intermediate steps of a neural network’s reasoning in ways that show how it is going about its rationalizations.

This phenomenon is exemplified in some interesting ways on the Web, though not quite as obliquely. My favorite example is Genetic Cars. While this is an evolutionary algorithm rather than an AI algorithm, it’s still of similar importance. The program generates possible solutions to a simple question in a confined problem space. Here, instead of performing image recognition, it’s just building cars and seeing how far they can get on a random track.

One has to wonder if human reasoning works in a similar way, or in a completely different way. The scientific method would dictate that we’re relatively similar: trial and error, ruling out the inferior, and eventually yielding a solution.

Jan 16, 2015 3:47:00 PM

Topics: Big Data,, Artificial Intelligence, AI, Deep Learning

Public Data Sets: Oceanographic Data

Jan 14, 2015 1:14:40 PM

Topics: Big Data analytics

Inside the Apache Software Foundation’s Flink

The Apache Software Foundation has announced Apache Flink as a Top-Level Project (TLP).

Flink is an open-source Big Data system that fuses processing and analysis of both batch and streaming data. The data-processing engine, which offers APIs in Java and Scala as well as specialized APIs for graph processing, is presented as an alternative to Hadoop’s MapReduce component with its own runtime. Yet the system still provides access to Hadoop’s distributed file system and YARN resource manager.

The open-source community around Flink has steadily grown since the project’s inception at the Technical University of Berlin in 2009. Now at version 0.7.0, Flink lists more than 70 contributors and sponsors, including representatives from Hortonworks, Spotify and Data Artisans (a German startup devoted primarily to the development of Flink).

The project’s graduation to TLP after only nine months in incubation raises questions about not only its future, but also about the potential it may have for the evolution of Big Data processing. SD Times spoke with Flink vice president (and CTO of Data Artisans) Stephan Ewen and Flink Project Management Committee member Kostas Tzoumas (also the CEO of Data Artisans) about where Flink came from, what makes the Big Data engine unique, and where the newly minted TLP is going.

SD Times: Flink is an open-source distributed data analysis engine for batch and streaming data with Java and Scala APIs. In your own words, describe what Flink is as a platform.
Ewen: Flink, as a platform, is a new approach for unifying flexible analytics in streaming and batch data sources. Flink’s technology draws inspiration from Hadoop, MPP databases and data streaming systems, but fuses those in a unique way. For example, Flink uses a data streaming engine to execute both batch and streaming analytics. Flink contains a lot of compiler technology, using Scala macros, Java reflection, and code generation together with database optimizer technology in order to holistically compile and optimize user code similarly to what relational databases do.

Tzoumas: To the user, this technology brings easy and powerful programming APIs and a system with state-of-the-art performance, which also performs reliably well in a variety of use cases and hardware settings.

Talk about the origins of Flink and the inspiration for its creation. How, if at all, has the project’s core and focus evolved over the past several years?
Tzoumas: In 2009, researchers at TU Berlin and other universities started playing with Hadoop and asked themselves, how can we bring together knowledge from the database systems community and the Hadoop community in a hybrid system?

Back then, Hadoop was fairly new—only MapReduce and quite hard to use—while SQL databases were more established but clearly could not cover some new use cases. During the years, the team built a system that intelligently fuses concepts from the Hadoop and the SQL database worlds—without either being a SQL database or being based on Hadoop—called Stratosphere. The project gained momentum as an open-source GitHub project, and in April 2014 the community decided to submit a proposal to the Apache Incubator. During the years, the scope of the project of course expanded somewhat to cover both streaming and batch data processing, and to provide a smooth user experience to developers that use it, both for data exploration and for production use.

What makes Flink unique in regards to data streaming and pipelining? What does the technology offer, or what use cases does it enable that makes it stand out among similar Big Data technologies?
Tzoumas: The combination of batch and true low-latency streaming is unique in Flink. Combining these styles of processing, often termed “lambda architecture,” is becoming increasingly popular. Unlike pure batch or streaming engines, through its hybrid engine Flink can support both high-performance and sophisticated batch programs as well as real-time streaming programs with low latency and complex streaming semantics.

Ewen: In addition, Flink is unique among Big Data systems (and perhaps unique in the open-source Java world) in how it uses memory. Flink was designed to run very robustly in both clusters with plenty of RAM, and in clusters with limited memory resources, or where data vastly exceeds the available memory. To that end, the system manages its own memory and contains sophisticated Java implementations of database-style query processing algorithms, traditionally written in C/C++, that can work both on in-heap and off-heap memory.

How has the open-source community around Flink grown and developed over the past several years?
Tzoumas: The Flink community came initially from the academic world: [the] Technical University of Berlin, Humboldt University of Berlin, [the] Hasso Plattner Institute. As the project gained more exposure, more people from industry joined the project. Recently, a group of Flink committers started Data Artisans, a Berlin-based startup that is committed to developing Flink as open source. We have been very excited about how the community has been developing. The number of contributors to the project has almost doubled since the project went to the Apache Incubator.

What does this milestone of ascension to Top-Level Project mean for Flink? Where does Flink go from here?
Ewen: Graduating to a Top-Level Project is a very important milestone for Flink as it reflects the maturity of Flink both as a system and as a community. The Flink community is currently in its public developer mailing list discussing a developer road map for 2015, which includes several exciting new features in the APIs, optimizer and runtime of the system. In addition, the community has started to put a lot more focus on building libraries, e.g., a Machine Learning library and a Graph processing library, on top of Flink. This work will make the system accessible to a wider audience that will in turn feed back to the community.

Jan 12, 2015 8:48:00 AM

Topics: Big Data,, hadoop, Apache, Apache Flink, HDFS

A blended approach to managing data

Making the case for version control, testing environments and continuous integration when it comes to software development these days is a no brainer. But when it comes to the data behind your important applications, life-cycle management and data flow automation are still new ideas struggling to find their place in the market.

That doesn’t mean, however, that managing your data is an impossible task, devoid of vendors and best practices. But it does mean that most enterprises may not yet fully comprehend what exactly data automation entails.

It can refer to a number of different things, or to all of them together in a single workflow. These include things like data scrubbing to remove sensitive data before use in testing environments. It can also refer to the flow of data from production systems, back to data warehouses, then forward again into analytics data stores. Data automation can even refer to the actual change management of data as it comes into the system and evolves over time.

With such a broad space to cover, the term data automation has a lot of heavy lifting to do when it comes to being advocated for inside your organization. But it doesn’t have to be an impossible struggle, thanks to numerous companies and tools that make managing that data life cycle much easier.

Seb Taveau, senior business leader and technical evangelist at MasterCard for its Open APIs, said that the management of data in the software development life cycle is extremely important for a heavily regulated organization like his.

(Related: Scaling agile in databases)

MasterCard has a particularly interesting set of problems when it comes to the data life cycle. As a credit card company, financial regulations weigh heavily on what types of data MasterCard can collect, let alone share. And yet, the company still has been able to build up a developer network of APIs and services based on their data. This data includes information like purchases made at specific locations, as well as the time stamps associated with them. As a result, using MasterCard APIs, developers can determine which restaurants in their town are the most popular among locals, or when certain businesses are open.

“MasterCard’s been working on taking some of its private APIs and making them public for the past four years,” said Taveau. “The developer program is not new inside MasterCard. This became a key project for MasterCard about a year ago. They wanted to make sure MasterCard was a tech companion, not just a payment network.”

But having all that data publicly available means there’s a lot of work to be done before an API can go live. Depending on the data source, the project to open the data behind an API can sometimes be very easy, and sometimes require a lot of hands-on work.

“Sometimes it works great; sometimes it requires manual work to make sure it works,” said Taveau. “It depends on the complexity of the API and the data you’re trying to reach. Location is very easy to automate: It doesn’t require the security. The data is not critical—just a pinpoint to a merchant location or restaurant location or ATM location. That’s not critical.”

When it comes to more delicate information, said Taveau, the effort requires more coordination and care. “When you talk about a merchant identifier API, that’s a completely different story. As you expect, the review process for the location API and the tokenization API or risk management are very different.”

Shifting tools
Yaniv Yehuda, cofounder and CTO of DBmaestro, said that managing software is almost second nature to developers, but managing the data is another story entirely. “Tools are important, as well as human processes. But what we found out is that people are really challenged when they get to the database. Databases don’t follow the same processes the code does. They require special attention,” he said.

The reason for the difference in managing them is that they are created in different ways and have a different workflow, said Yehuda. “When you deal with traditional code, you have your development people working with version control. Then it gets built by a build server and pushed to the next environment,” he said.

“When you deal with a database, a database is not compiled in one area and then pushed to another. The database actually holds its internal structure of the data: the schema in each environment. If you have a development environment, it holds the structure code content as the application, but it still hosts the same data in the Q/A environment. In order to deal with that, you create transition code that changes the database from one state to the next.

“The database, in order to be promoted from one version to the next, it requires additional steps in order to deal with that transition code. Because people are really doing stuff with traditional tools, developers have to do this stuff manually. This code is really static, so if you have some changes in your development environment and you want them to push to another environment, then you write the code to deal with that transition. You have code override, and people are not using the database itself to manage version control because they have to extract the objects from the database and put them in a different version control from the code.

“What DBmaestro does first is to create a bulletproof version-control system. We created an enforced version control to the database. You cannot change the database unless you check out the object. You go to the database and say ‘add column,’ but only when you check it out. When you check it back in, it gets version-controlled. The second thing DBmaestro does is it safely automates deployments from one environment to another.”

Red Gate offers similar tools for data automation and management. Ben Rees, general manager at Red Gate, said that the company has been building tools around this problem for 15 years now. “People were using them to do database life-cycle management, but we didn’t call it that back then. Back then we called it agile database delivery and agile database development,” he said.

“The essence of it is that there are developers and DBAs who want to make changes to their databases, and more importantly, they want to make those changes live. They’ve been doing that using our tools, and have been following this process. We’ve only realized recently what they’re doing is application life-cycle management, except for the database.”

(Related: A better way to look at databases)

Rees said that Red Gate’s tools can help a development organization get control of data management within its application life cycle. “It’s about source controlling the database. It’s about automating the updates. It’s about properly managing the release process, and having proper release management to get changes into production. Then it’s about monitoring what happens after the event.”

Thus, as developers have awakened to database management, Red Gate has awakened to how its customers are using its tools. “What we’re realizing now, over the last year or so, is that we have this complete story that our customers have been telling us over and over for years,” said Rees. “We used to sell these 1,500-point tools to the end user. Now we’re selling something that’s about changes in process, changes in how you work.”

That means Red Gate can no longer rely on single developers using a credit card to buy a tool. Instead, data management and the database life cycle have moved up the stack to become CIO-level concerns. Getting buy-in from that far up the chain can ensure that a new process and life cycle can be pushed out across the organization, not just into small pockets.

Ron Huizenga, data architect and product manager at Embarcadero Technologies, said that change management in databases is a great way to ensure changes don’t destroy essential information.

“We have database change-management tools, from the metadata approach, from the modeling approach, and the metadata artifacts,” he said. “It’s extremely important to be able to map out all of what that is.

“There’s also a difference in terms of the data content in those stores. And that’s where we really get into an enterprise data lineage. You may have a company that has employee data scattered across a number of different systems. How do we map that together, and how do we know if info is changing in its journey through the system?”

Embarcadero, he said, has tools that can tie different data store terms together. If one database refers to employees with the field “EMP” while another uses the field title “PEOP,” Embarcadero’s tools can be used to define these two different data columns as meaning the same thing, thus allowing for quick integration of data from disparate data sources.

New stores, new processes
When it comes to the actual database, NoSQL data stores are offering another wrinkle. Couchbase, for example, is a NoSQL database that behaves similarly to Lotus Notes. For mobile applications, this means that Couchbase can be used to embed a datastore onto a mobile device, then sync that datastore with the remotely hosted version of Couchbase.

Wayne Carter, chief architect of mobile at Couchbase, said that database management has changed drastically in recent years.

“I come from Oracle, where I spent 14 years,” he said. “I grew up in CRM from Siebel, and the models in CRM are extremely complicated. We went through several variances, and even a simple thing like a contact… the model associated with that becomes monstrous over time.

“At the application logic level, some things you’d do in the database, like validation into your application code, and things like joins and queries just to get simple objects. That just doesn’t work in today’s mobile space. Applications move faster and need to evolve a lot faster. These are completely dependent on databases, from the change tracking perspective.”

He went on to say that, “For our database, because it’s moved to the application for management, it’s managed at the code level. The same code can be used for archiving [and] versioning, and could be stored on GitHub… rather than within the database. The actual migration of the database is being managed on the change management [system].” Thus, NoSQL data stores allow the application to define the layout of the data. Because of this, simply doing version control within the application code itself can help to manage the data.

Mongo extends this concept even further with its MongoDB Management System. This enterprise tool for managing MongoDB instances can quickly spin up test environments and replicate data from one area to another, ensuring testers and developers can access the data during development.

That doesn’t mean it’s the entire solution to the data management problem, yet, however. Kelly Stirman, director of products at Mongo, said, “What it doesn’t do is things like data masking and sampling. It doesn’t generate different distributions of fields and attributes or create randomized samples. There’s a lot of interesting stuff you could do to load test your applications based on the distribution of certain types of values on a known data set. We’re looking at that and thinking about [adding these capabilities to future versions].”

The business of data
Sean Poulley, vice president of databases and data warehousing at IBM, said that understanding the data life cycle requires an understanding of how a database works in a services environment. “One of the things people often misunderstand is that having a database is one thing, but running a data service is another. There’s a world of difference between having a database-management product and actually delivering it as a service,” he said.

To this end, IBM has been focusing on CouchDB as its platform for the future. IBM acquired Cloudant in February 2014, and has since been building enterprise products around CouchDB, the database originally created by ex-Lotus developer Damien Katz. CouchDB is, essentially, an attempt to recreate the database system inside of Lotus Notes.

“CouchDB has built a lot of clever management tools around how they load balance and around the database capabilities,” said Poulley. “Cloudant did a really nice job. At one stage, they looked like they were going to fork CouchDB, but they decided to contribute that back to Apache CouchDB. We’re seeing tremendous uptake; it’s one of those teams with all the ingredients for success. I actually spent a period of my career inside Lotus. Not everybody recognizes what CouchDB is. The first time I saw CouchDB and Cloudant I thought, ‘That’s Lotus!’ ”

IBM’s view of the database-management problem boils down to the enterprise network that includes data zones, said Poulley. “We talk about data zones, and we came up with these different zones based on thousands of customer engagements. What we observed is that customers had four or five data zones. One would be a relational, operational store. Then we would find, classically, the data warehouse zone, and the data-mart zone, typically with the conditional relational model. Then we’re seeing this emergence, particularly with Hadoop, of an exploration zone, where data is being brought in raw and dumped into Hadoop.”

That’s a change, said Poulley, from where data has traditionally lived and for how it’s being managed. “Before, people would do preprocessing before they moved it into a relational data warehouse,” he said.

“Similarly, we’ve seen with the explosion of data analytics, and the growth of data, we’re seeing a tremendous growth in our relational data warehouse as well. Understanding the true value is not in the physical storage itself, it’s in being able to process the info you need at the speed you need it, for the period you need it. This idea, which is not new, is becoming increasingly interesting as a logical data warehouse where data flows to and from a high-performance environment like a Netezza data warehouse, into something like Big Insights, and vice versa.

“If you move info from the relational world into the Hadoop world, you can still introduce the queries. That’s what we describe as an actionable archive: You can run the same queries on Hadoop, keep the hot data in the performant environment, and move the longer-term data into a colder environment.”

Thus, Poulley advocates the reuse of queries instead of the replication of data. Rather than moving data around from data warehouses to relational data stores and into NoSQLs or test environments, he advocates keeping the data where it is and writing queries that can run on any type of data store.

So, perhaps data management and automation isn’t just about moving around the data and masking it. Perhaps it’s more about moving as much code out of the database as possible so that it can be managed in the same way as software. Then, instead of bringing the data to the software, it’s perhaps easier and more efficient to bring the software to the data. That requires some steps in between to ensure some basic capabilities around the data, such as masking, however.

“Once you know you can mask the data, the idea of capturing the workflows is almost like a VCR form,” said Poulley. “To really capture the workloads that are happening on a day-to-day basis is something most companies don’t do, which they should. They put it in a staging environment, but it never really experiences true workflows until it experiences the workflows of live data. Being able to capture sample data and being able to mask the data is something all companies should be doing.”

Thus, Poulley advocates testing applications with real-world data traffic as much as possible. Doing so, he said, will give you a better idea of what the application will do in production. And doing all of this will allow developers to “bring the analytics to the data, rather than the other way around,” he said.

“The innovation in the next few years won’t be about data movement; it’ll be about query movement. It’s much easier to move the queries and federate the queries.”

Back to the master
No matter how you manage your data, there is one thing that is for certain: dealing with the data itself requires discussions with the controller of that data. At MasterCard, every API in the company is controlled by one group or another. That means when MasterCard’s Taveau and his team are beginning the work of taking an API public, their first stop is in the offices of the team that manages the API and data they’re working on.

“When you’re looking at an API, we won’t be exposing single data points through an API; we create an aggregate of the data, and that’s what the API will look at,” he said. “This requires a lot of discussion internally on the whys and the why nots, where we discuss the value that can be brought out of these type of packages. How do you do it? There’s this filter, then it’s reviewed, then we check the data we’re putting in the package. We have the information security team to make sure the code is made properly, and we have friends in sales to see if there’s value in it.

“At the end of the day, the value is in the data and the quality of the data.”

Jan 8, 2015 8:53:21 AM

Topics: Big Data,

Taking Big Data to the street in 2014

This was the last year in which Big Data was an off-the-street term. As the year ended, Hortonworks was preparing to go public, with Cloudera preparing for the same outcome. Big Data will officially transition into being a big business this year.

Jan 2, 2015 9:02:00 AM

Topics: Big Data,

Why Do So Many Big Data Projects Fail

In a Fireside Chat at Big Data TechCon San Francisco, Big Data Consultant Tony Shan explaines why failing with big data projects is not always a bad thing, as long as we learn from our mistakes.

Dec 18, 2014 2:18:35 PM

Topics: Big Data,

Public Data Sets: Resources

Resources are the life’s blood of a country. Tracing them from their production to their transportation and eventual use forms that backbone of countless industries. With better understanding comes more efficient management and conservation. These data sets feature natural resources in their various states and uses.

Dec 4, 2014 2:08:46 PM

Topics: Data Sets, Big Data analytics

HP reinvents IT operations with Big Data-powered software suite

HP today unveiled a full range of new IT Operations Management solutions built around big data analytics. These solutions are designed to help organizations automate and optimize their applications and infrastructure technology, resulting in exceptional customer experiences, faster time to market, and greater cost efficiencies.

Dec 2, 2014 11:38:52 AM

Topics: Big Data analytics

Public Data Sets: Biology

Nov 25, 2014 11:45:00 AM

Topics: Big Data,, Data Sets