Biz & IT —

The hot new technology in Big Data is decades old: SQL

The decades-old database technology is staging a comeback.

A slide from a presentation at Hadoop Summit describes HortonWorks' Stinger initiative, an effort to make SQL work better with Hadoop through Apache Hive.
A slide from a presentation at Hadoop Summit describes HortonWorks' Stinger initiative, an effort to make SQL work better with Hadoop through Apache Hive.
Jason Levitt

Despite the growth of "NoSQL" databases over the past few years, SQL is going nowhere isn't going anywhere. In fact, it seems Structured Query Language is in ascendance in a realm that once seemed bent on excluding it: "Big Data."

At the recent Hadoop Summit, among all of the announcements of new products and partnerships around "big data" analytics, one of the surprising trends was the apparent resurgence of a technology that has been around for decades. Many of the announcements coming from companies at the Summit centered on using SQL as the primary interface for Big Data analytics.

"It looks as if there's not a Hadoop-related vendor here who isn't promoting an SQL solution," said Paco Nathan, former director of data science at Concurrent and now Chief Scientist at Mesosphere, a speaker at Hadoop Summit. “And a few of them sound too good to be true."

Built on Hadoop

Hadoop is the open source batch-processing storage and analysis engine based on research papers published by Google about its MapReduce and Google File System technology. It's the underlying technology behind many "big data" analysis tools being used to sift through huge volumes of information created by Web visits, server logs, and all other sorts of data streams. Facebook, for example, has over 30 petabytes of data in Hadoop clusters, and it created the Hive query front-end for Hadoop (which is now an Apache open-source project). The NSA's Accumulo database, used by the agency to do real-time analysis of intelligence data, is also built on top of Hadoop.

But Hadoop can be a challenging system to learn, since it requires that users understand both its problem solving strategy, called MapReduce, and a programming language that supports MapReduce tasks. MapReduce uses batches of parallel processing jobs to sort through large volumes of data. SQL, on the other hand, is used with nearly every relational database system and huge numbers of people who know how to use it effectively to mine and analyze data. While the Facebook-created Hive provides an SQL-like front-end for Hadoop, it neither implements full SQL semantics nor is it particularly fast—since it simply translates queries into batch-processed MapReduce jobs for Hadoop.

Over the past six months, vendors have responded to the demand for more corporate-friendly analytics by announcing a slew of systems that offer full SQL query capabilities with significant performance improvements over existing Hive/Hadoop systems. These systems are designed to allow full SQL queries over warehouse-size data sets, and in most cases they bypass Hadoop entirely (although some are hybrid approaches). Allowing much faster SQL queries at scale makes big data analytics accessible by many more people in the enterprise and fits in with existing workflows.

Here's a sampling of some of the SQL-for-Big Data initiatives underway:

  • Facebook's Presto, a real-time query engine that provides a direct SQL interface to Facebook's Hadoop data warehouse. Facebook plans on releasing Presto as an open-source project this fall.
  • Amazon Web Services' RedShift. The service provides an SQL-based data warehouse service that can handle queries against databases of up to 1.6 petabytes.
  • HortonWorks' Stinger initiative, an effort to improve the SQL interface of Hive and make Hive 100 times faster.
  • IBM's BigSQL, an SQL query engine for Hadoop. BigSQL bypasses MapReduce and runs against the Hadoop Distributed File System for read-only queries and HBase (the Hadoop database engine) for transactional queries that perform reads and writes of data.
  • EMC's HAWQ, an SQL query engine for the company's Pivotal HD version of Hadoop.
  • Cloudera's Impala, a real-time ad-hoc query interface to Hadoop introduced last October.

There are also upcoming changes to Hadoop itself that will make SQL queries of Hadoop data easier. Hadoop 2.0, which will be released later this year, replaces the MapReduce code in Hadoop with a modular architecture called YARN (Yet Another Resource Negotiator) that permits multiple analysis systems to co-exist with MapReduce.

Listing image by James Lumb

Channel Ars Technica