Wednesday, 17 July 2013

Pivotal HD - Talking true SQL on HDFS

Pivotal HD is an Apache Hadoop distribution that natively integrates the industry-leading Pivotal Greenplum massively parallel processing (MPP) database technology with the Apache Hadoop framework. See this link for it's release information. http://www.gopivotal.com/pivotal-products/pivotal-data-fabric/pivotal-hd

In this blog entry I want to detail, how we bring a full compliant ansi standard SQL into hadoop. The two components of Pivotal HD critical for this are as follows.

HAWQ - SQL query processor based on GPDB running on HDFS

HAWQ is a parallel SQL query engine that combines the merits of the Greenplum Database Massively Parallel Processing (MPP) relational database engine and the Hadoop parallel processing framework. HAWQ supports SQL and native querying capability against various data sources in different popular formats. It provides linear scalable storage solution for managing terabytes or petabytes of data at low cost

GFXF - Extension Framework component of HAWQ to create external tables

GPXF is the external table interface to HAWQ that enables querying data stored in HDFS, HBase, and HIVE. In addition, it exposes an API that enables a parallel connection to additional data sources.

Example

So here is a very simple example of how this works. We already have an existing Pivotal HD install with the HAWQ and GPXF components.

1. First lets generate some data using a simple script as follows

for i in {1..1000000}
do
   echo "$i|person$i"
done


2. Create the file person.txt as shown below and verify we have 1 million person entries

  
[gpadmin@pivhdsne pas]$ ./data2.sh > person.txt
[gpadmin@pivhdsne pas]$ cat person.txt | wc -l
1000000
[gpadmin@pivhdsne pas]$ tail person.txt 
999991|person999991
999992|person999992
999993|person999993
999994|person999994
999995|person999995
999996|person999996
999997|person999997
999998|person999998
999999|person999999
1000000|person1000000
3. Now lets copy the file person.txt onto the HDFS as shown below

hadoop fs -put person.txt /

4. At this point we have the data file on HDFS so we can create an external table using HAWQ so we can query the data using SQL. To create an external table we would use SQL as follows
  
demo=# CREATE EXTERNAL TABLE person ( id int, name text)
demo-# LOCATION ('gpxf://pivhdsne:50070/person.txt?Fragmenter=HdfsDataFragmenter') FORMAT 'TEXT' (DELIMITER = '|');
CREATE EXTERNAL TABLE
Time: 3.843 ms

GPXF enables External table interface inside HAWQ to read data stored in Hadoop ecosystem which enables loading and querying data stored in
  • HDFS
  • HBase
  • Hive
The supported data formats for GPXF include the following.
  • Text
  • Avro
  • Hive - Text, Sequence and RCFile formats
  • HBase

Note: GPXF has an extensible framework API to enable custom connector development for other data sources and custom formats

5. Now at this point we can query data sitting on our HDFS using SQL , some examples below.

- Getting a count(*)
  
demo=# select count(*) from person;
  count  
---------
 1000000
(1 row)

Time: 1396.451 ms

- Querying the first 10 records only
  
demo=# select * from person limit 10;
 id |   name   
----+----------
  1 | person1
  2 | person2
  3 | person3
  4 | person4
  5 | person5
  6 | person6
  7 | person7
  8 | person8
  9 | person9
 10 | person10
(10 rows)

Time: 263.435 ms

More Information

Fore more information on Pivotal HD see the links below.

http://www.gopivotal.com/pivotal-products/pivotal-data-fabric/pivotal-hd

http://www.greenplum.com/products/pivotal-hd

http://pivotalhd.cloudfoundry.com/index.html

1 comment:

Anonymous said...

To work on Pivotal HD 1.0.1 Single Node (VM), the SQL in step 4 should be:

CREATE EXTERNAL TABLE person ( id int, name text) LOCATION ('pxf://pivhdsne:50070/person.txt?Fragmenter=HdfsDataFragmenter&Accessor=TextFileAccessor&Resolver=TextResolver') FORMAT 'TEXT' (DELIMITER = '|');