In this blog entry I want to detail, how we bring a full compliant ansi standard SQL into hadoop. The two components of Pivotal HD critical for this are as follows.
HAWQ - SQL query processor based on GPDB running on HDFS
HAWQ is a parallel SQL query engine that combines the merits of the Greenplum Database Massively Parallel Processing (MPP) relational database engine and the Hadoop parallel processing framework. HAWQ supports SQL and native querying capability against various data sources in different popular formats. It provides linear scalable storage solution for managing terabytes or petabytes of data at low cost
GFXF - Extension Framework component of HAWQ to create external tables
GPXF is the external table interface to HAWQ that enables querying data stored in HDFS, HBase, and HIVE. In addition, it exposes an API that enables a parallel connection to additional data sources.
Example
So here is a very simple example of how this works. We already have an existing Pivotal HD install with the HAWQ and GPXF components.
1. First lets generate some data using a simple script as follows
for i in {1..1000000}
do
echo "$i|person$i"
done
2. Create the file person.txt as shown below and verify we have 1 million person entries
[gpadmin@pivhdsne pas]$ ./data2.sh > person.txt [gpadmin@pivhdsne pas]$ cat person.txt | wc -l 1000000 [gpadmin@pivhdsne pas]$ tail person.txt 999991|person999991 999992|person999992 999993|person999993 999994|person999994 999995|person999995 999996|person999996 999997|person999997 999998|person999998 999999|person999999 1000000|person10000003. Now lets copy the file person.txt onto the HDFS as shown below
hadoop fs -put person.txt /
4. At this point we have the data file on HDFS so we can create an external table using HAWQ so we can query the data using SQL. To create an external table we would use SQL as follows
demo=# CREATE EXTERNAL TABLE person ( id int, name text) demo-# LOCATION ('gpxf://pivhdsne:50070/person.txt?Fragmenter=HdfsDataFragmenter') FORMAT 'TEXT' (DELIMITER = '|'); CREATE EXTERNAL TABLE Time: 3.843 ms
GPXF enables External table interface inside HAWQ to read data stored in Hadoop ecosystem which enables loading and querying data stored in
- HDFS
- HBase
- Hive
- Text
- Avro
- Hive - Text, Sequence and RCFile formats
- HBase
Note: GPXF has an extensible framework API to enable custom connector development for other data sources and custom formats
5. Now at this point we can query data sitting on our HDFS using SQL , some examples below.
- Getting a count(*)
demo=# select count(*) from person; count --------- 1000000 (1 row) Time: 1396.451 ms
- Querying the first 10 records only
demo=# select * from person limit 10; id | name ----+---------- 1 | person1 2 | person2 3 | person3 4 | person4 5 | person5 6 | person6 7 | person7 8 | person8 9 | person9 10 | person10 (10 rows) Time: 263.435 ms
More Information
Fore more information on Pivotal HD see the links below.http://www.gopivotal.com/pivotal-products/pivotal-data-fabric/pivotal-hd
http://www.greenplum.com/products/pivotal-hd
http://pivotalhd.cloudfoundry.com/index.html
1 comment:
To work on Pivotal HD 1.0.1 Single Node (VM), the SQL in step 4 should be:
CREATE EXTERNAL TABLE person ( id int, name text) LOCATION ('pxf://pivhdsne:50070/person.txt?Fragmenter=HdfsDataFragmenter&Accessor=TextFileAccessor&Resolver=TextResolver') FORMAT 'TEXT' (DELIMITER = '|');
Post a Comment