Tuesday, 25 June 2013

Greenplum GPLOAD with updates/merge ability

gpload” is a data loading utility that acts as an interface to Greenplum Database’s external table parallel loading feature. The Greenplum EXTERNAL TABLE feature allows us to define network data sources as tables that we can query to speed up the data loading process. Using a load specification defined in a YAML formatted control file, “gpload” executes a load by invoking the Greenplum parallel file server (gpdist) – Greenplum’s parallel file distribution program, creating an external table definition based on the source data defined, and executing an INSERT, UPDATE or MERGE operation to load the source data into the target table in the database.

In this example we show how we can load 500,000 records and then update those 500,000 records with GPLOAD.

1. Create a table to load the data into as shown below.

drop table if exists apples.people cascade;

create table apples.people (id int, name text)
DISTRIBUTED BY (id);

2. The data we are loading is defined as follows

[Tue Jun 25 22:39:24 gpadmin@:~/demos/gpload/inserts ] $ head person.txt
1,person1
2,person2
3,person3
4,person4
5,person5
6,person6
7,person7
8,person8
9,person9
10,person10


3. Create a YAML formatted control file to load the data into the table PEOPLE as shown below.

test.yml
  
VERSION: 1.0.0.1
DATABASE: gpadmin
USER: pas
HOST: 127.0.0.1
PORT: 5432
GPLOAD:
    INPUT:
     - SOURCE:
          LOCAL_HOSTNAME:
            - 127.0.0.1 
          PORT: 8100
          FILE: [ /Users/gpadmin/demos/gpload/inserts/person.txt ]
     - COLUMNS:
         - "id":
         - "name": 
     - FORMAT: text
     - DELIMITER: ','
     - ENCODING: 'UTF8'
     - NULL_AS: ''
     - ERROR_LIMIT: 100000
     - ERROR_TABLE: apples.err_people
    OUTPUT:
     - TABLE: apples.people 
     - MODE: INSERT  

4. Run a script to load the data as shown below.

Script:

gpload -f test.yml

Output:

[Tue Jun 25 22:44:29 gpadmin@:~/demos/gpload/inserts ] $ ./do-gpload.sh 
2013-06-25 22:44:35|INFO|gpload session started 2013-06-25 22:44:35
2013-06-25 22:44:35|INFO|started gpfdist -p 8100 -P 8101 -f "/Users/gpadmin/demos/gpload/inserts/person.txt" -t 30
2013-06-25 22:44:35|INFO|running time: 0.73 seconds
2013-06-25 22:44:35|INFO|rows Inserted          = 500000
2013-06-25 22:44:35|INFO|rows Updated           = 0
2013-06-25 22:44:35|INFO|data formatting errors = 0
2013-06-25 22:44:35|INFO|gpload succeeded

Now at this point lets update the same data set with new updated data , once again using gpload BUT this time the YAML control file is a little different here.

5. The data we are UPDATING is defined as follows

[Tue Jun 25 22:54:39 gpadmin@:~/demos/gpload/upserts ] $ head people.txt 
1,person1-updated
2,person2-updated
3,person3-updated
4,person4-updated
5,person5-updated
6,person6-updated
7,person7-updated
8,person8-updated
9,person9-updated
10,person10-updated

6. Create a YAML formatted control file to merge/update the data into the table PEOPLE as shown below.

test.yml
  
VERSION: 1.0.0.1
DATABASE: gpadmin
USER: pas
HOST: 127.0.0.1
PORT: 5432
GPLOAD:
    INPUT:
     - SOURCE:
          LOCAL_HOSTNAME:
            - 127.0.0.1 
          PORT: 8100
          FILE: [ /Users/gpadmin/demos/gpload/upserts/people.txt ]
     - COLUMNS:
         - id: int 
         - name: text 
     - FORMAT: text
     - DELIMITER: ','
     - ENCODING: 'UTF8'
     - NULL_AS: ''
     - ERROR_LIMIT: 100000
     - ERROR_TABLE: apples.err_people
    OUTPUT:
     - TABLE: apples.people 
     - MODE: MERGE
     - MATCH_COLUMNS:
           - id
     - UPDATE_COLUMNS:
           - name 
     - MAPPING:
         id: id 
         name: name  

7.  Run a script to update the POEPLE table data as shown below.

Script:

gpload -f test.yml

Output:

[Tue Jun 25 22:54:21 gpadmin@:~/demos/gpload/upserts ] $ ./do-gpload.sh    
2013-06-25 22:54:34|INFO|gpload session started 2013-06-25 22:54:34
2013-06-25 22:54:34|INFO|started gpfdist -p 8100 -P 8101 -f "/Users/gpadmin/demos/gpload/upserts/people.txt" -t 30
2013-06-25 22:54:39|INFO|running time: 5.13 seconds
2013-06-25 22:54:39|INFO|rows Inserted          = 0
2013-06-25 22:54:39|INFO|rows Updated           = 500000
2013-06-25 22:54:39|INFO|data formatting errors = 0
2013-06-25 22:54:39|INFO|gpload succeeded

8. Finally verified we indeed have updated the data in the PEOPLE table.
  
gpadmin=# select * from apples.people limit 10;
 id |       name       
----+------------------
  1 | person1-updated
  2 | person2-updated
  3 | person3-updated
  4 | person4-updated
  5 | person5-updated
  6 | person6-updated
  7 | person7-updated
  8 | person8-updated
  9 | person9-updated
 10 | person10-updated
(10 rows)

2 comments:

Tomoko said...

This is cool!

greenhorn said...

Hi, this is a really useful post! Thanks for publishing it. I was wondering whether you could also demonstrate how to use the update condition, or mapping with an expression? If I want to only update records which do not have a specific string, how do I incorporate it into this update condition? I have not able to find many resources to help me understand this. I would really appreciate if you could elaborate. Thanks!