In this example we show how we can load 500,000 records and then update those 500,000 records with GPLOAD.
1. Create a table to load the data into as shown below.
drop table if exists apples.people cascade;
create table apples.people (id int, name text)
DISTRIBUTED BY (id);
2. The data we are loading is defined as follows
[Tue Jun 25 22:39:24 gpadmin@:~/demos/gpload/inserts ] $ head person.txt
1,person1
2,person2
3,person3
4,person4
5,person5
6,person6
7,person7
8,person8
9,person9
10,person10
3. Create a YAML formatted control file to load the data into the table PEOPLE as shown below.
test.yml
VERSION: 1.0.0.1
DATABASE: gpadmin
USER: pas
HOST: 127.0.0.1
PORT: 5432
GPLOAD:
INPUT:
- SOURCE:
LOCAL_HOSTNAME:
- 127.0.0.1
PORT: 8100
FILE: [ /Users/gpadmin/demos/gpload/inserts/person.txt ]
- COLUMNS:
- "id":
- "name":
- FORMAT: text
- DELIMITER: ','
- ENCODING: 'UTF8'
- NULL_AS: ''
- ERROR_LIMIT: 100000
- ERROR_TABLE: apples.err_people
OUTPUT:
- TABLE: apples.people
- MODE: INSERT
4. Run a script to load the data as shown below.
Script:
gpload -f test.yml
Output:
[Tue Jun 25 22:44:29 gpadmin@:~/demos/gpload/inserts ] $ ./do-gpload.sh
2013-06-25 22:44:35|INFO|gpload session started 2013-06-25 22:44:35
2013-06-25 22:44:35|INFO|started gpfdist -p 8100 -P 8101 -f "/Users/gpadmin/demos/gpload/inserts/person.txt" -t 30
2013-06-25 22:44:35|INFO|running time: 0.73 seconds
2013-06-25 22:44:35|INFO|rows Inserted = 500000
2013-06-25 22:44:35|INFO|rows Updated = 0
2013-06-25 22:44:35|INFO|data formatting errors = 0
2013-06-25 22:44:35|INFO|gpload succeeded
Script:
gpload -f test.yml
Output:
[Tue Jun 25 22:44:29 gpadmin@:~/demos/gpload/inserts ] $ ./do-gpload.sh
2013-06-25 22:44:35|INFO|gpload session started 2013-06-25 22:44:35
2013-06-25 22:44:35|INFO|started gpfdist -p 8100 -P 8101 -f "/Users/gpadmin/demos/gpload/inserts/person.txt" -t 30
2013-06-25 22:44:35|INFO|running time: 0.73 seconds
2013-06-25 22:44:35|INFO|rows Inserted = 500000
2013-06-25 22:44:35|INFO|rows Updated = 0
2013-06-25 22:44:35|INFO|data formatting errors = 0
2013-06-25 22:44:35|INFO|gpload succeeded
Now at this point lets update the same data set with new updated data , once again using gpload BUT this time the YAML control file is a little different here.
5. The data we are UPDATING is defined as follows
[Tue Jun 25 22:54:39 gpadmin@:~/demos/gpload/upserts ] $ head people.txt
1,person1-updated
2,person2-updated
3,person3-updated
4,person4-updated
5,person5-updated
6,person6-updated
7,person7-updated
8,person8-updated
9,person9-updated
10,person10-updated
6. Create a YAML formatted control file to merge/update the data into the table PEOPLE as shown below.5. The data we are UPDATING is defined as follows
[Tue Jun 25 22:54:39 gpadmin@:~/demos/gpload/upserts ] $ head people.txt
1,person1-updated
2,person2-updated
3,person3-updated
4,person4-updated
5,person5-updated
6,person6-updated
7,person7-updated
8,person8-updated
9,person9-updated
10,person10-updated
test.yml
VERSION: 1.0.0.1
DATABASE: gpadmin
USER: pas
HOST: 127.0.0.1
PORT: 5432
GPLOAD:
INPUT:
- SOURCE:
LOCAL_HOSTNAME:
- 127.0.0.1
PORT: 8100
FILE: [ /Users/gpadmin/demos/gpload/upserts/people.txt ]
- COLUMNS:
- id: int
- name: text
- FORMAT: text
- DELIMITER: ','
- ENCODING: 'UTF8'
- NULL_AS: ''
- ERROR_LIMIT: 100000
- ERROR_TABLE: apples.err_people
OUTPUT:
- TABLE: apples.people
- MODE: MERGE
- MATCH_COLUMNS:
- id
- UPDATE_COLUMNS:
- name
- MAPPING:
id: id
name: name
7. Run a script to update the POEPLE table data as shown below.
Script:
gpload -f test.yml
Output:
[Tue Jun 25 22:54:21 gpadmin@:~/demos/gpload/upserts ] $ ./do-gpload.sh
2013-06-25 22:54:34|INFO|gpload session started 2013-06-25 22:54:34
2013-06-25 22:54:34|INFO|started gpfdist -p 8100 -P 8101 -f "/Users/gpadmin/demos/gpload/upserts/people.txt" -t 30
2013-06-25 22:54:39|INFO|running time: 5.13 seconds
2013-06-25 22:54:39|INFO|rows Inserted = 0
2013-06-25 22:54:39|INFO|rows Updated = 500000
2013-06-25 22:54:39|INFO|data formatting errors = 0
2013-06-25 22:54:39|INFO|gpload succeeded
8. Finally verified we indeed have updated the data in the PEOPLE table.
gpadmin=# select * from apples.people limit 10; id | name ----+------------------ 1 | person1-updated 2 | person2-updated 3 | person3-updated 4 | person4-updated 5 | person5-updated 6 | person6-updated 7 | person7-updated 8 | person8-updated 9 | person9-updated 10 | person10-updated (10 rows)