In this example we show how we can load 500,000 records and then update those 500,000 records with GPLOAD.
1. Create a table to load the data into as shown below.
drop table if exists apples.people cascade;
create table apples.people (id int, name text)
DISTRIBUTED BY (id);
2. The data we are loading is defined as follows
[Tue Jun 25 22:39:24 gpadmin@:~/demos/gpload/inserts ] $ head person.txt
1,person1
2,person2
3,person3
4,person4
5,person5
6,person6
7,person7
8,person8
9,person9
10,person10
3. Create a YAML formatted control file to load the data into the table PEOPLE as shown below.
test.yml
VERSION: 1.0.0.1 DATABASE: gpadmin USER: pas HOST: 127.0.0.1 PORT: 5432 GPLOAD: INPUT: - SOURCE: LOCAL_HOSTNAME: - 127.0.0.1 PORT: 8100 FILE: [ /Users/gpadmin/demos/gpload/inserts/person.txt ] - COLUMNS: - "id": - "name": - FORMAT: text - DELIMITER: ',' - ENCODING: 'UTF8' - NULL_AS: '' - ERROR_LIMIT: 100000 - ERROR_TABLE: apples.err_people OUTPUT: - TABLE: apples.people - MODE: INSERT
4. Run a script to load the data as shown below.
Script:
gpload -f test.yml
Output:
[Tue Jun 25 22:44:29 gpadmin@:~/demos/gpload/inserts ] $ ./do-gpload.sh
2013-06-25 22:44:35|INFO|gpload session started 2013-06-25 22:44:35
2013-06-25 22:44:35|INFO|started gpfdist -p 8100 -P 8101 -f "/Users/gpadmin/demos/gpload/inserts/person.txt" -t 30
2013-06-25 22:44:35|INFO|running time: 0.73 seconds
2013-06-25 22:44:35|INFO|rows Inserted = 500000
2013-06-25 22:44:35|INFO|rows Updated = 0
2013-06-25 22:44:35|INFO|data formatting errors = 0
2013-06-25 22:44:35|INFO|gpload succeeded
Script:
gpload -f test.yml
Output:
[Tue Jun 25 22:44:29 gpadmin@:~/demos/gpload/inserts ] $ ./do-gpload.sh
2013-06-25 22:44:35|INFO|gpload session started 2013-06-25 22:44:35
2013-06-25 22:44:35|INFO|started gpfdist -p 8100 -P 8101 -f "/Users/gpadmin/demos/gpload/inserts/person.txt" -t 30
2013-06-25 22:44:35|INFO|running time: 0.73 seconds
2013-06-25 22:44:35|INFO|rows Inserted = 500000
2013-06-25 22:44:35|INFO|rows Updated = 0
2013-06-25 22:44:35|INFO|data formatting errors = 0
2013-06-25 22:44:35|INFO|gpload succeeded
Now at this point lets update the same data set with new updated data , once again using gpload BUT this time the YAML control file is a little different here.
5. The data we are UPDATING is defined as follows
[Tue Jun 25 22:54:39 gpadmin@:~/demos/gpload/upserts ] $ head people.txt
1,person1-updated
2,person2-updated
3,person3-updated
4,person4-updated
5,person5-updated
6,person6-updated
7,person7-updated
8,person8-updated
9,person9-updated
10,person10-updated
6. Create a YAML formatted control file to merge/update the data into the table PEOPLE as shown below.5. The data we are UPDATING is defined as follows
[Tue Jun 25 22:54:39 gpadmin@:~/demos/gpload/upserts ] $ head people.txt
1,person1-updated
2,person2-updated
3,person3-updated
4,person4-updated
5,person5-updated
6,person6-updated
7,person7-updated
8,person8-updated
9,person9-updated
10,person10-updated
test.yml
VERSION: 1.0.0.1 DATABASE: gpadmin USER: pas HOST: 127.0.0.1 PORT: 5432 GPLOAD: INPUT: - SOURCE: LOCAL_HOSTNAME: - 127.0.0.1 PORT: 8100 FILE: [ /Users/gpadmin/demos/gpload/upserts/people.txt ] - COLUMNS: - id: int - name: text - FORMAT: text - DELIMITER: ',' - ENCODING: 'UTF8' - NULL_AS: '' - ERROR_LIMIT: 100000 - ERROR_TABLE: apples.err_people OUTPUT: - TABLE: apples.people - MODE: MERGE - MATCH_COLUMNS: - id - UPDATE_COLUMNS: - name - MAPPING: id: id name: name
7. Run a script to update the POEPLE table data as shown below.
Script:
gpload -f test.yml
Output:
[Tue Jun 25 22:54:21 gpadmin@:~/demos/gpload/upserts ] $ ./do-gpload.sh
2013-06-25 22:54:34|INFO|gpload session started 2013-06-25 22:54:34
2013-06-25 22:54:34|INFO|started gpfdist -p 8100 -P 8101 -f "/Users/gpadmin/demos/gpload/upserts/people.txt" -t 30
2013-06-25 22:54:39|INFO|running time: 5.13 seconds
2013-06-25 22:54:39|INFO|rows Inserted = 0
2013-06-25 22:54:39|INFO|rows Updated = 500000
2013-06-25 22:54:39|INFO|data formatting errors = 0
2013-06-25 22:54:39|INFO|gpload succeeded
8. Finally verified we indeed have updated the data in the PEOPLE table.
gpadmin=# select * from apples.people limit 10; id | name ----+------------------ 1 | person1-updated 2 | person2-updated 3 | person3-updated 4 | person4-updated 5 | person5-updated 6 | person6-updated 7 | person7-updated 8 | person8-updated 9 | person9-updated 10 | person10-updated (10 rows)
2 comments:
This is cool!
Hi, this is a really useful post! Thanks for publishing it. I was wondering whether you could also demonstrate how to use the update condition, or mapping with an expression? If I want to only update records which do not have a specific string, how do I incorporate it into this update condition? I have not able to find many resources to help me understand this. I would really appreciate if you could elaborate. Thanks!
Post a Comment