I recently needed to setup a cron script which loaded CSV files from a directory into Greenplum every 2 minutes. Once loaded the files are moved onto Hadoop for archive purposes. The config below shows how to use GPLOAD data load utility which utilises GPFDIST.
1. Create a load table. In this example the data is then moved to the FACT table once the load is complete
drop table rtiadmin.rtitrans_etl4;
CREATE TABLE rtiadmin.rtitrans_etl4 (
imsi character varying(82),
subscriber_mccmnc character varying(10),
msisdn character varying(82),
imei character varying(50),
called_digits character varying(50),
start_datetime integer,
end_datetime integer,
first_cell_lac integer,
first_cell_idsac integer,
current_cell_lac integer,
current_cell_idsac integer,
dr_type integer,
status character varying(50),
ingest_time bigint,
processed_time bigint,
export_time bigint,
extra_col text,
gploaded_time timestamp without time zone
)
WITH (appendonly=true) DISTRIBUTED BY (imsi);
2. GPLOAD yaml file defined as follows
VERSION: 1.0.0.1
DATABASE: mydb
USER: rtiadmin
HOST: 172.1.1.1
PORT: 5432
GPLOAD:
INPUT:
- SOURCE:
LOCAL_HOSTNAME:
- loadhost
PORT: 8100
FILE:
- /data/rti/stage/run/*.csv
- COLUMNS:
- imsi : text
- subscriber_mccmnc : text
- msisdn : text
- imei : text
- called_digits : text
- start_datetime : text
- end_datetime : text
- first_cell_lac : integer
- first_cell_idsac : integer
- current_cell_lac : integer
- current_cell_idsac : integer
- dr_type : integer
- status : text
- ingest_time : bigint
- processed_time : bigint
- export_time : bigint
- extra_col : text
- FORMAT: text
- HEADER: false
- DELIMITER: ','
- NULL_AS : ''
- ERROR_LIMIT: 999999
- ERROR_TABLE: rtiadmin.rtitrans_etl4_err
OUTPUT:
- TABLE: rtiadmin.rtitrans_etl4
- MODE: INSERT
- MAPPING:
imsi : imsi
subscriber_mccmnc : subscriber_mccmnc
msisdn : msisdn
imei : imei
called_digits : called_digits
start_datetime : substr(start_datetime, 1, 10)::int
end_datetime : substr(end_datetime, 1, 10)::int
first_cell_lac : first_cell_lac
first_cell_idsac : first_cell_idsac
current_cell_lac : current_cell_lac
current_cell_idsac : current_cell_idsac
dr_type : dr_type
status : status
ingest_time : ingest_time
processed_time : processed_time
export_time : export_time
extra_col : extra_col
gploaded_time : current_timestamp
PRELOAD:
- TRUNCATE : true
- REUSE_TABLES : true
SQL:
- AFTER : "insert into rtitrans select * from rtitrans_etl4"
3. Call GPLOAD as follows
source $HOME/.bash_profile
gpload -f rtidata.yml
Note: We use the ENV variable as $PGPASSWORD which is used during the load if a password is required which was in this demo
Few things worth noting here.
REUSE_TABLES : This ensures the external tables created during the load are maintained and re-used on next load.
TRUNCATE: This clears the load table prior to load and we use this as we COPY the data once the load is finished into the main FACT table using the "AFTER"