Wednesday, 9 April 2014

Pivotal Greenplum GPLOAD with multiple CSV files

I recently needed to setup a cron script which loaded CSV files from a directory into Greenplum every 2 minutes. Once loaded the files are moved onto Hadoop for archive purposes. The config below shows how to use GPLOAD data load utility which utilises GPFDIST.

1. Create a load table. In this example the data is then moved to the FACT table once the load is complete

  
drop table rtiadmin.rtitrans_etl4;

CREATE TABLE rtiadmin.rtitrans_etl4 (
    imsi character varying(82),
    subscriber_mccmnc character varying(10),
    msisdn character varying(82),
    imei character varying(50),
    called_digits character varying(50),
    start_datetime integer,
    end_datetime integer,
    first_cell_lac integer,
    first_cell_idsac integer,
    current_cell_lac integer,
    current_cell_idsac integer,
    dr_type integer,
    status character varying(50),
    ingest_time bigint,
    processed_time bigint,
    export_time bigint,
    extra_col text,
    gploaded_time timestamp without time zone
)
WITH (appendonly=true) DISTRIBUTED BY (imsi); 

2. GPLOAD yaml file defined as follows

VERSION: 1.0.0.1
DATABASE: mydb
USER: rtiadmin
HOST: 172.1.1.1
PORT: 5432
GPLOAD:
   INPUT:
    - SOURCE:
         LOCAL_HOSTNAME:
            - loadhost
         PORT: 8100
         FILE:
          - /data/rti/stage/run/*.csv
    - COLUMNS:
          - imsi : text
          - subscriber_mccmnc : text
          - msisdn : text
          - imei : text
          - called_digits : text
          - start_datetime : text
          - end_datetime : text
          - first_cell_lac : integer
          - first_cell_idsac : integer
          - current_cell_lac : integer
          - current_cell_idsac : integer
          - dr_type : integer
          - status : text
          - ingest_time : bigint
          - processed_time : bigint
          - export_time : bigint
          - extra_col : text
    - FORMAT: text
    - HEADER: false
    - DELIMITER: ','
    - NULL_AS : ''
    - ERROR_LIMIT: 999999
    - ERROR_TABLE: rtiadmin.rtitrans_etl4_err
   OUTPUT:
    - TABLE: rtiadmin.rtitrans_etl4
    - MODE: INSERT
    - MAPPING:
           imsi : imsi
           subscriber_mccmnc : subscriber_mccmnc
           msisdn : msisdn
           imei : imei
           called_digits : called_digits
           start_datetime : substr(start_datetime, 1, 10)::int
           end_datetime : substr(end_datetime, 1, 10)::int
           first_cell_lac : first_cell_lac
           first_cell_idsac : first_cell_idsac
           current_cell_lac : current_cell_lac
           current_cell_idsac : current_cell_idsac
           dr_type : dr_type
           status : status
           ingest_time : ingest_time
           processed_time : processed_time
           export_time : export_time
           extra_col : extra_col
           gploaded_time : current_timestamp
   PRELOAD:
    - TRUNCATE : true 
    - REUSE_TABLES : true
   SQL:
    - AFTER : "insert into rtitrans select * from rtitrans_etl4" 

3. Call GPLOAD as follows

source $HOME/.bash_profile
gpload -f rtidata.yml


Note: We use the ENV variable as $PGPASSWORD which is used during the load if a password is required which was in this demo

Few things worth noting here.

REUSE_TABLES : This ensures the external tables created during the load are maintained and re-used on next load.

TRUNCATE: This clears the load table prior to load and we use this as we COPY the data once the load is finished into the main FACT table using the "AFTER"

No comments: