After installing PHD add on for Pivotal Cloud Foundry 1.1 I quickly created some development services for PHD using the CLI as shown below.
[Tue Apr 15 22:40:08 papicella@:~/vmware/pivotal/products/cloud-foundry ] $ cf create-service p-hd-hawq-cf free dev-hawq
Creating service dev-hawq in org pivotal / space development as pas...
OK
[Tue Apr 15 22:42:31 papicella@:~/vmware/pivotal/products/cloud-foundry ] $ cf create-service p-hd-hbase-cf free dev-hbase
Creating service dev-hbase in org pivotal / space development as pas...
OK
[Tue Apr 15 22:44:10 papicella@:~/vmware/pivotal/products/cloud-foundry ] $ cf create-service p-hd-hive-cf free dev-hive
Creating service dev-hive in org pivotal / space development as pas...
OK
[Tue Apr 15 22:44:22 papicella@:~/vmware/pivotal/products/cloud-foundry ] $ cf create-service p-hd-yarn-cf free dev-yarn
Creating service dev-yarn in org pivotal / space development as pas...
OK
Finally using the web console to brow the services in the "Development" space
Tuesday, 15 April 2014
Wednesday, 9 April 2014
Pivotal Greenplum GPLOAD with multiple CSV files
I recently needed to setup a cron script which loaded CSV files from a directory into Greenplum every 2 minutes. Once loaded the files are moved onto Hadoop for archive purposes. The config below shows how to use GPLOAD data load utility which utilises GPFDIST.
1. Create a load table. In this example the data is then moved to the FACT table once the load is complete
2. GPLOAD yaml file defined as follows
VERSION: 1.0.0.1
DATABASE: mydb
USER: rtiadmin
HOST: 172.1.1.1
PORT: 5432
GPLOAD:
INPUT:
- SOURCE:
LOCAL_HOSTNAME:
- loadhost
PORT: 8100
FILE:
- /data/rti/stage/run/*.csv
- COLUMNS:
- imsi : text
- subscriber_mccmnc : text
- msisdn : text
- imei : text
- called_digits : text
- start_datetime : text
- end_datetime : text
- first_cell_lac : integer
- first_cell_idsac : integer
- current_cell_lac : integer
- current_cell_idsac : integer
- dr_type : integer
- status : text
- ingest_time : bigint
- processed_time : bigint
- export_time : bigint
- extra_col : text
- FORMAT: text
- HEADER: false
- DELIMITER: ','
- NULL_AS : ''
- ERROR_LIMIT: 999999
- ERROR_TABLE: rtiadmin.rtitrans_etl4_err
OUTPUT:
- TABLE: rtiadmin.rtitrans_etl4
- MODE: INSERT
- MAPPING:
imsi : imsi
subscriber_mccmnc : subscriber_mccmnc
msisdn : msisdn
imei : imei
called_digits : called_digits
start_datetime : substr(start_datetime, 1, 10)::int
end_datetime : substr(end_datetime, 1, 10)::int
first_cell_lac : first_cell_lac
first_cell_idsac : first_cell_idsac
current_cell_lac : current_cell_lac
current_cell_idsac : current_cell_idsac
dr_type : dr_type
status : status
ingest_time : ingest_time
processed_time : processed_time
export_time : export_time
extra_col : extra_col
gploaded_time : current_timestamp
PRELOAD:
- TRUNCATE : true
- REUSE_TABLES : true
SQL:
- AFTER : "insert into rtitrans select * from rtitrans_etl4"
3. Call GPLOAD as follows
source $HOME/.bash_profile
gpload -f rtidata.yml
1. Create a load table. In this example the data is then moved to the FACT table once the load is complete
drop table rtiadmin.rtitrans_etl4; CREATE TABLE rtiadmin.rtitrans_etl4 ( imsi character varying(82), subscriber_mccmnc character varying(10), msisdn character varying(82), imei character varying(50), called_digits character varying(50), start_datetime integer, end_datetime integer, first_cell_lac integer, first_cell_idsac integer, current_cell_lac integer, current_cell_idsac integer, dr_type integer, status character varying(50), ingest_time bigint, processed_time bigint, export_time bigint, extra_col text, gploaded_time timestamp without time zone ) WITH (appendonly=true) DISTRIBUTED BY (imsi);
2. GPLOAD yaml file defined as follows
VERSION: 1.0.0.1
DATABASE: mydb
USER: rtiadmin
HOST: 172.1.1.1
PORT: 5432
GPLOAD:
INPUT:
- SOURCE:
LOCAL_HOSTNAME:
- loadhost
PORT: 8100
FILE:
- /data/rti/stage/run/*.csv
- COLUMNS:
- imsi : text
- subscriber_mccmnc : text
- msisdn : text
- imei : text
- called_digits : text
- start_datetime : text
- end_datetime : text
- first_cell_lac : integer
- first_cell_idsac : integer
- current_cell_lac : integer
- current_cell_idsac : integer
- dr_type : integer
- status : text
- ingest_time : bigint
- processed_time : bigint
- export_time : bigint
- extra_col : text
- FORMAT: text
- HEADER: false
- DELIMITER: ','
- NULL_AS : ''
- ERROR_LIMIT: 999999
- ERROR_TABLE: rtiadmin.rtitrans_etl4_err
OUTPUT:
- TABLE: rtiadmin.rtitrans_etl4
- MODE: INSERT
- MAPPING:
imsi : imsi
subscriber_mccmnc : subscriber_mccmnc
msisdn : msisdn
imei : imei
called_digits : called_digits
start_datetime : substr(start_datetime, 1, 10)::int
end_datetime : substr(end_datetime, 1, 10)::int
first_cell_lac : first_cell_lac
first_cell_idsac : first_cell_idsac
current_cell_lac : current_cell_lac
current_cell_idsac : current_cell_idsac
dr_type : dr_type
status : status
ingest_time : ingest_time
processed_time : processed_time
export_time : export_time
extra_col : extra_col
gploaded_time : current_timestamp
PRELOAD:
- TRUNCATE : true
- REUSE_TABLES : true
SQL:
- AFTER : "insert into rtitrans select * from rtitrans_etl4"
source $HOME/.bash_profile
gpload -f rtidata.yml
Note: We use the ENV variable as $PGPASSWORD which is used during the load if a password is required which was in this demo
Few things worth noting here.
REUSE_TABLES : This ensures the external tables created during the load are maintained and re-used on next load.
TRUNCATE: This clears the load table prior to load and we use this as we COPY the data once the load is finished into the main FACT table using the "AFTER"