After installing PHD add on for Pivotal Cloud Foundry 1.1 I quickly created some development services for PHD using the CLI as shown below.
[Tue Apr 15 22:40:08 papicella@:~/vmware/pivotal/products/cloud-foundry ] $ cf create-service p-hd-hawq-cf free dev-hawq
Creating service dev-hawq in org pivotal / space development as pas...
OK
[Tue Apr 15 22:42:31 papicella@:~/vmware/pivotal/products/cloud-foundry ] $ cf create-service p-hd-hbase-cf free dev-hbase
Creating service dev-hbase in org pivotal / space development as pas...
OK
[Tue Apr 15 22:44:10 papicella@:~/vmware/pivotal/products/cloud-foundry ] $ cf create-service p-hd-hive-cf free dev-hive
Creating service dev-hive in org pivotal / space development as pas...
OK
[Tue Apr 15 22:44:22 papicella@:~/vmware/pivotal/products/cloud-foundry ] $ cf create-service p-hd-yarn-cf free dev-yarn
Creating service dev-yarn in org pivotal / space development as pas...
OK
Finally using the web console to brow the services in the "Development" space
Search This Blog
Tuesday, 15 April 2014
Wednesday, 9 April 2014
Pivotal Greenplum GPLOAD with multiple CSV files
I recently needed to setup a cron script which loaded CSV files from a directory into Greenplum every 2 minutes. Once loaded the files are moved onto Hadoop for archive purposes. The config below shows how to use GPLOAD data load utility which utilises GPFDIST.
1. Create a load table. In this example the data is then moved to the FACT table once the load is complete
2. GPLOAD yaml file defined as follows
VERSION: 1.0.0.1
DATABASE: mydb
USER: rtiadmin
HOST: 172.1.1.1
PORT: 5432
GPLOAD:
INPUT:
- SOURCE:
LOCAL_HOSTNAME:
- loadhost
PORT: 8100
FILE:
- /data/rti/stage/run/*.csv
- COLUMNS:
- imsi : text
- subscriber_mccmnc : text
- msisdn : text
- imei : text
- called_digits : text
- start_datetime : text
- end_datetime : text
- first_cell_lac : integer
- first_cell_idsac : integer
- current_cell_lac : integer
- current_cell_idsac : integer
- dr_type : integer
- status : text
- ingest_time : bigint
- processed_time : bigint
- export_time : bigint
- extra_col : text
- FORMAT: text
- HEADER: false
- DELIMITER: ','
- NULL_AS : ''
- ERROR_LIMIT: 999999
- ERROR_TABLE: rtiadmin.rtitrans_etl4_err
OUTPUT:
- TABLE: rtiadmin.rtitrans_etl4
- MODE: INSERT
- MAPPING:
imsi : imsi
subscriber_mccmnc : subscriber_mccmnc
msisdn : msisdn
imei : imei
called_digits : called_digits
start_datetime : substr(start_datetime, 1, 10)::int
end_datetime : substr(end_datetime, 1, 10)::int
first_cell_lac : first_cell_lac
first_cell_idsac : first_cell_idsac
current_cell_lac : current_cell_lac
current_cell_idsac : current_cell_idsac
dr_type : dr_type
status : status
ingest_time : ingest_time
processed_time : processed_time
export_time : export_time
extra_col : extra_col
gploaded_time : current_timestamp
PRELOAD:
- TRUNCATE : true
- REUSE_TABLES : true
SQL:
- AFTER : "insert into rtitrans select * from rtitrans_etl4"
3. Call GPLOAD as follows
source $HOME/.bash_profile
gpload -f rtidata.yml
1. Create a load table. In this example the data is then moved to the FACT table once the load is complete
drop table rtiadmin.rtitrans_etl4; CREATE TABLE rtiadmin.rtitrans_etl4 ( imsi character varying(82), subscriber_mccmnc character varying(10), msisdn character varying(82), imei character varying(50), called_digits character varying(50), start_datetime integer, end_datetime integer, first_cell_lac integer, first_cell_idsac integer, current_cell_lac integer, current_cell_idsac integer, dr_type integer, status character varying(50), ingest_time bigint, processed_time bigint, export_time bigint, extra_col text, gploaded_time timestamp without time zone ) WITH (appendonly=true) DISTRIBUTED BY (imsi);
2. GPLOAD yaml file defined as follows
VERSION: 1.0.0.1
DATABASE: mydb
USER: rtiadmin
HOST: 172.1.1.1
PORT: 5432
GPLOAD:
INPUT:
- SOURCE:
LOCAL_HOSTNAME:
- loadhost
PORT: 8100
FILE:
- /data/rti/stage/run/*.csv
- COLUMNS:
- imsi : text
- subscriber_mccmnc : text
- msisdn : text
- imei : text
- called_digits : text
- start_datetime : text
- end_datetime : text
- first_cell_lac : integer
- first_cell_idsac : integer
- current_cell_lac : integer
- current_cell_idsac : integer
- dr_type : integer
- status : text
- ingest_time : bigint
- processed_time : bigint
- export_time : bigint
- extra_col : text
- FORMAT: text
- HEADER: false
- DELIMITER: ','
- NULL_AS : ''
- ERROR_LIMIT: 999999
- ERROR_TABLE: rtiadmin.rtitrans_etl4_err
OUTPUT:
- TABLE: rtiadmin.rtitrans_etl4
- MODE: INSERT
- MAPPING:
imsi : imsi
subscriber_mccmnc : subscriber_mccmnc
msisdn : msisdn
imei : imei
called_digits : called_digits
start_datetime : substr(start_datetime, 1, 10)::int
end_datetime : substr(end_datetime, 1, 10)::int
first_cell_lac : first_cell_lac
first_cell_idsac : first_cell_idsac
current_cell_lac : current_cell_lac
current_cell_idsac : current_cell_idsac
dr_type : dr_type
status : status
ingest_time : ingest_time
processed_time : processed_time
export_time : export_time
extra_col : extra_col
gploaded_time : current_timestamp
PRELOAD:
- TRUNCATE : true
- REUSE_TABLES : true
SQL:
- AFTER : "insert into rtitrans select * from rtitrans_etl4"
source $HOME/.bash_profile
gpload -f rtidata.yml
Note: We use the ENV variable as $PGPASSWORD which is used during the load if a password is required which was in this demo
Few things worth noting here.
REUSE_TABLES : This ensures the external tables created during the load are maintained and re-used on next load.
TRUNCATE: This clears the load table prior to load and we use this as we COPY the data once the load is finished into the main FACT table using the "AFTER"
Subscribe to:
Posts (Atom)