How to correctly use HP Vertica Update statement in Talend Open Studio

by | Feb 16, 2015 | BlogPosts, Tech Tips | 1 comment

The Issue

HP Vertica is an analytic platform which specialized in OLAP. It was implemented in one of our Big Data Management projects in the past and proved to be a great solution. During our development phase it has come to our attention that many people complain about the data insert speeds, especially UPDATE statements.

This article will explain how to correctly use HP Vertica Analytic Database in order to achieve great data import and update performance.

Talend Open Studio

Below I will give a brief description of the different HP Vertica components present in Talend Studio. The description table will not only describe the table but also give some valuable information as to when to use the

Component Name Description
tVerticaConnection The Advanced setting – Auto Commit should not be selected as a commit will be performed after each SQL statement. Thus significantly reducing the performance of INSERT and UPDATE statements.
tVerticaOutputBulk Normally used together with tVerticaBulkExec (which is described below), but can be substituted with tVerticaOutputBulkExec component. The advantage of using two components over one is that the data can be transformed before it is loaded in to the database. Additionally, an “if” trigger can be implemented.
Important: The Advanced settings make it possible to change the “Row Separator” and “Field Separator”, but bear in mind that no additional CSV options such as: “Escape char” and “Text enclosure” can be selected, so make sure that the “Field Separator” you are using is not present in the data that is being loaded in to tVerticaOutputBulk component.
tVerticaBulkExec Should be used together tVerticaOutputBulk component which is described below. The “Write to ROS(Read Optimized Store)” option (located in Advanced settings) is active by default, but should only be active if the Output file is greater than 100MB. So perhaps an “if” trigger should be implemented that looks at the files size from tFileProperties component.
Important: When using together with tVerticaOutoupBulk make sure that the “Field Separator” value located in the Advanced settings of the component is identical to the “Field Separator” value in tVerticaOutputBulk. By default they do not match.
tVerticaOutput Should be used for small Inserts or Updates. The “Commit every 100” default option set in the Advanced settings should give you an idea as to how small the Inserts and Updates should be.
tVerticaOutputBulkExec Combines both tVerticaOutputBulk and tVerticaBulkExec in to a single component. Should be used if no additional data manipulation is required. The “Write to ROS(Read Optimized Store)” option (located in Advanced settings) is active by default, but should only be active if the Output file is greater than 100MB.
tVerticaCommit Can be used if Auto commit is not active.

 

Examples

Below you will find two jobs that showcase how impactful is the use of correct components.

Incorrect way

The first example showcases an INSERT and an UPDATE using tVerticaOutput component.

The data for the job is created using a tRowGenerator. 10 million rows of dummy data is generated, producing 700MB file. This file is then imported into a tMap and a lookup is made against randomly generated ID’s with the same sequence. There are 5 million IDs generated in the lookup which match the main input file. All that do not match are used in the Rejects output and are treated as an Update.

sdfsafd

The first run is executed using the tVerticaOutput without “Use batch mode” in the advanced section of the component.

Below you will see the results:

 

Starting job Simple_HP_vertica at 10:10 05/02/2015.

[statistics] connecting to socket on port 3817

[statistics] connected

Job Simple_HP_vertica ended at 10:40 05/02/2015. [exit code=0]

 

The job was killed after 30 minutes. It has not reach the UPDATE part of the job and only finished half of the INSERTS.

The job was then run again, but this time a “Use batch mode” was activated and it was set to 10000.

 

Starting job Simple_HP_vertica at 10:53 05/02/2015.

[statistics] connecting to socket on port 3392

[statistics] connected

Job Simple_HP_vertica ended at 11:11 05/02/2015. [exit code=0]

 

After 20 minutes the job finished the INSERTS but managed only to UPDATE 10000 rows. Better but not perfect.

Correct way

Now lets make a small change to the job and replace the tVerticaOutput component with two components – tVerticaOutputBulk and tVerticaBulkExec.

Correct update

Correct update

With this change implemented the job finished in 72 seconds.

 

Starting job HP_Vertica at 11:12 05/02/2015.

[statistics] connecting to socket on port 3870

[statistics] connected

Input File Processed.

   Commencing UPSERT

Number of records inserted into table blogPost_Test.DummyTestTable by tVerticaBulkExec_1: 5000000

Number of records inserted into table blogPost_Test.tmp_tVerticaBulkExec_3bMY2XU1 by tVerticaBulkExec_3: 5000000

  72seconds   Job Execution Time  72661 milliseconds

[statistics] disconnected

Job HP_Vertica ended at 11:13 05/02/2015. [exit code=0]

Conclusion

As you can see depending on what type of components are used in the job, the overall performance may change drastically. Make sure that you use the correct components and settings for those components to achieve the best results.

Instead of going on forums and proclaiming that the database is slow or that a Talend component is not working properly.

1 Comment

  1. Hi , I have a question, I do not see any column used in bulk update component. how the bulk update will be performed?

    Reply

Submit a Comment

Your email address will not be published. Required fields are marked *