Talend Studio: Tips and Tricks (Part 4 – Big Data Edition)

by Andrey Zaharov | Jul 5, 2016 | BlogPosts, Talend, Tech Tips | 0 comments

Introduction

Talend Studio provides a great number of configuration tools for Hadoop Clusters. It also gives you the ability to run Spark jobs on your local environment. However, you might still run into issues running code in distributed mode across the entire cluster. Here are a couple of tips you can apply to circumvent some of these issues.

Note: All of the issues can be mitigated if you have admin right for the Hadoop Cluster.

MapReduce

One of the issues you might run into is the job failing to run because it can’t identify the “mapreduce.application.framework.path”. This is an issue on HDP2.2 where the HDP version is not directly specified.

For example:

<name>mapreduce.application.framework.path</name>

<value>

/hdp/apps/${hdp.version}/mapreduce/mapreduce.tar.gz#mr-framework

</value>

<source>mapred-site.xml</source>

</property>

The job will fail because “${hdp.version}” is unknown to the job. To fix this you will need to add “-Dhdp.version=<version>” as a JVM parameter, where <version> is the version of HDP.

Spark Batch (Local Mode)

Tip #1 – Web UI Port

If you are running Spark in local mode, but the default port for the Web UI is used, change it to some other port that’s free. The job will identify that the default port is used and will switch to another, but if you are running with log4j you don’t want the unnecessary log entries. Also, it will make the job run a tiny bit faster since it won’t need to resolve the port connection.

Tip #2 – Tuning properties

If the job fails with a warning along the line of “outOfMemory” and you know you are dealing with large files, or you are performing type conversions, then it’s a good idea to play around with Driver memory and Execution memory. Just don’t go overboard as some developers have suggested on forum threads; too much memory allocations may starve other processes and cause the exact problem you were trying to solve in the first place.

Tip #3 – Unlimit restrictions resulting in output not generating

A local mode Spark job might fail when executing on a Linux env. where a ulimit is set to low. To solve this issue increase the default value of “spark.reducer.maxMbInFlight” property to “128” or more. Be aware that you need a machine with a lot of RAM to be able to increase this parameter.

Conclusion

Thanks for reading this article, I hope it was helpful. If you’d like to read more interesting tips and tricks with Talend, you can find here Part 1, Part 2, and Part 3 of this blog series. If you have any other queries or insight, just let me know at info@datalytyx.com. Otherwise, you can subscribe to our blog for a monthly update on the latest articles.