The following article is based on true events; only the client, data set and solution have been altered; all other details remain exactly the same.
My client has a 16GB csv file that his team want to analyse. The problem is that their only analysis tool is Microsoft Excel and there are several times too many rows in the file for Excel to open. The solution… To split the file into smaller files based on dates in a given date field.
In this first blog post, we shall explore a simple solution to this issue. Part 2 shall offer a more complex – and more suitable – solution.
So, you have some data in a file (or table) that you want to split into multiple, smaller files (or tables) based on the date in a particular column. In this example, we have a simple table with an ID column and a column.
We want to save each row to a file based on the month and year of the date. For example, at the end of the process, one of the files ought to be called “filesplit_201401” with 2 rows in it (ID’s 1 and 8) because they both feature dates in January 2014.
How It’s Done
Below is a screenshot of the Talend job I put together to do this.
The tForeach is a looping component that will iterate on each value in a list. In the screenshot below you can see that I have a list of dates that I want to loop through.
The tFileinputDelimited_1 component is going to read the rows from my csv file into the job (see first screenshot for the contents of my csv file).
The tFilterRow_1 will filter out any rows that do not have a date which is the same year and month as the current value in the tForEach_1. For example, when you run the job the first value in the list is “01/01/2014”, so when the rows are passed through the tFilterRow_1 component, only dates in January 2014 will be allowed to pass through to the next component.
The tFileOutputDelimted_1 component writes the rows to a delimited file.
When the job runs for each value in the tForeach list, the corresponding rows from the file will filter through and are written to a file by the tFileoutputDelimited component; the name of the file is based on the value in the list. The following piece of code in the file name field is responsible for appending the filename with the correct year and month:
Below is a screenshot of the folder where the output files were written and as you can see there is a file for each month of 2014.
And here is the filesplit_201401 file. As expected, we have only 2 rows from the original output both of which are in January of 2014.
We have just gone through how to split a file up into several smaller files based on the date value of a given field. The solution does have one major flaw however; I already knew what dates would be in the file and I specified which dates to split the file into before I ran the job… What if we had a huge file with an unknown range of dates but we still wanted to split the file? This challenge shall be addressed in the second part of this blog.
How to split a file based on a date value – Part 2: coming soon.