One thing that always and again gets my attention is date handling in ETL jobs, especially in Talend and when using file input. It seems to me that sometimes there is no clarity about the concept of date fields and formatting. This post tries to clear things up.
The first question to ask is where in the ETL process are we. Lets assume we are going to do this for one locale only. Usually an ETL process could look like this:
[Input] Data source (csv) --> [Processing] Talend job --> [Output] Database
Now, lets assume the date column would be formatted like this in the csv file:
Mon, 16 Jan 2017
And we need to store this date as a Date field for later usage. So first, we need to parse the input from String to Date. For this we can use a date pattern in the input file component. It could look like this:
"EEE, dd MMM yyyy"
All the letters stand for different parts of the date (see the docs for Java 7 SimpleDateFormat for more info):
- EEE: „Day name in week, Text“, three letters
- , : The comma and the space in the date string
- dd: „Day in month, Number“, two letters
- MMM: „Month in year, Text“, three letters
- yyyy: „Year, Number“, four letters
So there you go. With this simple instruction Talend (or better: Java) knows how to interpret this date string. This is important: the input is a date string, not yet a date. Only after the parsing has happened, it will become a Date. If something goes wrong (for example, the date would be formatted differently in a row), an exception will happen and the job will stop.
The processing magic will happen afterwards. We will not dive further into the ETL job, there is only one thing to keep in mind: what if I want to use a tLogRow component to peek into the data and need to set up the date format? Wouldn’t it be easier if it stayed a String?
No. Absolutely not. You would always want to make sure that a date which is just a string is getting parsed to become a Date field. Otherwise you wouldn’t be able to use date calculations or compare earlier / later dates. They, depending on the formatting, cannot be sorted well if they stay a string.
If you need to format the date, there is always the date format option in the schema in Talend. Just press Ctrl + Space to bring up some patterns. Again, those are formatted after the rules in the SimpleDateFormat.
It doesn’t matter if you need the day first or last or whatever formatting is needed. This has nothing to do with the Date field itself. The Date field is just a representation of a „current second in time“, you could say. Deep inside the system, it is just a huge number representing the seconds from the 01.01.1970. This is also known as Unix Time.
Now, with this knowledge it does not matter how the date will be formatted – as long as it is a Date type.
The same is the case for the database. Make sure to always use the appropriate date field. In MySQL for example, there are two different fields called DATE and TIMESTAMP. DATE only holds a date, no time can be stored in this field. This is a bit of a stumble point for developers who come from Oracle, since the behaviour is different there.
Still, storage is again usually done in the Unix Time format. So this means, again, if you read the value later out of the database, then you should think about formatting, not earlier. MySQL does this for example with the DATE_FORMAT function, Oracle with the TO_CHAR function. Make sure to understand the formatting patterns of the functions since they are different to the ones used in Talend.
For example, for MySQL you need to write
SELECT DATE_FORMAT("2017-02-04", "%d.%m.%Y") FROM DUAL
to get 04.02.2017
This also is different for Oracle.
So just make sure to understand where you are in your ETL process. Are you reading the data from the source? Is it text based? Does it need to be parsed into a date? Is the field a date field in the database? And finally: how should the formatting look like?
Oh and one word of advice – make sure you never read a date field in Oracle from the database and, just to make sure it really is a date use the TO_DATE function on that field – Oracle explicitly warns against doing this:
Do not use the TO_DATE function with a DATE value for the char argument. The first two digits of the returned DATE value can differ from the original char, depending on fmt or the default date format.
I have seen this and it showed me that there was no understanding of Date fields. Make sure to understand this to elimiate unneccessary bugs. It is worth the effort.
Picture by Basti93