Why CSV is a bad idea and should not be considered
Yes, I can already hear your shoutings.
„What is he talking about? CSV is a great format, and the easiest to use by far! You just String.split() by ‚;‘ and are done!!!“
Well, I’ll try to be honest: You are very wrong.
First of all: CSV is not a format. It is a bunch of values, separated by a character (this is what CSV stands for: Character Separated Values). There might be two of them. Or twenty. Or twohundred. We don’t know until we parse it. But even then we cannot say for sure that the column we are currently processing actually contains the data we are expecting. We rely on hope. Therefore we write down some expectations on the CSV file and call this format. Sometimes this works.
Second: If you are talking about the easyness of handling CSV, you are talking about that colleague that exports her Excel sheets to a .csv file. Yeah, that’s pretty easy. So, let’s be honest on this one, too: We decide to use CSV, because it is the easiest thing to do for the customer. She does not need a specialized application to create structured data with a strict format. She just needs to open Excel, fill in a few columns and then click on „Save as“.
As nice as it is to our beloved customer: This is where every implementer’s nightmare starts. Because the customer does not care how this CSV file looks like. What she does care is what her pretty Excel Sheet looks like, because this is what will (probably) be seen by someone (important). So she will be using all her Excel Skills and every aestethic sense to create a great looking Workbook, including carriage returns and almost every obscure UTF-8 character the codepage has to offer. And I cannot even be mad at her: If the CSV shall include, let’s say, a column for a description of something, I do not want to put that 2000 character description in one single line. If you have ever tried to do that with Excel will will know that you almost are forced to use carriage returns!
But this leads us to
Third: You cannot just String.split() by ‚;‘ and are done. You need to check if the columns are, by chance, also surrounded by quotes, because if so, you need to be aware of the fact that each column might include at least one ocurrence of your separator ‚;‘, which belongs to the value and shall not be considered as your separator. You also have to consider carriage returns, therefore you need to implement readahead of your CSV file. You also can never be sure if the customer did not move a column to another or splits one into two because „it looks more pretty“.
The list goes on and on, and over the time you will be implementing a monster in form of your very own CSV parser. By this time you will have also created your very own „format“ of a CSV file, because after several months of bugfixing „that crappy parser thing“ (that, by the way, has destroyed some of your reputation) you will have implemented at least some basic boundary checking on the supplied file. You will also be very tired at this point. Even if you were clever enough to use one of the several CSV parsers out there, you will still have a hard time.
- Because you are fighting human creativity. You won’t believe what people did to their Excel Sheet just to make it look „perfect“.
- Because the customer is not aware that her creativity is causing problems. She will be even mad at you because you force her to make her Excel Sheet look ugly.
- Because you are forgetting one of the most important rules: Fix the sender, not the receiver.
Here we are, looking at Excel. And Excel does not do ANY checks for you. It just exports CSV. And most of the time it is capable of importing its own CSV export. Most of the time…
So, what to do? Well, I think it depends on the complexity of the data. Of course a CSV processor that processes data created by another machine that also follows strict patterns will be implemented very fast. But as soon as Excel Sheets are involved that transfer more than, let’s say, 10 Columns and 20 lines, you better think about doing something else. From my personal experience, implementing a specialized GUI will help you check the boundaries as your customer starts to input data. Of course it will take more time to deliver. But I bet you will save several weeks of work fixing another issue that came up with the latest CSV file. Also, you have absolute control over the to-be-used export format. By using annotations on Java classes you are able to marshall data to XML and unmarshall it back to objects without even thinking about how to parse this format.
I mean, how cool is that?!?