Do you ever work with data?
Do you ever work with datasets that have multiple rows and columns?
Did you ever encounter one of the following issues?
- Weird characters show up in the middle of a word: Ã©â‚¬Âµ.
- Rows suddenly have more or less columns, or values end up in the wrong column.
- Dates or numbers fail to get loaded in the way you expect them to.
If you recognize any of these issues, you are deforming your data in ways that might impact anything you use it for at a later stage. There are a few technicalities on how data is stored that you should be aware of in order to work around these issues. If you work with data, you need to know this. Not being a computer scientist is not an excuse.
To stress the importance of this point, I would like to borrow a piece of text from Joel Spolsky’s post on character encoding, and make some changes to it:
´So I have an announcement to make:
if you are a programmer working in 2003if you are working with datasets and you don’t know the basics of characters, character sets, encodings, and Unicodefile types, encodings, delimiters and data types, and I catch you, I’m going to punish you by making you peel onions for 6 months in a submarine. I swear I will.´
File types – and why CSV files are not Excel files
- ‘My .csv data looks weird when I open it in Excel, I even changed the extension to .xls, but the data still doesn’t show up properly.’
- ‘I’ve received data as a text file (.txt), now I can’t open it with Excel.’
All files have a name. In most operating systems (e.g. Windows) it is common to make the names end with an extension. An extension is simply what follows behind the last dot in a filename. The extension does not define what the file type is, it simply suggests it. Your operating system uses the extension to know what application to use to open the file. Hence, changing the extension, does nothing to the file type, it simply changes what application gets to open the file. Microsoft has decided to hide the extensions of files from view by default in Windows, because it might confuse the user, causing confusion while doing so.
File types can be grouped into two very distinct groups:
Text files, and binary files.
Anything you open up with a text editor, and looks slightly legible, is a text file.
Anything you open up with a text editor, and looks like your computer exploded, is a binary file.
Text files that contain data often have extensions like .csv, .txt or .tsv, but could have any other extension. Files with extensions .xls or .xlsx should be Excel files, and are binary files (yes, .xlsx files are actually compressed XML archives). When you install Office, Excel is set up to be the default application to open .xls and .xlsx files, but also .csv files. Hence, a lot of people falsely assume .csv files are Excel files. Unfortunately CSV is not a single, well-defined format, and Excel starts guessing how to interpret the data. Thanks, but no thanks.
- Set up your operating system to always show the full filename, including the extension.
- Learn how to open any file with a text editor, so you can check if it is a text or a binary file.
(Some easy methods are: using the “open with…” option if it is available, rename the file to .txt so that your default text editor opens the file, or open the file from within the text editor you would like to use.)
- Don’t rename .csv files to .xls, and never, ever open .csv files in Excel. Learn to import data into Excel instead, and mind the next issues while doing so.
Encoding – and why text does not exist
- ‘I’ve imported my data into Excel, and characters like Ã©â‚¬Âµ are showing up.’
- ‘When looking at my data in a text editor, some characters are replaced by question marks or black rectangles.’
Remember I said earlier all files are text, or binary?
Unfortunately, the rabbit hole goes deeper: there ain’t no such thing as plain text.
Joel Spolsky does an amazing job in his article to explain this, and I would highly suggest reading it, even for non-developers.
What it basically boils down to is that a computer has no idea what a letter is, it only knows 0’s and 1’s, called bits, and is used of working with 8 bits at once, called bytes. A byte has a certain value based on what bits are set to 0 or 1, resulting in 256 possible values. In a text file, each of these values used to represent one character (e.g. a letter, number or symbol). The 256 possible characters quickly became a limitation once Asian, Arabic, mathematical and a whole range of other characters needed to be added. This was ‘solved’ by creating countless systems where two, three, or a variable number of bytes represented one character. These systems are called character encodings, and we are now stuck with a whole bunch of them.
When you receive a text file you are not receiving a row of letters in a certain order, you are receiving a row of bytes in a certain order (if you dare, download a free hex editor and have a look). Without knowing what encoding is being used, there is no way of knowing what characters those bytes represent. The weird characters, question marks and black rectangles start showing up when one or more bytes are interpreted as being in an encoding that they are not. When you open a text file, even your text editor simply guesses what encoding is being used. Thanks, but no thanks.
- When exporting data into a text file, make sure you know what encoding is being used. When receiving data, always ask what encoding is being used. If you get a blank stare, forward this article.
- Be aware that when you are loading data into any application, a certain encoding is being used. A lot of programs do this without even mentioning it (e.g. STATA, Microsoft Notepad), causing users to think the encoding ghost does not live there. Look for the default encoding used by your applications, if you can’t find it in the manual, try searching in online forums.
- When importing data into an application that is encoding-aware (e.g. Excel, MySQL), make sure to select the correct encoding, and don’t rely on the defaults.
When importing data into an application that only supports one specific encoding, learn to convert the encoding beforehand using a text editor that supports this (e.g. Notepad++). There is a subtle difference between displaying bytes using a certain encoding, and converting characters into a certain encoding. The former does not alter the source text file, the latter does.
Byte Order Mark – and why text exists
- ‘Some weird looking characters appear at the first column of the first row of my data.’
- ‘Suddenly there are a few weird characters at the start of my text file.’
Remember I said earlier there is no way of knowing what encoding is being used?
That used to be true, and then came the Byte Order Mark (BOM).
A BOM is made up of a few bytes at the start of a text file indicating what encoding is being used. Convenient, yes, but what if your application doesn’t know about BOMs? You might have guessed it: The bytes of the BOM are shown as if they are normal characters.
When using a certain encoding, like UTF-8, a BOM can or can not be used. So remember “The file is in UTF-8.” does not tell you if there is a BOM or not. Also, a file with an UTF-8 BOM can still contain non-UTF-8 byte combinations. The BOM does not work as a safety mechanism to prevent invalid byte combinations.
- When receiving data, ask if it contains a BOM. You will get a blank stare. Forward this article, or if you’re tech-savvy, try to find out for yourself by opening the file in a text editor that can encode files with and without BOM (e.g. Notepad++), or open the file with a HEX-editor to look for a BOM.
- If the file contains a BOM, and the application you want to import it to does not know about BOM, use a text editor that supports it to remove the BOM (e.g. Notepad++).
Newlines – and why everything is text
- ‘Some of my rows get cut in half.’
- ‘There are weird characters at the end of my rows.’
- ‘All my data shows up on one long line.’
When you look at a text file, words are separated by spaces and tabs, and lines are separated by newlines. Tabs and newlines belong to a group of characters called “control characters” (spaces are just ’empty characters’). These control characters usually don’t show up as actual characters, but rather change how the surrounding text is displayed. Yet, they are still bytes, just like the bytes making up the other characters (e.g. 01000001 is usually represented as “A”, and 00001001 as a tab).
The most common control characters you should know about are tabs, newlines, newlines… and newlines. Yes, there are quite a few different flavors of newlines, the newline doesn’t really exist as such. There’s the carriage return, the line feed, and the combination of both (and some less common ones). The line feed is commonly used on UNIX systems and on OS X; the carriage return is used in Mac OS up to version 9. Windows uses both, a carriage return, followed by a line feed. This makes a lot of sense on a dot-matrix printer, but not so much on a computer screen.
When your application or text editor reads your data and is not familiar with the type of newline being used, they might not get interpreted at all, causing all data to end up on the same row, or some of the control characters might show up as a weird character. This kind of problem commonly does not arise as long as the same platform is being used (Windows, UNIX, Mac). But once you start exchanging data with people using a different platform, this can become an issue.
Because even newlines are text, it is possible that some of the text values in your data contain newlines. When this happens, a tricky problem arises: When an application reads the file and stumbles upon the newline, it will think it has reached the end of a row, causing it to process what follows next, as being the next row.
- When creating your own datasets, always check if your data contains newline characters. If it does, consider removing them, replacing them with another character, or continue reading the section on encapsulation.
- Use a text editor that supports visualization of control characters (e.g. Sublime Text) to check what type of newline is being used.
- Check if the application you are loading the data into allows you to select the type of newline that is being used.
- If your application uses a different type of newline by default, use a text editor (e.g. Notepad++) to change the line endings into whichever style is required.
Delimiters – and why columns don’t exist
- ‘Some values suddenly show up together in the same column.’
- ‘Some values get split over multiple columns.’
- ‘Some values show up in the wrong column.’
Tabular data is often handed around as plain text files, with each line being a different record, and each column being separated by a specific character, called the delimiter (another method: fixed width). The extensions of these files is often .csv, which stands for “Comma-Separated Values”, but the extension might as well be “.txt”, “.dat”, or “.whyevenuseextentions”. For .csv files, you would guess this implies that the columns are separated by commas. Wrong.
As mentioned earlier, CSV is not an official standard. The delimiter is allowed to be anything: a comma, a semicolon, a tab, or even the letter “a”. Later you will find out why this is a good thing, but let us first find out why this is a bad thing. If the delimiter is allowed to be anything, how can your application know what is being used as a delimiter? It can’t, and some applications (e.g. Excel when it uses the defaults) just take a guess, again. Thanks, but no thanks.
As with the newlines, a tricky problem arises if the delimiter also appears in one of the text values. How can an application know if, e.g. a semicolon, is part of the text value, and not the end of a column? Commas, semicolons and tabs are quite common in text values, and as delimiters, further adding to this problem.
- The good thing about being able to choose the delimiter is that when you create your own datafile, you can choose a delimiter that does not appear anywhere in your data.
- When receiving data, open the file with a text editor that can also visualize control characters like tabs (e.g. UltraEdit), and check what delimiter is being used. Using a text editor that can not display tabs (e.g. Microsoft Notepad and WordPad) is a bad idea, because tabs can sometimes be indistinguishable from spaces.
- When loading your data into an application (e.g. Access), select the delimiter explicitly.
- If the application you want to load the data into expects a certain delimiter by default, you will have to convert the delimiter to the required character. When doing so, first make sure the character does not appear anywhere in the data.
Encapsulation – just… don’t
- ‘I’ve imported my data into [PowerfulDataTool v4.4] , and now all my values are surrounded by quotes.’
- ‘I’ve imported by data into [StatisticsPro v3], and some fields break in the middle, and a part of the value ends up in the next column.’
Encapsulation, also known as text qualifiers, is used as a solution to a problem that in most cases is not even there, and its use can usually be avoided without causing the problems that it introduces.
That should be the official Wikipedia definition.
The problem it tries to solve is the following: What happens if one of your values contains the character that you are also using as a delimiter? As we have seen earlier, your data ends up in the wrong column, since your application has no idea it is not intended as a delimiter, but just part of your text.
But wait, someone said: What if we just put double quotes around all of our text values? Then we know that if we see a delimiter character between the double quotes, it’s part of the text. Encapsulation was born, and all was well…
Or was it? It seems the problem was just pushed further down the road. What happens if your value contains a double quote? It is wrongly interpreted as the end of the value, and the rest of the value ends up in the next column.
But wait, someone said: What if we just put a special character in front of the double quote, so that our program knows that the next double quote is part of the text. Escape characters were born, and the people were getting confused, rightfully so…
- Then a more pragmatic developer came along, and said: What are you guys doing? Why not just use a delimiter that does not appear in your text?
- If you do get data that uses encapsulation, ask what escape character is being used, or, if it is certain the encapsulation character will never show up in the values. If you get a blank stare, forward this article.
The two rare exceptions
- The application you are loading the data into only allows you to work with a delimiter that also appears in your text values, and simply removing that character from your data is not an option. Then, using encapsulation and escape characters is your only way out. Unfortunately, it is unlikely that an unsophisticated application like that will support encapsulation and escape characters.
- Your data contains newlines within text values, and you can not simply replace them with spaces. Also then, you are forced to use encapsulation. When doing so, make sure the application you use to load the data also supports newlines appearing in encapsulated values.
Data types – and why numbers and dates don’t exist
- ‘We were using numbers as ID’s, but now they appear without zero’s in front of them (e.g. 0005 -> 5).’
- ‘Excel completely messes up the dates in my data.’
- ‘This text is formatted as a number.’
- ‘NULL, 0 or -9999 starts showing up in fields that were empty before.’
When we, bright and intelligent humans, see a number, a date or a piece of text, we can easily tell them apart. But how does an application that works with your data see the difference? We’ve been here so many times by now that it should be obvious: it takes a guess.
There are countless ways of writing dates, even for humans it’s often hard to tell if, e.g. 1/2/2014, is referring to the first of February, or the second of January. Hence “date” is hardly a data type, you have to know the formatting of the date for the date to make sense.
But numbers are just numbers, right? Well, yes, in 1.315 out of 1,526 cases…
Depending on where you live, a comma or a dot is used as the decimal separator. The comma or dot is sometimes also used as thousands separators. You can imagine the challenges this poses when you are letting users from all over the world insert numerical values into a form.
Another issue arises with text that looks like a number. Things like “0005”. Many tools, like Excel, will interpret this as a number, and save it as “5”. If the zeros were a relevant part of the data, this effectively breaks the data.
And then, the value that you would expect to be the simplest of all, turns out to be the most complex: nothing, empty, no value. In databases this is often visualized as “NULL“, for numbers the default “0” is sometimes used, and some might even decide to store it as “-9999”, to set it apart from the number zero. When any of these visualizations of the empty value get exported into a data file it causes a problem, since you can not tell them apart from their non-empty equivalents.
- Make sure you know the data type, and format of each column. Keep in mind that in a million rows of data, one row could be using a different format. A quick visual check is not sufficient, you will need to automate this.
- Once you know the format, check if the application you want to load the data into supports this format. If it doesn’t, you’ll first have to change the format. This is where Regex might come in handy, but that’s a whole different story.
- When creating your own datafiles, make sure you are exporting empty values as such. Just, two delimiters with nothing in between.
- If you decide to copy-paste data into Excel (I would suggest using Import), make sure to first set the datatype of the column, and then paste the data.
Headers – so close, yet so far
- ‘When I load my data, the first row contains the column names.’
Most people are familiar with this. The first row of the data sometimes contains the column names so you know what is what. Unfortunately, if your application is not expecting the header it simply loads it as a record, et voilà, the column names appear as a record. The intent is great, trying to add documentation to what is being passed around, but as we’ve seen, there is far more we need to know about the data before it can be interpreted properly.
The solution – and the conclusion to this rather long post
I would like to propose the following: Whenever plain text ‘CSV’ data is created, add a file alongside that file describing the Encoding, BOM, newlines, delimiters, encapsulation, data types, and formatting being used. That information is what turns data into data, otherwise it’s just a pile of bytes. (Or, delve into the depths of XML)
And because a good acronym always works to remember stuff, I would call this…
The FEB ‘N DEDH file.
Granted, that must be the worst acronym in the history of acronyms,
but it beats life in a submarine.