The Absolute Minimum Everyone Working With Data Absolutely, Positively Must Know About File Types, Encoding, Delimiters and Data types (No Excuses!)

Do you ever work with data?
Do you ever work with datasets that have multiple rows and columns?
Did you ever encounter one of the following issues?

If you recognize any of these issues, you are deforming your data in ways that might impact anything you use it for at a later stage. There are a few technicalities on how data is stored that you should be aware of in order to work around these issues. If you work with data, you need to know this. Not being a computer scientist is not an excuse.

To stress the importance of this point, I would like to borrow a piece of text from Joel Spolsky’s post on character encoding, and make some changes to it:

´So I have an announcement to make: if you are a programmer working in 2003 if you are working with datasets and you don’t know the basics of characters, character sets, encodings, and Unicode file types, encodings, delimiters and data types, and I catch you, I’m going to punish you by making you peel onions for 6 months in a submarine. I swear I will.´

That being said, let’s start with the basics…

File types – and why CSV files are not Excel files

The problem
  • ‘My .csv data looks weird when I open it in Excel, I even changed the extension to .xls, but the data still doesn’t show up properly.’
  • ‘I’ve received data as a text file (.txt), now I can’t open it with Excel.’
The explanation

All files have a name. In most operating systems (e.g. Windows) it is common to make the names end with an extension. An extension is simply what follows behind the last dot in a filename. The extension does not define what the file type is, it simply suggests it. Your operating system uses the extension to know what application to use to open the file. Hence, changing the extension, does nothing to the file type, it simply changes what application gets to open the file. Microsoft has decided to hide the extensions of files from view by default in Windows, because it might confuse the user, causing confusion while doing so.

Binary file
Binary file

File types can be grouped into two very distinct groups:
Text files, and binary files.
Anything you open up with a text editor, and looks slightly legible, is a text file.
Anything you open up with a text editor, and looks like your computer exploded, is a binary file.

Text files that contain data often have extensions like .csv, .txt or .tsv, but could have any other extension. Files with extensions .xls or .xlsx should be Excel files, and are binary files (yes, .xlsx files are actually compressed XML archives). When you install Office, Excel is set up to be the default application to open .xls and .xlsx files, but also .csv files. Hence, a lot of people falsely assume .csv files are Excel files. Unfortunately CSV is not a single, well-defined format, and Excel starts guessing how to interpret the data. Thanks, but no thanks.

The solution
  • Set up your operating system to always show the full filename, including the extension.
  • Learn how to open any file with a text editor, so you can check if it is a text or a binary file.
    (Some easy methods are: using the “open with…” option if it is available, rename the file to .txt so that your default text editor opens the file, or open the file from within the text editor you would like to use.)
  • Don’t rename .csv files to .xls, and never, ever open .csv files in Excel. Learn to import data into Excel instead, and mind the next issues while doing so.

Encoding – and why text does not exist

The problem
  • ‘I’ve imported my data into Excel, and characters like 逵 are showing up.’
  • ‘When looking at my data in a text editor, some characters are replaced by question marks or black rectangles.’
The explanation

Remember I said earlier all files are text, or binary?
Unfortunately, the rabbit hole goes deeper: there ain’t no such thing as plain text.
Joel Spolsky does an amazing job in his article to explain this, and I would highly suggest reading it, even for non-developers.

What it basically boils down to is that a computer has no idea what a letter is, it only knows 0’s and 1’s, called bits, and is used of working with 8 bits at once, called bytes. A byte has a certain value based on what bits are set to 0 or 1, resulting in 256 possible values. In a text file, each of these values used to represent one character (e.g. a letter, number or symbol). The 256 possible characters quickly became a limitation once Asian, Arabic, mathematical and a whole range of other characters needed to be added. This was ‘solved’ by creating countless systems where two, three, or a variable number of bytes represented one character. These systems are called character encodings, and we are now stuck with a whole bunch of them.

When you receive a text file you are not receiving a row of letters in a certain order, you are receiving a row of bytes in a certain order (if you dare, download a free hex editor and have a look). Without knowing what encoding is being used, there is no way of knowing what characters those bytes represent. The weird characters, question marks and black rectangles start showing up when one or more bytes are interpreted as being in an encoding that they are not. When you open a text file, even your text editor simply guesses what encoding is being used. Thanks, but no thanks.

The solution
  • When exporting data into a text file, make sure you know what encoding is being used. When receiving data, always ask what encoding is being used. If you get a blank stare, forward this article.
  • Be aware that when you are loading data into any application, a certain encoding is being used. A lot of programs do this without even mentioning it (e.g. STATA, Microsoft Notepad), causing users to think the encoding ghost does not live there. Look for the default encoding used by your applications, if you can’t find it in the manual, try searching in online forums.
  • When importing data into an application that is encoding-aware (e.g. Excel, MySQL), make sure to select the correct encoding, and don’t rely on the defaults.
  • Notepad++ encoding menu
    Notepad++ encoding menu

    When importing data into an application that only supports one specific encoding, learn to convert the encoding beforehand using a text editor that supports this (e.g. Notepad++). There is a subtle difference between displaying bytes using a certain encoding, and converting characters into a certain encoding. The former does not alter the source text file, the latter does.

Byte Order Mark – and why text exists

The problem
  • ‘Some weird looking characters appear at the first column of the first row of my data.’
  • ‘Suddenly there are a few weird characters at the start of my text file.’
The explanation

Remember I said earlier there is no way of knowing what encoding is being used?
That used to be true, and then came the Byte Order Mark (BOM).

A BOM is made up of a few bytes at the start of a text file indicating what encoding is being used. Convenient, yes, but what if your application doesn’t know about BOMs? You might have guessed it: The bytes of the BOM are shown as if they are normal characters.

When using a certain encoding, like UTF-8, a BOM can or can not be used. So remember “The file is in UTF-8.” does not tell you if there is a BOM or not. Also, a file with an UTF-8 BOM can still contain non-UTF-8 byte combinations. The BOM does not work as a safety mechanism to prevent invalid byte combinations.

The solution
  • When receiving data, ask if it contains a BOM. You will get a blank stare. Forward this article, or if you’re tech-savvy, try to find out for yourself by opening the file in a text editor that can encode files with and without BOM (e.g. Notepad++), or open the file with a HEX-editor to look for a BOM.
  • If the file contains a BOM, and the application you want to import it to does not know about BOM, use a text editor that supports it to remove the BOM (e.g. Notepad++).

Newlines – and why everything is text

The problem
  • ‘Some of my rows get cut in half.’
  • ‘There are weird characters at the end of my rows.’
  • ‘All my data shows up on one long line.’
The explanation

When you look at a text file, words are separated by spaces and tabs, and lines are separated by newlines. Tabs and newlines belong to a group of characters called “control characters” (spaces are just ’empty characters’). These control characters usually don’t show up as actual characters, but rather change how the surrounding text is displayed. Yet, they are still bytes, just like the bytes making up the other characters (e.g. 01000001 is usually represented as “A”, and 00001001 as a tab).

The most common control characters you should know about are tabs, newlines, newlines… and newlines. Yes, there are quite a few different flavors of newlines, the newline doesn’t really exist as such. There’s the carriage return, the line feed, and the combination of both (and some less common ones). The line feed is commonly used on UNIX systems and on OS X; the carriage return is used in Mac OS up to version 9. Windows uses both, a carriage return, followed by a line feed. This makes a lot of sense on a dot-matrix printer, but not so much on a computer screen.

When your application or text editor reads your data and is not familiar with the type of newline being used, they might not  get interpreted at all, causing all data to end up on the same row, or some of the control characters might show up as a weird character. This kind of problem commonly does not arise as long as the same platform is being used (Windows, UNIX, Mac). But once you start exchanging data with people using a different platform, this can become an issue.

Because even newlines are text, it is possible that some of the text values in your data contain newlines. When this happens, a tricky problem arises: When an application reads the file and stumbles upon the newline, it will think it has reached the end of a row, causing it to process what follows next, as being the next row.

The solution
  • When creating your own datasets, always check if your data contains newline characters. If it does, consider removing them, replacing them with another character, or continue reading the section on encapsulation.
  • Use a text editor that supports visualization of control characters (e.g. Sublime Text) to check what type of newline is being used.
  • Check if the application you are loading the data into allows you to select the type of newline that is being used.
  • If your application uses a different type of newline by default, use a text editor (e.g. Notepad++) to change the line endings into whichever style is required.

Delimiters – and why columns don’t exist

The problem
  • ‘Some values suddenly show up together in the same column.’
  • ‘Some values get split over multiple columns.’
  • ‘Some values show up in the wrong column.’
The explanation

Tabular data is often handed around as plain text files, with each line being a different record, and each column being separated by a specific character, called the delimiter (another method: fixed width).  The extensions of these files is often .csv, which stands for “Comma-Separated Values”, but the extension might as well be “.txt”, “.dat”, or “.whyevenuseextentions”. For .csv files, you would guess this implies that the columns are separated by commas. Wrong.

As mentioned earlier, CSV is not an official standard. The delimiter is allowed to be anything: a comma, a semicolon, a tab, or even the letter “a”. Later you will find out why this is a good thing, but let us first find out why this is a bad thing. If the delimiter is allowed to be anything, how can your application know what is being used as a delimiter? It can’t, and some applications (e.g. Excel when it uses the defaults) just take a guess, again. Thanks, but no thanks.

As with the newlines, a tricky problem arises if the delimiter also appears in one of the text values. How can an application know if, e.g. a semicolon, is part of the text value, and not the end of a column? Commas, semicolons and tabs are quite common in text values, and as delimiters, further adding to this problem.

The solution
  • The good thing about being able to choose the delimiter is that when you create your own datafile, you can choose a delimiter that does not appear anywhere in your data.
  • When receiving data, open the file with a text editor that can also visualize control characters like tabs (e.g. UltraEdit), and check what delimiter is being used. Using a text editor that can not display tabs (e.g. Microsoft Notepad and WordPad) is a bad idea, because tabs can sometimes be indistinguishable from spaces.
  • When loading your data into an application (e.g. Access), select the delimiter explicitly.
  • If the application you want to load the data into expects a certain delimiter by default, you will have to convert the delimiter to the required character. When doing so, first make sure the character does not appear anywhere in the data.

Encapsulation – just… don’t

The problem
  • ‘I’ve imported my data into [PowerfulDataTool v4.4] , and now all my values are surrounded by quotes.’
  • ‘I’ve imported by data into [StatisticsPro v3], and some fields break in the middle, and a part of the value ends up in the next column.’
The explanation

Encapsulation, also known as text qualifiers, is used as a solution to a problem that in most cases is not even there, and its use can usually be avoided without causing the problems that it introduces.
That should be the official Wikipedia definition.

The problem it tries to solve is the following: What happens if one of your values contains the character that you are also using as a delimiter? As we have seen earlier, your data ends up in the wrong column, since your application has no idea it is not intended as a delimiter, but just part of your text.

But wait, someone said: What if we just put double quotes around all of our text values? Then we know that if we see a delimiter character between the double quotes, it’s part of the text. Encapsulation was born, and all was well…

Or was it? It seems the problem was just pushed further down the road. What happens if your value contains a double quote? It is wrongly interpreted as the end of the value, and the rest of the value ends up in the next column.

But wait, someone said: What if we just put a special character in front of the double quote, so that our program knows that the next double quote is part of the text. Escape characters were born, and the people were getting confused, rightfully so…

The solution
  • Then a more pragmatic developer came along, and said: What are you guys doing? Why not just use a delimiter that does not appear in your text?
  • If you do get data that uses encapsulation, ask what escape character is being used, or, if it is certain the encapsulation character will never show up in the values. If you get a blank stare, forward this article.
The two rare exceptions
  • The application you are loading the data into only allows you to work with a delimiter that also appears in your text values, and simply removing that character from your data is not an option. Then, using encapsulation and escape characters is your only way out. Unfortunately, it is unlikely that an unsophisticated application like that will support encapsulation and escape characters.
  • Your data contains newlines within text values, and you can not simply replace them with spaces. Also then, you are forced to use encapsulation. When doing so, make sure the application you use to load the data also supports newlines appearing in encapsulated values.

Data types – and why numbers and dates don’t exist

The problem
  • ‘We were using numbers as ID’s, but now they appear without zero’s in front of them (e.g. 0005 -> 5).’
  • ‘Excel completely messes up the dates in my data.’
  • ‘This text is formatted as a number.’
  • ‘NULL, 0 or -9999 starts showing up in fields that were empty before.’
The explanation

When we, bright and intelligent humans, see a number, a date or a piece of text, we can easily tell them apart. But how does an application that works with your data see the difference? We’ve been here so many times by now that it should be obvious: it takes a guess.

There are countless ways of writing dates, even for humans it’s often hard to tell if, e.g. 1/2/2014, is referring to the first of February, or the second of January. Hence “date” is hardly a data type, you have to know the formatting of the date for the date to make sense.

But numbers are just numbers, right? Well, yes, in 1.315 out of 1,526 cases…
Depending on where you live, a comma or a dot is used as the decimal separator. The comma or dot is sometimes also used as thousands separators. You can imagine the challenges this poses when you are letting users from all over the world insert numerical values into a form.

Another issue arises with text that looks like a number. Things like “0005”. Many tools, like Excel, will interpret this as a number, and save it as “5”. If the zeros were a relevant part of the data, this effectively breaks the data.

And then, the value that you would expect to be the simplest of all, turns out to be the most complex: nothing, empty, no value. In databases this is often visualized as “NULL“, for numbers the default “0” is sometimes used, and some might even decide to store it as “-9999”, to set it apart from the number zero. When any of these visualizations of the empty value get exported into a data file it causes a problem, since you can not tell them apart from their non-empty equivalents.

The solution
  • Make sure you know the data type, and format of each column. Keep in mind that in a million rows of data, one row could be using a different format. A quick visual check is not sufficient, you will need to automate this.
  • Once you know the format, check if the application you want to load the data into supports this format. If it doesn’t, you’ll first have to change the format. This is where Regex might come in handy, but that’s a whole different story.
  • When creating your own datafiles, make sure you are exporting empty values as such. Just, two delimiters with nothing in between.
  • If you decide to copy-paste data into Excel (I would suggest using Import), make sure to first set the datatype of the column, and then paste the data.

Headers – so close, yet so far

The problem
  • ‘When I load my data, the first row contains the column names.’
The explanation

Most people are familiar with this. The first row of the data sometimes contains the column names so you know what is what. Unfortunately, if your application is not expecting the header it simply loads it as a record, et voilà, the column names appear as a record. The intent is great, trying to add documentation to what is being passed around, but as we’ve seen, there is far more we need to know about the data before it can be interpreted properly.

The solution – and the conclusion to this rather long post

I would like to propose the following: Whenever plain text ‘CSV’ data is created, add a file alongside that file describing the Encoding, BOM, newlines, delimiters, encapsulation, data types, and formatting being used. That information is what turns data into data, otherwise it’s just a pile of bytes. (Or, delve into the depths of XML)

And because a good acronym always works to remember stuff, I would call this…
The FEB ‘N DEDH file.
Granted, that must be the worst acronym in the history of acronyms,
but it beats life in a submarine.

HP LaserJet 400 M475dn scan quality review

So we needed a new printer and scanner at the office…

Being a small office, a multifunction seemed convenient; the fewer devices to manage the better. Based on the requirements we had, the HP LaserJet M475dn seemed like a good choice. The sales pitch on HP’s website also convinced us of the quality to expect:

Sounds perfect, professional-quality at around 900€ in Europe (600$ in the US).

To be honest, this is not a full product review, but only an overview of the two issues we’ve experienced with the scanner of this device. The printer works fine, it’s just the “professional-quality color” of the scanner that was disappointing. I’ve come to the conclusion that if you want a high-quality scanner, you’ll have to buy a ScanJet. Unless you don’t mind rainbows and brightness issues.


The main shortcoming we’ve experienced with the scanner is that it fails to pick up light shades of gray. To test this, I’ve used the following printer test chart. Below I’ll discuss the results of the different tests, and further down you can see the corresponding scan-results.

  1. The first column is the section of the test chart I’ll use to compare the scan results against.
  2. First, I made a color scan on the flatbed with the default settings. As you can see, 221 is the last gradient that is still clearly visible. In a real-world setting, this means that if you have a document with annotations in pencil, they would be partly gone. If you have a document signed with a bad pen, the signature could be incomplete.
  3. Then, I’ve done the obvious, try to lower the brightness of the scan through the interface on the printer. Notice here that it doesn’t make the scanner more sensitive to lighter shades, but it simply seems to post-process the scan, making it darker in general, including the white areas.
  4. After finding this HP support forum post, I installed the scanner software and lowered the brightness to -30 hoping the software would work out the brightness differently, but it seems to be doing more of the same. (I’ve uploaded some of my scan results on that forum, so you can download the entire scanned pages there.)
  5. Now I was thinking, is it wrong to be expecting a scanner to be able to do this? So I took out our old OfficeJet 9120, a device from 2009. The result is a bit better, but not much. The color-consistency is a lot better though.
  6. Still not convinced, I repeated the test on a single-function ScanJet 5550c. The color-consistency is not optimal, but it does manage to pick up the lightest gray shades. So, do we own low quality multifunction devices, or are there simply no multifunction devices with high-quality scanners? More on this later …


Who doesn’t like rainbows?

With the brightness issue thoroughly analyzed, I could look into something else I had noticed. On the black areas of the test chart, rainbow patterns where showing up.

  1. As with the previous tests, on the left, you can see the original section of the test chart.
  2. And again, I’ve made a default color scan on the flatbed. Notice the … creative interpretation … of the black area made by the M475dn.
  3. After hours of testing, I came to the conclusion that this only happened on the flatbed scanner, and only when the test chart was printed on low quality on a Dell 3110cn. When printing on high quality, the scans work fine. Of course, when you receive a document, you have no control over the quality used while printing, nor the printer used to print it. We also noticed this color distortion on some other scans of documents we received, so it seems like this issue also shows up when scanning documents printed with some other devices.
  4. So again I was thinking, is it wrong to expect a scanner to be able to scan this? Is there something intrinsically unscannable about low quality prints? Let’s try the old ScanJet 5550c again. In the resulting scan you might still detect a -slight- rainbow effect. But far less as with the M475dn. And remember that the 5550c has been developed around 2004 …


HP’s explanation

When I contacted HP about this, they quickly dismissed the brightness issue as something you could not expect a “low-end” device to pick up. When they saw the rainbow effect, they did confirm that this was unusual and opened a support case.

  • At first, HP’s support team was very responsive and helpful. Within a few days they sent me a new firmware version that was not publicly released yet. To no avail…
  • Then, they sent a technician to come and replace the entire scanner assembly. That didn’t solve it …
  • As a last resort, HP offered to send a completely new device. I suggested to just send the low-quality print-out of the test page to them so they could test it themselves. This showed that -all- these devices have this issue.
  • Having run out of options, I asked them if I could upgrade to a device that didn’t have this issue, and even pay for the added price of the new device. They told me they would “look into it and call me back”, and that was that …

Persistent as I am, I called them a few weeks later. They actually did “look into it”, and had a reply waiting. They told me that

“You have to understand that this is not a high-end device, you cannot expect professional quality of a low-end device like this.”

Those words, exactly. Ironic, if you look at the sales-website again.
It’s interesting to note that they first accepted the issue as a bug and tried to fix it, and once they found out it’s an issue showing up in all of their devices, they started claiming you can simply not expect this kind of quality.
Why didn’t they say this in the first place? Seems clear to me …

The grand finale

Now that I knew I bought a low-end device, I was curious what high-end multifunction device -did- meet my super-professional requirements. The low-end/high-end distinction is mentioned nowhere on the website, so I called support to clarify. First line support, still having my low-quality test sheet, was kind enough to run some more tests.

  • The first test showed, not so surprisingly, that a ScanJet N6350 was able to scan the document without rainbows.
  • The test on an OfficeJet was a bit more surprising: no rainbows.
  • The test that gave me a good laugh was on a “high-end, top of the line” multifunction LaserJet; unfortunately he didn’t mention the model. The rainbow effect was still there!

The support agent explained that this was to be expected, since all LaserJets use the same scan technology, and the OfficeJet and ScanJet use different scanner assemblies. Oh and, to get back to the brightness issue, this would mean that that problem is to be expected on the “high-end” LaserJets, too.

The lesson learned here seems to be that -all- scanners in HP’s LaserJet devices can be considered “low-end” by their own definition, since they are all plagued with these issues. If I was running a hardware company, I would have escalated an issue like this to my R&D department. HP doesn’t seem to be as pro-active in increasing the quality of their products. Or, is there no market for high-quality multifunction devices?