Avoding “Invalid byte sequence in UTF-8” with Ruby and CSV files

If you’re running into a ton of problems reading e.g. an ISO-8859-1 encoded CSV file into your (probably UTF-8) Ruby or Rails application, and if the error you get is “Invalid byte sequence in UTF-8” even though you’re giving CSV.open the correct encoding options, here’s a solution.

The example CSV file is a tab-separated, ISO-8859-1 encoded file with CRLF line endings. You’d expect the following to work:

CSV.open(@infile, "r:ISO-8859-15:UTF-8", {:col_sep => "t", :headers => :first_row})

But it fails mysteriously! Even though the conversion to UTF-8 goes without problems, you get an ArgumentError complaining about some illegal byte sequence. If you analyze deeper, you might find (in this case) a complaint about rn. The solution is very, very non-obvious: You need to specify the row separator in addition to your encodings!

mjtko from the #rubyonrails channel on Freenode discovered this. If we change the line to the following:

CSV.open(@infile, "r:ISO-8859-15:UTF-8", {:col_sep => "t", :row_sep => "n", :headers => :first_row})

Boom, there’s your working CSV object, with working encodings.

6 thoughts on “Avoding “Invalid byte sequence in UTF-8” with Ruby and CSV files”

  1. Yeah, MediathekView rocks, but that’s only for the German state-owned stuff! Teleboy has 77 channels you can record from using watchteleboy, woohoo 🙂

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s