03 March 2011

Convert Your MYSQL Database from Latin-1 to UTF-8

It all started when I tried upgrading to ruby 1.9.2 and learned more than I ever wanted to know about character encodings. All of a sudden, my site was showing text humans were never supposed to read, with gibberish in place of recognisably foreign accented letters.

I tried using the mysql2 gem, and setting Encoding.default_external = 'UTF-8' in my environment.rb, these steps were necessary but not enough.

After much googling, it became evident that I had to go through each text field in each row in each table, and convert each latin-1 character to utf-8.

You would think that alter table #{table} convert to character set utf8 would do the trick, but no. You would be wrong. At least, I was.

Many authors have chimed in on this topic, but my hat goes off to Derek Sivers who showed the light in an O'Reilly article on converting latin1 to utf-8 in mysql.

I didn't want to do all the work he did, and figured a rails/activerecord migration might ease the pain somewhat. Below you'll find what I came up with. Re-use as you please. You'll need to specify the table/column names that need converting, and you might want to make sure I've covered all the characters that matter to you.

Basically, all this does is iterate over the tables and columns you specify, and then iterates over all the shady latin-1 characters you need to fix, and asks mysql to replace them with the utf-8 equivalent. Someone with stronger mysql-fu might find a cleverer way to do this; in the meantime, here goes:


# encoding: UTF-8

class ConvertMySqlLatin1ColumnsToUtf8 < ActiveRecord::Migration
  def self.up

    execute("set names utf8")

    # change this hash for your application. This example here is for a
    # totally original blog application concept.
    keys = {
        :authors   => %w{first_name last_name},
        :blogs => %w{name description},
        :entries => %w{title content tags},
        :comments => %w{content}
    }

    conversions = {
      'C383C2A1'         => 'á', 'C383C2A0'       => 'à', 'C383C2A4'       => 'ä', 'C383C2A2'   => 'â',
      'C383C2A9'         => 'é', 'C383C2A8'       => 'è', 'C383C2AB'       => 'ë', 'C383C2AA'   => 'ê',
      'C383C2AD'         => 'í', 'C383C2AC'       => 'ì', 'C383C2AF'       => 'ï', 'C383C2AE'   => 'î',
      'C383C2B3'         => 'ó', 'C383C2B2'       => 'ò', 'C383C2B6'       => 'ö', 'C383C2B4'   => 'ô',
      'C383C2BA'         => 'ú', 'C383C2B9'       => 'ù', 'C383C2BC'       => 'ü', 'C383C2BB'   => 'û',
      'C383C281'         => 'Á', 'C383E282AC'     => 'À', 'C383E2809E'     => 'Ä', 'C383E2809A' => 'Â',
      'C383E280B0'       => 'É', 'C383CB86'       => 'È', 'C383E280B9'     => 'Ë', 'C383C5A0'   => 'Ê',
      'C383C28D'         => 'Í', 'C383C592'       => 'Ì', 'C383C28F'       => 'Ï', 'C383C5BD'   => 'Î',
      'C383E2809C'       => 'Ó', 'C383E28099'     => 'Ò', 'C383E28093'     => 'Ö', 'C383E2809D' => 'Ô',
      'C383C5A1'         => 'Ú', 'C383E284A2'     => 'Ù', 'C383C593'       => 'Ü', 'C383E280BA' => 'Û',
      'C385C2B8'         => 'Ÿ', 'C385E2809C'     => 'œ', 'C383C2B8'       => 'ø', 'C383C2BF'   => 'ÿ',
      'C3A2E282ACC593'   => '“', 'C3A2E282ACC29D' => '”', 'C3A2E282ACCB9C' => '‘',
      'C3A2E282ACE284A2' => '’', 'C382C2AB'       => '«', 'C382C2BB'       => '»',
      'C383C2A5'         => 'å', 'C383E280A6'     => 'Å', 'C383C5B8'       => 'ß', 'C383E280A0' => 'Æ', 
      'C383C2A7'         => 'ç', 'C383E280A1'     => 'Ç', 'C383C2B1'       => 'ñ', 'C383E28098' => 'Ñ', 
      'C383C2A3'         => 'ã', 'C383C2B5'       => 'õ', 'C383C692'       => 'Ã', 'C383E280A2' => 'Õ'
    }

    keys.each { |table, columns|
      execute "alter table #{table} convert to character set utf8"
      columns.each { |column|
        conversions.each { |hex, utf8|
          execute("update #{table} set #{column} = replace(#{column}, unhex('#{hex}'), '#{utf8}') where #{column} regexp unhex('#{hex}');")
        }
      }
    }
  end

  def self.down
    # left as an exercise for the reader :)
  end
end

The # encoding comment at the beginning is important, don't leave it out or ruby 1.9.2 will complain.

Use this to check you've covered all the relevant text columns:


mysql> use information_schema
mysql> select table_name, column_name from columns where table_schema = '__YOUR_DB_NAME__' and (data_type = 'varchar' or data_type = 'text');

(There might be other relevant data types, like 'mediumtext' that you have to deal with; don't just take my word for it)

Here are some other places covering latin1/utf8 conversion:
http://stackoverflow.com/questions/4188677
http://groups.google.com/group/rails-i18n/browse_thread/thread/0146440dbf7d23ab
http://www.ruby-forum.com/topic/212452
http://stackoverflow.com/questions/4453842

Enjoy!

6 comments:

  1. I'm always amazed to see that in 2011 there are still softwares that are not using a unicode character set by default.
    Looks like some people/developers are stuck in 2nd millennium…

    ReplyDelete
  2. Maybe their code base comes from the 2nd millenium : I am working on such a project and we do have these encoding problems. So, you can be amazed...

    ReplyDelete
  3. Yo @Laurent, @JP ... terrifying, isn't it? The scary thing in my case was I *thought* I was being all moral and stuff with a DB in UTF8, except ... when I switched to ruby 1.9 ... it wasn't!

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. Your MySQL code that trolls information_schema.columns for conversion targets will miss some in some databases.

    Instead of data_type = 'text'

    you need data_type like '%text'

    ReplyDelete
  6. Lets take a more subtle example, however. Lets consider what happens to a Agent in a Workgroup, located at a remote site, behind a distributed voice mail server. create mysql dashboard

    ReplyDelete