03 March 2011

Convert Your MYSQL Database from Latin-1 to UTF-8

It all started when I tried upgrading to ruby 1.9.2 and learned more than I ever wanted to know about character encodings. All of a sudden, my site was showing text humans were never supposed to read, with gibberish in place of recognisably foreign accented letters.

I tried using the mysql2 gem, and setting Encoding.default_external = 'UTF-8' in my environment.rb, these steps were necessary but not enough.

After much googling, it became evident that I had to go through each text field in each row in each table, and convert each latin-1 character to utf-8.

You would think that alter table #{table} convert to character set utf8 would do the trick, but no. You would be wrong. At least, I was.

Many authors have chimed in on this topic, but my hat goes off to Derek Sivers who showed the light in an O'Reilly article on converting latin1 to utf-8 in mysql.

I didn't want to do all the work he did, and figured a rails/activerecord migration might ease the pain somewhat. Below you'll find what I came up with. Re-use as you please. You'll need to specify the table/column names that need converting, and you might want to make sure I've covered all the characters that matter to you.

Basically, all this does is iterate over the tables and columns you specify, and then iterates over all the shady latin-1 characters you need to fix, and asks mysql to replace them with the utf-8 equivalent. Someone with stronger mysql-fu might find a cleverer way to do this; in the meantime, here goes:

# encoding: UTF-8

class ConvertMySqlLatin1ColumnsToUtf8 < ActiveRecord::Migration
  def self.up

    execute("set names utf8")

    # change this hash for your application. This example here is for a
    # totally original blog application concept.
    keys = {
        :authors   => %w{first_name last_name},
        :blogs => %w{name description},
        :entries => %w{title content tags},
        :comments => %w{content}

    conversions = {
      'C383C2A1'         => 'á', 'C383C2A0'       => 'à', 'C383C2A4'       => 'ä', 'C383C2A2'   => 'â',
      'C383C2A9'         => 'é', 'C383C2A8'       => 'è', 'C383C2AB'       => 'ë', 'C383C2AA'   => 'ê',
      'C383C2AD'         => 'í', 'C383C2AC'       => 'ì', 'C383C2AF'       => 'ï', 'C383C2AE'   => 'î',
      'C383C2B3'         => 'ó', 'C383C2B2'       => 'ò', 'C383C2B6'       => 'ö', 'C383C2B4'   => 'ô',
      'C383C2BA'         => 'ú', 'C383C2B9'       => 'ù', 'C383C2BC'       => 'ü', 'C383C2BB'   => 'û',
      'C383C281'         => 'Á', 'C383E282AC'     => 'À', 'C383E2809E'     => 'Ä', 'C383E2809A' => 'Â',
      'C383E280B0'       => 'É', 'C383CB86'       => 'È', 'C383E280B9'     => 'Ë', 'C383C5A0'   => 'Ê',
      'C383C28D'         => 'Í', 'C383C592'       => 'Ì', 'C383C28F'       => 'Ï', 'C383C5BD'   => 'Î',
      'C383E2809C'       => 'Ó', 'C383E28099'     => 'Ò', 'C383E28093'     => 'Ö', 'C383E2809D' => 'Ô',
      'C383C5A1'         => 'Ú', 'C383E284A2'     => 'Ù', 'C383C593'       => 'Ü', 'C383E280BA' => 'Û',
      'C385C2B8'         => 'Ÿ', 'C385E2809C'     => 'œ', 'C383C2B8'       => 'ø', 'C383C2BF'   => 'ÿ',
      'C3A2E282ACC593'   => '“', 'C3A2E282ACC29D' => '”', 'C3A2E282ACCB9C' => '‘',
      'C3A2E282ACE284A2' => '’', 'C382C2AB'       => '«', 'C382C2BB'       => '»',
      'C383C2A5'         => 'å', 'C383E280A6'     => 'Å', 'C383C5B8'       => 'ß', 'C383E280A0' => 'Æ', 
      'C383C2A7'         => 'ç', 'C383E280A1'     => 'Ç', 'C383C2B1'       => 'ñ', 'C383E28098' => 'Ñ', 
      'C383C2A3'         => 'ã', 'C383C2B5'       => 'õ', 'C383C692'       => 'Ã', 'C383E280A2' => 'Õ'

    keys.each { |table, columns|
      execute "alter table #{table} convert to character set utf8"
      columns.each { |column|
        conversions.each { |hex, utf8|
          execute("update #{table} set #{column} = replace(#{column}, unhex('#{hex}'), '#{utf8}') where #{column} regexp unhex('#{hex}');")

  def self.down
    # left as an exercise for the reader :)

The # encoding comment at the beginning is important, don't leave it out or ruby 1.9.2 will complain.

Use this to check you've covered all the relevant text columns:

mysql> use information_schema
mysql> select table_name, column_name from columns where table_schema = '__YOUR_DB_NAME__' and (data_type = 'varchar' or data_type = 'text');

(There might be other relevant data types, like 'mediumtext' that you have to deal with; don't just take my word for it)

Here are some other places covering latin1/utf8 conversion: