Character encoding issues upgrading MySQL to 4.1

This night the host of a client upgraded from MySQL 3 to MySQL 4.1. It is a known issue that his might bring problems regarding character enconding, resulting in special characters (like acute accent or umlaut) or getting defaced. Some of these issues can be resolved quite easily, for instance the user error "Illegal mix of collations". We'll first look into what happens behind the scenes when upgrading from MySQL 3 or MysQL 4.0 to MySQL 4.1.

The MySQL manual explains what happens, and how to properly prepare for a migration. "The MySQL 3 and 4.0 character set contains both character set and collation information in one single entity. Beginning in MySQL 4.1, character sets and collations are separate entities. Though each collation corresponds to a particular character set, the two are not bundled together. If you want to start mysqld from a 4.1 distribution with data created by MySQL 4.0, you should start the server with the same character set and collation or compile it to do this by default. In this case, you won't need to reindex your data."

If it is not the case, and you don't have a backup of your data, you're in trouble. Note that this encoding applies to the database, tables and fields. This means that, without proper preparation, your fields will not be updated to the new collcation, but your data will use the new character set. In other words: your data will look messed up.

Most of the time a simple ALTER TABLE (cfr link at the top) will fix your data. If it does not, here is a last resort. Before I continue I would like to offer this piece of advise: create a backup. Not creating one got you in this situation in the first place so do yourself a favor and create one now. Also note that you should only continue if you have at least some knowledge of what you're doing. I am not responsible for data loss in any way and will not be able to help you if things go wrong so create a backup which you can restore if things do go wrong.

What you're going to do is make a copy of the update.php file that comes with your Drupal installation. Copy this code (by Heine) and put it in the new file below these lines:

  update_fix_schema_version();
  update_fix_watchdog_115();
  update_fix_watchdog();
  update_fix_sessions();

This piece of code will convert all data to binary first, and then back to the UTF-8 encoding, getting lost of all giberish characters in the process. All you need to do now is run the file from your browser; don't follow any of the normal steps to update your Drupal installation. Also make sure to clean your cache ("DELETE FROM cache"). Your data has now been restored.

Many thanks to RedLED and Heine in providing a solution for this issue.

Comments

It is important to

It is important to understand why this happens: MySQL 4.0 and earlier do not support UTF-8. So, we tell MySQL to use an old character set (like Latin1), but we really pass in UTF-8 data. But, because Latin1 is just a simple mapping of bytes to characters, the data is accepted and returned exactly as before. Only things like case-insensitive string comparison will not work for Unicode characters.

When converting to MySQL 4.1, you need to tell MySQL to keep all your data as is (it is already UTF-8) and simply re-interpret it as UTF-8. This is why we convert to binary in the process: to tell MySQL to forget about Latin1.

If you were to do normal ALTER TABLE ... CHANGE CHARACTER SET ... commands, MySQL would take the faux-Latin1 data and 'convert' it to UTF-8. Any Unicode characters would break in this process.

I went through similar

I went through similar problems some time ago. I've used this small module to script the changes.
http://cvs.drupal.org/viewcvs/drupal/contributions/sandbox/jareyero/modu...

Works great!

Thanks for the help. Worked perfectly for all my Drupal installs.

--Bill

reply

I tried that (Drupal 5.1) with no avail. I even used phpMyAdmin to manually set collations to general. Now I just get the error with general instead of unicode. "user warning: Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE)..."

thanks

Seems to have worked perfectly -- thanks!

reply

I used MySQL Standard 4.0.24 and upgraded it to 4.1.10. I did try the script before and after. Thanks for the response.

Add extra tables to the array

It's important to note that if you use contributed modules that handle (translated/translatable) strings, you should add their tables to the array in the snippet provided.

Hey Jose, This module does

Hey Jose,

This module does exactly the same in fact. It calls the Drupal function system_update_169. Do note that in this case, the access table is not updated. It's a good thing there's a comment linking to the PostgreSQL documentation about Character Set Support.

Thanks for the link!

More specifications

Could you tell me what exactly caused this error? What version of MySQL did you upgrade form, and what version did you upgrade to for instance? Do you have a backup? Did you try the script before setting the collations to general manually?

Extra tables?

Did you add all table names in the array as mentioned in a comment of mine? Do you have the site online somewhere?

Life saver.

I had the same issues and Heine pointed me to this article. I had to add all my custom content types to the list to get everything updated. Wim: Would be helpful if you added this instruction (like in #comment-10526) to the article itself. Thank you!

Post new comment

The content of this field is kept private and will not be shown publicly.