Unicode support via Perl DBD::ODBC and MS SQL Server - common problems

DBD::ODBC has had increasing support for unicode since version 1.16. However, unicode seems to be an issue that causes a lot of confusion and especially when it comes to DBI and DBDs. The mantra of just DWIM, is complicated because most DBDs were originally written with no unicode support. Changing them to automatically do the right thing when adding support for unicode inevitably breaks a lot of existing code e.g., people storing UTF-8 in varchars who decode the data coming back from the database would suddenly find their code breaks if their UTF-8 encoded perl scalars suddenly ended up in the database as database-encoded-unicode and did not need decoding when selected back. This is why a lot of DBDs have an attribute to enable unicode support.

DBD::ODBC was rather fortunate in that the ODBC specification was enhanced to add new unicode APIs and new data types called SQL_WCHARs which meant there appeared to be a clear division between the old ANSI APIs and the newer unicode (so called Wide) APIs. However, nothing is a simple as it first seems and some of the problems I ran in to were:

  • The ODBC unicode API means calling lots of different functions from the ANSI one.
  • Although Windows had reasonable support for the ODBC unicode APIs in drivers, Unix drivers lagged behind.
  • The original wide APIs described unicode as UCS-2 which only supports unicode up to U+0FFFF (the Basic Multilingual Plane).
  • As UTF-8 was more widely used on Unix some drivers added support for unicode by returning UTF-8 encoded data via the ANSI API but for various reasons too complicated to go in to here (read about how SQLGetData returns column data in chunks and how you know there is more data to read) it does not work well with ODBC.
  • Some (perhaps it was only one) DBI XS interfaces did not support Perl scalars and required C char * (changed some time ago) e.g., login6_sv.
  • Windows support for unicode in DBD::ODBC had an advantage that Windows has WideCharToMultiByte and MultiByteToWideChar but there is no equivalent in the runtime C library.

Mostly, all these issues are resolved now although, even now on Unix, DBD::ODBC builds by default without unicode support mostly because so many people seem to have old code they depend on so switching to a unicode build needs to be a positive decision.

So, sitting back quite happy about the current support for unicode in DBD::ODBC I was surprised to be seeing increasing questions and confusion about it. It seems:

  • a lot of people still just don't get unicode
  • people still don't decode data they read into Perl
  • people confuse getting the data back from a database as unicode correctly and printing/writing it somewhere else and just expecting it to still be correct
  • Windows confuses things with code pages
  • MS SQL Server does some non intuitive things with char/varchar
  • surrogate issues
  • collations
  • lots of other issues

There isn't a whole lot I can do about the first 3 of those above but I thought a few minor faqs in DBD::ODBC might help. So I collected together the issues I'd seen and started writing a few FAQs and the more I wrote down the more I realised what I knew was so "every day" for me I'd lost sight that it is quite complicated and confusing. So I'm attempting to put together a more thorough document and so far I have a common problems document and some examples.

I'd be really pleased if you could read over these and suggest improvements. I'd also be happy to receive any ideas for areas I've not convered that have confused you or even issues you still have and have not resolved. Of course, I'm not really opening this up to any DBD right now and would prefer to keep it DBD::ODBC for the moment but I'm happy to compile any problems/misconceptions if they include solutions/explanations for other DBDs too.