Inserting unicode characters > 0xFFFF (surrogate pairs) into MS SQL Server with Perl DBD::ODBC

Until recently it has been difficult to insert unicode characters above 0xFFFF into MS SQL Server. DBD::ODBC could do it in such a way that you can select them back correctly but the built in functions (like length, sorting and upper/lower etc) did not treat the surrogate pairs as such so it was limited.

Microsoft SQL Server 2012 introduces a new collation suffix (_SC) and it supports surrogate pairs (although there is an indication that the UTF-16 encoded data must be sent little endian and I've not managed to test on a big endian machine as yet). Here is some test code:

use strict;
use warnings;
use DBI;
use Unicode::UCD 'charinfo';
use Data::Dumper;
#use charnames ':full';
use Test::More;
use Test::More::UTF8;

binmode(STDOUT, ":encoding(UTF-8)");
binmode(STDERR, ":encoding(UTF-8)");

# unicode chr above FFFF meaning it needs a surrogate pair
my $char = "\x{2317F}";
my $charinfo   = charinfo(0x2317F);
print Dumper($charinfo);

my $h = DBI->connect() or die "Failed to connect to db";

$h->{ChopBlanks} = 1;
$h->{RaiseError} = 1;

eval {
    $h->do('drop table mje');
};

# create table ensuring collation specifieds _SC
# for supplementary characters.
$h->do(q/create table mje (a nchar(20) collate Latin1_General_100_CI_AI_SC)/);

my $s = $h->prepare(q/insert into mje values(?)/);
$s->execute("\x{2317F}");

my $r = $h->selectall_arrayref(q/select a, len(a), unicode(a), datalength(a) from mje/);
print Dumper($r);
print "Ordinals of received/sent: ", ord($r->[0][0]), ", ", ord($char), "\n";
print DBI::data_diff($r->[0][0], $char);
is($r->[0][0], $char);
is($r->[0][1], 1);
is($r->[0][2], 143743);

done_testing;

which outputs:

$VAR1 = {
          'digit' => '',
          'bidi' => 'L',
          'category' => 'Lo',
          'code' => '2317F',
          'script' => 'Han',
          'combining' => '0',
          'upper' => '',
          'name' => 'CJK UNIFIED IDEOGRAPH-2317F',
          'unicode10' => '',
          'decomposition' => '',
          'comment' => '',
          'mirrored' => 'N',
          'lower' => '',
          'numeric' => '',
          'decimal' => '',
          'title' => '',
          'block' => 'CJK Unified Ideographs Extension B'
        };
$VAR1 = [
          [
            "\x{2317f}",
            1,
            143743,
            40
          ]
        ];
Ordinals of received/sent: 143743, 143743
ok 1
ok 2
ok 3
1..3

Note that I set the collation on the table but it can be set on the database.

The basic thing to notice that differs from previous MS SQL Server versions if the length function returns 1 (for 1 character) and not as before 2.

I've tried this with the Easysoft SQL Server ODBC Driver on Linux and it seems to work fine. I've still to test big endian machines. If the data really needs to be sent little endian I may need to add an attribute to DBD::ODBC as the current implementation converts UTF-8 encoded Perl data to UTF-16 in native endianess.

Some references:

Collation and Unicode Support
International Features in Microsoft SQL Server 2005

Comments

native endianess

If the data really needs to be sent little endian I may need to add an attribute to DBD::ODBC as the current implementation converts UTF-8 encoded Perl data to UTF-16 in native endianess.

I've no idea what I was thinking when I wrote that. It is up to the ODBC Driver to sort this out.