The Perl UTF-8 and utf8 Encoding Mess
from
Jeremy Zawodny's blog, posted 4 months ago
I've been hacking on some Perl code that extracts data that comes from web users around the world and been stored into MySQL (with no real encoding information, of course). My goal it to generate well-formed, valid XML that can be read by another tool.
Now I'll be the first to admit that I never really took the time to like, understand, or pay much attention to all the changes in Perl's character and byte handling over the years. I'm one of those developers that, I suspect, is representative of the majority (at least in this self-centered country). I think it's all stupid and complicated and should Just Work... somehow.
But at the same time I know it's not.
Anyway, after importing lots of data I came across my first bug. Well, okay... not my first bug. My first bug related to this encoding stuff. The XML parser on the receiving end raised hell about some weird characters coming in.
Oh, crap. That's right. This is the big bad Internet and I forgot to do anything to scrub the data so that it'd look like the sort of thing you can cram into XML and expect to maybe work.
A little searching around managed to jog my memory and I updated my code to include something like this:
use Encode;
...
my $data = Encode::decode('utf8', $row->{'Stuff'});
And all was well for quite some time. I got a lot farther with that until this weekend when Perl itself began to blow up on me, throwing fatal exceptions like this:
Malformed UTF-8 character (fatal) ...
My first reaction, like yours probably, was WTF?!?! How on god's green earth is this a FATAL error?
After much swearing, (more)...
