Perl utf8 binmode unexpected results

Question

Why does binmode as raw produce the umlaut? Could any elaboration be given regarding how 'Zurich' String is stored internally in Perl? Just a little lost.

use strict;
use warnings;

my $filename = "result-test-encoding-raw.xml";
open(my $fh,'>', $filename) or die "die";
#binmode $fh, ':utf8'; #bad umlaut
binmode $fh, ':raw'; #good umlaut

print $fh '<?xml version="1.0" encoding="UTF-8"?>';
print $fh '<node>';

my $line_text =  'Zürich';
print $fh $line_text;
print $fh '   next   ';
$line_text = 'Z&#252;rich';
print $fh $line_text;

print $fh '</node>';

close($fh);

Show source
| xml   | utf-8   | perl   2017-09-05 22:09 2 Answers

Answers to Perl utf8 binmode unexpected results ( 2 )

  1. 2017-09-05 23:09

    Strings in Perl can be stored either as Byte Strings or Unicode Character strings. In your case, you are defining Byte Strings.

    Question: In which encoding is your program source saved?

    Your 1st assignment to $line_text is a byte string in your program source's encoding. When you print this byte string to the file using :raw, it is dumped exactly as it was stored in your source. If you print an encoded byte string using an encoder,like :utf8 you get a doubly encoded string which is unlikely a good idea. If your program is saved in UTF8, then you can use utf8; to decode that string literal into a Character string. When you print a properly decoded Character string using :utf8, it will encode the characters correctly into UTF8.

    Moral of the story: While passing raw bytes can work in some situations, it's generally a better idea to decode your inputs (and string literals) and encode your outputs.

  2. 2017-09-05 23:09

    You're missing use utf8;, which tells Perl your source code is encoded using UTF-8.


    By default, source files are expected to be encoded using US-ASCII.

    • If you encoded your source file using UTF-8, but you didn't tell this to Perl (by using use utf8;), Perl will treat it as encoded using US-ASCII. For string literals, Perl will simply map the bytes to string characters (rather than rejecting non-ASCII chars). This means that $line_text contains 5A.C3.BC.72.69.63.68.

      When you pass these characters to a file handle with an encoding layer, the encoding layer will treat those characters as Unicode Code Points (Zürich) and produce the appropriate bytes to represent those characters.

    • If you encoded your source file using UTF-8, and if you told this to Perl (by using use utf8;), Perl will treat it as encoded using UTF-8 (decoding it accordingly). This means that $line_text contains 5A.FC.72.69.63.68.

      When you pass these characters to a file handle with an encoding layer, the encoding layer will treat those characters as Unicode Code Points (Zürich) and produce the appropriate bytes to represent those characters.


    use strict;
    use warnings;
    use utf8;                             # Source code is encoded using UTF-8.
    use open ':std', ':encoding(UTF-8)';  # Terminal expects UTF-8. Default encoding for files.
    
    my $filename = "result-test-encoding-raw.xml";
    
    open(my $fh, '>', $filename)
       or die("Can't create \"$filename\": $!\n");
    
    ...    
    print $fh 'Zürich';
    ...
    

    Note that I the use of :encoding(UTF-8) instead :utf8. The later is incorrect even though both appear equivalent in this example.

Leave a reply to - Perl utf8 binmode unexpected results

◀ Go back