Perl’s unpack
can work with character data in two ways. The C
template does characters and the U
does the UTF-8. That they exist doesn’t mean that you should use them, but they exist and I inadvertently overlooked some of their behavior for Programming Perl. These focus a bit too much on Perl’s internal representation of a string, which we shouldn’t do.
First, let’s look at some pack
examples. I use the Devel::Dump module to show the underlying scalar record so I can see what is actually stored:
use strict; use utf8; use Devel::Peek; select STDERR; # because Dump uses that by default binmode STDERR, ':utf8'; my $c_packed = pack 'C*', 0xE9; print "packed with C: $c_packed\n"; Dump( $c_packed ); print "\n"; my $alpha_c_packed = pack 'C*', 0x3B1; print "α packed with C: $alpha_c_packed\n"; Dump( $alpha_c_packed ); print "\n"; my $u_packed = pack 'U*', 0x065, 0x301; print "packed with U: $u_packed\n"; Dump( $u_packed ); print "\n"; my $alpha_u_packed = pack 'U*', 0x3B1; print "α packed with U: $alpha_u_packed\n"; Dump( $alpha_u_packed ); print "\n";
The C
works with an octet. In the first example I pack
0xE9
, the character é, whose code number fits in one octet. In the second example, I pack
0x3B1
, the character α, whose code number needs two octets. The output shows that the second case doesn’t work out correctly. It only pack
s the low octet, 0xB1
, which is \261
in octal. As a single-octet character, that’s ±
.
packed with C: é SV = PV(0x7ffc0a801310) at 0x7ffc0c001cd8 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x7ffc0a501670 "\351"\0 CUR = 1 LEN = 16 α packed with C: ± SV = PV(0x7ffc0a801390) at 0x7ffc0a810f28 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x7ffc0a501690 "\261"\0 CUR = 1 LEN = 16
The next two parts of the output show the result of packing with the U
. In the first example, I pack
0x065
and 0x301
, the code numbers for e and the combining acute accent, ´. Notice that the packed string has the UTF8
flag and that the actual storage isn’t the code numbers. It’s the UTF-8 representation for those characters. The PV
line shows both the UTF-8 encoded octets and the character string itself (curiously labeled as UTF8
).
packed with U: é SV = PV(0x7ffc0a8013c0) at 0x7ffc0a810b08 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x7ffc0a5015d0 "e\314\201"\0 [UTF8 "e\x{301}"] CUR = 3 LEN = 16 α packed with U: α SV = PV(0x7ffc0a801440) at 0x7ffc0a827380 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x7ffc0a5014b0 "\316\261"\0 [UTF8 "\x{3b1}"] CUR = 2 LEN = 16
Now let’s go the other way. If I want to unpack
the same strings, I’ll get out the values that I put into them.
use utf8; binmode STDOUT, ':utf8'; my $c_packed = pack 'C*', 0xE9; my( $number ) = unpack 'C*', $c_packed; printf "%04X\n", $number; my $alpha_c_packed = pack 'C*', 0x3B1; my( $number ) = unpack 'C*', $alpha_c_packed; printf "%04X\n", $number; my $u_packed = pack 'U*', 0x065, 0x301; my( @numbers ) = unpack 'C*', $u_packed; printf "%04X %04X\n", @numbers; my $alpha_u_packed = pack 'U*', 0x3B1; my( @numbers ) = unpack 'C*', $alpha_u_packed; printf "%04X\n", @numbers;
The output shows that all but the second group comes back the same because the C
only packed the low octet of 0x3B1
:
00E9 00B1 0065 0301 03B1
That’s not all the C
and U
formats do though. They affect future uses of the A
format, which is supposed to deal with ASCII characters but goes beyond that.
use utf8; binmode STDOUT, ':utf8'; my $c_packed = pack 'U*', 0xE9, 0xA9, 0xBC; # 驼 my @chars1 = unpack '(A)*', $c_packed; print "chars1 are @chars1\n"; my @chars2 = unpack 'C0(A)*', $c_packed; print "chars2 are @chars2\n"; my @uchars = map { sprintf '%04x', ord } unpack 'U0(A)*', $c_packed; print "uchars are @uchars\n";
The first unpack
uses an explicit A
and nothing else. Although $c_packed
contains 8-bit characters, they show up as I expect in @chars1
. In the second unpack
, I use C0
, which specifies the C
template zero times. It also sets subsequent A
template to work on characters. That behavior is also the default. In the third unpack
, I start with U0
, which specifies the U
template zero times and causes subsequent A
templates to turn the octets of the UTF-8 representation of the characters. To show those, I convert them to their ordinal values instead of the gibberish à ©  ©  ¼
:
chars1 are é © ¼ chars2 are é © ¼ uchars are 00c3 00a9 00c2 00a9 00c2 00bc
I can mix the two, which doesn’t seem useful or desirable, but works.
use utf8; binmode STDOUT, ':utf8'; my $c_packed = pack 'U*', 0xE9, 0xA7, 0xBC; # 駼 # show the octets printf "%v02X\n", unpack 'U0A*', $c_packed; # show the characters my @chars = unpack 'U0(A)*', $c_packed; print "@chars\n"; my @uchars; @uchars = unpack '(A)2(U0A)*', $c_packed; print "(A)2(U0A)* uchars are @uchars\n"; @uchars = unpack 'A(U0A)2A', $c_packed; print "A(U0A)2A uchars are @uchars\n"; @uchars = unpack '(U0A)2C0(A)2', $c_packed; print "(U0A)2C0(A)2 uchars are @uchars\n";
The first line shows the raw octets and the next line shows the characters that have those code numbers. As a raw string, Perl interprets those according to the default encoding. I’ll use that line to help me see what’s happening next.
C3.A9.C2.A7.C2.BC à ©  §  ¼ (A)2(U0A)* uchars are é §  ¼ A(U0A)2A uchars are é  § ¼ (U0A)2C0(A)2 uchars are à © § ¼
The next line, with (A)2(U0A)*
, unpacks as two characters in the default mode and the rest of the string as UTF-8 octets. The (A)2
extracts é §. After that, the internal representation of the string has two octets left, C2.BC
. As octets, (U0A)*
extracts those as  ¼. I get the ¼ because the code number is the same as the last octet. The first octet gives me the Â
.
It’s the same thing when I move the octet template (U0A)2
to the middle. I get the  § for the same reason as I got the weird output last time, but I still get the  ¼. The template after that reverts to the default mode again.
The last line shows much the thing, shifted to the left. I don’t know why anyone would want to do this, but it’s how it works.
In short, the C0
unpacks characters and U0
unpacks bytes from the internal representation. It’s a feature that’s been around since at least v5.6, but Perl’s Unicode support has improved far beyond our need for these features. Maybe it’s a good thing that I didn’t include them in Programming Perl.