pack’s C0 and U0

Perl’s unpack can work with character data in two ways. The C template does characters and the U does the UTF-8. That they exist doesn’t mean that you should use them, but they exist and I inadvertently overlooked some of their behavior for Programming Perl. These focus a bit too much on Perl’s internal representation of a string, which we shouldn’t do.

First, let’s look at some pack examples. I use the Devel::Dump module to show the underlying scalar record so I can see what is actually stored:

use strict;
use utf8;
use Devel::Peek;

select STDERR;  # because Dump uses that by default
binmode STDERR, ':utf8';

my $c_packed = pack 'C*', 0xE9;
print "packed with C: $c_packed\n";
Dump( $c_packed );
print "\n";

my $alpha_c_packed = pack 'C*', 0x3B1;
print "α packed with C: $alpha_c_packed\n";
Dump( $alpha_c_packed );
print "\n";

my $u_packed = pack 'U*', 0x065, 0x301;
print "packed with U: $u_packed\n";
Dump( $u_packed );
print "\n";

my $alpha_u_packed = pack 'U*', 0x3B1;
print "α packed with U: $alpha_u_packed\n";
Dump( $alpha_u_packed );
print "\n";

The C works with an octet. In the first example I pack 0xE9, the character é, whose code number fits in one octet. In the second example, I pack 0x3B1, the character α, whose code number needs two octets. The output shows that the second case doesn’t work out correctly. It only packs the low octet, 0xB1, which is \261 in octal. As a single-octet character, that’s ±.

packed with C: é
SV = PV(0x7ffc0a801310) at 0x7ffc0c001cd8
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x7ffc0a501670 "\351"\0
  CUR = 1
  LEN = 16

α packed with C: ±
SV = PV(0x7ffc0a801390) at 0x7ffc0a810f28
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x7ffc0a501690 "\261"\0
  CUR = 1
  LEN = 16

The next two parts of the output show the result of packing with the U. In the first example, I pack 0x065 and 0x301, the code numbers for e and the combining acute accent, ´. Notice that the packed string has the UTF8 flag and that the actual storage isn’t the code numbers. It’s the UTF-8 representation for those characters. The PV line shows both the UTF-8 encoded octets and the character string itself (curiously labeled as UTF8).

packed with U: é
SV = PV(0x7ffc0a8013c0) at 0x7ffc0a810b08
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x7ffc0a5015d0 "e\314\201"\0 [UTF8 "e\x{301}"]
  CUR = 3
  LEN = 16

α packed with U: α
SV = PV(0x7ffc0a801440) at 0x7ffc0a827380
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x7ffc0a5014b0 "\316\261"\0 [UTF8 "\x{3b1}"]
  CUR = 2
  LEN = 16

Now let’s go the other way. If I want to unpack the same strings, I’ll get out the values that I put into them.

use utf8;

binmode STDOUT, ':utf8';

my $c_packed = pack 'C*', 0xE9;
my( $number ) = unpack 'C*', $c_packed;
printf "%04X\n", $number;

my $alpha_c_packed = pack 'C*', 0x3B1;
my( $number ) = unpack 'C*', $alpha_c_packed;
printf "%04X\n", $number;

my $u_packed = pack 'U*', 0x065, 0x301;
my( @numbers ) = unpack 'C*', $u_packed;
printf "%04X %04X\n", @numbers;

my $alpha_u_packed = pack 'U*', 0x3B1;
my( @numbers ) = unpack 'C*', $alpha_u_packed;
printf "%04X\n", @numbers;

The output shows that all but the second group comes back the same because the C only packed the low octet of 0x3B1:

That’s not all the C and U formats do though. They affect future uses of the A format, which is supposed to deal with ASCII characters but goes beyond that.

use utf8;

binmode STDOUT, ':utf8';

my $c_packed = pack 'U*', 0xE9, 0xA9, 0xBC;  # é©¼

my @chars1 = unpack '(A)*', $c_packed;
print "chars1 are @chars1\n";

my @chars2 = unpack 'C0(A)*', $c_packed;
print "chars2 are @chars2\n";

my @uchars = map { sprintf '%04x', ord } 
	unpack 'U0(A)*', $c_packed;
print "uchars are @uchars\n";

The first unpack uses an explicit A and nothing else. Although $c_packed contains 8-bit characters, they show up as I expect in @chars1. In the second unpack, I use C0, which specifies the C template zero times. It also sets subsequent A template to work on characters. That behavior is also the default. In the third unpack, I start with U0, which specifies the U template zero times and causes subsequent A templates to turn the octets of the UTF-8 representation of the characters. To show those, I convert them to their ordinal values instead of the gibberish Ã © Â © Â ¼:

chars1 are é © ¼
chars2 are é © ¼
uchars are 00c3 00a9 00c2 00a9 00c2 00bc

I can mix the two, which doesn’t seem useful or desirable, but works.

use utf8;

binmode STDOUT, ':utf8';

my $c_packed = pack 'U*', 0xE9, 0xA7, 0xBC;  # é§¼

# show the octets
printf "%v02X\n", unpack 'U0A*', $c_packed;

# show the characters
my @chars = unpack 'U0(A)*', $c_packed;
print "@chars\n";

my @uchars;
@uchars = unpack '(A)2(U0A)*', $c_packed;
print "(A)2(U0A)* uchars are @uchars\n";

@uchars = unpack 'A(U0A)2A', $c_packed;
print "A(U0A)2A uchars are @uchars\n";

@uchars = unpack '(U0A)2C0(A)2', $c_packed;
print "(U0A)2C0(A)2 uchars are @uchars\n";

The first line shows the raw octets and the next line shows the characters that have those code numbers. As a raw string, Perl interprets those according to the default encoding. I’ll use that line to help me see what’s happening next.

C3.A9.C2.A7.C2.BC
Ã © Â § Â ¼
(A)2(U0A)* uchars are é § Â ¼
A(U0A)2A uchars are é Â § ¼
(U0A)2C0(A)2 uchars are Ã © § ¼

The next line, with (A)2(U0A)*, unpacks as two characters in the default mode and the rest of the string as UTF-8 octets. The (A)2 extracts é §. After that, the internal representation of the string has two octets left, C2.BC. As octets, (U0A)* extracts those as Â ¼. I get the ¼ because the code number is the same as the last octet. The first octet gives me the Â.

It’s the same thing when I move the octet template (U0A)2 to the middle. I get the Â § for the same reason as I got the weird output last time, but I still get the Â ¼. The template after that reverts to the default mode again.

The last line shows much the thing, shifted to the left. I don’t know why anyone would want to do this, but it’s how it works.

In short, the C0 unpacks characters and U0 unpacks bytes from the internal representation. It’s a feature that’s been around since at least v5.6, but Perl’s Unicode support has improved far beyond our need for these features. Maybe it’s a good thing that I didn’t include them in Programming Perl.