pack’s C0 and U0

Perl’s unpack can work with character data in two ways. The C template does characters and the U does the UTF-8. That they exist doesn’t mean that you should use them, but they exist and I inadvertently overlooked some of their behavior for Programming Perl. These focus a bit too much on Perl’s internal representation of a string, which we shouldn’t do.

First, let’s look at some pack examples. I use the Devel::Dump module to show the underlying scalar record so I can see what is actually stored:

use strict;
use utf8;
use Devel::Peek;

select STDERR;  # because Dump uses that by default
binmode STDERR, ':utf8';

my $c_packed = pack 'C*', 0xE9;
print "packed with C: $c_packed\n";
Dump( $c_packed );
print "\n";

my $alpha_c_packed = pack 'C*', 0x3B1;
print "α packed with C: $alpha_c_packed\n";
Dump( $alpha_c_packed );
print "\n";

my $u_packed = pack 'U*', 0x065, 0x301;
print "packed with U: $u_packed\n";
Dump( $u_packed );
print "\n";

my $alpha_u_packed = pack 'U*', 0x3B1;
print "α packed with U: $alpha_u_packed\n";
Dump( $alpha_u_packed );
print "\n";

The C works with an octet. In the first example I pack 0xE9, the character é, whose code number fits in one octet. In the second example, I pack 0x3B1, the character α, whose code number needs two octets. The output shows that the second case doesn’t work out correctly. It only packs the low octet, 0xB1, which is \261 in octal. As a single-octet character, that’s ±.

packed with C: é
SV = PV(0x7ffc0a801310) at 0x7ffc0c001cd8
  REFCNT = 1
  PV = 0x7ffc0a501670 "\351"\0
  CUR = 1
  LEN = 16

α packed with C: ±
SV = PV(0x7ffc0a801390) at 0x7ffc0a810f28
  REFCNT = 1
  PV = 0x7ffc0a501690 "\261"\0
  CUR = 1
  LEN = 16

The next two parts of the output show the result of packing with the U. In the first example, I pack 0x065 and 0x301, the code numbers for e and the combining acute accent, ´. Notice that the packed string has the UTF8 flag and that the actual storage isn’t the code numbers. It’s the UTF-8 representation for those characters. The PV line shows both the UTF-8 encoded octets and the character string itself (curiously labeled as UTF8).

packed with U: é
SV = PV(0x7ffc0a8013c0) at 0x7ffc0a810b08
  REFCNT = 1
  PV = 0x7ffc0a5015d0 "e\314\201"\0 [UTF8 "e\x{301}"]
  CUR = 3
  LEN = 16

α packed with U: α
SV = PV(0x7ffc0a801440) at 0x7ffc0a827380
  REFCNT = 1
  PV = 0x7ffc0a5014b0 "\316\261"\0 [UTF8 "\x{3b1}"]
  CUR = 2
  LEN = 16

Now let’s go the other way. If I want to unpack the same strings, I’ll get out the values that I put into them.

use utf8;

binmode STDOUT, ':utf8';

my $c_packed = pack 'C*', 0xE9;
my( $number ) = unpack 'C*', $c_packed;
printf "%04X\n", $number;

my $alpha_c_packed = pack 'C*', 0x3B1;
my( $number ) = unpack 'C*', $alpha_c_packed;
printf "%04X\n", $number;

my $u_packed = pack 'U*', 0x065, 0x301;
my( @numbers ) = unpack 'C*', $u_packed;
printf "%04X %04X\n", @numbers;

my $alpha_u_packed = pack 'U*', 0x3B1;
my( @numbers ) = unpack 'C*', $alpha_u_packed;
printf "%04X\n", @numbers;

The output shows that all but the second group comes back the same because the C only packed the low octet of 0x3B1:

0065 0301

That’s not all the C and U formats do though. They affect future uses of the A format, which is supposed to deal with ASCII characters but goes beyond that.

use utf8;

binmode STDOUT, ':utf8';

my $c_packed = pack 'U*', 0xE9, 0xA9, 0xBC;  # 驼

my @chars1 = unpack '(A)*', $c_packed;
print "chars1 are @chars1\n";

my @chars2 = unpack 'C0(A)*', $c_packed;
print "chars2 are @chars2\n";

my @uchars = map { sprintf '%04x', ord } 
	unpack 'U0(A)*', $c_packed;
print "uchars are @uchars\n";

The first unpack uses an explicit A and nothing else. Although $c_packed contains 8-bit characters, they show up as I expect in @chars1. In the second unpack, I use C0, which specifies the C template zero times. It also sets subsequent A template to work on characters. That behavior is also the default. In the third unpack, I start with U0, which specifies the U template zero times and causes subsequent A templates to turn the octets of the UTF-8 representation of the characters. To show those, I convert them to their ordinal values instead of the gibberish à ©  ©  ¼:

chars1 are é © ¼
chars2 are é © ¼
uchars are 00c3 00a9 00c2 00a9 00c2 00bc

I can mix the two, which doesn’t seem useful or desirable, but works.

use utf8;

binmode STDOUT, ':utf8';

my $c_packed = pack 'U*', 0xE9, 0xA7, 0xBC;  # 駼

# show the octets
printf "%v02X\n", unpack 'U0A*', $c_packed;

# show the characters
my @chars = unpack 'U0(A)*', $c_packed;
print "@chars\n";

my @uchars;
@uchars = unpack '(A)2(U0A)*', $c_packed;
print "(A)2(U0A)* uchars are @uchars\n";

@uchars = unpack 'A(U0A)2A', $c_packed;
print "A(U0A)2A uchars are @uchars\n";

@uchars = unpack '(U0A)2C0(A)2', $c_packed;
print "(U0A)2C0(A)2 uchars are @uchars\n";

The first line shows the raw octets and the next line shows the characters that have those code numbers. As a raw string, Perl interprets those according to the default encoding. I’ll use that line to help me see what’s happening next.

à ©  §  ¼
(A)2(U0A)* uchars are é § Â ¼
A(U0A)2A uchars are é Â § ¼
(U0A)2C0(A)2 uchars are à © § ¼

The next line, with (A)2(U0A)*, unpacks as two characters in the default mode and the rest of the string as UTF-8 octets. The (A)2 extracts é §. After that, the internal representation of the string has two octets left, C2.BC. As octets, (U0A)* extracts those as  ¼. I get the ¼ because the code number is the same as the last octet. The first octet gives me the Â.

It’s the same thing when I move the octet template (U0A)2 to the middle. I get the  § for the same reason as I got the weird output last time, but I still get the  ¼. The template after that reverts to the default mode again.

The last line shows much the thing, shifted to the left. I don’t know why anyone would want to do this, but it’s how it works.

In short, the C0 unpacks characters and U0 unpacks bytes from the internal representation. It’s a feature that’s been around since at least v5.6, but Perl’s Unicode support has improved far beyond our need for these features. Maybe it’s a good thing that I didn’t include them in Programming Perl.

Programming Perl in popular culture

While watching Brooklyn 99, the new Andy Samberg show, I notice in the background of the police psychologist’s office a curious blue-spined book. The trade dress of O’Reilly books makes them instantly identifiable from a distance, even when blurry, which means they probably got it right. I had to look at it on pause for a few seconds to convince myself I wasn’t seeing things.

Nope, that’s definitely Programming Perl, Third Edition. Miyagawa then pointed me toward a much more in focus appearance in the TV show Chuck. This time the book is bedside and by itself. Some light bedtime reading?

You can tell that it’s the third edition because each has a distinctive look, which I showed in “20 Years of Programming Perl”. Seeing a specific book I’ve worked on, even in an edition prior to my involvement, is almost as good as seeing the Camel in The IT Crowd:

Camel Perl::Critic Policies

Update: This project has been absorbed into the main Perl::Critic project.

Chapter 21 of Programming Perl recommends several programming practices and styles. Tom meditated on Perlmonks that he’d like to have Perl::Critic policies for those.

Some of the policies already exist and there are many recommendations that still need policies. To start this effort, I created the perl-critic-policy-camel GitHub project. If you’d like to take part, let me know. Or don’t: just fork and work and send pull requests.

Scott Hildreth wins the Programming Perl cover

Scott Hildreth, wearing his YAPC::NA::Madison shirt, shows off the framed and signed Programming Perl cover he won from this blog. I never keep this swag for very long, so you might get some too if you follow along.

He’s also standing in front of a Chicago White Sox banner, which almost cancels out the benefit that he gets from using Perl.

OSCON discounts for the Camel book

During OSCON, the Programming Perl ebook is $19.99 (50% the list price) when you buy it through the open source geeks promotion or by using offer code CFOSCON.

Besides Programming Perl, you can also get Randal’s Schwartz’s Learning Perl Video Series for $74.99 (also 50% off).

Win a framed cover of Programming Perl

A framed cover of Programming Perl showed up in my mail today. Our publisher, O’Reilly Media, has been doling these out to authors in the past couple of months. Tom got one about a week ago and I was slightly jealous, even though I already have one for Learning Perl. So what am I going to do with this one? I’ve signed the front glass, not wanting to disturb the very nice framing job, which also means that if you hate my signature, a little rubbing alcohol should remove that easily.

A framed book cover, with my signature

I’ve come up with some creative giveaways for the books, but I wonder what I should ask our readers to do for a chance to get the framed cover. It’s not enough to simply put your name into a raffle. I wouldn’t mind auctioning it on eBay, but that locks out the people who can’t get an account. But, I want the recipient to give something to get something. So, this time, I’ll let people decide what they’d like to exchange. You don’t have to give something to me; maybe you fix a bug in perl, donate a bunch of stuff to your local hacker space, or something else that makes the universe a slightly better place. Leave a comment pointing to what you’ve done.

I still have some full books (cover and all the other pages) to give away too.

Re: What is Modern Perl?

Dave Cross reviewed Programming Perl for josetteorama. It’s no surprise to us that he likes the latest edition.

He spent most of his review distinguishing between different definitions of “modern Perl”. Most people use that term in conjunction with the module and book of the same name, which means they are programming in a particular fashion with a particular set of modules. The Camel book has never cared what you are doing though. It’s a reference for the Perl language, not a guide to Perl application development.

As Dave points out, “modern Perl” can also mean “up to date”. He says:

The Perl of 2012 is substantially different to the Perl of 2000.


The definitive Perl book is now up to date with the way that the best Perl programmers now program Perl.

That was always the plan. We talked about the other meaning of “modern Perl”, but fully realizing that “modern” isn’t the same thing as contemporary. “Modern history” started right after the Middle Ages. Indeed, much of “modern Perl” started in 1995 when we got Perl 5. The “modern Perl” movement isn’t so much about getting people to program with the stuff released today, but to get them to stop programming in the Perl 3 fashion. That’s what chromatic says on the back cover of Modern Perl:

[M]ost Perl 5 programs in the world take far too little advantage of the language. You can write Perl 5 programs as if they were Perl 4 programs (or Perl 3 or 2 or 1), but programs written to take advantage of everything amazing the worldwide Perl 5 community has invented…

“Modern Perl” is really Perl 5, and that’s evolving because Larry specifically created Perl 5 to be user extensible (and he went one further to make Perl 6 user-mutable). How we extended and used Perl 5 ten years ago is different than five years ago is different from today, and we fully expect that in five years we’ll be doing it in another way. However, the core language will be what it is, and that’s what Programming Perl is about.

More Programming Perls to give away

Just when I’d mailed off my last Programming Perl, O’Reilly sends me five more. I think I was suppose to get these before YAPC, but it’s too late for that. Don’t they realize these are big, heavy books? Now I have to figure out how to give away five of these. Although we are arranging for the second printing, having sold out the first, these are still the first printing.

18 pounds of more books

To get these books, you’re going to have to do more than send me a postcard. This time, you have to support a charity where I was recently elected to the Board of Directors. Fractured Atlas liberates the artists by doing the back office bits for them, including donor management, project insurance, health insurance, space rental assistance, and many other boring, non-arty bits. There’s a lot of web programming involved in the tools that artists use, and a lot of tools to transform data in various ways. Make a $50 or higher donation to their general fund and send me your receipt (minus details I shouldn’t see). I’ll take everyone who does that before July 19th and randomly select five people to get a signed copy of the Camel, or pass them out for those who go bend my challenge. Even if you don’t want a Camel, consider a gift that helps artists make our world better.

Recipients of my thanks

  1. Demain, who supported The Debate Society with a monthly donation through Fractured Atlas.

My first Camel royalties

I just made my first $4 on Programming Perl. O’Reilly Media just moved to a monthly (instead of quarterly) royalty report and payment. Now I get to the reports at the end of each month, although it’s for three months back. Royalties come three months after so booksellers have time to track inventory, send money back, and more troubling, return books they don’t want anymore.

Programming Perl was officially available in March, so I won’t see any significant money until the end of June. However, it looks like some people bought the ebook version directly from O’Reilly so they’ve already delivered those and taken the money. I can get those royalties now. That’s $4.20.

Things are looking good for the Camel.

Get your Camel signed, remotely

You can almost always get a signature from an author if you find them in person at a conference, Perl mongers event, or on the street, sometimes you don’t have those opportunities (especially since all of the authors are from the United States).

You can still get the signatures even without those opportunities. I have a limited number of “Authentic Author’s Signature” stickers from O’Reilly. If you send me a postcard with your address (here’s mine), I’ll send you one of these signature plates.