My trouble is…

My trouble, as a geek, is that I'm generally three quarters of the way towards working out how to do something from first principles before it even occurs to me to see if anyone else has done it already.

Clear case in point yesterday: I wanted to extract a picture from a PDF into a stand-alone JPEG. Now I happen to know quite a lot about what PDFs look like on the inside, because I've written a program which writes them. So my first thought was to poke at the inside of the PDF in question and see what I could see. And indeed, I found my way quite quickly to the picture data, and discovered it was listed as requiring the DCTDecode filter.

My home-grown PDF-prodding program didn't contain code to untangle this, so I went looking for something that did. Reading the PDF specification revealed that their DCT filter had a lot to do with JPEG; so first I dumped the raw DCT-encoded data into a .jpeg file just to see what would happen. It actually seemed half-way plausible: file(1) said it was a JPEG and viewing it showed recognisable fragments of image, but they were jumbled up in a strange way, from which I infer that PDF's DCT encoding is very similar to JPEG but not exactly identical. (Seems odd, but there we go.)

Googling for a Python library (my PDF-prodder is in Python) showed up some code which claimed to be able to handle DCT encoding in PDFs. On closer inspection, however, it only handled encoding, and it didn't even really do that: it just dumped a JPEG file into a PDF and tagged it as requiring DCTDecode. (Odder still; that hints that PDF's DCT encoding might be a superset of conventional JPEG.)

Well, so much for short cuts. Next step: the xpdf source code will definitely have a DCT decoder, because I've seen it successfully decode this image! Downloaded it, grubbed around the source for a while, found a DCT decoder sitting in the middle of a C++ stackable-streams architecture. Fair enough. I identified a spot where I could plausibly stick a hacky ‘now dump the decoded stream to disk’ code fragment, then attempted to compile xpdf. I wasn't entirely sure what format data I would get out of this exercise, but I was reasonably confident that it would be some kind of recognisable raw image data which I ought to be able to massage into a simple image format without too much Perl, and that ImageMagick could probably take it from there.

xpdf's configure script missed some libraries, and reported ‘WARNING: You will be able to compile pdftops, pdftotext, pdfinfo, pdffonts, and pdfimages, but not xpdf or pdftoppm’.

Er, hang on. ‘pdfimages’? The realisation suddenly struck me that I'd wasted an hour on trying to reinvent what in retrospect was a fairly obviously useful wheel. The problem is not that I thought ‘will someone already have done this?’ and decided they wouldn't have; the problem is that it didn't really occur to me in the first place, because I jumped straight from ‘I want to extract a picture from this document’ to ‘a binary data stream in a PDF, I know something about those’, and from there I was off down the implementation process without stopping to look around me first.

Oh well. pdfimages turned out to be exactly what I needed, so I got there in the end without writing too much code.