My trouble is… [entries|reading|network|archive]
simont

[ userinfo | dreamwidth userinfo ]
[ archive | journal archive ]

Fri 2005-11-18 11:16
My trouble is…

My trouble, as a geek, is that I'm generally three quarters of the way towards working out how to do something from first principles before it even occurs to me to see if anyone else has done it already.

Clear case in point yesterday: I wanted to extract a picture from a PDF into a stand-alone JPEG. Now I happen to know quite a lot about what PDFs look like on the inside, because I've written a program which writes them. So my first thought was to poke at the inside of the PDF in question and see what I could see. And indeed, I found my way quite quickly to the picture data, and discovered it was listed as requiring the DCTDecode filter.

My home-grown PDF-prodding program didn't contain code to untangle this, so I went looking for something that did. Reading the PDF specification revealed that their DCT filter had a lot to do with JPEG; so first I dumped the raw DCT-encoded data into a .jpeg file just to see what would happen. It actually seemed half-way plausible: file(1) said it was a JPEG and viewing it showed recognisable fragments of image, but they were jumbled up in a strange way, from which I infer that PDF's DCT encoding is very similar to JPEG but not exactly identical. (Seems odd, but there we go.)

Googling for a Python library (my PDF-prodder is in Python) showed up some code which claimed to be able to handle DCT encoding in PDFs. On closer inspection, however, it only handled encoding, and it didn't even really do that: it just dumped a JPEG file into a PDF and tagged it as requiring DCTDecode. (Odder still; that hints that PDF's DCT encoding might be a superset of conventional JPEG.)

Well, so much for short cuts. Next step: the xpdf source code will definitely have a DCT decoder, because I've seen it successfully decode this image! Downloaded it, grubbed around the source for a while, found a DCT decoder sitting in the middle of a C++ stackable-streams architecture. Fair enough. I identified a spot where I could plausibly stick a hacky ‘now dump the decoded stream to disk’ code fragment, then attempted to compile xpdf. I wasn't entirely sure what format data I would get out of this exercise, but I was reasonably confident that it would be some kind of recognisable raw image data which I ought to be able to massage into a simple image format without too much Perl, and that ImageMagick could probably take it from there.

xpdf's configure script missed some libraries, and reported ‘WARNING: You will be able to compile pdftops, pdftotext, pdfinfo, pdffonts, and pdfimages, but not xpdf or pdftoppm’.

Er, hang on. ‘pdfimages’? The realisation suddenly struck me that I'd wasted an hour on trying to reinvent what in retrospect was a fairly obviously useful wheel. The problem is not that I thought ‘will someone already have done this?’ and decided they wouldn't have; the problem is that it didn't really occur to me in the first place, because I jumped straight from ‘I want to extract a picture from this document’ to ‘a binary data stream in a PDF, I know something about those’, and from there I was off down the implementation process without stopping to look around me first.

Oh well. pdfimages turned out to be exactly what I needed, so I got there in the end without writing too much code.

LinkReply
[identity profile] compilerbitch.livejournal.comFri 2005-11-18 13:06
I have part of a graphic designer grafted to my brain stem, so I have to admit that my approach would have been to load the PDF into Illustrator, then to just save out the images.

*tries it just to make sure*

Yep that works (and also in Photoshop).
Link Reply to this | Thread
[personal profile] simontFri 2005-11-18 13:09
I, on the other hand, have no graphic designers anywhere near my brain stem, so I don't own a copy of either of those :-)
Link Reply to this | Parent
[identity profile] satanicsocks.livejournal.comFri 2005-11-18 13:49
That's what I'd have done too :)
Link Reply to this | Parent
[identity profile] christhomas123.livejournal.comFri 2005-11-18 13:18
I'm surprised at you! Re-inventing the wheel like that! ;o)

My very first reaction these days is to hit google and see if someone's done it already.
Link Reply to this
[identity profile] feanelwa.livejournal.comFri 2005-11-18 20:46
[blanks through the IT]
You really, really have to meet Tom from my group. You would hit it off.
Link Reply to this
navigation
[ go | Previous Entry | Next Entry ]
[ add | to Memories ]