PDF Filter Implementation in HexaPDF Using Fibers

Part of the series on the HexaPDF implementation

07. October 2017

In the previous post about HexaPDF I introduced the basic PDF object system. This post will focus on one of the available object types, PDF streams and their filters.

If you are already familiar with the basics of PDF streams and filters, jump down to the section about their implementation in HexaPDF.

PDF Streams

As described in the previous post, a PDF stream represents a potentially unlimited sequence of bytes. Each stream has also some meta data associated with it and this meta data is represented by a PDF dictionary. Since a stream is not limited in size it is used to hold data like images, font files or content streams of pages (i.e. the instructions that tell a PDF viewer what to display and how).

If you look inside a PDF file you will find that the instructions for defining streams are just plain ASCII strings, as are the instructions for all other PDF objects. The following is a valid PDF stream:

1 0 obj
<</Length 12>>
stream
Hello World!
endstream
endobj

The first thing to notice is that the PDF stream is defined as an indirect PDF object. Since streams have to follow a certain syntax and can be arbitrarily long, they can never be direct objects!

After the object definition comes the PDF dictionary that holds the meta data of the stream. There are a few keys like /Length, /Filter and /DecodeParms that are valid for all streams. The only mandatory key is /Length since without it would be hard (sometimes impossible) to find the end of the stream (in essence, we would need to scan for the endstream keyword which might, or might not, work).

If the dictionary was followed by the endobj keyword, we would just have an indirect object pointing to a dictionary. However, it is followed by the stream keyword, telling us that stream data follows, and the stream data itself is followed by the endstream keyword.

And that’s how streams are represented at the file level. However, before we start exploring how HexaPDF handles streams there is one more thing to know of: stream filters.

Stream Filters

The example stream shown above contains the exact byte sequence that a PDF reader would get. But just dumping all streams without compression into a PDF would lead to large PDF files. Therefore streams can employ filters that need to be applied to the raw stream data to get the real stream data.

Of the 9 filters (I will leave out the Crypt filter because it is a special construct) that are defined by the PDF specification four filters deal exclusively with image data:

CCITTFaxDecode handles images encoded with group 3 and 4 CCITT fax encoding,
JBIG2Decode handles monochrome images in JBIG2 encoding,
DCTDecode handles JPEG images, and
JPXDecode handles JPEG2000 images.

Currently, none of these four filters are implemented in HexaPDF. Decoding JPEG or JPEG2000 images is currently not necessary because we can just put a whole JPEG/JPEG2000 image in a PDF stream, set the filter accordingly and it just works. However, this is not the case with the other two filters and therefore they will be implemented in a future version.

That leaves us with five remaining filters of which two, ASCIIHexDecode and ASCII85Decode, are used to ensure that streams in a PDF file are encoded using only ASCII characters, making it possible to create PDFs consisting only of ASCII characters. The problem with them is that they make the streams bigger instead of smaller (e.g. with ASCIIHexDecode each source byte gets encoded by two bytes) and are therefore seldomly used.

Finally, the last three filters deal with compressing data:

The RunLengthDecode filter employs a simple run length encoding to compress data. You will probably never see it used.
The LZWDecode filter uses the Lempel-Ziv-Welch algorithm, that is also used by the TIFF format, to compress data. You will also probably never see it used.
Finally, FlateDecode uses the zlib/deflate compression method and this is what is used most of the time since it offers better compression than the other two.

The LZWDecode and FlateDecode filters can additionally use a predictor algorithm that prepares the input stream so that higher compression rates can be achieved. This predictor algorithm is taken from the PNG specification and together with the deflate algorithm allows for the easy embedding of PNG images into a PDF.

Now that you know which filters are available, we will look at how to they are used.

If a stream has filters applied, the stream dictionary’s /Filter key needs to be set to the applied filters. You read correctly, more than one filter can be applied to a stream; however, this feature is rarely used. Additionally, the /DecodeParms key can be used to supply decoding parameters for each filter.

Going back to our earlier example, it would look like this if the ASCII85Decode and ASCIIHexDecode filters were applied in that order on encoding (note that the filters describe the decoding order):

1 0 obj
<</Length 35 /Filter [/ASCIIHexDecode /ASCII85Decode]>>
stream
3837635552445d692c2245626f38307e3e>
endstream
endobj

Implementation in HexaPDF

Since PDF streams are essentially dictionaries with a byte stream attached, they are implemented in HexaPDF as the subclass HexaPDF::Stream of HexaPDF::Dictionary. The class provides all necessary convenience methods to access, decode and encode streams.

The stream data itself can either be a simple String or a HexaPDF::StreamData object. The former is mostly used for setting the stream data when creating a PDF file or when processing the decoded stream data. The latter is used to represent the stream data without actually reading/decoding it. The last bit is important since it means that HexaPDF can load large stream objects without needing to read the stream data itself if it is not used.

HexaPDF::StreamData objects basically just store a reference to an IO object, an offset and a length. When asked for the data, i.e. when a stream needs to be read and decoded, it returns an object that reads the raw data in chunks to avoid huge memory use when possible. The raw stream data is then passed through the filters specified by the stream dictionary to get the decoded stream data. Since the raw data is read in chunks, it means that the filters need to be aware of that, too. Otherwise the benefits of reading in chunks is wasted. Finally, if the whole stream data is needed at once, it is read as described above but concatenated into one huge string.

The best way to think of this is as a filter pipeline:

The first object in the pipeline is responsible for providing the data chunks.
The middle objects then transform the data chunks according to some defined algorithms.
The last object collects the data chunks and either concatenates them into a string or does something else with them, e.g. writing the chunks to a file.

Therefore the requirements for filter objects used in such a pipeline are:

Can handle arbitrarily large chunks of source data, from 1 byte upwards
Can process the source data in chunks, i.e. it doesn’t need all the data to start processing

Thinking about all this, Ruby’s fiber objects immediately came to mind, mostly because I remembered a blog post about implementing pipelines using fibers by Dave Thomas.

The neat thing about fibers is that they allow you to interrupt an algorithm at any point and return to that exact same point later on, continuing with the algorithm. This is in stark contrast with methods, procs and the like because they always start from the top, even if interrupted in the middle. Koichi Sasada gave a great talk about fibers at this year’s Ruby Kaigi that you should definitely check out.

As you can see fibers are a perfect fit for implementing the PDF filter pipeline. I have implemented some helper methods for creating the initial, source data yielding fibers and for collecting the results.

The filters themselves (e.g.HexaPDF::Filter::ASCIIHexDecode) are implemented as modules that have the methods encoder(source, options = nil) and decoder(source, options = nil). These two methods create fibers that transform the data received via the source argument and yield the results.

If we were to manually read a PDF stream, the process would be like this (using our example stream from above):

Create fiber a that knows how to read chunks from the IO (the real code):

a = HexaPDF::Filter.source_from_io(io, pos: offset, length: length, chunk_size: chunk_size)

Check whether the stream employs filters (our example stream does) and wrap the fiber a in the necessary filter fibers (the real code):
```
b = HexaPDF::Filter::ASCIIHexDecode.decoder(a)
c = HexaPDF::Filter::ASCII85Decode.decoder(b)
```
Note that nothing has been read so far since the fibers were just created but not resumed. To get the string we retrieve the chunks by continuously resuming our fiber and concatenate the chunks (the real code):
```
HexaPDF::Filter.string_from_source(c) # => "Hello World!"
```

And that’s the whole magic!

Conclusion

This post showed you how PDF streams and filters work in general and how they are implemented in HexaPDF. Using Ruby’s fiber objects HexaPDF can lazily load PDF streams and perform chunk-wise processing on them, avoiding huge memory usage.

In a future post I will introduce you to the security features of PDF, how they work and how HexaPDF implements them.

gettalong's web home