Pisa and Reportlab pitfalls

Generating PDFs with Django, Pisa and Reportlab and what to look out for

About a week ago an entry about generating PDFs with Django was posted on the Uswaretech Blog. In particular this blog post talks about using Pisa, a html2pdf python library to generate complex PDFs from existing HTML pages. I now took the chance to finish the draft for this blog post you are reading right now, which was lying around for about 2 months, which I originally wrote to point out some pitfalls I ran into while using Pisa in a Django project.

I'm using Reportlab for PDF-generation, which is a very powerfull open-source python library. Reportlab features both, a good low-level API for generating documents and an higher level abstraction with has an layout-engine, which knows where to do pagebreaks and such things. Some documents are easy to build using the reportlab API, especially documents which contain much text and are not so heavily styled. For documents which are heavily styled I added one more tool to do the heavy job, while I could concentrate on technologies, which I'm fluent in. The solution was to use pisa and write the documents in plain old HTML+CSS.

Pisa is an open-source python library which uses html5lib to parse HTML (with CSS) documents and then creates a PDF using reportlab. The results are pretty good and pisa provides some vendor-specific CSS extensions, which allow styling pages with different templates and adding static headers and footers. Additionally there are some pisa-specific XML-tags, like <pdf:pagenumber />, which allow adding pagenumbers, pagebreaks etc. to the resulting PDF.

Both, reportlab and pisa, have some documentation, but I want to document some gotchas I couldn't find in the docs, which took some time for me to figure out. (And I hope to save someone else some time figuring this stuff out.)

Where are my pagenumbers?

As said before pisa allows adding static headers and footers via CSS, this is documented very well, so will not repeat it here. One problem I ran into was adding a static footer to my pages, which contains a <pdf:pagenumber /> tag and should show the current pagenumber at the bottom of every page of the resulting PDF. The problem was, that every page just showed the number 0. The solution was very simple, but I was only able to figure it out after studiying the pisa source-code: You can only use the <pdf:pagenumber /> tag inside a paragraph, not (as I did) inside a table for example. Even parapgraphs inside tables don't work. It has to be a top-level paragraph.

Wrong Pagebreaks

The documents I was generating are starting with a headline, short introduction text (about 5 lines) and then follows another headline and a long table (more than one page). Pisa and reportlab know how to do pagebreaks in tables, but I had the problem, that everytime the table was longer than the remaining space on page one a pagebreak appeared directly after the introduction text, the table starts at the top of page two and was correctly split over the next few pages. The pagebreak was added by reportlabs layout-engine (platypus), which is rather smart, but I had to find out what was going on, before I could understand why this was happening.

Pisa Pagebreak Issue

The layout-engine knows a concept of keep-with-next, which avoids orphaned elements on the bottom of a page. Pisa assignes a default keep-with-next attribute to all HTML headers (h1-h6), which is a good thing most of the time, but had the following consequences in my case: reportlab knows that the headline before the table and the table should be kept together, because both together don't fit on the remaining space on the first page, they are moved to second page. They don't fit on this page either, but now the pagebreak is done in the table, because nowhere in the document will be more space as on a new blank page.

The solution to avoid this pagebreak and just have a normal break inside the table is to assign a css style of "-pdf-keep-with-next:false;" to the headline just before the table. This will tell pisa to tell reportlab not to use a keep-with-next around the headline and the table. Reportlab will put the headline on the first page, then the table and will notice that the table don't fit on the page and will add a pagebreak inside the table, just as one would have expected.

Adding Pictures to the PDF

This one is rather trivial and not really a pitfall, but as it fits nicely into this topic I'm going to write it down here. To be able to get pictures into the PDF, which are visible on the HTML page, you should define a link-callback function, which knows how to translate a src attribute from the HTML document to a local path to the image. If your are processing remote files, this callback could even fetch the image, but it has to return a path to image where reportlab can find it on the filesystem, not a file-like object or something else. A very simple link callback function which should work for most Django project could look like this:

import os
from django.conf import settings

def fetch_resources(uri, rel):
    """
    Callback to allow pisa/reportlab to retrieve Images,Stylesheets, etc.
    `uri` is the href attribute from the html link element.
    `rel` gives a relative path, but it's not used here.

    """
    path = os.path.join(settings.MEDIA_ROOT, uri.replace(settings.MEDIA_URL, ""))
    return path

Kommentare