As anyone who uses Python will tell you, it’s a wonderful language to write in, but the interpreter is slow. If you want programs that run fast, you should write them in C. Python provides an API for writing modules in C that can be imported into Python programs just like any other module. But sometimes writing in C can be a bit of a pain. It’s too low-level, and the syntax is nowhere near as nice as Python’s. That’s where Cython comes in. Cython allows you to compile C-extensions for Python without having to write so much C. In this entry, we’ll compile a C-extension to use the functionality of the command-line utility pdftotext.
First, let’s learn a bit about pdftotext. Its purpose is to extract plaintext from PDF files. It’s part of the Poppler suite of pdf rendering tools (http://poppler.freedestkop.org). It’s open-source, and it’s written in C++. Cython supports C++ extensions, as we’ll see.
Usage of the pdftotext utility looks like this:
pdftotext <PDF-file>
When no <text-file> is specified, it outputs the text to stdout.
After downloading the Poppler source (in this case, poppler-0.43.0), we find the pdftotext utility’s source code located in poppler-0.43.0/utils/pdftotext.cc. We will have to modify the code a little; for one thing, we notice that pdftotext wants to write its output to a file, but we want to write it to a string that we can pass back to Python:
textOut = new TextOutputDev(textFileName->getCString(), …);
Luckily for us, the class TextOutputDev has an overloaded constructor (see poppler-0.43.0/poppler/TextOutputDev.h):
// Create a TextOutputDev which will write to a generic stream. … TextOutputDev(TextOutputFunc func, void *stream, …);
Here’s a quick modification we can make to the pdftotext source code that will allow us to pass the output text to Python. We can write a function that accepts a pointer to a stream, a C-style string of text, and a length, and returns nothing. All this function does is write the text to the stream, which we cast as a C++ std-lib stringbuffer:
void outputToStream(void *outputStream, const char *text, int len) { ((std::stringbuf *)outputStream)->sputn(text, len); }
Then we can write a function named extract that takes the path to the pdf file, and, calling a function named _extract (discussed below) outputs the text:
std::string extract(std::string pdf_path) { std::stringstream ss; std::stringbuf *outputStream = ss.rdbuf(); int exitCode = _extract(pdf_path.c_str(), outputStream); if (exitCode) { return std::string("ERROR!"); } else { return ss.str(); } }
The function int _extract(const char *pdf_path, void *outputStream) that extract calls is a modified version of pdftotext’s main() function. For brevity, I have left out the implementation details here, but it basically just involves hard-coding some of the parameters that would normally be passed to the pdftotext utility as options from the command line (for simplicity), and passing our outputToStream function and our outputStream pointer into the constructor of TextOutputDev (rather than passing a filename). Note that the outputStream is just the pointer to the buffer used by the string stream ss.
Now, to be able to export this function to Cython, we have to create a header file (pdf2text_extract.hpp) with the function declaration:
std::string extract(std::string pdf_path);
Likewise, we should call the implementation file (our modified pdftotext.cc) by the same name:
pdf2text_extract.cpp.
Then, in a file called pdf2text.pyx we write all of our Cython code:
from libcpp.string cimport string cdef extern from "pdf2text_extract.hpp": cpdef string extract(string pdf_path)
That’s all the Cython code we need! Cython automatically handles the type conversion from Python str to C++ std::string so we can call the function from Python with a pdf_path of type str.
Now we just need to compile it. For this we use distutils and the following setup.py file:
from distutils.core import setup from distutils.extension import Extension from Cython.Build import cythonize setup( name = 'Python PDF2Text Extension', ext_modules = cythonize(, # system installation of poppler headers (Ubuntu 14.04): include_dirs=, libraries=, language="c++",) ]), )
Even though we downloaded the Poppler source code, I found a problem when running the extension linked against the version 0.43.0 library on Ubuntu 14.04. Instead, we can install the libpoppler-dev package from apt-get:
sudo apt-get install libpoppler-dev
This puts the header files in /usr/include/poppler, as we see from the include_dirs parameter in setup.py.
Also, let’s make sure we have Cython installed before we try running setup.py:
pip install cython
To compile the extension we run:
python setup.py build_ext --inplace
This will create a pdf2text.so file, which is our compiled Python extension. To use it, we just need to import it into python, and run the extract function that we’ve defined:
>>> import pdf2text >>> pdf2text.extract(“path/to/pdf_file.pdf”)
Check back in with Artemis’ blog The Quiver periodically for more developer quick hints.