Save time by extracting Data directly from PDFs with Tableau

Information is everywhere and sometimes that information is in a PDF. Perhaps some historic reporting or something interesting from the Web. Extracting data directly from PDFs with Tableau unlocks this data quickly.

There are of course others ways to get at the data. Manually copying it or copying it into Excel can both work. But they both take time and are pretty frustrating.  Extracting data directly from PDFs with Tableau is a powerful way to add new data to visualizations and analysis with minimal effort. That means more potential for useful insights.

Extracting data directly from PDFs with Tableau - IMAX Example

I was discussing IMAX (the 3D/Big screen Cinema business) last night over dinner with friends. Not sure why. Regardless, I thought let's take a quick look at IMAX revenue by region just to see how it's going.  I found the following annual report (10K) for Dec 2016 as a PDF:

IMAX 2016 10K

After scanning through it, I found the following on page 131.

All well and good, but not super useful. Opening up Tableau Desktop I simply connected to a new data source and selected the IMAX 2016 10K PDF and specifically page 131.

This gave me the following (after selecting the 'Use Data Interpreter' option).

Extracting data from PDF direct into Tableau

Still not perfect so I quickly did the following. I hid the columns with nulls, renamed the year columns with 2014, 2015, and 2016 and pivoted them.  Then I renamed the pivoted columns to ‘Location’, ‘Year’, and ‘Revenue’ and finally, filtered out the ‘Total’ rows.

It was then a simple case of creating a basic dashboard for the data.

For the purposes of this post, I recreate the PDF data in the top left and also in the top right but in an easier to read format.

In the bottom left it becomes clear that China is catching up with the USA. It's also interesting to note that excluding China the Rest of the World accounts for about the same total revenue as the USA.

Interesting stuff... and easier to see when visualized.

IMAX Annual Revenue by Location in Tableau

So there you have it. Extracting data directly from PDFs with Tableau is relatively straightforward and a time saver. Once in Tableau, the possibilities are endless.

You might also be interested in using Excel VBA to Load CSV Files. This approach can also automate the preparation of CSV files into a format perfect for Tableau or Power BI or other tools liek Calumo.

EDIT. Based on a comment below the following chart includes 5 years of data. The relative growth of revenue from China has been impressive.

IMAX 5 Years of Annual Revenue by Region

4 thoughts on “Save time by extracting Data directly from PDFs with Tableau”

Comments are closed.