Downloading and inspecting PDFs from the internet using R

2016-09-16 18:32 PDT

Some PDFs, published on the internet, contain useful information that could be monitored for updates or changes to the information contained within. I have not previously written code for this type of task, but now I have a legitimate reason to monitor the state of a PDF that is published online. R, among other modern languages, provides some tools to accomplish this task. A couple options from R include:

I'll give an example by using the Environmental Resources Engineering programs catalog located here:
http://pine.humboldt.edu/registrar/catalog/documents/sections/Programs/ere.pdf

I'm using this file for no other reasons other than it's published at a URL and is subject to updates.

install.packages('tm')
library(tm)
# Define the target PDF URL. Use paste to keep the code format
# manageable for this webpage.
tgt.url <- paste('http://pine.humboldt.edu/registrar/',
    'catalog/documents/sections/Programs/ere.pdf',
    sep=''
    )
# Define the local path to save the file
tgt.path <- './ere.pdf'
# Download the file to the local path
download.file(url = tgt.url, destfile = tgt.path, mode = 'wb')
# Create a TextDocument class of the PDF
pdf <- readPDF(control = list(text = "-layout")
	)(elem = list(uri = tgt.path),
	language = "en",id = "id1"
	)
# Observe the PDF's meta-data
pdf$meta    
# Results:
# author       : jak57    
# datetimestamp: 2016-06-22 14:04:02    
# description  : character(0)    
# heading      : Programs_Single.indd    
# id           : ere.pdf    
# language     : en    
# origin       : PScript5.dll Version 5.2.2

Given that a datetimestamp was recorded for a previous download, an update to the PDF can be detected by periodically downloading the file and checking the datetimestamp against a previously recorded value for the file.

If you wanted to the line of the file that contains a particular string you could use something like:

   match(string.to.find,sub("^\\s+", "", pdf$content))

If the datetimestamp has changed, you can then take a preferred action. Some example actions may be:

The action to take, if any, is up to the developer and subject to the requirements of the development project.

Future blog post ideas around this type of programmig task include:

If at this point you are asking yourself, "Why the hell is Doug using R so much?" Well, the answer is that I'm teaching myself R right now. Python is next on the list.

Have a lot of fun!!!