Downloading and inspecting PDFs from the internet using R
2016-09-16 18:32 PDT
Some PDFs, published on the internet, contain useful information that could be monitored for updates or changes to the information contained within. I have not previously written code for this type of task, but now I have a legitimate reason to monitor the state of a PDF that is published online. R, among other modern languages, provides some tools to accomplish this task. A couple options from R include:
- download.file() to download the target PDF file
- the tm library to to mine text within the PDF using the function readPDF
I'll give an example by using the Environmental Resources Engineering programs catalog located here:
http://pine.humboldt.edu/registrar/catalog/documents/sections/Programs/ere.pdf
I'm using this file for no other reasons other than it's published at a URL and is subject to updates.
install.packages('tm')
library(tm)
# Define the target PDF URL. Use paste to keep the code format
# manageable for this webpage.
tgt.url <- paste('http://pine.humboldt.edu/registrar/',
'catalog/documents/sections/Programs/ere.pdf',
sep=''
)
# Define the local path to save the file
tgt.path <- './ere.pdf'
# Download the file to the local path
download.file(url = tgt.url, destfile = tgt.path, mode = 'wb')
# Create a TextDocument class of the PDF
pdf <- readPDF(control = list(text = "-layout")
)(elem = list(uri = tgt.path),
language = "en",id = "id1"
)
# Observe the PDF's meta-data
pdf$meta
# Results:
# author : jak57
# datetimestamp: 2016-06-22 14:04:02
# description : character(0)
# heading : Programs_Single.indd
# id : ere.pdf
# language : en
# origin : PScript5.dll Version 5.2.2
Given that a datetimestamp was recorded for a previous download, an update to the PDF can be detected by periodically downloading the file and checking the datetimestamp against a previously recorded value for the file.
If you wanted to the line of the file that contains a particular string you could use something like:
match(string.to.find,sub("^\\s+", "", pdf$content))
If the datetimestamp has changed, you can then take a preferred action. Some example actions may be:
- Send an e-mail with a notice that the file has changed. The e-mail can contain the URL to the file and/or the recently downloaded file.
- Attempt to parse the file's contents to extract changes in the file. Report these changes along with your e-mail. You can find the contents of the PDF in a vector of strings in pdf$content.
- yadda yadda yadda
The action to take, if any, is up to the developer and subject to the requirements of the development project.
Future blog post ideas around this type of programmig task include:
- Sending e-mails with attachments using R
- How to parse the PDF contents to attempt to extract useful information using R
- Crawling a target URL to discover files of interest using R
If at this point you are asking yourself, "Why the hell is Doug using R so much?" Well, the answer is that I'm teaching myself R right now. Python is next on the list.
Have a lot of fun!!!