pdf problems at Mary Ferrell

The Mary Ferrell Foundation is my main source for ARC documents. I have a professional membership which allows me to download as many pdfs from their huge collection as I wish. I have very seldom had problems with this service, but after a lapse of several months without downloading anything, I have run into an odd glitch.

The pdf files I have recently downloaded no longer work with the script I use to process pdfs. The reason, it turns out, is that pdfs from MFF are no long strictly compliant with the pdf standard. What’s more, I get tiny variations in the pdfs depending on what browser I use to download them.

With the Firefox browser, I get pdfs that have two “blank spaces” (hex 20) at the beginning of the file, and are missing the last two bytes at the end of the file (hex 40 0A, which is a capital F and a linefeed).

I found this out by comparing recent downloads of several files with the same files downloaded from MFF several months ago. The new downloads all have this odd difference from the old ones, and as a result they are no longer strictly compliant with the pdf standard, which requires that a pdf file header begins ‘%PDF…’, not ‘<20><20>%PDF’, and ends with asci ‘…EOF’ (the line feed is not in the standard).

With the Google Chrome browser, I get pdfs that have two hex 20 bytes at the beginning, but are not missing the final 40 0A bytes. Weird.

This problem is specific to Mary Ferrell. When I download an ARC pdf from NARA today, I get the same file I got last year, and it makes no difference whether I use Firefox or Google.

I now have a perl script that fixes the pdf headers, but perhaps the MFF could fix the whole problem by tweaking the settings on their website.