Matching NARA to MFF, part 1: July 24 release

This is the first note in a new series on how the Mary Ferrell Foundation (MFF) managed to match up the records from NARA’s 2017-2018 ARC releases with the appropriate metadata and plug them in to their massive document archive.

This was a big, big job; my hat is off to them, particularly to Mary Ferrell Foundation President Rex Bradford, who doubles as the MFF’s website designer and programmer.

The many reasons why this was such a big job should already be clear to the handful of people who have read some of the early posts on this website. I will have to recap some of these in the coming series, but I will try to keep the repetition down to a minimum, linking to the early posts as much as possible.

In short, the many duplicates in the released files, the single pdfs combining up to a dozen documents, the complex errors that mess up your searches and counts, all these need to be fixed in order to make the most efficient use of the files posted on-line at NARA in 2017-2018.

MFF, an excellent and reliable source of original documents, having done the job of integrating the 2017-2018 NARA releases into their original collection, why am I looking at the result of their effort in nitpicking detail?

Well, first I would like to gain the benefit of MFF’s corrections and in-depth indexing without having to download all 50,000 plus documents from the MFF website. That would be exceedingly time consuming, and redundant, too, since I already downloaded all of the material from NARA two to three years ago.

If I don’t download MFF’s files, however, that leaves me in the same situation that MFF was in before they finally bit the bullet and rearranged the NARA releases to match up with the ARC metadata themselves. In the process of rearranging, they caught a number of errors in the NARA data which I missed. The errors MFF caught, which I missed, will reveal themselves as I go through their matching.

In addition, using an MFF copy of a release record, I have occasionally noted a record that was mislabeled in some way. Most of these errors represent errors in NARA’s data which MFF simply copied; a very, very small percentage were mismatches on the part of the MFF indexers. Having become an obsessive-compulsive over this sort of thing, I have to fix these errors too.

Those who do not suffer from obsessive-compulsive urges will recognize that this series of notes is guaranteed to be an incredibly boring read. For all but the obsessed reader, stop here.

Review of NARA18

Before I describe my inspection of MFF matching, I have to describe what they were working with. NARA posted six sets of ARC records on their website in 2017, and one more set in 2018 for a total of 7. Each time NARA posted a set of records, it provided an excel file which gave most of the standard metadata for each record.1The standard metadata is the information was provided on the RIF finding aid, attached as a cover sheet to each record in the ARC.

The final result of these releases is in NARA’s spreadsheet posted on April 26, 2018, which has links to pdfs of all the files released, plus the associated metadata. This is my base for comparing the files as posted and labeled by NARA with the files that MFF has added to their collection. This base spreadsheet I refer to as NARA 18.

NARA 18 has a couple of points worth noting for those who have not wrestled with it. In addition to adding in the April 2018 releases, it differs a little from the spreadsheet for the earlier releases at NARA. This earlier spreadsheet was cumulative for the six release sets from July 2017 to December 2017, with each new set of releases added at the top of the sheet. Call this spreadsheet NARA 17.

Despite what were clear errors in the metadata for various records, I’m pretty sure no corrections were made in any of the six versions of NARA 17. NARA 18, however, incorporated a couple of dozen corrections for records in this previous set.2See my note “NARA 18 errata corrections” (here)

NARA 17 also had an oddity in that multiple files and links were sometimes listed in one spreadsheet cell. This was a pain in the butt for people who were trying to get all the files and data into a normalized database table.3I noted this problem back in December 2017 (here). This oddity was fixed in NARA 18, so that every row in that spreadsheet represented a single file.4As I noted in a May 2018 post (here).

NARA 18 also incorporated metadata from the list of records I refer to as NF18. I have discussed this list and its quirks in at least a half dozen notes. I’ll refer to these as needed.

MFF assimilates the NARA releases

MFF actually started adding files to their collections from the NARA releases as soon these appeared on-line, but problems with file data slowed things down and it took them around a year before everything was up. The files at MFF from the 2017-2018 releases are now posted in several different sets, based on the date each set of files was posted on NARA’s website, and also based on which agency originated the files (as indicated in the file metadata).

To avoid writing one gigantic post on this vastly boring subject, I will divide my notes on MFF’s handling of the releases by the NARA set dates. I am guaranteed to miss stuff, so there will also be a summary at the end which picks up whatever stuff I find I missed. The discussion of the first release set from July 2017 begins below.

July release: CIA records

The July release covered 3810 records, the majority of which originated with the CIA. There were problems with these records which I did not catch the first time round, but MFF did. Here is the first problem, which occurs in the NARA 18 spreadsheet:

row #filenamerecnum
54357104-10231-10030.pdf104-10231-10030
54358104-10231-10031.pdf104-10231-10032
54359104-10231-10032.pdf104-10231-10033
54360104-10231-10033.pdf104-10231-10035
54361104-10231-10035.pdf104-10231-10036
54362104-10231-10036.pdf104-10231-10045
54363104-10231-10045.pdf104-10231-10047
54364104-10231-10047.pdf104-10231-10049
54365104-10231-10049.pdf104-10231-10051
54366104-10231-10051.pdf104-10231-10053
54367104-10231-10055.pdf104-10231-10055

Row # is the row of the spreadsheet that I am citing. Record number is a unique 13 digit combination for each record with an identification aid in the ARC. This is essential for keeping track of the vast number of records in the Collection.

Looking at the filenames and record numbers shows something odd. In the first row and the last row of this excerpt, the filename and the record number are the same (filename adds the pdf extension). File name and record number are in sync. For the other 9 filenames, however, filename and record number are out of sync. The mismatch arises because there is a file, 104-10231-10031.pdf, which occurs at the beginning of the sequence but does not have a corresponding row with a record number and other metadata. As a result, the files in these nine rows have incorrect record numbers and metadata in the NARA 18 spreadsheet.

However, the filename and record number are back in sync at the end of this spreadsheet excerpt. Why? Because there is no file with the filename 104-10231-10053.pdf This means filename and file metadata match again.

This oddity matches what was actually posted at NARA. There is a file, 104-10231-10031.pdf, that was posted on July 26, 2017, and there was NO file named 104-10231-10053.pdf that was posted on July 26, 2017. In fact, 104-10231-10053.pdf was never posted at NARA at all.

I missed this spreadsheet error, and as a result spent a couple of weeks pulling my hair out when I tried to write about this series of records. These files are all financial records for the Cuban Revolutionary Council.5See here for a note on these files. The mismatching numbers totally confused me and my note on them is still a draft in the bowels of my wordpress tables.6It is now up (here).

MFF caught this mistake, so their files and record numbers match up. See for example their link to 104-10231-10031.pdf the file that threw everything out of sync.

They did, however, make one mistake: for some reason they have a link to the missing file 104-10231-10053. This is link is wrong; clicking on it will take you to ARC 104-10231-10051, not 104-10231-10053. In fact, it takes you to the same file as clicking on 104-10231-10051.7The size of the two copies of 104-10231-10051 on MFF differ, so somehow MFF has a second copy of this file. I do not know where the extra copy of 104-10231-10051 came from.

This is not the only time this happens in the July 2017 spreadsheet. Here is another excerpt from the July release in NARA 18:

row #filenamerecord number
51339104-10227-10080.pdf104-10227-10080
51340104-10227-10083.pdf104-10228-10004
51341104-10228-10004.pdf104-10228-10007
51342104-10228-10007.pdf104-10228-10012
51343104-10228-10012.pdf104-10228-10105
51344104-10228-10105.pdf104-10228-10116
51345104-10229-10020.pdf104-10229-10020

In the first row and last row the filename and record number are in sync. Again, the filename and record number go out of sync, and for the same reason as above: The mismatch arises because there is a file, 104-10227-10083.pdf, which occurs at the beginning of the sequence but does not have a corresponding row with a record number and other metadata. As a result, the files in these five rows have incorrect record numbers and metadata in the NARA 18 spreadsheet.

However, the filename and record number are back in sync at the end of this spreadsheet excerpt because there is no file with the filename 104-10228-10116.pdf Filename and file metadata then match again.

MFF did not catch this error, so all five files are mislabeled in their collection; i.e. 104-10227-10083.pdf is labeled 104-10228-10004, etc. So if you click on the link, you will see that this record, labeled 104-10228-10004 by MFF, has printed on it the record number 104-10227-10083. It is in fact 104-10227-10083. Very confusing.

A third error involving mismatched filename and record name in the July release is as below:

row #filenamerecord number
51481104-10525-10001.pdf104-10527-10001
54439104-10527-10001.pdf104-10525-10001

MFF did not catch this either and the two files are thus switched; i.e. 104-10527-10001 is labeled 104-10525-10001 and vice versa.

One final error is that NARA18 omits the file 104-10086-10154. MFF caught this one, so it is available where it should be.8I caught this one too, see here.

Summary

Except for the last one, the missing 104-10086-10154, I did not find any of these errors the first time around. MFF caught most of the first set, and their remaining errors, like mine, came from the original NARA mismatches. As you may have noticed, this note covered only the CIA releases in July 2017. The remainder of the July releases, primarily FBI, seem to not have had these problems. The FBI releases, however, have an important peculiarity which is worth a separate note all by itself. A subject for the next note, hopefully a touch less boring than this one.

Footnotes

  • 1
    The standard metadata is the information was provided on the RIF finding aid, attached as a cover sheet to each record in the ARC.
  • 2
    See my note “NARA 18 errata corrections” (here)
  • 3
    I noted this problem back in December 2017 (here).
  • 4
    As I noted in a May 2018 post (here).
  • 5
    See here for a note on these files.
  • 6
    It is now up (here).
  • 7
    The size of the two copies of 104-10231-10051 on MFF differ, so somehow MFF has a second copy of this file. I do not know where the extra copy of 104-10231-10051 came from.
  • 8
    I caught this one too, see here.