Builders And Defenders

Working With the Data

Home
Data Processing

Data Cleaning and Standardization

On this page, you can find information about how the database team approached the sources at the heart of the project. Here, we discuss how we cleaned data, standardized it in the system, and the methodology that guided us as we worked with the sources. You can also find details here about the types of sources we worked with, where they came from, and the software supporting the database.

Approach to the Sources: Descendants at the Center

In constructing this database, the team prioritized a social history approach to the sources. Focused on a bottom-up analysis that centered the individual Black laborers, soldiers, and their relationships, the database operates with a special interest in their personal information, lived experiences, and social networks. This focus on the social history of the Black builders and defenders of Fort Negley and Nashville helps to build the database as a detailed and broad finding aid and guide for further research by the Black Nashville community and Fort Negley descendants, genealogists, public historians and preservationists, and historical researchers.

For example, the team consulted Taneya Koonce and a handful of descendants to clarify how they would search for their ancestors. We asked what Black genealogists generally type into search boxes, and what presumptions come with those searches, and what types of things they are expecting to find, hoping to find, and in their wildest dreams, would like to find. Sutton also regularly attends meetings of the Nashville chapter of the African American Historical and Genealogical Society to hear how the presenting members reconstructed their family histories, and where they struggled. From there it became clear that we had to restructure the data to be people-first and relational. While the original Labor Rolls were used to keep track of hours, for both payment and the possibility for reimbursement to Union-sympathizing enslavers who voluntarily offered up the people they enslaved for Union labor, we could create our own categories and the ways they relate to one another and the rest of the data to help descendants find out more about their ancestors. We also changed category names to be more reflective of the ways in which we now use language (for example, “slave” became “enslaved,” and “Owned by” became “enslaved by,” and in other primary sources, “colored” became “Black”), while working with our developer to ensure that the outdated terms would return the same results if typed into the search bar. That way, the database website affirms the humanity of the enslaved, while not obscuring primary source documentation or making the searches any more cumbersome for those who are accustomed to seeing the offensive terminology on primary sources and using it in searches.

Further, Koonce instructed Sutton and the students working with the Fort Negley Descendants Project on the sources and places where most Black genealogists from Tennessee get stuck in their search, namely the “Brick Wall” of the 1870s: the dead end that the usual genealogical records hit prior to emancipation, when most Black Americans were not listed as people in their own right, but among the material possessions of others. Her specialist knowledge and local connections to descendants make her an invaluable advisor to the project and made it clear to Sutton that this project couldn’t be a mere transcription project where the creators remove themselves from the end product – it had to have clear subjective goals in order to be helpful for a specific audience. Therefore, the descendant community’s needs shaped what data we accepted, how we collected and cleaned it, how it will be made searchable, and what types of information from the sources becomes centered. To counter the disparity in access to genealogical and historical sources, the project has to be ruthless and unapologetic in centering descendant needs. This also has the side effect of restoring balance to the history of the Civil War as the vast majority of information readily available is about its white participants, making the database a valuable tool for public historians who want fairness, accuracy, and diversity in modern depictions of the conflict.

Identifying Personal and Demographic Information

Some of the key details about people in the database include: person identifier (a unique identifier for the person); names (including given name, surname, and alternate names); birth and death dates, origin (such as birthplace); sex, color, status (identifying the free or enslaved status of individuals); and occupation. In addition, when there was information about an individual that the database did not have a specific category for, we put those details into the “Description” section for each individual. Here, researchers can find rich notes about the laborers, soldiers, and their family members summarized from primary sources. Additional personal information about the people in the database include relationships, events, and reference – discussed in more detail below.

Description Section and Notes

Mentioned above, the “Description” category for individuals is where some of the richest information lies. Written as brief notes, this is where the team summarized information about individuals from primary sources that did not fit into a pre-established category, yet was important to connect to the person. Details in this section range widely – for example, from additional information about the laborers, such as their time worked, rate of pay, if they were ever paid; notes written in sources about individuals (such as if they were free, assigned to specific forts or jobs, or any other additional notes left by the writers of the sources); summary information about the context of the types of sources they appear in (these types of notes are specifically in the US Army Corps of Engineers dataset as their sources were mostly letters and military documents); notes about the rank in and out of USCT soldiers; and summary notes about the experiences of the 12th Regiment USCT from their compiled service records.

Dates

Dates in the database are connected to many different categories of information – such as people, primary source references, events, locations, and more. In all cases, dates are organized in a range – typically, these ranges are organized into start and end dates. When we have a precise date give for these date ranges, we use it in all categories; in other cases, we create a range based on available information. For example, for primary sources, they have a date range titled “Content Start Date” and “Content End Date” which signifies the earliest and latest date referenced or present in a given document or collection of documents. Similarly, birth and death dates have date ranges titled “Born Before,” “Born After,” “Died Before,” and “Died After.” When we have specific birth and death dates we use those in each of the categories. However, in cases where we don’t have a birth or death date, we can nonetheless create a birth and death date range based on the date of the documents individuals appeared in. For example, if a person appeared in a document dated 8/1/1865, we can state that they were born before 8/1/1865 and died sometime after 8/1/1865. In more detailed cases, we have used ages listed to calculate birth and death based on the publication of a document or other information. While these date ranges are not precise, it nonetheless is helpful for search functions and general age ranges. For events, we use a similar date range method – event dates are titled with earliest and latest start and end dates.

Relationships

Another especially rich and detailed category in the database is for “Relationships.” In this category, when data from the sources showed that individuals were connected in some way, we created relationships between these people. Common relationships in the database include family connections (such as parent, spouse, child, and godparent), slavery relationships (such as that between an enslaver and an enslaved person), and professional relationships (such as an attorney-client relationship, soldier relationships, and more). We were able to create these networks of relation based on the data from the sources. For example, one of the labor rolls noted the enslaver for nearly 3,000 of the laborers. In addition, documents in the US Army Corps of Engineers Correspondence had petitions filed by the family members of laborers for payment – these were filed under the full names of the laborers’ parents, spouses, and children.

Events and Shared Events

Another valuable category from the database is for events and shared events. Events and shared events are information about a single event that led to the recording of data about participants as captured in an historical document or secondary source. Examples of events in the database include the “Participation” events for the “Construction of Fort Negley,” as well as other fort construction projects; “Participation” events for the “Battle of Nashville; and “Birth” and “Death” events for when an individual was born or died. In addition to individual events, shared events in the database represent events in which individuals collectively participated – such as the “Construction of Fort Negley,” the “Battle of Nashville,” and the “Construction of the Nashville and Northwestern Railroad.” Key details about events include: event identifier (unique reference for each event in the dataset), event name or title, life event (Type or category that captures an event's overarching impact or purpose), description, start and end date, location, and source.

Entities

Entities are another category in the database that signifies connections. Mainly, entities are groups in the database to which people can belong – such as specific military regiments like the 12th USCT Regiment, Civil War forts such as Fort Negley and Fort Rosecrans, and Fisk University. When data in the sources specifically showed that individuals were connected to these and other groups of affiliation, the team connected them to their respective entities. For example, the USCT regiment affiliations are particularly rich – in these sections, if an individual was a soldier enrolled in the USCT, their entity connections will state that they were a member of a specific regiment and which company they belonged to.

Locations

The mapping and location functions of the database provide deep insight into the spatial dimensions of the project. The locations in the database are often denoted through the “Place” category – the “Place” record captures information about the location where an event that led to the registration or recording of data about participants happened. Events such as births and deaths, fort construction, battles, and more have locations assigned for them for each individual when that information was available. Locations are also noted sources and datasets – for example, primary sources and archives have locations and provenances attached to them for where the source originated or where a specific archive is located in present day. All of these locations have been embedded into the system’s mapping and GIS functions and have specific coordinates. Because of these mapping connections, the database is poised to be able to create detailed GIS analyses based around these locations.

References and Dataset Information

All sources that the database is grounded upon are fully referenced in the system in a variety of ways. First, the sources contain all necessary citation information for them – such as their title, date range, page numbers, publication information, document type information (such as if the reference is a collection or a single document, a physical manuscript or on microfilm, what language is in, and more), and location of the document’s creation and where it is housed (such as which archive or historical group holds it). These details about the source references are also connected to people and events – individuals and events have references connected to them based on which primary source they appeared in. Further information about the source and its larger dataset are available – such as a brief description of the dataset, publication rights, and contributors.

Missing Information from the Sources

When data about a given individual was missing from the sources or when details were vague or contradictory, the team developed a strategy for how to properly include information. Mainly, when data was clearly present in the sources, we used it for filling out categories for the individuals and when data was not clear or absent, we left these fields unknown, blank, or used a stand in method. For example, if an individual’s sex, age, race, or status were missing from the primary sources, we designated that information as “Unknown.” If an individual had no first name listed with their surname, we would either leave it blank or use a “stand-in” method where we assigned them a first name of “Unknown” or another signifier. In other cases where data was unknown, such as birth and death dates, the team created a date range for calculating a general timeframe.

Duplicates

From importing tens of thousands of individuals from numerous shared events, the database team found that there were likely many of the same people, or duplicates, overlapping in the system. In cases where we knew that two or more individuals who appeared across one or more primary sources were the same person, we merged them with confidence into a single person entry. Examples of this include people sharing the same full name and additional identifying points such as the same relationships or same event involvement across shared sources (such as the labor rolls). Individuals who were merged retain all information from their previously separate entries – such as each primary source reference they came from and any additional notes. In cases, however, where we were uncertain if an individual with the same shared name might not be the same person as another individual we left them unmerged and as unique people.

An example of these practices is especially highlighted in how we imported the numerous labor rolls. The original labor rolls accounted for roughly 2,800 people. When we imported the additional labor rolls, we found that a large number of people overlapped with the laborers we already imported based on same-name matches – approximately 2,140. Given that primary source documents from the war estimated that there were between 3,000 – 4,000 laborers, we chose to merge individuals with same-name matches within the new supplemental labor data and the original labor rolls. For example, the entries for the name “Robert Anderson” were all merged because this name appeared on page 2 of the original labor rolls, page 2 of labor index no. 3, page 3 of labor index no. 9, and page 3 of the ledger of laborers remaining unpaid.

Likewise, the entries for the name “George Ray” were merged because this name appeared on page 12 of labor index no. 1, page 16 of labor index no. 3, and page 27 of labor index no. 9. Given that we based these merges off of primary source number estimates and same-name matches, it is possible that there may be merged individuals who were in fact separate people. To be as transparent as possible, we have recorded all notes and relevant information about each individual and have listed all of their primary source references so that researchers in the future can further trace the laborers across their respective archival documents.

We chose not to merge individuals who we could not match in order to avoid archival erasure – this includes if there were names in the new labor data that did not match, if there were multiple name matches in the original labor rolls who we could not narrow down, and entries for the laborers listed in the remaining unpaid ledger. For example, there were many names from the supplemental labor data who matched with entries in the original labor rolls who we were unable to merge because there were two or more same names in the labor rolls who we considered distinct people based on related information (such as different enslavers). While it is likely that there are still many same-person duplicates within this group, to preserve as much data as possible we chose not to merge these individuals.

Dataset Sources

The sources in this database come from a variety of venues and often have multiple layers shaping them. The primary sources themselves are housed in a variety of archival locations – including the Tennessee State Library and Archives, Nashville Metro Archives, the Fort Negley Visitor Center, and local libraries and archives. The types of documents include a range of primary source – such as multiple labor rolls and indexes, ledgers, collections of official military letters, correspondence, personal papers, and clerical papers.

We worked both directly with the primary sources and with already transcribed and processed versions of the sources. When we worked with already processed and transcribed versions of the sources, they came from a variety of resources – such as Fort Negley Visitor Center and the Friends of Fort Negley and the work done there by Krista Castillo, Fletch Coke, and Natalie Goodwin, as well as already published sources such as that by Chuck Sherrill, Tennessee State Archivist, and his book Tennessee Convicts, and Vanderbilt University’s “Historic Black Nashville” undergraduate seminar led by Dr. Jane Landers and Dr. Daniel Sharfstein. When already processed and transcribed sources were unavailable, the team worked at local archives and libraries to transcribe and ingest primary sources material – such as at the Tennessee State Library and Archives and the Tennessee Central Railway Museum and Archive.

Database System

The database is supported by the software Spatial Historian - an emerging technology platform developed by Dr. Jim Schindling which allows the team to organize, ingest, and analyze identifying historical information, social networks, and spatial patterns of people in the database. Spatial Historian is an indispensable technology for seeing the aggregation of multiple dimensions around the relationships, archival records, and experiences of the Black laborers, soldiers, their families, and descendants throughout the database. The flexibility of the system allows us to customize Spatial Historian according to the community’s needs. For more details on the software Spatial Historian, please see:

Jim Schindling, “The Spatial Historian: Building a Tool to Extract Structured Information from the Slave Societies Digital Archive,” in Regenerated Identities: Documenting African Lives, edited by Paul E. Lovejoy, Henry B. Lovejoy, Érika Melek Delgado, Kartikay Chadha Trenton: Africa World Press, 2021 Jim Schindling, “The Spatial Historian: Creating a Spatially Aware Historical Research System,” PhD dissertion.,West Virginia University, 2020.