Union Army Data - Potential Collection Bias

1. Military Data Set Potential Bias

1.1 Military Data Set Selection Bias

The potential for selection bias exists whenever non-random factors influence whether a given variable is present in the data or not. In the pension files, the main source of selection bias is that a number of variables exist in the data only if the recruit made an application for a pension, and others exist only if he actually obtained a pension. Thus, factors such as having a war injury or living past 1890, the year when the pension laws were liberalized, increase the probability of being in the pension system. Data on occupation and family variables coming from affidavits and letters are subject to this bias. In addition, being in the pension system in 1898 when the family circulars were first distributed also increases the availability of family data.

1.2 Military Data Set Linkage Rates

Some sample selection bias may arise in the use of the data in this collection due to linkage failures. Of the 39,340 army veterans in the Union Army sample, 68.4% were linked to pension records, 98.4% were linked to compiled military service records (CMSR), and 67.1% were linked to carded medical records (CMR). In the case of the military service records, the linkage failure rate was very low and due to random factors.

With the CMRs, the Pension Bureau made the records only for those recruits who had notable medical experiences while in the service. Linkage failure for CMRs could therefore be predicted by factors which reduced a recruit’s odds of illness or injury in service. These factors include short durations at risk due to early discharge, desertion, or death, and mild military experience, which is indicated by a recruit’s company’s battle and casualty history. Other company-specific effects may be explained by inadequate record keeping by the company or authorities subsequently compiling and maintaining the CMRs. It was also possible that the records may have been simply lost or destroyed.

The primary reason that individuals were not linked to pension records was that they died prior to 1890. The Disability Act of June 27, 1890, extended pension eligibility to include disabilities not directly related to wartime experience. As a result, the number of men on the pension rolls swelled.

In summary, the nature of bias for pension data depends critically on the time period as well as the research question being addressed.

2. Census Data Set Potential Bias

2.1 Census Data Set Selection Bias

The potential for selection bias exists whenever non-random factors influence whether a given variable is present in the data or not. Finding a recruit in the census is more likely if the recruit applied for a pension. Because the pension applications contain residential addresses, it is much easier to make a strong link to the census schedule if the recruit made an application. If the recruit is in the pension system as of 1900, finding him in 1900 and in later census years is done with high probability. If he is not in the system, then census linkages are low. Living in an urban location or a different state than he lived in at the time of military enlistment further lowers the probability of finding him in the census and, in turn, finding information on his family members. Whether these factors will result in selection bias depends largely on whether they are correlated with factors affecting the outcome variable of interest.

Selection bias in the early census years is possible, though its causes are very different than for later years. There are many factors that determine whether a recruit can be found in the 1850 and 1860 census records. Those early census years are the primary source of information on the demographic and socioeconomic characteristics of the recruit’s early life. Find rates for 1860 are higher because census enumeration took place not long before enlistment in the Union Army. Linkage to 1850 is harder because of factors such as migration or because the recruit was already a head of household in 1860 and, therefore, no information on his parents is available from the 1860 records that can be used to link to his census record in 1850.

Users of the census data should also take note of the differences in variables across the different census years. Some variables, such as birthplace (recbpl) or occupation (recocc) can be traced across all census years, while others, such as birth month (recbmo), or blindness, (recbnd), occur only in a particular census year (in this case 1900 or 1910). Furthermore, it is possible that the quality of data differs across locations, years, and census enumerators.

As an illustration of how selection bias might affect a given analysis, consider the case of marital status. The degree of bias depends on the time period:

1910 or later: This can be calculated with relatively little bias: the recruit will be in the pension system with a high probability and locating his census records will also be straightforward.

1900: Slight bias exists since pension participation was not complete. However, family circulars had been collected. The pension had not become fully aged-based by this point, so participation in the pension depended on health.

1895: Significant bias exists because the family circulars were not yet available and the 1890 census manuscripts were destroyed by fire. However, the pension system had been liberalized and participation was much higher than prior to 1890.

1865-1889: Profound selection bias exists. Marital status information will only be available haphazardly and, even then, only for recruits who qualified for pension assistance before the 1890 liberalization. During the decades before 1890, veterans were steadily entering the pension system each year, but for a large number of veterans who never entered the pension system, it is impossible to know whether they were alive, so determining marital status is usually not possible or reliable. Efforts by the Early Indicators researchers to link veterans to the 1870 and 1880 censuses will partially alleviate this problem.

1860-1864: Marital status can be inferred (though not known with certainty) from the 1860 census for recruits of marriageable age. Factors affecting linkage rates to 1860 will influence the level of selection bias during this time period.

2.2 Census Data Set Linkage Rates

Some sample selection bias may arise in the use of the data due to linkage failures: the failure to find a given individual from the main sample in the census records. 55.3% of veterans are linked to at least one of the following censuses: 1850, 1860, 1900, 1910. The direction and magnitude of the selection bias will depend on how closely the variables in the linked data are correlated with the factors that determine linkage to the Census manuscripts. Factors that are known to influence linkage to the census data include date of death, migration from one state to another or within a state, movement into or out of different households, and socio-economic status.

3. Disease Data Set Potential Bias

3.1 Disease Data Set Error Rates

There are four types of errors that are routinely calculated and reported in the analysis of error rates below. They are classified as follows:

A. Incorrect Entries. This type of error includes a wide variety of mistakes, such as entering information in the wrong field or on the wrong screen, entering information that should not be input at all, breaking up information that could be entered as one answer or entering as one answer information that should be broken into two or three answers. Other incorrect entries include misreading a word on the certificate. Like the errors of omission (see below), these errors are usually random and result primarily from difficulty reading the certificates.

B. Omission Errors. Errors of omission are random and are the result of an inputter inadvertently skipping over information while reading the certificate. Often it is negative information that is omitted.

C. Numerical Errors. Most numerical errors occur on the entry screen, particularly with dates, pulse rates, and respiration rates. These errors usually result from misreading a number.

D. Rating Errors. Rating errors are errors in the rating question (*_rat) on each disease screen. Sometimes ratings are omitted; this occurs most often when the rating is 0 or NO RATING. Other problems occur when a total rating is entered as a specific disease rating or vice-versa. Rating errors also include incorrect ratings, such as 2/18 instead of 3/18.

The following two types of errors are not reported because they do not reflect lost or incorrect data. They are tracked and corrected.

A. Spelling Errors. Most errors in spelling occur because of difficulty reading the certificates, especially town names and unusual words. Some of these errors are corrected in the process of checking the de novo sample. Most of the spelling errors, however, are corrected in the extensive data cleaning process which took place at the Center for Population Economics (CPE) at the University of Chicago.

B. Linking Errors. Linking errors always involve the related diseases question (*_rel) on each disease screen. There are two possibilities for error here. The first is in connecting two diseases that should not be connected. The second, and more common, error is failing to connect diseases that are rated together, particularly when they are linked by an aggregate or total rating, or failing to connect diseases that have a cause/effect relationship. Other linking errors occur in not linking to the correct screen or omitting a link. Note that because linkages involve at least two screens, if there is a linking error on one screen, there is usually also a linking error on another screen. Therefore, two errors are counted (one for each screen), which makes this error rate seem higher than it actually is.

Note: Errors in the data are often the result of errors in the original document that researchers are instructed not to correct. Therefore these data are input as shown on the document.

Analysis of Inputting Error

Analysis suggests that inputting errors in the data set are random, and will not introduce a systematic bias into the data set. The most common errors, comprising 60% of the total errors observed, consist of omission errors. In most cases, these errors involve the loss of negative information or of descriptors or adjectives of a condition, not the loss of the condition itself. In the majority of such cases, the loss is remedied by information taken from subsequent certificates. The next largest category, entry errors, makes up almost 35% of the observed errors. These errors result largely from difficulty in reading the certificates and include entry of information into the wrong screen or incorrect division or combination of adjectives. The information is not necessarily lost and again typically would be remedied by information from subsequent certificates. Potentially the most serious errors, rating and numerical errors, constitute only 6 to 8% of the total observed error rate of 3.8 to 9.4%.