SIPP USERS’ GUIDE DATA EDITING and IMPUTATION 4. Data Editing and Imputation This chapter describes the data editing and imputation procedures applied to data from the Survey of Income and Program Participation (SIPP) after completion of the interviews. Three different approaches are used for dealing with missing data in SIPP:  Weighting adjustments are used for some types of noninterviews;  Data editing (also referred to as logical imputation) is used for some types of item nonresponse; and  Statistical (or stochastic) imputation is used for some types of unit nonresponse and some types of item nonresponse. Weighting is discussed in Chapter 8. The chapter begins with a brief discussion of the types of missing data and the goals of imputation in SIPP. It then presents an overview of the editing and imputation procedures used to deal with missing and inconsistent data. Next, the chapter provides a detailed description of each of the major steps used by the Census Bureau when creating its internal files and the files that are released for public use. Prior to 1996 the development of cross-sectional wave files involved mainly cross- sectional editing and imputation. The longitudinal files involved longitudinal editing. Beginning with the 1996 Panel, the processing procedures may also include methods that use prior wave information to edit and impute a current wave (after wave 1). The most common imputation technique, the hot-deck method, is still used in the 1996+ Panels. A new procedure allows donors, when appropriate, chosen on the basis of similarities in reported prior wave information when that reported information exists for certain variables. In panels prior to the 1996 Panel, the donors were chosen based only on current wave similarities. The SIPP Web site (http://www.sipp.census.gov/sipp/) supplements the information in this chapter with detailed information about all variables on the public use files. To obtain more detailed information about imputation and editing procedures, contact the Demographic Surveys Division’s (DSD) Income Programming Surveys Branch, 301-763-5244. Types of Missing Data As in all surveys, there are two general types of missing data in SIPP: (1) unit nonresponse and (2) item nonresponse. Unit nonresponse occurs in SIPP when one or more of the people residing at a sample address are not interviewed and no proxy interview is obtained. This can happen for a number of reasons, described in Chapter 2. Most types of unit nonresponse are dealt with through weighting adjustments (see Chapters 2 and 8). However, the data editing and statistical imputation procedures described in this chapter are used with one type of unit nonresponse: Type Z 1 SIPP USERS’ GUIDE DATA EDITING and IMPUTATION noninterviews. Type Z noninterviews are cases where an interview was obtained from at least one Household member but interviews were not obtained from one or more other sample persons in that household.1 Prior to the 1996 Panel and in some instances in the 1996 Panel, the method used to adjust for person-level noninterviews in the core wave files is known as Type Z imputation, which is discussed below. Chapter 2 discusses person-level nonresponse, Type Z. The other type of missing data is, item nonresponse. This occurs when a respondent completes most of the questionnaire but does not answer one or more individual questions. Item nonresponse data in SIPP occur under the following circumstances:  Respondents refuse or are unable to provide requested information;  Interviewers fail to ask a question or incorrectly record a response;  A response is inconsistent with related responses or is incompatible with response categories; and  Interviewers make an error when recording or keying in the data.2 Item nonresponse data are usually imputed for core items, as well as for many topical module items. Goals of Imputation Missing data cause a number of problems:  Analyses of data sets with missing data are more problematic than analyses of complete data sets  There is a lack of consistency among analyses because analysts compensate for missing data in different ways and their analyses may be based on different subsets of data  In the presence of nonresponse that is unlikely to be completely random, estimates of population parameters are biased. Because missing data are always present to some degree, analyses of survey data must be based on assumptions about patterns of missing data. When missing data are not imputed or otherwise accounted for in the model being estimated, the implicit assumption is that data are missing at random after controlling for other variables in the model. The imputation procedures used for SIPP are based on the assumption that data are missing at random within subgroups of the population (as 1 That can happen because people refuse to be interviewed or they are unavailable and a proxy is not obtained. 2 Prior to the 1996 Panel, errors could also occur when data-entry workers were keying in results from the paper survey. 2 SIPP USERS’ GUIDE DATA EDITING and IMPUTATION defined by the cells of the imputation matrices described later in the chapter). The statistical goal of imputation is to reduce the bias of survey estimates. This goal is achieved to the extent that systematic patterns of item nonresponse are correctly identified and modeled. In SIPP, the statistical goals of imputation are general, rather than specific. Instead of addressing the estimation of specific Parameters, SIPP procedures are designed to provide reasonable estimates for a variety of analytical purposes. Data editing is generally preferred over statistical imputation, and it is used whenever a missing item can be logically inferred from other data that have been provided. The advantage of data editing is that it avoids the increase in variance that occurs when missing items on one record are imputed with nonmissing responses from other records. Assessing the Influence of Imputed Data on Analysis Users of SIPP data interested in assessing the influence of imputed data on their analyses should consider whether SIPP imputation procedures have properties that affect their specific analytical requirements. A general discussion of the treatment of missing data in sample surveys is given in Kalton and Kaspyrzyk (1986). Sedransk (1985), Little (1986), and Jinn and Sedransk (1987) discuss properties of commonly used imputation processes. An example of the impact of imputation procedures for the WIC program is discussed in CNSTAT, 2003. A report discussing sources of error for federal data collection programs is given in A Statistical Policy Working Paper, 31, June 2001. An evaluation of the effects of imputed data should include a review of rates of unit nonresponse and an assessment of the extent of item nonresponse. Unit nonresponse tends to increase over the life of a panel, as does the likelihood that nonresponse is not a random effect. As the percentage of eligible sample members reinterviewed decreases, the pool from which donors3 are selected shrinks accordingly. This smaller pool of donors leads to an increased likelihood that individual donors will be used more than once, which in turn increases the variance of an estimate. The effects of imputation will likely be small for items with low rates of missing data as long as rates of item nonresponse are not high among important subclasses. Lepkowski et al. (1987), using data from a large federal survey, provide a framework for evaluating the effect of imputed values on analyses. This framework can be readily adapted to SIPP analyses. Imputation Methods The SIPP primarily uses two methods to impute missing data. The hot-deck method, used for item 3 Cases with complete data that are the source of the imputed values placed on the records with missing data. 3 SIPP USERS’ GUIDE DATA EDITING and IMPUTATION non-response, and the Type Z method, used for unit non-response. Item non-response refers to missing items within an interviewed case. Unit non-response refers to a non-interviewed case within an interviewed household. SIPP also uses another method in rare circumstances, logical imputation, where the imputation is logically derived. The hot-deck method replaces individual missing data items with reported data from another person or household with similar characteristics. Initially, the input file is sorted by geographical keys: PSU, Segment, and Serial Number; this ensures that neighboring records represent geographically proximate units. Edits and imputations are then performed sequentially by unit for each topical section: demographics, household characteristics, labor force, assets, general income, health insurance, and program participation. Each section is processed completely before the next section is done. A hot deck array is created for each edited variable and is stratified by selected variables such as age, race, sex, etc.. Hot decks are first initialized with cold deck values then they are loaded with data provided by the respondent by passing through the data one time. The data are then passed a second time with good responses contributing to the hot deck and missing responses allocated from the hot deck. Each hot deck cell will contain exactly one value at any point in the edit: either the cold deck value or the most recently encountered good value meeting the same criteria for that cell - as defined by the stratifying variables. The hot deck imputation process as currently implemented is fully deterministic: subsequent re-processing using the same file and same edit program will result in identical imputations. Type Z imputation method involves imputing an entire set of data from a single donor. This is used primarily for non-interviewed persons within an interviewed household. The Type Z procedure is based on a hierarchical sorting and matching operation based on a set of variables that are non- missing for both recipient and donor. The matching variables used are age, race, sex, marital status, household relationship, education, veteran status, parent/guardian status, and income and asset sources. The match is designed to progressively broaden ranges with the above match keys until a match is found. When a match is found, all data are transferred to the recipient record except for identification variables and other variables that may not be relevant within the recipient household. A second Type Z operation is used within the labor force edit to impute a set of labor force characteristics from a single donor (this is referred to as a little type Z). See the section, Type Z Imputation for Core Items in the Core Wave Files, for more information. An Overview of the Process The processing of SIPP data has traditionally been done cross-sectionally by wave and then longitudinally across waves once all waves were available. In 1996 this process was changed to apply selected longitudinal edits to individual wave files and in 2004, most longitudinal edits were discontinued. For the pre-1996 panels, there are two phases to the processing of SIPP data. The first phase occurs at the conclusion of each wave of interviewing, then the data collected during that wave are processed, creating the core wave and topical module files. The second phase occurs at the conclusion of the final wave of interviews, core data from all waves are linked and a new set of edit and imputation procedures is applied to the resulting full panel file. 4 SIPP USERS’ GUIDE DATA EDITING and IMPUTATION For the 1996+ panels, there are also two phases, however the second phase does not involve the creation of a full panel file. Waves 1-4 are edited cross-sectionally as they become available; this is phase one. Then, once wave 4 is complete, a longitudinal edit of selected demographic variables is done across the four waves. These variables are then placed on each of the individual wave files. For wave 5+, these variables are not re-edited, but are simply pulled forward from the previous wave. For 2004+ most longitudinal editing was abandoned. Previous wave data is still used to fill missing data, however no attempt is made to enforce consistency across waves. Phase 1 - Summary There are six steps in the first phase of SIPP data processing: 1. As each wave of interviewing is completed, core data collected during the wave are edited for internal consistency. 2. Following data editing, the statistical matching and hot-deck procedures described later in this chapter are used to impute missing data from the core wave file. 3. A public-use version of the core wave file is created from the internal core wave file. The public-use file is the same as the Census Bureau’s internal file except that it has certain information suppressed or topcoded to protect the confidentiality of survey respondents (see sections on Topcoding and Suppression of Geographic Information, at the end of this chapter). 4. On a separate production track from the core data, data from the topical module file administered with the wave are edited for internal consistency. The extent of data editing varies across the topical modules, and some topical modules receive almost no editing. 5. Next, hot-deck procedures are used to impute missing data in the topical module. The extent of imputation varies across the topical modules; some topical modules have no missing data imputed. 6. A public-use version of the topical module file is created from the internal file. As with the public-use core wave files, the public-use topical module files have certain information suppressed to protect the confidentiality of survey respondents. Figure 4-1 illustrates the steps that generate the Census Bureau’s internal core wave and full panel files. These steps are repeated at the conclusion of each wave of interviews. Prior to the 1996 Panel, each wave was processed independently of other waves of data. Thus, when multiple core wave files are linked, apparent changes in a respondent’s status could be due to different applications of data edits and imputations to the files being combined (file linkage is the subject of Chapter 13). With the 1996 data, the hot-deck procedure was redesigned to rely on historical 5 SIPP USERS’ GUIDE DATA EDITING and IMPUTATION information reported in prior waves. In addition, other forms of longitudinal imputation, such as carryover methods, were adapted. Figure 4-1. Sequence of Cross-Sectional Imputation and Longitudinal Editing Procedures Imputation of Sample Unit Characteristics (Tenure, etc.) Imputation of Item Imputation of Personal Demographic Characteristics Missing Data for Sample (Age, Race, Marital Status) Unit Characteristics and Sequence is Personal Demographic repeated for Characteristics each wave in a Type Z Imputations Imputation of Person-Level panel Noninterviews Imputation of Labor Force Items and Recipiency of Imputation of Item Income and Assets Nonresponse in Core Imputation for Item Nonresponse in Records for others Questions Cash Income Imputation for Item Nonresponse in Self-Employment Identification Sections Imputation for Item Nonresponse in Asset Sections (Property Income) Imputation for Item Nonresponse for Household Program Information Editing for Demographic and Household Variables, Editing of Longitudinal Employment Variables, General Amount Variables, and Record Other Variables For 1996+ panels, type Z records only handled in a separate process if no previous wave data are available. The imputation procedure for SIPP Panels 1996+ allows for item imputation from previous wave’s data if the previous wave’s data had valid data regardless if went through a hot deck procedure. In these situations, an allocation flag of 3 was assigned. One advantage of using prior wave data instead of using a hot deck procedure to impute is that the data are more consistent from wave to wave. A disadvantage is that a particular donor has potential to be an influence each wave thereafter. Phase 2 Summary - Pre 1996 panels At the conclusion of the panel, the Census Bureau creates a full panel file containing core data from all waves. There are four steps to this process. 1. Core data from all waves are linked. Those data have already been subjected to the Phase 1 edit and imputation procedures. 2. A series of longitudinal edits are applied to the full panel file. Unlike the core wave edit procedures, these edits are designed to create longitudinally consistent records for each person. 6 SIPP USERS’ GUIDE DATA EDITING and IMPUTATION Both reported values and values that were imputed during the first phase of processing are subject to change. Thus, the data in a full panel file may differ from the data in the core wave files from which the full panel file was constructed. 3. A missing wave imputation procedure is then applied. Data are imputed when a sample member was absent for one wave but was present for the two adjacent waves. Data for the missing wave are interpolated on the basis of information from the fourth month of the prior wave and the first month of the subsequent wave. The missing wave imputation procedure was introduced with the 1991 Panel. Earlier panels were not subjected to this procedure. 4. A public-use version of the full panel file is created from the internal file. The public use file has certain information suppressed to protect the confidentiality of survey respondents. Phase 2 Processing – 1996 to 2001 Panels 1. Core data from waves 1- 4 are linked. Those data have already been subjected to the Phase 1 edit and imputation procedures. 2. Demographic and household composition variables are edited to ensure consistency across the 4 waves. Waves 1-4 are re-processed using the longitudinally edited values. 3. Waves 5+ are processed cross-sectionally as they become available; demographic and household composition variables are pulled forward from longitudinally edited values in the previous wave. Note that no full panel files are created for the 1996+ panels. Phase 2 Processing – 2004 Panel Only cross-sectional edits and imputation procedures were applied. There were no longitudinal edits of demographic and household composition variables. The balance of this chapter describes in greater detail the full sequence of data edit and imputation procedures applied to SIPP data files. Most of the material contained in this chapter is taken from Pennell (1993). The data processing sequence for each wave is detailed below. 7 SIPP USERS’ GUIDE DATA EDITING and IMPUTATION Data Entry and Initial Editing Beginning with the 1996 Panel (Chapter 2), all of the data entry and some of the initial data editing are performed by computer-assisted interviewing while the interview is in progress. Before the 1996 Panel, the first stages of data processing involved editing the paper questionnaires for completeness, reasonableness, and consistency. Those data checks were conducted first by field representatives before they submitted their questionnaires to the regional offices and then by the regional and central offices of the Census Bureau. The next step was data entry, in which clerks keyed in the information from control cards and questionnaires. Edits were built into the data-entry program to ensure that the data were keyed in the proper sequence and that certain key identifiers, such as control number, name, and relationship to householder, were present. Following this step, the data files were transmitted electronically to Census Bureau headquarters. Imputation for Sample Unit Characteristics and Personal Demographic Characteristics Items in this category, including housing tenure (owned or rented), age, race, marital status, and so forth, must be present for any further data processing to take place. If these values cannot be logically derived, they are imputed. The imputation procedure is a modified version of the sequential hot-deck procedure described below. Type Z Imputation for Core Items in the Core Wave Files Pre-1996 Panels. Type Z imputation was the method used in the pre-1996 panels to impute core items for person-level noninterviews. There are two categories of person-level noninterviews subject to imputation for the core questions. The first category includes individuals 15 years of age and older who were members of interviewed households at the beginning of the 4-month reference period but were not original sample members or members of any SIPP-interviewed household on the date of the interview that is, people not interviewed because they moved out of the sample household between the beginning of the reference period and the interview date. Had these people been original sample members, they would be interviewed at their new address. Rather, these are all people who entered the SIPP sample after the first wave and were in the sample because at some point they were living with an original sample member. The second category of imputed noninterview includes people 15 years of age or older who were members of SIPP-interviewed households on the date of the interview and during all or a portion of the 4-month reference period but who were not interviewed because they refused to cooperate or were unavailable for the interview and a proxy interview was not obtained. The Type Z imputation procedure is based on a hierarchical sorting and merging operation that matches noninterviews with respondents on socioeconomic characteristics available for both. The 8 SIPP USERS’ GUIDE DATA EDITING and IMPUTATION variables used to match noninterviews with respondents are age, race, gender, marital status, household relationship, education, veteran status, parent/guardian status, and income and asset sources. Pennell (1993, Figure C-1) provides a table of variables used to match recipients with donors. The Type Z imputation procedure is designed to always find a match. Type Z noninterviews are imputed by assigning values from the matching donor to the noninterview record. The donor values are assigned in full, except for identification variables or other variables not relevant for the household in which the noninterview occurred. Pennell (1993) gives a complete account of Type Z imputation, including detailed descriptions of matching operations. For 1996+ Panels, the Type Z procedure is only used where Type Z persons do not have an interview record available in the previous wave, i.e. in wave 1, for new respondents in wave 2+, or where person was a non-interview in the previous wave. For all others, general imputation procedure (the sequential hot-deck procedure described in the following pages) is used to impute core items for most person-level noninterviews. Imputation of Item Nonresponse in Core Questions SIPP core items are imputed in the following order: 1. Labor force participation, recipiency of income, and asset holdings; 2. Other cash income; 3. Wage, salary, and self-employment income amounts; 4. Asset income amounts; and 5. Program participation and benefits. The Sequential Hot -Deck Imputation Procedure The statistical imputation method used to impute missing items from the core questions and topical modules is known as a sequential hot-deck procedure.45 In a general sense, the sequential hot-deck procedure, like the Type Z imputation procedure, matches a record with missing data to that of a donor with similar background characteristics and uses the donor ‘s values. This procedure differs from data editing, which replaces missing data with inferred values based on nonmissing data from the same case. 4 The hot-deck procedure used in SIPP for the core questions and topical module items is sequential because the selection of replacement values is implemented one record at a time from an ordered file. 9 SIPP USERS’ GUIDE DATA EDITING and IMPUTATION The sequential hot-deck procedure used in SIPP involves five key steps: 1. Specifying cold-deck or initial donor values; 2. Sorting the sample cases; 3. Identifying records with no item nonresponse and updating hot-deck values; 4. Classifying cases into subclasses of the population, referred to as imputation classes or adjustment cells, according to values on a set of classification or auxiliary variables that are nonmissing for all cases (this step is omitted in the initial processing of the key demographic items‘ race, gender, etc.); and 5. Selecting replacement values from donor cases to impute item-missing data on recipient records. Two types of sequential hot-deck imputation are used to provide values for missing items. In Wave 1 and for each sample member who is new to a subsequent wave, the hot deck is cross-sectional; only values from current wave responses are used in the definition of the hot-deck cells. Beginning with Wave 2, previous wave values are included in the definition of the hot deck cells. In both instances, however, only current wave values from selected donors are used to replace missing items (with several exceptions, described below). Longitudinal (or previous wave) hot-deck imputation was not performed prior to the 1996 Panel. Each wave received only the cross-sectional hot-deck imputation. For example, the item indicating whether a person worked part-time in the reference period for the wave (a dichotomous item) uses the longitudinal hot deck for old sample members and the cross- sectional hot deck for new sample members. The 1996 Panel cross-sectional hot-deck imputation is based on a cell structure with 288 cells that are based on cross-classifications of sex (two categories), race (two categories), age (six categories), marital status (three categories), disability status (two categories), and presence of own children (two categories). On the basis of his or her current wave values for those categories, each new sample member in any later wave is assigned to a cell; then the donor ‘s value in that cell is used to impute a value to the new sample member. The longitudinal hot-deck imputation for the part-time work item for old sample members in Waves 2+ is based on a cell structure with 576 cells that are based on the same categories described above with one extra category: whether or not the person worked part-time in the previous wave. A donor is selected from that cell, and that value is imputed. The actual item is imputed from a donor ‘s value of the item in the current wave; the previous wave value is used only in the assignment of the cell. That procedure guarantees that the sample member is matched to the donor who had the same value for the item in the previous wave. Therefore, sample members who worked part-time in the previous wave will be matched only to donors who also worked part-time in the previous wave. However, the actual hot-deck imputation comes from the donor ‘s value in the current wave, which may or may not include part-time work. Imputed values for the sample member are allowed in assigning the cell for some items. If a sample member had an imputation for part-time work in the previous wave, that imputation is used to define 10 SIPP USERS’ GUIDE DATA EDITING and IMPUTATION the cell for the longitudinal hot-deck imputation, even though it is an imputation itself. That is not done for other items, such as asset items. Only a nonimputed or logically imputed value counts toward the longitudinal hot deck for those items. The part-time item is dichotomous; the previous wave imputation matrix was essentially the current wave imputation matrix with the previous wave’s value of the item added to the matrix. In many cases, the differences between the two imputation matrices will be more pronounced, especially for items with several categories of answers. An example of this is the item reasons why person worked less than 35 hours in the reference period there are 12 categories for that item. The previous wave imputation matrix uses the following characteristics to define cells: Previous wave value for item (12 categories);  Sex (two categories);  Race (two categories);  Age (six categories); The current wave imputation matrix uses the following characteristics to define cells.  Sex (two categories);  Race (two categories);  Age (six categories);  Marital status (three categories);  Disability status (two categories);  Presence of own children (two categories). A different type of example is the item gross pay in the first month of the reference period. For new SIPP sample members, cross-sectional hot-deck imputation is carried out by using the following characteristics to generate cells:  Industry and occupation category (16 categories);  Sex (two categories);  Hours worked (three categories); and 11 SIPP USERS’ GUIDE DATA EDITING and IMPUTATION  Education level (three categories) For old sample members, a longitudinal hot-deck imputation is carried out by using the previous wave value for the item gross pay in the fourth month of the preceding wave ‘s reference period.6 This continuous value is divided into 138 categories, starting from $1 to $100, to over $50,000. Sample members are matched to donors by using the previous wave values of those categories. For labor force items, the Census Bureau uses the following special imputation procedures when a person has no current wave information indicating whether or not he or she worked during the reference period. If the Census Bureau can infer from what it knows about the previous reference period whether the person had a job or business at the start of the current period, the Census Bureau carries out the following procedure: 1. If the person was working at the end of the prior wave, then labor force participation is imputed from a single donor for the complete current wave. 2. The Census Bureau then projects job characteristics for the person from the person‘s prior wave through the current wave. 3. Finally, the Census Bureau edits the job characteristics for consistency with the imputed labor force participation variables. This procedure is known as an EPPFLAG imputation, after the name of the variable that indicates its use. If a person was a nonworker in the prior wave or the Census Bureau cannot infer work status on the basis of prior wave data, then the person ‘s work status is imputed. If the person is imputed as a worker in the reference period, the Census Bureau imputes the complete set of job/business characteristics variables and labor force participation variables to the person from one donor, in order to maintain consistency among the fields. That procedure is called a little Type Z imputation. For some items in some cases, a direct logical or carryover imputation is made. The carryover imputation takes the previous wave’s value for the item for the sample member and imputes it to the current wave. That imputation is done particularly for items that rarely (or never) change for a sample member across waves (such as sex and race) or for items that change in predictable ways (such as age). SIPP hot-deck procedures are designed to preserve the univariate distribution of each variable subjected to imputation. These procedures do not, in general, preserve the covariances among variables. Although some of those interrelationships might be preserved to a certain extent, that is 6 The second month of the reference period actually uses as the “previous wave value” the first month value, with the third month using the second month, and so forth, so that these imputations are really previous month rather than previous wave 12 SIPP USERS’ GUIDE DATA EDITING and IMPUTATION not the primary intent of the hot-deck imputation procedures used by the Census Bureau. One consequence is that imputation can introduce inconsistencies into the data. For example, if a respondent has reported program participation, but his or her income is too high for that program, it is possible that the income data have been imputed. Whenever users detect inconsistencies, it is wise to check the allocation (imputation) flag to see if the inconsistent data might have been imputed. The discussion of allocation (imputation) flags later in this chapter provides more information. Starting or Cold -Deck Values In other surveys, cold-deck values in a sequential hot-deck procedure historically served as the initial set of replacement values for missing items in the first record processed; missing items in subsequent records typically received replacement (hot-deck) values from the current data set. In SIPP, however, cold-deck values are seldom used as replacement values for either the first or subsequent records processed. During later stages of processing, as the cold-deck values are replaced with information from the current wave, the array of cells is referred to as the hot-deck matrix. The cells in the matrix are defined by the cross-classification of auxiliary variables (Pennell, 1993, Figure 3.3). Each cell in the matrix corresponds to respondent cases with the same set of values on the classification variables. Many different matrices are defined in SIPP, and each matrix corresponds to one or more variables subject to imputation. Sorting the Sample Cases The records in the sample file are sorted by three geographic variables prior to imputing item- missing data. The three geographic sort variables are primary sampling unit, segment number, and serial number. The cases are sorted prior to processing and are not re-sorted at any other time during the imputation process. The sorting operation creates a file in which neighboring records represent geographically proximate households. Preprocessing the Sample File: Initial Updating of Cold-Deck Values Once the cases have been sorted, they are processed through a series of programs. During the first pass against the programs, the cold-deck values are updated with information from the current wave; missing data are not imputed. The initial processing is done separately for each of the five groups of related core variables listed above. During the first pass, the first record in the sorted file with consistent and nonmissing data for a particular group of variables is identified and the values from that case replace the cold-deck values for that section in the matrix. The values for each subsequent record with consistent and nonmissing information update the previous set of consistent and nonmissing values written to the matrix. The checking and updating operation continues until all records in the data file have been processed. The last values written to the matrix serve as the starting values in the subsequent sequential hot-deck procedure. In this way, cold-deck values are rarely used as replacement values in SIPP because the initial processing usually replaces all starting values with values from the current wave of data. 13 SIPP USERS’ GUIDE DATA EDITING and IMPUTATION Allocating Cases into Imputation Classes In the next step of the imputation procedure, each respondent record or noninterview record in the sorted file is allocated to one of the imputation classes or adjustment cells according to its values on the set of classification, or auxiliary, variables.7 1. The auxiliary variables are chosen for each item or set of related items on the basis of their level of correlation with the item receiving the imputation (i.e., classification variables are chosen on the basis of their ability to explain the variability of the item or set of related items); Census Bureau researchers assign different sets of classification variables to different sets of items. 2. The auxiliary variables are either dichotomous or polychotomous categorical variables (e.g., sex, race); if they are continuous, they are categorized into a parsimonious number of levels (e.g., income, asset levels) 3. The level of the auxiliary variables then define a matrix, with the number of cells in this matrix being the product of the number of levels for each auxiliary variable. For example, an imputation defined by five variables, each with three levels, has a total of 243 cells. Any given item or set of related items may have imputation matrices with the numbers of cells ranging from under 100 to well over 1,000, depending on the matrix. Auxiliary variables such as sex, race, and categorizations of age (with different categorizations for different items) are used frequently in the matrices, as are more specialized auxiliary variables that are relevant for particular items (such as industry and occupation category for the monthly gross pay item). Pennell (1993) gives examples of the different sets of classification variables for previous panel years. The allocation of sample cases into imputation classes (also known as subclasses or strata) according to a set of classification variables serves several purposes. Ideally, the set of classification variables should account for a large proportion of the variance in the variable being imputed and should be associated with variations in response rates. To the extent that this is accomplished, the classification procedure creates homogeneous adjustment cells containing similar cases. In this way, donors and recipients are similar under the assumption that the nonresponse mechanism within the imputation class is not related to the item being imputed; that is, an underlying assumption is made that item nonresponse data are distributed randomly within the subclass defined by the cross-classification of the auxiliary variables. The selection of classification variables may also place bounds on the range of values that can be imputed and implicitly satisfy edit constraints. The implicit stratification created by the sort order of the file further improves the opportunity for better imputation to the extent that nearby cases are more similar to each other than cases that are farther apart in the file. 7 This step is omitted for the imputation of the primary demographic values that are imputed before the person-level noninterviews. 14 SIPP USERS’ GUIDE DATA EDITING and IMPUTATION Imputing for Missing Data and Updating of Hot-Deck Values The selection of replacement values for missing items is restricted to donor and recipient records within each particular cell; that is, records allocated to one cell never donate information to records in another cell with missing items. As the file is processed through the set of programs the second time, the imputations are performed and the set of hot-deck values is updated once again. The records are processed sequentially, according to the sort order of the file. A missing item is given the value of the last corresponding item that is nonmissing from a record in that imputation class. If the value of an item in the current record is nonmissing, it replaces the previous hot-deck value for that imputation class. In this way, the hot-deck value for each imputation class is constantly being updated with the value of the last nonmissing case. The updating is done item by item. Missing items in one record receive the current set of replacement values. Then the nonmissing values in that record are used to update the hot deck in preparation for the next record. At any point during the process, the donated values in the hot deck likely come from many different respondents, even within imputation classes. That is why this imputation procedure does not preserve covariances among the variables being imputed. Allocation (Imputation) Flags An allocation (imputation) flag is associated with each core item subject to imputation. When an item has been imputed, an allocation (imputation) flag for that item is set. Beginning with the 1996 Panel, allocation flags denoting either data edits or statistical imputations for all variables are included on the core wave files. For core wave files from earlier panels, imputation flags are included for most items subject to imputation. One type of variable that does not have an allocation flag, are the recode variables. SIPP produces recodes that combines variables to produce one estimate. Recoded variables do not have imputation flags. These variables assist users who are interested in summarizing related questions. For example, the Total Household Income variable, is a recode variable of more than sixty possible income sources. Using recodes cuts down the amount of programming significantly. An allocation (imputation) flag with the value 0 indicates no imputation, a value of 1 indicates a hot- deck imputation that uses only current wave values, a value of 2 indicates a cold deck value, a value of 3 indicates a logical imputation, and for panels 1996+, a value of 3 may also indicate data from the previous wave was carried over to the current wave, and finally, a value of 4 indicates a dependent imputation. This last category includes imputations in which data have been carried over from the sample unit’s previous wave data and imputations in which previous wave data are used as control variables. For detailed documentation about the coding of allocation (imputation) flags for specific variables, analysts can refer to the data dictionary for the data file with which they are working. 15 SIPP USERS’ GUIDE DATA EDITING and IMPUTATION For items that receive Type Z imputations (in both the pre-1996 panels and the 1996+ Panels) and items receiving EPPFLAG and little Type Z imputations in the 1996+ Panels, the allocation (imputation) flag for a particular imputed item will not indicate by itself the imputation status of the item. For Type Z imputations, the EPPINTVW field in the 1996+ Panels and the person-level INTVW field in the pre-1996 panels will indicate whether the Type Z procedure was used to impute all items for the sample person (in these cases, EPPINTVW = 3 or 4 or INTVW = 3 or 4).8,9 The individual imputation flag for each item indicates whether or not that item was imputed during the processing of the donor’s fields. For EPPFLAG imputations, the EPPFLAG field will equal 1. When this is true, all labor force participation and job/business characteristics fields are imputed via the EPPFLAG procedure, whether or not the individual items indicate an imputation. As with the Type Z procedure, an allocation (imputation) flag with a value greater than zero for any of the labor force participation items means that the values of these items are not the original values from the donor but are processed values that are consistent with the sample person’s demographics and household composition; for the job/business characteristics fields, an allocation flag with a value = 4 indicates that the sample person’s values in these fields have been projected forward from the person’s values for these fields in the previous wave. To find little Type Z imputations, check the allocation (imputation) flag of the variable EPDJBTHN. If (a) EPDJBTHN = 1 (indicating that the person was a worker), (b) this item’s allocation (imputation) flag is 1 or 4, and (c) EPPFLAG is not 1, then a little Type Z imputation has taken place for all of the labor force participation and job/business characteristics fields. As with the Type Z procedures, the allocation (imputation) flag for an individual item only indicates whether the item was imputed when the donor ‘s fields were processed. The full panel files carry only a subset of the allocation (imputation) flags carried on the core wave files. The value of an allocation (imputation) flag is set during wave processing, and, usually, it is not modified to reflect any changes in value resulting from the longitudinal editing discussed below. The Census Bureau does reset the values of some allocation flags to indicate that a longitudinal imputation has occurred. 8 The codes for EPPINTVW and INTVW differ. In the 1996+ Panels, EPPINTVW is coded as follows: 1= Interview (self), 2 = Interview (proxy), 3 = Noninterview Type Z, 4 = Noninterview pseudo Type Z (left sample during the reference period), and 5 = Children under 15 during the reference period. In the pre-1996 panels, INTVW for person is coded as follows: 0 = Not applicable (children under 15), 1= Interview (self), 2 = Interview (proxy), 3 = Noninterview Type Z refusal, and 4 = Noninterview Type Z other. 9 Note that for the 1990-1993 Panels, INTVW can equal 5 on the core wave files (this value is not documented in the codebook). A value of 5 denotes persons in the sample early in the wave who were not in the sample at the time of interview. Such persons are processed as if they are a Type Z nonrespondent. Prior to the 1990 Panel, such persons are identified as those with PP-MIS5 1 but PP-MISj = 1 for j = 1, 2, 3, or 4. 16 SIPP USERS’ GUIDE DATA EDITING and IMPUTATION Topical Module Imputation Procedures When item-missing data in topical modules are imputed, the same sequential hot-deck procedure used to impute item-missing data in the SIPP core is used. Topical module data for Type Z noninterviews are also imputed item by item with the sequential hot deck. Those cases are not subjected to the Type Z imputation procedure that was used for core items in the pre-1996 panels. Phase 2: Data Editing Procedures for Full Panel Files – Pre - 1996 Panels only At the conclusion of each SIPP panel, core data from all waves are assembled into the full panel file. That assembly is done after all waves have been processed separately, producing the core wave files. Once all waves are linked, longitudinal edits are applied to the SIPP full panel files to ensure that the data for each respondent are consistent over time. Although the core wave files are edited for consistency, some types of inconsistencies become apparent only when looking at the data over multiple waves. Starting with the 1996 Panel, some longitudinal editing has been built into the CAI instrument. The ability to carry data across waves in the CAI environment is expected to result in better cross-wave consistency in the core wave files and in less need for subsequent longitudinal editing.10 Pre–1996 Full Panel Files The following discussion refers only to pre-1996 procedures. Longitudinal edits in the pre-1996 panels were applied for selected variables. The edits were designed (1) to correct crosswave inconsistencies, which become apparent only when multiple waves are examined together, and (2) to honor the preference to replace imputed values from one wave with reported values from another wave. Unlike the hot-deck imputation procedures used with the core wave files, the longitudinal edits in the pre-1996 files did not replace missing data for one person with reported data from another person. When a data value was modified during longitudinal editing, the replacement value was obtained from the same record either directly (by copying a reported value from a different month) or indirectly (using some form of interpolation or extrapolation from reported values in other months). Those procedures could cause modifications both in reported and imputed values. When a data value was modified during longitudinal editing, the associated imputation flag was not changed. In 10 Prior to CAI, a control file was developed at Wave 1 that contained a unique identifier for each sample person, as well as that person's age, sex, and race. In subsequent waves, the control file provided a means of detecting inconsistencies in age, sex, and race across waves. As each wave of data was received, the reported age, sex, and race of the sample person were checked against the control file and corrections were made. Also prior to CAI, income recipiency was brought forward to the subsequent wave. 17 SIPP USERS’ GUIDE DATA EDITING and IMPUTATION addition, the core wave files were not revised to reflect changes made during longitudinal editing. Thus, the data for any given respondent may differ between the core wave files and the full panel file, and estimates based on the full panel file may differ from those based on the core wave files. The longitudinal edits in the pre-1996 files were performed independently on four groups of variables: 1. Demographic and household composition variables; 2. Earned income variables; 3. Other Income variables, Food Stamp variables, WIC variables, and program coverage variables; and 4. Medical insurance variables. In most cases, the values reported during Wave 1 were used as the standard against which inconsistencies were judged. Pennell (1993) provides detailed information about longitudinal consistency edits for specific variables. Missing Wave Imputation There are many instances in which data are missing for a person in one wave but are present for that same person in the two adjacent waves. For example, a person may be missing in Wave 5 but have complete data for Waves 4 and 6. Beginning with the 1991 Panel, the Census Bureau began imputing those missing waves in the full panel files. Missing wave imputation is performed only when a missing wave is bounded on both sides by waves in which the sample member was present. If a respondent has missing data for more than one consecutive wave, the imputation is not performed. For missing waves that are bounded on each side by interviewed waves, data are interpolated using a random carryover procedure. A value r is randomly assigned to each nonrespondent’s household for each missing wave, where r = 0, 1, 2, 3, or 4. The first r reference months within the missing wave receive their imputed values from the fourth month of the preceding wave, and the remaining 4-r reference months receive their imputed amounts from the first month of the subsequent wave. Although this procedure results in data conducive to many analytic purposes, the random carryover forces stability in responses for wave nonrespondents. That stability could result in underestimation of between-wave changes. The procedure also results in imputed waves that do not exhibit the seam effect common to waves of reported data (Chapter 6). Williams and Bailey (1996) provide a complete account of the handling of missing wave data in SIPP. 18 SIPP USERS’ GUIDE DATA EDITING and IMPUTATION Phase 2: 1996 & 2001 Panels The 1996 & 2001 panels use waves 1-4 to inform the values of selected demographic and household composition variables for the entire panel. Waves 1-4 are linked longitudinally and made consistent across the four waves. Where a disagreement exists, data from a later wave takes precedence over data from an earlier. Once these data are edited, the cross-sectional edits are re-run for each wave 1-4 with the original demographics replaced with the longitudinally edited values. For waves 5 plus, each wave is only run cross-sectionally with the reported demographics replaced with the longitudinal values from waves 1-4. Phase 2: 2004+ Panels There is no phase 2 for the 2004+ panels. Each wave is edited cross-sectionally with no attempt to make the data consistent across waves - except for the normal phase 1 editing procedures which can use previous wave data to supplement missing data in the current wave. Mapping: 2004+ Panels The SIPP data collection instrument was extensively modified for the 2004 panel. The intentions of the changes were to decrease the respondent burden and to increase the accuracy of the data collected, while keeping the scope and content of the survey essentially unchanged. Due to the complex nature of the edit programs it was determined that they would not be re-written. Instead, a mapping operation was inserted into the beginning of the processing stream which (wherever possible) translated the data from the format of the new instrument to the format of the 2001 instrument. This allowed us to run the remaining processing steps with limited changes and to release public use files for 2004+ in the same format and with the same variable names as the 1996 & 2001 panels. Analysis of the affect of these changes on data quality and/or respondent burden is beyond the scope of this guide. Confidentiality Procedures for the Public Use Files All of the editing and imputation procedures described in the preceding sections are part of the process of preparing the data for internal Census Bureau use. Before the files are released for public use, they undergo additional editing to protect the confidentiality of respondents. Two procedures are used: topcoding of selected variables (income, assets, and age) and suppression of geographic information. As a result of these procedures, estimates based on data from the public use files will differ slightly from the Census Bureau ‘s published estimates. 19 SIPP USERS’ GUIDE DATA EDITING and IMPUTATION Topcoding One piece of information that might reveal a respondent’s identity is a very high income. For that reason, the Census Bureau topcodes income before making that information publicly available, recoding any income amounts over a certain maximum value to that maximum. In other words, income on the public use data files has a ceiling value. Although income is the primary variable that is topcoded, other variables that may disclose a respondent’s identity, such as age, are also topcoded. A few variables, such as starting dates for employment, may be bottom coded if they pose a disclosure risk. Chapter 10 and Appendix B provide a thorough discussion of top coding methods and procedures in SIPP. Suppression of Geographic Information Geographic information that can be used to directly identify survey respondents, such as an address, is removed from the public use files. In addition, states and metropolitan areas with populations less than 250,000 are not identified. Specific nonmetropolitan areas (such as counties outside of metropolitan areas) are never identified. In certain states, when the nonmetropolitan population is small enough to present a disclosure risk, a fraction of that state ‘s metropolitan sample is recoded to nonmetropolitan status. For that reason, the SIPP data cannot be used to estimate characteristics of the population residing outside metropolitan areas. Chapter 10 provides details. For the 1996 & 2001 Panels, state-level geography is shown for 45 states and the District of Columbia. The remaining five states are combined as follows: 1. Maine, Vermont; and 2. North Dakota, South Dakota, Wyoming. For the 1984 through 1993 Panels, state-level geography is shown for 41 individual states and the District of Columbia; the nine other states are combined into three groups: 1. Maine, Vermont; 2. Iowa, North Dakota, South Dakota; and 3. Alaska, Idaho, Montana, Wyoming. All States are identified for the 2004+ Panels. 20