SIPP USERS’ GUIDE LINKING FILES 13. Linking Core Wave, Topical Module, and Longitudinal Research Files In many situations, a single Survey of Income and Program Participation (SIPP) data file will not contain the information needed for a project. Because only limited core information is included on the topical module files, analysts often need to merge data from the core wave or longitudinal research files with topical module information. Also, they may need to link two or more topical module files, each containing data on a different topic and collected in different waves. And there are situations in which it is necessary to merge data from the core wave files with data from the longitudinal research files. Those situations arise because not all of the core wave content is included 1 on the longitudinal research files (e.g., calendar month weights are only on the core wave files). This chapter describes procedures for linking core wave, topical module, and full panel data files. This chapter assumes a working knowledge of the files that will be linked. Analysts who are not familiar with those files should read the following before proceeding with this chapter:  Chapter 9 for an overview of the SIPP data files;   Chapter 10 for a discussion of the core wave files;   Chapter 11 for a discussion of the topical module files; and   Chapter 12 for a discussion of the longitudinal research files.  In all cases, this chapter describes procedures for linking person records across files. It does not discuss procedures for linking households or families because those procedures become problematic 1 Even when the same variables are on the core wave and longitudinal research files, the data may not be the same. Different edit and imputation procedures are used for these two types of files. Pre-1996 Panels, all edit and imputation procedures applied to the core wave files worked entirely within the given file. Information from previous waves or later waves was not used. Beginning with the 1996 Panel, edit and imputation procedures applied to the core wave files make greater use of information from previous waves. However, because the core wave files are processed as the data become available; it is not possible to make use of information from future waves. The edit and imputation procedures applied to the longitudinal research files, however, make use of each person’s full longitudinal record. There are many times when the preferred data for a study will be on the longitudinal research files but the weights will be on the core wave files. 13-1 SIPP USERS’ GUIDE LINKING FILES 2 when working with longitudinal data. This chapter begins with a discussion of the mechanics involved in linking SIPP data files. The procedures are straightforward and easily implemented. In each case there are three basic steps: 1. Create data extracts from each of the files to be linked; 2. Sort the files in common order by using the variables identified as match keys; and 3. Merge the files. There are two general formats that the final files can take. This chapter refers to these as person month format (the format of the current core wave files) and person-record format (the format of the 3 longitudinal research files). The choice of format will be a function of the planned analysis and the software that will be used for that analysis. Where appropriate, procedures for generating each type of data file are described. After discussing the mechanics of linking SIPP files, this chapter discusses why nonmatches occur and suggests ways to deal with them. Most variable names changed in the 1996 Panel from those of previous panels. To aid users working with pre-1996 panel files, this chapter presents both the old and the new variable names when the text applies to both. In the main body of the text, the old names are presented in parentheses following the new names. For example, the sample unit ID variable name, which is SSUID in the 1996 Panel, was SUID in previous panels; it is written in this chapter as SSUID (SUID). In tables, a variety of methods are used to present both the old and the new names. 2 Difficulties arise when unit composition changes over time. In those situations, there is no unambiguous way to define longitudinal households and families, and many ad hoc procedures run the risk of introducing biases into analyses of those units. The alternative approach that has gained acceptance in the research community involves assigning to people the characteristics of the households or families to which they belong at each point in time. Subjects can then be followed over time, as can the characteristics of the households or families to which they belong. One exception to the longitudinal household problem is with program units (e.g., food stamp units), where program rules can be used to define when changing composition constitutes the formation of a new unit (as opposed to changed composition of an existing unit). For discussions of the issues involved in studying longitudinal households and families, see McMillen and Herriot (1985), Duncan and Hill (1985), Citro et al. (1986), and Kalton et al. (1987). 3 Some software (e.g., Stata) refers to this as wide format, while the person-month format is referred to as “long”. 13-2 SIPP USERS’ GUIDE LINKING FILES Procedures for Linking Files There are six types of merges that SIPP users commonly need to perform: 1. Person-month records within a core wave file can be linked, creating a single wide record for each person rather than a record for each person for each month; 4 2. Two or more core wave files can be linked together; 3. Core wave files can be linked to longitudinal research files; 4. Two or more topical module files can be linked to each other; 5. Topical module files can be linked to core wave files; and 6. Topical module files can be linked to longitudinal research files. This chapter addresses each of these merges in turn. Linking Within a Core Wave File -Transforming the Person - Month Format into the Person - Record Format This procedure transforms the person-month-format core wave files (with one record per person per month) into a single wide record per person (the format used for the core wave files before the 1990 Panel). As well as being useful in its own right, reformatting is often a necessary first step when merging core wave files with data from either the topical module files or from the longitudinal research files. Two approaches for this link are described. Programmers using third-generation languages, such as FORTRAN and PL/1, typically use the first approach. Programmers using fourth-generation languages, such as SAS and SPSS, typically use the second approach. The first approach (using FORTRAN) contains four steps: 1. Sort the file by person and reference month, using the following variables: sample unit ID [SSUID (SUID)], entry address ID [EENTAID (ENTRY)], person number [EPPPNUM 4 This procedure transforms the current format of the core wave files into a format similar to that used prior to the 1990 Panel, a format analogous to that used for the longitudinal research files. 13-3 SIPP USERS’ GUIDE LINKING FILES (PNUM)], and reference month [SREFMON (REFMTH)].5 This is the sort order the Census Bureau uses for the core wave files. If the file being used is in its original sort order, this step can be skipped. 2. Define and initialize monthly variable arrays to some missing data code. Users should be careful to choose initial values outside the range of legal values for the variables of interest. For example, the variable TAGE (AGE) would be defined as an array of four elements, and each element could be initialized to -9 (an age that no one can have); the variable TPTOTINC (TOTINC) would be defined as an array of four elements and each element could be initialized to -999999 (a negative value outside the range of the variable), and so on. 3. Read each person’s corresponding person-month record and put the information into the appropriate element of the array. 4. Write the person-based record from the information stored in the arrays. The second approach (using SAS) also contains four steps:6 1. Sort the file by person and reference month, using the following variables: sample unit ID [SSUID SUID)], entry address ID [EENTAID (ENTRY)], person number [EPPPNUM (PNUM)], and reference month [SREFMON (REFMTH)]. This is the sort order used by the Census Bureau for the core wave files. If the file being used is in its original sort order, this step can be skipped. 2. Write out four files, each one containing the person ID variables and the variables for 1 of the 4 months. For example, file1 would have the person ID variables [SSUID (SUID), ENTAID (ENTRY), and EPPPNUM (PNUM)] and the variables for month one, file2 would have the person ID variables and the variables for month two, and so on. 3. Rename the (monthly) variables in each of the four files to unique names. For example, the variable names in file1 might be TAGE1 (AGE1) and PTOTINC17 (TOTINC1); in file2 the variable names might be TAGE2 (AGE2) and PTOTINC2 (TOTINC2). 4. Merge the four files together, using SSUID (SUID), EENTAID (ENTRY), and EPPPNUM (PNUM) as the match keys. 5 In the 1996 Panel, the entry address is no longer needed to uniquely identify people. Its continued use will not create any problems; it is simply redundant information for purposes of identifying SIPP sample members. 6 An alternative procedure that may be useful in many cases uses SAS Proc Transpose. Stata also has a procedure- reshape-that can accomplish this task. 7 Because variable names in SAS were at the time limited to eight characters, the monthly variable name is shortened from TPTOTINC1 (nine characters) to PTOTINC1 (eight characters). 13-4 SIPP USERS’ GUIDE LINKING FILES The SAS code in Figure 13-1 performs the above steps. The person-month format of the core wave files before reformatting) is illustrated in Table 13-1. Person number 101 is in the sample all 4 months, person number 102 is in the sample all 4 months, person number 201 is in the sample for 2 months, and person number 202 is in the sample for 1 month. The person-record format (after reformatting) is illustrated in Table 13-2. Missing data are indicated by a single period, the default missing data code in SAS. For the FORTRAN example, the missing data would have codes of -9 and -999999. Linking Two or More Core Wave Files There are three reasons to link two or more core wave files: 1. To create an analysis file for one or more calendar months containing data from all four rotation groups. For example, data for March 1994 are contained in the Wave 7 file (of the 1992 Panel) for rotation groups 4 and 1, and in the Wave 8 file for rotation groups 2 and 3. (data for the same calendar month are also in Waves 4 and 5 of the 1993 Panel.) 2. To create an analysis file containing more than 4 months of information for each person. This linkage is of primary interest to users of the 1996 Panel, because longitudinal research files for all other panels are available from the Census Bureau. 3. As preparation for merging core wave data with data from either the topical module files or the longitudinal research files. Creating files in the person-month format is straightforward. In this instance, the files from each of the contributing core wave files simply need to be sorted and interleaved to create the final analysis file. The final sort order would likely be based on SSUID (SUID), EENTAID (ENTRY), EPPPNUM (PNUM), SWAVE (WAVE), and SREFMON (REFMTH). If a person-record format (with just one record per person) is desired, the first step is interleaving the files to create the person-month-format file. Then, using that as the input file, analysts can apply the procedures described in the preceding section to generate a file with a single, wide record for each person. There will be up to 4 months of data for each wave used. In the example from Tables 13-1 and 13-2, if three waves of data are being combined, the final file will have 12 values for SREFMON (REFMTH), TAGE (AGE), and TPTOTINC (TOTINC). In the SAS program code, the variable names would likely be REFMTH1-REFMTH12, TAGE1-TAGE12, and TOTINC1-TOTINC12. 13-5 SIPP USERS’ GUIDE LINKING FILES Figure 13-1. Sample SAS Code to Change the Core Wave Files from Person-Month Format to Person-Record Format from Wave 2 of the 1996 Panel * this creates the initial extract from the full core wave file */ data allmnths; set corewv962 (keep=ssuid eentaid epppnum srefmth tage tptotinc); run; /* sort the data-if the master file was in its original order, this step is not needed */ proc sort; by ssuid eentaid epppnum srefmth; run; /* write out 1 file for each of the four months, renaming variables in the process */ data file1 (rename =(tage=tage1 tptotinc=ptotinc1 srefmth=srefmth1 )) file2 (rename =(tage=tage2 tptotinc=ptotinc2 srefmth=srefmth2 )) file3 (rename =(tage=tage3 tptotinc=ptotinc3 srefmth=srefmth3 )) file4 (rename = (tage=tage4 tptotinc=ptotinc4 srefmth=srefmth4 )); set allmnths; select (srefmth); when(1) output file1; when(2) output file2; when(3) output file3; when(4) output file4; end; run; /* merge the 4 “monthly” files together, forming the final file */ data newfile; merge file1 file2 file3 file4; by ssuid eentaid epppnum; run; 13-6 SIPP USERS’ GUIDE LINKING FILES Users attempting to create their own longitudinal databases from the core wave files should proceed cautiously. The edit and imputation procedures applied to the core wave files for the SIPP pre-1996 Panels were all “within wave” procedures. This means that the edits and imputations applied to a person’s records in one wave were independent of those in other waves. Imputation procedures for many core wave file variables from the 1996 Panel are different. The new procedures do make use of information from the preceding wave. When linking data across waves, apparent changes in income, program participation, labor force behavior, or most other outcomes could be due to real changes reported by the respondent, or they could be an artifact of the data editing and imputation performed by the Census Bureau. Although this problem arises primarily with the core wave files from panels pre-1996, it is also true of the 1996 Panel.8 There are two ways to identify cases with edited or imputed data. In panels pre-1996, the entire record was imputed if (1) MIS5 = 2 and MISj = 1 for j = 1, 2, 3, or 4 or (2) INTVW = 3 or 4. The record was imputed in the 1996 Panel if EPPINTVW = 3 or 4. In the 1996 Panel, persons with Type Z noninterviews with prior wave information have their items imputed with procedures that use their prior wave responses. The relatively few cases with no prior wave information (those in Wave 1 and those in Waves 2-12 who are new to the sample) have their records imputed with the Type Z procedure used in the pre-1996 files. For all panels, if the record was not imputed, it is necessary to check the allocation (imputation) flags associated with the variables of interest. Once identified, users might need to implement some form of longitudinal editing and imputation or distinguish in their analyses between “real” changes and those that may result from the core wave data processing procedures. Basic demographic information, such as age, race, and sex, can also appear to change from one wave to the next. In these instances, changes reflect corrections made in later interviews to information collected in earlier interviews; it is generally safe to assume the most recent data are correct. When using the core wave files for longitudinal research, analysts should also note that the sample weights included on the core wave files are calendar month specific. These weights may not be appropriate for the planned longitudinal analyses. Chapter 8 has a detailed discussion of how to use the sample weights provided with the SIPP files. 8 The new imputation procedures for the 1996 Panel are expected to introduce less error than procedures used for earlier panels. Thus, the number and magnitude of spurious changes (as well as falsely imputed stability) should be reduced. Even so, imputation errors will occur, and caution is advised when using the core wave files for longitudinal research. 13-7 SIPP USERS’ GUIDE LINKING FILES Table 13 -1. Example of the Core Wave Person-Month File Structure Entry Person Reference Sample Unit Address ID Number Month Age Total Income ID [SSUID [(EENTAID [EPPPNUM [(SREFMON [TAGE [(TPTOTINC (SUID)] (ENTRY)] (PNUM)] (REFMTH)] (AGE)] (TOTINC)] 123456781000 011 (11) 0101 (101) 1 42 $2000 123456781000 011 (11) 0101 (101) 2 42 $2100 123456781000 011 (11) 0101 (101) 3 42 $2000 123456781000 011 (11) 0101 (101) 4 43 $2000 123456781000 011 (11) 0102 (102) 1 41 $500 123456781000 011 (11) 0102 (102) 2 41 $500 123456781000 011 (11) 0102 (102) 3 41 $0 123456781000 011 (11) 0102 (102) 4 41 $0 123456781000 011 (11) 0201 (201) 2 18 $200 123456781000 011 (11) 0201 (201) 3 18 $200 123456781000 011 (11) 0201 (201) 4 18 $200 123456781000 011 (11) 0202 (202) 2 2 $0 123456781000 011 (11) 0202 (202) 3 2 $0 123456781000 011 (11) 0202 (202) 4 2 $0 Table 13-2. Example of the Core-Wave Wide-Record/Person File Structure Sample Unit Entry Person Reference ID Address ID Number Month Age Total Income [SSUID [(EENTAID [EPPPNUM SREFMONa TAGEb TPTOTINCc (SUID)] (ENTRY)] (PNUM)] 1 2 3 4 1 2 3 4 1 2 3 4 123456781000 011 (11) 0101 (101) 1 2 3 4 42 42 42 43 $2,000 $2100 $2,000 $2,000 123456781000 011 (11) 0102 (102) 1 2 3 4 41 41 41 41 $500 $500 $0 $0 123456781000 011 (11) 0201 (201) . 2 3 4 . 18 18 18 . $200 $200 $200 123456781000 011 (11) 0202 (202) . 2 3 4 . 2 2 2 . $0 $0 $0 Note: . = missing. a 1 = SREFMTH1, 2 – SREFMTH2, 3 = SREFMTH3, 4 = REFMTH4. b 1 = TAGE1, 2 = TAGE2, 3 = TAGE3, 4= TAGE4. c 1 = PTOTINC1, 2=PTOTINC2, 3 = PTOTINC3, 4 = PTOTINC4. 13-8 SIPP USERS’ GUIDE LINKING FILES Linking Core Wave Files to Longitudinal Research Files There are relatively few circumstances in which the core wave and full panel files need to be linked because, for the most part, they contain the same information. In general, if the same information is available from both the core wave and longitudinal research files, the information from the longitudinal research files is preferable because the edit and imputation procedures used for the longitudinal research files are believed to introduce less error than the procedures used for the core wave files.9 However, some core information is contained only on the core wave files, and, therefore, at times it will be necessary to merge the core wave and longitudinal research files. The following steps are necessary to link data from the core wave files with data from the full panel files: 1. Create data extracts from the core wave and longitudinal research files; 2. Put the two extracts into the same format (either person-month format or person-record format); 3. Sort the extracts into the same order; and 4. Merge the extracts, creating the final file. The variables that uniquely identify people in the core wave and longitudinal research files have different names. Table 13-3 shows the names for the three variables needed to match people across those files for panels pre-1996.10 Table 13 -3. Variables Identifying People in the Core Wave and Longitudinal Research Files for Panels Pre – 1996 Panels Variable Core Wave Files Longitudinal Research Files Sample Unit ID SUID is matched to PP-ID Entry Address ID ENTRY is matched to PP-ENTRY Person Number PNUM is matched to PP-PNUM If the final file will be in person-record format, these are the only variables needed for the sort and merge operations (steps 3 and 4, above). If the final file will be in person-month format, then WAVE and REFMTH are also needed. Figure 13-2 shows the SAS code to transform data from the longitudinal research files in wide record format into the person-month format used in the core wave files. The program creates a person- 9 See footnote 1. 10 Current plans call for using consistent variable names across all files from the 1996 Panel. When text copy applies to both 1996 and pre-1996 panel files, pre-1996 variable names appear in parentheses following 1996 variable names. 13-9 SIPP USERS’ GUIDE LINKING FILES month format file from the 1993 longitudinal research file. Because SAS does not allow variable names with embedded dashes, the characters in the variable names have been replaced with underscore (“_”) characters. The 1992 Panel had 10 waves, so the output file will have up to 40 monthly records for each person: no records are written for any months when pp_mis is not equal to 1. The program creates a data set with seven variables: SUID (renamed from PP_ID), ENTRY (renamed from PP_ENTRY), PNUM (renamed from PP_PNUM), REFMTH (which ranges from 1 to 4), WAVE (which ranges from 1 to 10), AGE, and TOTINC. The REFMTH variable is computed as modulus (i/4) if it is not equal to 0, or 4 if is equal to 0. The modulus is the remainder from the division, so in month six of the panel the quantity is modulus (6/4) = 2, in month seven it is modulus (7/4) = 3, and in month eight it is 4 (since the remainder from the division of 8 by 4 is 0). The wave is computed as the first integer greater than or equal to i/4. For month one, i/4 = 0.25, so wave = 1. For month four, i/4 = 1, so wave = 1. For month 17, 17/4 = 4.25, so wave = 5. The file created by the program in Figure 13-2 could be merged with an extract from the core wave files from the 1992 Panel, using SUID, ENTRY, PNUM, WAVE, and REFMTH as the match keys. If the longitudinal research file was in its original sort order, the file created by the program in Figure 13-2 will already be sorted by this set of match keys. Values for AGE and TOTINC from the core wave and longitudinal research files will not match for all people in all months because the core wave files and the longitudinal research files are subjected to different edit and imputation procedures. In addition, beginning with the1991 Panel, a missing wave imputation procedure has been applied to the longitudinal research files: people who had missing data from one wave but complete data from the two adjacent waves had data imputed for the missing wave in the longitudinal research files.11 This means that some people will have data in the longitudinal research files for months in which they have no records in the associated core wave files (those who were not Type Z nonrespondents). 11 Many of these situations arise with Type Z nonrespondents: nonresponding people who live in households with other responding sample members. Type Z nonrespondents in the pre-1996 core wave files and those in the 1996 Panel files with no prior wave information were subjected to a whole-record imputation procedure, described in Chapter 10. These people would have records in the core wave files, but different information-because it was imputed using different procedures-in the longitudinal research files. 13-10 SIPP USERS’ GUIDE LINKING FILES Linking Two or More Topical Module Files At times it will be necessary to merge data from two or more topical module files. Any project that studies the relationship between subject areas covered by different topical modules will require such a merge. One example might be a study of the relationship between the use of health care services (collected in Wave 3 of the 1993 Panel) and medical expenses (collected in Wave 4 of the 1993 Panel). The mechanical process of linking topical module files is relatively straightforward. The topical module files all have the same format (one record per person) and variable names, for the ID variables are consistent across the topical module files: individuals are uniquely identified by the combination of SSUID (ID), EENTAID (ENTRY), and EPPPNUM (PNUM). However, a number of cautions should be noted: 1. Pre-1996 Panels, there were instances in which the same variable name was used in different topical module files for different variables. For example, in the 1990 Panel, TM8400 was used in the Wave 2 topical module for a variable that indicates whether the respondent completed 12th grade. The same variable name was used in the Wave 6 topical module to indicate whether the respondent was a parent of children who are under 21 years of age living in his or her household. 2. Not all people with records in one topical module file will have records in another topical module file. In the topical module files from the 1996 Panel, there will generally be a record for each person who was a responding SIPP household member in the fourth month of the wave’s core reference period. Pre-1996 Panels, all household members in the interview month have topical module records for a given wave. However, household composition changes from one wave to the next: some people leave SIPP households and others join SIPP households, and this changing composition is reflected in the topical module files. Also, in the 1996 Panel, some people who were nonrespondents in month four of one wave may have been respondents in month four of another wave. Thus, when topical module files are merged, there will be a nontrivial number of nonmatches: people with data from only one of the topical modules. Nonmatches are addressed in greater detail later in this chapter. 3. Choosing appropriate weights is complicated by the fact that there are a substantial number of nonmatches across topical modules. One solution is to use one of the weights from the longitudinal research files. Chapter 8 gives a detailed discussion of the SIPP weights. Often it will be necessary to merge additional information (such as sample weights) from the core wave or longitudinal research files when working with multiple topical modules. 13-11 SIPP USERS’ GUIDE LINKING FILES Users interested in measuring change with data from the topical module files (such as changes in asset holdings, or changes in health or disability status) should proceed with caution. First, in some instances measurement error is large relative to the actual changes that have taken place. One example is found in the topical modules that measure levels of household assets and liabilities. 12 Although the topical modules can provide estimates of aggregate-level changes in those instances, users should not attempt to measure those changes at the individual level. Also, the edit and imputation procedures applied to the topical module files are all “within wave” procedures. This means that the edits and imputations applied to a person’s records in one wave are independent of those in other waves. When data are linked across waves, apparent changes could be due to real changes reported by the respondent or they could be artifacts of the data editing and imputation performed by the Census Bureau. Figure 13-2. Sample SAS Code to Change the Longitudinal Research Files from Person-Record Format to Person-Month Format for Panels Pre-1996 Data pmonth keep=pp_id pp_entry pp_pnum refmth wave age totinc rename=(pp_id=suid pp_entry=entry pp_pnum=pnum )); /* this example works with the 1993 SIPP panel-10 waves */ set sipp93fp (keep=pp_id pp_entry pp_pnum pp_mis1-pp_mis40 age1-age40 tinc1-totinc40 ); /* define arrays to ease the programming burden */ array ages {40} age1-age40; array totincs {40} totinc1-totinc40; array pp_mis {40} pp_mis1-pp_mis40; do i=1 to 40; /* for each month */ if pp_mis{i} eq 1) then do; /* if pp_mis is 1,use the data */ age= ages{i}; /* the age in this month */ totinc = totincs{i}; /* total income this month */ j = mod(i,4); if(j eq 0)then refmth=4; /* the reference month */ else refmth = j; wave = ceil(i/4); /* the wave */ output; /* write out the record */ end; end; run; 12 See the SIPP Quality Profile, 3rd Ed. (U.S. Census Bureau, 1998a) and SIPP Working Paper series for discussions of this issue as it relates to this and other SIPP topical modules. 13-12 SIPP USERS’ GUIDE LINKING FILES There are two ways to identify cases with edited or imputed data. In panels pre-1996, the entire record was imputed if (1) PP-MIS5 = 2 and PP-MISj = 1 for j = 1, 2, 3, or 4 or (2) INTVW = 3 or 4. In the 1996 Panel, the record was imputed if (1) EPPMIS4 = 2 or (2) EPPINTVW = 3 or 4. In the 1996 Panel, persons with Type Z noninterviews who have prior wave information have their records imputed with procedures that use their prior wave responses. For persons with no prior wave information (those in Wave 1 and those in Waves 2-12 who are new to the sample), the Type Z imputation procedure is used. On all panels, users should check the imputation flags associated with the variables of interest. Linking Topical Module Files to Core Wave Files Because the topical module files contain only limited information from the SIPP core, there will be many times when it is necessary to merge data from the topical module files with data from the SIPP core. One source of these data is the core wave files.13 The first decision that must be made is which core wave file to use. Special attention should be paid to the reference periods for the topical module items of interest. In the 1996 Panel, topical module questions refer to either month four of the wave’s core reference period, or to a longer period in the past (such as the preceding 12 months or the prior calendar year). In those instances, information would come from the month-four records of the core wave files from the same wave (and possibly from earlier months and waves). Pre-1996 Panels, many topical module items referred to conditions in the interview month. The interview month, however, is not included as a separate record in the core wave file for the same wave as the topical module.14 Rather, core information for the interview month of one wave is found in the month-one information from the following wave. For example, the interview month for Wave 3 is month 13 in the SIPP research panel, and core data for month 13 are collected as the first reference month of Wave 4.15 Commonly used reference periods for topical module items are the current (interview) month (month one of the next wave), the previous month (month four of the current wave), the previous 4 months (the full reference period for the current wave), and the previous year. The topical module files have one record per person, while the core wave files have up to four records for each person (one record per person for each month the person was a SIPP sample member). There are at least three options available when merging topical modules with data from the 13 The next section describes procedures for merging topical module files with data from the longitudinal research files. 14 Some of the interview month information is contained on the records for the four reference months of the wave. But in the person-month-format file there is no separate record for the interview month itself. 15 Information collected during the interview month of one wave may not match the information collected about the same calendar month in the subsequent wave. In the 1996 Panel, dependent interviewing techniques and other checks made possible with CAI are used to help resolve those inconsistencies. 13-13 SIPP USERS’ GUIDE LINKING FILES SIPP core wave files:16 1. Pick a single month from the core wave files. For example, if the topical module items use the interview month as their reference period, it may make sense to use records for month one from the core wave files from the next wave. 2. Spread the topical module data across all records from the core wave file. That results in a final file in person-month format. 3. Create a single record for each person from the appropriate core wave file and merge the topical module data to that record. This results in a final file in the person-record format with the same monthly detail as in the second option described above. The steps involved are as follows: 1. Create an extract from the core wave file(s) of interest. 2. If a single record for each person is desired, apply the algorithm in Figure 13-1, which is described in the section entitled Linking Within a Core Wave File - Transforming the Person- Month Format into the Person-Record Format. 3. Sort the core wave extract using SSUID (SUID), EENTAID (ENTRY), and EPPPNUM (PNUM) as the sort keys. These three variables uniquely identify people in the core wave files. If the core wave extract is in the person-month format, include SREFMON (REFMTH) as the final sort key. 4. Create an extract from the topical module file of interest. Sort the topical module extract using SSUID (ID), EENTAID (ENTRY), and EPPPNUM (PNUM) as the sort keys. 5. For the 1996 Panel, merge the core wave extract with the topical module extract; use SSUID, ENTAID, and EPPPNUM as the sort keys. For panels pre-1996, merge the core wave extract with the topical module extract; use the sort keys shown in Table 13-4. When data from panels pre-1996 are used, there will likely be a nontrivial number of nonmatches between the core wave files and the topical module files. That will be true even when a topical module is merged with core data from the same wave, because people who were members of a SIPP 16 Yet another option is to create a single record from the core wave files containing aggregate measures for the reference period of interest. For example, it might make sense to create a single record from the current core wave file with total income received during all 4 months of the wave’s reference period. Or the average number of hours worked per week during the previous 4 months might be appropriate. Once the aggregate record is created, the merge step is similar to the others described in this section. 13-14 SIPP USERS’ GUIDE LINKING FILES household in the interview month but not during the previous 4 months will have records in the topical module files but not in the core wave files Table 13-4. Variables Identifying People in the Topical Module and Core Wave Files for Panels Pre-1996 Panels Variable Topical Module Files Core Wave Files Sample Unit ID ID is matched to SUID Entry Address ID ENTRY is matched to ENTRY Person Number PNUM is matched to PNUM Linking Topical Module Files to Longitudinal Research Files from Pre-1996 Panels While topical module files can be linked with data from the core wave files, there are many times when it will be necessary or desirable to use the longitudinal research files instead. For example, if the full panel weights17 are needed for the planned analysis, they must come from the longitudinal research files. When the same core items are available from the core wave and the longitudinal research files, analysts may prefer to use the longitudinal research files because the edit and imputation procedures used for them are believed to introduce less error than the procedures used for the core wave files. The steps involved are as follows: 1. Create an extract from the longitudinal research file. 2. If a file in the person-month format is desired, apply the algorithm described in the section above, Linking Core Wave Files to Longitudinal Research Files. The example in Figure 13-2 can be adapted to that purpose, but the ID variables would need to be renamed to match those used in the topical module files rather than in the core wave files (Table 13-5). 3. Sort the full panel extract; use PP-ID, PP-ENTRY, and PP-PNUM as the sort keys. These three variables uniquely identify people in the longitudinal research files. If the full panel extract is in the person-month format, include WAVE and REFMTH as the final sort keys. 4. Create an extract from the topical module file of interest. Sort the extract; use ID (the variable name for the sample unit ID in the topical module files), ENTRY, and PNUM as the sort keys. 17 Chapter 8 discusses the SIPP weights, their derivation, and use. 13-15 SIPP USERS’ GUIDE LINKING FILES 5. Merge the core wave extract with the topical module extract based on the sort keys described here and shown in Table 13-5. Because the longitudinal research files contain a record for every person who was ever a member of a SIPP household, every person with a record in a topical module file should have a record in the longitudinal research file. However, analysts working with a person-month-format file containing records only for months when PP-MIS = 1 may find nonmatches. Table 13-5. Variables Identifying People in the Topical Module and Longitudinal Research Files Pre-1996 Panel Variable Topical Module Files Research Files Sample Unit ID ID is matched to PP-ID Entry Address ID ENTRY is matched to PP-ENTRY Person Number PNUM is matched to PP-PNUM Nonmatches When Merging Files SIPP is designed to follow a group of people over an extended period of time. This group includes only those who were interviewed in the first wave of the panel and the children subsequently born to or adopted by them.18 Over the course of the panel, these original sample members are followed and interviewed every 4 months. Secondary sample members, on the other hand, are part of the SIPP sample only for as long as they continue to reside with at least one original sample member. As long as they are part of the SIPP sample, the secondary sample members are interviewed and included in the SIPP data files. The problem of nonmatches occurs only when users merge across waves for any types of files. There is no matching problem when the same or different types of files are merged within the same wave. As shown in Table 13-6, there are a variety of reasons why a person may be in one SIPP data file but not in another. All but one of the reasons is associated with people entering and leaving the SIPP sample:19 1. The original sample person may have left the SIPP sample universe (e.g., died, moved abroad, moved into military barracks, or moved into an institution); 18 In the 1993 Panel all original sample members were followed no matter what their ages. In all other panels, only original sample members aged 15 years or older are followed when they move to new addresses. In all cases, however, the SIPP data files contain a record for all people, including children, who reside in a household with at least one original panel member present. 19 The SIPP following rules are described in greater detail in Chapter 2. 13-16 SIPP USERS’ GUIDE LINKING FILES 2. The original sample person may have left the sample but is still in the sample universe (sample attrition); 3. The original sample person may have just reentered the SIPP sample universe (after living abroad, etc.); 4. The person is a newborn (a special case of a person joining the sample universe); he secondary sample member has just begun living with an original sample person; 5. The secondary sample member no longer lives with an original sample member; 6. The person had data for “missing wave” imputed in the longitudinal research file and has no records in the core wave or topical module files for that wave; and he secondary sample member has just begun living with an original sample person; The secondary sample member no longer lives with an original sample member; 7. Pre-1996 Panels, the Census Bureau may have intentionally altered the identification information of the person, thereby making it difficult to find a match for this person (in rare situations referred to as merged households). 8. A person’s reason for leaving the SIPP sample is identified in the core wave and longitudinal research files. In the former, the variable name is ULFTMAIN (REALFT). In the longitudinal research files, the name is REASLEFT, and it has a value for each wave rather than each month. Figure 13-3 shows the variable values and corresponding descriptions. Procedures for dealing with nonmatches vary, depending largely on the reasons the person entered or left the SIPP sample. A number of common scenarios are presented below. Exiting or Entering the Population There is a fundamental distinction between situations in which people leave the sample because they leave the SIPP sample universe and situations in which they leave the sample despite the fact that they are still part of that population. The SIPP sample universe (the population that the SIPP sample represents) is the noninstitutionalized, resident population of the United States. It includes both civilian and military people; it includes adults and children who reside in the United States and outside of institutions. People who leave this population because they die, move abroad, or move into institutions exit the SIPP sample because they are no longer a part of the population that SIPP represents. In general, when nonmatches occur because people have entered or exited the population represented by the SIPP sample, data should not be imputed and weights should not be adjusted for the period when 13-17 SIPP USERS’ GUIDE LINKING FILES these people are outside of that population. From the perspective of SIPP, these people do not exist when they are outside of the population represented by the sample. Table 13- 6. Reasons for Nonmatches File #1 (earlier File #2 (later time Reasons time period) period) People Exiting the Sample Original sample people left the SIPP sample universe (left the population of inference) Present Not present Person died Moved abroad - left sample universe Moved into military barracks - left sample universe Moved into an institution - left sample universe Original sample person exited from the sample (still in the sample Present Not present universe but no longer in the sample) Refused to be interviewed Secondary sample person no longer lives with an original sample Present Not present member People Entering the Sample Newborn Not present Present Original sample person returns to SIPP sample universe (returns to Not present Present the population of inference) Moved from abroad - entered sample universe Moved from military barracks - entered sample universe Moved from an institution - entered sample universe Original sample member returns to sample Not present Present Original sample member agrees to be interviewed and returns to sample Secondary sample person now lives with an original sample Not present Present member Missing Wave Imputation in the Longitudinal Research File (Beginning with the 1991 Panel) Person has data in the longitudinal research file but no data in the corresponding wave in the core wave or topical module files. Merged Households Special Case “Old” version of the ID information Present Not present “New” version of the ID information Not present Present The following examples help explain why weighting adjustments and imputation are problematic in these situations:  A person is in the SIPP sample at Time 1 but dies before Time 2. In this case, the person is not  part of the population at Time 2. In computing the aggregate (total) income of the population 13-18 SIPP USERS’ GUIDE LINKING FILES at Time 1, this person’s income would be included. To impute income to this person for the Time 2 observation, analysts would compute an aggregate income that is too high: The person had no income at Time 2, and so none should be imputed.20 If this case is dropped from the analysis file and the weights are inflated for the remaining sample, the estimate of the total population at Time 2 would be too high. Because this person was not a part of the population at Time 2, the weights for the remaining sample members should not be inflated to represent this individual.  A person is overseas at Time 1 but at Time 2 is living with an original sample member in the  United States. At Time 1, this person was not part of the population represented by the SIPP sample. Because this person was not a part of that population, the SIPP sample should not be adjusted in any way to represent this individual. A number of strategies are possible for dealing with cases in which nonmatches result from people entering or leaving the population represented by the SIPP sample. One approach is to drop those people from the analysis sample entirely. No adjustment would be made to the weights of the remaining cases. However, the definition of the population represented by the remaining sample would change. The remaining sample represents the population that existed at both Time 1 and Time 2. It does not represent anyone who either entered or left the population. That approach has the advantage of being simple to implement. It also results in a clearly defined population of inference. Caution is necessary, however, to the extent that people entering and leaving the population are systematically different from those who are present throughout the period being studied: the remaining sample cannot be used to draw inferences about this other part of the population. People entering and leaving prisons and nursing homes, for example, likely have very different income profiles than the population that remains outside of these institutions over the period under study. If event-history models are used to analyze the data, another approach is possible.21 With these models, exits from the population can be treated as competing outcomes. For example, in a study of unemployment dynamics, a competing risks model might allow for three possible outcomes: spells of unemployment can end because (1) a person becomes employed, (2) a person exits the labor force, or (3) a person exits the population.22 20 If the person had been alive with income that she or he did not report to the Census Bureau, an estimate of his or her unreported income would be imputed to the individual. Failing to impute that unreported income would mean that the income received by a member of the population is not represented anywhere in the sample. That value would result in a sample estimate of aggregate income in the population that was lower than the actual value in the population. 21 For a description of these methods, see, Tuma and Hannan (1984). 22 In actual applications, more than three outcomes would likely be modeled. The determinants of entering a nursing home, for example, are likely quite different from the determinants of entering a prison. 13-19 SIPP USERS’ GUIDE LINKING FILES Figure 13-3. Data Dictionary Entries for Variables Identifying the Reason a Person Left the SIPP Sample Wave 23, 1996 Panel Core Wave File D ULFTMAIN 2 606 T PE: UNEDITED VARIABLE - Main reason left Household What is the main reason ... left the household? U Movers from households which contain sample persons at the time of interview, movers from a household which splits into multiple households. Note: This is an unedited field and the universe is not exact.
V 0 .Not answered V 1 .Deceased V 2 .Institutionalized V 3 .On active duty in the Armed Forces V 4 .Moved outside of U.S. V 5 .Separation or divorce V 6 .Marriage V 7 .Became employed/unemployed V 8 .Due to job change - other V 9 .Listed in error in prior wave V 10 .Other V 11 .Moved to type C household 1993 Full Panel Files D REASLEFT 9 143 9 1 Range = (0:9) Preedited reason for leaving the Household Control Card item 23 U Persons who left at any time during the reference period Subscript 1: not applicable for Observation 1 Subscript 2-8: reason left in Observations 2-8 V 0.Not applicable or not answered or nonmatch V 1.Left - deceased V 2.Left - institutionalized V 3.Left - living in armed forces barracks V 4.Left - moved outside of country V 5.Left - separation or divorce V 6.Left - person #201 or greater no longer living with sample person V 7.Left - other V 8.Entered merged household V 9.Interviewed in previous wave but not in sample (figure continues) 13-20 SIPP USERS’ GUIDE LINKING FILES Figure 13-3. Data Dictionary Entries for Variables Identifying the Reason a Person Left the SIPP Sample (continued) 1993 Core Wave Files D REALFT 2 521 Reason for leaving the household Applicable when previous wave address ID is not equal to control card address ID Range=(00:00,05:12,25:31,99:99) U All persons, including children, no longer in the household V 00.Not applicable or not answered V 05.Left - deceased V 06.Left - institutionalized V 07.Left - living in Armed Forces barracks V 08.Left - moved outside of country V 09.Left - separation or divorce V 10.Left - person #201+ no longer living with sample person V 11.Left - other V 12.Left - entered merged household * Should have been deleted in a previous wave: V 25.Left - deceased V 26.Left - institutionalized V 27.Left - living in Armed Forces barracks V 28.Left - moved outside of country V 29.Left - separation or divorce V 30.Left - 201+ person no longer living with sample person V 31.Left - other V 99.Listed in error Exiting the Sample but Remaining in the Population (Sample Attrition) Sample attrition occurs when people leave the SIPP sample but remain a part of the population represented by that sample. In these instances the remaining sample generally should be adjusted to represent the full population, including the part of the population represented by those who leave the sample. There are several options for handling such cases:  Impute the missing data and proceed. This option is appropriate for researchers familiar with the statistical literature on imputation for missing data. A full discussion of this topic is well beyond the scope of this manual. Analysts are cautioned, however, against using the common practice of “substituting the mean” for missing data. That practice can yield biased estimates of multivariate statistics (such as regression coefficients) and generally leads to downward biased estimates of standard errors. 13-21 SIPP USERS’ GUIDE LINKING FILES  Drop cases with missing data, adjust (post stratify) the weights for the retained cases, and proceed. This post stratification involves several steps. 1. Tabulate the weighted number of cases by various socioeconomic categories before dropping any cases. 2. Repeat the tabulation after dropping the nonmatches. 3. Compute adjustment factors by dividing the weighted numbers from step 1 (before dropping any cases) by the weighted numbers from step 2 (after dropping cases). 4. Create a new weight variable by multiplying the original weight variable by the appropriate post stratification factor computed in step 3. This situation requires caution. A user who drops records may introduce selection biases because those in the retained sample may be more stable than those who leave. For example, the fact that a (former) sample member has left may be associated with other changes in that person’s life, such as giving birth, getting married, or getting a new job. Because the person left the sample, it s not possible to know from the available data what changes actually did occur in each case. Also, when records are dropped, the procedures for computing standard errors as described in the source and accuracy statements provided with the data will no longer apply. The procedures described in Chapter 7 for the direct estimation of standard errors should, however, work without any modification. If the number of cases lacking complete information is small relative to the full analysis sample (the full sample with positive weights), the biases introduced by dropping those cases also are likely to be small and this procedure may be a viable alternative.  If the longitudinal research file is available, use a subset of the cases with complete data for  which Census Bureau-provided weights are available and proceed. At the extreme, this procedure entails retaining only cases with positive full panel weights and using those weights for any analyses performed.23 This is a conservative approach, but one that is relatively easy to implement because the weights already exist, they have already been adjusted for the observed sample attrition, and the population of inference is clearly defined.  Use other missing data methods to provide estimates and their standard errors. A full  discussion of these methods is beyond the scope of this manual. The methods are designed to make use of all available information from the cases with complete data without (directly) imputing data to cases with incomplete information. Interested users can consult the literature 23 The calendar year weights on the longitudinal research files are also options worth exploring. Chapter 8 provides a detailed discussion of the SIPP sample weights, their derivation, and use. 13-22 SIPP USERS’ GUIDE LINKING FILES in the E-M algorithm for one example of how this can be done.24 Also, Skinner et al. (1989) discuss model-based approaches to the analysis of complex surveys with missing data. Missing Wave Imputation in the Longitudinal Research Files Pre-1996 Beginning with the 1991 Panel, a missing wave imputation procedure has been applied to the longitudinal research files: persons who had missing data from one wave but complete data from the two adjacent waves had data imputed for the missing wave in the longitudinal research files.25 Some of those cases are Type Z nonrespondents and will have records with different data in the core wave files.26 Other people will have data in the longitudinal research files for months when they have no records in the associated core wave or topical module files. The correct procedure for dealing with the resulting nonmatches depends on which weight variables will be used. If the weights are coming from the core wave or topical module files, observations from the longitudinal research files not present in the cross-sectional files should be dropped. That is because the weights on the core wave and topical module files are computed for the samples in those files, samples that do not include the people who have had that wave imputed in the longitudinal research files. If the weights are coming from the longitudinal research file, then other procedures must be used to deal with the missing data from the core wave and topical module files. In those instances, the procedures described for dealing with sample attrition should be considered. Merged Households in Panels Pre-1996 Finally, nonmatches can occur when the Census Bureau changes the ID numbers for sample members.27 Pre-1996 Panels, there were two very rare occasions when this happened. The first 24 For example, see Little and Rubin (1987). Users should also note that some statistical packages (e.g., SPSS) have incorporated more sophisticated options for handling missing data than have generally been available in the past. 25 Imputed waves can be identified on the longitudinal research files by using the WAVFLG variable. 26 The data are different because different imputation procedures are used. 27 Because the Census Bureau is using new procedures in the 1996 Panel and beyond, merged households will not be an identifiable source of nonmatches when files from the 1996 Panel are merged. Rather, they will appear no different from other situations where people enter and leave the SIPP sample, such as through marriages, divorces, deaths, and sample attrition. For example, in the 1996 Panel, there will be no way to identify which (if any) of the people who appear to have entered the sample in Wave 3 were also sample members who appear to have left the sample following Wave 2. The “new” sample members will be given person numbers in the same range as others who enter he sample in Wave 3, and no previous wave information will be attached to them. The new procedures greatly simplify the handling of these rare cases for both the Census Bureau and outside data users . 13-23 SIPP USERS’ GUIDE LINKING FILES occurred when two separate sampling units, each containing original sample members, were merged together, perhaps because of a marriage. In this situation, the people in one of the sampling units retained their identification information, while the people in the other sampling unit had their identification information changed to agree with the retained set. The person numbers of the changed set were modified to be between 180 and 199. The second instance occurred when a SIPP household split into two new households (in which each new household gained a new sample person), which later recombined. For example, a married couple separated in Wave 3, each moving in with a sibling. Both siblings were assigned a person number of 301, because they entered the sample in Wave 3 at different addresses. If the husband and wife reunited in Wave 6, bringing the siblings with them, one sibling person number was changed. In s this case, one of the siblings would have a person number of 301 and the other would have a person number of 680 (or some number between 680 and 699 because the households recombined in Wave 6). Different file types (i.e., core wave, topical, and full panel) keep track of the changed ID values differently. If the move occurred after the first month of a reference period, the core wave file contains two records for the person whose identification information changed. The first record contains the original identification information of the person before the move and identifies the person as having exited the sample at the time of the move. The second record contains the new identification information after the move and identifies the person as having entered the sample at the time of the move. When the move occurs at the start of a reference period, only the second record is retained in the core wave file. The topical module file, however, contains only the second record, no matter when the move took place. The longitudinal research file contains both records, no matter when the move took place. The easiest way to find these people is to search the core wave file for people with a previous wave identified as present, that is, PWSUID > 0 or PWENTRY > 0 or PWPNUM > 0. Users then need to decide how they want to handle these special cases. There are several possibilities: ● Change the identification information used in the waves before the move to the new values seen in the wave(s) after the move, and then merge the records using these ID values. This option is useful when working primarily with the person’s core wave data after the move. ● Change the identification information in the waves after the move to the original values, and then use those ID values to merge records. This option is useful when working primarily with the person’s core wave data before the move. ● Duplicate the person’s record, and use the initial identification information with one record and the new identification information with the other record; then merge those records. With this approach, the weights for the duplicated records will need to be adjusted so that the duplicated 13-24 SIPP USERS’ GUIDE LINKING FILES weights sum to the original (unduplicated) weights. ● Treat this person as two people: once as someone who exits the sample at the time of the move and once as someone who enters the sample at the time of the move. That is how these cases are treated in the longitudinal research files. The weighting implications of this approach depend on the planned analysis. 13-25