SIPP USERS’ GUIDE                                                         SAMPLING ERROR


 7.            Sampling Error
This chapter discusses methods for obtaining the sampling error estimates derived from the
Survey of Income and Program Participation (SIPP) panels. The sample selected for each SIPP
panel is a stratified multistage probability sample. This complex sample design needs to be taken
into account when estimating the variances of SIPP estimates. The SIPP data files contain
variables, related to the sample design, that are created for the purpose of variance estimation.
Several software packages are now available for computing variance estimates for a wide range
of statistics based on complex sample designs. Using the variables that specify the design, these
programs can calculate appropriate variances of survey estimates. The Census Bureau also
provides generalized variance functions (GVFs) that can be used to obtain approximate estimates
of sampling variance for SIPP estimates.

A common mistake in the estimation of sampling error for survey estimates is to ignore the
complex survey design and treat the sample as a simple random sample (SRS) of the population.
That mistake occurs because most standard software packages for data analyses assume simple
random sampling for variance estimation. When applied to SIPP estimates, SRS formulas for
variances typically underestimate the true variances. This chapter describes how appropriate
variance estimates, which take into account the complex sample design, can be obtained for SIPP
estimates.

The topics discussed in this chapter are:

       •        Direct variance estimation;

       •       Approximate variance estimates obtained from GVFs;

               and

       •       Variance estimation when some data are imputed.


Direct Variance Estimation
The primary sampling unit (PSU) plays a key role in variance estimation with a multistage
sample design. SIPP PSUs are mostly counties, groups of counties, or independent cities (SIPP
Quality Profile, 3rd Ed. [U.S. Census Bureau,1998a, Chapter 3]). The PSUs are sampled without
replacement so that no PSU is selected more than once for the sample. Some PSUs are sampled
with probability proportional to size within strata and usually called nonself-representing PSUs
                                              7-1 

SIPP USERS’ GUIDE                                                          SAMPLING ERROR

(NSR PSUs). Other PSUs are so large that they are included in the sample with certainty and
therefore are called self-representing PSUs (SR PSUs). Because no sampling is involved, SR
PSUs are, in fact, not PSUs but strata. The actual PSUs for those certainty selections are the
enumeration districts and other units selected within them. Although the SIPP PSUs are selected
without replacement (as is the case with most multistage designs), for the purpose of variance
estimation they are treated as if they were sampled with replacement. The with-replacement
assumption greatly facilitates variance estimation since it means that variance estimates can be
computed by taking into account only the PSUs and strata, without the need to consider the
complexities of the subsequent stages of sample selection. This widely used simplifying
assumption leads to an overestimation of variances, but the overestimation is not great.

Several software packages are available for computing variances of a wide range of survey
estimates (e.g., means and proportions for the total sample and for subclasses, for differences in
means and proportions between subclasses, and for regression and logistic regression
coefficients) from complex sample designs. Many of these packages are listed on the Web:
http://www.fas.harvard.edu/~stats/survey-soft/survey-soft.html. Lepkowski and Bowles (1996)
examined eight of the packages.

These packages use a variety of methods for variance estimation. Some use an approach based
on a Taylor-series approximation, or linearization, method. Others use a replication method, such
as jackknife repeated replications or balanced repeated replications. Although some methods
have advantages in some situations, there is generally little to recommend one method over
another. The variance estimates they produce are not identical, but the differences are usually
small. See Wolter (1985) and Rust (1985) for discussions of these methods.

Variance Units and Variance Strata, 1990–2004 Panels

For the 1990-2004 SIPP Panels, the sample member record contains information concerning the
PSU and stratum within which the member was sampled. This information is needed as input for
all of the specialized software packages. The original PSU and strata codes are not included in
the SIPP public use data files, however, to avoid potential identification of small geographic
areas and sampled individuals. Instead, sets of PSUs are combined across strata to produce
variance units and variance strata, with two variance units in each variance stratum. Variance
units and variance strata may be treated as PSUs and strata for variance estimation purposes.
Their use does not give rise to any bias in the variance estimates. The variance estimates are
somewhat less precise, however, than those obtained from the use of the PSUs and strata that
have not been combined.

Under the complex sample design, the number of degrees of freedom for variance estimation
depends on the number of variance strata. The 1984 SIPP Panel consists of 142 variance units in
71 variance strata; the panels between 1985 and 1991 have 144 variance units and 72 variance
strata; the 1992-1993 Panels have 198 variance units and 99 variance strata; the 1996-2001
Panels have 210 variance units and 105 variance strata; and the 2004 Panel has 228 variance
                                             7-2 

SIPP USERS’ GUIDE                                                        SAMPLING ERROR

units and 114 variance strata. As a rough approximation, the number of degrees of freedom for a
variance estimate is the number of variance strata. Thus, for national estimates, the variance
estimates have about 71 degrees of freedom for the 1984 Panel, 72 degrees of freedom for the
1985-1991 Panels, and 99 degrees of freedom for the 1992-1993 Panels, 105 degrees of freedom
for the 1996-2001 Panels, and 114 degrees of freedom for the 2004 Panel. Regional estimates
will have fewer degrees of freedom because such estimates include only some of the variance
strata.

Table 7-1 displays the variable names for the variance stratum and variance unit code in the SIPP
core wave files and the SIPP full panel files. These codes can be employed as stratum and PSU
codes in any of the software packages for variance estimation with complex sample designs.

        Table 7-1. Variance Stratum Code and Variance Unit Code in SIPP Files, 1990-2004

                                                SIPP Core Wave File  SIPP Longitudinal File
    Variable for Variance Estimation
                                               1990-1993   1996-2004 1990-1993 1996-2004
     Variance unit (or half-sample) code        HHSC       GHLFSAM HALFSAMP GHLFSAM
     Variance stratum code                      HSTRAT GVARSTR VARSTRAT GVARSTR


Replication Weights for the SIPP Panels
Analysts should use Fay’s method for estimating variances for the SIPP Panels. Fay’s method is
a modified balanced repeated replication (BRR) method of variance estimation. The difference
between the basic BRR method and Fay’s method is that the BRR method uses replicate factors
of 0 and 2, whereas Fay’s method uses one factor, k, which is in the range (0, 1), with the other
factor equal to 2-k. In Fay’s method, the introduction of the perturbation factor (1-k) allows the
use of both halves of the sample. Thus, Fay’s method has the advantage that no subset of the
sample units in a particular classification will be totally excluded. The variance formula for
Fay’s method is

                                                            j (θi & θ0) ,
                                                            G
                             Var (θ0) '
                             Var          {1/[G(1 & k)2]}
                                          {1/                          2                        (7&1)
                                                            i' 1
where

        G = number of replicates; 

        1 – k = perturbation factor; 

        i = replicate, i = 1 to G; 

        θi= ith estimate of the parameter θ based on the observations included in the ith replicate;
        θ0 = survey estimate of the parameter θ based on the full sample.

The 1996 SIPP Panel uses 108 replicate weights, which are calculated on the basis of a

                                                    7-3 

SIPP USERS’ GUIDE                                                            SAMPLING ERROR

perturbation factor of 0.5 (k = 0.5). Inserting those values into Equation (7-1) results in the 1996
SIPP Panel variance formula of

                                                        180
                          Var (θ0) ' {1/[180 ( 0.52]} j (θi & θ0)2.
                          Var        {1/       0.5

The 2004 SIPP Panel uses 120 replicate weights, which are calculated on the basis of a
perturbation factor of 0.5 (k = 0.5).

The Census Bureau used VPLX and SAS software to compute the replicate weights that are
available through Data FERRET and the SIPP FTP Site.


Using GVFs to Approximate Variance Estimates
The Census Bureau provides two forms for approximate variance estimation: GVFs and tables of
standard errors (the square root of the variance) for different estimated numbers and percentages.
The generalized estimates provide indications of the magnitude of the sampling error in the
survey estimates. They serve as convenient ways to summarize the sampling errors for a broad
variety of estimates.

The GVFs for SIPP were derived by modeling the standard error behavior of groups of estimates
with similar standard errors. The mathematical form of the function adopted is

                                     s ' (ax 2 % bx)1/2 ,
                                                 bx)                                      (7-2)

where s represents the standard error and x the value of an estimate. The parameters a and b are
derived on the basis of a selected group of estimates. They are updated annually and are included
in the source and accuracy statement that accompanies each SIPP data file for a panel. It is
essential to use the parameter estimates for a specific panel and to follow the instructions to
apply necessary adjustments to obtain the correct estimates for subgroups. Besides GVFs, the
Census Bureau provides summary tables of general standard errors. Those estimates are also
available in the source and accuracy statements. The following examples show how to use GVFs
to estimate the standard errors of estimated numbers and of sample means. The use of GVFs and
tables of standard errors is described in the source and accuracy statements for each panel.

Before looking at the examples, the user should note that the generalized variance estimates for
estimating the standard errors of other statistics may not be accurate for small subgroups. Using
the 1984 SIPP Panel, Bye and Gallicchio (1989) developed variance functions for participants of
Old Age, Survivors, and Disability Insurance (OASDI) and Supplemental Security Income (SSI)
programs. They found that for estimates of less than 10 million, the generalized standard error
estimates provided by the Census Bureau were 1.20 to 1.75 times larger than those obtained from
the variance functions developed specifically for that subgroup.
                                                7-4 

SIPP USERS’ GUIDE                                                           SAMPLING ERROR


Using GVFs for Standard Errors of Estimated Numbers

The approximate standard error, s, of an estimated number of persons (or households, and
families) can be obtained by the formula

                                         s ' (ax 2 % bx)1/2 ,

where a and b are the parameters associated with the estimate for the particular reference period,
and x is the weighted estimate. This equation is appropriate for the standard errors of estimated
numbers and should not be applied to estimates of dollar values.

Suppose that the number of households with monthly household income above $20,000 is
estimated from Wave 1 of the 2004 Panel to be 1,637,500. The approximate values of a and b
from Table 3 of the source and accuracy statement of the 2004 Panel are a = -0.00002809, and b
= 3,153. Then, the standard error, s, of this estimated number is given by


                    s ' [(&0.00002809 ( 1,637,5002) % (3,153 ( 1,637,500)]1/2,
                           0.00002809 637,500 (3,                637,500)]

                                                                                        (7-3)

The approximate 90 percent confidence interval for the estimated number can be computed as
x " 1.645 ∗ s, which ranges from 1,520,165 and 1,754,835. Therefore, a conclusion that the
average estimate derived from all possible samples lies within an interval computed in this way
would be correct for roughly 90 percent of all samples.


Using GVFs for the Standard Error of a Mean
A mean is defined here to be the average quantity of some characteristic (other than the number
of persons or households) per person or household. For example, a mean could be the average
monthly household income of females 25 to 54 years of age. The formula used to estimate the
standard error of a mean, ¯ is
                          x

                                                   b 2
                                          s¯ '
                                           x         s ,
                                                   y

                                                                                           (7-4)

where y is the size on which the estimate is based, s2 is the estimated population variance of the
                                               7-5 

SIPP USERS’ GUIDE                                                          SAMPLING ERROR

characteristic, and b is the parameter associated with the particular type of characteristic.
Because of the approximations used in developing this formula, an estimate of the standard error
of the mean obtained from this formula will generally underestimate the true standard error.

The estimated population mean, ¯ , and the population variance, s2, are given by the formulas:
                               x


                                                      j wi xi
                                                       n

                                                      i'1
                                                ¯
                                                x '
                                                       j wi
                                                        n


and                                                    i'1


                                    j wi (xi &x )             j wi (xi &x )
                                     n                        n
                                              ¯2                        ¯    2

                             s2 '   i'1
                                                       or     i'1


                                          j wi                    j w i &1
                                           n                        n

                                          i'1                     i'1


where there are n units with the item of interest, and wi is the final weight for the ith unit. (Note
     j w i ' y ). Suppose that, based on January of the 2004 data of the Wave 1, 2004 Panel, the
     n

that
mean monthly cash household income for females aged 25 to 54 is $5,826, the weighted number
of females in this age range is y = 62,346,000, and the population variance is estimated to be
64,900,000. When the appropriate b parameter of 3,153 from Table 3 of the Source and accuracy
statement for Panel 2004 is used, the estimated standard error of this mean is

                         s¯ ' [(3,153 ( 64,900,000) / 62,346,000]1/2 ' $57.
                          x   [(3,        ,90 ,00       ,34 ,00        $57.

Thus, the 90 percent confidence interval, computed

                                          x ± 1.645( sx.
                                          ¯ 1.645 ¯

ranges from $5,732 to $5,920. Therefore, a conclusion that the average estimate derived from all
possible samples lies within an interval computed in this way would be correct for roughly 90
percent of all samples.

Variance Estimation with Imputed Data
Imputation methods are used to fill in several types of missing data in SIPP. They are used to
complete some item nonresponse, person-level nonresponse within households (Type Z
nonresponse), and some wave nonresponse (intermittent responses bounded by two responding
waves). Imputation fills in gaps in the data set and makes data analyses easier. It also allows
                                                      7-6 

SIPP USERS’ GUIDE                                                        SAMPLING ERROR

more people to be retained as panel members for longitudinal analyses. The concern, however, is
that imputation fabricates data to some degree. Treating the imputed values as actual values in
estimating the variance of survey estimates leads to an overstatement of the precision of the
estimates (Brick and Kalton, 1996). It is important to recognize this fact when sizable
proportions of values are imputed.


                                             7-7