Home Training International seminars: Seminar 23-24 05 2011
About Training
Training News
International seminars:

Data Confrontation Seminar

May 23-24, 2011 , Kaunas University of Technology

by Oliver Lipps (Swiss Foundation for Research in Social Sciences, FORS), Brita Dorer (GESIS – Leibniz-Institute for the Social Sciences)

Introduction  Oliver Lipps Brita Dorer

Oliver Lipps, FORS (Lausanne, Switzerland)
REPORT “Data Quality Challenges in Social Science Surveys”


Workshop Report

0. Introduction
1. The total error approach
2. Population – Sampling – Survey Frequencies
3. Measurement and Data Collection
4. Data Quality Enhancement
5. Data analysis issues

  0. Introduction

The topic of my part of the workshop was “Data Quality Challenges in Social Science Surveys”. Students often use data from large social surveys without taking into account that data contains error. Data, once available, are taken for the truth and are expected to perfectly represent the target population. An additional issue is that complex data structure is not acknowledged by data users. Often, the standard, default analysis models are used which often biases both estimates and creates a too high precisions. The latter often leads to a too quick rejection of the Null hypothesis and acceptance of the alternative.

The idea of this part of the workshop was to sensitize students that data are not errorless and should be used with care. Also, some complex analysis models should be motivated. Although the issue of data quality and data analysis as a general topic is too broad to be covered in a 1 1/2 day workshop I nevertheless tried to cover the most important issue related to large social science data quality. The general structure was to go along the lines of the production of a large social survey and to explain which errors occur at what stage of the survey development, followed by an example of analysing complex data (multilevel modelling).

Both parts of the workshop were not intended as a lecture on Survey Methodology and Data Quality. Rather, an active involvement of participants was intended. I guess that the questions asked and feedback given proved that students were interested and got sensitized to some extent about the most important possible errors in survey data.

The high heterogeneity of the audience was a bit a problem: while some students were quite familiar with the topic, others had trouble to understand easy notations and concepts. For future seminars, it would certainly help to insure that the level of the participants is not too diverse.
Appended are the slides which I used for my intervention (May 23 and May 24 afternoon), together with a brief title.

  1. The total error approach

I started the workshop with alternatives to designing
 and conducting an own survey:

Then, the concept and basic development of the
total survey error were introduced:

Dependencies of these two error sources on
sample size and sampling efforts are given:

Next, I presented the organizational structure
necessary when conducting a survey:

The next slide distinguished the sampling error
from nonsampling error:

Error components known by the 1940s:

Deming wondered about the reliabilities of surveys,
given that many errors. Amazingly up-to-date …:

Processing errors - a new source of error in the
TSE concept:

Summary: Steps of a surveys and errors involved:

Summary of the effects (in today’s speech) that
were specified and which not:

Further development of the TSE by Groves in
his book from 1989:

The state of the art of the TSE concept at the
beginning 2000s:

  2. Population – Sampling – Survey Frequencies

Definition and characterisation of target population
vs. survey population:

What information should a sampling frame include:

Guidelines for sampling frame selection:
Definition and role of and examples for the sampling
frame. Discrepancies between frame listed and target

Wishlist for a sampling frame:

Distinction between probability (random) and
subjectively chosen sample. How can unbiasedness
and precision of a survey statistics (e.g., mean)
be interpreted:
Relationship between precision and unbiasedness
of estimates and sample design and sample size.
Examples with random samples of different sizes:

Relationship between sample size and population
size for a defined precision:

Characteristics of and examples for probability
(random) samples:

Graphical example of a simple random sample
(often the standard for sample size calculation
(e.g., in the ESS):

Easy calculation example of a stratified random
sample with four strata:

Graphical example of a stratified random sample:

Graphical example of a proportional random sample:
Easy calculation example of a proportional random
sample versus simple random sample: different
estimation of the mean:

Graphical illustration of a repeated cross-sectional

Graphical interpretation of age, time, and cohort
effects, identification problem:

Time in surveys: who is surveyed how often?
Cross-sectional versus longitudinal:

Graphical illustration of a panel survey with
individual dynamics:

Types of and examples for longitudinal surveys:

Advantages of panel surveys
(over repeated cross-sections):

Model types specifically designed for the analysis of panel data:

Problems of panel surveys
(especially compared with repeated cross-sections):


  3. Measurement and Data Collection

Characterization of measurement error and
sources of measurement errors:

Possibilities to avoid errors by testing of different

Possible error sources of different survey modes:

Possible error sources in the questionnaire design:

Survey modes: most important differences and
Different usages of self-enumerated and
interviewer-based modes:

Difference between face-to-face and telephone

(easy) example and anticipated response rates
of a sequential mixed mode design:
Different usages of computer-assisted and
paper-based modes:

What has to be considered in mixed mode

Summary: mode effects on measurement: what
should be done about it:


  4. Data Quality Enhancement

General issues suitable to improve data quality:

Most important findings from the effects of
incentives on response:

Types of nonresponse: unit and item. Possible
solutions to nonresponse: Weighting and imputation:

Data quality enhancement examples from the ESS:

Nonresponse and common nonresponse selection
“Missing at Random” concept:

Memory and interviewer errors:

Use of paradata for nonresponse error analyses
and response enhancement:

Definition of and examples for radata:

Use of paradata in a responsive design and to
monitor fieldwork progress:
Use of paradata for measurement error

Interviewer characteristics and job description,
burden of interviewers:

Definition of imputation and weighting. Methods
of and problems with imputation:

 Design weights example for nonresponse:

Interviewing: a hard and demanding job:

Interviewer effects: respondent satisficing and
obtaining socially desired answers. Interviewer
characteristics correlated with such errors:

Design and calibration weighting:


  5. Data analysis issues

Software overview to manage and analyze large

Types of data analyses:

Steps of data treatment for survey research

Software used at FORS:

Level of measurement of variables
(determines methods and models):

Easiest test: contingency (cross) tabulation:
T-Test: equivalence of means of two continuous

Components and interpretation of the general
linear model:

Components of the T-Test:

Characteristics of a linear regression:

Error in a linear regression:
Empirical example of observation points from
a social survey:

Calculation of the regression coefficients:

From bivariate to multivariate regression:
empirical example:
How can a regression line be drawn through
the observation points:

Components of a linear regression:

Assumptions for linear regression:
Endogeneity: causes for misspecification:

Inference calculation example: changes of
standard error with different sample sizes:
Inference from linear regression coefficients:

How can preciseness of regression coefficients
be increased:

Assumption OLS: Homoscedasticity:

Summary: Assumptions in OLS:
Assumption OLS: no autocorrelation:

Next level: nonlinear regression:
Introduction to multilevel modeling: when is
ML necessary?

Advantages of ML models:

Design effect with clustered data
(e.g., respondents in interviewer sampling point):
Examples for levels in clustered data. Differences
levels and variables:

Graphical example for differences of (wrong)
single level and (right) ML model:

Graphical example for (wrong) assumption of
between effects when within-effects are more relevant:

First step towards a ML model: variable intercept:

Graphical interpretation and formula of easiest
ML model:
Next step towards a ML model: assume the
intercepts follow a normal distribution
(units have randomly varying levels):

Easy calculation example: between and within



Campanelli, P, P. Sturgis and S. Purdon (1997). Can you hear me knocking: An investigation into the Impact of Interviewers on Survey Response Rates. The Survey Methods Centre at the Social and Community Planning Research, London.

Groves, R. (1989). Survey Errors and Survey Costs. Wiley, New York

Groves, R., F. Fowler, M. Couper, J. Lepkowski, E. Singer and R. Tourangeau (2004). Survey Methodology. Wiley Series in Survey Methodology

Groves, R. and L. Lyberg (2010). Total Survey Error: Past, Present, and Future, Public Opinion Quarterly 74 (5): 849–879

Krosnick, J.A. (1991). Response Strategies for Coping with the Cognitive Demands of Attitude Measures in Surveys. Applied Cognitive Psychology 5, pp. 213-236

Nicolaas, G. (2011). Survey Paradata: A review. ESRC National Centre for Research Methods Review paper Jan 2011.

Statistics Canada (2003). Survey methods and practices. Statistics Canada, Ottawa (also online: Statistics Canada Quality Guidelines (accessed 17/5/2011:


Nr.1  2009 07–11
Nr.2  2009 12–2010 02
Nr.3  2010 03–05
Nr.4  2010 06–08
Nr.5  2010 09-11
Nr.6  2010 12 - 2011 02
Nr. 7 2011 03-05
Nr.8 2011 06-08
© KTU Policy and Public Administration Institute
Updated 2012-02-23