Multiple recent linkage efforts have produced longitudinal individual-level data from the censuses of 1850-1940. In addition, within the Census Bureau's restricted data program, respondents can be linked anonymously between the censuses of 1940, 2000, 2010, and 2020. Researchers are already using parts of these infrastructures together to conduct long-term inter-generational studies. In this paper we describe the various options for accessing, linking, and conducting research that uses both the historical and modern linked data as a centuries-long panel. We assess the sample sizes, representativeness, and cohort coverage of multigenerational panels created with existing linked census resources over the 1850-2020 period.
The U.S. Census Bureau maintains a large longitudinal research infrastructure that currently includes linked data from the 1940 census, the 2000-2010 censuses, major national surveys going back to 1973, and administrative records dating back to the 1990s. These restricted data are accessible to researchers around the U.S. via the Federal Statistical Research Data Centers (FSRDC) network. The major shortcoming of this infrastructure is that it lacks linkable files from the decennial censuses of 1950 through 1990. Full-count microdata from these censuses are available for research, but datasets from these years do not include respondent names and therefore have not been linked over time. Respondent names for these censuses are available only via the original census returns, which are stored on 258,000 reels of microfilm.
The Decennial Census Digitization and Linkage project (DCDL) is an initiative to recover names from the 1960-1990 censuses and to produce linked restricted microdata files for research use. We describe the results of a pilot project we completed on the 1990 census. For that pilot, we created digital images from census microfilm, hand-keyed "truth data" from those images, supported two teams' attempts to conduct Handwriting Recognition on the images, appended recovered names to already-existing microdata files, and linked the new 1990 census microdata records to previous and subsequent censuses. We describe our processes, the accuracy of the Handwriting Recognition, and the accuracy of the record linkage with the recovered names. We conclude by providing an update on the recently-initiated project to carry out these processes on a production scale for the 1960 through 1990 censuses.
When combined with existing linkages between the censuses of 1940, 2000, 2010, the soon-to-be public 1950 census, and the future 2020 census, DCDL will provide the final component in a massive longitudinal data infrastructure that covers most of the U.S. population since 1940. As a multi-purpose statistical tool, the DCDL will further the U.S. Census Bureau's mission to provide high quality data on the U.S. population and support cutting-edge research in the FSRDC network. The resulting data resource will expand our understanding of population dynamics in the U.S. far beyond what is currently possible, providing transformational opportunities for research, education, and evidence-building across the social, behavioral, and economic sciences.
IntroductionPopulation data capture children, parents, relatives, and others moving in and out of households. The U.S. has seen falling marriage rates, and increases in multigenerational households and complex families, young children living with grandparents, and adult children living with parents. Robust parent-child linkages are critical to understand these demographic shifts.
Objectives and ApproachWe construct and validate parent-child linkages over a century to observe how U.S. households are changing over time. The three largest person-based datafiles in the U.S. are the decennial censuses, the Social Security Administration transaction file, and individual tax returns from the Internal Revenue Service. These sources operationalize relationships differently, capture data at various frequencies, and gather the data for unique purposes. We use probabilistic matching to observe and reconcile parent-child relationships across these sources. The data include a variety of personal identifiers including name, date of birth, parents' names, address, and place of birth that support matching and validation.
ResultsWe find that understanding the content, consistency, and coverage of the files before matching is critical for high quality linkages. The representativeness of the parent-child relationship file improves over time, with the weakest coverage for the Greatest Generation and the strongest coverage for Millennials. Coverage varies by source: tax data underrepresent non-white children and have duplicate records for SSNs, while names and dates of birth are missing from Census data. Multiple match rates differ among demographic groups and over time. In the matching process, the blocking variables rely on common variables across the population datasets. Our approach provides robust entity resolution for women, despite married-maiden name changes. We describe challenges due to data problems in old census records and validation changes in social security data.
Conclusion/ImplicationsWe conduct a successful reconciliation of parent-child relationships in U.S. population level files. The project supports operational and research uses, such as the 2020 Census. We will extend this work using graph matching and will expand the method to validate other relationship links including spouses and siblings.
ICPSR is building LinkageLibrary, a repository and community space for researchers involved in linking and combining datasets, as a collaboration between social, statistical, and computer scientists. Unlike surveys or experiments where causal and outcome variables are measured in tandem, it is often necessary when working with organic, non-design data to link to other measures. This makes linkage methodologies particularly important when conducting analyses using administrative data. A common benchmarking repository of linkage methodologies will propel the field to the next level of rigor by facilitating comparison of different algorithms, understanding which types of algorithms work best under different conditions and problem domains, promoting transparency and replicability of research, and encouraging proper citation of methodological contributions and their resulting datasets. It will bring together the diverse scholarly communities (e.g., computer scientists, statisticians, and social, behavioral, economic, and health (SBEH) scientists) who are currently addressing these challenges in disparate ways that do not build on one another's work. Improving linkage methodologies is critical to the production of representative samples, and thus to unbiased estimates of a wide variety of social and economic phenomena. The repository will accelerate the development of new record linkage algorithms and evaluation methods, improve the reproducibility of analyses conducted on integrated data, allow comparisons on same and different data, and move forward the provision of privacy-aware integrated data. The presentation will focus on lessons learned while building the repository and the community, and introduce the LinkageLibrary website.
IntroductionAccess to real data with diverse attributes is critical for effective development of any data analytic algorithm. Benchmarking data repositories have all been vital to the development of research communities focused on algorithm development. This work reports on the development of such a data repository for record linkage.
Objectives and ApproachEstablishing a common benchmarking repository of real data can propel a field to the next level of rigor by facilitating comparison of different algorithms, understanding what type of algorithms work best under certain real data conditions and problem domains, promoting transparency and replicability of research, and creating incentives for proper citations for contributions. In addition, benchmarking repositories can bring together the diverse stakeholders (e.g., computer scientists, statisticians, data custodians, data users including social, behaviour, economic, and health (SBEH) scientists) that can advance the field more effectively than could researchers from any single discipline.
ResultsIn Fall 2016, international leaders in record linkage formed a Data Linkage Repository workgroup (DLRep) to establish a benchmarking data repository for record linkage. The workgroup is working in collaboration with The Inter-university Consortium for Political and Social Research (ICPSR) to host the site data repository planned for release in Summer 2018. The repository for record linkage research will house various types of real data that require linking with metadata, unique handles for citations, proposed algorithms for evaluation criteria, and a platform for posting, sharing, and comparing results as well as citations of relevant papers. Some datasets will have the gold standard published that researchers can evaluate their results against. Other datasets will gather results to build the gold standard as a community.
Conclusion/ImplicationsRecord linkage methodology is important to domains where data needs to be integrated from multiple sources, including diverse disciplines. Establishing an international interdisciplinary research community around a benchmark data linkage repository to validate and compare linkage algorithms is crucial to fully realizing the social benefits of data about people.