Skip to Main Content

Data Cleaning

Provides support on how to clean data and suggests activities to practice

Components of Data Cleaning

Data cleaning is the process of preparing data for analysis. Cleaning will correct errors so that analysis will be smoother, less error-prone, and more accurate.

When you're data cleaning, you may encounter needing to do the following things in order to "tidy" your data:

  • spreadsheet data is machine readable (so it can be imported into other software tools)--for example, the first row is for column headers and there are no extra or merged rows or columns prior to data table; one data table per spreadsheet
  • removing out-of-date or irrelevant data ("noisy data")
  • handling corrupted files or corrupted values (sometimes data you export will "explode," causing some lengthy values to be pushed into and across into the wrong columns for that row)
  • handling blank values or line breaks
  • handling multiple values (cells should each only have 1 value)
  • checking for incomplete values
  • standardizing data type per variable ("data validation" helps ensure that numbers and text should not be in the same column)
  • correcting date formats
  • removing duplicate rows
  • removing symbols that didn't translate well when exported
  • checking for outliers
  • removing accidental blank spaces added before or after the actual intended value
  • checking for capitalization or spelling or vocabulary consistencies ("normalizing data" removes inconsistencies/mismatches)
  • re-organizing data such as reshaping it ("transforming / wrangling / munging data")
  • de-identifying data
  • creating bins for ranges/categories or creating codes for categorical data
  • merging datasets together or appending data ("harmonizing data")

Variable Harmonization

Data harmonization is the process by which you create a single integrated dataset out of two or more separate datasets, making the variables included measure the same thing (such as the same units or the same geographical boundaries) so that you can conduct comparative statistical analysis. The purpose of harmonizing data is for comparability.

If you'd like to learn about harmonization, CESSDA has created a four part recorded webinar (1 hour total). Topics include: an introduction to social science variable harmonization, ex ante and ex post harmonization strategies, examples from international social science surveys (including the European Values Survey), the use of international standard classifications, and how to use metadata to construct FAIR variable harmonization documentation.

Quality Check

Once you think you have finished cleaning your data, do a final review to check whether the data seems in order and ready for analysis.

Ask

  • Does the data make sense?
  • Is the data standardized and consistent?
  • Is the data in a format that is suitable for analysis? (Both as a file format that your analysis software can import and with meaningful columns laid out)

Here are some characteristics of quality data:

  1. Data is valid and/or reliable (complies with definitions, values would come out to be consistent over time if repeated)
  2. Data is accurate and/or credible (close to the true values)
  3. Data is complete (all required data is known)
  4. Data is consistent (dataset has considered missing, incomplete, and invalid values, as well as standardized units)
  5. Data is confirmable (data doesn't reflect researcher bias)
  6. Data is transferable (there can be applicability in other contexts beyond your own study)

Practice Data Cleaning