Purpose of Cleaning and Data Type Checks
Learn about the purpose of data cleaning, and also about the data type checks.
Purpose of data cleaning
The purpose of data cleaning is to make sure that the data is correct. It’s rarely the case that once data is collected through the game and transferred to the server, it’s automatically ready for analysis. Often, the data is incomplete, has wrong entries, or contains outliers. Thus, it’s important to check and, when possible, correct data to prepare it for analysis.
For example, given VPAL data, each row follows a certain format with variables in certain order and type. Position and orientation variables are all expected to be integers within certain ranges, time stamps are expected to follow a specific order with increments of 0.2 seconds, and scores and health are supposed to follow certain ranges and order. To ensure that the data is correct, the data needs to go through a series of checking procedures. This can be easily done through the process of parsing and reading. When errors are encountered, we can use NAN
or NA
to signify an error. This process can also be done after reading and parsing content into a data table (or data frame). Below, we discuss different methods used for type checks, range checks, etc.
Data type checks
There are many ways to check data format. When parsing the data, we can check the type or restrict the data to be of a different type, and if that fails, we can generate an exception. In R, we can use different functions to check types after we read the data. For example, is.numeric
is a function that will return TRUE
or FALSE
based on if the variable is numeric or not. In most cases, we would want to introduce an NA
for cells that do not contain the correct format or type. Suppose we are looking at numeric data. In that case, we can use the as.numeric
function, which checks if a variable or a column in the data frame contains numeric values (i.e., real numbers) or not, and for the cells that are not, it will introduce an NA
(see code widget below). There are also functions to check other data types: as.logical
, as.factor
, as.character
, and as.integer
.
Data format checks conversions
In addition to the type issue discussed above, we may also have dirty data, meaning that the ranges or values may not be right. This is not just an issue of a quick type check but requires more involved checking on ranges, given the measurement type and actual variable. This also requires some knowledge from designers about the ranges for the different variables represented in our data.
Categorical data
For categorical data, this can be as simple as a scenario where we have a value that is not in the right category. When we have a categorical type of data that we want to constrain to be within a specific list of categories, we can enforce that constraint. In R
, we use factors to denote that type, and within R
, we can enforce a variable of type factor to have specific categories. If a variable shows a value that isn’t in the right categorical type, a NAN
or NA
will be generated in the cell.
Numeric data
For numeric data, we need to encode manual checks on values based on specified logical values or designers’ designated values per variable. For example, health cannot be negative or cannot go above 100, etc. Similar to the categorical variables, if a value is not right, a NAN
or NA
is introduced in that cell, and the process continues.
Time stamps
Timestamps can be represented in several ways. One way is to store simulation time as discussed above. This is, in essence, how it is represented in VPAL. However, most other games use standard time. The following lab shows examples of reading in time and date into a POSIXlt object, as with other type conversions, an `NA is used if the value cannot be converted.
Once the data is checked for type and for consistency in terms of its range and values, the data is deemed technically correct and ready to be passed onto the next step
Following conversion is for vectors.
Get hands-on with 1400+ tech skills courses.