- R Data Science Essentials
- Raja B. Koushik Sharan Kumar Ravindran
- 443字
- 2025-04-04 20:23:12
Data preprocessing techniques
The first step after loading the data to R would be to check for possible issues such as missing data, outliers, and so on, and, depending on the analysis, the preprocessing operation will be decided. Usually, in any dataset, the missing values have to be dealt with either by not considering them for the analysis or replacing them with a suitable value.
To make this clearer, let's use a sample dataset to perform the various operations. We will use a dataset about India named IndiaData
. You can find the dataset at https://github.com/rsharankumar/R_Data_Science_Essentials. We will perform preprocessing on the dataset:
data <- read.csv("IndiaData.csv", header = TRUE) #1. To check the number of Null sum(is.na(data)) [1] 6831
After reading the dataset, we use the is.na
function to identify the presence of NA in the dataset, and then using sum
, we get the total number of NAs present in the dataset. In our case, we can see that a large number of rows has NA in it. We can replace the NA with the mean value or we can remove these NA rows.
The following function can be used to replace the NA with the column mean for all the numeric columns. The numeric columns are identified by the sapply(data, is.numeric)
function. We will check for the cells that have the NA value, then we identify the mean of these columns using the mean
function with the na.rm=TRUE
parameter, where the NA values are excluded while computing the mean
function:
for (i in which(sapply(data, is.numeric))) { data[is.na(data[, i]), i] <- mean(data[, i], na.rm = TRUE) }
Alternatively, we can also remove all the NA rows from the dataset using the following code:
newdata <- na.omit(data)
The next major preprocessing activity is to identify the outliers
package and deal with it. We can identify the presence of outliers in R by making use of the outliers
function. We can use the function outliers only on the numeric columns, hence let's consider the preceding dataset, where the NAs were replaced by the mean values, and we will identify the presence of an outlier using the outliers
function. Then, we get the location of all the outliers using the which
function and finally, we remove the rows that had outlier values:
install.packages("outliers") library(outliers)
We identify the outliers in the X2012
column, which can be subsetted using the data$X2012
command:
outlier_tf = outlier(data$X2012,logical=TRUE) sum(outlier_tf) [1] 1 #What were the outliers find_outlier = which(outlier_tf==TRUE,arr.ind=TRUE) #Removing the outliers newdata = data[-find_outlier,] nrow(newdata)
The column from the preceding dataset that was considered in the outlier example had only one outlier and hence we can remove this row from the dataset.