Bringing data to a usable format_R Data Science Essentials-QQ阅读女频幻言网

书名：R Data Science Essentials
作者名：Raja B. Koushik Sharan Kumar Ravindran
本章字数：620字
更新时间：2025-04-04 20:23:12

Bringing data to a usable format

We covered reading the data in R, understanding the data types, and performing various operations on the data. Now, we will see a few concepts that will be used just before an analysis or building a model. While performing an analysis, we might not need to study the entire dataset and we can just focus a subset of it, or, on the other hand, we might have to combine data from multiple data sources. These are the various concepts that will be covered in this chapter.

The most commonly used functionality will be to select the desired column from the dataset. While building the model, we will not be using all the columns in the dataset but just some of them that are more relevant. In order to select the column, we can either specify the column name or number, or simply delete the columns that are not required.

newdata <- data[c(1,5:10)]
head(newdata)
# excluding column 
newdata <- data[c(-2, -3, -4, -11)]
head(newdata) 
 mpg drat wt qsec vs am gear
Mazda RX4 21.0 3.90 2.620 16.46 0 1 4
Mazda RX4 Wag 21.0 3.90 2.875 17.02 0 1 4
Datsun 710 22.8 3.85 2.320 18.61 1 1 4
Hornet 4 Drive 21.4 3.08 3.215 19.44 1 0 3
Hornet Sportabout 18.7 3.15 3.440 17.02 0 0 3
Valiant 18.1 2.76 3.460 20.22 1 0 3

In the preceding code, we first selected the column by its position. The first line of the code will select the first column and then the 5th to 10th column from the dataset, whereas, in the last line, the specified two columns are removed from the dataset. Both the preceding commands will yield the same result.

We can also arrive at a situation where we need to filter the data based on a condition. While building the model, we cannot create a single model for the whole of the population but we should create multiple models based on the behavior present in the population. This can be achieved by subsetting the dataset. In the following code, we will get the data of cars that have an mpg more than 25 alone:

newdata <- data[ which(data$mpg > 25), ]
 mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2

We might also need to consider a sample of the dataset. For example, while building a regression or logistic model, we need to have two datasets—one for the training and the other for the testing. In these cases, we need to choose a random sample. This can be done using the following code:

sample <- data[sample(1:nrow(data), 10, replace=FALSE),]
sample
 mpg cyl disp hp drat wt qsec vs am gear carb
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4

We considered a random sample of 10 rows from the dataset. Along with these, we might have to merge two different datasets. Let's see how this can be achieved. We can combine the data both row-wise as well as column-wise as follows:

sample1 <- data[sample(1:nrow(data), 10, replace=FALSE),]
sample2 <- data[sample(1:nrow(data), 5, replace=FALSE),]
newdata <- rbind(sample1, sample2)

The preceding code is used to combine two datasets that share the same column format. Then we can combine them using the rbind function. Alternatively, if the two datasets have the same length of data but different columns, then we can combine them using the cind or merge functions:

newdata1 <- data[c(1,5:7)]
newdata2 <- data[c(8:11)]
newdata <- cbind(newdata1, newdata2)

When we have two different datasets with a common column, then we can use the merge function to combine them. On using merge, the dataset will be merged based on the common columns.

These are the essential concepts necessary to prepare the dataset for the analysis, which will be discussed in the next few chapters.