- Advanced Python Programming
- Dr. Gabriele Lanaro Quan Nguyen Sakis Kasampalis
- 641字
- 2025-04-04 14:55:55
Pandas fundamentals
While NumPy deals mostly with arrays, Pandas main data structures are pandas.Series, pandas.DataFrame, and pandas.Panel. In the rest of this chapter, we will abbreviate pandas with pd.
The main difference between a pd.Series object and an np.array is that a pd.Series object associates a specific key to each element of an array. Let’s see how this works in practice with an example.
Let's assume that we are trying to test a new blood pressure drug, and we want to store, for each patient, whether the patient's blood pressure improved after administering the drug. We can encode this information by associating to each subject ID (represented by an integer), True if the drug was effective, and False otherwise.
We can create a pd.Series object by associating an array of keys, the patients, to the array of values that represent the drug effectiveness. The array of keys can be passed to the Series constructor using the index argument, as shown in the following snippet:
import pandas as pd
patients = [0, 1, 2, 3]
effective = [True, True, False, False]
effective_series = pd.Series(effective, index=patients)
Associating a set of integers from 0 to N to a set of values can technically be implemented with np.array, since, in this case, the key will simply be the position of the element in the array. In Pandas, keys are not limited to integers but can also be strings, floating point numbers, and also generic (hashable) Python objects. For example, we can easily turn our IDs into strings with little effort, as shown in the following code:
patients = ["a", "b", "c", "d"]
effective = [True, True, False, False]
effective_series = pd.Series(effective, index=patients)
An interesting observation is that, while NumPy arrays can be thought of as a contiguous collection of values similar to Python lists, the Pandas pd.Series object can be thought of as a structure that maps keys to values, similar to Python dictionaries.
What if you want to store the initial and final blood pressure for each patient? In Pandas, one can use a pd.DataFrame object to associate multiple data to each key.
pd.DataFrame can be initialized, similarly to a pd.Series object, by passing a dictionary of columns and an index. In the following example, we will see how to create a pd.DataFrame containing four columns that represent the initial and final measurements of systolic and dyastolic blood pressure for our patients:
patients = ["a", "b", "c", "d"]
columns = {
"sys_initial": [120, 126, 130, 115],
"dia_initial": [75, 85, 90, 87],
"sys_final": [115, 123, 130, 118],
"dia_final": [70, 82, 92, 87]
}
df = pd.DataFrame(columns, index=patients)
Equivalently, you can think of a pd.DataFrame as a collection of pd.Series. In fact, it is possible to directly initialize a pd.DataFrame, using a dictionary of pd.Series instances:
columns = {
"sys_initial": pd.Series([120, 126, 130, 115], index=patients),
"dia_initial": pd.Series([75, 85, 90, 87], index=patients),
"sys_final": pd.Series([115, 123, 130, 118], index=patients),
"dia_final": pd.Series([70, 82, 92, 87], index=patients)
}
df = pd.DataFrame(columns)
To inspect the content of a pd.DataFrame or pd.Series object, you can use the pd.Series.head and pd.DataFrame.head methods, which print the first few rows of the dataset:
effective_series.head()
# Output:
# a True
# b True
# c False
# d False
# dtype: bool
df.head()
# Output:
# dia_final dia_initial sys_final sys_initial
# a 70 75 115 120
# b 82 85 123 126
# c 92 90 130 130
# d 87 87 118 115
Just like a pd.DataFrame can be used to store a collection of pd.Series, you can use a pd.Panel to store a collection of pd.DataFrames. We will not cover the usage of pd.Panel as it is not used as often as pd.Series and pd.DataFrame. To learn more about pd.Panel, ensure that you refer to the excellent documentation at http://pandas.pydata.org/pandas-docs/stable/dsintro.html#panel.