# Getting to Know the Pandas DataFrame

## Contents

# Getting to Know the Pandas DataFrame¶

The Pandas DataFrame is a data structure that allows us to manipulate and analyze tabular data. A “tabular” data structure can be thought of as a matrix, where rows represent observations and columns represent features that describe each observation. It’s a structure that you would find in a SQL database or Excel spreadsheet. Let’s say we have a tabular dataset about movies.

In this case, each row represents a movie and each column represents a characteristic about the movie like the genre, rating, and director. The “index” column represents a row’s position in the dataframe. By default, a Pandas DataFrame’s index starts at 0.

## Importing the Pandas package¶

In order to create and use a Pandas DataFrame, we need to have the `pandas`

package readily available in our environment. Let’s import `pandas`

and give it the alias of “pd” so that we don’t have to write out “pandas” every time we call a function.

```
import pandas as pd
```

## Creating a dataframe¶

There are several ways to create a Pandas DataFrame. Here, we’ll describe 2 approaches.

### Converting a dictionary to a dataframe¶

You can create a dataframe from a dictionary. Each key of the dictionary represents a column name and the value of the dictionary is a list that represents values belonging to that particular column. Each element of the list represents the value of a row in the dataframe.

Let’s create a dataframe called `df_movies`

.

Note

`df`

is short for “dataframe”. It’s common for data scientists to name their dataframe “df”.

```
data = {
'movie': ['Batman', 'Jungle Book', 'Titanic'],
'genre': ['action', 'kids', 'romance'],
'rating': [6, 9, 8],
'director': ['Tim Burton', 'Wolfgang Reitherman', 'James Cameron']
}
df_movies = pd.DataFrame(data)
```

We can confirm that `df_movies`

is indeed a dataframe:

```
type(df_movies)
```

```
pandas.core.frame.DataFrame
```

Now let’s see how it looks 👀:

```
df_movies
```

movie | genre | rating | director | |
---|---|---|---|---|

0 | Batman | action | 6 | Tim Burton |

1 | Jungle Book | kids | 9 | Wolfgang Reitherman |

2 | Titanic | romance | 8 | James Cameron |

### Loading a csv file into a dataframe¶

You can also create a dataframe by importing tabular data from a comma-separated-value (csv) file, or Excel spreadsheet. A csv file looks somthing like this:

To load this csv file into a Pandas DataFrame, we will need to use the Pandas `read_csv()`

function. For data in Excel format, you can use `read_excel()`

. We will also need to know the path where the csv file is located. This can be either on your local machine or in the cloud.

Let’s load in `movies_data.csv`

file as a dataframe. The original file is located on my local machine in a folder called `data/`

.

```
df_movies = pd.read_csv("data/movies_data.csv")
df_movies
```

movie | genre | rating | director | |
---|---|---|---|---|

0 | Batman | action | 6 | Tim Burton |

1 | Jungle Book | kids | 9 | Wolfgang Reitherman |

2 | Titanic | romance | 8 | James Cameron |

This csv-loaded dataframe is identical to the one that was generated from a dictionary.

## Pandas Series¶

An important part of the Pandas DataFrame is the Pandas Series. While the DataFrame is a 2-dimensional structure, a Series is 1-dimensional. It can store any datatype (integers, strings, floats, timestamps, even lists). A Series represents a single column of a DataFrame. This is how you get an individual column (represented as a Pandas Series) from a dataframe:

```
dataframe['column_name']
```

Let’s say we want to pull the `rating`

column from our `df_movies`

dataframe.

```
df_movies['rating']
```

```
0 6
1 9
2 8
Name: rating, dtype: int64
```

The `rating`

column is a Pandas Series! We can confirm its datatype:

```
type(df_movies['rating'])
```

```
pandas.core.series.Series
```

There is a wide range of built-in functions that come with the Pandas Series. Some examples include:

`.mean()`

: if the column is numeric, it gets the average value of the column`.nunique()`

: counts number of unique values belonging to a particular column`.fillna(value='value')`

: fills missing values with ‘value’ (or any other value of your choosing)

The official documentation on Pandas Series provides a list of all available functions. We’ll explore the functions of Pandas Series in more detail in the upcmoing chapter, Data Exploration.