Getting to Know the Pandas DataFrame

The Pandas DataFrame is a data structure that allows us to manipulate and analyze tabular data. A “tabular” data structure can be thought of as a matrix, where rows represent observations and columns represent features that describe each observation. It’s a structure that you would find in a SQL database or Excel spreadsheet. Let’s say we have a tabular dataset about movies.

In this case, each row represents a movie and each column represents a characteristic about the movie like the genre, rating, and director. The “index” column represents a row’s position in the dataframe. By default, a Pandas DataFrame’s index starts at 0.

Importing the Pandas package

In order to create and use a Pandas DataFrame, we need to have the pandas package readily available in our environment. Let’s import pandas and give it the alias of “pd” so that we don’t have to write out “pandas” every time we call a function.

import pandas as pd 

Creating a dataframe

There are several ways to create a Pandas DataFrame. Here, we’ll describe 2 approaches.

Converting a dictionary to a dataframe

You can create a dataframe from a dictionary. Each key of the dictionary represents a column name and the value of the dictionary is a list that represents values belonging to that particular column. Each element of the list represents the value of a row in the dataframe.

Let’s create a dataframe called df_movies.


df is short for “dataframe”. It’s common for data scientists to name their dataframe “df”.

data = {
    'movie': ['Batman', 'Jungle Book', 'Titanic'], 
    'genre': ['action', 'kids', 'romance'], 
    'rating': [6, 9, 8],
    'director': ['Tim Burton', 'Wolfgang Reitherman', 'James Cameron']

df_movies = pd.DataFrame(data)

We can confirm that df_movies is indeed a dataframe:


Now let’s see how it looks 👀:

movie genre rating director
0 Batman action 6 Tim Burton
1 Jungle Book kids 9 Wolfgang Reitherman
2 Titanic romance 8 James Cameron

Loading a csv file into a dataframe

You can also create a dataframe by importing tabular data from a comma-separated-value (csv) file, or Excel spreadsheet. A csv file looks somthing like this:

To load this csv file into a Pandas DataFrame, we will need to use the Pandas read_csv() function. For data in Excel format, you can use read_excel(). We will also need to know the path where the csv file is located. This can be either on your local machine or in the cloud.

Let’s load in movies_data.csv file as a dataframe. The original file is located on my local machine in a folder called data/.

df_movies = pd.read_csv("data/movies_data.csv")

movie genre rating director
0 Batman action 6 Tim Burton
1 Jungle Book kids 9 Wolfgang Reitherman
2 Titanic romance 8 James Cameron

This csv-loaded dataframe is identical to the one that was generated from a dictionary.

Pandas Series

An important part of the Pandas DataFrame is the Pandas Series. While the DataFrame is a 2-dimensional structure, a Series is 1-dimensional. It can store any datatype (integers, strings, floats, timestamps, even lists). A Series represents a single column of a DataFrame. This is how you get an individual column (represented as a Pandas Series) from a dataframe:


Let’s say we want to pull the rating column from our df_movies dataframe.

0    6
1    9
2    8
Name: rating, dtype: int64

The rating column is a Pandas Series! We can confirm its datatype:


There is a wide range of built-in functions that come with the Pandas Series. Some examples include:

  • .mean(): if the column is numeric, it gets the average value of the column

  • .nunique(): counts number of unique values belonging to a particular column

  • .fillna(value='value'): fills missing values with ‘value’ (or any other value of your choosing)

The official documentation on Pandas Series provides a list of all available functions. We’ll explore the functions of Pandas Series in more detail in the upcmoing chapter, Data Exploration.