{
"cells": [
{
"cell_type": "markdown",
"id": "70594b81",
"metadata": {},
"source": [
"# Getting to Know the Pandas DataFrame"
]
},
{
"cell_type": "markdown",
"id": "6fdaa7fa",
"metadata": {},
"source": [
"The [Pandas DataFrame](https://pandas.pydata.org/docs/reference/frame.html) is a data structure that allows us to manipulate and analyze tabular data. A \"tabular\" data structure can be thought of as a matrix, where rows represent observations and columns represent features that describe each observation. It's a structure that you would find in a SQL database or Excel spreadsheet. Let's say we have a tabular dataset about movies.\n",
"\n",
"<img width=\"50%\" src=\"https://practicalpython.s3.us-east-2.amazonaws.com/assets/dataframe_structure.png\"/>\n",
"\n",
"In this case, each row represents a movie and each column represents a characteristic about the movie like the genre, rating, and director. The \"index\" column represents a row's position in the dataframe. By default, a Pandas DataFrame's index starts at 0."
]
},
{
"cell_type": "markdown",
"id": "cdda2b7c",
"metadata": {},
"source": [
"## Importing the Pandas package\n",
"\n",
"In order to create and use a Pandas DataFrame, we need to have the `pandas` package readily available in our environment. Let's import `pandas` and give it the alias of \"pd\" so that we don't have to write out \"pandas\" every time we call a function.\n",
"\n",
"<img src=\"https://media.giphy.com/media/nVsLCrW5iHf6E/giphy.gif\"/>"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "f9dd60c4",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd "
]
},
{
"cell_type": "markdown",
"id": "cd2658d2",
"metadata": {},
"source": [
"## Creating a dataframe\n",
"\n",
"There are several ways to create a [Pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame). Here, we'll describe 2 approaches. \n",
"\n",
"### Converting a dictionary to a dataframe\n",
"\n",
"You can create a dataframe from a dictionary. Each key of the dictionary represents a column name and the value of the dictionary is a list that represents values belonging to that particular column. Each element of the list represents the value of a row in the dataframe. \n",
"\n",
"<img width='70%' src=\"https://practicalpython.s3.us-east-2.amazonaws.com/assets/dict_df.png\"/>\n",
"\n",
"Let's create a dataframe called `df_movies`.\n",
"\n",
"```{note}\n",
"`df` is short for \"dataframe\". It's common for data scientists to name their dataframe \"df\". \n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "27f8611b",
"metadata": {},
"outputs": [],
"source": [
"data = {\n",
" 'movie': ['Batman', 'Jungle Book', 'Titanic'], \n",
" 'genre': ['action', 'kids', 'romance'], \n",
" 'rating': [6, 9, 8],\n",
" 'director': ['Tim Burton', 'Wolfgang Reitherman', 'James Cameron']\n",
"}\n",
"\n",
"df_movies = pd.DataFrame(data)"
]
},
{
"cell_type": "markdown",
"id": "f5b2b9b9",
"metadata": {},
"source": [
"We can confirm that `df_movies` is indeed a dataframe:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "c0c71ad1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.frame.DataFrame"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(df_movies)"
]
},
{
"cell_type": "markdown",
"id": "2ac9bef6",
"metadata": {},
"source": [
"Now let's see how it looks 👀:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "09c7f239",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>movie</th>\n",
" <th>genre</th>\n",
" <th>rating</th>\n",
" <th>director</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Batman</td>\n",
" <td>action</td>\n",
" <td>6</td>\n",
" <td>Tim Burton</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Jungle Book</td>\n",
" <td>kids</td>\n",
" <td>9</td>\n",
" <td>Wolfgang Reitherman</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Titanic</td>\n",
" <td>romance</td>\n",
" <td>8</td>\n",
" <td>James Cameron</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" movie genre rating director\n",
"0 Batman action 6 Tim Burton\n",
"1 Jungle Book kids 9 Wolfgang Reitherman\n",
"2 Titanic romance 8 James Cameron"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_movies"
]
},
{
"cell_type": "markdown",
"id": "6afd5ec6",
"metadata": {},
"source": [
"### Loading a csv file into a dataframe\n",
"\n",
"You can also create a dataframe by importing tabular data from a comma-separated-value (csv) file, or Excel spreadsheet. A csv file looks somthing like this:\n",
"\n",
"<img width=\"30%\" src=\"https://practicalpython.s3.us-east-2.amazonaws.com/assets/example_csv_file.png\"/>\n",
"\n",
"To load this csv file into a Pandas DataFrame, we will need to use the Pandas [`read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function. For data in Excel format, you can use [`read_excel()`](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html). We will also need to know the path where the csv file is located. This can be either on your local machine or in the cloud. \n",
"\n",
"Let's load in `movies_data.csv` file as a dataframe. The original file is located on my local machine in a folder called `data/`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "52e90654",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>movie</th>\n",
" <th>genre</th>\n",
" <th>rating</th>\n",
" <th>director</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Batman</td>\n",
" <td>action</td>\n",
" <td>6</td>\n",
" <td>Tim Burton</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Jungle Book</td>\n",
" <td>kids</td>\n",
" <td>9</td>\n",
" <td>Wolfgang Reitherman</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Titanic</td>\n",
" <td>romance</td>\n",
" <td>8</td>\n",
" <td>James Cameron</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" movie genre rating director\n",
"0 Batman action 6 Tim Burton\n",
"1 Jungle Book kids 9 Wolfgang Reitherman\n",
"2 Titanic romance 8 James Cameron"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_movies = pd.read_csv(\"data/movies_data.csv\")\n",
"\n",
"df_movies"
]
},
{
"cell_type": "markdown",
"id": "ea1b631c",
"metadata": {},
"source": [
"This csv-loaded dataframe is identical to the one that was generated from a dictionary. "
]
},
{
"cell_type": "markdown",
"id": "2629988a",
"metadata": {},
"source": [
"## Pandas Series\n",
"\n",
"An important part of the Pandas DataFrame is the [Pandas Series](https://pandas.pydata.org/docs/reference/series.html). While the DataFrame is a 2-dimensional structure, a Series is 1-dimensional. It can store any datatype (integers, strings, floats, timestamps, even lists). A Series represents a single column of a DataFrame. This is how you get an individual column (represented as a Pandas Series) from a dataframe:\n",
"\n",
"```\n",
"dataframe['column_name'] \n",
"```\n",
"\n",
"Let's say we want to pull the `rating` column from our `df_movies` dataframe."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "f1cf12c4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 6\n",
"1 9\n",
"2 8\n",
"Name: rating, dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_movies['rating']"
]
},
{
"cell_type": "markdown",
"id": "4dcb11cc",
"metadata": {},
"source": [
"The `rating` column is a Pandas Series! We can confirm its datatype:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "3bc9b2c0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.series.Series"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(df_movies['rating'])"
]
},
{
"cell_type": "markdown",
"id": "3615b86a",
"metadata": {},
"source": [
"There is a wide range of built-in functions that come with the Pandas Series. Some examples include:\n",
"\n",
"- `.mean()`: if the column is numeric, it gets the average value of the column\n",
"- `.nunique()`: counts number of unique values belonging to a particular column \n",
"- `.fillna(value='value')`: fills missing values with 'value' (or any other value of your choosing)\n",
"\n",
"The [official documentation](https://pandas.pydata.org/docs/reference/series.html) on Pandas Series provides a list of all available functions. \n",
"We'll explore the functions of Pandas Series in more detail in the upcmoing chapter, Data Exploration. "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}