{ "cells": [ { "cell_type": "markdown", "id": "70594b81", "metadata": {}, "source": [ "# Getting to Know the Pandas DataFrame" ] }, { "cell_type": "markdown", "id": "6fdaa7fa", "metadata": {}, "source": [ "The [Pandas DataFrame](https://pandas.pydata.org/docs/reference/frame.html) is a data structure that allows us to manipulate and analyze tabular data. A \"tabular\" data structure can be thought of as a matrix, where rows represent observations and columns represent features that describe each observation. It's a structure that you would find in a SQL database or Excel spreadsheet. Let's say we have a tabular dataset about movies.\n", "\n", "<img width=\"50%\" src=\"https://practicalpython.s3.us-east-2.amazonaws.com/assets/dataframe_structure.png\"/>\n", "\n", "In this case, each row represents a movie and each column represents a characteristic about the movie like the genre, rating, and director. The \"index\" column represents a row's position in the dataframe. By default, a Pandas DataFrame's index starts at 0." ] }, { "cell_type": "markdown", "id": "cdda2b7c", "metadata": {}, "source": [ "## Importing the Pandas package\n", "\n", "In order to create and use a Pandas DataFrame, we need to have the `pandas` package readily available in our environment. Let's import `pandas` and give it the alias of \"pd\" so that we don't have to write out \"pandas\" every time we call a function.\n", "\n", "<img src=\"https://media.giphy.com/media/nVsLCrW5iHf6E/giphy.gif\"/>" ] }, { "cell_type": "code", "execution_count": 1, "id": "f9dd60c4", "metadata": {}, "outputs": [], "source": [ "import pandas as pd " ] }, { "cell_type": "markdown", "id": "cd2658d2", "metadata": {}, "source": [ "## Creating a dataframe\n", "\n", "There are several ways to create a [Pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame). Here, we'll describe 2 approaches. \n", "\n", "### Converting a dictionary to a dataframe\n", "\n", "You can create a dataframe from a dictionary. Each key of the dictionary represents a column name and the value of the dictionary is a list that represents values belonging to that particular column. Each element of the list represents the value of a row in the dataframe. \n", "\n", "<img width='70%' src=\"https://practicalpython.s3.us-east-2.amazonaws.com/assets/dict_df.png\"/>\n", "\n", "Let's create a dataframe called `df_movies`.\n", "\n", "```{note}\n", "`df` is short for \"dataframe\". It's common for data scientists to name their dataframe \"df\". \n", "```" ] }, { "cell_type": "code", "execution_count": 2, "id": "27f8611b", "metadata": {}, "outputs": [], "source": [ "data = {\n", " 'movie': ['Batman', 'Jungle Book', 'Titanic'], \n", " 'genre': ['action', 'kids', 'romance'], \n", " 'rating': [6, 9, 8],\n", " 'director': ['Tim Burton', 'Wolfgang Reitherman', 'James Cameron']\n", "}\n", "\n", "df_movies = pd.DataFrame(data)" ] }, { "cell_type": "markdown", "id": "f5b2b9b9", "metadata": {}, "source": [ "We can confirm that `df_movies` is indeed a dataframe:" ] }, { "cell_type": "code", "execution_count": 3, "id": "c0c71ad1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(df_movies)" ] }, { "cell_type": "markdown", "id": "2ac9bef6", "metadata": {}, "source": [ "Now let's see how it looks 👀:" ] }, { "cell_type": "code", "execution_count": 4, "id": "09c7f239", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>movie</th>\n", " <th>genre</th>\n", " <th>rating</th>\n", " <th>director</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>Batman</td>\n", " <td>action</td>\n", " <td>6</td>\n", " <td>Tim Burton</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>Jungle Book</td>\n", " <td>kids</td>\n", " <td>9</td>\n", " <td>Wolfgang Reitherman</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>Titanic</td>\n", " <td>romance</td>\n", " <td>8</td>\n", " <td>James Cameron</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " movie genre rating director\n", "0 Batman action 6 Tim Burton\n", "1 Jungle Book kids 9 Wolfgang Reitherman\n", "2 Titanic romance 8 James Cameron" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_movies" ] }, { "cell_type": "markdown", "id": "6afd5ec6", "metadata": {}, "source": [ "### Loading a csv file into a dataframe\n", "\n", "You can also create a dataframe by importing tabular data from a comma-separated-value (csv) file, or Excel spreadsheet. A csv file looks somthing like this:\n", "\n", "<img width=\"30%\" src=\"https://practicalpython.s3.us-east-2.amazonaws.com/assets/example_csv_file.png\"/>\n", "\n", "To load this csv file into a Pandas DataFrame, we will need to use the Pandas [`read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function. For data in Excel format, you can use [`read_excel()`](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html). We will also need to know the path where the csv file is located. This can be either on your local machine or in the cloud. \n", "\n", "Let's load in `movies_data.csv` file as a dataframe. The original file is located on my local machine in a folder called `data/`." ] }, { "cell_type": "code", "execution_count": 5, "id": "52e90654", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>movie</th>\n", " <th>genre</th>\n", " <th>rating</th>\n", " <th>director</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>Batman</td>\n", " <td>action</td>\n", " <td>6</td>\n", " <td>Tim Burton</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>Jungle Book</td>\n", " <td>kids</td>\n", " <td>9</td>\n", " <td>Wolfgang Reitherman</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>Titanic</td>\n", " <td>romance</td>\n", " <td>8</td>\n", " <td>James Cameron</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " movie genre rating director\n", "0 Batman action 6 Tim Burton\n", "1 Jungle Book kids 9 Wolfgang Reitherman\n", "2 Titanic romance 8 James Cameron" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_movies = pd.read_csv(\"data/movies_data.csv\")\n", "\n", "df_movies" ] }, { "cell_type": "markdown", "id": "ea1b631c", "metadata": {}, "source": [ "This csv-loaded dataframe is identical to the one that was generated from a dictionary. " ] }, { "cell_type": "markdown", "id": "2629988a", "metadata": {}, "source": [ "## Pandas Series\n", "\n", "An important part of the Pandas DataFrame is the [Pandas Series](https://pandas.pydata.org/docs/reference/series.html). While the DataFrame is a 2-dimensional structure, a Series is 1-dimensional. It can store any datatype (integers, strings, floats, timestamps, even lists). A Series represents a single column of a DataFrame. This is how you get an individual column (represented as a Pandas Series) from a dataframe:\n", "\n", "```\n", "dataframe['column_name'] \n", "```\n", "\n", "Let's say we want to pull the `rating` column from our `df_movies` dataframe." ] }, { "cell_type": "code", "execution_count": 6, "id": "f1cf12c4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 6\n", "1 9\n", "2 8\n", "Name: rating, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_movies['rating']" ] }, { "cell_type": "markdown", "id": "4dcb11cc", "metadata": {}, "source": [ "The `rating` column is a Pandas Series! We can confirm its datatype:" ] }, { "cell_type": "code", "execution_count": 7, "id": "3bc9b2c0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(df_movies['rating'])" ] }, { "cell_type": "markdown", "id": "3615b86a", "metadata": {}, "source": [ "There is a wide range of built-in functions that come with the Pandas Series. Some examples include:\n", "\n", "- `.mean()`: if the column is numeric, it gets the average value of the column\n", "- `.nunique()`: counts number of unique values belonging to a particular column \n", "- `.fillna(value='value')`: fills missing values with 'value' (or any other value of your choosing)\n", "\n", "The [official documentation](https://pandas.pydata.org/docs/reference/series.html) on Pandas Series provides a list of all available functions. \n", "We'll explore the functions of Pandas Series in more detail in the upcmoing chapter, Data Exploration. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 5 }