{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "70594b81",
   "metadata": {},
   "source": [
    "# Getting to Know the Pandas DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6fdaa7fa",
   "metadata": {},
   "source": [
    "The [Pandas DataFrame](https://pandas.pydata.org/docs/reference/frame.html) is a data structure that allows us to manipulate and analyze tabular data. A \"tabular\" data structure can be thought of as a matrix, where rows represent observations and columns represent features that describe each observation. It's a structure that you would find in a SQL database or Excel spreadsheet. Let's say we have a tabular dataset about movies.\n",
    "\n",
    "<img width=\"50%\" src=\"https://practicalpython.s3.us-east-2.amazonaws.com/assets/dataframe_structure.png\"/>\n",
    "\n",
    "In this case, each row represents a movie and each column represents a characteristic about the movie like the genre, rating, and director. The \"index\" column represents a row's position in the dataframe. By default, a Pandas DataFrame's index starts at 0."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cdda2b7c",
   "metadata": {},
   "source": [
    "## Importing the Pandas package\n",
    "\n",
    "In order to create and use a Pandas DataFrame, we need to have the `pandas` package readily available in our environment. Let's import `pandas` and give it the alias of \"pd\" so that we don't have to write out \"pandas\" every time we call a function.\n",
    "\n",
    "<img src=\"https://media.giphy.com/media/nVsLCrW5iHf6E/giphy.gif\"/>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "f9dd60c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd2658d2",
   "metadata": {},
   "source": [
    "## Creating a dataframe\n",
    "\n",
    "There are several ways to create a [Pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame). Here, we'll describe 2 approaches. \n",
    "\n",
    "### Converting a dictionary to a dataframe\n",
    "\n",
    "You can create a dataframe from a dictionary. Each key of the dictionary represents a column name and the value of the dictionary is a list that represents values belonging to that particular column. Each element of the list represents the value of a row in the dataframe. \n",
    "\n",
    "<img width='70%' src=\"https://practicalpython.s3.us-east-2.amazonaws.com/assets/dict_df.png\"/>\n",
    "\n",
    "Let's create a dataframe called `df_movies`.\n",
    "\n",
    "```{note}\n",
    "`df` is short for \"dataframe\". It's common for data scientists to name their dataframe \"df\". \n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "27f8611b",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = {\n",
    "    'movie': ['Batman', 'Jungle Book', 'Titanic'], \n",
    "    'genre': ['action', 'kids', 'romance'], \n",
    "    'rating': [6, 9, 8],\n",
    "    'director': ['Tim Burton', 'Wolfgang Reitherman', 'James Cameron']\n",
    "}\n",
    "\n",
    "df_movies = pd.DataFrame(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f5b2b9b9",
   "metadata": {},
   "source": [
    "We can confirm that `df_movies` is indeed a dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "c0c71ad1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.frame.DataFrame"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(df_movies)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ac9bef6",
   "metadata": {},
   "source": [
    "Now let's see how it looks 👀:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "09c7f239",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movie</th>\n",
       "      <th>genre</th>\n",
       "      <th>rating</th>\n",
       "      <th>director</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Batman</td>\n",
       "      <td>action</td>\n",
       "      <td>6</td>\n",
       "      <td>Tim Burton</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jungle Book</td>\n",
       "      <td>kids</td>\n",
       "      <td>9</td>\n",
       "      <td>Wolfgang Reitherman</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Titanic</td>\n",
       "      <td>romance</td>\n",
       "      <td>8</td>\n",
       "      <td>James Cameron</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         movie    genre  rating             director\n",
       "0       Batman   action       6           Tim Burton\n",
       "1  Jungle Book     kids       9  Wolfgang Reitherman\n",
       "2      Titanic  romance       8        James Cameron"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_movies"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6afd5ec6",
   "metadata": {},
   "source": [
    "### Loading a csv file into a dataframe\n",
    "\n",
    "You can also create a dataframe by importing tabular data from a comma-separated-value (csv) file, or Excel spreadsheet. A csv file looks somthing like this:\n",
    "\n",
    "<img width=\"30%\" src=\"https://practicalpython.s3.us-east-2.amazonaws.com/assets/example_csv_file.png\"/>\n",
    "\n",
    "To load this csv file into a Pandas DataFrame, we will need to use the Pandas [`read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)  function. For data in Excel format, you can use [`read_excel()`](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html). We will also need to know the path where the csv file is located. This can be either on your local machine or in the cloud. \n",
    "\n",
    "Let's load in `movies_data.csv` file as a dataframe. The original file is located on my local machine in a folder called `data/`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "52e90654",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movie</th>\n",
       "      <th>genre</th>\n",
       "      <th>rating</th>\n",
       "      <th>director</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Batman</td>\n",
       "      <td>action</td>\n",
       "      <td>6</td>\n",
       "      <td>Tim Burton</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jungle Book</td>\n",
       "      <td>kids</td>\n",
       "      <td>9</td>\n",
       "      <td>Wolfgang Reitherman</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Titanic</td>\n",
       "      <td>romance</td>\n",
       "      <td>8</td>\n",
       "      <td>James Cameron</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         movie    genre  rating             director\n",
       "0       Batman   action       6           Tim Burton\n",
       "1  Jungle Book     kids       9  Wolfgang Reitherman\n",
       "2      Titanic  romance       8        James Cameron"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_movies = pd.read_csv(\"data/movies_data.csv\")\n",
    "\n",
    "df_movies"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea1b631c",
   "metadata": {},
   "source": [
    "This csv-loaded dataframe is identical to the one that was generated from a dictionary. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2629988a",
   "metadata": {},
   "source": [
    "## Pandas Series\n",
    "\n",
    "An important part of the Pandas DataFrame is the [Pandas Series](https://pandas.pydata.org/docs/reference/series.html). While the DataFrame is a 2-dimensional structure, a Series is 1-dimensional. It can store any datatype (integers, strings, floats, timestamps, even lists). A Series represents a single column of a DataFrame. This is how you get an individual column (represented as a Pandas Series) from a dataframe:\n",
    "\n",
    "```\n",
    "dataframe['column_name'] \n",
    "```\n",
    "\n",
    "Let's say we want to pull the `rating` column from our `df_movies` dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "f1cf12c4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    6\n",
       "1    9\n",
       "2    8\n",
       "Name: rating, dtype: int64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_movies['rating']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4dcb11cc",
   "metadata": {},
   "source": [
    "The `rating` column is a Pandas Series! We can confirm its datatype:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "3bc9b2c0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.series.Series"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(df_movies['rating'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3615b86a",
   "metadata": {},
   "source": [
    "There is a wide range of built-in functions that come with the Pandas Series. Some examples include:\n",
    "\n",
    "- `.mean()`: if the column is numeric, it gets the average value of the column\n",
    "- `.nunique()`: counts number of unique values belonging to a particular column \n",
    "- `.fillna(value='value')`: fills missing values with 'value' (or any other value of your choosing)\n",
    "\n",
    "The [official documentation](https://pandas.pydata.org/docs/reference/series.html) on Pandas Series provides a list of all available functions. \n",
    "We'll explore the functions of Pandas Series in more detail in the upcmoing chapter, Data Exploration. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}