Data Inspection

In this book, we’ll be exploring the NYC real estate dataset provided by the City of New York and sourced from Kaggle.

The dataset includes properties that were sold between September 2016 and September 2017. Each row represents the sale of a property, while the columns are features that describe the property. Here is a list of the columns:

  • Borough

  • Neighborhood

  • Building class category

  • Tax class at present

  • Block

  • Lot

  • Ease-ment

  • Building class at present

  • Address

  • Apartment number

  • Zip code

  • Residential units

  • Commercial units

  • Total units

  • Land square feet

  • Gross square feet

  • Year built

  • Tax class at time of sale

  • Building class at time of sale

  • Sale price

  • Sale date

For the official definitions of each of these columns, refer to the Data Defintions in the Appendix.

1) Loading in the data

Before we get started, the very first thing we need to do is import the pandas package into our environment so that it’s readily available. Let’s give it the alias of “pd” so that we don’t have to write out “pandas” every time we call a function.

import pandas as pd

In order to explore the NYC real estate dataset, we need to load it into our environment. We can do this using the Pandas read_csv() function. Let’s call our new dataframe "df" (short for dataframe).

df = pd.read_csv("data/nyc_real_estate.csv")
df.columns
Index(['BOROUGH', 'NEIGHBORHOOD', 'BUILDING CLASS CATEGORY',
       'TAX CLASS AT PRESENT', 'BLOCK', 'LOT', 'EASE-MENT',
       'BUILDING CLASS AT PRESENT', 'ADDRESS', 'APARTMENT NUMBER', 'ZIP CODE',
       'RESIDENTIAL UNITS', 'COMMERCIAL UNITS', 'TOTAL UNITS',
       'LAND SQUARE FEET', 'GROSS SQUARE FEET', 'YEAR BUILT',
       'TAX CLASS AT TIME OF SALE', 'BUILDING CLASS AT TIME OF SALE',
       'SALE PRICE', 'SALE DATE'],
      dtype='object')

2) Checking the shape and size of our dataset

As a data scientist, one of the first things we want to know about a dataset is how big it is. How many rows and columns are we dealing with? Let’s figure out the “shape” of our dataframe:

df.shape
(84548, 21)

df.shape returns the number of rows and number of columns of a dataframe:

(n_rows, n_columns)

Our dataset has ~84.5K rows, which means that we have data on ~84.5K property sales in NYC.

Number of rows

The length of a dataset represents the number of rows so if you’re only interested in knowing how many rows we’re dealing with, you can simply apply the len() function:

len(df)
84548

You can also get the number of rows from df.shape by accessing the first element of the df.shape’s output:

df.shape[0]
84548

Number of columns

To get the number of columns of a dataset, you can first get the list of a dataset’s columns:

df.columns
Index(['BOROUGH', 'NEIGHBORHOOD', 'BUILDING CLASS CATEGORY',
       'TAX CLASS AT PRESENT', 'BLOCK', 'LOT', 'EASE-MENT',
       'BUILDING CLASS AT PRESENT', 'ADDRESS', 'APARTMENT NUMBER', 'ZIP CODE',
       'RESIDENTIAL UNITS', 'COMMERCIAL UNITS', 'TOTAL UNITS',
       'LAND SQUARE FEET', 'GROSS SQUARE FEET', 'YEAR BUILT',
       'TAX CLASS AT TIME OF SALE', 'BUILDING CLASS AT TIME OF SALE',
       'SALE PRICE', 'SALE DATE'],
      dtype='object')

Then, get the length of the column list:

len(df.columns)
21

Alternatively, you can also get the number of columns by getting the second element of df.shape’s output:

df.shape[1]
21

Size

What if we want to how many data points (“cells”) are in the dataset? This is considered the size of a dataframe.

\(\text{number of data points} = \text{size} = \text{n_rows} \times \text{n_cols}\)

You can get a dataframe’s size using .size.

df.size
1775508

There are 1,775,508 data points in our dataset!

Let’s check that df.size equal n_rows * n_cols:

df.size == df.shape[0] * df.shape[1]
True

Note

Since Pandas stores data in-memory, it is not optimized to work with very large datasets. The maximum amount of data that Pandas can handle depends on 2 things:

  • The type of data you’re working with (e.g., integers and floats are more space-efficient than strings)

  • RAM of your machine. The larger the memory, the more data you’ll be able to work with.

If your dataset is too big for Pandas to handle, you’ll get a MemoryError. There are several ways to get around this problem, as described in the Pandas documentation, including splitting the dataset into smaller chunks or using Dask, a parallel computing library instead.

3) Viewing the data

Now, let’s take a peek at our data to get a sense of what it looks like 👀. There are 3 Pandas methods that you can use to inspect data:

  1. .head(): viewing first x rows

  2. .tail(): viewing last x rows

  3. .sample(): viewing random x rows

If you don’t mind viewing a lot of data at once, you can simply print df which will return the first 5 rows and last 5 rows of the dataframe.

First \(n\) rows

By default, df.head() returns the first 5 rows but you can customize this by defining n rows that you want to return inside of the parentheses. For example, df.head(n=3) would return 3 rows and df.head(n=7) would return 7 rows.

df.head()
BOROUGH NEIGHBORHOOD BUILDING CLASS CATEGORY TAX CLASS AT PRESENT BLOCK LOT EASE-MENT BUILDING CLASS AT PRESENT ADDRESS APARTMENT NUMBER ... RESIDENTIAL UNITS COMMERCIAL UNITS TOTAL UNITS LAND SQUARE FEET GROSS SQUARE FEET YEAR BUILT TAX CLASS AT TIME OF SALE BUILDING CLASS AT TIME OF SALE SALE PRICE SALE DATE
0 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2A 392 6 C2 153 AVENUE B ... 5 0 5 1633 6440 1900 2 C2 6625000 2017-07-19 00:00:00
1 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2 399 26 C7 234 EAST 4TH STREET ... 28 3 31 4616 18690 1900 2 C7 - 2016-12-14 00:00:00
2 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2 399 39 C7 197 EAST 3RD STREET ... 16 1 17 2212 7803 1900 2 C7 - 2016-12-09 00:00:00
3 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2B 402 21 C4 154 EAST 7TH STREET ... 10 0 10 2272 6794 1913 2 C4 3936272 2016-09-23 00:00:00
4 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2A 404 55 C2 301 EAST 10TH STREET ... 6 0 6 2369 4615 1900 2 C2 8000000 2016-11-17 00:00:00

5 rows × 21 columns

Last \(n\) rows

df.tail() does the opposite of df.head() - it returns the last 5 rows of a dataframe rather than the first 5. Similarly, you can specify how many rows you want to return by passing in the number inside the parentheses:

df.tail(n=2)
BOROUGH NEIGHBORHOOD BUILDING CLASS CATEGORY TAX CLASS AT PRESENT BLOCK LOT EASE-MENT BUILDING CLASS AT PRESENT ADDRESS APARTMENT NUMBER ... RESIDENTIAL UNITS COMMERCIAL UNITS TOTAL UNITS LAND SQUARE FEET GROSS SQUARE FEET YEAR BUILT TAX CLASS AT TIME OF SALE BUILDING CLASS AT TIME OF SALE SALE PRICE SALE DATE
84546 5 WOODROW 22 STORE BUILDINGS 4 7100 28 K6 2730 ARTHUR KILL ROAD ... 0 7 7 208033 64117 2001 4 K6 11693337 2016-12-21 00:00:00
84547 5 WOODROW 35 INDOOR PUBLIC AND CULTURAL FACILITIES 4 7105 679 P9 155 CLAY PIT ROAD ... 0 1 1 10796 2400 2006 4 P9 69300 2016-10-27 00:00:00

2 rows × 21 columns

Note

df.tail(2) and df.tail(n=2) return the same thing. The latter is more explicit which can be easier for other people reading your code to understand.

Sampling \(n\) rows

If you need to randomly sample rows from your dataframe, you can use df.sample(). Unlike df.head() and df.tail(), df.sample() will return 1 row by default instead of 5.

df.sample()
BOROUGH NEIGHBORHOOD BUILDING CLASS CATEGORY TAX CLASS AT PRESENT BLOCK LOT EASE-MENT BUILDING CLASS AT PRESENT ADDRESS APARTMENT NUMBER ... RESIDENTIAL UNITS COMMERCIAL UNITS TOTAL UNITS LAND SQUARE FEET GROSS SQUARE FEET YEAR BUILT TAX CLASS AT TIME OF SALE BUILDING CLASS AT TIME OF SALE SALE PRICE SALE DATE
51394 4 BAYSIDE 02 TWO FAMILY DWELLINGS 1 7366 63 B3 48-25 207TH STREET ... 2 0 2 4000 1812 1935 1 B3 1090000 2017-06-15 00:00:00

1 rows × 21 columns

4) Assessing missing values

The fastest way to assess missing values of a given dataset is to use df.info(). It gives a good overview of the columns, datatypes, and “non-null count” (a crpytic way of describing how many values are not missing).

Let’s test it out on our dataset.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84548 entries, 0 to 84547
Data columns (total 21 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   BOROUGH                         84548 non-null  int64 
 1   NEIGHBORHOOD                    84548 non-null  object
 2   BUILDING CLASS CATEGORY         84548 non-null  object
 3   TAX CLASS AT PRESENT            84548 non-null  object
 4   BLOCK                           84548 non-null  int64 
 5   LOT                             84548 non-null  int64 
 6   EASE-MENT                       84548 non-null  object
 7   BUILDING CLASS AT PRESENT       84548 non-null  object
 8   ADDRESS                         84548 non-null  object
 9   APARTMENT NUMBER                84548 non-null  object
 10  ZIP CODE                        84548 non-null  int64 
 11  RESIDENTIAL UNITS               84548 non-null  int64 
 12  COMMERCIAL UNITS                84548 non-null  int64 
 13  TOTAL UNITS                     84548 non-null  int64 
 14  LAND SQUARE FEET                84548 non-null  object
 15  GROSS SQUARE FEET               84548 non-null  object
 16  YEAR BUILT                      84548 non-null  int64 
 17  TAX CLASS AT TIME OF SALE       84548 non-null  int64 
 18  BUILDING CLASS AT TIME OF SALE  84548 non-null  object
 19  SALE PRICE                      84548 non-null  object
 20  SALE DATE                       84548 non-null  object
dtypes: int64(9), object(12)
memory usage: 13.5+ MB

At the very bottom of the output, we see:

dtypes: int64(9), object(12)

There are 9 columns that int64, which store integer (numerical) values. There are 12 columns that are “objects” which typically describes columns storing string values.

There are 84,548 rows and across all 21 columns, there are 84,548 non-null values which indicates that we don’t have any missing values in our dataset. 🎉

Using .isnull() and .notnull()

You can use .isnul() to generate a boolean mask that returns True when the value is missing and False when it is not.

Note

To reverse the boolean mask, you can use df.notnull() which returns True when the value is not null (i.e., not missing).

Let’s try it out on our dataframe:

df.isnull().head()
BOROUGH NEIGHBORHOOD BUILDING CLASS CATEGORY TAX CLASS AT PRESENT BLOCK LOT EASE-MENT BUILDING CLASS AT PRESENT ADDRESS APARTMENT NUMBER ... RESIDENTIAL UNITS COMMERCIAL UNITS TOTAL UNITS LAND SQUARE FEET GROSS SQUARE FEET YEAR BUILT TAX CLASS AT TIME OF SALE BUILDING CLASS AT TIME OF SALE SALE PRICE SALE DATE
0 False False False False False False False False False False ... False False False False False False False False False False
1 False False False False False False False False False False ... False False False False False False False False False False
2 False False False False False False False False False False ... False False False False False False False False False False
3 False False False False False False False False False False ... False False False False False False False False False False
4 False False False False False False False False False False ... False False False False False False False False False False

5 rows × 21 columns

We can count the number of missing values in each column like this:

df.isnull().sum()
BOROUGH                           0
NEIGHBORHOOD                      0
BUILDING CLASS CATEGORY           0
TAX CLASS AT PRESENT              0
BLOCK                             0
LOT                               0
EASE-MENT                         0
BUILDING CLASS AT PRESENT         0
ADDRESS                           0
APARTMENT NUMBER                  0
ZIP CODE                          0
RESIDENTIAL UNITS                 0
COMMERCIAL UNITS                  0
TOTAL UNITS                       0
LAND SQUARE FEET                  0
GROSS SQUARE FEET                 0
YEAR BUILT                        0
TAX CLASS AT TIME OF SALE         0
BUILDING CLASS AT TIME OF SALE    0
SALE PRICE                        0
SALE DATE                         0
dtype: int64

As expected, there are 0 missing values in our dataset. 👍

Using the missingno package

Another way to visualize missing values in your dataset is to use the missingno package. It’s a great way to assess which rows have missing values and how sparse or dense your dataset is. Let’s import it and give it the alias “msno”.

import missingno as msno

Let’s use the matrix() function to visualize patterns of missing values in our dataset:

msno.matrix(df)
<AxesSubplot:>
_images/02_loading_data_48_1.png

From the plot above, we can see that our data is very dense. There are no missing values. For examples of what the matrix would look like with missing data, check out the missingno docs.