Post

Summary of Kaggle-Pandas Course Content

I have summarized the content of the Pandas mini-course from Kaggle's public courses.

Pandas

Solve short hands-on challenges to perfect your data manipulation skills.

Lesson 1. Creating, Reading and Writing

Importing Pandas

1
import pandas as pd

Pandas has two core objects: DataFrame and Series.

DataFrame

A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column. DataFrame entries don’t need to be integers.

1
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})

DataFrames are declared in Python dictionary format. The key is the column name, and the value is a list of items to be entered.

Usually, when declaring a DataFrame, column labels are assigned the name of that column, but row labels are assigned integers 0, 1, 2… If necessary, you can manually specify row labels. The list of row labels in a DataFrame is called an Index, and can be specified using the index parameter.

1
2
3
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 
              'Sue': ['Pretty good.', 'Bland.']},
             index=['Product A', 'Product B'])

Series

A Series is a sequence of data values.

1
pd.Series([1, 2, 3, 4, 5])

A Series is essentially the same as a single column of a DataFrame. Therefore, you can also specify an index. However, the difference is that it has a name instead of a ‘column name’.

1
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

Series and DataFrames are closely related. It helps to think of a DataFrame as just a bunch of Series put together.

Reading Data Files

In many cases, rather than writing data directly, you use existing data. Data can be stored in various formats, but the most basic form is a CSV file. The contents of a CSV file typically look like this:

1
2
3
4
Product A,Product B,Product C,
30,21,9,
35,34,1,
41,11,11

That is, a CSV file is a table where each value is separated by a comma. That’s why it’s called “Comma-Separated Values”, CSV.

To load data in CSV file format into a DataFrame, use the pd.read_csv() function.

You can check the size of a DataFrame using the shape attribute.

You can view the first five rows of a DataFrame using the head() command.

The pd.read_csv() function has over 30 parameters. For example, if the CSV file you’re trying to load contains its own index, you can specify the value of the index_col parameter to use that column as the index instead of Pandas automatically assigning an index.

Writing Data

You can export a DataFrame to a CSV file using the to_csv() method. It’s used like this:

1
(DataFrame name).to_csv("(CSV file path)")

Lesson 2. Indexing, Selecting & Assigning

Selecting specific values to use from a Pandas DataFrame or Series is a step in almost every data-using task.

This post is licensed under CC BY-NC 4.0 by the author.

Comments powered by Disqus.