Part 2 — Manipulating Data with Pandas

3 min readFeb 10, 2023

A machine learning model is shaped by the data on which it is trained so consequently small amounts of /incorrect interred/ out of range / bad or lot of missing values data do not perform well in the real-world. A population is what we would like to work with but it is near impossible to collect data points for population. A sample is a subset of the population which is representative and selected.

We deal with bad data by

· Choose a model that can work with incomplete data, or

· Remove samples (rows) that have incomplete data, or

· Artificially add values that are missing with reasonable substitutes like the mean or median or the most appearing entry .

Pandas is package sued to manipulate data in a DataFrame form more like matrix or spreadsheet form .

We have the following categories of data

1. Continuous data are numbers that can be increased or decreased by any amount , working with continuous data, floating point numbers are best.

2. Categorical data are data that don’t fall on a spectrum. Categorical data can’t be stored as numbers in an obvious way, for only two categories can usually be encoded as Boolean or integer data,Working with three or more categories can one need to one hot encode or binary encode

·3. Ordinal data are categorical data that have an order, and so can be stored as numbers. Ordinal data are typically encoded as integers in descending order.

Below are Pandas functions to manipulate .

Below is how to load different files into a Pandas dataframe

import pandas as pd
df = pd.read_csv('file.csv')
df = pd.read_table('file.csv')
df = pd.read_parquet('file.parqurt')
df = pd.read_excel('file.xlsx')

Returns the first N rows of a dataframe

df.head()  # return the frist 5 entries
df.head(30 )  # return the frust 30 entries 
print(df.head(3))

Returns the last N rows of a dataframe

df.tail( )  # retrun the last 5 entries 
df.tail( 20 ) # return the last 30 etires
print(df.tail(3))

Returns the number of rows and columns in a dataframe

df.shape
# return a tuple (45 - rows,3-columns) tnhet shwo the numebr of by columns

Returns information about a dataframe, including column names, data types, and memory usage

df.info()

Returns basic statistics for all numerical columns in a dataframe

df.describe()
df.descrine(incldue='all')

Access a specific column by name

df['column_name']
df.colum_name

Access a specific row by index

df.loc[0]

Access a specific row or cell by integer-based index

df.iloc[0,0]

Sorts a dataframe by one or more columns

df.sort_values(by='column_name', ascending=False, inplace=True)

Groups a dataframe by one or more columns and applies a function to each group

grouped = df.groupby(['column_name'])
print(grouped.mean())

Creates a pivot table from a dataframe

pivot = df.pivot_table(
            index='column_name_1', 
            columns='column_name_2', 
            values='column_name_3'
      )

Merges two dataframes on a common column

df1 = pd.DataFrame({‘key’: [‘A’, ‘B’, ‘C’, ‘D’], ‘value’: [1, 2, 3, 4]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': [5, 6, 7, 8]})
df_merged = pd.merge(df1, df2, on='key')

Concatenates two or more dataframes along a particular axis

df1 = pd.DataFrame({
      ‘A’: [‘A0’, ‘A1’, ‘A2’, ‘A3’], 
      ‘B’: [‘B0’, ‘B1’, ‘B2’, ‘B3’], 
      ‘C’: [‘C0’, ‘C1’, ‘C2’, ‘C3’], 
      ‘D’: [‘D0’, ‘D1’, ‘D2’, ‘D3’]
})

Count the number of unique values in a column.

df['column_name'].value_counts()

Remove missing values (NA) from a DataFrame

df.dropna()

Fill missing values (NA) in a DataFrame with a specified value.

df.fillna(0)   # all the columns
df['column_name'].fillna(0) # for a spcefici columns

Group a DataFrame by one or more columns and perform aggregation.

df.groupby(‘column_name’).mean()

Part 2 — Manipulating Data with Pandas

Written by Setumo Raphela