Part 2 — Manipulating Data with Pandas
A machine learning model is shaped by the data on which it is trained so consequently small amounts of /incorrect interred/ out of range / bad or lot of missing values data do not perform well in the real-world. A population is what we would like to work with but it is near impossible to collect data points for population. A sample is a subset of the population which is representative and selected.
We deal with bad data by
· Choose a model that can work with incomplete data, or
· Remove samples (rows) that have incomplete data, or
· Artificially add values that are missing with reasonable substitutes like the mean or median or the most appearing entry .
Pandas is package sued to manipulate data in a DataFrame form more like matrix or spreadsheet form .
We have the following categories of data
1. Continuous data are numbers that can be increased or decreased by any amount , working with continuous data, floating point numbers are best.
2. Categorical data are data that don’t fall on a spectrum. Categorical data can’t be stored as numbers in an obvious way, for only two categories can usually be encoded as Boolean or integer data,Working with three or more categories can one need to one hot encode or binary encode
·3. Ordinal data are categorical data that have an order, and so can be stored as numbers. Ordinal data are typically encoded as integers in descending order.
Below are Pandas functions to manipulate .
Below is how to load different files into a Pandas dataframe
import pandas as pd
df = pd.read_csv('file.csv')
df = pd.read_table('file.csv')
df = pd.read_parquet('file.parqurt')
df = pd.read_excel('file.xlsx')
Returns the first N rows of a dataframe
df.head() # return the frist 5 entries
df.head(30 ) # return the frust 30 entries
print(df.head(3))
Returns the last N rows of a dataframe
df.tail( ) # retrun the last 5 entries
df.tail( 20 ) # return the last 30 etires
print(df.tail(3))
Returns the number of rows and columns in a dataframe
df.shape
# return a tuple (45 - rows,3-columns) tnhet shwo the numebr of by columns
Returns information about a dataframe, including column names, data types, and memory usage
df.info()
Returns basic statistics for all numerical columns in a dataframe
df.describe()
df.descrine(incldue='all')
Access a specific column by name
df['column_name']
df.colum_name
Access a specific row by index
df.loc[0]
Access a specific row or cell by integer-based index
df.iloc[0,0]
Sorts a dataframe by one or more columns
df.sort_values(by='column_name', ascending=False, inplace=True)
Groups a dataframe by one or more columns and applies a function to each group
grouped = df.groupby(['column_name'])
print(grouped.mean())
Creates a pivot table from a dataframe
pivot = df.pivot_table(
index='column_name_1',
columns='column_name_2',
values='column_name_3'
)
Merges two dataframes on a common column
df1 = pd.DataFrame({‘key’: [‘A’, ‘B’, ‘C’, ‘D’], ‘value’: [1, 2, 3, 4]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': [5, 6, 7, 8]})
df_merged = pd.merge(df1, df2, on='key')
Concatenates two or more dataframes along a particular axis
df1 = pd.DataFrame({
‘A’: [‘A0’, ‘A1’, ‘A2’, ‘A3’],
‘B’: [‘B0’, ‘B1’, ‘B2’, ‘B3’],
‘C’: [‘C0’, ‘C1’, ‘C2’, ‘C3’],
‘D’: [‘D0’, ‘D1’, ‘D2’, ‘D3’]
})
Count the number of unique values in a column.
df['column_name'].value_counts()
Remove missing values (NA) from a DataFrame
df.dropna()
Fill missing values (NA) in a DataFrame with a specified value.
df.fillna(0) # all the columns
df['column_name'].fillna(0) # for a spcefici columns
Group a DataFrame by one or more columns and perform aggregation.
df.groupby(‘column_name’).mean()