Data Cleansing for machine learning
Machine learning is a data science technique used to extract patterns from data allowing computers to identify related data, forecast future outcomes, behaviors, and trends.
Data cleaning deals with issues in the data quality such as errors, missing values and outliers.
The tabular data is typically available in the form of rows and columns.
In tabular data, the row describes a single observation, and each column describes different properties of the observation.
Column values can be continuous (numerical), discrete (categorical), datetime (time-series), or text.
· Null values refer to unknown or missing data as well as irrelevant responses. Strategies for dealing with this scenario include:
· Dropping these records
o Works when you do not need to use the information for downstream workloads.
· Adding a placeholder (for example, -1):
o Allows you to see missing data later on without violating a schema.
· Basic imputing
o Allows you to have a “best guess” of what the data could have been, often by using the mean or median of non-missing data for numerical data type, or most_frequent value of non-missing data for categorical data type.
· Advanced imputing
o Determines the “best guess” of what data should be using more advanced strategies such as clustering machine learning algorithms or oversampling techniques such as SMOTE (Synthetic Minority Over-sampling Technique).
· In some situations, the columns have inconsistent data types.
o For example, a column can have a combination of numbers presented as strings, like "44.5"
and "25.1"
.
o As part of data cleaning you often have to convert the data in the column to its correct data type.
· In some situations, you find duplicate records in the table.
o The easiest solution is to drop the duplicate records.
· An outlier is defined as an observation that is significantly different to all other observations in a given column.
o There are several ways to identify outliers, and one common approach is to compute the Z-score for an observation x
Machine learning models are as strong as the data they are trained on.
There are many valid approaches to feature engineering and some of the most popular ones, categorized by data type, are as follows:
· Aggregation (count, sum, average, mean, median, and the like)
· Part-of (year of date, month of date, week of date, and the like)
· Binning (grouping entities into bins and then applying aggregations)
· Flagging (boolean conditions resulting in True of False)
· Frequency-based (calculating the frequencies of the levels of one or more categorical variables)
· Embedding (transforming one or more categorical or text features into a new set of features, possibly with a different cardinality)
· Deriving by example
The cosine function provides symmetrically equal weights to corresponding AM and PM hours, and the sine function provides symmetrically opposite weights to corresponding AM and PM hours.
Scaling numerical features is an important part of pre-processing data for machine learning
There are two common approaches to scaling numerical features:
· Normalization
o Normalization mathematically rescales the data into the range [0, 1].
· Standardization
o Standardization rescales the data to have mean = 0 and standard deviation = 1
A common type of data that is prevalent in machine learning is called categorical data.
Categorical data implies discrete or a limited set of values.
For example, a person’s gender or ethnicity is considered as categorical
Two common approaches for encoding categorical data:
· Ordinal encoding
o Ordinal encoding, converts categorical data into integer codes ranging from 0 to (number of categories — 1).
· One-hot encoding
o One-hot encoding is often the recommended approach, and it involves transforming each categorical value into n (= number of categories) binary values, with one of them 1, and all others 0.