Start ML journey with Supervised Learning
A Simple Tutorial to Jumpstart Your ML Journey with Fundamentals
If you are new to machine learning, this tutorial is for you. Starting with supervised learning is the best way to step into the world of machine learning.
What is Supervised Learning?
Supervised learning is a type of machine learning where the algorithm is trained on a labeled dataset. This means that for each piece of data in the training set, you already have the correct answer (the output). The goal of supervised learning is to learn a mapping function that maps inputs (features) to outputs (targets) so that when the model is given new examples, it can accurately predict the output.
Key Elements of Supervised Learning
Labeled Data: This is a dataset where each example is a pair consisting of an input object (typically a vector) and a desired output value (label).
Features and Target:
- Features are the input variables used for prediction.
- Target is the output variable that the model is trying to predict.
Training and Testing Data: The dataset is typically divided into two sets:
- Training Data: Used to train the model.
- Testing Data: Used to evaluate the performance of the model.
Types of Supervised Learning:
- Classification: The output variable is a category, such as “spam” or “not spam."
- Regression: The output variable is a real or continuous value, like “salary” or "weight."
Basic Process of Supervised Learning
- Data Collection: Gather and clean the data to be used for training and testing.
- Data Preprocessing: This includes tasks like normalization, handling missing values, and encoding categorical variables.
- Choosing a Model: Based on the problem, select an appropriate model. For example, use linear regression for a regression problem, or a decision tree for a classification problem.
- Training the Model: Feed the training data into the model, allowing it to learn the relationship between features and targets.
- Model Evaluation: Use the testing data to evaluate the model’s accuracy and make adjustments as needed.
- Prediction: Once the model is trained and evaluated, use it to make predictions on new data.
Algorithms in Supervised Learning
- Linear Regression: Used for regression problems.
- Logistic Regression: Used for classification problems.
- Decision Trees: Can be used for both regression and classification.
- Support Vector Machines (SVM): Used mainly for classification problems.
- Neural Networks: Versatile and can be used for both regression and classification.
Let's Understand this through a simple and practical example
Let’s go through a simple Python example of supervised learning using linear regression. This example will predict house prices based on their size (in square feet). We’ll use Python with libraries such as NumPy, pandas, and scikit-learn.
First, ensure you have the necessary libraries installed. You can install them using pip:
pip install numpy pandas scikit-learn matplotlib
Now, let’s go through the code step by step.
Step 1: Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
numpy
: A library for numerical computations and working with arrays.pandas
: Used for data manipulation and analysis.matplotlib.pyplot
: A plotting library to visualize data.train_test_split
: Function fromsklearn.model_selection
to split data arrays into two subsets: training data and testing data.LinearRegression
: A linear regression model fromsklearn.linear_model
.mean_squared_error
: Function fromsklearn.metrics
to measure the average of the squares of the errors (difference between the actual and predicted values).
Step 2: Create a Sample Dataset
Here we create a simple dataset of house sizes and their prices.
# Sample dataset
data = {
'Size': [550, 600, 1000, 1200, 1500, 1800, 2000, 2300, 2500, 3000],
'Price': [300000, 320000, 360000, 400000, 420000, 450000, 480000, 520000, 550000, 600000]
}
df = pd.DataFrame(data)
This code creates a sample dataset using a dictionary with house sizes and corresponding prices. The pd.DataFrame
function converts this dictionary into a pandas DataFrame, a 2D labeled data structure similar to a table.
Step 3: Prepare the Data
We split the data into ‘features’ (X) and ‘labels’ (y), and then into training and testing sets.
X = df[['Size']] # Features (2D array)
y = df['Price'] # Labels (1D array)
# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X
andy
are conventional names for features and labels in machine learning. Here, 'Size' is the feature (independent variable), and 'Price' is the label (dependent variable).train_test_split
splits the dataset into training and testing sets.test_size=0.2
means 20% of the data is used for testing and the rest for training.random_state
ensures reproducibility of the results.
Step 4: Train the Model
model = LinearRegression()
model.fit(X_train, y_train)
LinearRegression()
creates an instance of the linear regression model.model.fit(X_train, y_train)
trains the model using the training data (X_train
andy_train
).
Step 5: Make Predictions
Use the trained model to make predictions on the test data.
# Making predictions using the test set
y_pred = model. Predict(X_test)
model.predict(X_test)
uses the trained model to predict the prices (y_pred
) for the house sizes in the test set (X_test
).
Step 6: Evaluate the Model
We can evaluate our model’s performance using metrics such as Mean Squared Error.
# Calculating the Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
mean_squared_error
calculates the mean squared error, a common metric to evaluate the performance of a regression model. It compares the actual prices (y_test
) with the predicted prices (y_pred
).
Step 7: Visualize the Results
plt.scatter(X_test, y_test, color='red', label='Actual Price')
plt.plot(X_test, y_pred, color='blue', label='Predicted Price')
plt.title('Actual vs Predicted House Prices')
plt.xlabel('Size of House')
plt.ylabel('Price')
plt.legend()
plt.show()
plt.scatter
andplt.plot
create a scatter plot of the actual prices and a line plot of the predicted prices, respectively.plt.title
,plt.xlabel
,plt.ylabel
, andplt.legend
add a title, labels, and a legend to the plot.plt.show()
displays the plot.
Final Output
The final output of this code will be:
- The Mean Squared Error printed to the console, indicating how close the model’s predictions are to the actual prices.
- A plot showing the actual house prices (as red dots) and the predicted house prices (as a blue line) against the house sizes. This visual representation helps in understanding how well the linear model predicts the prices based on house size.
- Mean Squared Error (MSE): The MSE for our model is approximately 46,562,552.63. This value represents the average squared difference between the actual and predicted prices. A lower MSE indicates a better fit of the model to the data.
- Plot: The plot visualizes the actual house prices (red dots) against the predicted prices (blue line) based on house size. This graphical representation helps us see how well the model predictions align with the actual values. In this plot, you can observe a linear relationship between the size of the house and its price, as predicted by the model.
In upcoming posts, I’ll explore other fundamental topics such as unsupervised and reinforcement learning. Stay tuned and follow me for more insights!
Thanks for reading :)