Introduction to Machine Learning & Modelling Techniques

Introduction to Machine Learning & Modelling Techniques

Supervised, Unsupervised & Reinforcement Learning

A 1-hour conceptual masterclass designed for beginner-to-intermediate data professionals to build an intuitive, visual, and practical mental map of ML algorithms.

Machine Learning Cover Graphic

Navigating the AI Ecosystem

  • Artificial Intelligence (AI): The overarching field of creating systems that mimic human intellect (includes search, heuristics, planning).
  • Machine Learning (ML): Systems learning rules and functions directly from structured datasets without manual programming.
  • Deep Learning (DL): Nested layers of neural networks learning complex patterns from raw unstructured data.
  • Large Language Models (LLM): Transformer-based generative AI systems understanding human text sequences.
AI ML DL LLM Venn Diagram

The Three Pillars of ML

  • Supervised Learning: Training model parameters using paired input-output datasets to predict numeric labels or classes.
    Example: Predicting customer default risk using historical loan details.
  • Unsupervised Learning: Organizing unlabeled data into natural partitions, dimensions, or clusters without human correction.
    Example: Grouping buyers into behavioral segments based on shopping habits.
  • Reinforcement Learning: An active agent optimizing behavioral policies inside environments using trial-and-error rewards.
    Example: Training a robot vacuum to navigate a room using pathfinding rewards.
Three Pillars of Machine Learning

Machine Learning Terminology

1. The Core Learning Frameworks

To build a strong foundation in Machine Learning 101, you need to master the core terminology that dictates how models learn, fail, and get evaluated.

  • Supervised Learning: Learning from labeled data (inputs paired with known correct outputs).
    Example: Predicting housing prices from historical size, zip code, and sales price.
  • Unsupervised Learning: Finding hidden patterns or structures in unlabeled data.
    Example: Grouping customers into distinct cohorts based on purchasing behaviors (Clustering).
  • Reinforcement Learning: Learning through trial and error using a system of rewards and penalties.
    Example: Training self-driving agents to stay on a track by rewarding correct steering and penalizing collision.
  • Semi-Supervised Learning: Combining a small amount of labeled data with a large amount of unlabeled data to train models cost-effectively.
    Example: Labeling a few medical scans manually, then training a classifier on those scans alongside thousands of unlabeled scans.

2. The Mechanics of Learning

  • Features vs. Target: Features are your input attributes (independent variables, X); the Target is what you want to predict (dependent variable, Y).
  • Loss (Cost) Function: A mathematical formula measuring how wrong a model's predictions are compared to actual targets. The goal of training is to minimize this loss.
  • Gradient Descent: The optimization algorithm used to tweak a model's internal weights step-by-step to lower the loss.
  • Learning Rate: A hyperparameter controlling how big of a step weights take during gradient descent. Too big overshoots; too small takes forever.
  • Parametric vs. Non-Parametric: Parametric models have a fixed number of weights (like Linear Regression). Non-parametric models grow parameters dynamically with the dataset size (like k-NN or Decision Trees).
  • Linear vs. Non-Linear vs. Spatial: Linear models assume straight trends. Non-linear models capture curves (kernels/trees). Spatial models classify based on coordinate proximity (k-NN).

3. Generalisation & Pitfalls

  • Overfitting: When a model learns the training data too well—including the random noise—and fails to predict new, unseen data accurately.
    Symptom: High training accuracy, low validation accuracy.
  • Underfitting: When a model is too simple to capture the underlying trend in the data.
    Symptom: Low training accuracy, low validation accuracy.
  • Bias-Variance Tradeoff: The ultimate balancing act. Bias is error from overly simple assumptions (underfitting). Variance is error from extreme sensitivity to small fluctuations in training data (overfitting).
  • Regularisation (L1/L2): Techniques used to prevent overfitting by adding a penalty to the loss function for models that get too complex.
    L1 (Lasso) performs feature selection; L2 (Ridge) performs weight decay.

4. Data Splitting & Evaluation

  • Train / Validation / Test Splits:
    Train: Used to teach the model's weights.
    Validation: Used to tune hyperparameters and choose best model.
    Test: Hidden until the very end to evaluate final, real-world performance.
  • Cross-Validation: Splitting data into multiple rotating chunks to ensure the model evaluates well across the entire dataset, avoiding lucky splits.
  • Confusion Matrix: A table layout used to visualize the performance of a classification model, showing True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).
  • Precision vs. Recall:
    Precision: Out of predicted positives, how many were actually positive?
    Recall: Out of actual positives, how many did the model find?

Supervised Learning Basics

  • Core Goal: Learn an approximation function Y = f(X) where X are inputs and Y are target labels.
  • Regression: The target label is a continuous numeric value.
    Examples: Estimating real estate market prices or tracking server CPU temperatures.
  • Classification: The target label is a discrete categorical bucket.
    Examples: Flagging transaction fraud or sorting incoming emails into spam folders.
Classification vs Regression Scatter Plot

Bias, Variance & Model Tuning

  • Underfitting (High Bias): The model is too simple to capture underlying patterns.
    Example: Predicting house price using only size, ignoring location. The model is too rigid.
  • Overfitting (High Variance): Model memorizes training noise and outliers.
    Example: Fitting a high-degree polynomial that matches every single outlier, but fails on new houses.
  • Cross-Validation: Splitting data into k-folds to validate model performance.
    Example: Splitting data into 5 groups, training on 4, testing on 1, and rotating 5 times to prevent bias.
  • Hyperparameter Tuning: Adjusting model options before training to control complexity.
    Example: Limiting a decision tree's depth to 3 levels, or setting k = 5 in k-NN to smooth out predictions.
Bias Variance tradeoff

Regression Models In-Depth

📈 1. Linear Regression

Fits an optimal line matching target outputs while minimizing the residual sum of squares between inputs and predictions. Fits simple linear data trends.

  • Real Example: Estimating monthly retail revenue based on ad spend.
  • Real Example: Estimating crop yield based on rainfall index.

🛡️ 2. Regularization (Ridge & Lasso)

Adds a penalty "budget" to the coefficients to prevent overfitting. Ridge shrinks coefficients evenly. Lasso shrinks coefficients completely to zero, performing automatic feature selection.

  • Real Example: Customer lifetime value modeling across sparse tables.
  • Real Example: Regressions with highly multi-correlated features.

💡 Selection Guide: When to Choose Which?

📈 Linear Regression Choose if the feature-to-target relationships are simple and linear, you need maximum coefficient interpretability, or require a fast baseline model.
🛡️ Ridge Regression (L2) Choose when you have many highly correlated variables (multicollinearity) and want to keep all of them while preventing overfitting via weight shrinkage.
✂️ Lasso Regression (L1) Choose when you want automated feature selection, shrinking weights of irrelevant variables to exactly zero to create a sparse, highly interpretable model.

Ordinary Least Squares (OLS)

The math behind standard Linear Regression. We fit a linear function and optimize parameters by minimizing the Mean Squared Error (MSE) loss function:

Linear Equation: y = β₀ + β₁x₁ + ... + βₙxₙ
Cost Function: MSE = (1/n) * Σ (yᵢ - ŷᵢ)²
  • yᵢ is the true target, ŷᵢ is the model prediction.
  • Minimizing MSE yields the line of best fit (least squares residuals).
from sklearn.linear_model import LinearRegression model = LinearRegression(fit_intercept=True) model.fit(X_train, y_train)
OLS Linear Regression trend line and coordinates

🏠 Intuitive Real-World Example

Predict Housing Price (y, Dependent Variable) based on House Size (x₁, Independent Variable). The weight β₁ represents the price increase per additional sq. ft. (e.g., +$250/sq.ft.), while β₀ (intercept) is the base land cost.

Regularization Math

To prevent overfitting, we add a coefficient magnitude penalty term (α) to the OLS cost function:

Ridge (L2 Penalty) Cost:

J(w) = MSE + α * Σ (w_j)²

Lasso (L1 Penalty) Cost:

J(w) = MSE + α * Σ |w_j|
from sklearn.linear_model import Ridge, Lasso ridge = Ridge(alpha=1.0) lasso = Lasso(alpha=0.1)
Bias Variance tradeoff: Underfitting vs Good Fit vs Overfitting

📉 Regularization & Weight Shrinkage

If predicting housing price using Size, Bedrooms, and Wall Color: a high penalty α shrinks weights. Ridge (L2) shrinks weights evenly (retaining all features), while Lasso (L1) drives Wall Color's weight to exactly zero, automatedly discarding it.

Classification: Part 1

  • Logistic Regression: Fits features through a sigmoid activation curve to predict category probabilities between 0 and 1.
    Example: Predicting if a client defaults on a loan (Probability 0 to 1).
  • Support Vector Machines (SVM): Solves optimal linear boundaries by maximizing margins separating data coordinate groups.
    Example: Classifying handwritten letters by drawing widest boundary corridors.
from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC lr = LogisticRegression(C=1.0) # C is inverse regularization strength svm = SVC(kernel='rbf', C=1.0)
Classification Decision Boundaries

⚖️ Selection Guide: Logistic vs. SVM vs. k-NN

📈 Logistic Regression

When to choose: Need fast training, highly interpretable coefficients, or explicitly require probabilistic scores (e.g. default probability).

🛡️ Support Vector Machines (SVM)

When to choose: Non-linearly separable data (via kernels), high dimensionality (features > samples), or when maximum accuracy is the main goal.

📍 k-Nearest Neighbors (k-NN)

When to choose: Small datasets, complex decision boundaries, and need an intuitive, instance-based model with zero training overhead.

💡 Pro Tip: Inverse Regularization strength C

C is the inverse of regularization strength (C = 1/λ).
Smaller C: Stronger regularization; penalizes complex models to prevent overfitting (simpler decision boundary).
Larger C: Weaker regularization; allows model weights to grow to fit training data tightly (risk of overfitting).

Distance-Based: k-NN

  • k-Nearest Neighbors (k-NN): Classifies points based on the majority labels of the 'k' closest points in coordinates space.
  • Details: Requires no mathematical training beforehand (lazy learner), but is computationally expensive for large tables.
from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors=5, p=2)
k-NN Proximity Voting circles

🏨 Intuitive Hotel Classification Example

To classify a new hotel based on Price per Night ($) and Distance to Beach (meters) under k=5: locate the 5 nearest hotels in coordinate space. If 4 are "Budget Hostel" and 1 is "Luxury Resort", the model classifies the new hotel as Budget Hostel by majority vote.

k-NN: Step-by-Step Example

Classifying a new fruit (Sweetness = 6, Crunchiness = 4) with k = 3 neighbors.

1. Measure Distance (Euclidean)

Formula: Distance = √((x₂ - x₁)² + (y₂ - y₁)²)

Fruit Sweet (x) Crunch (y) Calculation Distance
Apple A 7 7 √((7-6)² + (7-4)²) = √(1+9) 3.16
Apple B 8 5 √((8-6)² + (5-4)²) = √(4+1) 2.24
Orange A 3 3 √((3-6)² + (3-4)²) = √(9+1) 3.16
Orange B 7 2 √((7-6)² + (2-4)²) = √(1+4) 2.24
Orange C 6 2 √((6-6)² + (2-4)²) = √(0+4) 2.00

2. Find 3 Neighbors

  1. Orange C (Dist: 2.00)
  2. Apple B (Dist: 2.24)
  3. Orange B (Dist: 2.24)

3. Vote for Class

  • Oranges: 2 votes (C, B)
  • Apples: 1 vote (B)
KNN Classification Example Plot

✅ Final Classification

The new fruit is classified as an Orange because it secured the majority of the votes (2 out of 3).

Tree-Based Ensembles

  • Decision Trees: A hierarchical sequence of logic cuts (rules) dividing features into homogeneous groups.
  • Random Forest (Bagging): Aggregates many trees trained on random sample subsets, voting in parallel to reduce variance.
  • XGBoost (Extreme Gradient Boosting): Sequentially fits trees where each new model corrects residuals (mistakes) of the previous tree.
from sklearn.ensemble import RandomForestClassifier from xgboost import XGBClassifier rf = RandomForestClassifier(n_estimators=100, max_depth=8) xgb = XGBClassifier(n_estimators=100, learning_rate=0.1)
Decision Tree, Random Forest and XGBoost Comparison

🌲 Selection Guide: Tree Algorithms

  • Decision Trees: Best for quick, simple baselines where explainability is crucial. Highly sensitive to minor dataset shifts.
  • Random Forest (Bagging): Best for general tabular data. Trains independent trees in parallel on bootstrap samples, reducing variance out-of-the-box.
  • XGBoost (Boosting): Best for winning predictions. Fits sequential trees correcting residual errors of past models to minimize bias.
🛠️ Crucial Tuning Parameters

Max Depth: Limits how deep trees grow. Low values prevent overfitting; high values capture complex structures.
n_estimators & LR: Random Forest is robust to high counts; XGBoost requires balancing trees with a smaller learning rate.

💡 Split Example (Elaborated Decision Tree Branches)

Root: Credit Score > 650? ├─ YESDTI < 40%? │ ├─ YESEmployed == True?Approved (96% prob, Low Risk) │ └─ NOIncome > $80k?Approved (80% prob, High Interest) └─ NOCollateral == True? ├─ YESApproved (70% prob, Secured Loan) └─ NORejected (95% prob, Extreme Risk)

Supervised Learning: Case Studies

Case Study 1 📵

Telecom Customer Churn

Problem: Predict user subscription cancellations based on usage records.
Type: Classification.
Models: Logistic Regression, Random Forest, XGBoost.

Case Study 2 🏠

Real Estate Pricing

Problem: Estimate home values based on size, zip codes, crime rates.
Type: Regression.
Models: Ridge/Lasso, Random Forest Regressor.

Case Study 3 ✉️🛡️

Email Spam Filter

Problem: Classify incoming emails as legitimate or spam category.
Type: Classification.
Models: Naive Bayes, Decision Trees, SVM.

Supervised Quiz: Classification vs. Regression

1. Predict standard stock price movements tomorrow (e.g. $150.25).
2. Identify if an incoming email is a phishing attempt.
3. Estimate delivery ETA duration of a food courier.
4. Diagnose whether medical scans show malignant or benign cell tumors.

Unsupervised Learning Basics

What is Unsupervised Learning?

Analyzing unstructured, unlabeled data patterns to extract inherent shapes, dimensions, and segment groups without teacher supervision.

Core Sub-Types

  • Clustering: Grouping spatial coordinates based on similarity metrics.
  • Dimensionality Reduction: Compressing input dimensions while keeping variance.
  • Anomaly Detection: Isolating rare outliers.
📊
Unlabeled Data
Only Features (X)
➡️
🧠
Unsupervised Engine
Find Shapes & Compression
➡️
🎨 Clustering (Groups)
🗜️ Dim. Reduction (Axes)
🚨 Anomaly Detection (Outliers)

Clustering Algorithms

  • K-Means: Groups data points into k spherical partitions by iteratively relocating centroids to match local means.
  • Hierarchical Clustering: Builds nested tree branches (agglomerative) to connect data coordinates without predefining cluster counts.
  • DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Groups dense coordinate neighborhoods, discovering arbitrary cluster shapes and isolating sparse noise.
from sklearn.cluster import KMeans, DBSCAN kmeans = KMeans(n_clusters=3) dbscan = DBSCAN(eps=0.5, min_samples=5)
K-Means vs DBSCAN Comparison Dendrogram

🔵 Selection Guide: Clustering Algorithms

📍 K-Means (Centroid-Based)

When to choose: Spherical clusters, similar sizes, and need maximum speed/scalability on large datasets.
Caveat: Sensitive to feature scaling; requires predefining number of clusters k.

🌿 Hierarchical (Tree-Based)

When to choose: Need taxonomic nested hierarchies (dendrograms) or deterministic results on small datasets.
Caveat: High computational complexity (O(N³)), making it sluggish for datasets over 10,000 samples.

🌀 DBSCAN (Density-Based)

When to choose: Arbitrary shapes (rings, loops), data has noise/outliers to filter, and cluster count is unknown.
Caveat: Fails on datasets with highly varying densities; tuning search radius Eps is critical.

💡 Clustering Examples

K-Means: Customer segmentation for targeted marketing campaigns.
Hierarchical: Building evolutionary trees of animal species (dendrograms).
DBSCAN: Mapping city crime or traffic hotspots while filtering outlier noise.

Mechanics of Clustering: How They Work

K-Means Working

Centroid-Based Partitioning:

  1. Initialize: Randomly place K center coordinates (centroids).
  2. Assign: Map every data point to its closest centroid using Euclidean distance.
  3. Update: Shift each centroid to the mean coordinate of all points assigned to it.
  4. Iterate: Repeat steps 2-3 until centroids stop shifting.
📍
1. Assign
➡️
🔄
2. Update
➡️
🎯
3. Iterate

Hierarchical Working

Bottom-Up Agglomerative Trees:

  1. Initialize: Treat every single coordinate point as its own distinct cluster.
  2. Measure: Calculate distance between all clusters using a linkage metric.
  3. Merge: Group the two closest clusters into a single parent cluster.
  4. Iterate: Repeat steps 2-3 until only one root cluster remains, forming a dendrogram.

* Note: Divisive (top-down) is conceptually opposite but mathematically asymmetric to Agglomerative. Agglomerative merges local pairs; Divisive splits global structures.

🌱
1. Pairs
➡️
🌿
2. Merge
➡️
🌳
3. Tree

DBSCAN Working

Density-Based Connectivity:

  1. Scan Neighbors: For each point, count how many coordinates lie within radius (ε).
  2. Core Points: If points count ≥ min_samples, mark it as a "Core" point and start a new cluster.
  3. Expand Border: Include neighbors within ε of core points; mark isolated neighbors as "Border".
  4. Isolate Noise: Any remaining coordinates not reachable from core points are labeled as "Noise" (outliers).
🌐
1. Scan
➡️
🔗
2. Link
➡️
🚫
3. Noise

K-Means: Step-by-Step Example

Grouping customers into K = 2 clusters based on Age and Spending Score (1-10). Initial centroids: C1 = (20,3), C2 = (40,8).

1. Distance Assignment Step

Customer Age (x) Spend (y) Dist to C1 (20,3) Dist to C2 (40,8) Cluster
User A 22 4 2.24 18.44 C1
User B 28 2 8.06 13.42 C1
User C 45 9 25.71 5.10 C2
User D 38 7 18.44 2.24 C2

2. Update Centroid Center Step

  • New C1 Center: Average of User A(22,4) & B(28,2) → ((22+28)/2, (4+2)/2) = (25, 3)
  • New C2 Center: Average of User C(45,9) & D(38,7) → ((45+38)/2, (9+7)/2) = (41.5, 8)
K-Means Customer Clusters Update Plot

🔄 3. Repeat Until Convergence

Steps 1 & 2 are repeated with updated centroids. The process stops when cluster assignments freeze and centroids no longer shift coordinate positions.

Dimensionality Reduction

  • PCA (Principal Component Analysis): Projects high-dimensional data orthogonally to new directions capturing maximum variance.
  • t-SNE (t-Distributed Stochastic Neighbor Embedding) / UMAP (Uniform Manifold Approximation and Projection): Non-linear manifold mapping that preserves local neighborhoods to visualize distributions in 2D coordinate space.
from sklearn.decomposition import PCA from sklearn.manifold import TSNE pca = PCA(n_components=2) tsne = TSNE(n_components=2, perplexity=30)
PCA Projection 3D to 2D t-SNE Embeddings Mapping

📊 Selection Guide: PCA vs. t-SNE / UMAP

📈 PCA (Linear Projection)

When to choose: Preprocessing features for other models, noise reduction, or preserving global structural patterns.
Projection: Supports out-of-sample projection (`pca.transform(X_new)`) for new data points.

🗺️ t-SNE & UMAP (Non-Linear Mapping)

When to choose: Visualizing high-dimensional clusters in 2D or 3D coordinate space.
Projection: Does *not* support projecting new points; requires re-running the entire dataset.

💡 Reduction Examples

PCA: Compressing 100+ user survey answers into 2 principal components to feed into a regression model.
t-SNE / UMAP: Visualizing high-dimensional single-cell genetic sequences in a 2D scatter plot.

Unsupervised Quiz: Clustering or Dimensionality Reduction?

1. Group coordinate locations of delivery drops to establish local sorting hubs.
2. Compress 50 columns from a buyer survey into 3 summary metrics for a 2D plot.
3. Segment online news articles into topic folders to help readers browse.

Unsupervised Learning Case Studies

Case Study 1 🛒

Customer Segmentation

Problem: Group buyers based on shopping volumes and session times.
Type: Clustering.
Models: K-Means, Hierarchical Clustering.

Case Study 2 🧬

Genomics Visualization

Problem: Map expression distributions of 20k genes in 2D plots.
Type: Dimensionality Reduction.
Models: t-SNE, PCA.

Case Study 3 💳🛡️

Credit Anomaly Detection

Problem: Flag highly rare bank credit card transaction anomalies.
Type: Outlier / Anomaly Detection.
Models: Isolation Forest, One-class SVM.

Reinforcement Learning Basics

  • Concept: Teaching model behaviors using feedback loops based on action trials and environmental rewards.
  • Agent: The core decision-making AI engine.
  • Environment: The interactive space surrounding the agent.
  • State: Current environmental configurations.
  • Action: Movement selection performed by the agent.
  • Reward: positive or negative numeric feedback score.
RL Loop diagram

Interactive: RL Agent Learning Journey

Exploration Blind Trial: The robot moves forward, enters a hazard zone (fire), and immediately receives a large negative reward (-100). The agent updates its memory weights to avoid this action in similar grid states in future epochs.

RL Step Fail

Exploitation Path Corrected: After multiple iterations, the agent learns the barrier locations, navigates around hazard areas, reaches the battery charger goal, and earns a positive reward (+100).

RL Step Success

RL Core Concepts & Algorithms

Q-Learning (Value-Based)

Maintains a lookup index (Q-table) mapping expected cumulative rewards for actions selected in current states.

  • Use Case: Industrial robot vacuum navigation.
STATES
State/Action ⬅️ Left ⬆️ Up ➡️ Right
S1 (Start) 0.0 -10.0 +0.5
S2 (Hazard) +1.2 0.0 -100.0
import gymnasium as gym import numpy as np env = gym.make('FrozenLake-v1') # Q-table: states x actions matrix q_table = np.zeros([env.observation_space.n, env.action_space.n])

Policy Gradients (Policy-Based)

Directly models and scales probability distributions of actions without saving a mid-layer state value lookup matrix.

  • Use Case: Continuous throttle controls for quadcopters.
Input State
[x, y, vx, vy]
➡️
NN Policy π(a│s)
Softmax Outputs
➡️
Left:12%
Up:85%
Right:3%
from stable_baselines3 import PPO import gymnasium as gym env = gym.make('CartPole-v1') model = PPO('MlpPolicy', env, verbose=0) model.learn(total_timesteps=5000)

Reinforcement Learning Case Studies

Case Study 1 ♞

Grandmaster Chess/Go AI

Problem: Master complex game strategies to beat humans.
Type: Reinforcement Learning.
Models: Monte Carlo Tree Search, Deep Q-Networks.

Case Study 2 🦾

Robotic Arm Control

Problem: Grasp moving objects without damage.
Type: Continuous RL.
Models: PPO, Deep Deterministic Policy Gradients.

Case Study 3 🎯💰

Ad Bidding Optimization

Problem: Select user ad impressions to maximize clicks.
Type: Multi-Armed Bandit RL.
Models: Thompson Sampling, UCB.

Choosing the Right Model: Decision Matrix

Supervised (Predictive)

  • OLS Linear Regression Fast: Simple linear relationships, maximum interpretability.
  • Ridge & Lasso Fast: High-dimensional inputs, sparse data tables, multicollinearity.
  • Logistic Regression Fast: Baseline binary classification splits.
  • SVM Moderate: Complex margins, high-dimensional text/feature matrices.
  • Random Forest / XGB Heavy: Complex structured tables, non-linear relationships (highest accuracy).

Unsupervised (Discovery)

  • K-Means Fast: Evenly sized, spherical, distinct customer clusters.
  • Hierarchical Moderate: Tree-like taxonomic relationships (e.g. biology).
  • DBSCAN Moderate: Dense clusters of arbitrary shapes with noise anomaly isolation.
  • PCA Fast: Linear dimension reduction to speed up downstream models.
  • t-SNE Heavy: Mapping complex non-linear manifolds strictly for 2D/3D visualization.

Reinforcement (Behavioral)

  • Multi-Armed Bandits Fast: Static states, balancing real-time web testing (explore vs. exploit).
  • Q-Learning Moderate: Low-dimensional discrete states and discrete action grids.
  • Policy Gradients (PPO) Heavy: Continuous control spaces (drones, robotics, complex behaviors).

How Many Models Exist?

There is no fixed count, but models group into 5 core families:

  • 📉 Linear Models: Fit flat linear planes (e.g. Linear/Logistic, Ridge/Lasso, and Polynomial Regression).
  • 🌲 Tree-based Ensembles: Cut spaces into nested step rules (e.g. Decision Trees, Random Forests, XGBoost).
  • 📏 Distance & Kernel Models: Proximity distance & coordinate projections (e.g. k-NN, Kernel SVM, K-Means, DBSCAN).
  • 🎲 Probabilistic Models: Rely on bayesian probability weights (e.g. Naive Bayes, GMM).
  • 🕸️ Neural Networks: Deep nodes layers mapping non-linear inputs to outputs.
Model Taxonomy Tree Diagram

* Key Contrast: While Linear Models assume flat, straight-line relationships (with Polynomial mapping non-linear relations using linear solvers), Tree Ensembles, Kernels, and Neural Networks are inherently Non-Linear—allowing them to fit complex curved boundaries.

The 6-Step ML Workflow

  • Data Collection: Querying raw database storage tables or APIs.
  • Preprocessing: Cleaning outliers, scaling values, feature engineering.
  • Model Choice: Picking the target mapping algorithm class.
  • Training: Fitting candidate model weights on data splits.
  • Evaluation: Validating output accuracy metrics on test folds.
  • Deployment: Packaging final models into live inference APIs.
Machine Learning Workflow Pipeline

Model Deployment: Model Packaging & Export

📦 Packaging Models for Production

Before a trained machine learning model can serve live predictions, its in-memory weights must be serialized into a persistent, portable file artifact.

  • Lasso / Ridge / Random Forest: Serialized using Joblib or Pickle.
  • Neural Networks / Multi-platform: Compiled into ONNX format.
📓
1. Train & Validate
➡️
💾
2. Save Artifact
➡️
📦
3. Deploy Model
import joblib # 1. Train machine learning model model.fit(X_train, y_train) # 2. Serialize model as file artifact joblib.dump(model, 'model.joblib')

Model Deployment: Inference API Deployment

🚀 Serving Real-Time Predictions

Deploying models means exposing them via a Python web framework (e.g., FastAPI or Flask) as a REST API endpoint for consumption by external applications.

  • Load: The serialized model is loaded in memory on server startup.
  • Expose: API endpoints receive incoming user feature inputs.
  • Predict: Perform real-time inference and return output scores.
💻
Client App
POST /predict
➡️ Req ⬅️ Res
Flask App
API Endpoint
↔️
💾
Loaded Model
In Memory
from flask import Flask, request, jsonify import joblib app = Flask(__name__) model = joblib.load('model.joblib') @app.route('/predict', methods=['POST']) def predict(): features = request.json['features'] prediction = model.predict([features]) return jsonify({'prediction': list(prediction)})

Mapping Tasks & Tooling

Pipeline Alignment ⚙️

  • Regression & Classification represent core Model Choice, Training, and Evaluation blocks.
  • Clustering & PCA fit directly into Preprocessing (mapping datasets to compressed arrays prior to training).

Key Tool Ecosystems 🛠️

  • Pandas / NumPy DataFrames Arrays
  • Scikit-Learn Classical ML
  • TensorFlow / PyTorch Deep Learning
  • MLflow MLOps

Key Takeaways & Wrap-Up

  • Define Problem First: Always map your requirements to labels (Supervised), features structure (Unsupervised), or states interaction (Reinforcement).
  • Prioritize Baseline Models: Try simple linear weights or single decision splits before compiling deep neural networks or complex boosting stacks.
  • Iterate on Pipeline Data: Most modeling errors stem from poor feature preprocessing, not hyperparameter tuning. Clean and scale your raw inputs carefully.

Visual Cheat Sheet Summary

Takehome Infographic summary chart

Summary Quiz: Paradigm Matchmaker

1. Train a self-driving car to steer and avoid road cones via feedback rewards.
2. Predict the salary of a new job listing based on experience, role, and location.
3. Find hidden, fraudulent cohorts inside banking transactions without labels.

Audience Q&A

Ask Me Anything!

Suggested Topics to Start:

  • How to manage imbalanced label coordinates?
  • When is deep neural computing actually necessary?
  • Transition paths from traditional BI analytics to ML engineering.
Audience Q&A Graphic