Supervised, Unsupervised & Reinforcement Learning
A 1-hour conceptual masterclass designed for beginner-to-intermediate data professionals to build an intuitive, visual, and practical mental map of ML algorithms.
To build a strong foundation in Machine Learning 101, you need to master the core terminology that dictates how models learn, fail, and get evaluated.
X);
the Target is what you want to predict (dependent variable, Y).
Fits an optimal line matching target outputs while minimizing the residual sum of squares between inputs and predictions. Fits simple linear data trends.
Adds a penalty "budget" to the coefficients to prevent overfitting. Ridge shrinks coefficients evenly. Lasso shrinks coefficients completely to zero, performing automatic feature selection.
The math behind standard Linear Regression. We fit a linear function and optimize parameters by minimizing the Mean Squared Error (MSE) loss function:
Predict Housing Price (y, Dependent Variable) based on House Size (x₁, Independent Variable). The weight β₁ represents the price increase per additional sq. ft. (e.g., +$250/sq.ft.), while β₀ (intercept) is the base land cost.
To prevent overfitting, we add a coefficient magnitude penalty term (α) to the OLS cost function:
If predicting housing price using Size, Bedrooms, and Wall Color: a high penalty α shrinks weights. Ridge (L2) shrinks weights evenly (retaining all features), while Lasso (L1) drives Wall Color's weight to exactly zero, automatedly discarding it.
📈 Logistic Regression
• When to choose: Need fast training, highly interpretable coefficients, or explicitly require probabilistic scores (e.g. default probability).
🛡️ Support Vector Machines (SVM)
• When to choose: Non-linearly separable data (via kernels), high dimensionality (features > samples), or when maximum accuracy is the main goal.
📍 k-Nearest Neighbors (k-NN)
• When to choose: Small datasets, complex decision boundaries, and need an intuitive, instance-based model with zero training overhead.
💡 Pro Tip: Inverse Regularization strength C
C is the inverse of regularization strength (C = 1/λ).
• Smaller C: Stronger regularization; penalizes complex models to prevent
overfitting (simpler decision boundary).
• Larger C: Weaker regularization; allows model weights to grow to fit training
data tightly (risk of overfitting).
To classify a new hotel based on Price per Night ($) and Distance to Beach (meters) under k=5: locate the 5 nearest hotels in coordinate space. If 4 are "Budget Hostel" and 1 is "Luxury Resort", the model classifies the new hotel as Budget Hostel by majority vote.
Classifying a new fruit (Sweetness = 6, Crunchiness = 4) with k = 3 neighbors.
Formula: Distance = √((x₂ - x₁)² + (y₂ - y₁)²)
| Fruit | Sweet (x) | Crunch (y) | Calculation | Distance |
|---|---|---|---|---|
| Apple A | 7 | 7 | √((7-6)² + (7-4)²) = √(1+9) | 3.16 |
| Apple B | 8 | 5 | √((8-6)² + (5-4)²) = √(4+1) | 2.24 |
| Orange A | 3 | 3 | √((3-6)² + (3-4)²) = √(9+1) | 3.16 |
| Orange B | 7 | 2 | √((7-6)² + (2-4)²) = √(1+4) | 2.24 |
| Orange C | 6 | 2 | √((6-6)² + (2-4)²) = √(0+4) | 2.00 |
The new fruit is classified as an Orange because it secured the majority of the votes (2 out of 3).
• Max Depth: Limits how deep trees grow. Low values prevent overfitting; high values
capture complex structures.
• n_estimators & LR: Random Forest is robust to high counts; XGBoost requires
balancing trees with a smaller learning rate.
• Root: Credit Score > 650? ├─ YES ➔ DTI < 40%? │ ├─ YES ➔ Employed == True? ➔ Approved (96% prob, Low Risk) │ └─ NO ➔ Income > $80k? ➔ Approved (80% prob, High Interest) └─ NO ➔ Collateral == True? ├─ YES ➔ Approved (70% prob, Secured Loan) └─ NO ➔ Rejected (95% prob, Extreme Risk)
Telecom Customer Churn
Problem: Predict user subscription cancellations based on usage records.
Type: Classification.
Models: Logistic Regression, Random Forest, XGBoost.
Real Estate Pricing
Problem: Estimate home values based on size, zip codes, crime rates.
Type: Regression.
Models: Ridge/Lasso, Random Forest Regressor.
Email Spam Filter
Problem: Classify incoming emails as legitimate or spam category.
Type: Classification.
Models: Naive Bayes, Decision Trees, SVM.
Analyzing unstructured, unlabeled data patterns to extract inherent shapes, dimensions, and segment groups without teacher supervision.
📍 K-Means (Centroid-Based)
• When to choose: Spherical clusters, similar sizes, and need maximum
speed/scalability on large datasets.
• Caveat: Sensitive to feature scaling; requires predefining number of clusters
k.
🌿 Hierarchical (Tree-Based)
• When to choose: Need taxonomic nested hierarchies (dendrograms) or deterministic
results on small datasets.
• Caveat: High computational complexity (O(N³)), making it sluggish for
datasets over 10,000 samples.
🌀 DBSCAN (Density-Based)
• When to choose: Arbitrary shapes (rings, loops), data has noise/outliers to
filter, and cluster count is unknown.
• Caveat: Fails on datasets with highly varying densities; tuning search radius
Eps is critical.
💡 Clustering Examples
• K-Means: Customer segmentation for targeted marketing campaigns.
• Hierarchical: Building evolutionary trees of animal species (dendrograms).
• DBSCAN: Mapping city crime or traffic hotspots while filtering outlier noise.
Centroid-Based Partitioning:
Bottom-Up Agglomerative Trees:
* Note: Divisive (top-down) is conceptually opposite but mathematically asymmetric to Agglomerative. Agglomerative merges local pairs; Divisive splits global structures.
Density-Based Connectivity:
min_samples, mark it as a "Core" point and start a new cluster.Grouping customers into K = 2 clusters based on Age and Spending Score (1-10). Initial centroids: C1 = (20,3), C2 = (40,8).
| Customer | Age (x) | Spend (y) | Dist to C1 (20,3) | Dist to C2 (40,8) | Cluster |
|---|---|---|---|---|---|
| User A | 22 | 4 | 2.24 | 18.44 | C1 |
| User B | 28 | 2 | 8.06 | 13.42 | C1 |
| User C | 45 | 9 | 25.71 | 5.10 | C2 |
| User D | 38 | 7 | 18.44 | 2.24 | C2 |
Steps 1 & 2 are repeated with updated centroids. The process stops when cluster assignments freeze and centroids no longer shift coordinate positions.
📈 PCA (Linear Projection)
• When to choose: Preprocessing features for other models, noise reduction, or
preserving global structural patterns.
• Projection: Supports out-of-sample projection (`pca.transform(X_new)`) for
new data points.
🗺️ t-SNE & UMAP (Non-Linear Mapping)
• When to choose: Visualizing high-dimensional clusters in 2D or 3D coordinate
space.
• Projection: Does *not* support projecting new points; requires re-running the
entire dataset.
💡 Reduction Examples
• PCA: Compressing 100+ user survey answers into 2 principal components to feed
into a regression model.
• t-SNE / UMAP: Visualizing high-dimensional single-cell genetic sequences in a
2D scatter plot.
Customer Segmentation
Problem: Group buyers based on shopping volumes and session times.
Type: Clustering.
Models: K-Means, Hierarchical Clustering.
Genomics Visualization
Problem: Map expression distributions of 20k genes in 2D plots.
Type: Dimensionality Reduction.
Models: t-SNE, PCA.
Credit Anomaly Detection
Problem: Flag highly rare bank credit card transaction anomalies.
Type: Outlier / Anomaly Detection.
Models: Isolation Forest, One-class SVM.
Exploration Blind Trial: The robot moves forward, enters a hazard zone (fire), and immediately receives a large negative reward (-100). The agent updates its memory weights to avoid this action in similar grid states in future epochs.
Exploitation Path Corrected: After multiple iterations, the agent learns the barrier locations, navigates around hazard areas, reaches the battery charger goal, and earns a positive reward (+100).
Maintains a lookup index (Q-table) mapping expected cumulative rewards for actions selected in current states.
| State/Action | ⬅️ Left | ⬆️ Up | ➡️ Right |
|---|---|---|---|
| S1 (Start) | 0.0 | -10.0 | +0.5 |
| S2 (Hazard) | +1.2 | 0.0 | -100.0 |
Directly models and scales probability distributions of actions without saving a mid-layer state value lookup matrix.
Grandmaster Chess/Go AI
Problem: Master complex game strategies to beat humans.
Type: Reinforcement Learning.
Models: Monte Carlo Tree Search, Deep Q-Networks.
Robotic Arm Control
Problem: Grasp moving objects without damage.
Type: Continuous RL.
Models: PPO, Deep Deterministic Policy Gradients.
Ad Bidding Optimization
Problem: Select user ad impressions to maximize clicks.
Type: Multi-Armed Bandit RL.
Models: Thompson Sampling, UCB.
* Key Contrast: While Linear Models assume flat, straight-line relationships (with Polynomial mapping non-linear relations using linear solvers), Tree Ensembles, Kernels, and Neural Networks are inherently Non-Linear—allowing them to fit complex curved boundaries.
| Paradigm | Model / Algorithm | Slide Reference | Type |
|---|---|---|---|
| Supervised (Regression) | 1. Ordinary Least Squares (OLS) Linear Regression | Slide 7 | Linear |
| Supervised (Regression) | 2. Ridge Regression (L2 Penalty) | Slide 8 | Linear |
| Supervised (Regression) | 3. Lasso Regression (L1 Penalty) | Slide 8 | Linear |
| Supervised (Regression) | 4. Polynomial Regression (Bridge Model) | Slide 25 (Mentioned) | Non-Linear features / Linear params |
| Supervised (Classification) | 5. Logistic Regression | Slide 9 | Linear |
| Supervised (Classification) | 6. Support Vector Classifier (Linear SVM) | Slide 9 | Linear |
| Supervised (Classification) | 7. Kernel SVM (RBF / Polynomial Kernels) | Slide 9 | Non-Linear |
| Supervised (Classification) | 8. k-Nearest Neighbors (k-NN) | Slide 10 | Non-Linear |
| Tree Ensembles (Reg/Clas) | 9. Decision Trees | Slide 11 | Non-Linear |
| Tree Ensembles (Reg/Clas) | 10. Random Forest (Bagging) | Slide 11 | Non-Linear |
| Tree Ensembles (Reg/Clas) | 11. XGBoost, LightGBM, CatBoost (Boosting) | Slide 11 | Non-Linear |
| Unsupervised (Clustering) | 12. K-Means Clustering | Slide 15 | Non-Linear |
| Unsupervised (Clustering) | 13. Hierarchical Clustering (Agglomerative) | Slide 15 | Non-Linear |
| Unsupervised (Clustering) | 14. DBSCAN (Density-Based) | Slide 15 | Non-Linear |
| Unsupervised (Dim. Reduction) | 15. Principal Component Analysis (PCA) | Slide 17 | Linear |
| Unsupervised (Dim. Reduction) | 16. t-SNE & UMAP (Manifold Learning) | Slide 17 | Non-Linear |
| Unsupervised (Anomaly) | 17. Isolation Forest | Slide 19 | Non-Linear |
| Unsupervised (Anomaly) | 18. One-Class SVM | Slide 19 | Non-Linear |
| Probabilistic Models | 19. Naive Bayes | Slide 12 | Non-Linear |
| Probabilistic Models | 20. Gaussian Mixture Models (GMM) | Slide 25 | Non-Linear |
| Reinforcement Learning | 21. Multi-Armed Bandits (Thompson / UCB) | Slide 23 | Behavioral / Policy |
| Reinforcement Learning | 22. Q-Learning (Value-Based) | Slide 22 | Behavioral / Value-Based |
| Reinforcement Learning | 23. Policy Gradients / PPO | Slide 22 | Behavioral / Policy-Based |
| Reinforcement Learning | 24. Deep Q-Networks (DQN) & DDPG | Slide 23 | Behavioral / Deep RL |
| Deep Learning | 25. Neural Networks (MLPs, CNNs, Transformers) | Slide 25 | Non-Linear |
Before a trained machine learning model can serve live predictions, its in-memory weights must be serialized into a persistent, portable file artifact.
Deploying models means exposing them via a Python web framework (e.g., FastAPI or Flask) as a REST API endpoint for consumption by external applications.
Ask Me Anything!