# Flashcards: Advanced Pandas & Seaborn for ML Preparation

Use these conceptual flashcards to review key data manipulation topics before transitioning to the Scikit-Learn machine learning classes.

---

## 🗂️ Category 1: Pandas Data Structures

### Flashcard 1: DataFrame vs. Series in ML
* **Question**: How do Pandas DataFrames and Series map to the mathematical inputs of a supervised machine learning model?
* **Answer**: 
  * A **DataFrame (2D)** contains rows (samples) and columns (features). It maps to the **feature matrix $X$**.
  * A **Series (1D)** contains a single labeled array representing target values. It maps to the **target vector $y$**.
* **Visual Concept**: 
  * $X$ = Multi-column table (Features)
  * $y$ = Single column (Labels/Target)

---

## 🧹 Category 2: Data Cleaning for ML Compliance

### Flashcard 2: The NaN Constraint
* **Question**: Why must missing values (`NaN`) be handled before passing a DataFrame to a Scikit-Learn estimator?
* **Answer**: Most classical Scikit-Learn estimators (e.g., `LinearRegression`, `LogisticRegression`, `SVM`) cannot perform operations with empty slots (`NaN`) and will throw a `ValueError: Input contains NaN`.
* **Remediation**:
  * **Drop**: Use `df.dropna(subset=['target_col'])` when labels are missing.
  * **Impute**: Use `df['feature'].fillna(df['feature'].median())` to fill missing features using column median or mean.

### Flashcard 3: Duplicate Rows & Data Leakage
* **Question**: How do duplicate rows impact machine learning models, and how do we resolve them?
* **Answer**: Duplicate rows cause **data leakage** if identical samples end up in both the training set and testing set. This inflates the model's test performance metrics (making it look better than it is) while performing poorly on actual unseen data.
* **Syntax**: `df = df.drop_duplicates(keep='first')`

---

## 🏷️ Category 3: Categorical Encoding

### Flashcard 4: Ordinal Mapping vs. One-Hot Encoding
* **Question**: What is the difference between Ordinal Mapping and One-Hot Encoding? When do we use each?
* **Answer**:
  * **Ordinal Mapping**: Maps ordered categories (e.g. Bronze $\rightarrow$ 0, Silver $\rightarrow$ 1, Gold $\rightarrow$ 2) using a dictionary. Used when there is a natural ranking.
    * *Syntax*: `df['tier_enc'] = df['tier'].map({'Bronze': 0, 'Silver': 1, 'Gold': 2})`
  * **One-Hot Encoding**: Converts nominal categories (no rank, e.g. gender, city) into multiple binary columns (0 or 1).
    * *Syntax*: `df_enc = pd.get_dummies(df, columns=['city'], drop_first=True)`

### Flashcard 5: The Dummy Variable Trap
* **Question**: What is the "Dummy Variable Trap" and how does setting `drop_first=True` resolve it?
* **Answer**: When One-Hot encoding, creating binary dummy variables for *all* categories introduces collinearity. For example, if we have two columns `gender_M` and `gender_F`, they are perfectly correlated (if one is 1, the other is 0). This collinearity destabilizes linear model coefficients. Setting `drop_first=True` drops the first class, leaving it as the reference level, removing collinearity.

---

## 🤝 Category 4: Combining Data

### Flashcard 6: Concatenation vs. Merging
* **Question**: When do you use `pd.concat()` vs. `pd.merge()`?
* **Answer**:
  * **`pd.concat()`**: Used to stack tables together. Stacks vertically (Axis 0, adding more rows/samples) or horizontally (Axis 1, adding more features) based on matching indexes.
  * **`pd.merge()`**: Used to perform SQL-style relational database joins based on matching key columns (primary/foreign keys), such as merging transaction history with customer demographics.

---

## 📅 Category 5: Time Series Preprocessing

### Flashcard 7: Lag and Lead Features
* **Question**: What are lag and lead features, and how are they created in Pandas?
* **Answer**:
  * **Lag Feature (`.shift(1)`)**: Shifts data points forward in time. Used to create autoregressive predictors (e.g., using yesterday's stock price as a feature to predict today's price).
  * **Lead Feature (`.shift(-1)`)**: Shifts data points backward in time. Used to construct targets for predictive modeling (e.g., matching today's record with tomorrow's actual closing price to serve as the training label $y$).

---

## 📊 Category 6: Exploratory Data Analysis & Seaborn

### Flashcard 8: Boxplots & Outlier Audits
* **Question**: Why is identifying outliers with Seaborn boxplots critical before training linear models?
* **Answer**: Outliers heavily skew models that minimize squared errors (like Ordinary Least Squares regression) because squaring the large error of an outlier pulls the fitted line/hyperplane away from the main data distribution.
* **Visual Tool**: `sns.boxplot(data=df, x='category', y='amount')` detects these anomalies as dots beyond the Interquartile Range (IQR) whiskers.

### Flashcard 9: Heatmaps & Multi-Collinearity
* **Question**: What does a correlation matrix heatmap tell us during feature selection?
* **Answer**: It highlights pairwise correlation coefficients. If two features have a correlation close to +1.0 or -1.0, they represent redundant information (multicollinearity). Dropping one of the redundant features reduces the size of the feature space ($X$), makes coefficients in linear models stable, and speeds up model training.
* **Syntax**: `sns.heatmap(df.corr(), annot=True, cmap='coolwarm')`
