Assignment 2a: Exploratory Data Analysis¶


Author	Robert Frenken
Estimated time	6--8 hours
Prerequisites	Assignment 1 completed, Python basics, GitHub account

What You'll Build¶

An exploratory data analysis (EDA) of a real automotive intrusion detection dataset, published as a blog post on your Quarto website. You'll load and explore the data with pandas, create visualizations with matplotlib and seaborn, train two simple ML models with scikit-learn, and recreate two plots from the Python Graph Gallery.

Part 0: Project Setup¶

0.1 Create a GitHub Repository¶

Go to github.com/new
Repository name: eda-assignment (or similar)
Set to Public, check "Add a README", add a Python .gitignore
Clone it to your machine:

git clone git@github.com:YOURUSERNAME/eda-assignment.git
cd eda-assignment

0.2 Set Up a Python Environment¶

uv (recommended)pipconda

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create a virtual environment and install packages
uv venv
source .venv/bin/activate   # Windows Git Bash: source .venv/Scripts/activate
uv pip install pandas matplotlib seaborn scikit-learn jupyter

python -m venv .venv
source .venv/bin/activate   # Windows Git Bash: source .venv/Scripts/activate
pip install pandas matplotlib seaborn scikit-learn jupyter

conda create -n eda python=3.12 pandas matplotlib seaborn scikit-learn jupyter -y
conda activate eda

For more details on Python environments, see the Python Environment Setup guide.

0.3 Project Directory Layout¶

Organize your repository like this:

eda-assignment/
├── data/               # Raw and processed data (add to .gitignore if large)
├── figures/            # Saved plot images
├── notebooks/
│   └── eda.ipynb       # Your main analysis notebook
├── .gitignore
├── README.md
└── requirements.txt    # Pin your dependencies

Create the directories and save your dependencies:

mkdir -p data figures notebooks
pip freeze > requirements.txt   # or: uv pip freeze > requirements.txt

Don't commit large data files

Add data/ to your .gitignore if the dataset exceeds a few MB. Git is not designed for large binary files.

Part 1: Get & Explore the Data¶

1.1 Download the Dataset¶

Download the HCRL Survival IDS dataset from ocslab.hksecurity.net/Datasets/survival-ids.

Visit the link and download the dataset files
Place the CSV file(s) in your data/ directory
Open notebooks/eda.ipynb (create it in VS Code or with jupyter notebook)

1.2 Load and Inspect¶

import pandas as pd

df = pd.read_csv("../data/survival_ids.csv")  # adjust filename as needed

# Basic inspection
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
df.head()

1.3 Summary Statistics¶

Run these in separate notebook cells and read the output — don't just run them blindly:

# Data types and non-null counts
df.info()

# Descriptive statistics
df.describe()

# Check for missing values
df.isnull().sum()

# Value counts for categorical columns (if any)
# df["column_name"].value_counts()

Data loaded successfully with pd.read_csv
df.info() output reviewed — you understand the column types
df.describe() output reviewed — you can identify reasonable ranges
Missing values checked

Part 2: Visualizations¶

Create at least 3 EDA plots. Use matplotlib and seaborn. Save each figure to figures/.

2.1 Distribution Plot¶

Pick a numeric column and plot its distribution:

import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(figsize=(8, 5))
sns.histplot(df["your_column"], bins=50, kde=True, ax=ax)
ax.set_title("Distribution of Your Column")
ax.set_xlabel("Value")
ax.set_ylabel("Count")
fig.savefig("../figures/distribution.png", dpi=150, bbox_inches="tight")
plt.show()

2.2 Correlation Heatmap¶

Visualize relationships between numeric features:

fig, ax = plt.subplots(figsize=(10, 8))
numeric_cols = df.select_dtypes(include="number")
sns.heatmap(numeric_cols.corr(), annot=True, fmt=".2f", cmap="coolwarm", ax=ax)
ax.set_title("Feature Correlation Heatmap")
fig.savefig("../figures/correlation_heatmap.png", dpi=150, bbox_inches="tight")
plt.show()

2.3 Categorical Breakdown¶

If the dataset has a label or class column, visualize its distribution:

fig, ax = plt.subplots(figsize=(8, 5))
sns.countplot(data=df, x="label_column", ax=ax)
ax.set_title("Class Distribution")
ax.set_xlabel("Class")
ax.set_ylabel("Count")
fig.savefig("../figures/class_distribution.png", dpi=150, bbox_inches="tight")
plt.show()

Make your plots readable

Always include a title, axis labels, and a legend (if applicable). Use bbox_inches="tight" when saving to avoid clipped labels.

Distribution plot created and saved
Correlation heatmap created and saved
Categorical breakdown (or third plot of your choice) created and saved

Part 3: Toy ML Models¶

Train two simple classifiers and evaluate them. This is a first exposure to the sklearn API — the goal is to learn the workflow, not to achieve state-of-the-art accuracy.

3.1 Prepare the Data¶

from sklearn.model_selection import train_test_split

# Adjust column names to match your dataset
X = df.drop("label_column", axis=1)
y = df["label_column"]

# Handle non-numeric columns if needed
X = X.select_dtypes(include="number")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")

3.2 Train Two Models¶

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Model 1: Logistic Regression
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train, y_train)

# Model 2: Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

3.3 Evaluate with Confusion Matrix¶

from sklearn.metrics import classification_report, confusion_matrix

for name, model in [("Logistic Regression", lr), ("Random Forest", rf)]:
    y_pred = model.predict(X_test)
    print(f"\n{'='*40}")
    print(f"{name}")
    print(f"{'='*40}")
    print(classification_report(y_test, y_pred))

    # Plot confusion matrix
    fig, ax = plt.subplots(figsize=(6, 5))
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", ax=ax)
    ax.set_title(f"Confusion Matrix — {name}")
    ax.set_xlabel("Predicted")
    ax.set_ylabel("Actual")
    fig.savefig(f"../figures/confusion_matrix_{name.lower().replace(' ', '_')}.png",
                dpi=150, bbox_inches="tight")
    plt.show()

Train/test split created
Two models trained (Logistic Regression + Random Forest)
classification_report printed for both models
Confusion matrix heatmaps created and saved

Part 4: Graph Gallery Picks¶

Visit the Python Graph Gallery and pick 2 plots that look interesting. Recreate them using the Survival IDS dataset (or a subset of it).

Requirements¶

Choose 2 different chart types (e.g., violin plot, radar chart, hexbin, pair plot, bubble chart)
Adapt the gallery code to use columns from your dataset
Customize each plot: change the color palette, add proper titles/labels, add annotations if relevant
Save both figures to figures/

Picking good chart types

Don't just pick the simplest plots. Try something you haven't used before — the point is to expand your visualization toolkit. Good picks: violin plots, pair plots, radar charts, parallel coordinates, ridgeline plots.

Graph Gallery plot 1 created, customized, and saved
Graph Gallery plot 2 created, customized, and saved

Part 5: Publish as a Blog Post¶

Turn your analysis into a blog post on your Quarto website from Assignment 1.

5.1 Clean Your Notebook¶

Restart the kernel and run all cells top-to-bottom (Ctrl+Shift+F5 in VS Code)
Remove any scratch/debug cells
Add markdown cells that explain what you're doing and what the results mean — a reader should understand the analysis without reading the code

5.2 Add a YAML Header¶

Add this to the first cell of your notebook (as a Raw cell) or convert the notebook to .qmd:

---
title: "Exploratory Data Analysis: Survival IDS Dataset"
description: "EDA and baseline ML models on the HCRL Survival IDS dataset."
date: "2026-01-15"  # Use the date you completed the assignment
categories: [eda, python, machine-learning]
---

5.3 Publish on Your Quarto Site¶

Copy the notebook (or .qmd file) to your Quarto site's posts/ folder
Copy any required figures to the post directory
Preview locally: quarto preview
Commit and push:

git add posts/eda-post/
git commit -m "Add EDA blog post"
git push

Notebook runs cleanly top-to-bottom
Markdown explanations added between code cells
Blog post published and live on your Quarto site

Final Deliverables¶

Submit the following:

GitHub repo URL for your EDA project (e.g., github.com/YOURUSERNAME/eda-assignment)
5+ visualizations in figures/ (3 EDA + 2 Graph Gallery picks)
Confusion matrix heatmaps for both models
Blog post live on your Quarto site
requirements.txt in the repo root

Troubleshooting¶

Problem	Fix
`ModuleNotFoundError: No module named 'pandas'`	Make sure your virtual environment is activated and you've installed the packages
Jupyter kernel doesn't see installed packages	Select the correct kernel — click the kernel name in the top-right of VS Code and pick your `.venv`
`quarto render` fails on notebook	Restart kernel, run all cells, fix any errors, then try again
Data file too large for Git	Add `data/` to `.gitignore` — don't commit large files to Git
Heatmap annotations overlap	Reduce the number of features or use `annot=False` for large matrices
`SettingWithCopyWarning`	Use `.copy()` when creating subsets: `X = df.drop(...).copy()`