Assignment 2a: Exploratory Data Analysis¶
| Author | Robert Frenken |
| Estimated time | 6--8 hours |
| Prerequisites | Assignment 1 completed, Python basics, GitHub account |
What you'll build
An exploratory data analysis of a real automotive intrusion detection dataset, published as a blog post on your Quarto website. You'll load and explore the data with polars, create visualizations with altair, train two simple ML models with scikit-learn, and recreate two plots from the interactive charting ecosystem.
Why polars + altair instead of pandas + matplotlib?
Both stacks get the job done, but the lab standardizes on polars + altair for new projects:
- Polars is 10–100× faster than pandas on most operations, has a cleaner API (no
inplace=, noSettingWithCopyWarning), and expression-based syntax that composes well. It's what you'll see in current lab code. - Altair produces interactive, grammar-of-graphics plots out of the box. Hover tooltips, zoom, and linked-brushing come for free. It also renders cleanly in Quarto, Jupyter, and on the web.
- You'll still encounter pandas + matplotlib/seaborn in older code and tutorials. Knowing both is useful, but start here.
Part 0: Project Setup¶
0.1 Create a GitHub repository¶
- Go to github.com/new
- Repository name:
eda-assignment - Set to Public, check "Add a README", add a Python
.gitignore - Clone it:
0.2 Set up a Python environment¶
Why vl-convert-python?
Altair renders to HTML by default. vl-convert-python adds the ability to save charts as PNG/SVG/PDF — useful for the figures you'll commit to the repo.
For more on Python environments, see Python Environment Setup.
0.3 Project layout¶
eda-assignment/
├── data/ # Raw and processed data (gitignored if large)
├── figures/ # Saved plot images
├── notebooks/
│ └── eda.ipynb # Your main analysis notebook
├── .gitignore
├── README.md
└── requirements.txt
Create the directories and save dependencies:
Don't commit large data files
Add data/ to your .gitignore if the dataset exceeds a few MB. Git is not designed for large binary files — use a link in the README to the dataset source instead.
- Repo created and cloned
- Virtual environment activated
- Dependencies installed and pinned to
requirements.txt - Directory structure in place
Part 1: Get & Explore the Data¶
1.1 Download the dataset¶
Download the HCRL Survival IDS dataset from ocslab.hksecurity.net/Datasets/survival-ids.
- Visit the link and download the dataset files
- Place the CSV file(s) in your
data/directory - Open
notebooks/eda.ipynb(create it in VS Code or withjupyter notebook)
1.2 Load and inspect¶
import polars as pl
df = pl.read_csv("../data/survival_ids.csv") # adjust filename as needed
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns}")
df.head()
Polars vs pandas cheatsheet
| Task | pandas | polars |
|---|---|---|
| Load CSV | pd.read_csv(f) | pl.read_csv(f) |
| Shape | df.shape | df.shape |
| First rows | df.head() | df.head() |
| Column types | df.dtypes | df.schema |
| Summary stats | df.describe() | df.describe() |
| Select columns | df[["a", "b"]] | df.select("a", "b") |
| Filter | df[df.x > 0] | df.filter(pl.col("x") > 0) |
| New column | df["y"] = df.x * 2 | df.with_columns((pl.col("x") * 2).alias("y")) |
| Group + aggregate | df.groupby("g").sum() | df.group_by("g").sum() |
| Null count per column | df.isnull().sum() | df.null_count() |
1.3 Summary statistics¶
Run these in separate notebook cells and read the output — don't just run them blindly:
# Schema: column names + dtypes
df.schema
# Descriptive statistics (mean, std, min, max, quartiles)
df.describe()
# Missing values per column
df.null_count()
# Value counts for a categorical column (if any)
# df.group_by("column_name").len().sort("len", descending=True)
- Data loaded successfully with
pl.read_csv - Schema reviewed — you understand the column types
-
df.describe()output reviewed — you can identify reasonable ranges - Missing values checked
Part 2: Visualizations¶
Create at least 3 EDA plots with altair. Save each figure to figures/.
Altair in 30 seconds
Altair charts are built by chaining three things:
alt.Chart(data)— the data source (polars DataFrame works directly).mark_*(...)— the visual element (mark_bar,mark_line,mark_rect,mark_point, etc.).encode(...)— which columns map to which visual channels (x, y, color, size, tooltip)
Then .properties(title=..., width=..., height=...) to style.
2.1 Distribution plot¶
Pick a numeric column and visualize its distribution:
import altair as alt
hist = (
alt.Chart(df)
.mark_bar()
.encode(
alt.X("your_column:Q", bin=alt.Bin(maxbins=50), title="Value"),
alt.Y("count():Q", title="Count"),
tooltip=["count():Q"],
)
.properties(title="Distribution of Your Column", width=600, height=350)
)
hist.save("../figures/distribution.png", scale_factor=2)
hist
Adding a density overlay
Layer a transform_density chart over the histogram for a smoother view:
2.2 Correlation heatmap¶
Altair heatmaps need data in long form (one row per cell). Polars makes that a two-step transform:
# 1) Compute correlation matrix on numeric columns only
numeric = df.select(pl.col(pl.NUMERIC_DTYPES))
corr = numeric.corr()
# 2) Reshape to long form: one row per (row_feature, col_feature, correlation)
corr_long = (
corr.with_columns(pl.Series("row", numeric.columns))
.unpivot(index="row", variable_name="col", value_name="corr")
)
heatmap = (
alt.Chart(corr_long)
.mark_rect()
.encode(
alt.X("col:N", title=None, sort=numeric.columns),
alt.Y("row:N", title=None, sort=numeric.columns),
alt.Color("corr:Q", scale=alt.Scale(scheme="redblue", domain=[-1, 1]), title="r"),
tooltip=["row:N", "col:N", alt.Tooltip("corr:Q", format=".2f")],
)
.properties(title="Feature Correlation Heatmap", width=500, height=500)
)
# Overlay the numeric values on top of each cell
text = (
alt.Chart(corr_long)
.mark_text(baseline="middle", fontSize=10)
.encode(
x="col:N",
y="row:N",
text=alt.Text("corr:Q", format=".2f"),
color=alt.condition("abs(datum.corr) > 0.5", alt.value("white"), alt.value("black")),
)
)
(heatmap + text).save("../figures/correlation_heatmap.png", scale_factor=2)
heatmap + text
Why the long-form reshape?
Altair follows the grammar-of-graphics convention where each row is one observation. A correlation matrix is a 2D grid, so we flatten it: every cell becomes a row with (row_feature, col_feature, correlation). The unpivot (also called melt in pandas) does this in one call.
2.3 Class distribution¶
If your dataset has a label column, visualize its balance:
class_counts = df.group_by("label_column").len().rename({"len": "count"})
bar = (
alt.Chart(class_counts)
.mark_bar()
.encode(
alt.X("label_column:N", title="Class"),
alt.Y("count:Q", title="Count"),
alt.Color("label_column:N", legend=None),
tooltip=["label_column:N", "count:Q"],
)
.properties(title="Class Distribution", width=500, height=300)
)
bar.save("../figures/class_distribution.png", scale_factor=2)
bar
Make your plots readable
Always include a title and axis labels. Altair auto-generates titles from column names — override them for a cleaner read. Hover tooltips give readers a zoom-in without cluttering the chart.
- Distribution plot created and saved
- Correlation heatmap created and saved (with annotations)
- Class distribution (or third plot of your choice) created and saved
Part 3: Toy ML Models¶
Train two simple classifiers and evaluate them. This is a first exposure to the sklearn API — the goal is to learn the workflow, not to achieve state-of-the-art accuracy.
3.1 Prepare the data¶
Polars converts to numpy arrays for sklearn:
from sklearn.model_selection import train_test_split
# Drop the label and any non-numeric columns
X = df.drop("label_column").select(pl.col(pl.NUMERIC_DTYPES)).to_numpy()
y = df["label_column"].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")
3.2 Train two models¶
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train, y_train)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
3.3 Evaluate with confusion matrices¶
Build one altair confusion matrix function you can reuse for both models:
from sklearn.metrics import classification_report, confusion_matrix
def confusion_chart(y_true, y_pred, title: str) -> alt.Chart:
cm = confusion_matrix(y_true, y_pred)
labels = sorted(set(y_true))
cm_long = pl.DataFrame(
{
"actual": [labels[i] for i in range(len(labels)) for _ in labels],
"predicted": [labels[j] for _ in labels for j in range(len(labels))],
"count": cm.flatten().tolist(),
}
)
base = alt.Chart(cm_long).encode(
alt.X("predicted:N", title="Predicted"),
alt.Y("actual:N", title="Actual"),
)
rects = base.mark_rect().encode(
alt.Color("count:Q", scale=alt.Scale(scheme="blues"), title="Count"),
tooltip=["actual:N", "predicted:N", "count:Q"],
)
text = base.mark_text(fontSize=14, fontWeight="bold").encode(
text="count:Q",
color=alt.condition("datum.count > 50", alt.value("white"), alt.value("black")),
)
return (rects + text).properties(title=title, width=300, height=300)
for name, model in [("Logistic Regression", lr), ("Random Forest", rf)]:
y_pred = model.predict(X_test)
print(f"\n{'='*40}\n{name}\n{'='*40}")
print(classification_report(y_test, y_pred))
chart = confusion_chart(y_test, y_pred, f"Confusion Matrix — {name}")
chart.save(
f"../figures/confusion_matrix_{name.lower().replace(' ', '_')}.png",
scale_factor=2,
)
chart.display()
What's the white/black text trick doing?
Dark cells need white text to stay readable; light cells need black. The alt.condition expression checks the cell's count value and picks the right text color. This is a common altair pattern for annotated heatmaps.
- Train/test split created
- Two models trained (Logistic Regression + Random Forest)
-
classification_reportprinted for both models - Confusion matrices created and saved
Part 4: Gallery Picks¶
Visit the Vega-Altair Example Gallery and pick 2 chart types that look interesting. Recreate them using the Survival IDS dataset (or a subset of it).
Requirements¶
- Choose 2 different chart types (e.g., violin plot, strip plot, radial chart, parallel coordinates, linked brushing, faceted chart)
- Adapt the gallery code to use columns from your dataset
- Customize each plot: tweak the color scheme, add a meaningful title/labels, add tooltips or interaction where it makes sense
- Save both figures to
figures/
Picking good chart types
Don't just pick the simplest plots. Try something you haven't used before — the point is to expand your visualization toolkit. Good picks:
- Violin / strip plots for distributions across categories
- Linked brushing across two panels (altair's superpower)
- Faceted charts (
.facet(...)) for comparing groups - Parallel coordinates for multi-feature comparison
Linked brushing starter
brush = alt.selection_interval()
top = (
alt.Chart(df).mark_point()
.encode(x="feature_a:Q", y="feature_b:Q",
color=alt.condition(brush, "label_column:N", alt.value("lightgray")))
.add_params(brush)
)
bottom = (
alt.Chart(df).mark_bar()
.encode(x="label_column:N", y="count()")
.transform_filter(brush)
)
(top & bottom).save("../figures/linked_brushing.png", scale_factor=2)
Drag a box on the scatter — the bar chart below updates in real time.
- Gallery plot 1 created, customized, and saved
- Gallery plot 2 created, customized, and saved
Part 5: Publish¶
Turn your analysis into a blog post on your Quarto website. The full workflow (YAML header, categories, preview/commit/push) lives in the Publishing Guide — follow it end-to-end.
Quick reference for this assignment:
- Suggested categories:
[eda, python, visualization, machine-learning] - Cover image: lead with your correlation heatmap or the best gallery pick
-
What to write in the "What I Learned" section: the single most surprising finding from the data, and one thing that changed in your mental model (polars vs pandas, altair vs matplotlib, etc.)
-
Notebook runs cleanly top-to-bottom after a kernel restart
- Markdown explanations added between code cells
- Blog post live on your Quarto site
Final Deliverables¶
- GitHub repo URL for your EDA project (e.g.,
github.com/YOURUSERNAME/eda-assignment) - 5+ visualizations in
figures/(3 EDA + 2 gallery picks) - Confusion matrix charts for both models
- Blog post live on your Quarto site
-
requirements.txtin the repo root
Troubleshooting¶
| Problem | Fix |
|---|---|
ModuleNotFoundError: No module named 'polars' | Make sure your virtual environment is activated and you've installed the packages |
| Jupyter kernel doesn't see installed packages | Click the kernel name in the top-right of VS Code and pick your .venv |
| Altair chart displays as raw JSON in Jupyter | Run alt.renderers.enable("default") at the top of the notebook, or update to altair ≥ 5 |
chart.save("file.png") fails | Install vl-convert-python (uv pip install vl-convert-python) |
quarto render fails on notebook | Restart kernel, run all cells, fix any errors, then try again |
| Data file too large for Git | Add data/ to .gitignore — don't commit large files to Git |
| Correlation heatmap has too many features | Filter to the top-N most variable columns first: df.select([c for c in numeric.columns if df[c].std() > threshold]) |
Polars error on .corr() with non-numeric columns | Select numeric columns first: df.select(pl.col(pl.NUMERIC_DTYPES)) |
Sources¶
- Polars user guide — Polars team
- Altair example gallery — Vega-Altair project
- HCRL Survival IDS dataset — Hacking and Countermeasure Research Lab