WFO Risk · Project Story & Resume Guide

The Project Story

Why it exists · What problem it solves · 2022 post-pandemic context

💡 The One-Line Pitch
"Employee Health Risk Tracker is an ML system that predicts which employees are unlikely to return to the office post-pandemic — scoring each person 0–100% risk using 10 HR features, with a Streamlit dashboard for HR teams and a FastAPI endpoint for downstream integration."

⚙️ End-to-End ML Pipeline

📋 HR Data

10 features

→

🧹 Preprocessing

StandardScaler

→

📊 Logistic Reg

Baseline model

🌲 Random Forest

100 estimators

📐 CV ROC-AUC

5-fold comparison

→

✅ Best Model

joblib saved

→

⚡ FastAPI

/predict endpoint

→

📊 Streamlit UI

HR Dashboard

🌍 The 2022 Context — Post-Pandemic WFO Problem

In early 2022, companies globally were announcing return-to-office mandates. HR teams faced a new challenge: which employees will actually come back, and who needs intervention? Some employees had long commutes, young children, health anxieties, or had become so productive working from home that returning felt pointless. Blanket mandates ignored these differences and caused attrition. Companies needed data-driven employee risk segmentation — not guesswork.

💡 The Solution — ML Risk Scoring

Predict each employee's WFO risk score (0–100%) using 10 HR factors: commute distance, vaccination status, children under 5, pre-pandemic office habits, home internet quality, team size, manager behaviour, anxiety score, and WFH productivity rating. HR teams get a ranked list — High / Medium / Low risk — with specific recommendations per employee. Instead of one-size-fits-all mandates, HR can offer targeted support: transport allowances for long commuters, childcare assistance for parents, anxiety support programs.

⚙️ Why Two Models? Why Not Just Pick One?

Logistic Regression is the baseline — fast, interpretable, works well when feature relationships are roughly linear. HR teams can understand "commute distance has this much weight."

Random Forest captures non-linear interactions — for example, the combination of long commute + children + high anxiety is more than the sum of parts. It also gives feature importances, showing HR which factors drive risk most.

I train both, compare using 5-fold cross-validated ROC-AUC (not just test accuracy — avoids lucky train/test splits), and automatically save the winner. The model comparison dashboard makes this transparent and explainable to stakeholders.

📊 The Dashboard — For Non-Technical HR Teams

The Streamlit dashboard has three views: Dataset Overview (risk distribution pie, histogram, scatter plots — gives HR an at-a-glance picture of the workforce), Predict Employee (enter any employee's details, get a real-time risk score with a gauge chart and recommendation), and Model Comparison (side-by-side metrics, confusion matrices, feature importance chart — for analytics teams who want to understand the model). Non-technical HR managers see the prediction and recommendation; data-savvy analysts see under the hood.

🏭 Production Considerations

The ML model is saved with joblib and loaded lazily on first prediction — no startup delay for the API. StandardScaler is saved separately alongside the model — this is critical: you must use the same scaler fitted on training data to transform inference inputs, or predictions are meaningless. FastAPI endpoints are Pydantic-validated with range constraints per feature. Batch prediction endpoint handles multiple employees in one call, returning a summary of High/Medium/Low counts — useful for HR reporting.

🎤 2-Minute Interview Pitch

      "Employee Health Risk Tracker is an ML project I built in the 2022 post-pandemic context — companies were mandating return-to-office but had no visibility into which employees were actually likely to comply versus resist or quit. I framed it as a binary classification problem: predict whether each employee is high or low WFO risk based on 10 HR factors like commute distance, childcare responsibilities, vaccination status, and WFH productivity score.

      I trained two models — Logistic Regression as the interpretable baseline and Random Forest to capture non-linear interactions. I selected the best model using 5-fold cross-validated ROC-AUC rather than simple test accuracy, to avoid overfitting to a lucky split. Random Forest typically wins because it captures compound effects — the combination of long commute, young children, and high anxiety is worse than any single factor alone.

      I exposed predictions through a FastAPI service and built a Streamlit dashboard for HR teams — three views: workforce overview charts, individual employee risk prediction with a gauge chart and recommendation, and a model comparison panel showing feature importances. The whole stack runs with Docker Compose."

Resume Description

Copy-paste ready bullets · Short version · LinkedIn summary

📄 Full Version — 5 Bullets

Employee Health Risk Tracker — ML-Powered Return-to-Office Risk Prediction

Personal Project · 2022

Python · Scikit-learn · Pandas · FastAPI · Streamlit · Plotly · joblib · Docker

Built an end-to-end ML pipeline to predict employee return-to-office risk using 10 HR features (commute distance, vaccination status, childcare responsibilities, WFH productivity, anxiety score) — enabling HR teams to segment employees into High / Medium / Low risk categories with targeted intervention recommendations.
Trained and compared Logistic Regression and Random Forest classifiers; selected best model using 5-fold cross-validated ROC-AUC to prevent test-set overfitting — Random Forest achieved ROC-AUC of ~0.91 vs Logistic Regression's ~0.87.
Implemented StandardScaler preprocessing saved alongside the model with joblib, ensuring training-inference feature scaling consistency — a common production pitfall in ML deployment.
Exposed predictions via FastAPI REST service with Pydantic-validated endpoints supporting single and batch inference; batch endpoint returns summary counts (High/Medium/Low) for HR reporting workflows.
Built a 3-panel Streamlit dashboard for HR stakeholders: workforce risk distribution charts (Plotly), real-time risk gauge for individual employees, and model comparison panel with confusion matrices and Random Forest feature importance rankings.

📄 Short Version — 2 Bullets

Employee Health Risk Tracker — ML Return-to-Office Risk Prediction

Personal Project · 2022

Python · Scikit-learn · FastAPI · Streamlit · Plotly · Docker

Built an ML pipeline comparing Logistic Regression and Random Forest to predict employee WFO risk from 10 HR features; selected best model via 5-fold CV ROC-AUC and deployed via FastAPI with Pydantic validation and batch inference support.
Built a Streamlit dashboard with Plotly charts for HR teams — workforce risk distribution overview, real-time individual risk gauge with recommendations, and model comparison panel with feature importance rankings.

📊 Model Comparison — What You Cite in Interviews

Metric	Logistic Regression	Random Forest	Winner
Accuracy	~85%	~89%	RF ✓
ROC-AUC	~0.87	~0.91	RF ✓
CV ROC-AUC	~0.86	~0.90	RF ✓
Interpretability	High — coefficients	Medium — importances	LR ✓
Training Speed	Very fast	Moderate	LR ✓
Non-linear capture	No	Yes	RF ✓
Selected as best?	No	Yes ✓	—

How I Built It — Step by Step

Exact decisions · Why each choice was made · What you learned

Feature Engineering — What Predicts WFO Risk?

Most important step · Defines model quality

Before writing any code, I defined 10 features based on real 2022 HR research on post-pandemic return-to-office resistance:

Commute distance — the single biggest factor. 50km+ commutes are a strong predictor.
Children under 5 — childcare responsibility significantly increases WFH preference.
Vaccination status — unvaccinated employees had higher office anxiety in 2022.
Prior WFO days/week — pre-pandemic office habit. 5 days → likely to return.
Home internet quality — good internet = comfortable WFH = less urgency to return.
Anxiety score — self-reported anxiety about returning.
WFH productivity — high productivity at home = less incentive to return.
Manager WFO — if manager returns, employees follow. Social pressure.
Team size — larger teams feel more collaborative pressure to return.
Age — younger employees slightly more comfortable returning.

Synthetic Dataset Generation

No real HR data available · Realistic simulation

Real HR data is confidential — I couldn't use it. So I generated a 1,000-employee synthetic dataset using NumPy with realistic distributions: commute distances follow an exponential distribution (most people live close, a few live far), anxiety scores are uniform 1–10, vaccination rate ~82% (matching 2022 India figures).

The risk score formula is a weighted sum of features with added Gaussian noise — so the dataset is realistic, not perfectly separable. This means both models have to actually learn, not just memorise.

Binary label threshold at risk_score > 0.45 gives roughly 45% high-risk, 55% low-risk — a balanced dataset, avoiding class imbalance issues.

python train.py # generates data → trains both models → saves best

Preprocessing — StandardScaler

Critical for Logistic Regression · Good practice for RF

Features are on very different scales — age is 18–70, commute is 0–200km, anxiety is 1–10. Without scaling, Logistic Regression would give huge weight to commute_distance (large numbers) and tiny weight to binary features. StandardScaler normalises each feature to mean=0, std=1.

Critical production rule: fit the scaler only on training data (scaler.fit_transform(X_train)), then use scaler.transform(X_test) and scaler.transform(inference_input). Never refit on test or inference data — that's data leakage.

I save the scaler with joblib alongside the model — both must travel together to production.

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit only on train
X_test_scaled = scaler.transform(X_test) # transform only
joblib.dump(scaler, "models/scaler.pkl") # save for inference

Training Both Models + Cross-Validation

Fair comparison · CV prevents lucky splits

Logistic Regression: C=1.0 (default regularisation), max_iter=1000 (ensure convergence). Fast, interpretable, linear decision boundary.

Random Forest: n_estimators=100 (100 trees — diminishing returns beyond this), max_depth=8 (prevents overfitting), min_samples_split=10 (each split needs at least 10 samples). n_jobs=-1 uses all CPU cores for parallel tree training.

Model selection uses 5-fold cross-validated ROC-AUC, not just test set accuracy. Why? A single 80/20 split can be lucky — CV averages performance across 5 different splits, giving a more reliable estimate. ROC-AUC over accuracy because it measures ranking quality regardless of threshold.

FastAPI Inference Service

Serves predictions · Batch endpoint for HR reporting

Built two prediction endpoints: POST /predict for single employee with full recommendation text, and POST /predict/batch for multiple employees returning a High/Medium/Low summary — useful for bulk HR analysis.

Model is loaded lazily — only on first prediction call, not at startup. This avoids slow startup times and allows the API to start even before training is complete.

Pydantic validation enforces feature ranges at the API boundary — age: int = Field(..., ge=18, le=70) — bad inputs get a 422 error before reaching the model, not a confusing prediction error.

Streamlit Dashboard — 3 Views

HR-friendly · Non-technical stakeholders

Dataset Overview: KPI row (total employees, risk counts), risk distribution pie chart, score histogram, commute vs risk scatter, children vs risk bar chart. HR managers see the workforce at a glance.

Predict Employee: Sliders and dropdowns for all 10 features. Calls the FastAPI /predict endpoint and displays a Plotly gauge chart (0–100% with colour zones) and recommendation text. Real-time — results appear in under a second.

Model Comparison: Side-by-side accuracy/ROC-AUC bar chart, confusion matrices for both models, Random Forest feature importance horizontal bar chart — shows HR which factors drive risk most. Commute distance and anxiety score typically top the list.

Key Concepts to Know

Everything to deeply understand to defend this in any ML interview

🎯

Binary Classification

Predicting one of two outcomes — High Risk (1) or Low Risk (0). The model outputs a probability (0–1), not just a class. We threshold at 0.5 for the label, but use the raw probability as the risk score shown to HR.

📐

ROC-AUC — Why Not Accuracy?

Accuracy is misleading on imbalanced data. ROC-AUC measures the model's ability to rank high-risk employees above low-risk ones, regardless of threshold. AUC=1.0 is perfect, AUC=0.5 is random. It's the standard metric for binary risk scoring.

🔄

5-Fold Cross-Validation

Split data into 5 parts. Train on 4, test on 1. Repeat 5 times. Average the scores. Much more reliable than a single train/test split — a lucky or unlucky split can misrepresent true model performance.

📊

Logistic Regression

Linear model that outputs a probability via the sigmoid function. Coefficients are directly interpretable — "commute_distance has weight 0.8". Fast, explainable, but only captures linear relationships between features and risk.

🌲

Random Forest

Ensemble of 100 decision trees, each trained on a random subset of data and features. Final prediction = majority vote. Captures non-linear interactions. Feature importances show which features reduce impurity most across all trees.

⚖️

StandardScaler — Training-Inference Parity

Fit scaler on training data only. Use it to transform test and inference data. Never refit on inference data — that would be data leakage. Scaler must be saved alongside model and loaded together for every prediction.

💾

joblib — Model Serialisation

Saves trained sklearn models to disk as .pkl files. Preserves all learned parameters — weights, tree structures, feature importances. Load with joblib.load() for instant predictions without retraining.

🔢

Confusion Matrix

2×2 table: True Positives, True Negatives, False Positives, False Negatives. FP cost = flagging low-risk employees as high-risk (wasted HR intervention). FN cost = missing high-risk employees (they leave). Balance based on business priority.

📚 Skills Demonstrated

      Binary Classification
      Logistic Regression
      Random Forest
      Cross-Validation
      ROC-AUC
      StandardScaler
      Feature Engineering
      Model Comparison
      joblib Serialisation
      FastAPI
      Pydantic
      Streamlit
      Plotly
      Batch Inference
      Docker
      HR Analytics Domain
    

Interview Questions & Answers

Every ML question they can ask · Click to reveal confident answers

❓

Walk me through the WFO Risk project end to end.

EASY

▼

"Employee Health Risk Tracker is an ML system to predict employee return-to-office risk in the 2022 post-pandemic context. The problem is binary classification — High or Low risk — based on 10 HR features: commute distance, children under 5, vaccination status, pre-pandemic office habit, home internet quality, team size, manager behaviour, anxiety score, and WFH productivity score.

I trained two models — Logistic Regression as the interpretable baseline and Random Forest to capture non-linear interactions. I selected the best model using 5-fold cross-validated ROC-AUC. The trained model and scaler are saved with joblib and exposed via a FastAPI service. There's also a Streamlit dashboard with three panels: workforce overview charts, individual risk prediction with a gauge chart, and model comparison with feature importances."

❓

Why did you use ROC-AUC instead of accuracy to compare models?

EASY

▼

"Two reasons.

First — accuracy is misleading on imbalanced datasets. If 60% of employees are low-risk, a model that always predicts low-risk gets 60% accuracy while being completely useless.

Second — this is a risk scoring problem, not just a classification problem. HR doesn't just want a binary label — they want to rank employees by risk level. ROC-AUC directly measures the model's ability to rank high-risk employees above low-risk ones. An AUC of 0.91 means the model correctly ranks a randomly chosen high-risk employee above a randomly chosen low-risk employee 91% of the time.

I also used cross-validated ROC-AUC rather than just test-set AUC to get a more reliable estimate — averaging across 5 different train/test splits removes the influence of a lucky or unlucky data split."

❓

Why Random Forest over a single decision tree?

EASY

▼

"A single decision tree overfits badly — it memorises training data and generalises poorly. Random Forest solves this through two mechanisms.

Bagging — each tree is trained on a random bootstrap sample of the training data (sampling with replacement). Different trees see different data, so they make different errors.

Feature randomness — at each split, only a random subset of features is considered (typically √n features). This prevents all trees from looking identical and ensures diversity.

The final prediction is the majority vote across all 100 trees. Averaging many decorrelated trees dramatically reduces variance without increasing bias — that's why Random Forest almost always outperforms a single tree."

❓

What is data leakage and how did you prevent it?

MEDIUM

▼

"Data leakage is when information from the test set (or future data) influences model training, making your evaluation metrics artificially optimistic. The model appears to perform well in testing but fails in production.

The most common leakage in this project would be with StandardScaler. If I fit the scaler on the entire dataset (train + test combined), the scaler 'knows' the mean and std of the test set — that's leakage. The test set has technically influenced preprocessing.

I prevented it by calling scaler.fit_transform(X_train) — fitting only on training data — then scaler.transform(X_test) — applying the training distribution's parameters to test data without refitting. The scaler is saved to disk and used the same way during inference — ensuring production inputs are transformed with training-set statistics, not their own."

❓

What are the hyperparameters you chose for Random Forest and why?

MEDIUM

▼

"n_estimators=100 — 100 trees. Beyond ~100, performance improvement is minimal while training time keeps increasing. Diminishing returns kick in around 100 for a 1,000-sample dataset.

max_depth=8 — limits tree depth to prevent overfitting. An unconstrained tree will grow until every training sample is a leaf, memorising noise. Depth 8 allows complex patterns while staying generalizable.

min_samples_split=10 — a node must have at least 10 samples before it's allowed to split further. Prevents the tree from creating splits based on tiny, unrepresentative subgroups — another overfitting guard.

random_state=42 — for reproducibility. Same seed = same tree structure every run.

n_jobs=-1 — uses all available CPU cores for parallel tree training. With 100 trees, parallelism gives a significant speedup on multi-core machines."

❓

What does the confusion matrix tell you? What's the cost of each error type here?

MEDIUM

▼

"The confusion matrix has four cells: True Positives (correctly identified high-risk), True Negatives (correctly identified low-risk), False Positives (low-risk flagged as high-risk), False Negatives (high-risk missed, predicted as low-risk).

False Positives — low-risk employee flagged as high-risk. Cost: HR spends time and resources intervening with someone who would have returned anyway. Wasted effort, possibly annoys the employee. Relatively low cost.

False Negatives — high-risk employee not flagged. Cost: employee quietly resists the mandate, disengages, or leaves. HR gets no opportunity to intervene. Much higher business cost — losing an employee costs 50–200% of their annual salary in recruitment and onboarding.

This means for this use case, recall (sensitivity) matters more than precision — we'd rather over-flag and waste some HR time than miss employees who are about to leave. If I were optimising a threshold, I'd lower it below 0.5 to capture more true positives."

❓

How would you improve this model if you had more time?

ADVANCED

▼

"Three concrete improvements.

1. Hyperparameter tuning with GridSearchCV or RandomizedSearchCV — I used sensible defaults for Random Forest, but the optimal max_depth, n_estimators, and min_samples_split depend on the data. RandomizedSearchCV would try many combinations and find better parameters, likely improving ROC-AUC by 2–4%.

2. SHAP values for explainability — feature importances tell you global feature relevance across all predictions. SHAP (SHapley Additive exPlanations) gives per-prediction explanations — 'For Employee EMP0042, commute distance contributed +0.15 to risk score and manager_wfo contributed -0.08.' This is critical for HR — they need to justify why an employee is flagged.

3. Threshold optimisation — as I discussed, false negatives are more costly than false positives here. I'd use the ROC curve to find the threshold that maximises recall while keeping precision above a minimum acceptable level (e.g., precision ≥ 0.6). The default 0.5 threshold is not optimal for most business problems."