E-Commerce Fraud Detection
Real-time fraud detection with XGBoost and SHAP explainability
Overview
This project implements a production-ready machine learning system for detecting fraudulent e-commerce transactions in real-time. The system transforms raw transaction data into 30 engineered features and uses an optimized XGBoost classifier to identify fraud while minimizing false positives that could affect legitimate customers.
Fraud Detection Capabilities
The system’s 30 engineered features enable detection of diverse fraudulent activity patterns:
| Feature Category | Detection Signals |
|---|---|
| Temporal Analysis | Unusual transaction timing, timezone mismatches between user location and purchase time, late-hour activity |
| Amount Patterns | Deviations from typical purchase amounts, micro-transactions indicative of card testing, high-value anomalies |
| User Behavior | Account age relative to transaction patterns, purchase velocity, session characteristics |
| Geographic Risk | Distance between user origin and shipping destination, cross-border transactions, location inconsistencies |
| Security Indicators | Composite risk scores combining multiple signals, device and browser fingerprinting patterns |
Technical Architecture
The pipeline processes transactions through five integrated stages:
Feature Engineering
Custom sklearn-compatible transformer generates 30 features from 15 raw inputs: timezone-aware temporal features, amount deviations, user behavior metrics, geographic risk indicators, and security composite scores.
Model Inference
XGBoost classifier with tuned hyperparameters generates fraud probability scores with P95 latency under 40ms.
Threshold Strategies
Five configurable strategies enable precision-recall trade-offs for different business requirements.
SHAP Explainability
TreeSHAP explanations show top risk contributors for each prediction, enabling transparent fraud decisions.
Deployment
FastAPI service containerized with Docker, deployed on Google Cloud Run with auto-scaling.
Technology Stack
- ML Pipeline: Python 3.12, XGBoost, scikit-learn, pandas, numpy
- Explainability: SHAP (TreeSHAP for feature importance)
- API Service: FastAPI, Uvicorn, Pydantic validation
- Deployment: Docker, Google Cloud Run
- Testing: pytest (425 tests), Locust (load testing)
Model Performance
Model: XGBoost (n_estimators=100, max_depth=4)
PR-AUC: 0.866
ROC-AUC: 0.976
F1 Score: 0.778
Trained on 299K transactions with 44:1 class imbalance
Precision: 73.2%
Recall: 82.9%
P95 Latency: 36ms (Cloud Run)
Throughput: 25 requests/second
All target metrics exceeded