Built because
3am pages are avoidable

Every engineering team has the same story. A deploy goes out Friday at 5pm. The diff was "small." The tests passed. By Saturday morning, production is down. The signals were all there — high failure rate, end-of-day timing, rushed commit. Nobody was reading them. Now a machine does.

85.1%
Model precision on base model
0.934
AUC-ROC score — near perfect
<200ms
Score latency per request
$0
Monthly infrastructure cost

Deploy risk is
invisible until it isn't

Engineering teams make deploy decisions based on gut feeling. "It's a small change." "Tests passed." "We need this out today." But every one of those decisions contains measurable signals that predict failure.

Friday afternoon. 800+ line diff. 3 of the last 10 builds failed. The ML model would have scored that deploy 94/100. But without SafeShip, nobody was looking.

BEFORE SAFESHIP
Deploy goes out Friday 5pm
Tests passed, feels safe
Production down Saturday 3am
4 hours of incident response
Root cause: Friday + large diff + failures
WITH SAFESHIP
Deploy scores 94/100 BLOCKED
Top reasons shown instantly
Team delays to Monday 10am
Deploy scores 18/100 SAFE
Ships cleanly, no incident

How the model works

🌲
Random Forest

100 decision trees, depth 8, balanced class weights. Works well on 80–5000 samples — exactly the range real teams produce.

⚖️
SMOTE Balancing

Most builds succeed. Without SMOTE, the model would just predict "safe" for everything. SMOTE forces it to learn risky patterns.

🎯
10 Features

Diff size, failure rate, time of day, day of week, test pass rate, hotfix flag, deployer experience, days since deploy, and more.

🔄
Nightly Retraining

EC2 cron at 2am UTC. Trains per-tenant models when 80+ labelled builds exist. 5-check validation gate before any model swap.

📊
Drift Detection

Weekly KS-test across all 10 features. If your team's deploy patterns shift significantly, the model automatically retrains.

🔒
Validation Gate

5 checks before any model swap: dataset size, precision ≥0.75, AUC ≥0.70, no regression vs old model, min risky ratio.

All 10 Scored Features
Feature Signal Weight Extraction Fallback
diff_size High git diff --stat | parse lines files × 20
recent_failure_rate Very High Jenkins API last 10 builds 0.0
test_pass_rate High Jenkins test report API 1.0
is_hotfix High Branch name contains hotfix/fix 0
hour_of_day Medium System time at trigger Never missing
day_of_week Medium System time at trigger Never missing
diff_size Medium git diff --name-only | wc -l 5
days_since_deploy Medium Jenkins last success timestamp 7.0
deployer_exp Low Count in tenant S3 CSV 1
build_time_delta Low Current vs 14-day avg duration 0.0

Everything on
AWS Free Tier

EC2 t2.micro
Scoring API

Flask + Gunicorn behind Nginx. Model loaded in memory, hot-reloaded every 5 min from S3. Ansible-provisioned.

S3 (2 buckets)
Storage

Model files (.pkl) and per-tenant build CSVs. 90-day rolling window. Monthly archive. Free tier: 5 GB.

DynamoDB
Tenant Registry

Single table, partition key: tenant_id. All tenant metadata, thresholds, model phase. Free tier: 25 GB.

Start shipping safer today

Free. Self-hosted. Your data stays in your infra.