About SafeShip

Built because
3am pages are avoidable

Every engineering team has the same story. A deploy goes out Friday at 5pm. The diff was "small." The tests passed. By Saturday morning, production is down. The signals were all there — high failure rate, end-of-day timing, rushed commit. Nobody was reading them. Now a machine does.

85.1%

Model precision on base model

0.934

AUC-ROC score — near perfect

<200ms

Score latency per request

Monthly infrastructure cost

The Problem

Deploy risk is
invisible until it isn't

Engineering teams make deploy decisions based on gut feeling. "It's a small change." "Tests passed." "We need this out today." But every one of those decisions contains measurable signals that predict failure.

Friday afternoon. 800+ line diff. 3 of the last 10 builds failed. The ML model would have scored that deploy 94/100. But without SafeShip, nobody was looking.

BEFORE SAFESHIP

✗ Deploy goes out Friday 5pm

✗ Tests passed, feels safe

✗ Production down Saturday 3am

✗ 4 hours of incident response

✗ Root cause: Friday + large diff + failures

WITH SAFESHIP

✓ Deploy scores 94/100 BLOCKED

✓ Top reasons shown instantly

✓ Team delays to Monday 10am

✓ Deploy scores 18/100 SAFE

✓ Ships cleanly, no incident

The ML Engine

How the model works

🌲

Random Forest

100 decision trees, depth 8, balanced class weights. Works well on 80–5000 samples — exactly the range real teams produce.

⚖️

SMOTE Balancing

Most builds succeed. Without SMOTE, the model would just predict "safe" for everything. SMOTE forces it to learn risky patterns.

🎯

10 Features

Diff size, failure rate, time of day, day of week, test pass rate, hotfix flag, deployer experience, days since deploy, and more.

🔄

Nightly Retraining

EC2 cron at 2am UTC. Trains per-tenant models when 80+ labelled builds exist. 5-check validation gate before any model swap.

📊

Drift Detection

Weekly KS-test across all 10 features. If your team's deploy patterns shift significantly, the model automatically retrains.

🔒

Validation Gate

5 checks before any model swap: dataset size, precision ≥0.75, AUC ≥0.70, no regression vs old model, min risky ratio.

All 10 Scored Features

Feature	Signal Weight	Extraction	Fallback
`diff_size`	High	git diff --stat \| parse lines	files × 20
`recent_failure_rate`	Very High	Jenkins API last 10 builds	0.0
`test_pass_rate`	High	Jenkins test report API	1.0
`is_hotfix`	High	Branch name contains hotfix/fix	0
`hour_of_day`	Medium	System time at trigger	Never missing
`day_of_week`	Medium	System time at trigger	Never missing
`diff_size`	Medium	git diff --name-only \| wc -l	5
`days_since_deploy`	Medium	Jenkins last success timestamp	7.0
`deployer_exp`	Low	Count in tenant S3 CSV	1
`build_time_delta`	Low	Current vs 14-day avg duration	0.0

Architecture

Everything on
AWS Free Tier

EC2 t2.micro

Scoring API

Flask + Gunicorn behind Nginx. Model loaded in memory, hot-reloaded every 5 min from S3. Ansible-provisioned.

S3 (2 buckets)

Storage

Model files (.pkl) and per-tenant build CSVs. 90-day rolling window. Monthly archive. Free tier: 5 GB.

DynamoDB

Tenant Registry

Single table, partition key: tenant_id. All tenant metadata, thresholds, model phase. Free tier: 25 GB.

Start shipping safer today

Free. Self-hosted. Your data stays in your infra.

Get Started Free Try Live Demo →

Built because3am pages are avoidable

Deploy risk isinvisible until it isn't

How the model works

Everything onAWS Free Tier

Start shipping safer today

Built because
3am pages are avoidable

Deploy risk is
invisible until it isn't

Everything on
AWS Free Tier