Built because
3am pages are avoidable
Every engineering team has the same story. A deploy goes out Friday at 5pm. The diff was "small." The tests passed. By Saturday morning, production is down. The signals were all there — high failure rate, end-of-day timing, rushed commit. Nobody was reading them. Now a machine does.
Deploy risk is
invisible until it isn't
Engineering teams make deploy decisions based on gut feeling. "It's a small change." "Tests passed." "We need this out today." But every one of those decisions contains measurable signals that predict failure.
Friday afternoon. 800+ line diff. 3 of the last 10 builds failed. The ML model would have scored that deploy 94/100. But without SafeShip, nobody was looking.
How the model works
100 decision trees, depth 8, balanced class weights. Works well on 80–5000 samples — exactly the range real teams produce.
Most builds succeed. Without SMOTE, the model would just predict "safe" for everything. SMOTE forces it to learn risky patterns.
Diff size, failure rate, time of day, day of week, test pass rate, hotfix flag, deployer experience, days since deploy, and more.
EC2 cron at 2am UTC. Trains per-tenant models when 80+ labelled builds exist. 5-check validation gate before any model swap.
Weekly KS-test across all 10 features. If your team's deploy patterns shift significantly, the model automatically retrains.
5 checks before any model swap: dataset size, precision ≥0.75, AUC ≥0.70, no regression vs old model, min risky ratio.
Everything on
AWS Free Tier
Flask + Gunicorn behind Nginx. Model loaded in memory, hot-reloaded every 5 min from S3. Ansible-provisioned.
Model files (.pkl) and per-tenant build CSVs. 90-day rolling window. Monthly archive. Free tier: 5 GB.
Single table, partition key: tenant_id. All tenant metadata, thresholds, model phase. Free tier: 25 GB.