What you need to know before reading anything else.
The platform does what it promises. The problem is safety, not features. Right now the keys to the production servers and the customer database are written down inside the code itself, where anyone with a copy can read them. A few other doors that should be locked are standing open. None of this is hard to fix β but it should be treated as an urgent, this-week priority, because it is exploitable today.
No jargon. Each one is a real risk to the business, with an everyday analogy.
The same findings, framed for the call each of you actually needs to make.
.dockerignore, rebuild images. Today.file:// sink, and unverified model weights β individually serious, chainable into full AWS takeover.FORCE_ENV.Why three "separate" issues are really one path to full compromise.
Sequenced by urgency. The top two rows remove the real danger.
.dockerignore and rebuild clean images so keys stop shipping inside them.file:// URL fetcher. Add integrity checks to AI model downloads. Restrict CORS to known origins.FORCE_ENV). Tighten rate-limiting and shorten token lifetimes. Unify the two deploy methods.Everything below is concrete and reproducible. Line numbers reflect the reviewed commit.
Real EC2 SSH private keys (-----BEGIN RSA PRIVATE KEY-----), the production RDS password, prod/dev EC2 IPs, and the NewRelic license key are tracked in git. The RDS URL is hardcoded again in alembic.ini. An empty .dockerignore entry + COPY ./ /app bakes them into every shipped image and registry layer.
Impact: Anyone with repo read, any historical clone, or anyone who can pull an image gets SSH + DB access. Full compromise.
git filter-repo/BFG. Add EC2/, certs/, *.pem, *.ppk, credentials.txt to .gitignore + .dockerignore. Move secrets to AWS SSM; source alembic URL from env./business/* namespaceThe skip-path list contains bare prefix "/business/" matched by path.startswith(skip_path). Nearly every business endpoint short-circuits to call_next before any token check. Auth now relies on each route remembering Depends(JWTBearer()).
Impact: Multi-tenant data exposure on any under-protected endpoint; the assumed-global control is silently off.
/business/* route for an explicit auth dependency.Fetches arbitrary user-supplied URLs with requests.get(url) β no scheme/host allow-list β and explicitly handles file:// to read local disk. A bare except: pass hides abuse.
Impact: file:///etc/passwd / committed keys, and http://169.254.169.254/... (EC2 IMDS) β steal IAM role β AWS creds. Chains with 1.1.
https schemes, drop file://, validate host against an allow-list, block RFC1918/link-local/metadata IPs, set timeouts + size caps. Ideally fetch only from your own S3/CloudFront.Weights download from S3 verified only by os.path.exists() β no checksum/size check β then load via YOLO(...) which runs torch.load on a pickle (arbitrary code execution). Truncated downloads cache as valid.
Impact: Any S3 write access to the model bucket β RCE in production; partial downloads β silent model corruption.
OpenCV, PIL saves, synchronous requests.get and time.sleep run inside async def handlers (453 of them), with only ~16 run_in_executor/to_thread calls in the tree. A 10s blocking call freezes the whole worker.
asyncio.to_thread/executor or Celery; or make endpoints plain def. Use httpx.AsyncClient; never time.sleep in async.model = YOLO(get_model_path(...)) sits inside the per-request function β no singleton/cache, weights even re-fetched. Large x-variant models run on CPU while ~5GB of unused nvidia-cu12 wheels ship. Documented root cause of the 500β800MB/request peaks and OOM kills.
lru_cache. Move inference to a GPU Celery worker; install CPU-only torch on the API image.Only test files are a rate-limit script and the vendored matting lib. CI on push to main SSHes into prod and swaps containers β no test/lint/type/build gate, no approval. "Blue-green" actually does down then up = multi-minute 502 window per deploy.
Almost every direct dep is unpinned, so a re-lock silently pulls new torch/ultralytics/numpy β changing model numerics or pulling CVEs. No wheel hashes = no build-time integrity. Two conflicting OpenCV builds; rembg[cli] drags the gradio UI stack into a UI-less API.
--generate-hashes, build with --require-hashes. Drop [cli], standardize on opencv-python-headless, split GPU/audio deps into a worker file.Three .pptx (44/32/20 MB) + 48 PNGs are raw git blobs bloating history (.git is 80M). models.py is a 364KB ORM monolith; four route/function files total ~49K lines.
models.py into a domain package; decompose routes into per-resource APIRouter modules.Starlette reflects the request Origin when credentials are on β effectively all origins. Cross-tenant theft via a victim's browser. Fix: explicit origin allow-list
Forged X-Forwarded-For = fresh bucket per request β brute-force on login/OTP/reset. Redis outage = unlimited. Fix: trusted-proxy hop count + atomic Lua bucket + fail-closed on auth
Leaked refresh token = a year of access; weak HS256 secret = forgeable tokens = full bypass. Fix: 14β30d refresh + rotation; hard-fail on <32-char secret
FORCE_ENV & safety flags hardcoded in tracked sourceOne stray commit pinning FORCE_ENV='prod' points a non-prod deploy at prod DB. Fix: drive env + toggles from deploy env vars; assert prod values in CI
Silent failures + query-per-iteration with no eager loading, compounding 2.1. Fix: specific excepts + logging; selectinload/joinedload; query-count assertions
The supervisor script sets concurrency=4 β violating the documented concurrency=1 OOM invariant β and targets the wrong dir; running it OOMs the box. Scripts lack set -euo pipefail; rollback only guards the web image. Fix: pick one model; harden scripts; align concurrency=1
Unmaintained base, no USER, NPM admin on :81 with npm/npm, and 4G limits may not be enforced. Fix: python:3.10-slim multi-stage non-root; strong NPM creds + localhost-only :81; mem_limit/cpus
API map handed to attackers; auth drift across live v1/v2; no way to reproduce a past inference or catch a model regression. Fix: gate docs in prod; consolidate auth; weightβshaβgit-ref manifest + golden-image eval
src/schemas.py is 0 bytes, src/routes.py is a stub.*.json and *.md globally ignored.logs/security.log, uploads/example.txt tracked; use .gitkeep.src/; micro_haircomv_v2/ has no LICENSE/README.print() in rate-limit hot path β leaks bucket state every request.shell=True β subprocess uses arg lists; no command injection.run.py documents the memory math + concurrency invariant.