Codeaza Deep Review — B2B AI SaaS Platform

Technical Deep-Dive · For Engineers

Findings with file references

Everything below is concrete and reproducible. Line numbers reflect the reviewed commit.

1 Critical — fix this week

Critical1.1 — Live secrets committed to git (SSH keys + prod DB password)

EC2/B2B-AI-SaaS.pem · .ppk · EC2/credentials.txt · alembic.ini · newrelic.ini

Real EC2 SSH private keys (-----BEGIN RSA PRIVATE KEY-----), the production RDS password, prod/dev EC2 IPs, and the NewRelic license key are tracked in git. The RDS URL is hardcoded again in alembic.ini. An empty .dockerignore entry + COPY ./ /app bakes them into every shipped image and registry layer.

Impact: Anyone with repo read, any historical clone, or anyone who can pull an image gets SSH + DB access. Full compromise.

Fix now: Rotate RDS password, both EC2 keypairs, NR key today. Purge history with git filter-repo/BFG. Add EC2/, certs/, *.pem, *.ppk, credentials.txt to .gitignore + .dockerignore. Move secrets to AWS SSM; source alembic URL from env.

Critical1.2 — Auth middleware is a no-op for the whole /business/* namespace

src/middleware/auth_middleware.py:35 (skip list) · :92-94 (startswith match)

The skip-path list contains bare prefix "/business/" matched by path.startswith(skip_path). Nearly every business endpoint short-circuits to call_next before any token check. Auth now relies on each route remembering Depends(JWTBearer()).

Impact: Multi-tenant data exposure on any under-protected endpoint; the assumed-global control is silently off.

Fix: Exact-match static public paths; allow-list dynamic ones via regex/set. Never put a namespace root in a startswith skip list. Audit every /business/* route for an explicit auth dependency.

Critical1.3 — SSRF + local-file disclosure in URL fetcher

src/utils/image_utils.py:171-204 · callers in image_processing.py:415, study_group_patients.py

Fetches arbitrary user-supplied URLs with requests.get(url) — no scheme/host allow-list — and explicitly handles file:// to read local disk. A bare except: pass hides abuse.

Impact: file:///etc/passwd / committed keys, and http://169.254.169.254/... (EC2 IMDS) → steal IAM role → AWS creds. Chains with 1.1.

Fix: Reject non-https schemes, drop file://, validate host against an allow-list, block RFC1918/link-local/metadata IPs, set timeouts + size caps. Ideally fetch only from your own S3/CloudFront.

Critical1.4 — Model weights loaded with no integrity check → pickle RCE

micro_haircomv_v2/micro_haircomb/download_weights.py · segmentation.py:113 · follicle.py:36

Weights download from S3 verified only by os.path.exists() — no checksum/size check — then load via YOLO(...) which runs torch.load on a pickle (arbitrary code execution). Truncated downloads cache as valid.

Impact: Any S3 write access to the model bucket → RCE in production; partial downloads → silent model corruption.

Fix: Ship a SHA256 manifest; download to temp, verify, atomic-rename. Lock bucket writes to a CI role only.

2 High priority

High2.1 — Heavy CV work runs on the async event loop

business_functions/optimized_bg_process.py · patient_image_set_v2.py · image_processing.py

OpenCV, PIL saves, synchronous requests.get and time.sleep run inside async def handlers (453 of them), with only ~16 run_in_executor/to_thread calls in the tree. A 10s blocking call freezes the whole worker.

Fix: Offload CV/IO to asyncio.to_thread/executor or Celery; or make endpoints plain def. Use httpx.AsyncClient; never time.sleep in async.

High2.2 — YOLO models reloaded from disk on every inference (CPU)

micro_haircomv_v2/micro_haircomb/segmentation.py:113 · follicle.py:36 · run.py

model = YOLO(get_model_path(...)) sits inside the per-request function — no singleton/cache, weights even re-fetched. Large x-variant models run on CPU while ~5GB of unused nvidia-cu12 wheels ship. Documented root cause of the 500–800MB/request peaks and OOM kills.

Fix: Load each model once at startup into a module-level singleton / lru_cache. Move inference to a GPU Celery worker; install CPU-only torch on the API image.

High2.3 — No tests; CI deploys straight to prod with no gate

.github/workflows/main-ci.yml · no tests/ dir · pre-commit = black/isort only

Only test files are a rate-limit script and the vendored matting lib. CI on push to main SSHes into prod and swaps containers — no test/lint/type/build gate, no approval. "Blue-green" actually does down then up = multi-minute 502 window per deploy.

Fix: Add pytest + a required test/lint/mypy job before deploy; gate prod behind a GitHub Environment with reviewers; deploy on tag/release. Implement real parallel-stack cut-over.

High2.4 — Unpinned, un-hashed dependencies

requirements.in (torch, ultralytics, opencv, numpy… unpinned) · requirements.txt (no --generate-hashes)

Almost every direct dep is unpinned, so a re-lock silently pulls new torch/ultralytics/numpy — changing model numerics or pulling CVEs. No wheel hashes = no build-time integrity. Two conflicting OpenCV builds; rembg[cli] drags the gradio UI stack into a UI-less API.

Fix: Pin every direct dep; regenerate with --generate-hashes, build with --require-hashes. Drop [cli], standardize on opencv-python-headless, split GPU/audio deps into a worker file.

High2.5 — 96 MB binaries in git; god-files everywhere

src/admin/admin_scribble_tool/templates/*.pptx (96MB) · src/models.py (9,321 lines, 228 classes) · business/routes.py (10,599 lines)

Three .pptx (44/32/20 MB) + 48 PNGs are raw git blobs bloating history (.git is 80M). models.py is a 364KB ORM monolith; four route/function files total ~49K lines.

Fix: Move binaries to S3/Git LFS and purge history. Split models.py into a domain package; decompose routes into per-resource APIRouter modules.

3 Medium

Medium3.1 — CORS wildcard with credentials

src/main.py:94-101 & :173-180 — allow_origins=["*"] + allow_credentials=True

Starlette reflects the request Origin when credentials are on → effectively all origins. Cross-tenant theft via a victim's browser. Fix: explicit origin allow-list

Medium3.2 — Rate limiter trusts X-Forwarded-For & fails open

rate_limit_identifier.py:62-69 · token_bucket.py:63 (allow on Redis error, non-atomic)

Forged X-Forwarded-For = fresh bucket per request → brute-force on login/OTP/reset. Redis outage = unlimited. Fix: trusted-proxy hop count + atomic Lua bucket + fail-closed on auth

Medium3.3 — 1-year refresh tokens; JWT secret only warned

src/helpers/token.py:49 (REFRESH=365d) · :38-42 (weak-secret warning, proceeds)

Leaked refresh token = a year of access; weak HS256 secret = forgeable tokens = full bypass. Fix: 14–30d refresh + rotation; hard-fail on <32-char secret

Medium3.4 — FORCE_ENV & safety flags hardcoded in tracked source

src/utils/config.py — FORCE_ENV, ENABLE_RATE_LIMITING=False, ENABLE_REAL_EMAIL_DELIVERY=False; env from GIT_BRANCH

One stray commit pinning FORCE_ENV='prod' points a non-prod deploy at prod DB. Fix: drive env + toggles from deploy env vars; assert prod values in CI

Medium3.5 — Swallowed exceptions & probable N+1 queries

repo-wide: 28 bare except, 15 except…pass · ~1,300 db.query, ~490 loops in src/business

Silent failures + query-per-iteration with no eager loading, compounding 2.1. Fix: specific excepts + logging; selectinload/joinedload; query-count assertions

Medium3.6 — Two contradictory deploy models; dangerous ops scripts

deploy_celery_production.sh (supervisor, --concurrency=4, wrong paths) vs docker-compose via GitHub Actions

The supervisor script sets concurrency=4 — violating the documented concurrency=1 OOM invariant — and targets the wrong dir; running it OOMs the box. Scripts lack set -euo pipefail; rollback only guards the web image. Fix: pick one model; harden scripts; align concurrency=1

Medium3.7 — Root containers, deprecated base, weak proxy creds

Dockerfile (tiangolo base, root, bundles Chrome) · ssl-docker-compose.yml (npm/npm) · compose deploy.resources ignored outside Swarm

Unmaintained base, no USER, NPM admin on :81 with npm/npm, and 4G limits may not be enforced. Fix: python:3.10-slim multi-stage non-root; strong NPM creds + localhost-only :81; mem_limit/cpus

Medium3.8 — Public Swagger in prod; v1/v2 auth duplication; no model versioning

/docs & /openapi.json public (title leaks env+branch) · 7 login implementations · weights versioned by filename date

API map handed to attackers; auth drift across live v1/v2; no way to reproduce a past inference or catch a model regression. Fix: gate docs in prod; consolidate auth; weight→sha→git-ref manifest + golden-image eval

4 Low / hygiene

No lint/type gate — pre-commit is black/isort/whitespace only; add ruff + mypy.
Dead scaffolding — src/schemas.py is 0 bytes, src/routes.py is a stub.
Over-broad .gitignore — *.json and *.md globally ignored.
Committed runtime dirs — logs/security.log, uploads/example.txt tracked; use .gitkeep.
Vendored matting lib — full 2017 research repo under src/; micro_haircomv_v2/ has no LICENSE/README.
Celery result backend is the prod RDS — extra write load; consider dedicated Redis.
Debug print() in rate-limit hot path — leaks bucket state every request.
First request pays weight-download cost — download + verify at boot, fail fast.

5 What's done well

Passwords hashed with bcrypt (passlib) — correct.
SQL is parameterized via the ORM — no f-string injection found.
No shell=True — subprocess uses arg lists; no command injection.
The OOM problem is genuinely understood — run.py documents the memory math + concurrency invariant.
Background-removal service is the right pattern — caps images at 512px, bounded workers, memory guards.
Security-headers, webhook secret & token-bucket middleware exist — the bones of a real posture are there.

B2B AI SaaS Platform
Deep Review

The one-minute verdict

The product works — but the front door is unlocked.

The 5 things that matter — explained simply

Two desks, two decisions

Risk, trust & money

What to action

The kill-chain

From a copy of the code → to owning everything

Remediation roadmap

Findings with file references

1 Critical — fix this week

2 High priority

3 Medium

4 Low / hygiene

5 What's done well

B2B AI SaaS PlatformDeep Review

The one-minute verdict

The product works — but the front door is unlocked.

The 5 things that matter — explained simply

Two desks, two decisions

Risk, trust & money

What to action

The kill-chain

From a copy of the code → to owning everything

Remediation roadmap

Findings with file references

1 Critical — fix this week

2 High priority

3 Medium

4 Low / hygiene

5 What's done well

B2B AI SaaS Platform
Deep Review