Codeaza Technologies Β· Engineering Due Diligence

B2B AI SaaS Platform
Deep Review

An independent health-check of the SOCAi hair-styling computer-vision platform β€” security, reliability, code health and AI pipeline.
RepositorySOCAi-Labs / B2B-AI-SaaS-Platform
Scope676 files Β· 4 review tracks
Date24 June 2026
Executive Briefing Β· Plain English

The one-minute verdict

What you need to know before reading anything else.

High
Risk level

The product works β€” but the front door is unlocked.

The platform does what it promises. The problem is safety, not features. Right now the keys to the production servers and the customer database are written down inside the code itself, where anyone with a copy can read them. A few other doors that should be locked are standing open. None of this is hard to fix β€” but it should be treated as an urgent, this-week priority, because it is exploitable today.

πŸ”
Security
Critical
πŸ—οΈ
Code Health
At Risk
☁️
Infrastructure
Critical
πŸ€–
AI Pipeline
At Risk
πŸ§ͺ
Safety Net (Tests)
~ None
Executive Briefing

The 5 things that matter β€” explained simply

No jargon. Each one is a real risk to the business, with an everyday analogy.

πŸ”‘
Passwords left in the code
Severe
It's like taping the keys to your office and the safe combination onto the front door. The live server keys and the customer-database password are written inside the shared code.
Business impact: Anyone who has ever had a copy of the code can access customer data. This is a data-breach risk today.
πŸšͺ
The security guard is asleep
Severe
The system that's supposed to check "are you allowed in?" was accidentally switched off for the main part of the app. It only works where each page happens to ask for ID itself.
Business impact: One customer could potentially see another customer's data. A trust-killer for a B2B product.
🐒
The AI restarts itself every time
Reliability
Every single image request makes the AI "boot up" from scratch instead of staying ready. Like restarting your laptop before every email.
Business impact: This is the cause of the crashes and slow loads your team already sees. It also caps how many users you can serve.
πŸͺ‚
No safety net before going live
Reliability
There are almost no automated tests, and new code ships straight to customers with no checks. Like publishing a book with no proofreader.
Business impact: Bugs reach customers undetected, and every update risks breaking something that worked yesterday.
πŸ“¦
Heavy clutter in the codebase
Maintainability
Large PowerPoint files and a few giant 9,000-line files are mixed into the code, making it slow to work in and easy to break.
Business impact: Slower development, more bugs, and harder onboarding for new engineers β€” it quietly taxes every future feature.
βœ…
The good news
Solid
Passwords are stored correctly, the database is protected from a common hacking trick, and the team clearly understands its hardest performance problem.
Business impact: The foundations are sound. These are fixable issues, not a rewrite.
Leadership View

Two desks, two decisions

The same findings, framed for the call each of you actually needs to make.

πŸ‘” For the CEO

Risk, trust & money

  • Breach exposure is real today. Customer data could leak via the exposed credentials β€” a reputation and possibly a legal/contractual risk for a B2B deal.
  • This is not a rewrite. The fixes are days-to-weeks of focused work, not a rebuild. Budget impact is small relative to the risk removed.
  • Reliability is costing you now. The crashes/slow-loads your team reports trace to one fixable issue β€” fixing it improves the customer experience directly.
  • One ask: approve a 1-week "stabilise & secure" sprint as the team's top priority before new features.
πŸ› οΈ For the CTO

What to action

  • Incident first: rotate the leaked RDS password + EC2 keys, purge them from git history, fix .dockerignore, rebuild images. Today.
  • Close the exploit chain: the auth skip-path bypass, the SSRF/file:// sink, and unverified model weights β€” individually serious, chainable into full AWS takeover.
  • Stop the bleeding on reliability: cache ML models at startup; move CV work off the async event loop; add a CI test gate before prod deploys.
  • Pay down structural debt: pin dependencies, split the god-files, move binaries out of git, kill FORCE_ENV.
How a breach would actually happen

The kill-chain

Why three "separate" issues are really one path to full compromise.

From a copy of the code β†’ to owning everything

STEP 1Read the committed keys & DB password from the repo
β†’
OR 1bAbuse the URL fetcher to hit the cloud metadata endpoint & steal the server's identity
β†’
STEP 2The auth guard is off for most endpoints, so data is reachable anyway
β†’
RESULTFull production database + AWS account takeover
The Plan

Remediation roadmap

Sequenced by urgency. The top two rows remove the real danger.

TODAYIncident
Rotate the RDS password, EC2 keypairs & NewRelic key. Purge secrets from git history. Fix .dockerignore and rebuild clean images so keys stop shipping inside them.
THIS WEEKExploitable
Fix the auth skip-path bypass. Lock down the SSRF / file:// URL fetcher. Add integrity checks to AI model downloads. Restrict CORS to known origins.
SPRINT 1Stabilise
Add automated tests + a CI gate before prod deploys. Cache ML models at startup (fixes the crashes/slowness). Move image processing off the event loop. Pin & lock dependencies.
SPRINT 2Harden
Move 96 MB of binaries out of git. Drive environment/feature flags from deploy config (kill FORCE_ENV). Tighten rate-limiting and shorten token lifetimes. Unify the two deploy methods.
ONGOINGHealth
Split the giant files, ship non-root slim Docker images, add a model registry + quality checks, enable linting/type-checks, consolidate the v1/v2 auth code.
Technical Deep-Dive Β· For Engineers

Findings with file references

Everything below is concrete and reproducible. Line numbers reflect the reviewed commit.

1 Critical β€” fix this week

Critical1.1 β€” Live secrets committed to git (SSH keys + prod DB password)
EC2/B2B-AI-SaaS.pem Β· .ppk Β· EC2/credentials.txt Β· alembic.ini Β· newrelic.ini

Real EC2 SSH private keys (-----BEGIN RSA PRIVATE KEY-----), the production RDS password, prod/dev EC2 IPs, and the NewRelic license key are tracked in git. The RDS URL is hardcoded again in alembic.ini. An empty .dockerignore entry + COPY ./ /app bakes them into every shipped image and registry layer.

Impact: Anyone with repo read, any historical clone, or anyone who can pull an image gets SSH + DB access. Full compromise.

Fix now: Rotate RDS password, both EC2 keypairs, NR key today. Purge history with git filter-repo/BFG. Add EC2/, certs/, *.pem, *.ppk, credentials.txt to .gitignore + .dockerignore. Move secrets to AWS SSM; source alembic URL from env.
Critical1.2 β€” Auth middleware is a no-op for the whole /business/* namespace
src/middleware/auth_middleware.py:35 (skip list) Β· :92-94 (startswith match)

The skip-path list contains bare prefix "/business/" matched by path.startswith(skip_path). Nearly every business endpoint short-circuits to call_next before any token check. Auth now relies on each route remembering Depends(JWTBearer()).

Impact: Multi-tenant data exposure on any under-protected endpoint; the assumed-global control is silently off.

Fix: Exact-match static public paths; allow-list dynamic ones via regex/set. Never put a namespace root in a startswith skip list. Audit every /business/* route for an explicit auth dependency.
Critical1.3 β€” SSRF + local-file disclosure in URL fetcher
src/utils/image_utils.py:171-204 Β· callers in image_processing.py:415, study_group_patients.py

Fetches arbitrary user-supplied URLs with requests.get(url) β€” no scheme/host allow-list β€” and explicitly handles file:// to read local disk. A bare except: pass hides abuse.

Impact: file:///etc/passwd / committed keys, and http://169.254.169.254/... (EC2 IMDS) β†’ steal IAM role β†’ AWS creds. Chains with 1.1.

Fix: Reject non-https schemes, drop file://, validate host against an allow-list, block RFC1918/link-local/metadata IPs, set timeouts + size caps. Ideally fetch only from your own S3/CloudFront.
Critical1.4 β€” Model weights loaded with no integrity check β†’ pickle RCE
micro_haircomv_v2/micro_haircomb/download_weights.py Β· segmentation.py:113 Β· follicle.py:36

Weights download from S3 verified only by os.path.exists() β€” no checksum/size check β€” then load via YOLO(...) which runs torch.load on a pickle (arbitrary code execution). Truncated downloads cache as valid.

Impact: Any S3 write access to the model bucket β†’ RCE in production; partial downloads β†’ silent model corruption.

Fix: Ship a SHA256 manifest; download to temp, verify, atomic-rename. Lock bucket writes to a CI role only.

2 High priority

High2.1 β€” Heavy CV work runs on the async event loop
business_functions/optimized_bg_process.py Β· patient_image_set_v2.py Β· image_processing.py

OpenCV, PIL saves, synchronous requests.get and time.sleep run inside async def handlers (453 of them), with only ~16 run_in_executor/to_thread calls in the tree. A 10s blocking call freezes the whole worker.

Fix: Offload CV/IO to asyncio.to_thread/executor or Celery; or make endpoints plain def. Use httpx.AsyncClient; never time.sleep in async.
High2.2 β€” YOLO models reloaded from disk on every inference (CPU)
micro_haircomv_v2/micro_haircomb/segmentation.py:113 Β· follicle.py:36 Β· run.py

model = YOLO(get_model_path(...)) sits inside the per-request function β€” no singleton/cache, weights even re-fetched. Large x-variant models run on CPU while ~5GB of unused nvidia-cu12 wheels ship. Documented root cause of the 500–800MB/request peaks and OOM kills.

Fix: Load each model once at startup into a module-level singleton / lru_cache. Move inference to a GPU Celery worker; install CPU-only torch on the API image.
High2.3 β€” No tests; CI deploys straight to prod with no gate
.github/workflows/main-ci.yml Β· no tests/ dir Β· pre-commit = black/isort only

Only test files are a rate-limit script and the vendored matting lib. CI on push to main SSHes into prod and swaps containers β€” no test/lint/type/build gate, no approval. "Blue-green" actually does down then up = multi-minute 502 window per deploy.

Fix: Add pytest + a required test/lint/mypy job before deploy; gate prod behind a GitHub Environment with reviewers; deploy on tag/release. Implement real parallel-stack cut-over.
High2.4 β€” Unpinned, un-hashed dependencies
requirements.in (torch, ultralytics, opencv, numpy… unpinned) Β· requirements.txt (no --generate-hashes)

Almost every direct dep is unpinned, so a re-lock silently pulls new torch/ultralytics/numpy β€” changing model numerics or pulling CVEs. No wheel hashes = no build-time integrity. Two conflicting OpenCV builds; rembg[cli] drags the gradio UI stack into a UI-less API.

Fix: Pin every direct dep; regenerate with --generate-hashes, build with --require-hashes. Drop [cli], standardize on opencv-python-headless, split GPU/audio deps into a worker file.
High2.5 β€” 96 MB binaries in git; god-files everywhere
src/admin/admin_scribble_tool/templates/*.pptx (96MB) Β· src/models.py (9,321 lines, 228 classes) Β· business/routes.py (10,599 lines)

Three .pptx (44/32/20 MB) + 48 PNGs are raw git blobs bloating history (.git is 80M). models.py is a 364KB ORM monolith; four route/function files total ~49K lines.

Fix: Move binaries to S3/Git LFS and purge history. Split models.py into a domain package; decompose routes into per-resource APIRouter modules.

3 Medium

Medium3.1 β€” CORS wildcard with credentials
src/main.py:94-101 & :173-180 β€” allow_origins=["*"] + allow_credentials=True

Starlette reflects the request Origin when credentials are on β†’ effectively all origins. Cross-tenant theft via a victim's browser. Fix: explicit origin allow-list

Medium3.2 β€” Rate limiter trusts X-Forwarded-For & fails open
rate_limit_identifier.py:62-69 Β· token_bucket.py:63 (allow on Redis error, non-atomic)

Forged X-Forwarded-For = fresh bucket per request β†’ brute-force on login/OTP/reset. Redis outage = unlimited. Fix: trusted-proxy hop count + atomic Lua bucket + fail-closed on auth

Medium3.3 β€” 1-year refresh tokens; JWT secret only warned
src/helpers/token.py:49 (REFRESH=365d) Β· :38-42 (weak-secret warning, proceeds)

Leaked refresh token = a year of access; weak HS256 secret = forgeable tokens = full bypass. Fix: 14–30d refresh + rotation; hard-fail on <32-char secret

Medium3.4 β€” FORCE_ENV & safety flags hardcoded in tracked source
src/utils/config.py β€” FORCE_ENV, ENABLE_RATE_LIMITING=False, ENABLE_REAL_EMAIL_DELIVERY=False; env from GIT_BRANCH

One stray commit pinning FORCE_ENV='prod' points a non-prod deploy at prod DB. Fix: drive env + toggles from deploy env vars; assert prod values in CI

Medium3.5 β€” Swallowed exceptions & probable N+1 queries
repo-wide: 28 bare except, 15 except…pass Β· ~1,300 db.query, ~490 loops in src/business

Silent failures + query-per-iteration with no eager loading, compounding 2.1. Fix: specific excepts + logging; selectinload/joinedload; query-count assertions

Medium3.6 β€” Two contradictory deploy models; dangerous ops scripts
deploy_celery_production.sh (supervisor, --concurrency=4, wrong paths) vs docker-compose via GitHub Actions

The supervisor script sets concurrency=4 β€” violating the documented concurrency=1 OOM invariant β€” and targets the wrong dir; running it OOMs the box. Scripts lack set -euo pipefail; rollback only guards the web image. Fix: pick one model; harden scripts; align concurrency=1

Medium3.7 β€” Root containers, deprecated base, weak proxy creds
Dockerfile (tiangolo base, root, bundles Chrome) Β· ssl-docker-compose.yml (npm/npm) Β· compose deploy.resources ignored outside Swarm

Unmaintained base, no USER, NPM admin on :81 with npm/npm, and 4G limits may not be enforced. Fix: python:3.10-slim multi-stage non-root; strong NPM creds + localhost-only :81; mem_limit/cpus

Medium3.8 β€” Public Swagger in prod; v1/v2 auth duplication; no model versioning
/docs & /openapi.json public (title leaks env+branch) Β· 7 login implementations Β· weights versioned by filename date

API map handed to attackers; auth drift across live v1/v2; no way to reproduce a past inference or catch a model regression. Fix: gate docs in prod; consolidate auth; weight→sha→git-ref manifest + golden-image eval

4 Low / hygiene

5 What's done well

  • Passwords hashed with bcrypt (passlib) β€” correct.
  • SQL is parameterized via the ORM β€” no f-string injection found.
  • No shell=True β€” subprocess uses arg lists; no command injection.
  • The OOM problem is genuinely understood β€” run.py documents the memory math + concurrency invariant.
  • Background-removal service is the right pattern β€” caps images at 512px, bounded workers, memory guards.
  • Security-headers, webhook secret & token-bucket middleware exist β€” the bones of a real posture are there.