Skip to main content

Gap Analysis (v1 → v2)

What was broken in v1, why, and how v2 fixes it. Each row maps a defect to the v2 change that closes it.

#v1 defectRoot causev2 fix
1Massive false-positive ratepatterns.json matched variable NAMES (password, key, id), not vendor value-shapesrules.yaml with per-vendor value-shape regex + entropy + allowlist
2Missed private keys (.ssh/id_rsa, *.pem)Detection regex required name = value shape; PEM blocks have no shapeprivate-key-block rule matches PEM anchor line directly
3Missed Docker registry credentialsSame as #2 — JSON-nested value not "named"docker-registry-auth rule keys on "auth": token
4Random categorizationSame key appeared in multiple categories in patterns.json; last-loaded wonEach rule has a stable unique id; severity replaces category
5Double-scan in DAST modecrawler.scan_for_secrets AND scanner.scan_remote_file both firedSingle crawl path, scanning is centralized in engine.scanner.scan_text
6core/crawler-clean.py dead duplicateTwo parallel codebasesDeleted; new engine/crawler.py is single source
7SAST opened binary filesNo extension / binary checkNUL-byte heuristic + skip-dir + max-size + glob exclude
8except Exception: pass swallowed errorsHid coverage holesReplaced with logging.debug calls
9Status 301/302 dead branchrequests.get follows redirects by defaultCode path simplified, no longer claims to handle redirects it never sees
10release.yml referenced src/main.py (doesn't exist)Stale pathRewritten — builds wheel + Windows EXE + Linux EXE + Docker image
11fpm -s python without setup.pyNo packagingpyproject.toml ships, pip install . works, wheel built in CI
12wordlist.txt grew foreverAppended every runRemoved mutating wordlist write; new crawler doesn't write side files
13requirements.txt lied (argparse, python-docx, fpdf v1)DriftReplaced with accurate, pinned deps (fpdf2, pyahocorasick, tldextract, PyYAML)
14Default python-requests UA blocked by WAFsNo UA customization--user-agent flag, sensible default UA
15Single-threaded crawlerRecursive requests.getThreadPoolExecutor with --threads (default 16)
16KeyboardInterrupt lost resultsexit(0) mid-scanCrawler catches RequestException; CLI traps SIGINT cleanly
17Config paths relative to CWDHardcoded relative pathsPath(__file__).resolve().parent.parent / "config"
18No deduplicationSame line could emit multiple timesDedup keyed on (file, line, sha256(secret), rule_id)
19No exit code semanticsAlways exited 0--fail-on flag; exit 1 if any finding meets threshold
20No max-depth / max-URLs / scope controlCrawler could run forever--max-urls, --max-depth, eTLD+1 scope (default) or --strict-host
21Value char-class allowed (){}Matched CSS / JS blocksPer-rule regex; no broad char-class
22Duplicate keywords in patterns.jsonManual list maintenanceEach rule has unique id; rule loader warns on duplicates
23python-docx import never usedStale requirementRemoved
24PDF crashed on non-ASCIIfpdf v1 default fonts are Latin-1fpdf2 + safe-encode helper
25HTML report was raw df.to_html()No designNew HTML v2: severity pills, sortable columns, live filter, dark theme
26No SARIF outputCouldn't go into GitHub Code ScanningSARIF 2.1.0 reporter
27No JSONL outputCouldn't stream into SOAR/SIEMJSONL reporter
28No CLI flags for proxy / cookie / headerCouldn't test authenticated apps--cookie, --header K:V, --proxy, --insecure
29No verificationFindings were hypotheses--verify runs vendor probes; finding gets verified=true|false
30No JS source-map parsing.js.map files ignoredengine/sourcemap.py extracts sourcesContent[] and scans each
31No JS endpoint extractionOnly HTML link extractioncrawler.py regex-extracts string literals from JS
32No HTTP header scanBody-onlyEach Header: Value line scanned through the same engine
33No package install pathgit clone + python main.pypip install scan4secrets, console entry, Docker image
34No pre-commit hookCouldn't be gated.pre-commit-hooks.yaml ships
35README oversold "400+ rules"Marketing vs realityHonest count + a comparison table vs gitleaks/trufflehog/detect-secrets

What's still on the roadmap (v2.1+)

These are intentionally deferred from v2.0:

  • Git history scan (--git, --since) — iterate every blob across all branches; biggest real-world secret source
  • OpenAPI / GraphQL / Swagger ingestion for DAST seed URLs
  • Wayback / common-crawl URL source for archived endpoints
  • --gitleaks-import flag to re-verify + re-format an existing gitleaks JSON report
  • Tech-stack-aware wordlist selection (currently loads all wordlists indiscriminately)
  • Sitemap.xml / robots.txt ingestion as DAST seeds
  • Smart 404 detection (sites returning 200 with a "Not Found" body)
  • Process-pool SAST for very large monorepos
  • Diff / baseline mode (--baseline previous.json)
  • Test fixtures + pytest CI (skeleton is in tests/, fixtures need population)

How v2 compares operationally

Dimensionv1v2
Time to first useful rungit clone && pip install -r req && python main.py --path Xpip install scan4secrets && scan4secrets --path X
FP rate on clean repo~13% per file0% per file (empirical)
Detects private keysNoYes
Detects Docker registry credsNoYes
Authenticated DASTNoYes (cookie, header, proxy)
JS source mapsNoYes
CI-native outputNoSARIF, JSONL, exit codes
Live verificationNoYes (4 vendors built-in, schema for more)
Lines of detection engine code~300 lines, broken~400 lines, tested