Skip to main content

Architecture

How scan4secrets is organized, how data flows through it, and where to extend it.

Package layout

scan4secrets/
├── scan4secrets/ # the package
│ ├── __init__.py # version
│ ├── __main__.py # `python -m scan4secrets`
│ ├── cli.py # argparse, orchestration
│ ├── config/
│ │ ├── rules.yaml # the detection rules (YAML)
│ │ ├── extensions.json # extensions to gate DAST URL discovery
│ │ └── wordlist/ # path-guess wordlists for DAST
│ ├── engine/
│ │ ├── findings.py # Finding dataclass, severity, redact
│ │ ├── entropy.py # Shannon entropy
│ │ ├── rules.py # Rule loading, Aho-Corasick keyword index
│ │ ├── scanner.py # SAST: walks filesystem, applies rules
│ │ ├── crawler.py # DAST: concurrent web crawl, source-maps, JS endpoints
│ │ ├── sourcemap.py # parses .js.map sourcesContent
│ │ ├── wordlists.py # load path-guess wordlists for DAST seeding
│ │ └── verifier.py # live vendor-API verification
│ └── reporters/
│ ├── __init__.py # writer registry
│ ├── sarif.py # SARIF 2.1.0
│ ├── json_.py # pretty JSON
│ ├── jsonl.py # one finding per line
│ ├── csv_.py # CSV
│ ├── html.py # sortable filterable HTML
│ ├── excel.py # XLSX via openpyxl
│ └── pdf.py # PDF via fpdf2 (UTF-8 safe)
├── docs/ # this directory
├── tests/ # pytest + planted-secret fixtures
├── pyproject.toml # build / install / console entry
├── Dockerfile # container image
├── .pre-commit-hooks.yaml # pre-commit framework integration
├── .github/workflows/ # CI + release pipelines
├── main.py # backward-compat shim → cli.main()
└── README.md

Data flow

┌──────────────────────┐ ┌──────────────────────┐
│ User CLI invocation │ │ Custom rules.yaml │
└──────────┬───────────┘ └─────────┬────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────────────┐
│ cli.main(): parse args, load_rules(), filter, build CLI │
└────────────────────┬────────────────────────────────────┘

┌─────────────┼─────────────┐
│ │ │
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌──────────┐
│ stdin │ │ SAST │ │ DAST │
│ scan │ │ walk + │ │ crawl + │
│ text │ │ scan │ │ scan │
└───┬────┘ └───┬────┘ └────┬─────┘
│ │ │
└─────────────┼──────────────┘

┌──────────────────────┐
│ List[Finding] (dedup) │
└──────────┬───────────┘

┌──────────┴────────────┐
│ if --verify: verifier │ ──► live HTTP probes to vendor APIs
└──────────┬────────────┘ (sets Finding.verified)


┌──────────────────────┐
│ reporters.write_* │ ──► sarif / json / jsonl / csv / html / excel / pdf
└──────────┬───────────┘


exit-code gate
(0 / 1 based on --fail-on)

Engine model

A rule is (id, severity, keywords, regex, entropy_min, allowlist, verify). The keyword pre-filter (Aho-Corasick) is what makes the engine fast on large repos — for each line we identify only the rules whose keywords appear, then run only those regexes. Without the pre-filter, scanning a large repo with 100+ rules is O(N×R) for every line; with it, ~O(N).

The pipeline per line:

line ──► keyword index ──► candidate rules ──► regex.finditer ──► captured value


shannon_entropy

≥ entropy_min ?


allowlist.line ?
allowlist.path ?


emit Finding

Verification model

The verify: block on a rule is opt-in (--verify flag at runtime). For each finding whose rule has verify, the verifier sends one HTTP request:

headers[v.header_name] = v.header_value.replace("{{value}}", finding.secret)
r = requests.request(v.method, v.url, headers=headers, timeout=5, allow_redirects=False)
finding.verified = (r.status_code == v.success_status)

Each verification is concurrent (ThreadPoolExecutor, default 8 workers). Verification calls are issued AFTER scanning so a non-network scan is unaffected.

A verified finding is incident-grade evidence; an unverified one is a hypothesis. The two are visually distinct in the HTML report and tagged in SARIF properties so downstream triage tools can prioritize.

Extension points

Add a...File to edit
New detection rulescan4secrets/config/rules.yaml (append; no code change)
New vendor verifierverify: block on the new rule (no code change)
New reporter (e.g. Markdown)scan4secrets/reporters/<name>.py + register in reporters/__init__.py
New CLI flagscan4secrets/cli.py (_parser() then main())
Custom keyword index backendscan4secrets/engine/rules.py (the KeywordIndex class)
New URL discovery source (sitemap, openapi)scan4secrets/engine/crawler.py (extra_seeds parameter)
New path-guess wordlistdrop *.txt under scan4secrets/config/wordlist/

Performance notes

  • Aho-Corasick keyword pre-filter: 100+ rules → ~5x faster than naive per-rule regex scanning.
  • Binary skip: files with NUL byte in first 4096 bytes skipped (no garbage matches, faster).
  • Max file size: 10 MB default (skips minified bundles, build artifacts).
  • Max line length: 4096 chars (long minified lines truncated; prevents regex backtracking pathology).
  • Crawler concurrency: ThreadPoolExecutor with --threads (default 16); use 32-64 for fast targets.
  • Verifier concurrency: 8 workers; each probe ≤ 5s timeout.
  • Skipped directories by default: .git, node_modules, vendor, dist, build, .venv, __pycache__, .next, .nuxt, .tox, .cache, .mypy_cache, .pytest_cache, bower_components, coverage, .idea, .vscode.

Why YAML rules, not TOML or code

  • YAML round-trips edits cleanly in PRs.
  • Human-readable for non-Python contributors.
  • Same shape can be loaded by tests, distributed via package data, or referenced by --rules custom.yaml from outside the repo.