Architecture

How scan4secrets is organized, how data flows through it, and where to extend it.

Package layout

scan4secrets/
├── scan4secrets/                  # the package
│   ├── __init__.py                # version
│   ├── __main__.py                # `python -m scan4secrets`
│   ├── cli.py                     # argparse, orchestration
│   ├── config/
│   │   ├── rules.yaml             # the detection rules (YAML)
│   │   ├── extensions.json        # extensions to gate DAST URL discovery
│   │   └── wordlist/              # path-guess wordlists for DAST
│   ├── engine/
│   │   ├── findings.py            # Finding dataclass, severity, redact
│   │   ├── entropy.py             # Shannon entropy
│   │   ├── rules.py               # Rule loading, Aho-Corasick keyword index
│   │   ├── scanner.py             # SAST: walks filesystem, applies rules
│   │   ├── crawler.py             # DAST: concurrent web crawl, source-maps, JS endpoints
│   │   ├── sourcemap.py           # parses .js.map sourcesContent
│   │   ├── wordlists.py           # load path-guess wordlists for DAST seeding
│   │   └── verifier.py            # live vendor-API verification
│   └── reporters/
│       ├── __init__.py            # writer registry
│       ├── sarif.py               # SARIF 2.1.0
│       ├── json_.py               # pretty JSON
│       ├── jsonl.py               # one finding per line
│       ├── csv_.py                # CSV
│       ├── html.py                # sortable filterable HTML
│       ├── excel.py               # XLSX via openpyxl
│       └── pdf.py                 # PDF via fpdf2 (UTF-8 safe)
├── docs/                          # this directory
├── tests/                         # pytest + planted-secret fixtures
├── pyproject.toml                 # build / install / console entry
├── Dockerfile                     # container image
├── .pre-commit-hooks.yaml         # pre-commit framework integration
├── .github/workflows/             # CI + release pipelines
├── main.py                        # backward-compat shim → cli.main()
└── README.md

Data flow

┌──────────────────────┐                          ┌──────────────────────┐
│ User CLI invocation  │                          │  Custom rules.yaml   │
└──────────┬───────────┘                          └─────────┬────────────┘
           │                                                │
           ▼                                                ▼
     ┌─────────────────────────────────────────────────────────┐
     │ cli.main(): parse args, load_rules(), filter, build CLI │
     └────────────────────┬────────────────────────────────────┘
                          │
            ┌─────────────┼─────────────┐
            │             │             │
            ▼             ▼             ▼
       ┌────────┐    ┌────────┐    ┌──────────┐
       │ stdin  │    │  SAST  │    │   DAST   │
       │ scan   │    │ walk + │    │  crawl + │
       │ text   │    │  scan  │    │  scan    │
       └───┬────┘    └───┬────┘    └────┬─────┘
           │             │              │
           └─────────────┼──────────────┘
                         ▼
              ┌──────────────────────┐
              │ List[Finding] (dedup) │
              └──────────┬───────────┘
                         │
              ┌──────────┴────────────┐
              │ if --verify: verifier  │ ──► live HTTP probes to vendor APIs
              └──────────┬────────────┘     (sets Finding.verified)
                         │
                         ▼
              ┌──────────────────────┐
              │  reporters.write_*   │ ──► sarif / json / jsonl / csv / html / excel / pdf
              └──────────┬───────────┘
                         │
                         ▼
                  exit-code gate
              (0 / 1 based on --fail-on)

Engine model

A rule is (id, severity, keywords, regex, entropy_min, allowlist, verify). The keyword pre-filter (Aho-Corasick) is what makes the engine fast on large repos — for each line we identify only the rules whose keywords appear, then run only those regexes. Without the pre-filter, scanning a large repo with 100+ rules is O(N×R) for every line; with it, ~O(N).

The pipeline per line:

line ──► keyword index ──► candidate rules ──► regex.finditer ──► captured value
                                                                       │
                                                                       ▼
                                                              shannon_entropy
                                                                       │
                                                              ≥ entropy_min ?
                                                                       │
                                                                       ▼
                                                              allowlist.line ?
                                                              allowlist.path ?
                                                                       │
                                                                       ▼
                                                                 emit Finding

Verification model

The verify: block on a rule is opt-in (--verify flag at runtime). For each finding whose rule has verify, the verifier sends one HTTP request:

headers[v.header_name] = v.header_value.replace("{{value}}", finding.secret)
r = requests.request(v.method, v.url, headers=headers, timeout=5, allow_redirects=False)
finding.verified = (r.status_code == v.success_status)

Each verification is concurrent (ThreadPoolExecutor, default 8 workers). Verification calls are issued AFTER scanning so a non-network scan is unaffected.

A verified finding is incident-grade evidence; an unverified one is a hypothesis. The two are visually distinct in the HTML report and tagged in SARIF properties so downstream triage tools can prioritize.

Extension points

Add a...	File to edit
New detection rule	`scan4secrets/config/rules.yaml` (append; no code change)
New vendor verifier	`verify:` block on the new rule (no code change)
New reporter (e.g. Markdown)	`scan4secrets/reporters/<name>.py` + register in `reporters/__init__.py`
New CLI flag	`scan4secrets/cli.py` (`_parser()` then `main()`)
Custom keyword index backend	`scan4secrets/engine/rules.py` (the `KeywordIndex` class)
New URL discovery source (sitemap, openapi)	`scan4secrets/engine/crawler.py` (`extra_seeds` parameter)
New path-guess wordlist	drop `*.txt` under `scan4secrets/config/wordlist/`

Performance notes

Aho-Corasick keyword pre-filter: 100+ rules → ~5x faster than naive per-rule regex scanning.
Binary skip: files with NUL byte in first 4096 bytes skipped (no garbage matches, faster).
Max file size: 10 MB default (skips minified bundles, build artifacts).
Max line length: 4096 chars (long minified lines truncated; prevents regex backtracking pathology).
Crawler concurrency: ThreadPoolExecutor with --threads (default 16); use 32-64 for fast targets.
Verifier concurrency: 8 workers; each probe ≤ 5s timeout.
Skipped directories by default: .git, node_modules, vendor, dist, build, .venv, __pycache__, .next, .nuxt, .tox, .cache, .mypy_cache, .pytest_cache, bower_components, coverage, .idea, .vscode.

Why YAML rules, not TOML or code

YAML round-trips edits cleanly in PRs.
Human-readable for non-Python contributors.
Same shape can be loaded by tests, distributed via package data, or referenced by --rules custom.yaml from outside the repo.

Package layout​

Data flow​

Engine model​

Verification model​

Extension points​

Performance notes​

Why YAML rules, not TOML or code​