VIPERSHADOW — Dark Web Investigation Platform

1. Snapshot

Environment

Self-Hosted (Docker, Tor)

Purpose

Investigation-grade Threat Intel

Tools/Tech

Python / FastAPI / React / Tor / Selenium / LLM

Status

In Progress (v0.2.0 milestone)

2. Why I Built This

❌ The Gap in Dark Web Tooling

Existing dark web tooling forces a choice between shallow and unaffordable:

Free OSINT scanners (Robin, OnionSearch, Ahmia) go one layer deep and return a link dump with no matching, IOC extraction, or real analysis
Credential monitoring services (Dark Web ID, SpyCloud, Breachsense) only watch for breached emails on a recurring schedule — they don't investigate anything beyond credentials
Enterprise threat intelligence platforms (Recorded Future, Flashpoint, Intel 471, DarkOwl) do real investigation at depth, but cost $50k–$500k per year and assume a dedicated CTI team

The security operator who needs to actually investigate something — a threat actor during an active incident, a malware family in research, exposed infrastructure during attack surface review, a fresh breach landing on the desk — has nowhere to go between those tiers.

VIPERSHADOW exists to close that gap. Investigation-grade depth, self-hosted, single-operator workflow, no recurring license fee. Built for the operator who needs to answer "what do we actually know about this?" without a $100k Recorded Future seat to point at the question.

3. What It Does

🔍

Four Investigation Modes

Identity (Email, Username, Full Name, Phone), Threat Actor (Group Name, Alias, Malware), Domain (Domain, IP), Keyword (Search term). Each mode runs against tuned LLM prompts and target field sets appropriate to the investigation type.

🕸️

Multi-Layer Tor Scraping

L1–L4 adaptive depth scraping through Tor. L1 follows search results, L2 follows links on L1 pages, L3 follows internal links on matched L2 pages, L4 follows internal links on exact-matched L3 pages. Adaptive — depth increases only when matches keep landing. Where single-layer OSINT tools grab the matched page and stop, VIPERSHADOW follows references two and three layers deep.

🎯

Four-Pass Match Engine

Exact string match, rapidfuzz fuzzy-high (>85% confidence), rapidfuzz fuzzy-low (70–85% confidence), and regex pattern matching. Pages under 300 characters and search-result-page titles are discarded before matching.

📦

IOC Extraction

Emails, onion addresses, BTC/ETH/XMR cryptocurrency wallets, IPs, clearnet domains, MD5/SHA1/SHA256 hashes, and CVE identifiers. Runs only on pages with substantive content (500+ characters, non-search URLs).

🧠

LLM-Driven Analysis

The LLM analyst receives only matched excerpts and extracted IOCs — never unmatched scrape content. Returns a structured intelligence report with HIGH/MEDIUM/LOW confidence findings across Threat Actor Profile, Attack Methodology, Infrastructure Indicators, Victim Profile, and Current Activity Assessment sections. Generates suggested Next Queries to drive the investigation forward. Returns explicit INSUFFICIENT DATA flags when signal is weak rather than fabricating findings.

4. How It Works

The operator-in-the-loop pipeline:

1. Configure Identity — Operator pre-loads a throwaway username and password before the scan (not generated on-the-fly; Tor circuits don't tolerate mid-scan latency)

2. Set Investigation Mode — Select Identity, Threat Actor, Domain, or Keyword. Populate target fields appropriate to the mode

3. Pipeline Runs Mechanical Work — Search across 11 Tor engines → multi-layer scraping → 4-pass match engine → IOC extraction

4. Operator Steers Through Gates — When a registration wall is hit, operator clicks Insert Creds (uses pre-loaded identity) or Skip. When CAPTCHA blocks, operator solves it in the browser panel and the scrape resumes

5. Block Detection Categorizes Failures — REGISTRATION_GATED, JS_GATED, HONEYPOT, TIMEOUT, UNREACHABLE, TEXT_CAPTCHA — sources are logged with reasons so operator can decide which to pursue manually

6. LLM Synthesizes Intelligence Report — Receives matched excerpts and IOCs only. Produces confidence-tagged findings with cited sources

7. Suggested Next Queries Drive Pivots — Investigation history persists; operator can immediately run LLM-suggested follow-up queries on related angles

Standard-depth scans complete in roughly 10–15 minutes. The operator-in-the-loop layer (gated content handling) is the current v0.2.0 milestone. The multi-layer scraping and match/IOC/synthesis pipeline are operational and producing substantively more usable intelligence than shallow OSINT alternatives on the same queries.

5. Screenshots / Artifacts

Secure Boot Sequence — Tor circuit bootstrap, engine initialization, operator authentication

Master Identity — Throwaway credential configuration with operational discipline warnings

Block Report — Categorized unreachable sources (UNREACHABLE, HONEYPOT, JS CHALLENGE, TIMEOUT)

Live Console — L2 fetches with EXACT matches and categorized block detection in real time

Intelligence Report — Conti investigation, Risk Score 65, HIGH-confidence findings across five sections

Evidence Panel — 13 matches, 513 IOCs, 200 pages scraped in a single investigation

A single Threat Actor investigation against a well-documented historical target produced 13 matches, 513 extracted IOCs, and 200 pages scraped — with the LLM synthesizing five sections of HIGH-confidence findings. Depth comes from multi-layer scraping against open content; gated content handling expands coverage further as the v0.2.0 milestone completes.

6. Links

🔗 GitHub Repository 📖 Project Wiki

Source code is currently private while the v0.2.0 milestone is completed. The public README and Wiki document the project's architecture, design decisions, and operational characteristics. Source will be published at v1.0.0.