Introduction

During my final year at Reykjavik University, I tackled a problem that many security teams face: how to efficiently map and monitor the attack surface of organizations. The result was AASM (Automated Attack Surface Mapping) - a scalable system that discovers subdomains, endpoints, vulnerabilities, and more through automated scanning.

In this post, I'll walk through the architecture decisions, technical challenges, and lessons learned from building a production-ready security scanning platform.

Technology Stack

Layer	Technology	Purpose
Backend API	FastAPI	RESTful API with async support
Task Queue	Celery	Distributed task processing
Message Broker	Redis	Task queue and caching
Database	PostgreSQL (Supabase)	Relational data storage
Frontend	React/Next.js	User interface
Security Tools	Subfinder, Httpx, Nuclei, Masscan, Nmap, Gowitness	Attack surface scanning
Containerization	Docker	Deployment and dependency management

The Problem

Organizations often don't have full visibility into their external attack surface. New services get deployed, subdomains are created, and infrastructure changes - all without a centralized view of what's exposed to the internet. Manual discovery is time-consuming and quickly becomes outdated.

I needed to build a system that could:

Automatically discover and map attack surfaces
Scale to handle multiple concurrent scans
Process long-running security scans efficiently
Provide real-time visibility into results
Store historical data for trend analysis

Architecture Overview

The system follows a distributed architecture with several key components working together to provide scalable, asynchronous attack surface mapping:

FastAPI Backend

I chose FastAPI for the REST API because of its:

Performance: Built on Starlette and Pydantic, it's one of the fastest Python frameworks
Type Safety: Automatic validation and serialization with Pydantic models
Documentation: Auto-generated OpenAPI docs
Async Support: Native async/await for handling concurrent requests

python
from fastapi import FastAPI, BackgroundTasks
from pydantic import BaseModel

app = FastAPI(title="AASM API")

class ScanRequest(BaseModel):
    target: str
    scan_types: list[str]

@app.post("/api/scans")
async def create_scan(request: ScanRequest, background_tasks: BackgroundTasks):
    # Queue scan task
    task = scan_task.delay(request.target, request.scan_types)
    return {"task_id": task.id, "status": "queued"}

Redis + Celery Task Queue

For handling long-running scans, I implemented a task queue using Redis as the message broker and Celery as the distributed task processor. This architecture allows:

Asynchronous Processing: Scans run in background workers
Scalability: Add more workers to handle increased load
Reliability: Task retries and error handling
Monitoring: Real-time task status tracking

python
from celery import Celery
from celery.signals import task_prerun, task_postrun

celery_app = Celery(
    'aasm',
    broker='redis://localhost:6379/0',
    backend='redis://localhost:6379/1'  # Store task results
)

# Configure task settings
celery_app.conf.update(
    task_serializer='json',
    accept_content=['json'],
    result_expires=3600,
    task_track_started=True,
    task_time_limit=3600,  # 1 hour hard limit
    task_soft_time_limit=3300  # 55 min soft limit
)

@celery_app.task(bind=True, max_retries=3)
def scan_task(self, target, scan_types):
    """Main scanning task that orchestrates all scan types"""
    try:
        # Update task state to show progress
        self.update_state(state='PROGRESS', meta={'stage': 'initializing'})

        results = perform_scan(target, scan_types)
        store_results(results)

        return {"status": "completed", "results": results}
    except ScanTimeout as exc:
        # Don't retry on timeout
        return {"status": "failed", "error": "Scan timed out"}
    except Exception as exc:
        # Exponential backoff: 60s, 120s, 240s
        self.retry(exc=exc, countdown=60 * (2 ** self.request.retries))

PostgreSQL Database

I used PostgreSQL (via Supabase) for storing scan results because:

Relational Data: Complex relationships between targets, scans, and findings
JSONB Support: Flexible storage for varying scan result formats
Full-Text Search: Quick searching through findings
Performance: Efficient indexing for large datasets

Database Schema Design

The schema is designed to efficiently handle complex relationships while maintaining query performance:

sql
-- Organizations/Targets
CREATE TABLE organizations (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name VARCHAR(255) NOT NULL,
    domain VARCHAR(255) UNIQUE NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);

-- Scans
CREATE TABLE scans (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    organization_id UUID REFERENCES organizations(id) ON DELETE CASCADE,
    status VARCHAR(50) NOT NULL, -- queued, running, completed, failed
    scan_types TEXT[] NOT NULL,
    started_at TIMESTAMP DEFAULT NOW(),
    completed_at TIMESTAMP,
    error_message TEXT,
    task_id VARCHAR(255) UNIQUE -- Celery task ID
);

-- Subdomains discovered
CREATE TABLE subdomains (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    scan_id UUID REFERENCES scans(id) ON DELETE CASCADE,
    organization_id UUID REFERENCES organizations(id) ON DELETE CASCADE,
    subdomain VARCHAR(255) NOT NULL,
    ip_addresses INET[],
    http_status INTEGER,
    title TEXT,
    technologies TEXT[],
    discovered_at TIMESTAMP DEFAULT NOW(),
    UNIQUE(organization_id, subdomain)
);

-- Vulnerabilities found
CREATE TABLE findings (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    scan_id UUID REFERENCES scans(id) ON DELETE CASCADE,
    subdomain_id UUID REFERENCES subdomains(id) ON DELETE CASCADE,
    severity VARCHAR(20) NOT NULL, -- critical, high, medium, low, info
    title VARCHAR(500) NOT NULL,
    description TEXT,
    tool VARCHAR(100), -- nuclei, nmap, custom
    evidence JSONB, -- Flexible storage for tool-specific data
    cvss_score DECIMAL(3,1),
    cve_id VARCHAR(50),
    discovered_at TIMESTAMP DEFAULT NOW()
);

-- Indexes for performance
CREATE INDEX idx_scans_org ON scans(organization_id);
CREATE INDEX idx_scans_status ON scans(status);
CREATE INDEX idx_subdomains_org ON subdomains(organization_id);
CREATE INDEX idx_findings_severity ON findings(severity);
CREATE INDEX idx_findings_scan ON findings(scan_id);
CREATE INDEX idx_findings_evidence ON findings USING GIN (evidence);

Key design decisions:

UUIDs over integers: Better for distributed systems and prevents enumeration attacks
JSONB for evidence: Each security tool returns different data structures; JSONB allows flexible storage while still being queryable
Cascading deletes: When a scan is deleted, all associated findings are automatically removed
GIN index on JSONB: Enables fast queries on vulnerability evidence fields
Array types: PostgreSQL native arrays for storing multiple IPs, technologies, and scan types efficiently

Key Technical Decisions

Building a production-ready security scanning platform required careful consideration of architecture, tooling, and reliability. Here are the most important technical decisions and their implementations:

1. Tool Integration

The system integrates multiple industry-standard security tools to provide comprehensive attack surface mapping:

Subfinder: Subdomain discovery
Httpx: HTTP probing and metadata collection
Nuclei: Vulnerability scanning
Masscan/Nmap: Port scanning
Gowitness: Screenshot capture

Each tool runs as a subprocess, and I parse their output to extract structured data:

python
import subprocess
import json

def run_subfinder(domain: str) -> list[str]:
    result = subprocess.run(
        ['subfinder', '-d', domain, '-json'],
        capture_output=True,
        text=True
    )

    subdomains = []
    for line in result.stdout.splitlines():
        data = json.loads(line)
        subdomains.append(data['host'])

    return subdomains

2. Error Handling and Retries

Security tools can fail for various reasons (timeouts, rate limits, network issues). I implemented:

Exponential backoff for retries
Partial result storage (don't lose data if one tool fails)
Comprehensive logging for debugging

3. Real-Time Updates

The frontend uses polling to check task status:

typescript
async function pollTaskStatus(taskId: string) {
  const interval = setInterval(async () => {
    const response = await fetch(`/api/tasks/${taskId}`);
    const data = await response.json();

    if (data.status === 'completed' || data.status === 'failed') {
      clearInterval(interval);
      updateUI(data);
    }
  }, 2000);
}

Challenges & Solutions

Every complex system comes with its own set of challenges. Here's how I tackled the major obstacles during development:

Challenge 1: Managing Tool Dependencies

Problem: Each security tool has different installation requirements and versions.

Solution: I containerized the entire application with Docker, ensuring consistent environments across development and production. Each worker container includes all necessary tools.

Challenge 2: Scan Performance

Problem: Running all scans sequentially was too slow for large targets.

Solution: Implemented parallel execution using Celery groups and chords:

python
from celery import group, chord

def full_scan(target):
    # Run scans in parallel
    scan_tasks = group(
        subdomain_scan.s(target),
        port_scan.s(target),
        vulnerability_scan.s(target)
    )

    # Aggregate results when all complete
    callback = aggregate_results.s()
    chord(scan_tasks)(callback)

Challenge 3: Rate Limiting

Problem: External services and targets may rate limit our scans.

Solution: Implemented configurable delays between requests and respect for robots.txt:

python
import time
from urllib.robotparser import RobotFileParser

def respect_rate_limits(domain: str, delay: int = 1):
    rp = RobotFileParser()
    rp.set_url(f"https://{domain}/robots.txt")
    rp.read()

    crawl_delay = rp.crawl_delay("*")
    if crawl_delay:
        time.sleep(crawl_delay)
    else:
        time.sleep(delay)

Performance Metrics & Results

The system was tested against various targets to validate performance and scalability:

Scan Performance Benchmarks

Target Size	Subdomains Found	Scan Duration	Worker Count	Memory Usage
Small (1-10 subdomains)	8	2m 34s	2	512 MB
Medium (10-50 subdomains)	43	8m 17s	4	1.2 GB
Large (50-200 subdomains)	187	24m 52s	8	3.1 GB
Extra Large (200+ subdomains)	2,143	47m 18s	12	5.8 GB

Tool Execution Times (Average)

code
Subfinder (subdomain discovery):     ~45s for 100 subdomains
Httpx (HTTP probing):                ~2s per subdomain
Nuclei (vulnerability scanning):     ~15s per subdomain
Masscan (port scanning):             ~3m for /24 network
Gowitness (screenshots):             ~5s per URL

Key Achievements

Discovered 2,000+ subdomains across test targets in controlled environments
Identified 150+ vulnerabilities (CVEs and misconfigurations) during testing
Processed concurrent scans with up to 12 parallel workers without performance degradation
Database query performance:
- Subdomain lookup: <10ms average
- Finding aggregation: <50ms for 1000+ records
- Full-text search: <100ms across 10,000+ findings
API response times:
- Create scan endpoint: <200ms
- Task status check: <50ms
- Results retrieval: <300ms (with pagination)

Scalability Testing

The system was tested under load to verify horizontal scalability:

Concurrent scans: Successfully handled 25+ simultaneous scans with 12 Celery workers
Task throughput: Processed 100+ tasks/minute during peak load
Redis queue latency: Maintained <10ms even under heavy load
Database connection pool: Efficiently managed with 20 connections using pgBouncer

Real-World Impact

The project received recognition during thesis defense for its practical approach to solving attack surface management challenges. The system demonstrated how combining modern async frameworks, distributed task queues, and robust database design can create production-ready security automation tools.

Lessons Learned

1. Start Simple, Scale Later

I initially tried to build a complex microservices architecture. This was overkill for the requirements. Starting with a monolithic FastAPI app and adding Celery workers as needed was much more pragmatic.

2. Tool Output Parsing is Fragile

Security tools change their output formats. I learned to:

Version-lock tool dependencies
Add extensive validation for parsed data
Have fallback parsing strategies

3. Async is Powerful but Tricky

FastAPI's async capabilities are great, but mixing sync and async code requires care. I had to ensure all I/O operations (database, Redis, external APIs) used async clients.

4. Monitoring is Essential

For a distributed system, observability is crucial. I added:

Structured logging with request IDs
Celery Flower for task monitoring
Database query logging for performance tuning

Future Enhancements

While the current system is functional and performant, there are several enhancements that would make it production-ready for enterprise use:

Short-term Improvements

WebSocket Support for Real-time Updates
- Replace polling with WebSocket connections using FastAPI's native WebSocket support
- Stream scan progress updates directly to the frontend
- Reduce server load and improve user experience with instant notifications
Advanced Scheduling System
- Implement cron-style recurring scans using Celery Beat
- Allow users to configure daily/weekly/monthly scan schedules
- Track changes over time to identify new attack surface exposure
Alerting & Notifications
- Email/Slack/Discord webhooks for critical severity findings
- Configurable alert rules based on severity, CVE scores, or custom criteria
- Digest reports summarizing scan results

Long-term Enhancements

Multi-tenancy & Authentication
- User authentication with JWT tokens
- Role-based access control (admin, analyst, viewer)
- Organization-level isolation for scan data
- API key management for programmatic access
Enhanced Reporting
- PDF/HTML export with executive summaries
- Trend analysis showing attack surface changes over time
- Customizable report templates
- Integration with ticketing systems (Jira, Linear)
Machine Learning Integration
- False positive reduction using ML models
- Anomaly detection for unusual patterns
- Risk scoring based on historical data
- Predictive analytics for vulnerability trends

Conclusion

Building AASM was an intensive journey into distributed systems, async programming, and security automation. What started as a thesis project evolved into a functional platform that demonstrated how modern Python frameworks can be combined to solve real-world security challenges.

Key Takeaways

FastAPI + Celery is a powerful combination for building async, task-based systems
Proper database design (UUIDs, JSONB, indexes) is crucial for scalability
Containerization simplifies complex tool dependency management
Starting simple and scaling incrementally beats over-engineering from the start
Monitoring and observability are essential for distributed systems

The system successfully processed thousands of scans, discovered substantial attack surface data, and proved that automated security scanning can be both efficient and scalable when built with the right architecture.

Resources

Full Technical Report: Available on Skemman
Technologies Used: FastAPI, Celery, Redis, PostgreSQL, Docker, Subfinder, Nuclei, Httpx

Questions or feedback? I'm always happy to discuss distributed systems architecture, security automation, or lessons learned from this project. Feel free to reach out!

Building AASM: Automated Attack Surface Mapping