Firehose Weekly fuel for the dev firehose

# permalink

Ssg_implementation_checklist

SSGv3 Implementation Checklist - Test-First Approach

Testing Strategy

  • Unit tests: Test each component in isolation with mocks
  • Integration tests: Test stage interactions with real file fixtures
  • E2E tests: Full pipeline runs on sample sites
  • Property tests: Use hypothesis for cache key determinism, URL normalization idempotence

Test Data Principles

  • Use realistic content (actual markdown with frontmatter)
  • Cover boundary conditions (empty files, missing fields)
  • Include unicode in slugs and content
  • Test with various date formats and timezones
  • Include invalid inputs to verify error handling

Phase 1: Foundation - Data Structures & Models

Core Models

  • Write test: BuildItem creation with minimal fields
  • Write test: BuildItem state transitions (SCANNED → BUILT → WRITTEN)
  • Write test: BuildItem.to_dict() serialization
  • Write test: BuildItem.from_dict() deserialization
  • Implement BuildItem base class
  • Write test: ContentItem with templates_used list
  • Write test: AssetItem with preserve_mtime flag
  • Write test: IndexItem with pagination fields
  • Implement ContentItem, AssetItem, IndexItem subclasses

Error Models

  • Write test: SSGError with message and suggestion
  • Write test: BuildError with code, src, message
  • Write test: CollisionError with url and multiple sources
  • Write test: TemplateError for missing template
  • Implement error class hierarchy
  • Write test: ErrorCollector accumulates multiple errors
  • Write test: ErrorCollector.format_report() output
  • Implement ErrorCollector class

Configuration

  • Write test: Config loads from TOML with all fields
  • Write test: Config validates content_dir exists
  • Write test: Config validates output_dir not inside content_dir
  • Write test: Config validates permalink template syntax
  • Write test: Config blacklist includes default patterns
  • Implement Config dataclass with validation
  • Write test: Config detects invalid page_size (< 1)
  • Write test: Config normalizes paths to absolute

Phase 2: Utilities Layer

File Utilities

  • Write test: FileUtils.read_file() handles UTF-8
  • Write test: FileUtils.read_file() raises on invalid encoding
  • Write test: FileUtils.stat_file() returns mtime and size
  • Write test: FileUtils.is_ignored() matches patterns
  • Implement FileUtils class
  • Write test: FileUtils handles symlinks (don’t follow by default)

Date Parsing

  • Write test: DateParser.parse() handles ISO format
  • Write test: DateParser.parse() handles “YYYY-MM-DD” format
  • Write test: DateParser.parse() handles human formats (“Jan 15, 2025”)
  • Write test: DateParser.to_iso() produces standard format
  • Write test: DateParser.extract_from_filename() finds “YYYY-MM-DD-title.md”
  • Implement DateParser class
  • Write test: DateParser handles timezones consistently

Path Resolution

  • Write test: PathResolver.resolve() makes paths absolute
  • Write test: PathResolver.make_relative() produces relative path
  • Write test: PathResolver.is_within() detects containment
  • Write test: PathResolver.is_within() rejects traversal attempts
  • Implement PathResolver class

Slug Generation

  • Write test: SlugGenerator.generate() lowercases input
  • Write test: SlugGenerator.generate() transliterates unicode (ö→o, é→e)
  • Write test: SlugGenerator.generate() removes punctuation except hyphens
  • Write test: SlugGenerator.generate() collapses whitespace to single hyphen
  • Write test: SlugGenerator.generate() strips leading/trailing hyphens
  • Implement SlugGenerator class
  • Write test: SlugGenerator with preserve_unicode=True keeps diacritics

Phase 3: Metadata System

Metadata Extraction

  • Write test: MetadataExtractor.parse_frontmatter() extracts YAML
  • Write test: MetadataExtractor.parse_frontmatter() handles empty frontmatter
  • Write test: MetadataExtractor.parse_frontmatter() raises on invalid YAML
  • Write test: MetadataExtractor.get_system_defaults() uses filename for slug
  • Write test: MetadataExtractor.get_system_defaults() uses mtime for date
  • Write test: MetadataExtractor.get_path_derived() extracts category from parent
  • Write test: MetadataExtractor.get_path_derived() handles top-level files
  • Implement MetadataExtractor class
  • Write test: MetadataExtractor.merge_metadata() respects precedence
  • Write test: MetadataExtractor.extract() combines all three sources
  • Write test: Frontmatter category overrides directory structure

Phase 4: Template System

Template Dependency Tracking

  • Write test: TemplateDependencyTracker.build_graph() finds includes
  • Write test: TemplateDependencyTracker.build_graph() parses Jinja2
  • Write test: TemplateDependencyTracker.detect_cycles() finds A→B→A
  • Write test: TemplateDependencyTracker.detect_cycles() handles self-reference
  • Write test: TemplateDependencyTracker.detect_cycles() returns empty for DAG
  • Implement TemplateDependencyTracker class
  • Write test: TemplateDependencyTracker.compute_hash() includes content
  • Write test: TemplateDependencyTracker.compute_hash() includes all includes transitively
  • Write test: TemplateDependencyTracker.compute_hash() is deterministic
  • Write test: TemplateDependencyTracker raises error on circular dependency

Template Selection

  • Write test: TemplateSelector.select() uses frontmatter template if present
  • Write test: TemplateSelector.select() uses category template if exists
  • Write test: TemplateSelector.select() falls back to default.html
  • Write test: TemplateSelector.resolve_category_template() checks file exists
  • Implement TemplateSelector class
  • Write test: TemplateSelector raises error if default.html missing

Template Rendering

  • Write test: TemplateRenderer.load_templates() loads from directory
  • Write test: TemplateRenderer.render() produces HTML
  • Write test: TemplateRenderer.render() provides content in context
  • Write test: TemplateRenderer.render() provides metadata in context
  • Write test: TemplateRenderer.get_templates_used() tracks includes
  • Implement TemplateRenderer with Jinja2
  • Write test: TemplateRenderer handles missing template gracefully

Phase 5: URL System

URL Normalization

  • Write test: URLNormalizer.normalize() adds trailing slash
  • Write test: URLNormalizer.normalize() lowercases URL
  • Write test: URLNormalizer.normalize() collapses multiple slashes
  • Write test: URLNormalizer.normalize() ensures leading slash
  • Implement URLNormalizer class
  • Write test: URLNormalizer.normalize() is idempotent

Permalink Generation

  • Write test: PermalinkGenerator.parse_template() finds placeholders
  • Write test: PermalinkGenerator.parse_template() handles format specs ({year:04d})
  • Write test: PermalinkGenerator.apply_template() substitutes category
  • Write test: PermalinkGenerator.apply_template() substitutes date parts
  • Write test: PermalinkGenerator.apply_template() substitutes slug
  • Implement PermalinkGenerator class
  • Write test: PermalinkGenerator.generate() returns normalized URL
  • Write test: PermalinkGenerator.resolve_output_path() appends index.html
  • Write test: PermalinkGenerator validates template has required placeholders

Collision Detection

  • Write test: CollisionDetector.build_url_map() creates mapping
  • Write test: CollisionDetector.find_collisions() detects duplicate URLs
  • Write test: CollisionDetector.find_collisions() normalizes before comparing
  • Write test: CollisionDetector.format_error() lists all source files
  • Implement CollisionDetector class
  • Write test: CollisionDetector.detect() handles empty list
  • Write test: CollisionDetector catches index URL vs content URL collision

Phase 6: Content Processing

Markdown Transformation

  • Write test: MarkdownTransformer.transform() converts headers
  • Write test: MarkdownTransformer.transform() converts lists
  • Write test: MarkdownTransformer.transform() handles code blocks
  • Write test: MarkdownTransformer.configure_extensions() enables extensions
  • Implement MarkdownTransformer class
  • Write test: MarkdownTransformer with tables extension

Content Processor

  • Write test: ContentProcessor.read_file() loads content
  • Write test: ContentProcessor.parse_content() splits frontmatter and body
  • Write test: ContentProcessor.should_process() returns true for .md files
  • Write test: ContentProcessor.process() combines all steps
  • Implement ContentProcessor class
  • Write test: ContentProcessor.process() sets templates_used
  • Write test: ContentProcessor.process() resolves metadata
  • Write test: ContentProcessor.process() renders through template

Phase 7: Cache System

Cache Key Computation

  • Write test: CacheManager.compute_cache_key() includes content hash
  • Write test: CacheManager.compute_cache_key() includes metadata
  • Write test: CacheManager.compute_cache_key() includes template hash
  • Write test: CacheManager.compute_cache_key() includes permalink template
  • Write test: CacheManager.compute_cache_key() is deterministic
  • Write test: CacheManager.compute_cache_key() changes when content changes
  • Implement CacheManager.compute_cache_key()

Cache Operations

  • Write test: CacheManager.needs_processing() returns true on cold start
  • Write test: CacheManager.needs_processing() returns false on cache hit
  • Write test: CacheManager.needs_processing() returns true when key differs
  • Write test: CacheManager.load_cached() returns stored HTML
  • Write test: CacheManager.mark_processed() stores cache entry
  • Implement CacheManager class with in-memory storage
  • Write test: CacheManager persists across instances

Manifest Management

  • Write test: ManifestManager.load() parses JSON manifest
  • Write test: ManifestManager.load() returns None if missing
  • Write test: ManifestManager.save_atomic() writes to temp then renames
  • Write test: ManifestManager.save_atomic() includes all item fields
  • Write test: ManifestManager.compare() finds orphaned files
  • Implement ManifestManager class
  • Write test: ManifestManager handles corrupted manifest gracefully
  • Write test: ManifestManager includes template hashes in manifest

Phase 8: Index System

Pagination

  • Write test: Paginator.paginate() splits items by page_size
  • Write test: Paginator.paginate() handles empty list
  • Write test: Paginator.paginate() handles partial last page
  • Write test: Paginator.get_page_url() generates /page/2/ format
  • Write test: Paginator.get_page_url() generates /category/page/2/ for categories
  • Write test: Paginator.get_page_url() generates /index.html for page 1
  • Implement Paginator class

Index Generation

  • Write test: IndexGenerator.create_main_index() sorts by date desc
  • Write test: IndexGenerator.create_main_index() paginates correctly
  • Write test: IndexGenerator.create_category_indexes() groups by category
  • Write test: IndexGenerator.compute_cache_key() includes items_on_page
  • Write test: IndexGenerator.compute_cache_key() includes pagination context
  • Write test: IndexGenerator.compute_cache_key() includes total_pages
  • Implement IndexGenerator class
  • Write test: IndexGenerator.should_rebuild() detects membership change
  • Write test: IndexGenerator.should_rebuild() detects ordering change
  • Write test: IndexGenerator.generate() creates IndexItems with correct URLs

Phase 9: Stage 1 - SCAN

Directory Walking

  • Write test: ScanStage.walk_directory() finds all files
  • Write test: ScanStage.walk_directory() skips blacklisted paths
  • Write test: ScanStage.walk_directory() skips dot directories
  • Write test: ScanStage.walk_directory() skips output directory
  • Write test: ScanStage.should_ignore() matches blacklist patterns
  • Implement ScanStage.walk_directory()

File Classification

  • Write test: ScanStage.classify_file() returns ‘content’ for .md
  • Write test: ScanStage.classify_file() returns ‘asset’ for .css
  • Write test: ScanStage.classify_file() returns ‘asset’ for images
  • Implement ScanStage.classify_file()

BuildItem Creation

  • Write test: ScanStage.create_build_item() extracts initial_slug from filename
  • Write test: ScanStage.create_build_item() extracts initial_category from parent
  • Write test: ScanStage.create_build_item() handles top-level files
  • Write test: ScanStage.create_build_item() captures file stats
  • Implement ScanStage.create_build_item()

Scan Integration

  • Write test: ScanStage.run() returns sorted list of BuildItems
  • Write test: ScanStage.run() marks all items as SCANNED
  • Write test: ScanStage.run() on sample directory structure
  • Implement ScanStage.run()
  • Write test: ScanStage deterministic ordering (same input → same output)

Phase 10: Stage 2 - BUILD Phase 1

Content Item Processing

  • Write test: BuildStage.process_content_item() reads file
  • Write test: BuildStage.process_content_item() parses frontmatter
  • Write test: BuildStage.process_content_item() merges metadata
  • Write test: BuildStage.process_content_item() transforms markdown
  • Write test: BuildStage.process_content_item() renders template
  • Write test: BuildStage.process_content_item() computes cache_key
  • Write test: BuildStage.process_content_item() generates permalink
  • Implement BuildStage.process_content_item()

Asset Item Processing

  • Write test: BuildStage.process_asset_item() computes cache_key
  • Write test: BuildStage.process_asset_item() generates output path
  • Write test: BuildStage.process_asset_item() preserves relative path
  • Implement BuildStage.process_asset_item()

Cache Integration

  • Write test: BuildStage.run_phase_one() checks cache before processing
  • Write test: BuildStage.run_phase_one() loads cached HTML on hit
  • Write test: BuildStage.run_phase_one() processes only cache misses
  • Write test: BuildStage.run_phase_one() assigns cache_key to all items
  • Implement BuildStage.run_phase_one()

Error Collection

  • Write test: BuildStage.collect_errors() accumulates parse errors
  • Write test: BuildStage.collect_errors() accumulates template errors
  • Write test: BuildStage.collect_errors() continues processing after errors
  • Implement BuildStage.collect_errors()

Phase 11: Stage 2 - BUILD Phase 2

Index Creation

  • Write test: BuildStage.run_phase_two() generates main index
  • Write test: BuildStage.run_phase_two() generates category indexes
  • Write test: BuildStage.run_phase_two() uses cache_keys from phase 1
  • Write test: BuildStage.run_phase_two() computes index cache keys
  • Write test: BuildStage.run_phase_two() checks index cache
  • Write test: BuildStage.run_phase_two() renders index templates
  • Implement BuildStage.run_phase_two()

Collision Detection Integration

  • Write test: BuildStage.detect_collisions() checks content items
  • Write test: BuildStage.detect_collisions() checks index items
  • Write test: BuildStage.detect_collisions() normalizes URLs first
  • Write test: BuildStage.detect_collisions() returns CollisionErrors
  • Implement BuildStage.detect_collisions()

Build Integration

  • Write test: BuildStage.run() executes phase 1 then phase 2
  • Write test: BuildStage.run() detects collisions after phase 2
  • Write test: BuildStage.run() returns errors if any found
  • Write test: BuildStage.run() returns built items if no errors
  • Implement BuildStage.run()
  • Write test: BuildStage.run() on complete sample site

Phase 12: Stage 3 - WRITE

Output Writing

  • Write test: OutputWriter.write_text() creates parent directories
  • Write test: OutputWriter.write_text() writes UTF-8 content
  • Write test: OutputWriter.copy_file() preserves mtime for assets
  • Write test: OutputWriter.ensure_directory() creates nested paths
  • Write test: OutputWriter.fsync() ensures durability
  • Implement OutputWriter class

Timestamped Directories

  • Write test: WriteStage.create_timestamped_directory() uses YYYYMMDD_HHMMSS format
  • Write test: WriteStage.create_timestamped_directory() creates directory
  • Implement WriteStage.create_timestamped_directory()

Atomic Swapping

  • Write test: AtomicSwapper.verify_symlink_support() detects platform support
  • Write test: AtomicSwapper.create_symlink() creates symlink
  • Write test: AtomicSwapper.update_symlink_atomic() uses temp+rename pattern
  • Write test: AtomicSwapper.swap() updates symlink atomically
  • Implement AtomicSwapper class
  • Write test: AtomicSwapper.swap() preserves old symlink target on failure

Manifest Writing

  • Write test: WriteStage.write_manifest() before symlink update
  • Write test: WriteStage.write_manifest() includes all items
  • Write test: WriteStage.write_manifest() includes template hashes
  • Write test: WriteStage.write_manifest() uses atomic write
  • Implement WriteStage.write_manifest()

Cleanup

  • Write test: OutputCleaner.find_orphans() compares manifests
  • Write test: OutputCleaner.delete_old_outputs() keeps N recent
  • Write test: OutputCleaner.keep_recent() sorts by timestamp
  • Write test: OutputCleaner.cleanup() removes orphan files
  • Implement OutputCleaner class
  • Write test: OutputCleaner.cleanup() skips non-empty directories

Write Integration

  • Write test: WriteStage.run() creates timestamped directory
  • Write test: WriteStage.run() writes all items to temp directory
  • Write test: WriteStage.run() writes manifest
  • Write test: WriteStage.run() updates symlink
  • Write test: WriteStage.run() cleans old outputs
  • Write test: WriteStage.run() on complete sample site
  • Implement WriteStage.run()
  • Write test: WriteStage.run() leaves previous output on failure

Phase 13: Pipeline Integration

Build Context

  • Write test: BuildContext.add_item() stores item
  • Write test: BuildContext.get_items() returns all items
  • Write test: BuildContext filters items by state
  • Implement BuildContext class

Pipeline Orchestration

  • Write test: Pipeline.run() executes SCAN stage
  • Write test: Pipeline.run() executes BUILD stage
  • Write test: Pipeline.run() executes WRITE stage
  • Write test: Pipeline.run() aborts before WRITE if BUILD errors
  • Write test: Pipeline.run() returns success on complete build
  • Implement Pipeline.run()
  • Write test: Pipeline.run_dry() skips WRITE stage
  • Implement Pipeline.run_dry()

End-to-End Tests

  • Write test: Full pipeline on minimal site (1 page, 1 asset)
  • Write test: Full pipeline on multi-category site
  • Write test: Full pipeline with pagination (11+ posts)
  • Write test: Incremental build (change 1 file, rebuild)
  • Write test: Template change invalidation
  • Write test: Collision detection prevents build
  • Write test: Invalid frontmatter aborts before WRITE
  • Write test: Second build uses cache (fast rebuild)

Phase 14: Edge Cases & Robustness

Encoding & Parsing

  • Write test: ContentProcessor handles UTF-8 with BOM
  • Write test: ContentProcessor handles mixed line endings
  • Write test: MetadataExtractor handles TOML frontmatter
  • Write test: MetadataExtractor handles missing frontmatter delimiter

Template Edge Cases

  • Write test: TemplateRenderer handles template with no includes
  • Write test: TemplateRenderer handles deeply nested includes
  • Write test: TemplateDependencyTracker handles partial with same name as template

Permalink Edge Cases

  • Write test: PermalinkGenerator handles missing date in metadata
  • Write test: PermalinkGenerator handles empty category
  • Write test: PermalinkGenerator handles special characters in slug
  • Write test: PermalinkGenerator validates required placeholders present

Cache Edge Cases

  • Write test: CacheManager handles corrupted cache gracefully
  • Write test: CacheManager handles schema version mismatch
  • Write test: ManifestManager handles missing manifest file

Write Edge Cases

  • Write test: WriteStage handles disk full error
  • Write test: WriteStage handles permission denied
  • Write test: AtomicSwapper handles existing symlink
  • Write test: AtomicSwapper rollback on failure
  • Write test: OutputCleaner handles permission errors on delete

Index Edge Cases

  • Write test: IndexGenerator handles zero posts
  • Write test: IndexGenerator handles exactly one page of posts
  • Write test: Paginator handles page_size=1
  • Write test: IndexGenerator handles category with one post

Phase 15: Performance Validation

Benchmarks

  • Benchmark: SCAN 1000 files
  • Benchmark: BUILD 1000 posts (cold)
  • Benchmark: BUILD 1000 posts with 1 change (incremental)
  • Benchmark: Template change affecting 100 posts
  • Benchmark: WRITE 1000 files
  • Benchmark: Full pipeline 1000 posts

Memory Profiling

  • Profile: Memory usage during BUILD with 1000 posts
  • Profile: Memory usage during index generation
  • Test: Verify no memory leaks across multiple builds

Stress Tests

  • Test: Build with 5000 posts
  • Test: Site with 50 categories
  • Test: Site with 1000 tags
  • Test: Deeply nested directory structure (10+ levels)
  • Test: Very long post (10MB markdown file)
  • Test: 100+ posts in single category (pagination)

Implementation Notes

Test Fixtures Structure

tests/
├── fixtures/
│   ├── minimal_site/
│   │   ├── content/
│   │   │   └── hello.md
│   │   ├── templates/
│   │   │   └── default.html
│   │   └── config.toml
│   ├── multi_category/
│   │   ├── content/
│   │   │   ├── python/
│   │   │   │   ├── intro.md
│   │   │   │   └── advanced.md
│   │   │   └── rust/
│   │   │       └── ownership.md
│   │   └── templates/
│   │       ├── default.html
│   │       └── python.html
│   └── pagination_site/
│       ├── content/
│       │   └── posts/ (15 markdown files)
│       └── templates/
│           ├── default.html
│           └── index.html
└── unit/
    ├── test_models.py
    ├── test_metadata.py
    ├── test_templates.py
    ├── test_urls.py
    ├── test_cache.py
    ├── test_indexes.py
    └── test_stages.py
# permalink

Ssg_design 2

SSGv3: Complete Design Document

Static Site Generator with Validate-Then-Commit Architecture

Version: 3.0
Last Updated: 2025-10-04


Table of Contents

  1. Core Concept
  2. Architecture Overview
  3. Build Flow
  4. Entities
  5. Configuration System
  6. Frontmatter Parsing
  7. Permalink System
  8. Cache Strategy
  9. Index Generation
  10. Pipeline Stages
  11. Processor Interface
  12. Service Interfaces
  13. Error Handling
  14. Performance Characteristics

Core Concept

SSGv3 uses a validate-then-commit architecture with incremental validation by default.

Key Principles:

  • Everything flows through the system as a BuildItem
  • Two-phase build: Validate (read-only) then Commit (write)
  • Cache is trusted for incremental builds (fast)
  • Full validation available for paranoid builds (safe)
  • Atomic output updates - never partial builds
  • All errors reported upfront before any writes

Build Philosophy:

  1. Load config
  2. Discover all content
  3. Validate everything (trust cache for unchanged items)
  4. If ANY errors → abort and show all errors
  5. If validation passes → commit all changes atomically

Architecture Overview

Two-Phase Design

Phase 1: VALIDATE (Read-Only)

  • Discover all files
  • Check cache for unchanged items
  • Fully process changed/new items
  • Detect all errors
  • Compute all output paths
  • Check for collisions
  • No disk writes

Phase 2: COMMIT (Write)

  • Write all changed outputs
  • Update cache atomically
  • Generate indexes
  • Write manifest
  • Clean up orphaned files

Component Hierarchy

SSGv3
├── Config (from config.toml)
├── BuildContext (shared state)
├── BuildPipeline
│   ├── DiscoveryStage
│   ├── ValidationStage
│   └── CommitStage
├── Processors
│   ├── MarkdownContentProcessor
│   ├── SassProcessor
│   ├── CopyProcessor
│   └── IndexGenerator
└── Services
    ├── CacheManager
    ├── TemplateRenderer
    ├── OutputWriter
    ├── PermalinkGenerator
    └── MetadataExtractor

Build Flow

Complete Build Sequence

1. Startup
   - Parse CLI arguments
   - Load config.toml
   - Deserialize to Config dataclass
   - Run ConfigValidator
   - Initialize Services
   - Initialize BuildPipeline

2. DiscoveryStage
   - Walk content directories
   - Apply PathFilter (skip blacklisted, dot files, output/cache dirs)
   - Extract category from directory structure
   - Create BuildItem(state=RAW) for each file
   - Output: List[BuildItem]

3. ValidationStage (Incremental Mode - Default)
   For each BuildItem:
     a. Compute content hash
     b. Check cache:
        - Hash matches → load metadata from cache, mark VALIDATED
        - Hash differs → full validation:
          * Read file
          * Parse frontmatter
          * Extract metadata
          * Transform content (markdown→HTML)
          * Detect errors
          * Mark VALIDATED or ERROR
     
     c. Check URL collisions across all items
     d. If any errors → collect and abort before commit
   
   Output: List[BuildItem(state=VALIDATED)] or error list

4. CommitStage
   - Call processor.before_build() for all processors
   
   For each changed item:
     - Write output file
     - Update cache entry
   
   - Call processor.after_build() for all processors
     * IndexGenerator runs here
     * Generates category indexes, main index, RSS
   
   - Update manifest
   - CleanupStage: remove orphaned outputs
   
   Output: Build statistics

5. Report Results
   - Total files
   - Processed (changed)
   - Skipped (cached)
   - Failed (validation errors)

Incremental Validation Details

Cache Trust Strategy:

  • If hash(file_content) + processor_id + template_hashes matches cache → skip processing
  • Load metadata from cache (don’t re-read/re-parse file)
  • Still check URL collisions
  • Still track template dependencies

Performance:

  • 1000 posts, 1 changed: ~1 second
  • 999 hash checks (~1ms each)
  • 1 full processing
  • Memory holds only changed content + all metadata

Cache Invalidation Triggers:

  • Source file content changed
  • Template file changed (via dependency tracking)
  • Config changed (permalink format, markdown settings)
  • Cache schema version mismatch
  • Manual: ssg clean

Full Validation Mode (Optional)

ssg build --full-validation
[build]
validation_mode = "full"

Full mode adds:

  • Stat every file (verify exists)
  • Check file permissions
  • Verify file readable
  • Still skip content transformation if hash matches

Use for:

  • CI/CD deployments
  • Production builds
  • After system updates
  • Debugging

Entities

BuildItem (Base Class)

Represents anything the system builds.

Attributes:

BuildItem:
  src: Path | None          # Source file (None for generated items)
  out: Path                 # Output path
  kind: str                 # "content", "asset", "index", "feed"
  mtime: float              # Source modification time
  state: BuildState         # RAW | VALIDATED | WRITTEN
  hash: str                 # Content hash for cache checking

States:

  • RAW - Just discovered, not validated
  • VALIDATED - Checked, ready to commit
  • WRITTEN - Output file written

ContentItem (Specialized)

ContentItem(BuildItem):
  metadata: dict
    - title: str
    - date: datetime
    - date_iso: str
    - date_formatted: str
    - slug: str
    - category: str
    - tags: List[str]
    - url: str
  
  html: str                      # Rendered HTML content
  templates_used: List[Path]     # For dependency tracking

AssetItem (Specialized)

AssetItem(BuildItem):
  asset_type: str                # "static", "sass", "image"

IndexItem (Specialized)

IndexItem(BuildItem):
  page: int                      # Pagination page number
  items: List[ContentItem]       # Content in this index
  metadata: dict
    - category: str (optional)
    - total_pages: int

Examples

Blog Post:

ContentItem(
  src=Path('content/python/hello.md'),
  out=Path('public/python/2024/10/hello/index.html'),
  kind='content',
  state=VALIDATED,
  metadata={
    'title': 'Hello Python',
    'category': 'python',
    'date': datetime(2024, 10, 4),
    'slug': 'hello',
    'url': '/python/2024/10/hello/'
  }
)

Static Asset:

AssetItem(
  src=Path('assets/css/style.css'),
  out=Path('public/assets/css/style.css'),
  kind='asset',
  asset_type='static',
  state=VALIDATED
)

Generated Index:

IndexItem(
  src=None,
  out=Path('public/python/index.html'),
  kind='index',
  state=VALIDATED,
  metadata={'category': 'python'}
)

Configuration System

Config Dataclass

@dataclass
class Config:
    # Paths
    content_dir: Path
    asset_dir: Path
    output_dir: Path
    cache_dir: Path = Path('.cache')
    template_dir: Optional[Path] = None
    
    # Content
    content_extensions: List[str] = ['.md', '.markdown']
    blacklist: List[str] = []
    
    # Permalinks
    permalink_templates: Dict[str, str] = {
        'default': '/{year}/{month:02d}/{day:02d}/{slug}/'
    }
    default_category: str = ''
    
    # Site
    site_title: str = 'My Site'
    site_url: str = 'https://example.com'
    site_description: str = ''
    site_author: str = ''
    
    # Pagination
    page_size: int = 10
    
    # Markdown
    markdown_extensions: List[str] = ['codehilite', 'fenced_code', 'toc', 'tables']
    markdown_extension_configs: Dict[str, Dict] = {}
    
    # Build
    incremental: bool = True
    validation_mode: str = 'incremental'  # or 'full'
    clean_output: bool = False
    
    @classmethod
    def from_toml(cls, path: Path) -> 'Config'

config.toml Example

# config.toml

# Paths
content_dir = "content"
asset_dir = "assets"
output_dir = "public"
cache_dir = ".cache"
template_dir = "templates"

# Content
content_extensions = [".md", ".markdown"]
blacklist = ["drafts/", "README.md"]

# Permalinks
default_category = ""

[permalink_templates]
default = "/{year}/{month:02d}/{day:02d}/{slug}/"
page = "/{slug}/"
doc = "/{category}/{slug}/"

# Site
[site]
title = "My Blog"
url = "https://example.com"
description = "A blog about things"
author = "Jane Doe"

# Pagination
[pagination]
page_size = 10

# Markdown
[markdown]
extensions = ["codehilite", "fenced_code", "toc", "tables"]

[markdown.extension_configs.codehilite]
css_class = "highlight"
linenums = true

[markdown.extension_configs.toc]
anchorlink = true

# Build
[build]
incremental = true
validation_mode = "incremental"
clean_output = false

Configuration Loading

1. CLI specifies: ssg build --config path/to/config.toml
2. Or look in current directory: ./config.toml
3. Parse TOML → dict
4. Convert string paths to Path objects
5. Deserialize to Config dataclass
6. Run ConfigValidator.validate(config)
7. Pass to BuildPipeline

Configuration Validation

ConfigValidator checks:

  • Required fields present
  • Paths exist (content_dir, asset_dir)
  • Permalink templates valid:
    • {slug} present (required)
    • Valid format syntax
    • Known placeholders only
  • site_url format (http:// or https://)
  • No conflicting directories (output != cache)
  • page_size > 0
  • markdown_extensions are strings
  • validation_mode in [‘incremental’, ‘full’]

On validation error:

  • Print helpful error message
  • Show expected format
  • Suggest fix
  • Exit with code 1

Environment-Specific Configs

config.toml          # Base
config.dev.toml      # Dev overrides
config.prod.toml     # Prod overrides
ssg build --env dev
ssg build --env prod

Merge base config with environment-specific overrides.


Frontmatter Parsing

When Parsing Happens

NOT in DiscoveryStage (too expensive)

IN ValidationStage (only for changed items)

Parsing Flow

1. Read file content
   - Try utf-8 encoding
   - Fallback to latin-1
   
2. Use python-frontmatter library
   - Separates YAML/TOML frontmatter from content
   - Returns metadata dict + content string
   
3. Parse dates with dateutil.parser
   - Handles: "2024-10-04 13:00 IST"
   - Handles: "2024-09-22 11:54 +0530"
   - Strip timezone (keep naive datetime)
   
4. Merge metadata with defaults

Metadata Precedence

Final metadata = generated_defaults 
               + directory_metadata
               + frontmatter_overrides

Generated defaults:

  • slug from filename
  • title from filename
  • date from file mtime

Directory metadata:

  • category from parent directory

Frontmatter overrides:

  • Any field in frontmatter overrides above

Frontmatter Example

---
title: My Custom Title
date: 2024-10-04 13:00 IST
category: tutorials
tags: [python, web]
slug: custom-slug
---

# Content starts here

Error Handling

Malformed YAML:

  • Log error: “Failed to parse frontmatter in {file}: {error}”
  • Return None from validation
  • Mark item as ERROR
  • Continue validating other files
  • Show all errors at end

Invalid date format:

  • Log warning: “Invalid date in {file}, using mtime”
  • Fall back to file mtime
  • Continue processing

Missing fields:

  • Use defaults (title from filename, date from mtime)
  • No error, just defaults

Unreadable file:

  • Log error: “Cannot read {file}: {error}”
  • Mark as ERROR
  • Continue validation

Permalink System

Core Concept

Permalink = Template + Metadata

Template has placeholders filled from item metadata.

Supported Placeholders

  • {year} - 4-digit year
  • {month} - Month number (1-12)
  • {month:02d} - Zero-padded month (01-12)
  • {day} - Day number (1-31)
  • {day:02d} - Zero-padded day (01-31)
  • {slug} - URL-safe slug (REQUIRED)
  • {category} - Category from directory or frontmatter

Template Examples

Blog (date-based):
  /{year}/{month:02d}/{day:02d}/{slug}/
  → /2024/10/04/my-post/

Docs (category-based):
  /{category}/{slug}/
  → /python/my-post/

Flat:
  /{slug}/
  → /my-post/

Hybrid:
  /{category}/{year}/{slug}/
  → /python/2024/my-post/

Template Validation

Required:

  • {slug} must be present (ensures uniqueness)

Optional:

  • Date placeholders (content without dates still works)
  • {category} (works with or without categories)

Invalid:

  • Unknown placeholders → error at config validation
  • Invalid format syntax → error

Category Extraction

From directory structure:

content/                    → category = ""
content/python/             → category = "python"
content/python/web/file.md  → category = "python" (single-level)

Single-level extraction:

  • Uses immediate parent directory only
  • Ignores deeper nesting
  • Simpler URLs, easier reasoning

Future: Could add full-path option via config

Category Override

Frontmatter overrides directory-based category:

---
category: tutorials  # Override "python" from directory
---

Use cases:

  • Content organization ≠ URL structure
  • A/B testing categorizations
  • Legacy content migration

URL Collision Detection

Problem: Two files generate same URL

Detection: During ValidationStage, build map url → src

On collision:

ERROR: URL collision detected
  URL: /python/2024/10/my-post/
  Sources:
    - content/python/my-post.md
    - content/python/tutorials/my-post.md
  
  Fix: Use different slugs or override category in frontmatter

Build aborts - forces explicit resolution

PermalinkGenerator Interface

PermalinkGenerator:
  set_template(format_string: str) -> None
  validate_template() -> None (raises ValueError)
  generate(metadata: dict) -> str

Responsibilities:

  • Parse template placeholders
  • Validate required placeholders
  • Map metadata to URL
  • URL-encode special characters
  • Return path string

NOT responsible for:

  • Extracting metadata
  • Checking uniqueness
  • Writing files

Template Changes

Permalink template is global config.

If template changes:

  • All items have new output paths
  • Cache becomes invalid
  • Full rebuild required
  • Store template hash in manifest to detect

Cache Strategy

Two-Table Schema

Table 1: build_items

CREATE TABLE build_items (
    src TEXT PRIMARY KEY,
    out TEXT NOT NULL,
    kind TEXT NOT NULL,
    hash TEXT NOT NULL,        -- Combined hash for incremental builds
    processor_id TEXT NOT NULL,
    mtime REAL NOT NULL
)

Tracks all processed files (content + assets).

Table 2: content_metadata

CREATE TABLE content_metadata (
    src TEXT PRIMARY KEY,
    title TEXT NOT NULL,
    date_iso TEXT NOT NULL,
    date_formatted TEXT NOT NULL,
    url TEXT NOT NULL,
    slug TEXT NOT NULL,
    category TEXT NOT NULL,
    html_content TEXT NOT NULL,
    templates_used TEXT NOT NULL,  -- JSON array
    FOREIGN KEY (src) REFERENCES build_items(src)
)

Stores content-specific metadata.

Hash Computation

item_hash = hash(file_content)
          + processor.id
          + hash(templates_used)
          + hash(relevant_metadata)

More reliable than mtime:

  • Detects actual content changes
  • Works with git checkout
  • Works with cloud sync
  • Catches template changes

Template Dependency Tracking

Problem: Template change rebuilds everything

Solution: Track which templates each item uses

How:

  1. During rendering, TemplateRenderer records accessed templates
  2. Store in ContentItem.templates_used
  3. Save to content_metadata.templates_used (JSON)
  4. When template changes:
    • Compute new template hash
    • Query items using that template
    • Only rebuild affected items

Cache Trust in Incremental Mode

If hash matches:

  • Trust file unchanged
  • Load metadata from cache
  • Skip reading file
  • Skip parsing frontmatter
  • Skip content transformation
  • Mark as VALIDATED

Cache checked for:

  • Content hash match
  • Processor ID match
  • Template hashes match

Cache invalidated on:

  • Source file changed
  • Template changed
  • Config changed
  • Processor updated
  • Manual ssg clean

CacheManager Interface

CacheManager:
  needs_processing(src: Path, hash: str) -> bool
  mark_processed(item: BuildItem) -> None
  save_content_metadata(item: ContentItem) -> None
  get_all_content() -> List[ContentItem]
  get_content_by_category(category: str) -> List[ContentItem]
  get_items_using_template(template: Path) -> List[BuildItem]

Cache Failure Handling

Corrupted database:

  • Detect via schema version check
  • Log warning
  • Delete cache
  • Full rebuild

Missing cache:

  • First build, expected
  • Create new cache
  • Process everything

Partial cache:

  • Some items missing
  • Process missing items
  • Update cache

Index Generation

Design Approach

IndexGenerator is a special processor running in after_build() phase.

IndexGenerator Interface

IndexGenerator(Processor):
  can_handle(item) -> False  # Doesn't process discovered items
  
  before_build(context):
    pass  # No setup
  
  process(item, services):
    pass  # Doesn't process items
  
  after_build(context):
    # All index generation happens here

Generation Flow

In after_build():

1. Query all processed content
   items = context.get_all_items(kind='content')

2. Generate category indexes
   by_category = group_by(items, 'metadata.category')
   for category, category_items in by_category:
     create_index(category, category_items)

3. Generate main index (paginated)
   pages = paginate(items, page_size=10)
   for page_num, page_items in enumerate(pages):
     create_paginated_index(page_num, page_items)

4. Generate RSS feed
   create_feed(items[:20])

5. Generate sitemap
   create_sitemap(items)

Index Pages as BuildItems

IndexItem(
  src=None,  # Virtual
  out=Path('public/python/index.html'),
  kind='index',
  metadata={'category': 'python', 'items': [...]}
)

Written via OutputWriter like any other item.

Pagination

IndexItem(out='public/index.html', metadata={'page': 1})
IndexItem(out='public/page/2/index.html', metadata={'page': 2})
IndexItem(out='public/page/3/index.html', metadata={'page': 3})

Template receives page number and items for that page.

Why after_build()

Advantages:

  • Only includes successfully validated items
  • Has complete picture of all content
  • Can compute statistics (tag counts, category counts)
  • Validation failures don’t corrupt index

vs during-build approach:

  • During-build could include items that later fail
  • Index might be inconsistent

Pipeline Stages

Stage Interface

PipelineStage(ABC):
  run(context: BuildContext) -> None

Each stage modifies BuildContext and passes to next stage.

DiscoveryStage

Input: Config

Process:

1. Walk content_dir recursively
2. Apply PathFilter:
   - Skip output_dir, cache_dir
   - Skip dot directories (.git, .cache)
   - Skip blacklisted paths
   - Skip dot files
3. For each file:
   - Determine kind (check extension)
   - Extract category (parent directory)
   - Stat file (get mtime, size)
   - Create BuildItem(state=RAW)

Output: context.items = List[BuildItem(state=RAW)]

No file reading - fast, filesystem metadata only

ValidationStage

Input: List[BuildItem(state=RAW)]

Process (Incremental Mode):

errors = []

For each item:
  1. Compute hash = hash(file_content) + processor.id + template_hashes
  
  2. Check cache:
     if cache.has_item(src, hash):
       # FAST PATH
       item.metadata = cache.get_metadata(src)
       item.state = VALIDATED
       continue
     
     # SLOW PATH - changed item
     try:
       Read file
       Parse frontmatter
       Extract metadata
       Transform content
       item.state = VALIDATED
     except Exception as e:
       errors.append((item, e))
       item.state = ERROR

3. Build URL map, detect collisions
   
4. If errors:
     print all errors
     abort build

Output:

  • Success: List[BuildItem(state=VALIDATED)]
  • Failure: Error list, exit

Full Validation Mode:

  • Also stat every file
  • Verify readable
  • Still skip transformation if hash matches

CommitStage

Input: List[BuildItem(state=VALIDATED)]

Process:

1. Call processor.before_build(context) for all processors

2. For each changed item:
   - Get processor via can_handle()
   - Call processor.process(item, services)
   - Write output
   - Update cache
   - item.state = WRITTEN

3. Call processor.after_build(context) for all processors
   - IndexGenerator runs here
   - Generates indexes, feeds, sitemap

4. Update manifest atomically

5. CleanupStage:
   - Find outputs in old manifest not in current build
   - Delete orphaned files

Output: Build statistics

Atomic guarantee:

  • Cache updated only after all writes succeed
  • On write failure, cache remains old state
  • Can retry build without corruption

Processor Interface

Base Interface

Processor(ABC):
  id: str                    # Processor version for cache
  
  can_handle(item: BuildItem) -> bool
  
  before_build(context: BuildContext) -> None
  
  process(item: BuildItem, services: Services) -> BuildItem
  
  after_build(context: BuildContext) -> None

Lifecycle Phases

before_build() - Setup

  • Load templates into memory
  • Initialize caches
  • Prepare shared resources
  • Called once per build

process() - Transform

  • Read source (if not cached)
  • Parse/transform content
  • Render templates
  • Write output
  • Return updated item
  • Called per item

after_build() - Finalization

  • Generate auxiliary content
  • Flush caches
  • Create sitemaps/feeds
  • Called once per build

MarkdownContentProcessor

can_handle:

return item.kind == 'content' and item.src.suffix in ['.md', '.markdown']

process:

1. Read file content
2. Parse frontmatter (YAML/TOML)
3. Extract metadata with precedence
4. Parse dates with dateutil
5. Convert markdown to HTML
6. Render template
7. Write output
8. Track templates used
9. Return updated ContentItem

Tracks:

  • Templates used (for dependency tracking)
  • Metadata for index generation

SassProcessor

can_handle:

return item.kind == 'asset' and item.src.suffix == '.scss'

process:

1. Read SCSS
2. Compile to CSS
3. Generate source maps
4. Minify (optional)
5. Write output

CopyProcessor

can_handle:

return item.kind == 'asset'

process:

1. Copy file with shutil.copy2
2. Preserve mtime

Fallback for unknown asset types.

IndexGenerator

can_handle:

return False  # Special processor

after_build:

1. Query context.get_all_items(kind='content')
2. Group by category
3. Generate category indexes
4. Paginate main index
5. Generate RSS feed
6. Generate sitemap
7. Write all via OutputWriter

Service Interfaces

Services are pluggable for testability and extensibility.

FileManager (Abstract)

FileManager(ABC):
  read(path: Path) -> bytes
  read_text(path: Path, encoding: str) -> str
  write(path: Path, data: bytes) -> None
  write_text(path: Path, content: str) -> None
  copy(src: Path, dest: Path) -> None
  remove(path: Path) -> None
  hash_file(path: Path) -> str
  exists(path: Path) -> bool
  stat(path: Path) -> FileStat

Default: LocalFileManager (filesystem)

Alternative: S3FileManager, MemoryFileManager (testing)

TemplateRenderer (Abstract)

TemplateRenderer(ABC):
  load_templates(template_dir: Path) -> None
  render(template_name: str, context: dict) -> str
  get_template_hash(template_name: str) -> str
  get_templates_used() -> List[Path]

Default: Jinja2Renderer

Features:

  • Template inheritance
  • Partials/includes
  • Custom filters
  • Autoescaping

PermalinkGenerator (Abstract)

PermalinkGenerator(ABC):
  set_template(format: str) -> None
  validate_template() -> None
  generate(metadata: dict) -> str

Default: PatternPermalinkGenerator

Alternative: CustomPermalinkGenerator (complex routing)

MetadataExtractor

MetadataExtractor:
  parse_frontmatter(content: str) -> dict
  extract_from_file(item: BuildItem, config: Config) -> dict
  generate_slug(text: str) -> str
  generate_title(filename: str) -> str
  parse_date(date_value: Any) -> datetime

Handles:

  • Frontmatter parsing (YAML/TOML)
  • Metadata extraction
  • Slug generation
  • Date parsing (flexible formats)

OutputWriter

OutputWriter:
  write_html(path: Path, content: str) -> None
  copy_file(src: Path, dest: Path) -> None
  ensure_directory(path: Path) -> None
  remove_file(path: Path) -> None

All file writes go through this interface.

Services Bundle

Services:
  cache: CacheManager
  templates: TemplateRenderer
  output: OutputWriter
  permalinks: PermalinkGenerator
  metadata: MetadataExtractor

Passed to processor.process() to avoid many parameters.


Error Handling

Validation Phase Errors

Collected, not thrown:

errors: List[Tuple[BuildItem, Exception]] = []

For each item:
  try:
    validate(item)
  except Exception as e:
    errors.append((item, e))
    continue  # Keep validating

All errors shown:

ERROR: Found 3 errors during validation

1. content/python/bad.md
   Failed to parse frontmatter: Invalid YAML at line 5

2. content/tutorials/test.md
   Invalid date format: "not a date"

3. URL collision: /python/2024/10/post/
   - content/python/post.md
   - content/python/tutorials/post.md

Build aborted. Fix errors and retry.

Error Types

Validation errors:

  • Malformed frontmatter
  • Invalid date formats
  • URL collisions
  • Encoding errors
  • Missing required metadata
  • Template not found

Commit errors:

  • Permission denied (output dir)
  • Disk full
  • File locked

Config errors:

  • Invalid TOML syntax
  • Missing required fields
  • Invalid permalink template
  • Invalid paths

Error Recovery

Validation phase:

  • Collect all errors
  • Show complete report
  • Abort before any writes
  • No partial builds

Commit phase:

  • First error stops build
  • Log error clearly
  • Cache remains in old state
  • Can retry without corruption

User-Friendly Messages

Bad:

KeyError: 'slug'

Good:

ERROR: content/python/test.md
  Missing required field: 'slug'
  
  Either:
    1. Add 'slug' to frontmatter, or
    2. Ensure filename can generate valid slug

Performance Characteristics

Incremental Build (Default)

Scenario: 1000 posts, 1 file changed

DiscoveryStage:     ~100ms  (walk filesystem)
ValidationStage:    ~1s     (999 hash checks, 1 full process)
CommitStage:        ~50ms   (write 1 file, update cache)
Total:              ~1.2s

Memory: ~50MB (all metadata + 1 content item)

Full Validation Build

Scenario: 1000 posts, paranoid build

DiscoveryStage:     ~100ms
ValidationStage:    ~5s     (999 stat+hash, 1 full process)
CommitStage:        ~50ms
Total:              ~5.2s

Memory: ~50MB (metadata only)

Cold Build (No Cache)

Scenario: 1000 posts, first build

DiscoveryStage:     ~100ms
ValidationStage:    ~60s    (read + parse + transform all)
CommitStage:        ~2s     (write all outputs, create cache)
Total:              ~62s

Memory: ~500MB (all content in memory during validation)

Optimization Strategies

Parallel Processing (Future):

  • Validate independent items in parallel
  • Thread-safe cache access
  • 4-8x speedup on multi-core systems

Lazy Loading:

  • Load templates on-demand
  • Parse frontmatter only when needed
  • Stream large files

Cache Warming:

  • Pre-compute hashes on file watch
  • Background cache updates
  • Faster incremental builds

BuildContext

Shared state across pipeline stages and processors.

Structure

BuildContext:
  config: Config                    # Site configuration (immutable)
  items: List[BuildItem]            # All discovered/processed items
  manifest: BuildManifest           # Previous build state
  stats: BuildStats                 # Counters
  url_map: Dict[str, Path]          # URL collision detection

Query Interface

get_all_items(kind: str = None, 
              category: str = None,
              tags: List[str] = None) -> List[BuildItem]

get_item_by_path(src: Path) -> Optional[BuildItem]

get_items_using_template(template: Path) -> List[BuildItem]

Used by IndexGenerator to query processed content.

BuildStats

BuildStats:
  total_files: int
  processed: int      # Changed items
  skipped: int        # Cached items
  failed: int         # Validation errors
  index_generated: bool
  duration: float

CLI Interface

Commands

# Build site
ssg build

# Build with options
ssg build --config path/to/config.toml
ssg build --env prod
ssg build --full-validation
ssg build --verbose

# Clean cache and output
ssg clean

# Initialize new site
ssg init

# Watch and rebuild on changes (future)
ssg watch

# Serve locally (future)
ssg serve

Build Command Options

--config PATH       Config file location (default: ./config.toml)
--env ENV           Environment (dev/prod, loads config.ENV.toml)
--full-validation   Use full validation mode (paranoid)
--verbose          Show debug output
--dry-run          Validate only, don't write outputs
--clean            Clean before building

Exit Codes

0   Success
1   Validation errors
2   Commit errors
3   Configuration errors
4   File system errors

Watch Mode (Future Feature)

Design

1. Initial build
2. Watch filesystem for changes
3. On change:
   - Debounce (wait 100ms for more changes)
   - Determine affected items
   - Run incremental build
   - Notify browser (LiveReload)

Change Detection

File changed:

  • Reprocess that item only
  • Update cache
  • Regenerate indexes

Template changed:

  • Query items using that template
  • Reprocess affected items
  • Update cache

Config changed:

  • Full rebuild required
  • Restart watch

LiveReload Integration

1. Build includes JS snippet
2. SSG serves WebSocket
3. On rebuild, send reload message
4. Browser refreshes

Plugin System (Future Feature)

Plugin Interface

Plugin(ABC):
  name: str
  version: str
  
  register_processors() -> List[Processor]
  register_filters() -> Dict[str, Callable]
  register_commands() -> List[Command]
  
  on_config_loaded(config: Config) -> None
  on_build_start(context: BuildContext) -> None
  on_build_complete(context: BuildContext) -> None

Example Plugin

class ImageOptimizationPlugin(Plugin):
    def register_processors(self):
        return [ImageOptimizer()]
    
    def on_build_complete(self, context):
        # Generate responsive images
        for item in context.get_all_items(kind='asset'):
            if is_image(item):
                generate_thumbnails(item)

Plugin Discovery

[plugins]
enabled = ["image-optimization", "search-index"]

[plugins.image-optimization]
quality = 85
formats = ["webp", "avif"]

[plugins.search-index]
fields = ["title", "content", "tags"]

Testing Strategy

Unit Tests

Test data classes:

def test_build_item_state_transitions():
    item = BuildItem(state=RAW)
    assert item.state == RAW
    item.state = VALIDATED
    assert item.state == VALIDATED

Test utilities:

def test_slug_generation():
    assert generate_slug("Hello World") == "hello-world"
    assert generate_slug("C++") == "cpp"

Integration Tests

Test pipeline stages:

def test_discovery_stage():
    # Create temp filesystem
    # Run discovery
    # Assert correct items found

Test processors:

def test_markdown_processor():
    # Create test markdown file
    # Process
    # Assert HTML output correct

End-to-End Tests

Test complete builds:

def test_incremental_build():
    # Build once
    # Modify one file
    # Build again
    # Assert only one file processed

Snapshot Testing

Test output stability:

def test_output_unchanged():
    # Build with known input
    # Compare output to snapshot
    # Assert no differences

Performance Tests

Benchmark builds:

def test_build_performance():
    # Build 1000 posts
    # Assert completes in < 60s

Migration from Current SSG

Compatibility Layer

Map old concepts to SSGv3:

Old FileInfo          → BuildItem
Old ContentType       → Processor
Old should_rebuild()  → needs_processing() + cache check
Old process()         → validate() + commit()
Old CacheManager      → BuildContext + cache tables

Migration Steps

  1. Add state tracking:

    • Wrap FileInfo in BuildItem
    • Add state field
  2. Split processing:

    • Extract validation logic
    • Separate from writing
  3. Update cache schema:

    • Add hash column
    • Add templates_used column
    • Migration script for old caches
  4. Refactor pipeline:

    • Create stage classes
    • Move logic from methods to stages
  5. Add validation mode:

    • Implement incremental mode
    • Add full validation option

Backward Compatibility

Config migration:

# Old CONFIG dict still works
config = Config.from_dict(CONFIG)

Keep old CLI:

# Old commands still work
python ssg_generator.py build

Gradual adoption:

  • Stage 1: Add BuildItem wrapper
  • Stage 2: Add validation phase
  • Stage 3: Enable commit phase
  • Stage 4: Remove old code

Future Enhancements

Near-Term

  1. Watch mode - File watching + live reload
  2. Parallel processing - Multi-threaded builds
  3. Better error messages - Context + suggestions
  4. Progress bars - Visual feedback during builds
  5. Dry-run mode - Validate without writing

Medium-Term

  1. Plugin system - Extensible architecture
  2. Multiple content formats - RST, AsciiDoc, Org
  3. Asset pipeline - Sass, TypeScript, image optimization
  4. Search index - Client-side full-text search
  5. Multilingual - i18n support

Long-Term

  1. Distributed builds - Build on multiple machines
  2. Cloud storage - S3/GCS output
  3. Incremental deploys - Only upload changed files
  4. Build analytics - Performance insights
  5. Visual editor - GUI for content management

Appendix: Design Decisions

Why Validate-Then-Commit?

Problem with immediate writes:

  • Partial builds leave broken output
  • Can’t show all errors at once
  • Hard to implement dry-run
  • Difficult to rollback

Validate-then-commit solves:

  • Output always consistent
  • All errors reported upfront
  • Dry-run is just “stop after validate”
  • Easy rollback (just don’t commit)

Why Trust Cache by Default?

Incremental mode is fast:

  • 1000 posts, 1 changed: ~1 second
  • Matches user expectations
  • Good for development workflow

Full validation available when needed:

  • CI/CD deployments
  • Production builds
  • After system updates

Risk is low:

  • Hash detects content changes
  • Template tracking detects template changes
  • Config changes force full rebuild

Why Single-Level Categories?

Simpler:

  • Easier to understand
  • Cleaner URLs
  • Less nesting complexity

Good enough:

  • Most sites have 5-10 categories
  • Deep nesting rare
  • Can add full-path later

Why Jinja2?

Mature:

  • Battle-tested
  • Well-documented
  • Large ecosystem

Features:

  • Template inheritance
  • Macros/includes
  • Custom filters
  • Autoescaping

Alternative:
Could support multiple template engines via TemplateRenderer interface

Why TOML Config?

Readable:

  • Clean syntax
  • Comments supported
  • No significant whitespace

Type-safe:

  • Clear data types
  • Nested structures
  • Arrays and tables

Alternative:
Could support YAML via same Config.from_dict() pattern


Appendix: Glossary

BuildItem - Representation of anything the system builds (content, asset, index)

Processor - Handler that transforms BuildItems (markdown→HTML, SCSS→CSS)

Pipeline Stage - Phase of the build process (discovery, validation, commit)

BuildContext - Shared state across pipeline stages and processors

Services - Collection of utility objects (cache, templates, output writer)

Frontmatter - YAML/TOML metadata at the top of content files

Permalink - URL pattern for generated pages

Category - Classification from directory structure or frontmatter

Slug - URL-safe identifier derived from title or filename

Incremental Build - Only rebuild changed items

Cache - Persistent storage of previous build state

Hash - Content fingerprint for detecting changes

Template Dependency - Tracking which templates each item uses

Validation Phase - Read-only checking of all items

Commit Phase - Write all validated outputs

Index - Generated page listing multiple content items (blog index, category page)


Document Information

Version: 3.0
Date: 2025-10-04
Status: Complete Design
Next Steps: Implementation

Changes from v2:

  • Added validate-then-commit architecture
  • Added incremental validation mode
  • Added full validation option
  • Refined cache trust strategy
  • Added complete error handling
  • Added performance characteristics
  • Added migration guide

Contact: [Your contact info]
Repository: [Your repo URL]
License: [Your license]


End of Document

# permalink

Ssg_design 1

Static Site Generator v3

Design - Core Specification

Updated: 2025 Oct 31

Architecture Overview

SSGv3 is a static site generator built on a three-stage pipeline where each stage has exclusive responsibilities:

SCAN → BUILD → WRITE

Core principle: All validation and transformation happens in memory during BUILD. WRITE is a pure commit operation that either completes fully or leaves the previous output untouched.


Stage Boundaries

Stage File Reads Transformations File Writes Failures Allowed
SCAN Metadata only (stat) None None Yes (abort before BUILD)
BUILD Content (changed files only) Markdown→HTML, Template rendering None Yes (abort before WRITE)
WRITE None None All outputs No (atomic commit)

Guarantee: If BUILD completes without errors, WRITE will succeed or leave the system in the previous consistent state.


Component Hierarchy

Static Site Generator
│
├── Three-Stage Pipeline
│   ├── SCAN Stage
│   ├── BUILD Stage
│   │   ├── Phase 1: Content Processing
│   │   └── Phase 2: Index Generation
│   └── WRITE Stage
│
├── Core Data Model
│   └── BuildItem
│       ├── ContentItem
│       ├── AssetItem
│       └── IndexItem
│
├── Metadata System
│   ├── System Defaults
│   ├── Path-Derived Metadata
│   └── Frontmatter Overrides
│
├── Content Processing
│   ├── Slug Normalization
│   ├── Template Selection
│   ├── Markdown Transformation
│   └── Permalink Generation
│
├── Cache System
│   ├── Cache Key Computation
│   ├── Manifest Management
│   └── Invalidation Logic
│
├── Template System
│   ├── Template Selection
│   ├── Dependency Graph
│   ├── Hash Computation
│   └── Cycle Detection
│
├── Index System
│   ├── Pagination
│   ├── Index Cache Keys
│   └── Rebuild Detection
│
├── URL Management
│   ├── Permalink Templates
│   ├── URL Normalization
│   └── Collision Detection
│
└── Atomic Write System
    ├── Symlink-Based Swap
    ├── Timestamped Directories
    └── Orphan Cleanup

Core Data Model

BuildItem

Every discovered file becomes a BuildItem. After processing, it may become:

  • ContentItem: Markdown file that generates HTML
  • AssetItem: File copied byte-for-byte (CSS, images, etc.)
  • IndexItem: Virtual item for paginated index pages

Each BuildItem contains:

  • src: Original source file path (null for virtual items)
  • out: Final output path (resolved during BUILD)
  • url: Site-relative URL (e.g., /python/2025/intro/)
  • kind: content | asset | index
  • state: SCANNED | BUILT | WRITTEN
  • metadata: Resolved metadata dictionary
  • cache_key: SHA256 hash identifying this item’s dependencies
  • html: Rendered HTML (content and index items only)

Stage 1: SCAN

Purpose

Fast discovery of all source files with minimal I/O. Produces a deterministic list of BuildItems with initial metadata.

Inputs

  • Project directory structure
  • Content directory (e.g., content/)
  • Asset directory (e.g., assets/)
  • Blacklist patterns (e.g., ['.git', '_drafts'])

Process

  1. Directory Walking: Traverse content and asset directories using filesystem scanning
  2. Filtering: Skip blacklisted paths, hidden directories (.git, .cache), and output/cache directories
  3. Classification: Determine item kind based on file extension:
    • .md, .markdown → content
    • Everything else → asset
  4. Initial Metadata: Extract from file path structure:
    • initial_slug: Filename without extension
    • initial_category: Immediate parent directory name (empty if top-level)
    • path_rel: Path relative to content/asset root
  5. File Stats: Capture modification time and size (for change detection)
  6. Ordering: Sort items by relative path for deterministic processing

Outputs

List of BuildItem(state=SCANNED) with minimal metadata. No file contents read, no transformations performed.

Example

Input structure:
  content/python/intro.md
  content/rust/ownership.md
  assets/style.css

Output items:
  BuildItem(src=content/python/intro.md, kind=content, 
            initial_slug=intro, initial_category=python)
  BuildItem(src=content/rust/ownership.md, kind=content,
            initial_slug=ownership, initial_category=rust)
  BuildItem(src=assets/style.css, kind=asset,
            initial_slug=style, initial_category='')

Stage 2: BUILD

Purpose

Transform source files into renderable outputs, resolve all metadata, detect collisions, and prepare everything for atomic write. All operations happen in memory.

Two-Phase Process

BUILD operates in two distinct phases to resolve circular dependencies:

Phase 1: Content Processing

  • Process all content and asset items
  • Assign final cache_key to each item
  • Items are now ready for indexing

Phase 2: Index Generation

  • Create virtual IndexItems for pagination
  • Use cache_keys from Phase 1 in index cache computation
  • Detect URL collisions across all items

Phase 1: Content Processing

For each BuildItem from SCAN:

1. Metadata Resolution

Merge metadata from three sources in order of precedence (later sources override earlier):

System Defaults:

  • slug: filename without extension
  • category: parent directory name
  • date: file modification time

Path-Derived (only if not in frontmatter):

  • Category from directory structure
  • Date from filename patterns like YYYY-MM-DD-title.md

Frontmatter (always wins):

  • Any explicitly set field overrides all others
  • Parsed from YAML block at file start
  • Example: category: tutorials overrides directory structure

Key behavior: Files can move between directories without URL changes if frontmatter specifies category. Without frontmatter, URL follows directory structure.

2. Slug Normalization

Convert title/filename into URL-safe slug:

  • Convert to lowercase
  • Transliterate Unicode to ASCII (ö → o, é → e)
  • Remove all punctuation except hyphens
  • Replace whitespace sequences with single hyphen
  • Strip leading/trailing hyphens

Result: "My Cool Post!""my-cool-post"

3. Cache Key Computation

Compute a deterministic SHA256 hash that captures all inputs affecting the rendered output:

Input object (deterministically serialized to JSON):
{
  "content_hash": SHA256(file_bytes),
  "metadata": {
    "slug": resolved_slug,
    "category": resolved_category,
    "date_iso": resolved_date_in_ISO_format
  },
  "selected_template": template_name,
  "template_hash": hash_of_template_and_all_its_includes,
  "permalink_format": "{category}/{year}/{month}/{slug}/",
  "schema_version": 1
}

cache_key = SHA256(JSON.dumps(input_object, sorted_keys))

Critical: Template hash must include all partials/includes transitively (see Template Dependency Tracking section).

4. Cache Consultation

Check if this item needs processing:

  • Cache hit (cache_key matches stored key):

    • Load pre-rendered HTML from cache
    • Load resolved metadata
    • Skip to next item
  • Cache miss (no entry or key changed):

    • Read file contents
    • Parse frontmatter
    • Transform Markdown to HTML
    • Render through template
    • Store result in memory for WRITE

5. Template Selection

Choose template using first-match algorithm:

  1. If frontmatter contains template: X → use templates/X.html
  2. Else if category-specific template exists → use templates/{category}.html
  3. Else use templates/default.html (required)

Error if: Selected template file doesn’t exist.

6. Content Transformation

For content items (cache miss only):

  1. Read file as UTF-8 (fail if invalid encoding)
  2. Split frontmatter (YAML between --- markers) from body
  3. Parse Markdown body to HTML (single pass, with configured extensions)
  4. Assemble template context:
    • content: HTML body
    • metadata: All resolved metadata
    • site: Global site configuration
  5. Render final HTML through selected template
  6. Track which templates were used (for invalidation)

For asset items:

  • No transformation
  • Cache key based only on file hash and output path

7. Permalink Generation

Apply permalink template to resolved metadata:

Template format: {category}/{year:04d}/{month:02d}/{slug}/

Supported placeholders:

  • {category}: Category string
  • {year}, {month}, {day}: Date components with optional formatting
  • {slug}: URL-safe slug

Example:

Metadata: {category: "python", date: "2025-10-28", slug: "intro"}
Template: "{category}/{year}/{month}/{slug}/"
Result URL: "/python/2025/10/intro/"
Output path: "public/python/2025/10/intro/index.html"

URL Normalization Rules:

  • All URLs end with / (trailing slash required)
  • All URLs are lowercase
  • Multiple slashes collapsed to single slash
  • Leading slash always present

8. Assign Final Paths

For each item:

  • Set item.url to normalized permalink
  • Set item.out to filesystem output path (URL + index.html)
  • Set item.cache_key to computed hash
  • Set item.state = BUILT

Phase 2: Index Generation

After all content items have cache_keys assigned:

1. Collect Items for Indexing

Main index: All content items, sorted by date descending
Category indexes: Items grouped by category, sorted by date descending

2. Paginate Item Lists

Split sorted items into pages using configured page size (e.g., 10 posts per page).

Page URLs:

Main index:
  Page 1: /index.html
  Page 2: /page/2/index.html
  Page 3: /page/3/index.html

Category index (e.g., python):
  Page 1: /python/index.html
  Page 2: /python/page/2/index.html

3. Compute Index Cache Keys

For each index page, compute cache key capturing:

{
  "index_type": "main" or "category:python",
  "page_number": 2,
  "template_hash": hash_of_index_template,
  "pagination_context": {
    "total_items": 47,
    "total_pages": 5,
    "items_per_page": 10
  },
  "items_on_page": [
    {
      "cache_key": item.cache_key,
      "url": item.url,
      "date_iso": item.metadata.date_iso
    }
    for each item on this page (sorted order)
  ]
}

Key insight: This captures both membership and ordering. Any change to constituent items or pagination boundaries invalidates the index page.

4. Check Index Cache

For each index page:

  • Cache hit: Load pre-rendered HTML
  • Cache miss: Render index template with:
    • Current page items
    • Pagination metadata (current page, total pages, prev/next URLs)
    • Category information (for category indexes)

5. Create IndexItems

Construct virtual BuildItems:

  • kind = index
  • src = null (virtual item)
  • url and out set to index page paths
  • html contains rendered or cached HTML
  • state = BUILT

Add IndexItems to the main items list.

Collision Detection

After both phases complete:

  1. Build URL map: {url: [items_with_that_url]}
  2. Normalize all URLs before mapping
  3. Identify collisions: any URL with multiple source items
  4. For each collision:
    • Create detailed error with all source paths
    • Suggest resolution (change slug in frontmatter, change category, adjust permalink template)
  5. If any collisions found, abort before WRITE

Special cases checked:

  • Content item URL matching index URL
  • Asset filename matching generated content URL
  • Multiple content items with same slug in same category

Error Collection

BUILD accumulates all errors without stopping:

  • Frontmatter parse failures
  • Invalid date formats
  • Missing required templates
  • URL collisions
  • Encoding errors

At end of BUILD, if any errors exist, return error list and abort. No partial builds.

BUILD Success Criteria

BUILD phase succeeds only when:

  • All items processed without errors
  • All cache keys computed
  • All URLs resolved and collision-free
  • All HTML rendered (or loaded from cache)
  • No missing templates
  • All items in state=BUILT

Template Dependency Tracking

Problem

Template changes must invalidate all items that use them, including transitive dependencies through partials/includes.

Template Hash Computation

When templates are loaded:

  1. Build dependency graph:

    • Parse each template for include/import statements
    • Build map: {template_name: [included_template_names]}
  2. Detect cycles:

    • Walk dependency graph with depth-first search
    • Track visiting vs. visited nodes
    • If cycle detected: reject template set with error
    • Example error: “Circular template dependency: base.html → header.html → base.html”
  3. Compute hashes bottom-up:

    • Start with leaf templates (no includes)
    • For each template with includes:
      template_hash = SHA256(
        template_file_content +
        sorted_list_of_included_template_hashes
      )
      
    • Store computed hash for reuse
  4. Cache hash results: Template hashes are computed once at template load and reused for all items.

Invalidation on Template Change

When a template file changes:

  1. Recompute its hash (and hashes of templates that include it)
  2. Query cache for items that used that template
  3. Mark those items as needing rebuild
  4. During BUILD, these items get cache misses and reprocess

Note: Template hash is included in item cache_key, so any template change naturally invalidates dependent items.


Stage 3: WRITE

Purpose

Atomically commit all built outputs to disk. Either complete successfully or leave previous output unchanged.

Atomic Commit Strategy

SSGv3 uses symlink-based atomic swap for true atomicity:

Output Directory Structure

project/
  public@ -> output_20251028_143022/   # symlink to current build
  output_20251028_143022/              # actual output directory (timestamped)
  output_20251028_120000/              # previous build (kept for rollback)

The canonical public/ path is always a symlink, never a real directory.

Write Process

  1. Create timestamped directory: output_YYYYMMDD_HHMMSS/

  2. Write all outputs:

    • For each BUILT item:
      • Create parent directories as needed
      • Write file contents
      • For assets: copy bytes preserving modification time
      • For content/index: write HTML as UTF-8 text
    • All writes to temporary directory, not live output
  3. Fsync all files: Ensure writes are committed to disk

  4. Write manifest:

    • Create cache/manifest.json.new with:
      • Build timestamp
      • List of all items with their cache_keys and output paths
      • Template hashes used
    • Fsync manifest file
    • Atomically rename manifest.json.newmanifest.json
    • This is the commit point for cache
  5. Update symlink atomically:

    ln -sf output_20251028_143022 public.tmp
    mv -T public.tmp public
    
    • The mv of symlink is atomic on POSIX systems
    • Between creation of .tmp and final mv, live site remains on old build
    • The mv operation switches the site instantly
  6. Cleanup old outputs:

    • Keep N most recent output directories (configurable, default 2)
    • Delete older timestamped directories
    • This provides instant rollback capability

Success and Failure

Success criteria:

  • All files written to timestamped directory
  • Manifest committed
  • Symlink updated
  • Old outputs cleaned (if cleanup enabled)

On failure during file writes:

  • Delete incomplete timestamped directory
  • Leave public symlink unchanged (still points to previous good build)
  • No manifest update
  • Report error and exit with code 2

On failure during symlink update (rare):

  • Attempt to delete timestamped directory
  • Leave system with previous build active
  • Report error with manual recovery steps
  • Exit with code 2

Guarantee: Users never see partially-written output. Site is either old version or new version, never mixed.

Non-Symlink Fallback

On systems without symlink support (Windows without dev mode):

  1. Write to output.tmp/ directory
  2. After all writes succeed and manifest committed:
    • Rename output/output.old/ (if exists)
    • Rename output.tmp/output/
    • Delete output.old/

Limitation: Small time window where output/ doesn’t exist (between renames). Document this as degraded atomicity mode.

Orphan Cleanup

After successful symlink update:

  1. Load previous manifest
  2. Compare with new manifest
  3. Identify files in old manifest but not in new manifest (orphans)
  4. For each orphan path:
    • Verify it’s inside previous output directory
    • Delete file
    • Delete parent directories if empty

Safety: Only delete from old timestamped directories, never from current output. This is safe because old directories are not served.


Cache Management

Cache Structure

Cache stored in .ssg_cache/ directory:

  • manifest.json: Authoritative record of last successful build
  • db.sqlite: Item metadata and dependencies (optional, for advanced queries)

Manifest Format

{
  "schema_version": 1,
  "build_timestamp": "2025-10-28T14:30:22Z",
  "ssg_version": "3.0.0",
  "template_hashes": {
    "default.html": "sha256:abc...",
    "python.html": "sha256:def..."
  },
  "permalink_template": "{category}/{year}/{month}/{slug}/",
  "items": {
    "content/python/intro.md": {
      "cache_key": "sha256:123...",
      "url": "/python/2025/10/intro/",
      "out": "output_20251028_143022/python/2025/10/intro/index.html",
      "templates_used": ["default.html"]
    },
    ...
  }
}

Cache Operations

During BUILD:

  • Read manifest to check cache_keys
  • On cache hit: load cached HTML and metadata
  • On cache miss: mark item for processing

During WRITE:

  • After all files written
  • Before symlink update
  • Write new manifest atomically
  • Manifest commit makes cache changes durable

Cache invalidation triggers:

  • Source file content changed (content_hash differs)
  • Template changed (template_hash differs)
  • Metadata changed (different date/slug/category in frontmatter)
  • Permalink format changed
  • Dependencies changed (for indexes: constituent items changed)

Cold Build vs. Incremental Build

Cold build (no cache):

  • Process all items
  • Generate all outputs
  • Create initial manifest

Incremental build (cache exists):

  • SCAN: detect all files as before
  • BUILD Phase 1:
    • Compute cache_key for each item
    • Compare with manifest
    • Process only items with changed keys
    • Load cached HTML for unchanged items
  • BUILD Phase 2:
    • Check index cache keys
    • Rebuild only indexes whose constituent items changed
  • WRITE: Commit new manifest and swap outputs

Performance target: For 1000-post site with 1 file changed, incremental build should complete in under 5 seconds.


Metadata Precedence Model

Three Sources, Clear Rules

Metadata comes from three sources with strict precedence:

  1. System Defaults (lowest priority)
  2. Path-Derived (medium priority)
  3. Frontmatter (highest priority)

Example Walkthrough

File: content/tutorials/python/intro.md

System defaults provide:

  • slug: "intro" (from filename)
  • category: "python" (from parent directory)
  • date: 2025-10-28 (from file mtime)

Frontmatter overrides:

---
title: Introduction to Python
category: programming-basics
date: 2024-06-15
---

Final resolved metadata:

  • slug: "intro" (default, not overridden)
  • category: "programming-basics" (frontmatter wins)
  • date: 2024-06-15 (frontmatter wins)
  • title: "Introduction to Python" (from frontmatter)

Resulting URL: /programming-basics/2024/06/intro/

Stability Guarantees

With frontmatter category:

  • File can move to any directory
  • URL remains stable (frontmatter category used)
  • Useful for reorganizing source without breaking links

Without frontmatter category:

  • URL tracks directory structure
  • Moving file changes URL
  • Useful for organizing by category via folders

User choice: Explicit frontmatter for stable URLs, directory structure for convenience.


Key Design Decisions

1. Why Three Stages?

Separation of concerns:

  • SCAN: Pure discovery, no side effects, fast
  • BUILD: All computation and validation, memory-only
  • WRITE: Pure commit, no decisions

Benefits:

  • Easy to test each stage in isolation
  • Clear failure boundaries
  • Can abort before any writes
  • Can preview build output before committing

2. Why Symlink-Based Atomic Swap?

Problem: Directory rename atomicity varies by filesystem and platform.

Solution: Symlink update is atomic on all POSIX systems (single syscall).

Tradeoff: Requires symlink support, but this is standard on Linux/macOS and modern Windows.

Result: True atomic site publish with zero-downtime deployments.

3. Why Two-Phase BUILD?

Problem: Index cache keys need content item cache keys, but indexes are also items.

Solution: Process content first (compute cache keys), then generate indexes (using those keys).

Benefit: Clear dependency ordering, no circular references.

4. Why Template Dependency Tracking?

Problem: Changing a shared partial should rebuild all pages that use it.

Without tracking: Must rebuild entire site on any template change.

With tracking: Rebuild only affected pages, preserving incremental build performance.

Implementation cost: Parse templates once at load, compute hashes with includes. Worth the complexity for speed.

5. Why Reject Circular Template Dependencies?

Alternative: Allow cycles, break arbitrarily during hash computation.

Problem: Non-deterministic behavior, unpredictable cache invalidation.

Decision: Reject cycles explicitly. Templates should compose in directed acyclic graph. Cycles indicate design error.

Benefit: Clear, predictable behavior. Easy to debug.

6. Why Include Pagination Context in Index Cache?

Problem: Adding one post can shift items between pages without changing the items on a specific page.

Example: Post #11 appears on page 2. Add new post at top. Post #11 now on page 2 (but it’s the 12th post). Page 2 content unchanged but pagination controls need updating.

Solution: Include total counts and page boundaries in cache key. Any change to pagination structure invalidates all affected index pages.

Tradeoff: Adding one post invalidates multiple index pages. Acceptable because indexes are cheap to render compared to content.


Error Handling Philosophy

Fail Fast, Fail Explicitly

  • Detect errors as early as possible
  • Never write partial output
  • Accumulate all errors before reporting
  • Provide actionable error messages with suggestions

Error Categories

Configuration errors (fail in SCAN):

  • Invalid config file
  • Missing required directories
  • Malformed blacklist patterns

Content errors (fail in BUILD Phase 1):

  • Invalid frontmatter YAML
  • Unparseable dates
  • Missing required metadata
  • Invalid UTF-8 encoding

Template errors (fail in BUILD Phase 1):

  • Missing template files
  • Circular template dependencies
  • Template syntax errors

Collision errors (fail in BUILD Phase 2):

  • Multiple items generating same URL
  • Index URL conflicting with content URL

Write errors (fail in WRITE):

  • Disk full
  • Permission denied
  • Filesystem errors

Recovery Patterns

Before WRITE: No recovery needed, just abort. Previous build still intact.

During WRITE: Partial writes cleaned up, symlink left pointing to previous build. User can fix issue and retry.

After WRITE: Success. Any issues in cleanup (old output deletion) logged but don’t cause failure.


Performance Model

Bottlenecks and Optimizations

SCAN: I/O bound (filesystem traversal)

  • Optimization: Filter early, prune ignored directories immediately
  • Target: 10,000 files scanned in <2 seconds

BUILD Phase 1: CPU bound (Markdown parsing, template rendering)

  • Optimization: Cache aggressively, process only changed items
  • Target: 1 changed file in 1000-file site processes in <1 second

BUILD Phase 2: Memory bound (holding all items for indexing)

  • Optimization: None needed for reasonable site sizes
  • Target: Support sites with 10,000+ posts in <2GB memory

WRITE: I/O bound (writing files)

  • Optimization: Batch directory creation, optional parallel writes
  • Target: Write 1000 files in <3 seconds on SSD

Scaling Characteristics

Cold build: O(n) in number of source files
Incremental build: O(m + log n) where m = changed files, n = total files
Index rebuild: O(k) where k = number of posts on changed index pages
Memory usage: O(n) where n = number of posts (holds HTML in memory)

Practical Limits

  • Recommended: Up to 5,000 posts, typical site
  • Tested: Up to 10,000 posts
  • Theoretical: Limited by available memory (roughly 5MB per 1000 posts for HTML storage)

Summary of Core Concepts

  1. Three-Stage Pipeline: Clear boundaries prevent partial builds
  2. Two-Phase BUILD: Content first, then indexes (resolves dependencies)
  3. Cache Keys: Deterministic hashes capture all dependencies
  4. Metadata Precedence: System < Path < Frontmatter (users control URL stability)
  5. Template Tracking: Transitive hash includes all partials, enables selective rebuilds
  6. Symlink Atomicity: True atomic publish with rollback capability
  7. URL Normalization: Consistent collision detection
  8. Fail-Before-Write: All validation in BUILD, WRITE never fails
  9. Collision Detection: Explicit checks prevent silent overwrites
  10. Incremental Correctness: Cache invalidation based on complete dependency graph

These concepts compose to create a static site generator that is fast, correct, and predictable.

# permalink

pytest

export PYTEST_ADDOPTS="--color=yes --disable-warnings --capture=no"

pytest --stepwise
# permalink

links

Asianometry & Dylan Patel – How the Semiconductor Industry Actually Works - YouTube

The Complete Guide to Yakisugi (Shou Sugi Ban)

Python Programming Exercises, Gently Explained

What are some good python codebases to read? - Lobsters

The Composition Over Inheritance Principle

PyHAT-stack/awesome-python-htmx: A curated list of things related to python-based web development using htmx

miyuchina/mistletoe: A fast, extensible and spec-compliant Markdown parser in pure Python.

Metaphor: ReFantazio - Rock Paper Shotgun

Rabbit Waves

# permalink

Caveman

From Caveman to Chinaman - Cremieux Recueil

Briefly, seasonality is primarily dependent on three factors. The first is the Earth’s tilt, or obliquity, which determines how hemispheres will be tilted towards the Sun in summer and away in winter. The other two factors are lesser-known, but they are nevertheless important. They are the eccentricity of Earth’s orbit, which is how elliptical it is, and the precession, which is about whether the Earth’s closest approach to the Sun happens in the northern or southern hemisphere’s summer season

# permalink

Longreads 2023

Python Tricks: The Book

Sculpting a Python function - by Nobody has time for Python

Data Classes in Python 3.7+ (Guide) – Real Python

Make your own Tower Defense Game with PyGame • Inspired Python

Malcolm Gladwell and William Cohan on What Really Happened to GE

Life Lessons After 10 Years of BetterExplained.com – BetterExplained

‘Story Of Your Life’ Is Not A Time-Travel Story · Gwern.net

What “Drive My Car” Reveals on a Second Viewing

‘Drive My Car’ and ‘Uncle Vanya’: How Intertextuality Enriches a Film

Creating Isometric RPG Game Backgrounds - using Stable Diffusion techniques to create 2D game environments

Game Design Logs – LOSTGARDEN

Great game development is actively harmed by this assumption. Pre-allocating resources at an early stage interrupts the exploratory iteration needed to find the fun in a game. A written plan that stretches months into the future is like a stake through the heart of a good game process. Instead of quickly pivoting to amplify a delightful opportunity found during play testing, you end up blindly barreling towards completion on a some ineffectual paper fantasy.

A Skeleton Key to Ali Smith’s Artful — The Airship, from the Wayback machine

The most beautiful thing about Ali Smith’s book Artful — at once a series of real-life Oxford lectures and a metafictional post-love story — is the way she carries us through her unnamed narrator’s emotional progression

The Guardian review, theliterarysisters, #AliSmith

Google - We Have No Moat, And Neither Does OpenAI, discussion at Hacker News

Leaked internal document claims open-source will outcompete Google and OpenAI
the uncomfortable truth is, we aren’t positioned to win this arms race and neither is OpenAI. While we’ve been squabbling, a third faction has been quietly eating our lunch.

Development notes from xkcd’s Gravity and Escape Speed #gamedev

Escape Seed is a large space exploration game created and drawn by Randall Munroe.

This was one of the most ambitious — and most delayed — April Fools comics we’ve ever shipped. The art, physics, story, game logic, and render performance all needed to work together to pull this off. We decided to take the time to get it right.

The game is a spiritual successor to last year’s Gravity space exploration comic. Our goal was to deepen the game with a bigger map and more orbital mechanics challenges to play with.

Andreas Fragner - Writing summaries is more important than reading more books

Ted Chiang’s essay - Will A.I. Become the New McKinsey?, comments from Schneier on Security. previously, Ted Chiang - ChatGPT Is a Blurry JPEG of the Web

Out of Sir Vidia’s Shadow, Paul Theroux @LRB