SSGv3: Complete Design Document
Static Site Generator with Validate-Then-Commit Architecture
Version: 3.0
Last Updated: 2025-10-04
Table of Contents
- Core Concept
- Architecture Overview
- Build Flow
- Entities
- Configuration System
- Frontmatter Parsing
- Permalink System
- Cache Strategy
- Index Generation
- Pipeline Stages
- Processor Interface
- Service Interfaces
- Error Handling
- Performance Characteristics
Core Concept
SSGv3 uses a validate-then-commit architecture with incremental validation by default.
Key Principles:
- Everything flows through the system as a BuildItem
- Two-phase build: Validate (read-only) then Commit (write)
- Cache is trusted for incremental builds (fast)
- Full validation available for paranoid builds (safe)
- Atomic output updates - never partial builds
- All errors reported upfront before any writes
Build Philosophy:
- Load config
- Discover all content
- Validate everything (trust cache for unchanged items)
- If ANY errors → abort and show all errors
- If validation passes → commit all changes atomically
Architecture Overview
Two-Phase Design
Phase 1: VALIDATE (Read-Only)
- Discover all files
- Check cache for unchanged items
- Fully process changed/new items
- Detect all errors
- Compute all output paths
- Check for collisions
- No disk writes
Phase 2: COMMIT (Write)
- Write all changed outputs
- Update cache atomically
- Generate indexes
- Write manifest
- Clean up orphaned files
Component Hierarchy
SSGv3
├── Config (from config.toml)
├── BuildContext (shared state)
├── BuildPipeline
│ ├── DiscoveryStage
│ ├── ValidationStage
│ └── CommitStage
├── Processors
│ ├── MarkdownContentProcessor
│ ├── SassProcessor
│ ├── CopyProcessor
│ └── IndexGenerator
└── Services
├── CacheManager
├── TemplateRenderer
├── OutputWriter
├── PermalinkGenerator
└── MetadataExtractor
Build Flow
Complete Build Sequence
1. Startup
- Parse CLI arguments
- Load config.toml
- Deserialize to Config dataclass
- Run ConfigValidator
- Initialize Services
- Initialize BuildPipeline
2. DiscoveryStage
- Walk content directories
- Apply PathFilter (skip blacklisted, dot files, output/cache dirs)
- Extract category from directory structure
- Create BuildItem(state=RAW) for each file
- Output: List[BuildItem]
3. ValidationStage (Incremental Mode - Default)
For each BuildItem:
a. Compute content hash
b. Check cache:
- Hash matches → load metadata from cache, mark VALIDATED
- Hash differs → full validation:
* Read file
* Parse frontmatter
* Extract metadata
* Transform content (markdown→HTML)
* Detect errors
* Mark VALIDATED or ERROR
c. Check URL collisions across all items
d. If any errors → collect and abort before commit
Output: List[BuildItem(state=VALIDATED)] or error list
4. CommitStage
- Call processor.before_build() for all processors
For each changed item:
- Write output file
- Update cache entry
- Call processor.after_build() for all processors
* IndexGenerator runs here
* Generates category indexes, main index, RSS
- Update manifest
- CleanupStage: remove orphaned outputs
Output: Build statistics
5. Report Results
- Total files
- Processed (changed)
- Skipped (cached)
- Failed (validation errors)
Incremental Validation Details
Cache Trust Strategy:
- If
hash(file_content) + processor_id + template_hashes matches cache → skip processing
- Load metadata from cache (don’t re-read/re-parse file)
- Still check URL collisions
- Still track template dependencies
Performance:
- 1000 posts, 1 changed: ~1 second
- 999 hash checks (~1ms each)
- 1 full processing
- Memory holds only changed content + all metadata
Cache Invalidation Triggers:
- Source file content changed
- Template file changed (via dependency tracking)
- Config changed (permalink format, markdown settings)
- Cache schema version mismatch
- Manual:
ssg clean
Full Validation Mode (Optional)
ssg build --full-validation
[build]
validation_mode = "full"
Full mode adds:
- Stat every file (verify exists)
- Check file permissions
- Verify file readable
- Still skip content transformation if hash matches
Use for:
- CI/CD deployments
- Production builds
- After system updates
- Debugging
Entities
BuildItem (Base Class)
Represents anything the system builds.
Attributes:
BuildItem:
src: Path | None # Source file (None for generated items)
out: Path # Output path
kind: str # "content", "asset", "index", "feed"
mtime: float # Source modification time
state: BuildState # RAW | VALIDATED | WRITTEN
hash: str # Content hash for cache checking
States:
RAW - Just discovered, not validated
VALIDATED - Checked, ready to commit
WRITTEN - Output file written
ContentItem (Specialized)
ContentItem(BuildItem):
metadata: dict
- title: str
- date: datetime
- date_iso: str
- date_formatted: str
- slug: str
- category: str
- tags: List[str]
- url: str
html: str # Rendered HTML content
templates_used: List[Path] # For dependency tracking
AssetItem (Specialized)
AssetItem(BuildItem):
asset_type: str # "static", "sass", "image"
IndexItem (Specialized)
IndexItem(BuildItem):
page: int # Pagination page number
items: List[ContentItem] # Content in this index
metadata: dict
- category: str (optional)
- total_pages: int
Examples
Blog Post:
ContentItem(
src=Path('content/python/hello.md'),
out=Path('public/python/2024/10/hello/index.html'),
kind='content',
state=VALIDATED,
metadata={
'title': 'Hello Python',
'category': 'python',
'date': datetime(2024, 10, 4),
'slug': 'hello',
'url': '/python/2024/10/hello/'
}
)
Static Asset:
AssetItem(
src=Path('assets/css/style.css'),
out=Path('public/assets/css/style.css'),
kind='asset',
asset_type='static',
state=VALIDATED
)
Generated Index:
IndexItem(
src=None,
out=Path('public/python/index.html'),
kind='index',
state=VALIDATED,
metadata={'category': 'python'}
)
Configuration System
Config Dataclass
@dataclass
class Config:
# Paths
content_dir: Path
asset_dir: Path
output_dir: Path
cache_dir: Path = Path('.cache')
template_dir: Optional[Path] = None
# Content
content_extensions: List[str] = ['.md', '.markdown']
blacklist: List[str] = []
# Permalinks
permalink_templates: Dict[str, str] = {
'default': '/{year}/{month:02d}/{day:02d}/{slug}/'
}
default_category: str = ''
# Site
site_title: str = 'My Site'
site_url: str = 'https://example.com'
site_description: str = ''
site_author: str = ''
# Pagination
page_size: int = 10
# Markdown
markdown_extensions: List[str] = ['codehilite', 'fenced_code', 'toc', 'tables']
markdown_extension_configs: Dict[str, Dict] = {}
# Build
incremental: bool = True
validation_mode: str = 'incremental' # or 'full'
clean_output: bool = False
@classmethod
def from_toml(cls, path: Path) -> 'Config'
config.toml Example
# config.toml
# Paths
content_dir = "content"
asset_dir = "assets"
output_dir = "public"
cache_dir = ".cache"
template_dir = "templates"
# Content
content_extensions = [".md", ".markdown"]
blacklist = ["drafts/", "README.md"]
# Permalinks
default_category = ""
[permalink_templates]
default = "/{year}/{month:02d}/{day:02d}/{slug}/"
page = "/{slug}/"
doc = "/{category}/{slug}/"
# Site
[site]
title = "My Blog"
url = "https://example.com"
description = "A blog about things"
author = "Jane Doe"
# Pagination
[pagination]
page_size = 10
# Markdown
[markdown]
extensions = ["codehilite", "fenced_code", "toc", "tables"]
[markdown.extension_configs.codehilite]
css_class = "highlight"
linenums = true
[markdown.extension_configs.toc]
anchorlink = true
# Build
[build]
incremental = true
validation_mode = "incremental"
clean_output = false
Configuration Loading
1. CLI specifies: ssg build --config path/to/config.toml
2. Or look in current directory: ./config.toml
3. Parse TOML → dict
4. Convert string paths to Path objects
5. Deserialize to Config dataclass
6. Run ConfigValidator.validate(config)
7. Pass to BuildPipeline
Configuration Validation
ConfigValidator checks:
- Required fields present
- Paths exist (content_dir, asset_dir)
- Permalink templates valid:
{slug} present (required)
- Valid format syntax
- Known placeholders only
- site_url format (http:// or https://)
- No conflicting directories (output != cache)
- page_size > 0
- markdown_extensions are strings
- validation_mode in [‘incremental’, ‘full’]
On validation error:
- Print helpful error message
- Show expected format
- Suggest fix
- Exit with code 1
Environment-Specific Configs
config.toml # Base
config.dev.toml # Dev overrides
config.prod.toml # Prod overrides
ssg build --env dev
ssg build --env prod
Merge base config with environment-specific overrides.
Frontmatter Parsing
When Parsing Happens
NOT in DiscoveryStage (too expensive)
IN ValidationStage (only for changed items)
Parsing Flow
1. Read file content
- Try utf-8 encoding
- Fallback to latin-1
2. Use python-frontmatter library
- Separates YAML/TOML frontmatter from content
- Returns metadata dict + content string
3. Parse dates with dateutil.parser
- Handles: "2024-10-04 13:00 IST"
- Handles: "2024-09-22 11:54 +0530"
- Strip timezone (keep naive datetime)
4. Merge metadata with defaults
Metadata Precedence
Final metadata = generated_defaults
+ directory_metadata
+ frontmatter_overrides
Generated defaults:
slug from filename
title from filename
date from file mtime
Directory metadata:
category from parent directory
Frontmatter overrides:
- Any field in frontmatter overrides above
Frontmatter Example
---
title: My Custom Title
date: 2024-10-04 13:00 IST
category: tutorials
tags: [python, web]
slug: custom-slug
---
# Content starts here
Error Handling
Malformed YAML:
- Log error: “Failed to parse frontmatter in {file}: {error}”
- Return None from validation
- Mark item as ERROR
- Continue validating other files
- Show all errors at end
Invalid date format:
- Log warning: “Invalid date in {file}, using mtime”
- Fall back to file mtime
- Continue processing
Missing fields:
- Use defaults (title from filename, date from mtime)
- No error, just defaults
Unreadable file:
- Log error: “Cannot read {file}: {error}”
- Mark as ERROR
- Continue validation
Permalink System
Core Concept
Permalink = Template + Metadata
Template has placeholders filled from item metadata.
Supported Placeholders
{year} - 4-digit year
{month} - Month number (1-12)
{month:02d} - Zero-padded month (01-12)
{day} - Day number (1-31)
{day:02d} - Zero-padded day (01-31)
{slug} - URL-safe slug (REQUIRED)
{category} - Category from directory or frontmatter
Template Examples
Blog (date-based):
/{year}/{month:02d}/{day:02d}/{slug}/
→ /2024/10/04/my-post/
Docs (category-based):
/{category}/{slug}/
→ /python/my-post/
Flat:
/{slug}/
→ /my-post/
Hybrid:
/{category}/{year}/{slug}/
→ /python/2024/my-post/
Template Validation
Required:
{slug} must be present (ensures uniqueness)
Optional:
- Date placeholders (content without dates still works)
{category} (works with or without categories)
Invalid:
- Unknown placeholders → error at config validation
- Invalid format syntax → error
Category Extraction
From directory structure:
content/ → category = ""
content/python/ → category = "python"
content/python/web/file.md → category = "python" (single-level)
Single-level extraction:
- Uses immediate parent directory only
- Ignores deeper nesting
- Simpler URLs, easier reasoning
Future: Could add full-path option via config
Category Override
Frontmatter overrides directory-based category:
---
category: tutorials # Override "python" from directory
---
Use cases:
- Content organization ≠ URL structure
- A/B testing categorizations
- Legacy content migration
URL Collision Detection
Problem: Two files generate same URL
Detection: During ValidationStage, build map url → src
On collision:
ERROR: URL collision detected
URL: /python/2024/10/my-post/
Sources:
- content/python/my-post.md
- content/python/tutorials/my-post.md
Fix: Use different slugs or override category in frontmatter
Build aborts - forces explicit resolution
PermalinkGenerator Interface
PermalinkGenerator:
set_template(format_string: str) -> None
validate_template() -> None (raises ValueError)
generate(metadata: dict) -> str
Responsibilities:
- Parse template placeholders
- Validate required placeholders
- Map metadata to URL
- URL-encode special characters
- Return path string
NOT responsible for:
- Extracting metadata
- Checking uniqueness
- Writing files
Template Changes
Permalink template is global config.
If template changes:
- All items have new output paths
- Cache becomes invalid
- Full rebuild required
- Store template hash in manifest to detect
Cache Strategy
Two-Table Schema
Table 1: build_items
CREATE TABLE build_items (
src TEXT PRIMARY KEY,
out TEXT NOT NULL,
kind TEXT NOT NULL,
hash TEXT NOT NULL, -- Combined hash for incremental builds
processor_id TEXT NOT NULL,
mtime REAL NOT NULL
)
Tracks all processed files (content + assets).
Table 2: content_metadata
CREATE TABLE content_metadata (
src TEXT PRIMARY KEY,
title TEXT NOT NULL,
date_iso TEXT NOT NULL,
date_formatted TEXT NOT NULL,
url TEXT NOT NULL,
slug TEXT NOT NULL,
category TEXT NOT NULL,
html_content TEXT NOT NULL,
templates_used TEXT NOT NULL, -- JSON array
FOREIGN KEY (src) REFERENCES build_items(src)
)
Stores content-specific metadata.
Hash Computation
item_hash = hash(file_content)
+ processor.id
+ hash(templates_used)
+ hash(relevant_metadata)
More reliable than mtime:
- Detects actual content changes
- Works with git checkout
- Works with cloud sync
- Catches template changes
Template Dependency Tracking
Problem: Template change rebuilds everything
Solution: Track which templates each item uses
How:
- During rendering, TemplateRenderer records accessed templates
- Store in
ContentItem.templates_used
- Save to
content_metadata.templates_used (JSON)
- When template changes:
- Compute new template hash
- Query items using that template
- Only rebuild affected items
Cache Trust in Incremental Mode
If hash matches:
- Trust file unchanged
- Load metadata from cache
- Skip reading file
- Skip parsing frontmatter
- Skip content transformation
- Mark as VALIDATED
Cache checked for:
- Content hash match
- Processor ID match
- Template hashes match
Cache invalidated on:
- Source file changed
- Template changed
- Config changed
- Processor updated
- Manual
ssg clean
CacheManager Interface
CacheManager:
needs_processing(src: Path, hash: str) -> bool
mark_processed(item: BuildItem) -> None
save_content_metadata(item: ContentItem) -> None
get_all_content() -> List[ContentItem]
get_content_by_category(category: str) -> List[ContentItem]
get_items_using_template(template: Path) -> List[BuildItem]
Cache Failure Handling
Corrupted database:
- Detect via schema version check
- Log warning
- Delete cache
- Full rebuild
Missing cache:
- First build, expected
- Create new cache
- Process everything
Partial cache:
- Some items missing
- Process missing items
- Update cache
Index Generation
Design Approach
IndexGenerator is a special processor running in after_build() phase.
IndexGenerator Interface
IndexGenerator(Processor):
can_handle(item) -> False # Doesn't process discovered items
before_build(context):
pass # No setup
process(item, services):
pass # Doesn't process items
after_build(context):
# All index generation happens here
Generation Flow
In after_build():
1. Query all processed content
items = context.get_all_items(kind='content')
2. Generate category indexes
by_category = group_by(items, 'metadata.category')
for category, category_items in by_category:
create_index(category, category_items)
3. Generate main index (paginated)
pages = paginate(items, page_size=10)
for page_num, page_items in enumerate(pages):
create_paginated_index(page_num, page_items)
4. Generate RSS feed
create_feed(items[:20])
5. Generate sitemap
create_sitemap(items)
Index Pages as BuildItems
IndexItem(
src=None, # Virtual
out=Path('public/python/index.html'),
kind='index',
metadata={'category': 'python', 'items': [...]}
)
Written via OutputWriter like any other item.
Pagination
IndexItem(out='public/index.html', metadata={'page': 1})
IndexItem(out='public/page/2/index.html', metadata={'page': 2})
IndexItem(out='public/page/3/index.html', metadata={'page': 3})
Template receives page number and items for that page.
Why after_build()
Advantages:
- Only includes successfully validated items
- Has complete picture of all content
- Can compute statistics (tag counts, category counts)
- Validation failures don’t corrupt index
vs during-build approach:
- During-build could include items that later fail
- Index might be inconsistent
Pipeline Stages
Stage Interface
PipelineStage(ABC):
run(context: BuildContext) -> None
Each stage modifies BuildContext and passes to next stage.
DiscoveryStage
Input: Config
Process:
1. Walk content_dir recursively
2. Apply PathFilter:
- Skip output_dir, cache_dir
- Skip dot directories (.git, .cache)
- Skip blacklisted paths
- Skip dot files
3. For each file:
- Determine kind (check extension)
- Extract category (parent directory)
- Stat file (get mtime, size)
- Create BuildItem(state=RAW)
Output: context.items = List[BuildItem(state=RAW)]
No file reading - fast, filesystem metadata only
ValidationStage
Input: List[BuildItem(state=RAW)]
Process (Incremental Mode):
errors = []
For each item:
1. Compute hash = hash(file_content) + processor.id + template_hashes
2. Check cache:
if cache.has_item(src, hash):
# FAST PATH
item.metadata = cache.get_metadata(src)
item.state = VALIDATED
continue
# SLOW PATH - changed item
try:
Read file
Parse frontmatter
Extract metadata
Transform content
item.state = VALIDATED
except Exception as e:
errors.append((item, e))
item.state = ERROR
3. Build URL map, detect collisions
4. If errors:
print all errors
abort build
Output:
- Success:
List[BuildItem(state=VALIDATED)]
- Failure: Error list, exit
Full Validation Mode:
- Also stat every file
- Verify readable
- Still skip transformation if hash matches
CommitStage
Input: List[BuildItem(state=VALIDATED)]
Process:
1. Call processor.before_build(context) for all processors
2. For each changed item:
- Get processor via can_handle()
- Call processor.process(item, services)
- Write output
- Update cache
- item.state = WRITTEN
3. Call processor.after_build(context) for all processors
- IndexGenerator runs here
- Generates indexes, feeds, sitemap
4. Update manifest atomically
5. CleanupStage:
- Find outputs in old manifest not in current build
- Delete orphaned files
Output: Build statistics
Atomic guarantee:
- Cache updated only after all writes succeed
- On write failure, cache remains old state
- Can retry build without corruption
Processor Interface
Base Interface
Processor(ABC):
id: str # Processor version for cache
can_handle(item: BuildItem) -> bool
before_build(context: BuildContext) -> None
process(item: BuildItem, services: Services) -> BuildItem
after_build(context: BuildContext) -> None
Lifecycle Phases
before_build() - Setup
- Load templates into memory
- Initialize caches
- Prepare shared resources
- Called once per build
process() - Transform
- Read source (if not cached)
- Parse/transform content
- Render templates
- Write output
- Return updated item
- Called per item
after_build() - Finalization
- Generate auxiliary content
- Flush caches
- Create sitemaps/feeds
- Called once per build
MarkdownContentProcessor
can_handle:
return item.kind == 'content' and item.src.suffix in ['.md', '.markdown']
process:
1. Read file content
2. Parse frontmatter (YAML/TOML)
3. Extract metadata with precedence
4. Parse dates with dateutil
5. Convert markdown to HTML
6. Render template
7. Write output
8. Track templates used
9. Return updated ContentItem
Tracks:
- Templates used (for dependency tracking)
- Metadata for index generation
SassProcessor
can_handle:
return item.kind == 'asset' and item.src.suffix == '.scss'
process:
1. Read SCSS
2. Compile to CSS
3. Generate source maps
4. Minify (optional)
5. Write output
CopyProcessor
can_handle:
return item.kind == 'asset'
process:
1. Copy file with shutil.copy2
2. Preserve mtime
Fallback for unknown asset types.
IndexGenerator
can_handle:
return False # Special processor
after_build:
1. Query context.get_all_items(kind='content')
2. Group by category
3. Generate category indexes
4. Paginate main index
5. Generate RSS feed
6. Generate sitemap
7. Write all via OutputWriter
Service Interfaces
Services are pluggable for testability and extensibility.
FileManager (Abstract)
FileManager(ABC):
read(path: Path) -> bytes
read_text(path: Path, encoding: str) -> str
write(path: Path, data: bytes) -> None
write_text(path: Path, content: str) -> None
copy(src: Path, dest: Path) -> None
remove(path: Path) -> None
hash_file(path: Path) -> str
exists(path: Path) -> bool
stat(path: Path) -> FileStat
Default: LocalFileManager (filesystem)
Alternative: S3FileManager, MemoryFileManager (testing)
TemplateRenderer (Abstract)
TemplateRenderer(ABC):
load_templates(template_dir: Path) -> None
render(template_name: str, context: dict) -> str
get_template_hash(template_name: str) -> str
get_templates_used() -> List[Path]
Default: Jinja2Renderer
Features:
- Template inheritance
- Partials/includes
- Custom filters
- Autoescaping
PermalinkGenerator (Abstract)
PermalinkGenerator(ABC):
set_template(format: str) -> None
validate_template() -> None
generate(metadata: dict) -> str
Default: PatternPermalinkGenerator
Alternative: CustomPermalinkGenerator (complex routing)
MetadataExtractor
MetadataExtractor:
parse_frontmatter(content: str) -> dict
extract_from_file(item: BuildItem, config: Config) -> dict
generate_slug(text: str) -> str
generate_title(filename: str) -> str
parse_date(date_value: Any) -> datetime
Handles:
- Frontmatter parsing (YAML/TOML)
- Metadata extraction
- Slug generation
- Date parsing (flexible formats)
OutputWriter
OutputWriter:
write_html(path: Path, content: str) -> None
copy_file(src: Path, dest: Path) -> None
ensure_directory(path: Path) -> None
remove_file(path: Path) -> None
All file writes go through this interface.
Services Bundle
Services:
cache: CacheManager
templates: TemplateRenderer
output: OutputWriter
permalinks: PermalinkGenerator
metadata: MetadataExtractor
Passed to processor.process() to avoid many parameters.
Error Handling
Validation Phase Errors
Collected, not thrown:
errors: List[Tuple[BuildItem, Exception]] = []
For each item:
try:
validate(item)
except Exception as e:
errors.append((item, e))
continue # Keep validating
All errors shown:
ERROR: Found 3 errors during validation
1. content/python/bad.md
Failed to parse frontmatter: Invalid YAML at line 5
2. content/tutorials/test.md
Invalid date format: "not a date"
3. URL collision: /python/2024/10/post/
- content/python/post.md
- content/python/tutorials/post.md
Build aborted. Fix errors and retry.
Error Types
Validation errors:
- Malformed frontmatter
- Invalid date formats
- URL collisions
- Encoding errors
- Missing required metadata
- Template not found
Commit errors:
- Permission denied (output dir)
- Disk full
- File locked
Config errors:
- Invalid TOML syntax
- Missing required fields
- Invalid permalink template
- Invalid paths
Error Recovery
Validation phase:
- Collect all errors
- Show complete report
- Abort before any writes
- No partial builds
Commit phase:
- First error stops build
- Log error clearly
- Cache remains in old state
- Can retry without corruption
User-Friendly Messages
Bad:
Good:
ERROR: content/python/test.md
Missing required field: 'slug'
Either:
1. Add 'slug' to frontmatter, or
2. Ensure filename can generate valid slug
Performance Characteristics
Incremental Build (Default)
Scenario: 1000 posts, 1 file changed
DiscoveryStage: ~100ms (walk filesystem)
ValidationStage: ~1s (999 hash checks, 1 full process)
CommitStage: ~50ms (write 1 file, update cache)
Total: ~1.2s
Memory: ~50MB (all metadata + 1 content item)
Full Validation Build
Scenario: 1000 posts, paranoid build
DiscoveryStage: ~100ms
ValidationStage: ~5s (999 stat+hash, 1 full process)
CommitStage: ~50ms
Total: ~5.2s
Memory: ~50MB (metadata only)
Cold Build (No Cache)
Scenario: 1000 posts, first build
DiscoveryStage: ~100ms
ValidationStage: ~60s (read + parse + transform all)
CommitStage: ~2s (write all outputs, create cache)
Total: ~62s
Memory: ~500MB (all content in memory during validation)
Optimization Strategies
Parallel Processing (Future):
- Validate independent items in parallel
- Thread-safe cache access
- 4-8x speedup on multi-core systems
Lazy Loading:
- Load templates on-demand
- Parse frontmatter only when needed
- Stream large files
Cache Warming:
- Pre-compute hashes on file watch
- Background cache updates
- Faster incremental builds
BuildContext
Shared state across pipeline stages and processors.
Structure
BuildContext:
config: Config # Site configuration (immutable)
items: List[BuildItem] # All discovered/processed items
manifest: BuildManifest # Previous build state
stats: BuildStats # Counters
url_map: Dict[str, Path] # URL collision detection
Query Interface
get_all_items(kind: str = None,
category: str = None,
tags: List[str] = None) -> List[BuildItem]
get_item_by_path(src: Path) -> Optional[BuildItem]
get_items_using_template(template: Path) -> List[BuildItem]
Used by IndexGenerator to query processed content.
BuildStats
BuildStats:
total_files: int
processed: int # Changed items
skipped: int # Cached items
failed: int # Validation errors
index_generated: bool
duration: float
CLI Interface
Commands
# Build site
ssg build
# Build with options
ssg build --config path/to/config.toml
ssg build --env prod
ssg build --full-validation
ssg build --verbose
# Clean cache and output
ssg clean
# Initialize new site
ssg init
# Watch and rebuild on changes (future)
ssg watch
# Serve locally (future)
ssg serve
Build Command Options
--config PATH Config file location (default: ./config.toml)
--env ENV Environment (dev/prod, loads config.ENV.toml)
--full-validation Use full validation mode (paranoid)
--verbose Show debug output
--dry-run Validate only, don't write outputs
--clean Clean before building
Exit Codes
0 Success
1 Validation errors
2 Commit errors
3 Configuration errors
4 File system errors
Watch Mode (Future Feature)
Design
1. Initial build
2. Watch filesystem for changes
3. On change:
- Debounce (wait 100ms for more changes)
- Determine affected items
- Run incremental build
- Notify browser (LiveReload)
Change Detection
File changed:
- Reprocess that item only
- Update cache
- Regenerate indexes
Template changed:
- Query items using that template
- Reprocess affected items
- Update cache
Config changed:
- Full rebuild required
- Restart watch
LiveReload Integration
1. Build includes JS snippet
2. SSG serves WebSocket
3. On rebuild, send reload message
4. Browser refreshes
Plugin System (Future Feature)
Plugin Interface
Plugin(ABC):
name: str
version: str
register_processors() -> List[Processor]
register_filters() -> Dict[str, Callable]
register_commands() -> List[Command]
on_config_loaded(config: Config) -> None
on_build_start(context: BuildContext) -> None
on_build_complete(context: BuildContext) -> None
Example Plugin
class ImageOptimizationPlugin(Plugin):
def register_processors(self):
return [ImageOptimizer()]
def on_build_complete(self, context):
# Generate responsive images
for item in context.get_all_items(kind='asset'):
if is_image(item):
generate_thumbnails(item)
Plugin Discovery
[plugins]
enabled = ["image-optimization", "search-index"]
[plugins.image-optimization]
quality = 85
formats = ["webp", "avif"]
[plugins.search-index]
fields = ["title", "content", "tags"]
Testing Strategy
Unit Tests
Test data classes:
def test_build_item_state_transitions():
item = BuildItem(state=RAW)
assert item.state == RAW
item.state = VALIDATED
assert item.state == VALIDATED
Test utilities:
def test_slug_generation():
assert generate_slug("Hello World") == "hello-world"
assert generate_slug("C++") == "cpp"
Integration Tests
Test pipeline stages:
def test_discovery_stage():
# Create temp filesystem
# Run discovery
# Assert correct items found
Test processors:
def test_markdown_processor():
# Create test markdown file
# Process
# Assert HTML output correct
End-to-End Tests
Test complete builds:
def test_incremental_build():
# Build once
# Modify one file
# Build again
# Assert only one file processed
Snapshot Testing
Test output stability:
def test_output_unchanged():
# Build with known input
# Compare output to snapshot
# Assert no differences
Performance Tests
Benchmark builds:
def test_build_performance():
# Build 1000 posts
# Assert completes in < 60s
Migration from Current SSG
Compatibility Layer
Map old concepts to SSGv3:
Old FileInfo → BuildItem
Old ContentType → Processor
Old should_rebuild() → needs_processing() + cache check
Old process() → validate() + commit()
Old CacheManager → BuildContext + cache tables
Migration Steps
-
Add state tracking:
- Wrap FileInfo in BuildItem
- Add state field
-
Split processing:
- Extract validation logic
- Separate from writing
-
Update cache schema:
- Add hash column
- Add templates_used column
- Migration script for old caches
-
Refactor pipeline:
- Create stage classes
- Move logic from methods to stages
-
Add validation mode:
- Implement incremental mode
- Add full validation option
Backward Compatibility
Config migration:
# Old CONFIG dict still works
config = Config.from_dict(CONFIG)
Keep old CLI:
# Old commands still work
python ssg_generator.py build
Gradual adoption:
- Stage 1: Add BuildItem wrapper
- Stage 2: Add validation phase
- Stage 3: Enable commit phase
- Stage 4: Remove old code
Future Enhancements
Near-Term
- Watch mode - File watching + live reload
- Parallel processing - Multi-threaded builds
- Better error messages - Context + suggestions
- Progress bars - Visual feedback during builds
- Dry-run mode - Validate without writing
Medium-Term
- Plugin system - Extensible architecture
- Multiple content formats - RST, AsciiDoc, Org
- Asset pipeline - Sass, TypeScript, image optimization
- Search index - Client-side full-text search
- Multilingual - i18n support
Long-Term
- Distributed builds - Build on multiple machines
- Cloud storage - S3/GCS output
- Incremental deploys - Only upload changed files
- Build analytics - Performance insights
- Visual editor - GUI for content management
Appendix: Design Decisions
Why Validate-Then-Commit?
Problem with immediate writes:
- Partial builds leave broken output
- Can’t show all errors at once
- Hard to implement dry-run
- Difficult to rollback
Validate-then-commit solves:
- Output always consistent
- All errors reported upfront
- Dry-run is just “stop after validate”
- Easy rollback (just don’t commit)
Why Trust Cache by Default?
Incremental mode is fast:
- 1000 posts, 1 changed: ~1 second
- Matches user expectations
- Good for development workflow
Full validation available when needed:
- CI/CD deployments
- Production builds
- After system updates
Risk is low:
- Hash detects content changes
- Template tracking detects template changes
- Config changes force full rebuild
Why Single-Level Categories?
Simpler:
- Easier to understand
- Cleaner URLs
- Less nesting complexity
Good enough:
- Most sites have 5-10 categories
- Deep nesting rare
- Can add full-path later
Why Jinja2?
Mature:
- Battle-tested
- Well-documented
- Large ecosystem
Features:
- Template inheritance
- Macros/includes
- Custom filters
- Autoescaping
Alternative:
Could support multiple template engines via TemplateRenderer interface
Why TOML Config?
Readable:
- Clean syntax
- Comments supported
- No significant whitespace
Type-safe:
- Clear data types
- Nested structures
- Arrays and tables
Alternative:
Could support YAML via same Config.from_dict() pattern
Appendix: Glossary
BuildItem - Representation of anything the system builds (content, asset, index)
Processor - Handler that transforms BuildItems (markdown→HTML, SCSS→CSS)
Pipeline Stage - Phase of the build process (discovery, validation, commit)
BuildContext - Shared state across pipeline stages and processors
Services - Collection of utility objects (cache, templates, output writer)
Frontmatter - YAML/TOML metadata at the top of content files
Permalink - URL pattern for generated pages
Category - Classification from directory structure or frontmatter
Slug - URL-safe identifier derived from title or filename
Incremental Build - Only rebuild changed items
Cache - Persistent storage of previous build state
Hash - Content fingerprint for detecting changes
Template Dependency - Tracking which templates each item uses
Validation Phase - Read-only checking of all items
Commit Phase - Write all validated outputs
Index - Generated page listing multiple content items (blog index, category page)
Document Information
Version: 3.0
Date: 2025-10-04
Status: Complete Design
Next Steps: Implementation
Changes from v2:
- Added validate-then-commit architecture
- Added incremental validation mode
- Added full validation option
- Refined cache trust strategy
- Added complete error handling
- Added performance characteristics
- Added migration guide
Contact: [Your contact info]
Repository: [Your repo URL]
License: [Your license]
End of Document
Static Site Generator v3
Design - Core Specification
Updated: 2025 Oct 31
Architecture Overview
SSGv3 is a static site generator built on a three-stage pipeline where each stage has exclusive responsibilities:
Core principle: All validation and transformation happens in memory during BUILD. WRITE is a pure commit operation that either completes fully or leaves the previous output untouched.
Stage Boundaries
| Stage |
File Reads |
Transformations |
File Writes |
Failures Allowed |
| SCAN |
Metadata only (stat) |
None |
None |
Yes (abort before BUILD) |
| BUILD |
Content (changed files only) |
Markdown→HTML, Template rendering |
None |
Yes (abort before WRITE) |
| WRITE |
None |
None |
All outputs |
No (atomic commit) |
Guarantee: If BUILD completes without errors, WRITE will succeed or leave the system in the previous consistent state.
Component Hierarchy
Static Site Generator
│
├── Three-Stage Pipeline
│ ├── SCAN Stage
│ ├── BUILD Stage
│ │ ├── Phase 1: Content Processing
│ │ └── Phase 2: Index Generation
│ └── WRITE Stage
│
├── Core Data Model
│ └── BuildItem
│ ├── ContentItem
│ ├── AssetItem
│ └── IndexItem
│
├── Metadata System
│ ├── System Defaults
│ ├── Path-Derived Metadata
│ └── Frontmatter Overrides
│
├── Content Processing
│ ├── Slug Normalization
│ ├── Template Selection
│ ├── Markdown Transformation
│ └── Permalink Generation
│
├── Cache System
│ ├── Cache Key Computation
│ ├── Manifest Management
│ └── Invalidation Logic
│
├── Template System
│ ├── Template Selection
│ ├── Dependency Graph
│ ├── Hash Computation
│ └── Cycle Detection
│
├── Index System
│ ├── Pagination
│ ├── Index Cache Keys
│ └── Rebuild Detection
│
├── URL Management
│ ├── Permalink Templates
│ ├── URL Normalization
│ └── Collision Detection
│
└── Atomic Write System
├── Symlink-Based Swap
├── Timestamped Directories
└── Orphan Cleanup
Core Data Model
BuildItem
Every discovered file becomes a BuildItem. After processing, it may become:
- ContentItem: Markdown file that generates HTML
- AssetItem: File copied byte-for-byte (CSS, images, etc.)
- IndexItem: Virtual item for paginated index pages
Each BuildItem contains:
- src: Original source file path (null for virtual items)
- out: Final output path (resolved during BUILD)
- url: Site-relative URL (e.g.,
/python/2025/intro/)
- kind:
content | asset | index
- state:
SCANNED | BUILT | WRITTEN
- metadata: Resolved metadata dictionary
- cache_key: SHA256 hash identifying this item’s dependencies
- html: Rendered HTML (content and index items only)
Stage 1: SCAN
Purpose
Fast discovery of all source files with minimal I/O. Produces a deterministic list of BuildItems with initial metadata.
Inputs
- Project directory structure
- Content directory (e.g.,
content/)
- Asset directory (e.g.,
assets/)
- Blacklist patterns (e.g.,
['.git', '_drafts'])
Process
- Directory Walking: Traverse content and asset directories using filesystem scanning
- Filtering: Skip blacklisted paths, hidden directories (
.git, .cache), and output/cache directories
- Classification: Determine item kind based on file extension:
.md, .markdown → content
- Everything else → asset
- Initial Metadata: Extract from file path structure:
initial_slug: Filename without extension
initial_category: Immediate parent directory name (empty if top-level)
path_rel: Path relative to content/asset root
- File Stats: Capture modification time and size (for change detection)
- Ordering: Sort items by relative path for deterministic processing
Outputs
List of BuildItem(state=SCANNED) with minimal metadata. No file contents read, no transformations performed.
Example
Input structure:
content/python/intro.md
content/rust/ownership.md
assets/style.css
Output items:
BuildItem(src=content/python/intro.md, kind=content,
initial_slug=intro, initial_category=python)
BuildItem(src=content/rust/ownership.md, kind=content,
initial_slug=ownership, initial_category=rust)
BuildItem(src=assets/style.css, kind=asset,
initial_slug=style, initial_category='')
Stage 2: BUILD
Purpose
Transform source files into renderable outputs, resolve all metadata, detect collisions, and prepare everything for atomic write. All operations happen in memory.
Two-Phase Process
BUILD operates in two distinct phases to resolve circular dependencies:
Phase 1: Content Processing
- Process all content and asset items
- Assign final cache_key to each item
- Items are now ready for indexing
Phase 2: Index Generation
- Create virtual IndexItems for pagination
- Use cache_keys from Phase 1 in index cache computation
- Detect URL collisions across all items
Phase 1: Content Processing
For each BuildItem from SCAN:
1. Metadata Resolution
Merge metadata from three sources in order of precedence (later sources override earlier):
System Defaults:
slug: filename without extension
category: parent directory name
date: file modification time
Path-Derived (only if not in frontmatter):
- Category from directory structure
- Date from filename patterns like
YYYY-MM-DD-title.md
Frontmatter (always wins):
- Any explicitly set field overrides all others
- Parsed from YAML block at file start
- Example:
category: tutorials overrides directory structure
Key behavior: Files can move between directories without URL changes if frontmatter specifies category. Without frontmatter, URL follows directory structure.
2. Slug Normalization
Convert title/filename into URL-safe slug:
- Convert to lowercase
- Transliterate Unicode to ASCII (ö → o, é → e)
- Remove all punctuation except hyphens
- Replace whitespace sequences with single hyphen
- Strip leading/trailing hyphens
Result: "My Cool Post!" → "my-cool-post"
3. Cache Key Computation
Compute a deterministic SHA256 hash that captures all inputs affecting the rendered output:
Input object (deterministically serialized to JSON):
{
"content_hash": SHA256(file_bytes),
"metadata": {
"slug": resolved_slug,
"category": resolved_category,
"date_iso": resolved_date_in_ISO_format
},
"selected_template": template_name,
"template_hash": hash_of_template_and_all_its_includes,
"permalink_format": "{category}/{year}/{month}/{slug}/",
"schema_version": 1
}
cache_key = SHA256(JSON.dumps(input_object, sorted_keys))
Critical: Template hash must include all partials/includes transitively (see Template Dependency Tracking section).
4. Cache Consultation
Check if this item needs processing:
5. Template Selection
Choose template using first-match algorithm:
- If frontmatter contains
template: X → use templates/X.html
- Else if category-specific template exists → use
templates/{category}.html
- Else use
templates/default.html (required)
Error if: Selected template file doesn’t exist.
6. Content Transformation
For content items (cache miss only):
- Read file as UTF-8 (fail if invalid encoding)
- Split frontmatter (YAML between
--- markers) from body
- Parse Markdown body to HTML (single pass, with configured extensions)
- Assemble template context:
content: HTML body
metadata: All resolved metadata
site: Global site configuration
- Render final HTML through selected template
- Track which templates were used (for invalidation)
For asset items:
- No transformation
- Cache key based only on file hash and output path
7. Permalink Generation
Apply permalink template to resolved metadata:
Template format: {category}/{year:04d}/{month:02d}/{slug}/
Supported placeholders:
{category}: Category string
{year}, {month}, {day}: Date components with optional formatting
{slug}: URL-safe slug
Example:
Metadata: {category: "python", date: "2025-10-28", slug: "intro"}
Template: "{category}/{year}/{month}/{slug}/"
Result URL: "/python/2025/10/intro/"
Output path: "public/python/2025/10/intro/index.html"
URL Normalization Rules:
- All URLs end with
/ (trailing slash required)
- All URLs are lowercase
- Multiple slashes collapsed to single slash
- Leading slash always present
8. Assign Final Paths
For each item:
- Set
item.url to normalized permalink
- Set
item.out to filesystem output path (URL + index.html)
- Set
item.cache_key to computed hash
- Set
item.state = BUILT
Phase 2: Index Generation
After all content items have cache_keys assigned:
1. Collect Items for Indexing
Main index: All content items, sorted by date descending
Category indexes: Items grouped by category, sorted by date descending
2. Paginate Item Lists
Split sorted items into pages using configured page size (e.g., 10 posts per page).
Page URLs:
Main index:
Page 1: /index.html
Page 2: /page/2/index.html
Page 3: /page/3/index.html
Category index (e.g., python):
Page 1: /python/index.html
Page 2: /python/page/2/index.html
3. Compute Index Cache Keys
For each index page, compute cache key capturing:
{
"index_type": "main" or "category:python",
"page_number": 2,
"template_hash": hash_of_index_template,
"pagination_context": {
"total_items": 47,
"total_pages": 5,
"items_per_page": 10
},
"items_on_page": [
{
"cache_key": item.cache_key,
"url": item.url,
"date_iso": item.metadata.date_iso
}
for each item on this page (sorted order)
]
}
Key insight: This captures both membership and ordering. Any change to constituent items or pagination boundaries invalidates the index page.
4. Check Index Cache
For each index page:
- Cache hit: Load pre-rendered HTML
- Cache miss: Render index template with:
- Current page items
- Pagination metadata (current page, total pages, prev/next URLs)
- Category information (for category indexes)
5. Create IndexItems
Construct virtual BuildItems:
kind = index
src = null (virtual item)
url and out set to index page paths
html contains rendered or cached HTML
state = BUILT
Add IndexItems to the main items list.
Collision Detection
After both phases complete:
- Build URL map:
{url: [items_with_that_url]}
- Normalize all URLs before mapping
- Identify collisions: any URL with multiple source items
- For each collision:
- Create detailed error with all source paths
- Suggest resolution (change slug in frontmatter, change category, adjust permalink template)
- If any collisions found, abort before WRITE
Special cases checked:
- Content item URL matching index URL
- Asset filename matching generated content URL
- Multiple content items with same slug in same category
Error Collection
BUILD accumulates all errors without stopping:
- Frontmatter parse failures
- Invalid date formats
- Missing required templates
- URL collisions
- Encoding errors
At end of BUILD, if any errors exist, return error list and abort. No partial builds.
BUILD Success Criteria
BUILD phase succeeds only when:
- All items processed without errors
- All cache keys computed
- All URLs resolved and collision-free
- All HTML rendered (or loaded from cache)
- No missing templates
- All items in
state=BUILT
Template Dependency Tracking
Problem
Template changes must invalidate all items that use them, including transitive dependencies through partials/includes.
Template Hash Computation
When templates are loaded:
-
Build dependency graph:
- Parse each template for include/import statements
- Build map:
{template_name: [included_template_names]}
-
Detect cycles:
- Walk dependency graph with depth-first search
- Track visiting vs. visited nodes
- If cycle detected: reject template set with error
- Example error: “Circular template dependency: base.html → header.html → base.html”
-
Compute hashes bottom-up:
- Start with leaf templates (no includes)
- For each template with includes:
template_hash = SHA256(
template_file_content +
sorted_list_of_included_template_hashes
)
- Store computed hash for reuse
-
Cache hash results: Template hashes are computed once at template load and reused for all items.
Invalidation on Template Change
When a template file changes:
- Recompute its hash (and hashes of templates that include it)
- Query cache for items that used that template
- Mark those items as needing rebuild
- During BUILD, these items get cache misses and reprocess
Note: Template hash is included in item cache_key, so any template change naturally invalidates dependent items.
Stage 3: WRITE
Purpose
Atomically commit all built outputs to disk. Either complete successfully or leave previous output unchanged.
Atomic Commit Strategy
SSGv3 uses symlink-based atomic swap for true atomicity:
Output Directory Structure
project/
public@ -> output_20251028_143022/ # symlink to current build
output_20251028_143022/ # actual output directory (timestamped)
output_20251028_120000/ # previous build (kept for rollback)
The canonical public/ path is always a symlink, never a real directory.
Write Process
-
Create timestamped directory: output_YYYYMMDD_HHMMSS/
-
Write all outputs:
- For each BUILT item:
- Create parent directories as needed
- Write file contents
- For assets: copy bytes preserving modification time
- For content/index: write HTML as UTF-8 text
- All writes to temporary directory, not live output
-
Fsync all files: Ensure writes are committed to disk
-
Write manifest:
- Create
cache/manifest.json.new with:
- Build timestamp
- List of all items with their cache_keys and output paths
- Template hashes used
- Fsync manifest file
- Atomically rename
manifest.json.new → manifest.json
- This is the commit point for cache
-
Update symlink atomically:
ln -sf output_20251028_143022 public.tmp
mv -T public.tmp public
- The
mv of symlink is atomic on POSIX systems
- Between creation of
.tmp and final mv, live site remains on old build
- The
mv operation switches the site instantly
-
Cleanup old outputs:
- Keep N most recent output directories (configurable, default 2)
- Delete older timestamped directories
- This provides instant rollback capability
Success and Failure
Success criteria:
- All files written to timestamped directory
- Manifest committed
- Symlink updated
- Old outputs cleaned (if cleanup enabled)
On failure during file writes:
- Delete incomplete timestamped directory
- Leave
public symlink unchanged (still points to previous good build)
- No manifest update
- Report error and exit with code 2
On failure during symlink update (rare):
- Attempt to delete timestamped directory
- Leave system with previous build active
- Report error with manual recovery steps
- Exit with code 2
Guarantee: Users never see partially-written output. Site is either old version or new version, never mixed.
Non-Symlink Fallback
On systems without symlink support (Windows without dev mode):
- Write to
output.tmp/ directory
- After all writes succeed and manifest committed:
- Rename
output/ → output.old/ (if exists)
- Rename
output.tmp/ → output/
- Delete
output.old/
Limitation: Small time window where output/ doesn’t exist (between renames). Document this as degraded atomicity mode.
Orphan Cleanup
After successful symlink update:
- Load previous manifest
- Compare with new manifest
- Identify files in old manifest but not in new manifest (orphans)
- For each orphan path:
- Verify it’s inside previous output directory
- Delete file
- Delete parent directories if empty
Safety: Only delete from old timestamped directories, never from current output. This is safe because old directories are not served.
Cache Management
Cache Structure
Cache stored in .ssg_cache/ directory:
manifest.json: Authoritative record of last successful build
db.sqlite: Item metadata and dependencies (optional, for advanced queries)
Manifest Format
{
"schema_version": 1,
"build_timestamp": "2025-10-28T14:30:22Z",
"ssg_version": "3.0.0",
"template_hashes": {
"default.html": "sha256:abc...",
"python.html": "sha256:def..."
},
"permalink_template": "{category}/{year}/{month}/{slug}/",
"items": {
"content/python/intro.md": {
"cache_key": "sha256:123...",
"url": "/python/2025/10/intro/",
"out": "output_20251028_143022/python/2025/10/intro/index.html",
"templates_used": ["default.html"]
},
...
}
}
Cache Operations
During BUILD:
- Read manifest to check cache_keys
- On cache hit: load cached HTML and metadata
- On cache miss: mark item for processing
During WRITE:
- After all files written
- Before symlink update
- Write new manifest atomically
- Manifest commit makes cache changes durable
Cache invalidation triggers:
- Source file content changed (content_hash differs)
- Template changed (template_hash differs)
- Metadata changed (different date/slug/category in frontmatter)
- Permalink format changed
- Dependencies changed (for indexes: constituent items changed)
Cold Build vs. Incremental Build
Cold build (no cache):
- Process all items
- Generate all outputs
- Create initial manifest
Incremental build (cache exists):
- SCAN: detect all files as before
- BUILD Phase 1:
- Compute cache_key for each item
- Compare with manifest
- Process only items with changed keys
- Load cached HTML for unchanged items
- BUILD Phase 2:
- Check index cache keys
- Rebuild only indexes whose constituent items changed
- WRITE: Commit new manifest and swap outputs
Performance target: For 1000-post site with 1 file changed, incremental build should complete in under 5 seconds.
Metadata Precedence Model
Three Sources, Clear Rules
Metadata comes from three sources with strict precedence:
- System Defaults (lowest priority)
- Path-Derived (medium priority)
- Frontmatter (highest priority)
Example Walkthrough
File: content/tutorials/python/intro.md
System defaults provide:
slug: "intro" (from filename)
category: "python" (from parent directory)
date: 2025-10-28 (from file mtime)
Frontmatter overrides:
---
title: Introduction to Python
category: programming-basics
date: 2024-06-15
---
Final resolved metadata:
slug: "intro" (default, not overridden)
category: "programming-basics" (frontmatter wins)
date: 2024-06-15 (frontmatter wins)
title: "Introduction to Python" (from frontmatter)
Resulting URL: /programming-basics/2024/06/intro/
Stability Guarantees
With frontmatter category:
- File can move to any directory
- URL remains stable (frontmatter category used)
- Useful for reorganizing source without breaking links
Without frontmatter category:
- URL tracks directory structure
- Moving file changes URL
- Useful for organizing by category via folders
User choice: Explicit frontmatter for stable URLs, directory structure for convenience.
Key Design Decisions
1. Why Three Stages?
Separation of concerns:
- SCAN: Pure discovery, no side effects, fast
- BUILD: All computation and validation, memory-only
- WRITE: Pure commit, no decisions
Benefits:
- Easy to test each stage in isolation
- Clear failure boundaries
- Can abort before any writes
- Can preview build output before committing
2. Why Symlink-Based Atomic Swap?
Problem: Directory rename atomicity varies by filesystem and platform.
Solution: Symlink update is atomic on all POSIX systems (single syscall).
Tradeoff: Requires symlink support, but this is standard on Linux/macOS and modern Windows.
Result: True atomic site publish with zero-downtime deployments.
3. Why Two-Phase BUILD?
Problem: Index cache keys need content item cache keys, but indexes are also items.
Solution: Process content first (compute cache keys), then generate indexes (using those keys).
Benefit: Clear dependency ordering, no circular references.
4. Why Template Dependency Tracking?
Problem: Changing a shared partial should rebuild all pages that use it.
Without tracking: Must rebuild entire site on any template change.
With tracking: Rebuild only affected pages, preserving incremental build performance.
Implementation cost: Parse templates once at load, compute hashes with includes. Worth the complexity for speed.
5. Why Reject Circular Template Dependencies?
Alternative: Allow cycles, break arbitrarily during hash computation.
Problem: Non-deterministic behavior, unpredictable cache invalidation.
Decision: Reject cycles explicitly. Templates should compose in directed acyclic graph. Cycles indicate design error.
Benefit: Clear, predictable behavior. Easy to debug.
6. Why Include Pagination Context in Index Cache?
Problem: Adding one post can shift items between pages without changing the items on a specific page.
Example: Post #11 appears on page 2. Add new post at top. Post #11 now on page 2 (but it’s the 12th post). Page 2 content unchanged but pagination controls need updating.
Solution: Include total counts and page boundaries in cache key. Any change to pagination structure invalidates all affected index pages.
Tradeoff: Adding one post invalidates multiple index pages. Acceptable because indexes are cheap to render compared to content.
Error Handling Philosophy
Fail Fast, Fail Explicitly
- Detect errors as early as possible
- Never write partial output
- Accumulate all errors before reporting
- Provide actionable error messages with suggestions
Error Categories
Configuration errors (fail in SCAN):
- Invalid config file
- Missing required directories
- Malformed blacklist patterns
Content errors (fail in BUILD Phase 1):
- Invalid frontmatter YAML
- Unparseable dates
- Missing required metadata
- Invalid UTF-8 encoding
Template errors (fail in BUILD Phase 1):
- Missing template files
- Circular template dependencies
- Template syntax errors
Collision errors (fail in BUILD Phase 2):
- Multiple items generating same URL
- Index URL conflicting with content URL
Write errors (fail in WRITE):
- Disk full
- Permission denied
- Filesystem errors
Recovery Patterns
Before WRITE: No recovery needed, just abort. Previous build still intact.
During WRITE: Partial writes cleaned up, symlink left pointing to previous build. User can fix issue and retry.
After WRITE: Success. Any issues in cleanup (old output deletion) logged but don’t cause failure.
Performance Model
Bottlenecks and Optimizations
SCAN: I/O bound (filesystem traversal)
- Optimization: Filter early, prune ignored directories immediately
- Target: 10,000 files scanned in <2 seconds
BUILD Phase 1: CPU bound (Markdown parsing, template rendering)
- Optimization: Cache aggressively, process only changed items
- Target: 1 changed file in 1000-file site processes in <1 second
BUILD Phase 2: Memory bound (holding all items for indexing)
- Optimization: None needed for reasonable site sizes
- Target: Support sites with 10,000+ posts in <2GB memory
WRITE: I/O bound (writing files)
- Optimization: Batch directory creation, optional parallel writes
- Target: Write 1000 files in <3 seconds on SSD
Scaling Characteristics
Cold build: O(n) in number of source files
Incremental build: O(m + log n) where m = changed files, n = total files
Index rebuild: O(k) where k = number of posts on changed index pages
Memory usage: O(n) where n = number of posts (holds HTML in memory)
Practical Limits
- Recommended: Up to 5,000 posts, typical site
- Tested: Up to 10,000 posts
- Theoretical: Limited by available memory (roughly 5MB per 1000 posts for HTML storage)
Summary of Core Concepts
- Three-Stage Pipeline: Clear boundaries prevent partial builds
- Two-Phase BUILD: Content first, then indexes (resolves dependencies)
- Cache Keys: Deterministic hashes capture all dependencies
- Metadata Precedence: System < Path < Frontmatter (users control URL stability)
- Template Tracking: Transitive hash includes all partials, enables selective rebuilds
- Symlink Atomicity: True atomic publish with rollback capability
- URL Normalization: Consistent collision detection
- Fail-Before-Write: All validation in BUILD, WRITE never fails
- Collision Detection: Explicit checks prevent silent overwrites
- Incremental Correctness: Cache invalidation based on complete dependency graph
These concepts compose to create a static site generator that is fast, correct, and predictable.