Extracting and Validating Code Blocks from LLM-Generated Markdown in Python

#python #llm #tutorial #parsing

LLMs produce inconsistent Markdown. A prompt that returns a clean python ` block on one run might return `Python, `python, or a bare block with no language tag on the next. If your pipeline ingests that output and tries to save or execute the code, inconsistency breaks everything downstream. This article shows how to extract fenced code blocks reliably, validate them, fix common LLM-introduced errors, and integrate the whole thing into a chapter-level pipeline.

Why LLMs Hallucinate Code Fences

LLMs are trained on diverse Markdown across the web. Code fences appear as:

python ` (correct)
`Python (capitalized language tag)
python ` (space before language tag)
`py (alias)
` (no language tag)
Indented four-space blocks (old Markdown style)
Code inside a <pre> tag (if the model pulls from HTML training data)

They also occasionally emit mismatched fence lengths (

` vs `

`) or forget the closing fence entirely, leaving the rest of the document inside the block. Your extractor has to handle all of these.

Regex vs AST: When to Use Each

Use regex for extraction: you want to pull blocks out of arbitrary text without a full Markdown parse. It's fast, handles partial/malformed documents, and runs on raw strings.

Use AST (via Python's ast.parse()) for validation of the extracted Python code: it catches syntax errors that regex cannot detect, without executing the code.

Never use regex to validate Python syntax — and never exec() untrusted code to check if it runs.

Extracting All Fenced Code Blocks

`python
import re
from dataclasses import dataclass
from typing import Optional

@dataclass
class CodeBlock:
language: str # normalized to lowercase, e.g. "python"
source: str # raw code content
index: int # position in document (0-based)

FENCE_PATTERN = re.compile(
r"^(`{3,}|~{3,})\s*(?P[a-zA-Z0-9_+-])\s\n" # opening fence + optional lang
r"(?P.?)" # code content (non-greedy) r"^\1\s$", # closing fence (same length) re.MULTILINE | re.DOTALL, )

LANG_ALIASES = {
"py": "python",
"python3": "python",
"Python": "python",
"js": "javascript",
"ts": "typescript",
"sh": "bash",
"shell": "bash",
"": "unknown",
}

def extract_code_blocks(markdown: str) -> list[CodeBlock]:
"""Extract all fenced code blocks from a Markdown string."""
blocks = []
for i, match in enumerate(FENCE_PATTERN.finditer(markdown)):
raw_lang = match.group("lang").strip()
lang = LANG_ALIASES.get(raw_lang, raw_lang.lower())
code = match.group("code")
blocks.append(CodeBlock(language=lang, source=code, index=i))
return blocks
`

The key detail: the closing fence regex (^\1) backreferences the opening fence, so ` only closes with `

, not `. This handles nested or mismatched fences correctly.

Detecting and Fixing Common LLM Code Issues

Missing Imports

LLMs often use libraries in code examples without importing them. This heuristic scan catches the most common cases:

`python
import ast

COMMON_MISSING_IMPORTS = {
"Path": "from pathlib import Path",
"dataclass": "from dataclasses import dataclass, field",
"Optional": "from typing import Optional",
"List": "from typing import List",
"Dict": "from typing import Dict",
"datetime": "from datetime import datetime",
"json": "import json",
"os": "import os",
"re": "import re",
"logging": "import logging",
}

def detect_missing_imports(code: str) -> list[str]:
"""Return suggested import statements for symbols used but not imported."""
try:
tree = ast.parse(code)
except SyntaxError:
return [] # syntax errors handled separately

# Collect all names defined by imports in this block
imported_names = set()
for node in ast.walk(tree):
    if isinstance(node, ast.Import):
        for alias in node.names:
            imported_names.add(alias.asname or alias.name.split(".")[0])
    elif isinstance(node, ast.ImportFrom):
        for alias in node.names:
            imported_names.add(alias.asname or alias.name)

# Collect all Name nodes used at top level
used_names = {node.id for node in ast.walk(tree) if isinstance(node, ast.Name)}

suggestions = []
for name, import_stmt in COMMON_MISSING_IMPORTS.items():
    if name in used_names and name not in imported_names:
        suggestions.append(import_stmt)

return suggestions

Wrong Indentation

LLMs sometimes produce mixed tabs and spaces. Python 3 raises TabError on this. Normalize before parsing:

`python def fix_indentation(code: str) -> str: """Replace tabs with 4 spaces to avoid TabError on parse.""" return code.expandtabs(4) `

Validate with AST

`python def validate_python(code: str) -> tuple[bool, Optional[str]]: """ Try to parse code as Python AST. Returns (is_valid, error_message). """ try: ast.parse(fix_indentation(code)) return True, None except SyntaxError as e: return False, f"SyntaxError at line {e.lineno}: {e.msg}" except ValueError as e: return False, f"ValueError: {e}" `

Integration with a Chapter Pipeline

The full extraction step in a chapter-level pipeline: parse the LLM response, validate Python blocks, prepend missing imports, and store code separately from prose.

`python
from pathlib import Path
import json

def process_chapter_output(
markdown: str,
chapter_id: str,
output_dir: Path,
) -> dict:
"""
Extract code from a chapter's Markdown output.
Returns a summary dict with validation results.
"""
blocks = extract_code_blocks(markdown)
results = {"chapter_id": chapter_id, "blocks": []}

code_dir = output_dir / "code" / chapter_id
code_dir.mkdir(parents=True, exist_ok=True)

for block in blocks:
    result = {
        "index": block.index,
        "language": block.language,
        "valid": None,
        "error": None,
        "missing_imports": [],
        "file": None,
    }

    if block.language == "python":
        fixed = fix_indentation(block.source)
        is_valid, error = validate_python(fixed)
        result["valid"] = is_valid
        result["error"] = error

        if is_valid:
            suggestions = detect_missing_imports(fixed)
            result["missing_imports"] = suggestions

            # Prepend missing imports and save
            final_code = "\n".join(suggestions) + ("\n\n" if suggestions else "") + fixed
            filename = code_dir / f"block_{block.index:02d}.py"
            filename.write_text(final_code)
            result["file"] = str(filename)

    results["blocks"].append(result)

# Save summary
summary_path = output_dir / "code" / f"{chapter_id}_summary.json"
summary_path.write_text(json.dumps(results, indent=2))

valid_count = sum(1 for b in results["blocks"] if b.get("valid") is True)
print(f"Chapter {chapter_id}: {len(blocks)} blocks, {valid_count} valid Python")
return results

This gives you a clean separation between prose (the full Markdown file) and code (individually saved .py files with an accompanying JSON summary). When a block fails AST validation, you know exactly which chapter and which block index to fix, without searching through thousands of words of text.

Full pipeline + source code: germy5.gumroad.com/l/xhxkzz — $19.99, 30-day refund.

📋 Free resource: AI Publishing Checklist — 7 steps to ship a technical ebook with Python (free PDF)

Full pipeline + 10 scripts: germy5.gumroad.com/l/xhxkzz — $12.99 launch price