Files
2026-05-14 10:16:59 +09:00

12 KiB

AGENTS.md

This file gives implementation instructions for coding agents working in this repository.

Project Mission

Build a local-only PDF-to-Markdown converter for math-heavy digital PDFs. The converter must produce Obsidian-friendly Markdown and preserve enough metadata to debug formulas, reading order, tables, figures, and assets.

Project Guidelines

Behavioral guidelines to reduce common LLM coding mistakes. Merge with project-specific instructions as needed.

Tradeoff: These guidelines bias toward caution over speed. For trivial tasks, use judgment.

1. Think Before Coding

Don't assume. Don't hide confusion. Surface tradeoffs.

Before implementing:

  • State your assumptions explicitly. If uncertain, ask.
  • If multiple interpretations exist, present them - don't pick silently.
  • If a simpler approach exists, say so. Push back when warranted.
  • If something is unclear, stop. Name what's confusing. Ask.

2. Simplicity First

Minimum code that solves the problem. Nothing speculative.

  • No features beyond what was asked.
  • No abstractions for single-use code.
  • No "flexibility" or "configurability" that wasn't requested.
  • No error handling for impossible scenarios.
  • If you write 200 lines and it could be 50, rewrite it.

Ask yourself: "Would a senior engineer say this is overcomplicated?" If yes, simplify.

3. Surgical Changes

Touch only what you must. Clean up only your own mess.

When editing existing code:

  • Don't "improve" adjacent code, comments, or formatting.
  • Don't refactor things that aren't broken.
  • Match existing style, even if you'd do it differently.
  • If you notice unrelated dead code, mention it - don't delete it.

When your changes create orphans:

  • Remove imports/variables/functions that YOUR changes made unused.
  • Don't remove pre-existing dead code unless asked.

The test: Every changed line should trace directly to the user's request.

4. Goal-Driven Execution

Define success criteria. Loop until verified.

Transform tasks into verifiable goals:

  • "Add validation" -> "Write tests for invalid inputs, then make them pass"
  • "Fix the bug" -> "Write a test that reproduces it, then make it pass"
  • "Refactor X" -> "Ensure tests pass before and after"

For multi-step tasks, state a brief plan:

1. [Step] -> verify: [check]
2. [Step] -> verify: [check]
3. [Step] -> verify: [check]

Strong success criteria let you loop independently. Weak criteria ("make it work") require constant clarification.


These guidelines are working if: fewer unnecessary changes in diffs, fewer rewrites due to overcomplication, and clarifying questions come before implementation rather than after mistakes.

Commands

Command Description
uv run pytest Run the default fast test suite.
uv run pdf2md doctor Check local Python, uv, MinerU, GPU/PyTorch, model/cache, MathJax, and strict-local setup.
uv run pytest tests/test_ui_runner.py Run focused UI command-resolution and subprocess tests.
uv run --group ui-build pyinstaller --clean --onefile --windowed --name pdf2md-ui src\pdf2md_ui\app.py Rebuild the thin Windows UI executable.
uv run pdf2md convert paper.pdf --out outputs --chunk-pages --gpu auto --mineru-profile auto --strict-local Optional local conversion smoke; keep generated output ignored.

Source Documents

  • PLAN.md: shared plan, planned work, open questions, and ownership for agents.
  • PROGRESS.md: completed work, current status, blockers, and next actions for agents.
  • PRD.md: product requirements, user scope, CLI/API requirements, acceptance criteria.
  • ARCHITECTURE.md: system layers, MinerU adapter contract, intermediate representation, metadata schema, and local-only enforcement.
  • docs/KNOWLEDGEBASE.md: research basis and implementation background.
  • docs/V1IMPLEMENTATIONPLAN.md: v1 implementation sequence, sprint contracts, verification gates, and agent ownership.
  • docs/UI_RESEARCH.md: research basis for the implemented minimal Windows UI launcher.
  • docs/WORKARCHIVE.md: archived completed work, historical sprint outcomes, setup results, verification history, and sample conversion evidence.
  • docs/Sprints/*.md: active and historical sprint contracts.
  • docs/superpowers/specs/*.md: design specs created for focused project workflows.
  • docs/superpowers/plans/*.md: executable task plans created from specs, including completed UI folder batch work and abandoned historical plans.
  • .codex/agents/*.toml: project-scoped custom subagent roles.
  • .codex/commands/*.md: reusable project prompt commands.
  • .codex/skills/*/SKILL.md: project-specific Codex skills.
  • .codex/hooks.json and .codex/hooks/*.py: project hook configuration and deterministic hook scripts.

Startup Workflow

At the start of every task:

  • Read PLAN.md and PROGRESS.md before deciding what to do.
  • Read only the other source documents needed for the task.
  • Read docs/WORKARCHIVE.md when the task needs historical completed-work context, previous verification results, or sample conversion evidence.
  • Use .codex/agents, .codex/commands, and .codex/skills when the user explicitly asks for agent delegation, reusable workflows, or specialized project guidance.
  • State the relevant current goal, next action, and blocker if one exists.
  • If PLAN.md and PROGRESS.md conflict, trust PROGRESS.md for what has happened and update PLAN.md when making the next change.

Progress Tracking

Use PLAN.md and PROGRESS.md to coordinate work across agents.

  • Update PLAN.md when planned work, ownership, sequencing, open questions, or decisions change.
  • Update PROGRESS.md after meaningful work, verification, blockers, or next actions change.
  • Move completed work from PROGRESS.md to docs/WORKARCHIVE.md when it is no longer needed for current coordination.
  • Keep entries short and factual.
  • Do not use these files as scratchpads or long research notes.
  • Do not mark work complete until it has been verified.
  • When multiple agents work in parallel, each agent must leave enough context in PROGRESS.md for the next agent to resume without guessing.

Long-Running Harness Workflow

For substantial implementation work, follow the planner/generator/evaluator pattern from Anthropic's long-running harness design article: https://www.anthropic.com/engineering/harness-design-long-running-apps.

Use the harness only when task complexity justifies the overhead. For small documentation edits or narrow fixes, a single agent with focused verification is preferred.

Harness roles:

  • harness-planner-agent: expands a brief request into product context, high-level technical direction, non-goals, risks, and a sequence of small contracts.
  • feature-generator-agent: implements one agreed contract at a time after implementation has been explicitly requested.
  • evaluation-agent: independently reviews proposed contracts and completed work. It must be skeptical, specific, and willing to fail work that is incomplete, stubbed, unverified, or below threshold.

Before each implementation chunk:

  • Write or update a concise sprint contract before code changes start.
  • Include objective, touched surfaces, expected outputs, non-goals, verification steps, hard failure criteria, and handoff fields.
  • Let evaluation-agent review the contract before feature-generator-agent implements it.
  • Avoid over-specifying low-level implementation before the responsible agent has inspected the code.

After each implementation chunk:

  • feature-generator-agent runs a self-check but does not approve its own work.
  • evaluation-agent performs independent checks against the contract and reports actionable findings.
  • If the chunk fails, feed the evaluator's findings back into the next generator pass.
  • Update PROGRESS.md with completed work, checks run, residual risks, and the next concrete action.

When context becomes too large or a task spans sessions, prefer a clean structured handoff over relying only on conversation history. The handoff must include current state, decisions made, files touched, checks run, known failures, and the next action.

Periodically re-evaluate the harness itself. Remove roles, contracts, or checks that are not load-bearing, and add structure only when it improves correctness, scope control, or verification quality.

Fixed Product Decisions

  • Language: Python.
  • Workflow: uv.
  • Interface: CLI plus Python library.
  • Default CLI name: pdf2md.
  • Runtime policy: local-only. Do not add cloud OCR, remote LLM, or external document upload paths.
  • Default output: Obsidian-friendly Markdown.
  • Inline math: $...$.
  • Display math: $$...$$.
  • Conversion engine: MinerU 3.1.0.
  • Hardware target: NVIDIA GPU.
  • Input priority: digital PDFs with text layers.
  • Quality workflow: fully automatic. Log warnings and continue when possible.
  • MinerU execution: direct local mineru CLI only. MinerU 3.1.0 may launch a temporary local mineru-api internally when CLI runs without --api-url.
  • Output layout: write <out>/<stem>/<stem>_001.md, shared <out>/<stem>/images/, and <out>/<stem>/<stem>_report.md; new conversions do not persist public metadata JSON after Sprint 16.
  • UI folder batch conversion: the UI may convert direct-child PDFs in a selected folder by sequentially invoking existing pdf2md convert commands.
  • v1 use case: personal/research. MinerU and transitive model/package licenses must be documented before redistribution.

Architecture Guidance

Follow ARCHITECTURE.md for implementation structure. Do not duplicate architecture decisions in code comments or docs unless the new text points back to that file.

Key implementation constraints:

  • Keep MinerU-specific objects behind the MinerU adapter.
  • Keep public CLI/library contracts stable and project-owned.
  • Keep Obsidian Markdown normalization separate from MinerU execution.
  • Keep metadata and warning generation structured.
  • Keep quality report generation derived from metadata and local checks.
  • Do not add runtime engine selection in v1.

Local-Only Requirements

Never add runtime dependencies that upload PDFs, page images, or extracted text to remote services. Follow the strict-local enforcement rules in ARCHITECTURE.md.

Allowed in v1: direct mineru CLI execution and the CLI-internal temporary local mineru-api process.

Do not pass --api-url, use remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible inference endpoints in v1.

CLI Behavior

Follow the CLI requirements in PRD.md. Do not add commands, flags, config files, or runtime engine selection unless the user explicitly asks for them.

Do not overwrite user files unless the requested behavior and --overwrite semantics allow it.

Testing Guidance

Add tests in proportion to behavior risk.

Required early tests:

  • Math delimiter normalization.
  • Display math spacing.
  • Asset path normalization.
  • Metadata schema creation.
  • Warning aggregation.
  • CLI path planning and overwrite behavior.
  • MinerU adapter contract with mocked outputs.

Fixture PDFs should cover:

  • Simple digital PDF.
  • Math-heavy academic PDF.
  • Multi-column paper.
  • Table with formulas.
  • Figure with caption.

Any test that depends on large local models should be optional or marked separately so normal CI/dev checks can run quickly.

Git Workflow

After changing files:

  • Run the smallest useful verification for the change.
  • Check git status --short.
  • Commit the completed change unless the user explicitly asks not to.
  • Do not include unrelated user edits in the commit.
  • Commit rollback requests - Verify the target commit and current status first, then use a direct non-interactive reset; leave untracked generated/local artifacts such as build/, dist/, samples/, and *.spec files untouched unless deletion is explicitly requested.
  • Installed-runtime doctor debugging - Test both uv run pdf2md doctor and direct venv execution such as .venv\Scripts\pdf2md.exe doctor; direct execution may not inherit the same PATH behavior as uv run.

Documentation Guidance

Keep documentation explicit about:

  • Local-only privacy behavior.
  • NVIDIA GPU expectations.
  • MinerU installation and model downloads.
  • Known limitations of automatic formula reconstruction.
  • Dependency licenses.
  • Obsidian output assumptions.

Do not imply perfect LaTeX conversion. The correct guarantee is best-effort automatic conversion with warnings and provenance.