Conversion Policy

This document records implementation decisions for the PDF-to-Markdown conversion engine. It is planning guidance, not implementation code.

Input Classification

Support mixed PDFs by default: text-layer pages, scanned pages, and mixed pages can appear in the same document.
Use PyMuPDF or equivalent lightweight page analysis before heavy parsing to estimate text-layer quality per page.
Decide OCR intervention per page instead of treating the entire PDF as text-only or scan-only.
Prefer Marker's OCR/layout functionality for scanned or weak text-layer pages.

Marker owns overall layout tracking, reading order, body extraction, table structure, image extraction, headings, captions, and semantic block roles.
Nougat owns only mathematical expressions and formula block parsing.
Do not use Nougat as the main document parser.
Send a block to Nougat when Marker identifies it as an equation area or when text-pattern detection marks it as mathematical content.
If Nougat conversion fails, preserve information by falling back to Marker's extracted source text.

Treat formulas embedded inside a sentence without independent line spacing as inline formulas.
Treat formulas occupying independent line space or vertical whitespace as block formulas.
Preserve formula numbers detected near the right or bottom side of a formula region.
Attach anchors to extracted formula numbers and rewrite body references such as Eq. (3) or 식 (5) as internal Markdown links when confidence is sufficient.
Validate Markdown math delimiters by counting opening and closing $ ... $ and $$ ... $$ pairs across each chunk.
Validate common LaTeX environments by checking matching \begin{...} and \end{...} names and counts.
If delimiter or environment validation fails, repair the closest logical location in a way that keeps Markdown rendering intact.

Prefer Markdown tables when structure can be represented without major loss.
Use limited HTML <table> output for tables with merged cells, multi-row headers, or structures that exceed GitHub Flavored Markdown table expressiveness.
Preserve table footnotes as regular text immediately below the table.
Preserve top or bottom captions as text and create internal links from body references such as Table 1.
If structured table extraction loses too much information, also save a screenshot of the table region as a fallback asset and link it near the structured output.

Use deterministic image asset naming such as {document-slug}_fig-{figure-number}.png when a figure number is available.
Include chunk/page/block identifiers in names or anchors when needed to avoid collisions.
Place extracted image assets in the document images/ directory.
Add figure captions below Markdown image links.
Rewrite body references such as Fig. 2 to internal Markdown links when the figure target can be identified.
Deduplicate extracted images by hash and let repeated references share one asset and anchor.

Stitch lines into paragraphs when a line does not end with terminal punctuation and the next line begins like a continuation, or when bounding-box line spacing matches intra-paragraph spacing.
Join hyphenated line breaks when a line-ending hyphen is followed by a lowercase continuation without whitespace.
Preserve hyphens for known compounds, identifiers, or proper nouns when confidence is low.
Use Marker bounding boxes to validate that the linearized text flow matches expected reading order in sample PDFs.
Detect repeated header/footer/page-number patterns in stable top/bottom page regions and exclude them from body Markdown, or separate them from the main body flow.

Use 20 pages as the default chunk target.
Prefer logical block boundaries over strict page boundaries when a paragraph, formula, table, or figure would be cut in the middle.
If a block crosses a chunk boundary, keep the block intact by moving it to the previous or next chunk according to the least damaging boundary.
Add minimal context at the top of each chunk, including document title, page range, and chunk number.
Avoid sidecar metadata by default; put only core metadata in concise Markdown frontmatter.

Ensure the same PDF and same options produce stable output structure and filenames.
Use deterministic slug, anchor, asset, and chunk naming rules.
Prefer pathlib for filesystem paths.
Test Korean filenames, paths with spaces, and long Windows paths.

Use conservative batch sizes, usually 1 or 2, for GTX 1070 Ti 8 GB VRAM.
If a GPU out-of-memory error occurs, retry with a smaller batch or smaller page unit where possible.
If the user explicitly requests --device cuda or --runtime cuda, fail fast instead of silently switching to CPU.
If the user requests --runtime auto, warn and fall back to CPU when CUDA initialization fails.
Keep model cache locations explicit, preferably under a local project or user-configured model cache directory, so offline operation can reuse already-downloaded weights.

Show chunk-level progress and success/failure status in the CLI.
Print warnings and errors to stderr and a local log file.
Do not inject warnings or error logs into generated Markdown because they reduce document readability and integrity.
Support resuming failed conversions by skipping already successful chunks when a local state/cache file is available.
Sidecar outputs are still out of scope unless explicitly requested; a resume state file is a runtime cache, not part of the document output contract.

Prefer focused assertions over full Markdown snapshots.
Validate heading structure, formula delimiter balance, LaTeX environment pairs, image links, caption matching, table parseability, and no-exception conversion.
Use regex and Markdown/HTML parsers where practical instead of ad hoc string checks.
Maintain a sample metadata mapping file for samples/ that tags each PDF by traits such as text-layer quality, scanned pages, multi-column layout, formula density, table density, figure density, and Korean filename coverage.
Use engineering/mechanics PDFs with multi-column layout, formulas, graphs, and tables as the MVP acceptance corpus.

Current use is personal, which lowers immediate distribution risk.
If redistribution or commercial use becomes relevant, revisit Marker GPL and model-weight license implications before packaging.
Process or service isolation can be considered as a licensing risk-mitigation strategy, but it is not a legal conclusion and should be reviewed before distribution.

Keep the core conversion engine as a Python API/CLI package.
Future PyQt UI should remain a thin client over the same API and must not duplicate conversion logic.