add files
This commit is contained in:
@@ -0,0 +1,91 @@
|
||||
# Conversion Policy
|
||||
|
||||
This document records implementation decisions for the PDF-to-Markdown conversion engine. It is planning guidance, not implementation code.
|
||||
|
||||
## Input Classification
|
||||
- Support mixed PDFs by default: text-layer pages, scanned pages, and mixed pages can appear in the same document.
|
||||
- Use PyMuPDF or equivalent lightweight page analysis before heavy parsing to estimate text-layer quality per page.
|
||||
- Decide OCR intervention per page instead of treating the entire PDF as text-only or scan-only.
|
||||
- Prefer Marker's OCR/layout functionality for scanned or weak text-layer pages.
|
||||
|
||||
## Parser Responsibilities
|
||||
- Marker owns overall layout tracking, reading order, body extraction, table structure, image extraction, headings, captions, and semantic block roles.
|
||||
- Nougat owns only mathematical expressions and formula block parsing.
|
||||
- Do not use Nougat as the main document parser.
|
||||
- Send a block to Nougat when Marker identifies it as an equation area or when text-pattern detection marks it as mathematical content.
|
||||
- If Nougat conversion fails, preserve information by falling back to Marker's extracted source text.
|
||||
|
||||
## Formula Handling
|
||||
- Treat formulas embedded inside a sentence without independent line spacing as inline formulas.
|
||||
- Treat formulas occupying independent line space or vertical whitespace as block formulas.
|
||||
- Preserve formula numbers detected near the right or bottom side of a formula region.
|
||||
- Attach anchors to extracted formula numbers and rewrite body references such as `Eq. (3)` or `식 (5)` as internal Markdown links when confidence is sufficient.
|
||||
- Validate Markdown math delimiters by counting opening and closing `$ ... $` and `$$ ... $$` pairs across each chunk.
|
||||
- Validate common LaTeX environments by checking matching `\begin{...}` and `\end{...}` names and counts.
|
||||
- If delimiter or environment validation fails, repair the closest logical location in a way that keeps Markdown rendering intact.
|
||||
|
||||
## Tables
|
||||
- Prefer Markdown tables when structure can be represented without major loss.
|
||||
- Use limited HTML `<table>` output for tables with merged cells, multi-row headers, or structures that exceed GitHub Flavored Markdown table expressiveness.
|
||||
- Preserve table footnotes as regular text immediately below the table.
|
||||
- Preserve top or bottom captions as text and create internal links from body references such as `Table 1`.
|
||||
- If structured table extraction loses too much information, also save a screenshot of the table region as a fallback asset and link it near the structured output.
|
||||
|
||||
## Figures And Images
|
||||
- Use deterministic image asset naming such as `{document-slug}_fig-{figure-number}.png` when a figure number is available.
|
||||
- Include chunk/page/block identifiers in names or anchors when needed to avoid collisions.
|
||||
- Place extracted image assets in the document `images/` directory.
|
||||
- Add figure captions below Markdown image links.
|
||||
- Rewrite body references such as `Fig. 2` to internal Markdown links when the figure target can be identified.
|
||||
- Deduplicate extracted images by hash and let repeated references share one asset and anchor.
|
||||
|
||||
## Reading Order And Paragraph Flow
|
||||
- Stitch lines into paragraphs when a line does not end with terminal punctuation and the next line begins like a continuation, or when bounding-box line spacing matches intra-paragraph spacing.
|
||||
- Join hyphenated line breaks when a line-ending hyphen is followed by a lowercase continuation without whitespace.
|
||||
- Preserve hyphens for known compounds, identifiers, or proper nouns when confidence is low.
|
||||
- Use Marker bounding boxes to validate that the linearized text flow matches expected reading order in sample PDFs.
|
||||
- Detect repeated header/footer/page-number patterns in stable top/bottom page regions and exclude them from body Markdown, or separate them from the main body flow.
|
||||
|
||||
## Chunking
|
||||
- Use 20 pages as the default chunk target.
|
||||
- Prefer logical block boundaries over strict page boundaries when a paragraph, formula, table, or figure would be cut in the middle.
|
||||
- If a block crosses a chunk boundary, keep the block intact by moving it to the previous or next chunk according to the least damaging boundary.
|
||||
- Add minimal context at the top of each chunk, including document title, page range, and chunk number.
|
||||
- Avoid sidecar metadata by default; put only core metadata in concise Markdown frontmatter.
|
||||
|
||||
## Determinism And Paths
|
||||
- Ensure the same PDF and same options produce stable output structure and filenames.
|
||||
- Use deterministic slug, anchor, asset, and chunk naming rules.
|
||||
- Prefer `pathlib` for filesystem paths.
|
||||
- Test Korean filenames, paths with spaces, and long Windows paths.
|
||||
|
||||
## Runtime And Recovery
|
||||
- Use conservative batch sizes, usually 1 or 2, for GTX 1070 Ti 8 GB VRAM.
|
||||
- If a GPU out-of-memory error occurs, retry with a smaller batch or smaller page unit where possible.
|
||||
- If the user explicitly requests `--device cuda` or `--runtime cuda`, fail fast instead of silently switching to CPU.
|
||||
- If the user requests `--runtime auto`, warn and fall back to CPU when CUDA initialization fails.
|
||||
- Keep model cache locations explicit, preferably under a local project or user-configured model cache directory, so offline operation can reuse already-downloaded weights.
|
||||
|
||||
## Logging And Resume
|
||||
- Show chunk-level progress and success/failure status in the CLI.
|
||||
- Print warnings and errors to stderr and a local log file.
|
||||
- Do not inject warnings or error logs into generated Markdown because they reduce document readability and integrity.
|
||||
- Support resuming failed conversions by skipping already successful chunks when a local state/cache file is available.
|
||||
- Sidecar outputs are still out of scope unless explicitly requested; a resume state file is a runtime cache, not part of the document output contract.
|
||||
|
||||
## Quality Tests
|
||||
- Prefer focused assertions over full Markdown snapshots.
|
||||
- Validate heading structure, formula delimiter balance, LaTeX environment pairs, image links, caption matching, table parseability, and no-exception conversion.
|
||||
- Use regex and Markdown/HTML parsers where practical instead of ad hoc string checks.
|
||||
- Maintain a sample metadata mapping file for `samples/` that tags each PDF by traits such as text-layer quality, scanned pages, multi-column layout, formula density, table density, figure density, and Korean filename coverage.
|
||||
- Use engineering/mechanics PDFs with multi-column layout, formulas, graphs, and tables as the MVP acceptance corpus.
|
||||
|
||||
## Licensing
|
||||
- Current use is personal, which lowers immediate distribution risk.
|
||||
- If redistribution or commercial use becomes relevant, revisit Marker GPL and model-weight license implications before packaging.
|
||||
- Process or service isolation can be considered as a licensing risk-mitigation strategy, but it is not a legal conclusion and should be reviewed before distribution.
|
||||
|
||||
## UI Boundary
|
||||
- Keep the core conversion engine as a Python API/CLI package.
|
||||
- Future PyQt UI should remain a thin client over the same API and must not duplicate conversion logic.
|
||||
|
||||
Reference in New Issue
Block a user