Project Retrospectives
Why PromptReady needed an offline Markdown baseline first
How PromptReady moved past generic page cleanup by making local Markdown capture, source metadata, fallback selection, and AI cleanup boundaries explicit.
Date
Read
4 minPromptReady started from a simple product idea: capture a web page and turn it into clean Markdown that is useful for prompts, notes, and export.
The first version of that idea was too optimistic. I assumed the hard part would be the Markdown cleanup after extraction. In practice, the hard part was preserving enough source context before cleanup so that the Markdown had something real to represent.
The first failure
The naive shape is familiar:
- take the current page HTML
- run a readability-style extractor
- convert the selected content to Markdown
- clean up spacing, links, and code blocks
That works on simple pages. It breaks down when the captured HTML contains a shell, lazy content, duplicated navigation, newsletter blocks, or text that only becomes meaningful after the browser has rendered the page.
The failure was not just ugly Markdown. The worse failure was plausible Markdown that had lost the source. A document can look neat while quietly dropping headings, examples, or the exact technical blocks that made the page valuable.
The baseline became the product boundary
The fix was to treat local capture as a real pipeline instead of a prelude to AI cleanup.
PromptReady now carries source metadata through the path: title, URL, captured time, selection identity, and optional metadata HTML. The local processor builds the Markdown baseline first. That baseline is not a temporary artifact. It is the document the rest of the system has to respect.
That changed the architecture in a useful way. The extension can still return a usable result when AI is not configured. When AI is configured, it gets the offline Markdown as source context instead of being asked to summarize raw HTML from scratch.
The contract became:
rendered page -> local extraction -> canonical Markdown baseline -> optional AI cleanup -> quality gate -> export
Every step after extraction has to respect the baseline. If it cannot, the local result wins.
The browser capture contract also carries diagnostics, so a later failure is not just “bad Markdown”:
export interface CanonicalCapturePayload {
html: string;
url: string;
title: string;
selectionHash: string;
isSelection: boolean;
metadataHtml?: string;
captureDiagnostics?: {
strategy: "initial-body-html" | "deep-body-html";
settleWaitMs: number;
settleTimedOut?: boolean;
scrollStepsExecuted: number;
initialScrollHeight: number;
finalScrollHeight: number;
initialTextLength?: number;
deepTextLength?: number;
headingCountDelta?: number;
deepUsedReason?: string;
};
}
That shape matters because the background pipeline can reject malformed capture payloads before extraction, and the exported result can explain whether it used the initial body or a deeper rendered snapshot.
What the local path has to protect
The local baseline needs to preserve the parts that make a captured page useful:
- headings and their order
- code fences and inline code
- commands, package names, URLs, and config blocks
- source metadata
- enough body text to keep the document from becoming a summary
- the distinction between page content and surrounding clutter
That list is not glamorous, but it is the difference between a note that can be trusted and a note that only looks clean.
One of the useful tests came from a PromptReady landing-page fixture. The test does not just check that processing succeeds. It checks for actual page phrases, headings, pricing copy, and code-fence health, while also rejecting known clutter such as popup ads and irrelevant navigation fragments.
Why this stayed offline-first
AI can be useful for cleanup, but it is not a good source of truth for extraction. If the AI step is allowed to invent the structure from weak HTML, it may produce something readable while dropping the exact details I wanted to preserve.
So the safer boundary is:
- Capture rendered source from the browser.
- Build canonical Markdown locally.
- Pass that Markdown baseline into the AI prompt only as an optional cleanup input.
- Reject AI output that loses too much of the baseline.
- Fall back to the local result with a stable warning.
That keeps PromptReady useful as a browser extension instead of turning it into a thin UI around an AI rewrite.
The tradeoff
This boundary makes the system less magical. PromptReady has to carry metadata, classify fallback paths, and reject AI output that may look cleaner than the local result.
But the product is more honest. A capture tool should preserve source material before it tries to improve the writing.
The remaining boundary
Offline-first does not mean every page is solved. App-heavy pages, lazy sections, and social sites can still produce weak captures. I also do not want to pretend deep capture is a universal answer. It helps when more rendered content appears after scroll and settle waits. It does not fix every extraction problem.
The next useful work is better diagnostics: when a capture fails, the extension should make it easier to tell whether the problem came from capture, extraction, Markdown canonicalization, or the optional AI cleanup step.
That is the engineering lesson I took from this pass: local-first extraction is only credible when failures are observable and protected by tests.