Skip to content
Ahmed Hamza

Project Retrospectives

When local capture failed: empty shells, lazy pages, and fixture tests

Notes on turning PromptReady's weak page captures into explicit deep-capture policy, rendered fixtures, diagnostics, and repeatable no-network tests.


Date

Read

3 min

One of the easiest mistakes in a browser extension is assuming that page HTML is the page.

PromptReady exposed that mistake quickly. The page looked complete in the browser, but the captured material could still be shallow, lazy, or almost empty. The browser had meaning. The source snapshot did not.

The empty shell problem

The failure shape was simple:

<div id="root"></div>

That kind of capture is not useful for Markdown extraction. It is not a formatting problem. No amount of Turndown cleanup can recover content that was never captured.

The same class of bug showed up with lazy sections and app-heavy pages. A normal DOM snapshot might contain headers, placeholders, or navigation while missing the sections a user actually saw.

The capture policy changed

The fix was to move from “capture whatever is there now” toward an explicit capture policy.

PromptReady’s capture path can wait for the page to settle, scroll enough to trigger lazy content, compare the initial and deeper snapshots, and carry diagnostics about which strategy was used. That does not make every site easy, but it gives the system something better than a blind HTML grab.

The important part is that deep capture is still a policy, not a magic switch. If the deeper snapshot does not improve text or heading coverage enough, keeping it would add risk. If it does help, the result should say why it was retained.

The decision looks more like a gate than a toggle:

const deepCapture = await this.captureDeepSnapshot(policy);
const textGainRatio = initialSnapshot.textLength > 0
  ? (deepCapture.snapshot.textLength - initialSnapshot.textLength) /
    initialSnapshot.textLength
  : (deepCapture.snapshot.textLength > 0 ? 1 : 0);
const headingGain =
  deepCapture.snapshot.headingCount - initialSnapshot.headingCount;

const shouldUseDeepSnapshot =
  deepCapture.snapshot.textLength > initialSnapshot.textLength &&
  (textGainRatio >= policy.minTextGainRatio ||
    headingGain >= policy.minHeadingGain);

The output then records the strategy, scroll count, text lengths, heading delta, and the reason the deep snapshot was retained or rejected. That made weak captures debuggable instead of mysterious.

Fixtures made the failures concrete

The practical improvement was building a fixture loop.

Instead of relying on live pages during normal tests, PromptReady uses pinned HTML fixtures and focused assertions. A representative fixture can prove that the offline path extracts meaningful text, preserves headings, keeps code fences balanced, and rejects known page clutter.

The tests check for both sides of the contract:

That is a much better signal than a snapshot test that only says “the output changed.”

Why no-network tests matter

Live capture is useful when refreshing the corpus, but it is a poor default for normal regression testing. Network pages change. They load slowly. They add experiments. They fail for reasons that are unrelated to the extractor.

The normal test loop should be deterministic. That is why pinned fixtures matter. They let me reproduce a failure like “this rendered page became a shell” or “this code fence got glued to prose” without depending on the current state of the website.

The tradeoff

Fixtures are less exciting than live website testing. They also need maintenance when the extractor improves or the source corpus changes.

But they make the regression loop honest. A failing fixture points to capture, extraction, Markdown repair, or clutter filtering instead of sending me debugging through a live page that may have changed for unrelated reasons.

What remains unsolved

There are still site-specific edges. Some social pages can collapse to source metadata plus a heading, even when deep capture is enabled. That points to extraction and selection behavior, not just capture timing.

The useful boundary is honesty: PromptReady is stronger because the local path has rendered capture, diagnostics, and fixtures. It is not correct to claim perfect extraction from every live site.

The next step is more targeted failure classification. If a result is only metadata and a title, the extension should make that state visible and guide debugging toward capture source, extraction choice, or site-specific handling.