Skip to content

add metadata for .mdx, and add explicit wait to reduce chances of not scraping anything on long substack posts#42

Open
angelotc wants to merge 1 commit intotimf34:mainfrom
angelotc:optimizations
Open

add metadata for .mdx, and add explicit wait to reduce chances of not scraping anything on long substack posts#42
angelotc wants to merge 1 commit intotimf34:mainfrom
angelotc:optimizations

Conversation

@angelotc
Copy link
Copy Markdown

Summary

  • Emit YAML frontmatter (title, subtitle, date, author, image) at the top of every scraped .md so the files can drop straight into an MDX-based site. Replaces the previous # title /
    **date** / **Likes:** N header block.
  • Pull author, datePublished (ISO YYYY-MM-DD), and cover image from the page's ld+json — more
    reliable than the old div.meta-EgzBVA lookup and avoids the stray "Date not found" frontmatter.
  • Stop writing "null" posts. Previously, if the page hadn't rendered or the layout didn't match, the
    scraper silently wrote a file with title: "Untitled", date: "Date not found", and an empty body —
    and the os.path.exists cache check meant reruns never retried it.

What changed in substack_scraper.py

  • combine_metadata_and_content now writes YAML frontmatter; escapes embedded quotes in
    title/subtitle/author.
  • extract_post_data now takes a url and, on extraction failure (missing title or empty
    div.available-content), prints a [EXTRACT FAIL] diagnostic and dumps the raw page HTML to
    data/_debug/<writer>/<slug>.html for inspection.
  • scrape_posts skips writing the .md/.html when extraction fails, so reruns keep retrying
    instead of caching a broken file.
  • PremiumSubstackScraper.get_url_soup:
    • Replaces the fixed sleep(2) with a WebDriverWait(..., 20) that returns as soon as
      div.available-content, h1.post-title, h2.paywall-title, or a rate-limit <pre> appears. Timeout
      logs a warning instead of crashing.
    • Detects h2.paywall-title and returns None (mirroring the free scraper), so inaccessible
      premium posts are cleanly skipped instead of producing empty files.

Example frontmatter

---
title: "The Bento Box: Issue 4"
date: "2026-04-03"
author: "Michelle Flores"
image: "https://substackcdn.com/image/fetch/.../bento.png"
---

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant