add metadata for .mdx, and add explicit wait to reduce chances of not scraping anything on long substack posts by angelotc · Pull Request #42 · timf34/Substack2Markdown

angelotc · 2026-04-14T05:18:53Z

Summary

Emit YAML frontmatter (title, subtitle, date, author, image) at the top of every scraped .md so the files can drop straight into an MDX-based site. Replaces the previous # title /
**date** / **Likes:** N header block.
Pull author, datePublished (ISO YYYY-MM-DD), and cover image from the page's ld+json — more
reliable than the old div.meta-EgzBVA lookup and avoids the stray "Date not found" frontmatter.
Stop writing "null" posts. Previously, if the page hadn't rendered or the layout didn't match, the
scraper silently wrote a file with title: "Untitled", date: "Date not found", and an empty body —
and the os.path.exists cache check meant reruns never retried it.

combine_metadata_and_content now writes YAML frontmatter; escapes embedded quotes in
title/subtitle/author.
extract_post_data now takes a url and, on extraction failure (missing title or empty
div.available-content), prints a [EXTRACT FAIL] diagnostic and dumps the raw page HTML to
data/_debug/<writer>/<slug>.html for inspection.
scrape_posts skips writing the .md/.html when extraction fails, so reruns keep retrying
instead of caching a broken file.
PremiumSubstackScraper.get_url_soup:
- Replaces the fixed sleep(2) with a WebDriverWait(..., 20) that returns as soon as
  div.available-content, h1.post-title, h2.paywall-title, or a rate-limit <pre> appears. Timeout
  logs a warning instead of crashing.
- Detects h2.paywall-title and returns None (mirroring the free scraper), so inaccessible
  premium posts are cleanly skipped instead of producing empty files.

---
title: "The Bento Box: Issue 4"
date: "2026-04-03"
author: "Michelle Flores"
image: "https://substackcdn.com/image/fetch/.../bento.png"
---

add metadata for mdx, and add explicit wait

6c759bf