Content Production & Publishing

Content Extraction

Pull an existing site's content, media, navigation, and SEO metadata into a structured, downloadable inventory for migrations and redesigns. Works for WordPress and static sites.

See this in the workflow → Explore Content Production & Publishing

Maturity: GA
Plan: Pro/Agency
Access: All users
Works with: Any Website · Static Sites · WordPress · Wordpress API
Operator step: 2. Produce

The problem

What it solves

How to migrate WordPress content
How to backup WordPress content for redesign
How to preserve SEO during site migration
WordPress content export beyond XML
How to extract custom post types from WordPress
How to get ACF fields via REST API
How to find logo and contact info during site migration
How to organize content for a homepage redesign
How to match team photos to people during WordPress migration
How to validate social media links belong to the right business
How to extract business hours from a WordPress site
WordPress REST API double encoding fix
How to find all embedded maps and videos during site migration
How to catalog Street View virtual tour embeds across a WordPress site
How to extract content from a static website
How to migrate content from a non-WordPress site
How to crawl a website for content migration
Alternative to WordPress XML export for site migration

How it works

One managed system, not scattered tools

Extracts all content from a connected WordPress site via the REST API — posts, pages, media library, categories, tags, authors, navigation menus, and SEO metadata — into a browsable dashboard view and downloadable JSON file. Designed for site migrations, redesigns, and content auditing. Includes page-builder markup cleanup (Elementor/Divi/WPBakery), SEO metadata fallback via HTML scraping when plugins don't expose data to the API, and an automated extraction report explaining what was and wasn't captured. Now includes site settings (title, tagline, logo, favicon), custom post type detection and extraction with ACF/custom field support, widget and sidebar content, homepage identity scraping (phone, email, social links, copyright), image role categorization (logo, hero, headshot, background, etc.), and a content map that pre-organizes all content into homepage-ready sections (About, Services, Team, Testimonials, Contact, FAQ, Blog, Portfolio, Pricing, Careers). Data quality layer ensures all text is entity-decoded, social links are validated against the site domain, duplicate media variants are grouped with primary flags, and system author accounts are separated from real people. The Site Brief now resolves person photos even when WordPress featured_media is empty by matching image filenames to person names in page content. Now supports static/non-WordPress sites via HTML crawling. Discovers pages through sitemap.xml parsing or homepage link spidering, scrapes each page for content, metadata, images, links, embeds, and http://schema.org data, then runs the same intelligence layers (page-type classifier, image role tagging, embeds inventory, content map, site brief). Works for any website with a public URL. Static site extraction depends on the site having a parseable HTML structure. Sites with heavy JavaScript rendering (SPAs, React/Angular without SSR) may return minimal content. Rate-limited to 200 pages at 0.3s intervals to avoid overloading the target server.

Where this fits

A step in the operator loop

Content Extraction runs at 2. Produce — part of the Produce & Ship Content workflow. See how the work moves before and after it.

Open the Produce & Ship Content workflow →

The engine behind it

Part of Content Production & Publishing

Content Extraction is one capability in the Content Production & Publishing engine — the part of AlmaSEO that gets a client's work found. Explore the rest of the engine and the capabilities it shares data with.

Explore Content Production & Publishing →

See Content Extraction in the bigger picture

Content Extraction is one entry point into the AlmaSEO operating system. Follow it into the workflow it belongs to, step up to the engine that runs it — or see the whole thing on one of your client sites.

Book a demo → See this in the workflow → Explore Content Production & Publishing