Content Extraction
Pull an existing site's content, media, navigation, and SEO metadata into a structured, downloadable inventory for migrations and redesigns. Works for WordPress and static sites.
- Maturity
- GA
- Plan
- Pro/Agency
- Access
- All users
- Works with
- Any Website · Static Sites · WordPress · Wordpress API
- Operator step
- 2. Produce
What it solves
- How to migrate WordPress content
- How to backup WordPress content for redesign
- How to preserve SEO during site migration
- WordPress content export beyond XML
- How to extract custom post types from WordPress
- How to get ACF fields via REST API
- How to find logo and contact info during site migration
- How to organize content for a homepage redesign
- How to match team photos to people during WordPress migration
- How to validate social media links belong to the right business
- How to extract business hours from a WordPress site
- WordPress REST API double encoding fix
- How to find all embedded maps and videos during site migration
- How to catalog Street View virtual tour embeds across a WordPress site
- How to extract content from a static website
- How to migrate content from a non-WordPress site
- How to crawl a website for content migration
- Alternative to WordPress XML export for site migration
One managed system, not scattered tools
Extracts all content from a connected WordPress site via the REST API — posts, pages, media library, categories, tags, authors, navigation menus, and SEO metadata — into a browsable dashboard view and downloadable JSON file. Designed for site migrations, redesigns, and content auditing. Includes page-builder markup cleanup (Elementor/Divi/WPBakery), SEO metadata fallback via HTML scraping when plugins don't expose data to the API, and an automated extraction report explaining what was and wasn't captured. Now includes site settings (title, tagline, logo, favicon), custom post type detection and extraction with ACF/custom field support, widget and sidebar content, homepage identity scraping (phone, email, social links, copyright), image role categorization (logo, hero, headshot, background, etc.), and a content map that pre-organizes all content into homepage-ready sections (About, Services, Team, Testimonials, Contact, FAQ, Blog, Portfolio, Pricing, Careers). Data quality layer ensures all text is entity-decoded, social links are validated against the site domain, duplicate media variants are grouped with primary flags, and system author accounts are separated from real people. The Site Brief now resolves person photos even when WordPress featured_media is empty by matching image filenames to person names in page content. Now supports static/non-WordPress sites via HTML crawling. Discovers pages through sitemap.xml parsing or homepage link spidering, scrapes each page for content, metadata, images, links, embeds, and http://schema.org data, then runs the same intelligence layers (page-type classifier, image role tagging, embeds inventory, content map, site brief). Works for any website with a public URL. Static site extraction depends on the site having a parseable HTML structure. Sites with heavy JavaScript rendering (SPAs, React/Angular without SSR) may return minimal content. Rate-limited to 200 pages at 0.3s intervals to avoid overloading the target server.
A step in the operator loop
Content Extraction runs at 2. Produce — part of the Produce & Ship Content workflow. See how the work moves before and after it.
Open the Produce & Ship Content workflow →Part of Content Production & Publishing
Content Extraction is one capability in the Content Production & Publishing engine — the part of AlmaSEO that gets a client's work found. Explore the rest of the engine and the capabilities it shares data with.
Explore Content Production & Publishing →See Content Extraction in the bigger picture
Content Extraction is one entry point into the AlmaSEO operating system. Follow it into the workflow it belongs to, step up to the engine that runs it — or see the whole thing on one of your client sites.