Note
E-commerce SEO Is Plumbing With a Cash Register
Faceted navigation, product taxonomy, and the art of not indexing things.
A client calls. They sell 40,000 products. They have 2.3 million URLs in Google's index. Forty thousand products. Two point three million URLs. That's a ratio of 57 indexed URLs per product, which is the kind of number that makes you close your laptop, stare at the ceiling, and contemplate whether becoming a goat farmer in Vermont is still a viable career option.
I ask the client how this happened. The client says they don't know. The client says they hired an agency two years ago and the agency "did SEO" and the traffic went up for a while and then it went down and then it went down more and now it's been going down for eight months and nobody can figure out why. I ask the client to send me their Search Console data and their server logs and a full crawl export. The client sends me everything. I open the crawl export. The file is 1.2 gigabytes. A spreadsheet that is 1.2 gigabytes. My laptop fan sounds like a jet engine preparing for takeoff.
The problem, which takes me about forty minutes to identify and which I then spend the next three hours confirming because the scope of it is so spectacular that I keep thinking I must be misreading the data, is this: the client's faceted navigation (those filter options on category pages - filter by color, filter by size, filter by material, filter by price range, filter by brand, filter by availability) generates a unique URL for every possible combination of filters. Every combination. Color plus size. Color plus material. Color plus size plus material. Color plus size plus material plus price range. Color plus size plus material plus price range plus brand. Color plus size plus material plus price range plus brand plus availability.
For a category with 8 colors, 12 sizes, 6 materials, 5 price ranges, 15 brands, and 2 availability states, the number of possible filter combinations is - and I want you to sit down for this - 8 times 12 times 6 times 5 times 15 times 2, which is 86,400 possible URLs. For one category. The client has 340 categories. Not all categories have the same number of filter options, obviously, but you see where the math goes. It goes to 2.3 million.
This is the fundamental e-commerce SEO problem, and I don't mean "one of the fundamental problems." I mean the fundamental problem. The single issue that accounts for more wasted crawl budget, more indexation bloat, more ranking dilution, and more confused Googlebot behavior than any other factor in e-commerce. If you understand this problem and how to solve it, you understand about sixty percent of e-commerce technical SEO. The other forty percent is taxonomy and internal linking, which we'll get to. But first, the plumbing.
Why E-commerce Sites Are Plumbing Problems
I call e-commerce SEO "plumbing with a cash register" because the metaphor is almost literally true. A well-functioning e-commerce site works like a well-designed plumbing system: clean water (Googlebot, users) flows in through the main pipe (the homepage and top-level categories), gets distributed through a branching network of pipes (subcategories, product listings) to the individual fixtures (product pages) where the actual work happens (transactions). The system has valves (robots.txt, noindex tags, canonical tags) that control flow and prevent waste. The system has pressure (link equity, crawl budget) that needs to be managed so the important fixtures get enough water and the unimportant ones don't drain the system.
When the plumbing breaks - when there are leaks, blockages, or cross-connections - the system still looks fine from the outside. The website still loads. The products still display. The checkout still works. But behind the walls, the pressure is dropping, the flow is going to the wrong places, and the fixtures that matter (the high-margin product pages, the category pages that should be ranking) are getting a trickle while the crawl budget firehose is pointed at 86,400 filter combination pages that should never have existed in the first place.
The previous agency "doing SEO" on this site had optimized title tags on 40,000 product pages. They'd written meta descriptions. They'd added schema markup. They'd done everything right at the fixture level while the plumbing behind the walls was flooding the basement. This is why the traffic went up briefly (Google found some better-optimized pages and rewarded them) and then went down (Google's crawler got progressively more bogged down in the faceted navigation swamp, spent less and less of its crawl budget on the pages that actually mattered, and eventually started deindexing good pages because it ran out of crawl budget before reaching them).
The ratio that should scare you: If your indexed URL count is more than 3-5x your actual page count (unique products plus categories plus supporting content), you almost certainly have an index bloat problem. Check this in Search Console under Pages > Indexed. Then look at the actual number of pages you want indexed. If the gap is large, start looking at your faceted navigation, parameterized URLs, and session-based URL generation.
Crawl Budget: Why It Matters More for E-commerce
Crawl budget is one of those SEO concepts that gets thrown around a lot and understood very little, so let me be precise about what it means and why it matters disproportionately for e-commerce sites.
Google assigns every site a crawl budget - the number of pages Googlebot will crawl in a given time period. This budget is determined by two factors: crawl rate limit (how fast Google can crawl without overloading your server) and crawl demand (how much Google wants to crawl based on perceived importance and freshness). For most small sites, crawl budget is irrelevant because Google can crawl the entire site in minutes. For a site with 40,000 products and 2.3 million URLs, crawl budget is the single most important technical factor because Google literally cannot crawl everything.
When Google can't crawl everything, it makes choices. It prioritizes pages it thinks are important (based on internal links, external links, traffic, and freshness signals) and deprioritizes pages it thinks are less important. This is where the faceted navigation problem becomes catastrophic: Google doesn't know that /shoes/running?color=blue&size=10&brand=nike is a junk filter page and /shoes/running is the category page you need to rank. To Google, they're both URLs. And if there are 86,400 of the former and one of the latter, the crawler will spend most of its budget on the junk and very little on the page that matters.
I've seen this pattern so many times it's almost boring to describe. (Almost. It's never actually boring because every time I see it, someone is losing real money.) The site owner looks at Search Console and sees that thousands of product pages are in "Discovered - currently not indexed" status. They panic. They ask "why won't Google index my products?" The answer is that Google ran out of crawl budget before it got to those products because it spent all its budget crawling filter combinations, paginated URLs, sort-order variations, and other technical debris that the site generates automatically and that nobody ever told Googlebot to ignore.
The fix isn't to increase crawl budget (you can't really control that, despite what some people will tell you). The fix is to reduce crawl waste. Stop Googlebot from crawling the pages that don't matter so it has budget left for the pages that do. This is the valve system. This is where robots.txt, noindex, canonical tags, and URL parameter handling come in, and each one does something different, and using the wrong one in the wrong situation is a common and sometimes expensive mistake.
The Noindex vs. Canonical vs. Robots.txt Decision Tree
This is the section where I need to be technically precise because getting this wrong can range from "slightly suboptimal" to "you just deindexed your entire product catalog." I'm going to lay out the decision tree I use for faceted navigation URLs, and I'm going to explain why each choice exists, because if you just follow the tree without understanding it, you'll eventually encounter a situation the tree doesn't cover and you'll make the wrong call.
Option 1: Robots.txt disallow. This tells Googlebot "don't crawl this URL." It saves crawl budget, which is good. But - and this is the critical part that a lot of people get wrong - it does not prevent indexing. If another page links to a robots.txt-blocked URL, or if an external site links to it, Google may index it anyway. It'll show up in search results with a snippet that says "A description for this result is not available because of this site's robots.txt" and it'll look terrible and it'll potentially rank for things you don't want it to rank for. Use robots.txt when you want to save crawl budget on URLs that have no external links and no internal links that you can't control. Best for: session ID URLs, internal search result pages, print-friendly versions, and similar technical artifacts that nobody links to.
Option 2: Noindex meta tag. This tells Google "crawl this page but don't put it in the index." It removes the page from search results, which is what you want. But - and here's the catch - Google has to crawl the page to see the noindex tag. So it doesn't save crawl budget. Google still visits the URL, reads the content, sees the noindex tag, and moves on. If your primary problem is crawl budget waste (which it usually is for large e-commerce sites), noindex alone doesn't solve it. Use noindex when the page might get external links (and therefore might get indexed despite robots.txt) but you don't want it in the search index. Best for: faceted pages that filter by a single non-search-relevant attribute (filter by availability, filter by sale status), paginated pages beyond page 1, and search result pages.
Option 3: Canonical tag pointing to the base category page. This tells Google "this page is a variant of that page, please consolidate signals." It's a hint, not a directive - Google can ignore it. But when it works, it's elegant: the filter URL still exists and still functions for users, Google crawls it but understands it's a variant, and the link equity flows to the canonical page. Use canonical when the faceted page's content is substantially similar to the base category page (which it usually is - the same products, just filtered). Best for: most faceted navigation URLs, sort-order variations, and view-type variations (grid vs. list).
Option 4: Combination approach (the one I actually use). For large sites, the right answer is almost never a single technique. Here's what I typically recommend:
The faceted navigation decision framework:
1. Does this filter combination represent a real search intent? (Do people actually search for "blue running shoes size 10"?) If YES, let it be indexed. Add a canonical to itself. Optimize it. This is a landing page, not a faceted URL.
2. Does this filter combination have significant search volume? Check keyword data. If YES and the volume justifies it, treat it as a real landing page with unique content, proper internal links, and self-referencing canonical. If NO, proceed to step 3.
3. Is this a single-attribute filter on a search-relevant dimension? (Color, brand, material - things people actually filter by in search queries.) If YES, canonical to the base category page. Allow crawling. Google will consolidate.
4. Is this a multi-attribute filter combination? (Color AND size AND brand.) If YES, add a canonical to the most relevant single-attribute or base category page AND add robots.txt disallow for the multi-parameter pattern. Belt and suspenders.
5. Is this a non-search-relevant filter? (Sort order, items per page, availability, view type.) If YES, noindex plus robots.txt disallow. Nobody is searching for "running shoes sorted by price ascending." Nuke it from orbit.
The key insight behind this framework is that not all faceted URLs are equal. Some represent genuine search intent and should be landing pages. Some represent reasonable variations that Google can understand through canonicals. Some are pure technical waste that should be blocked from crawling entirely. The mistake most people make is treating all faceted URLs the same way, which means either blocking potentially valuable landing pages or allowing millions of worthless combinations to eat crawl budget.
The Taxonomy Problem
Here's something that took me an embarrassingly long time to figure out (we're talking years, not months, which is a timeline of ignorance that I share with you only because I promised to show actual process rather than pretend I arrived at all of this fully formed like some kind of SEO Athena springing from Zeus's forehead): the category structure of an e-commerce site is not a navigation decision. It is a ranking decision. The way you organize your products into categories and subcategories determines, more than almost any other single factor, what keywords your site can rank for and how well.
Think about it this way. When someone searches for "men's running shoes," Google wants to show them a page about men's running shoes. Not a page about shoes in general. Not a product page for a specific shoe. A category page - a curated collection of men's running shoes - that matches the intent precisely. If your site has a category called "Men's Running Shoes" with a unique URL, unique content, proper internal links, and a clean taxonomy that establishes it as a distinct entity within your site's architecture, Google can understand what that page is and rank it appropriately.
If your site has products tagged with attributes but no dedicated category page - if the only way to see "men's running shoes" is to filter the main "shoes" page by gender and activity type - then you don't have a page about men's running shoes. You have a faceted navigation URL. And we've already discussed what happens to faceted navigation URLs. They get canonicalized, or noindexed, or robots.txt blocked, or - if you haven't done anything - they bloat your index with millions of variations while the one combination that actually matches a real keyword goes unrecognized.
Category architecture is information architecture is ranking architecture. These three things are the same thing, and understanding that they're the same thing is the difference between an e-commerce site that ranks and one that doesn't.
The practical implication is that your category structure should be built from keyword research, not from product attributes. Don't organize your products by whatever internal logic your merchandising team uses. Organize them by how people actually search. If people search for "men's running shoes," create a category for it. If people search for "waterproof hiking boots," create a category for it. If people search for "women's cross-training shoes size 9," they probably don't (check the data), so don't create a category for it - let the faceted navigation handle it.
The taxonomy should be two to four levels deep. Homepage > Department > Category > Subcategory. Sometimes a third level of subcategory if the product range is genuinely that broad. Deeper than four levels and you're burying pages too far from the homepage, which dilutes both link equity and crawl priority. Shallower than two levels and you're not creating enough specificity to match long-tail queries.
The taxonomy test: Take your top 100 target keywords for organic search. For each keyword, identify which page on your site should rank for it. If the answer for more than 20% of your keywords is "we don't have a page for that" or "it's a faceted navigation URL," your taxonomy needs restructuring. Every high-volume keyword you want to rank for should map to a real, indexable, internally-linked category or subcategory page with unique content.
Product Page Optimization at Scale
Let me talk about product pages, because product pages are where the money literally is and where most e-commerce SEO efforts are simultaneously most focused and least effective.
The problem with product pages at scale is a content problem. If you sell 40,000 products, you need 40,000 product pages. Each of those pages needs enough unique, valuable content to justify its existence in Google's index. Manufacturer descriptions don't count - if you're using the same product description that's on every other retailer's site (and you are, because the manufacturer sent the same copy to everyone), you have a duplicate content problem across 40,000 pages.
Now, I'm not going to tell you to write unique descriptions for 40,000 products because that's insane. That's the kind of recommendation that sounds reasonable in an audit and is completely unreasonable in practice. But I am going to tell you that you need a strategy for creating unique content at scale, and that strategy needs to be practical enough that someone can actually execute it.
Here's what works. First, prioritize. You don't need unique content on all 40,000 pages. You need unique content on the pages that matter. Pull your Search Console data. Identify the product pages that get impressions. These are the pages Google is already considering for rankings - they just need a push. Write unique descriptions for those first. For most e-commerce sites, this will be somewhere between 5% and 20% of the total catalog, which is a manageable number.
Second, use structured content to create uniqueness algorithmically. A product page that has manufacturer description plus customer reviews plus Q&A sections plus technical specifications in a structured format plus comparison tables plus "frequently bought together" sections is a more unique page than one with just the manufacturer description, even though you didn't "write" most of that content. Each of those content modules adds unique text and context that distinguishes the page from the same product on a competitor's site.
Third, schema markup. This isn't a content strategy exactly, but it amplifies everything else. Product schema (price, availability, condition, brand, SKU), review schema (aggregate rating, review count), FAQ schema (if you have Q&A sections), breadcrumb schema. Schema doesn't directly help rankings (despite what some agencies will tell you), but it dramatically improves click-through rates from the SERP, and click-through rate is a signal that does affect rankings over time. A product page with a rich snippet showing 4.7 stars from 283 reviews, a price, and availability status gets more clicks than a plain blue link, and more clicks means more engagement signals, which means better rankings over time.
Fourth, don't forget the images. Product images are often the most unique content on a product page, and they're almost always under-optimized. Unique, high-quality product photography (not manufacturer stock photos), descriptive file names, informative alt text that includes the product name and key attributes, and proper image sitemaps so Google Images can find them. Google Images is an underappreciated traffic source for e-commerce - people search for products visually, and if your images are well-optimized and unique, you'll capture traffic that competitors who use generic manufacturer photos will miss.
Internal Linking for Deep Catalogs
The internal linking problem in e-commerce is a distribution problem. You have link equity flowing into the site through external links, and you need to distribute that equity to the pages that matter most. In a site with 40,000 product pages, the natural distribution (where link equity flows through the standard navigation) will be highly concentrated at the top (homepage, main categories) and extremely thin at the bottom (individual product pages, deep subcategories). This is the "long tail" distribution that every SEO talks about, and it means that the pages deepest in your site - which are usually the product pages where transactions actually happen - get the least link equity and the worst crawl priority.
The solution is strategic internal linking that pushes equity deeper into the site. And by "strategic" I mean "based on business priorities, not on some automated internal linking plugin that adds 'related products' links randomly."
Here's the internal linking architecture I recommend for e-commerce sites, and I'm being specific because "improve your internal linking" is the kind of advice that sounds helpful and isn't:
Category pages should link to subcategory pages and to featured/high-priority products. Not to all products - to featured ones. The "featured" designation should be based on margin, conversion rate, or strategic importance, not on recency or alphabetical order. Every category page should have a "top products" or "bestsellers" section near the top that links directly to your most important product pages in that category.
Product pages should link to related products (genuinely related, not randomly selected), to the parent category and subcategory, and (this is the one everyone misses) to other products that answer related search queries. If someone is looking at a running shoe, linking to socks, insoles, and running accessories isn't just good UX - it's creating a topical cluster that signals to Google that your site covers the topic comprehensively.
Blog content (if you have it) should link aggressively to category and product pages. Every guide, comparison, and review article should link to the relevant product and category pages using descriptive anchor text. A blog post about "how to choose running shoes for flat feet" should link to your "Running Shoes for Flat Feet" category page (which you created because the taxonomy test told you people search for this). This is how content marketing actually works for e-commerce - not by driving traffic to blog posts, but by building internal link equity that flows to the pages that generate revenue.
Breadcrumbs are not optional. Every product page should have breadcrumb navigation that reflects the category hierarchy, and those breadcrumbs should be marked up with breadcrumb schema. Breadcrumbs serve three purposes: they help users navigate, they pass link equity up the hierarchy, and they help Google understand the site structure. Implementing breadcrumbs is one of those rare SEO recommendations that has no downside and measurable upside, which is why it baffles me how many e-commerce sites either don't have them or have them implemented incorrectly.
The Seasonal Inventory Problem
Here's a problem specific to e-commerce that most SEO advice doesn't address because most SEO advice is written by people who've never managed a large product catalog: products come and go. Inventory changes. Seasonal items appear and disappear. Products get discontinued. New versions replace old ones. The catalog is not static, and every catalog change creates an SEO decision.
The wrong approach (which I see constantly) is to delete product pages when products go out of stock. The product is discontinued, so the page gets removed, the URL returns a 404, and whatever rankings, link equity, and traffic that page had accumulated simply vanishes. Months later, when the product comes back in stock (or a replacement product launches), a new page is created from scratch with zero authority. This is the SEO equivalent of demolishing your house every winter and rebuilding it from scratch every spring.
The right approach depends on the situation, and there are really only four situations:
Temporarily out of stock: Keep the page live. Display a clear "out of stock" message. Add schema markup indicating the availability status. Offer email notification for when it's back. Do not noindex it. Do not redirect it. The page should remain in the index, ranking, accumulating equity, and ready to convert the moment inventory returns. Google has explicitly stated that temporarily out-of-stock pages should remain indexable.
Seasonally unavailable: Same as above, but with additional content. "This product is available seasonally from April through September. Sign up for notifications." Keep the page ranking during the off-season because when the season starts, you want to already be in position, not starting over. If you deindex a seasonal product page every October and reindex it every April, you're giving up six months of equity accumulation for no reason.
Permanently discontinued with a replacement: 301 redirect the old URL to the replacement product page. This passes link equity from the old page to the new one, maintains the URL's position in external links, and provides a decent user experience (they wanted the old product, here's the new version). If there's no single replacement but a category of alternatives, redirect to the relevant category page.
Permanently discontinued with no replacement: This is the only case where letting the page 404 is defensible, and even then I'd argue for redirecting to the parent category if the page had any meaningful traffic or links. A 404 wastes everything the page accumulated. A redirect to a relevant category page preserves some equity and gives the user a next step. The redirect isn't perfect - the user wanted a specific product and they're getting a category - but it's better than a dead end.
Never do this: Do not redirect all discontinued products to the homepage. This is a common pattern and it's terrible. First, it provides an awful user experience (I wanted a specific product and you sent me to your homepage?). Second, Google treats a mass of redirects to the homepage as soft 404s and ignores them anyway. Third, it concentrates redirect signals in a way that dilutes rather than preserves equity. Redirect to the most relevant alternative page. If there is no relevant alternative, let it 404 cleanly.
Putting It All Together
Let me come back to the client with 40,000 products and 2.3 million indexed URLs, because I promised you plumbing and plumbing ends with things working.
Here's what we did, in order, over about four months.
Month one: We fixed the faceted navigation. We implemented the decision framework I described above - identified the filter combinations that represented real search intent (about 200 of them), created proper landing pages for those, and blocked or canonicalized everything else. This single change reduced the indexed URL count from 2.3 million to approximately 85,000. The crawl budget that had been wasted on 2.2 million junk URLs was suddenly available for the 85,000 pages that mattered. Within three weeks, we saw hundreds of product pages move from "Discovered - currently not indexed" to "Indexed" in Search Console.
Month two: We restructured the taxonomy. We took the existing category structure (which had been organized by the merchandising team based on internal product codes, not search behavior) and rebuilt it based on keyword research. We created 34 new subcategory pages targeting specific keyword clusters. We wrote unique content for each one - not marketing copy, but genuinely useful buying guides that helped users understand the product differences within each category. Each page had proper schema markup, breadcrumbs, and internal links to the highest-priority products.
Month three: We built the internal linking architecture. We added "featured products" sections to every category page, implemented breadcrumbs across the entire site (they hadn't had breadcrumbs, which is like building a skyscraper without elevator buttons), and created a blog content strategy focused on publishing buying guides that linked aggressively to category pages. We also ran a redirect audit on the 800+ discontinued products that had been returning 404s for months and redirected the ones with meaningful traffic or links to relevant alternatives.
Month four: We optimized product pages for the top 2,000 products by search impression volume. Unique descriptions, enhanced structured data, optimized images, review integration. We did not optimize all 40,000 product pages because that would have taken a year and the ROI on optimizing product #37,248 (an M4 zinc-plated hex nut with twelve monthly searches) is approximately zero.
The results, which I share not to brag (okay, partially to brag) but to illustrate the disproportionate impact of fixing the plumbing versus polishing the fixtures: organic traffic increased 340% over the following six months. Organic revenue increased 280% (not quite as much as traffic because some of the new traffic was informational rather than transactional, which is fine because those visitors enter the funnel and some of them convert later). The indexed URL count stabilized at around 90,000, which is a healthy ratio for a 40,000-product site.
The previous agency had spent two years optimizing title tags. We spent four months fixing the plumbing. The plumbing won.
It always wins.