I have a ritual when I take on a new client. Before I look at their analytics, before I open Search Console, before I run a single keyword report, I crawl the site. Not the site they think they have. The site they actually have. The full site. Every subdomain. Every directory. Every forgotten corner of their digital property. I do this the way an archaeologist surveys a dig site before they start digging - not because I expect to find treasure (though sometimes I do), but because I expect to find bones. And I always find bones.

The crawl takes time. Sometimes hours. Sometimes, for the enterprise clients with the complicated histories and the multiple acquisitions and the decade-plus web presence, it takes days. I run Screaming Frog or Sitebulb or sometimes both, pointed not just at the www subdomain that appears in the navigation bar and the marketing materials and the business cards, but at every subdomain DNS will reveal, every directory that responds to an HTTP request, every corner of the domain that has ever hosted a page. I watch the URLs scroll by. Thousands of them. Tens of thousands. Sometimes hundreds of thousands. And mixed in with the current pages, the live pages, the pages that the marketing team knows about and the content team maintains and the development team deployed on purpose, there are other pages. Pages that nobody knows about. Pages that nobody maintains. Pages that nobody deployed on purpose but that exist anyway, stubbornly, silently, like squatters in a building the owner forgot they owned.

These are the bones.

Every company's website is built on top of a graveyard. I don't mean this metaphorically, though the metaphor works well enough that I'll lean into it for the next several thousand words. I mean it structurally, architecturally, literally. The website you see when you visit a company's homepage - the polished design, the carefully written copy, the conversion-optimized landing pages, the blog with its regular posting schedule and its stock photography and its relentless commitment to the content calendar - that website sits on top of a foundation of abandoned projects, forgotten experiments, decommissioned campaigns, and orphaned pages that nobody remembers creating and nobody has taken responsibility for and that Google is still, right now, today, as you read this, attempting to crawl.

I find these things because I look. I look because nobody else does. And nobody else does because looking is boring and unsexy and what you find is almost always unpleasant and the fix is almost always tedious and nobody ever got promoted for cleaning up digital archaeology. You get promoted for launching things. You get a Slack thread full of party emojis for shipping the new campaign. Nobody gives you a party emoji for discovering that the staging site has been indexed since 2021 and is serving an outdated version of your product pages to approximately 340 real humans a month who found them through Google and are now looking at pricing from two years ago and a CEO photo of a person who no longer works there.

But I'm getting ahead of myself. Let me tell you about the graveyard.

The First Dig

The first time I understood what I was looking at - the first time the bones resolved from abstract clutter into a coherent picture of what had gone wrong and when and how - was in 2016. The client was a mid-market SaaS company. Eighty employees. Series B. Good product, decent revenue, growing steadily, the kind of company that has its act together in every department except, somehow, its website. They hired me because their organic traffic had been flat for eighteen months despite doing, in their words, "everything right." They had a content team. They had a blog. They published regularly. They had title tags and meta descriptions and alt text on their images and all the SEO hygiene items that someone had read about in a blog post in 2013 and implemented with admirable thoroughness.

I crawled the site. And what I found was not a website. It was a palimpsest - a document written over and over on the same surface, each layer partially obscuring the one beneath it, none of them fully erased.

The company had been through three redesigns in seven years. Not unusual. What was unusual was that nobody had cleaned up after any of them. The current site lived at www.example.com (not the real domain, obviously, though the temptation to name names is always there and I'm a worse person for resisting it). But also living on that domain, accessible to anyone with a browser and indexed by anyone with a crawler, were:

A staging environment at staging.example.com. This was the big one. The staging site was a complete copy of the production site, or rather, a complete copy of the production site as it had existed in January 2014 when someone had set up the staging environment and never password-protected it and never added a robots.txt and never added a noindex tag and never, in two and a half years, thought to check whether Google had found it. Google had found it. Google had found it almost immediately, because Google finds everything, because that is Google's entire job, to find things on the internet, and a fully functional website sitting on a subdomain with no access restrictions is, from Google's perspective, a fully functional website that should be indexed and served to users.

The staging site had 1,200 pages. The production site had 1,400 pages. That meant that Google was indexing 2,600 pages across the domain, of which 1,200 were duplicates with outdated content, old designs, dead links, broken images, pricing from 2014, team photos of people who had left the company, and a blog section that stopped updating in January 2014 because of course it did, because the staging site was a snapshot frozen in time, a digital Pompeii, preserved exactly as it was the moment someone typed git clone and walked away.

I showed the client. I remember the meeting. The VP of Marketing and the CTO were both on the call. The VP of Marketing said "how is this possible" in a way that made clear she was asking a question and issuing an accusation simultaneously. The CTO said nothing for a very long time and then said "huh." Not "huh" as in surprise. "Huh" as in "I probably should have known about this and the fact that I didn't is going to come up in my next performance review."

We killed the staging site. Proper noindex tags first, then a password gate, then eventually a full decommission with 301 redirects pointing all the old staging URLs back to their production equivalents. The process took two weeks because the staging site had accumulated its own backlinks (people had linked to it, thinking it was the real site, because from the outside it looked exactly like the real site) and its own search visibility (Google was ranking staging pages for brand queries, which meant that real potential customers searching for the company were sometimes landing on a version of the site from two years ago with the wrong phone number and a signup form that went nowhere).

Within three months of cleaning it up, the client's organic traffic increased by 40%. Not because we'd done anything to improve the site. Because we'd removed the thing that was hurting it. The staging site had been cannibalizing the production site for over two years, splitting link equity, confusing crawlers, serving duplicate content, and generally making a mess of what should have been a straightforward domain, and nobody knew. Nobody looked. Nobody had any reason to look, because looking at your full domain infrastructure is not something that appears on anybody's quarterly objectives or sprint planning or content calendar or any of the other systems that modern companies use to organize their digital efforts.

That was 2016. I've done this a hundred times since then. The details change. The pattern doesn't.

A Taxonomy of Ghosts

After a decade of digital archaeology, I've developed a taxonomy. A classification system for the things that haunt company websites. I present it here not because I enjoy taxonomies (I don't, they remind me of the kind of deliverable that consultants produce when they want to look smart without doing anything useful) but because naming the categories makes them easier to find, and finding them is the entire point.

Category One: The Staging Ghosts. I've already told you about staging sites. They're the most common haunting and the most damaging. Almost every company with a web development team has, at some point, created a staging environment on a subdomain. Staging.domain.com. Dev.domain.com. Test.domain.com. Preview.domain.com. Sometimes numbered: staging2.domain.com, dev-v3.domain.com, test-sept-2019.domain.com. These sites are created for a legitimate reason - to test changes before deploying them to production - and they are almost never properly secured, almost never noindexed, and almost never decommissioned when the project they were created for is complete.

I have found staging sites that were five years old. I have found staging sites that contained entire product lines that the company had discontinued. I have found a staging site that was, through some misconfiguration that I never fully understood, receiving production database updates while serving an old frontend, which meant it was displaying current customer data in a 2019 design template, which is both a UX nightmare and a potential data privacy issue and the kind of thing that makes your compliance officer's eye twitch. I have found staging sites with open admin panels. I have found staging sites with debug mode enabled, leaking stack traces and database connection strings to anyone who triggered an error. I have found staging sites where someone had been testing a new payment flow and had left test credit card numbers in the page source, which is not technically a security breach because they were test numbers, but which is the kind of thing that a security auditor writes up in red font with exclamation marks.

Every single one of these was indexed by Google. Every single one was accessible to the public. Every single one was invisible to the company that owned it, because nobody was looking.

Category Two: The Campaign Ghosts. Marketing departments launch campaigns. This is what marketing departments do. They create landing pages. They create microsites. They spin up subdomains. They register new domains. They build things with energy and enthusiasm and budget, and then the campaign ends, and the landing pages stay. The microsites stay. The subdomains stay. The domains auto-renew because nobody updated the billing contact and the credit card on file still works, so the registrar keeps charging it, and the site keeps existing, a monument to a campaign nobody remembers about a product nobody sells for a promotion that expired three years ago.

I had a client in the B2B space. Enterprise software. Good company, smart people, the kind of company where everyone has strong opinions about Salesforce and nobody remembers to cancel unused SaaS subscriptions. They had sixteen subdomains. Not all of them were active. Not all of them were intentional. But all of them were there, responding to HTTP requests, serving content, being crawled by Google.

One subdomain was a microsite for a campaign from 2019. The campaign was tied to a specific industry event - a trade show that the company had sponsored. The microsite had a landing page with a form for scheduling demos, a page listing the speakers at the event, a page with the booth location and schedule, and a blog post about why the company was excited to attend. The event had happened four years earlier. The speakers had moved to different companies. The booth location was irrelevant. The demo scheduling form was still functional and was still, according to the form submissions database, receiving approximately two submissions a month from people who had found the microsite through Google and didn't realize they were looking at a time capsule.

Two demo requests a month. Going to a form handler. That routed to an email address. That belonged to an employee who had left the company in 2021. The emails were bouncing. The demo requests were evaporating. Potential customers were filling out a form, expecting a response, and receiving nothing, which is worse than receiving nothing from a company that doesn't have a form, because at least then the customer knows they haven't made contact. These people thought they had made contact. They had filled out a form. They had entered their name and their email and their phone number and their company name and the size of their team and their timeline for purchase, which are the kinds of details that sales teams would pay real money to acquire, and the form was eating them. Silently. For two years.

How many customers did this company lose? I don't know. Nobody knows. Nobody will ever know. The demo requests are gone, bounced into the void, and the customers who submitted them have long since bought a competitor's product or forgotten they were looking or simply moved on with their lives, carrying with them, perhaps, a vague sense that this company was unresponsive, that they'd tried to get in touch and been ghosted, which is the worst possible brand impression you can create and which was being created, automatically, by a microsite that nobody knew existed for a campaign that nobody remembered.

I told this to the VP of Sales. His face did something complicated.

Category Three: The Acquisition Ghosts. This is the big one. The one that keeps me crawling enterprise sites at 2 AM with a sense of grim anticipation. When one company acquires another, they acquire the other company's website. They acquire its subdomains. They acquire its redirects. They acquire its DNS records. They acquire, in short, every digital decision that company ever made, including the bad ones, including the forgotten ones, including the ones that the acquired company's developers don't remember and the acquired company's marketing team never knew about.

I worked with a company that had made four acquisitions in six years. A growth-by-acquisition strategy. Very common in private equity-backed businesses, which is a world I spend a lot of time in and which I've written about before (the PE firms and their SEO mistakes are a subject I find endlessly fascinating, in the way that other people find true crime podcasts fascinating, which is to say it's a fascination grounded in horror). Each acquisition came with its own domain, its own website, its own subdomains, its own technical debt.

When I crawled the combined digital footprint of this company - the parent domain plus all four acquired domains plus all their subdomains plus all the redirects and legacy infrastructure - the total URL count was 187,000. One hundred and eighty-seven thousand URLs that Google was aware of across this company's digital presence. The company's current website - the one they showed to customers, the one they'd invested in, the one they were proud of - had 2,400 pages. Meaning that for every page they intended to have, they had roughly 77 pages they didn't intend to have. A ratio of 1:77. For every room in the house, there were 77 rooms in the basement that nobody had been in for years and that were full of things best not examined too closely.

Of those 187,000 URLs, approximately 160,000 were returning 200 OK status codes. Not 404s. Not 301 redirects. 200 OKs. Serving content. Old content. Irrelevant content. Content from acquired companies that had been folded into the parent brand years ago but whose websites were still live, still serving their old homepages, still ranking for their old brand terms, still sending visitors to product pages for products that had been discontinued or rebranded or merged into the parent company's product line.

Google was spending the vast majority of its crawl budget on these legacy sites. I could see it in the crawl stats. The parent company's actual website - the 2,400 pages they cared about - was getting crawled less frequently than the 160,000 legacy pages they didn't know existed. Google was dutifully visiting dead product pages from 2018, expired event landing pages from 2020, blog posts from acquired companies that contradicted the parent company's current messaging, and in one memorable case, a competitor comparison page from an acquired company that compared the acquired company favorably against the parent company, which meant that the parent company was, unknowingly, hosting a page that argued against its own product. It was ranking on page two. Customers were finding it.

I showed the CEO. He put his head in his hands. Not figuratively. Literally. Both hands. Full face coverage. The meeting was on Zoom so I had an excellent view of the top of his head for about thirty seconds.

Category Four: The Orphan Ghosts. Orphaned pages are pages that exist on a website but aren't linked to from anywhere on that website. They're reachable by URL - if you type the address directly or if you have an old bookmark or if Google has it in its index from a crawl it did six months ago - but they're not part of the site's navigation, they're not in the sitemap, they're not linked to from any other page. They exist in a state of digital limbo, connected to the domain but disconnected from the site, like a room in a house with no door.

Orphaned pages accumulate naturally. You publish a blog post in 2020. In 2022, you redesign the blog. The new blog template paginates differently, and the pagination cuts off at page 50, meaning any post older than the 500th most recent post is no longer accessible through the blog's navigation. But the post is still there. The URL still works. Google still has it in its index. It's just that nobody can find it by navigating the site, which means nobody is linking to it internally, which means Google's signals about its importance are weakening over time, which means it's slowly sinking in the rankings like a boat with a small but persistent leak.

Or: you create a landing page for a PPC campaign. The campaign runs for six months. The campaign ends. The ads stop running. The landing page is no longer receiving paid traffic. But nobody deletes it. Nobody redirects it. Nobody removes it from the sitemap (if it was ever in the sitemap, which it probably wasn't, because PPC landing pages are often noindexed to prevent them from competing with organic pages, except when the person who set up the noindex tag leaves the company and the person who replaces them removes the noindex tag during a site migration because they see it and think "that looks wrong" and there's nobody around to explain why it was there). The page becomes an orphan. It floats. It accumulates dust. It gets crawled. It gets indexed. It sits in Google's index, not ranking for anything useful, not contributing to the site's overall authority, just existing, consuming a small but nonzero amount of Google's attention every time the crawler comes back to check if anything has changed (it hasn't, nothing has changed on this page since the campaign ended two years ago, but Google doesn't know that, Google checks anyway, because Google is a very thorough, very patient, very literal-minded system that does not understand the concept of "we don't need this anymore").

Every site I've ever crawled has orphaned pages. Every single one. The number varies - I've seen sites with a dozen and I've seen sites with 40,000 - but the existence of orphans is universal. They're the most common haunting because they're the most natural. They're not the result of a mistake or a forgotten project or a failed acquisition. They're the result of time. Of publishing content over years and decades without a system for managing the lifecycle of that content from creation to retirement.

Category Five: The Test Ghosts. These are my favorites, in the way that a coroner might have a favorite cause of death, which is to say professionally rather than emotionally. Test pages are pages that were created during development - "lorem ipsum" pages, test product entries, placeholder content, dummy data - that were never removed before the site went live. They're the equivalent of leaving the construction scaffolding up after the building is finished, except the scaffolding is invisible and Google is crawling it.

I found a test page once on a financial services website - a company that manages people's retirement savings, a company where trust and professionalism are, you would think, paramount - that contained the text "This is a test page please delete me lol." The page had been live for eight months. It was indexed. It was ranking for a long-tail query that included the company's brand name plus the word "test." Not many people were searching for this query. But some were. And when they found it, they found a page on their retirement savings provider's website that said "please delete me lol," which is not the kind of messaging that inspires confidence in the stewardship of your 401(k).

I found another test page on an e-commerce site that was a fully functional product page for a product called "Test Product DO NOT USE" priced at $0.01. The product was in the sitemap. Google had indexed it. Google Shopping had picked it up. It was appearing in product listing ads. You could, theoretically, add it to your cart and check out. I don't know what would have happened if you'd completed the purchase. I didn't test it. Some graves are better left undisturbed.

The Crawl Budget Tax

Here's where the graveyard metaphor starts to carry real weight. Because ghosts, in the traditional sense, are mostly harmless. They rattle chains. They make the lights flicker. They provide content for reality television programs of questionable production value. But digital ghosts - the staging sites, the campaign remnants, the acquisition debris, the orphaned pages, the test content - aren't harmless. They're actively expensive.

The expense is crawl budget. Crawl budget is the amount of attention Google allocates to your domain - the number of pages Google will crawl, the frequency with which it will return, the speed at which it will process and index new content. Crawl budget is finite. Google has explicitly said this. Google has said that there's a crawl rate limit (how fast Googlebot can crawl without degrading your server performance) and a crawl demand (how much Google wants to crawl, based on freshness, importance, and other factors). The intersection of these two is your effective crawl budget.

When you have 2,400 pages and Google is allocating crawl budget based on 187,000 URLs, the math is not in your favor. Google is spending crawl resources on dead pages. On orphaned content. On staging sites that haven't changed since 2019. On campaign microsites that stopped being relevant three years ago. Every crawl Google wastes on a ghost page is a crawl it doesn't spend on a live page. Every time Googlebot visits a product page from an acquired company that no longer exists, that's a visit it could have spent on your new product page, the one you launched last week, the one that's sitting in Google's crawl queue waiting for its turn, a turn that's coming later than it should because Google is busy cataloging your ghosts.

This matters more than most people think. For small sites - a few hundred pages, straightforward architecture, no subdomains, no acquisition history - crawl budget isn't a concern. Google will find everything and crawl it regularly regardless. But for large sites, for sites with complex histories, for sites with subdomains and legacy infrastructure and the accumulated digital debris of a decade of web presence, crawl budget becomes a real constraint. I've seen sites where new content took months to get indexed because Google was spending its crawl allocation on legacy pages. Months. In a world where businesses expect their new product page to show up in search results within days, "months" is a word that makes executives ask uncomfortable questions about why they're paying for SEO.

The crawl budget tax is invisible. That's what makes it so insidious. You can't see it in Google Analytics. You can't see it in Search Console (well, you can, if you know where to look and what to compare, but nobody is teaching this, nobody is writing blog posts titled "How to Check if Google Is Wasting Its Time Crawling Your Staging Site," because that's not a sexy topic and it doesn't generate social shares and it doesn't have a snappy acronym that you can put on a conference slide). The crawl budget tax just silently makes everything slower. Indexing takes longer. Rankings take longer to update. New content sits in limbo longer. And because the effect is diffuse - spread across the entire site, across hundreds or thousands of pages, across months and years - it's almost impossible to attribute specific business outcomes to the underlying cause.

You can't walk into a meeting and say "we lost $400,000 in revenue because Google was wasting crawl budget on a staging site from 2019." You can't prove that. The causal chain is too long, too indirect, too tangled with other variables. But the loss is real. The delay is real. The opportunity cost of Google spending its finite attention on your ghosts instead of your live content is real. It's just invisible. And invisible costs are the hardest ones to justify fixing, which is why they don't get fixed, which is why the graveyard keeps growing.

The Horror Stories

I promised horror stories. I should deliver. These are all real. The details have been changed enough to protect the people involved, who are good people who made reasonable decisions at the time and simply didn't anticipate the consequences, which is, when you think about it, the description of nearly every decision that ends up in a horror story.

The Law Firm With Two Websites. A mid-size law firm. Forty attorneys. Bread-and-butter practice areas - personal injury, family law, estate planning, the kind of work that lives and dies on local SEO because clients search "divorce lawyer near me" not "top-rated matrimonial attorney in the greater metropolitan statistical area." They had been struggling with local rankings for two years. They'd hired three different SEO agencies. None had moved the needle. They came to me as a last resort, which is how most clients come to me, not because I'm particularly impressive but because I'm expensive enough that people don't hire me first.

I crawled the domain. And I found a second website.

Not a subdomain. Not a staging site. A completely separate website, on the same domain, in a subdirectory called /old/. The law firm had redesigned their website in 2020. The web development agency that handled the redesign had, instead of replacing the old site, simply moved it into a subdirectory and built the new site on top of it. The old site was still there. All of it. Every page. Every attorney bio (including six attorneys who had left the firm). Every practice area page. Every blog post. Every piece of content the firm had ever published, living in a parallel universe at domain.com/old/ while the new site lived at domain.com/.

The /old/ directory had 600 pages. The new site had 200 pages. Google was indexing all 800. Google was, in many cases, preferring the old pages because they had more backlinks (they'd been accumulating links for years before the redesign) and more content (the old site was more verbose, in that mid-2010s way where every practice area page was 3,000 words of keyword-stuffed prose because someone had told the firm that longer content ranks better, which was sort of true at the time but is the kind of half-truth that ages like milk).

For two years, this law firm's new website had been competing against its old website. Their "divorce lawyer" page on the new site was competing for rankings against their "divorce lawyer" page on the old site. Google, faced with two pages on the same domain targeting the same keywords, was choosing between them in a way that satisfied neither - sometimes ranking the new page, sometimes ranking the old page, sometimes ranking neither, because when Google can't determine which of two duplicate pages is authoritative, it sometimes throws up its hands (metaphorically, Google doesn't have hands) and ranks a competitor instead.

The fix took four hours. We 301 redirected every page in the /old/ directory to its equivalent on the new site. Where there was no equivalent (the departed attorneys, the discontinued practice areas), we redirected to the nearest relevant page. We updated the robots.txt to disallow the /old/ directory. We submitted the updated sitemap. We waited.

Six weeks later, the firm was ranking in the local three-pack for their primary practice areas for the first time in two years. Not because we'd done anything clever. Because we'd removed the thing that was preventing the obvious outcome from occurring. The grave was dug. The ghost was laid to rest. The living could finally be seen.

The Retailer With 40,000 Ghosts. An e-commerce company. Apparel. About 2,000 current products. Seasonal catalog that rotated quarterly. Been in business since 2009. Their website had grown organically (in the non-SEO sense of the word) over fourteen years, each season adding new product pages, each year adding new category pages, each redesign reshuffling the navigation without removing the old pages.

The crawl revealed 43,000 URLs returning 200 status codes. Forty-three thousand. For a company with 2,000 products. The other 41,000 pages were ghosts - old product pages for items no longer sold, seasonal landing pages from years past (Summer Sale 2016, Holiday Gift Guide 2018, Back to School 2019, and so on, a calendar of marketing optimism stretching back more than a decade), category pages for organizational schemes that had been replaced three redesigns ago, filtered URLs that the faceted navigation had generated in infinite combinations (size small + color blue + style casual + sort by price + page 3, multiplied by every permutation of every filter, generating URLs that no human had ever visited and no human ever would).

Google was attempting to crawl all 43,000 URLs on a regular basis. "Attempting" is the key word. Because the crawl budget wasn't sufficient - this was a mid-size retailer, not Amazon, and Google's patience for mid-size retailers is finite - Google was crawling maybe 5,000 URLs per day. At that rate, it took Google roughly nine days to make one complete pass through the site. Nine days to see everything once. For a site with seasonal products that changed weekly, "once every nine days" meant that new products were sitting in the crawl queue for over a week before Google even discovered them, let alone indexed them, let alone ranked them.

During peak selling seasons - Black Friday, back-to-school, holiday - a week's delay in indexing meant real money. Not abstract opportunity-cost money. Real, countable, somebody-lost-their-bonus money. New product pages launching for holiday campaigns weren't appearing in search results until the campaign was half over, because Google was too busy re-crawling the Summer Sale 2016 landing page to notice that new things existed.

We pruned 38,000 pages. Not all at once. Not recklessly. We did it over three months, carefully, with redirects where appropriate and 410 Gone status codes where redirects didn't make sense and canonical tags on the filtered URLs and a completely rebuilt sitemap that included only the pages we actually wanted Google to index. We monitored the crawl stats daily. We watched Google's crawl rate adjust in real time, shifting from the diluted, spread-thin crawling of 43,000 pages to the focused, efficient crawling of 5,000 pages.

The result: new products were being indexed within 24 to 48 hours instead of nine days. The speed-to-index improvement alone, during the following holiday season, was worth - and I'm being conservative here, using the client's own attribution model - approximately $280,000 in incremental revenue. From deleting pages. From taking things away. From cleaning up the graveyard so that the living could breathe.

The Company That Was Ranking for Its Own Security Vulnerability. This is the one I tell at parties, by which I mean the kind of parties I attend, which are parties where people voluntarily discuss canonicalization over drinks, which is a specific kind of party that you either understand or you don't and if you don't I envy you.

The client was a healthcare technology company. HIPAA compliance was not a suggestion for them. It was the entire regulatory framework within which they existed. Their main site was clean, well-maintained, properly secured. But on a subdomain - one that the IT team had set up years ago for internal documentation and had then forgotten about, in the way that IT teams forget about things they set up years ago, which is routinely and universally - there was a wiki. An actual, honest-to-god, open-to-the-public, indexed-by-Google wiki. Running software that hadn't been updated in three years. With known security vulnerabilities. That Google had indexed.

The wiki contained internal documentation. Process documents. Onboarding guides. Meeting notes. And - because the universe has a sense of humor that I am still learning to appreciate - a page titled "Known Security Issues - Q3 2021" that listed, in helpful detail, the company's known security vulnerabilities, the systems affected, the remediation timeline, and the name of the contractor responsible for the fixes.

This page was ranking. Not well. Not on page one. But it was ranking. For the company's brand name plus the word "security." Which meant that a journalist, a competitor, a potential customer, or a bad actor conducting reconnaissance could have found it with a moderately creative Google search. And would have found a detailed map of the company's security weaknesses, written by the company itself, published by the company itself, and indexed by a search engine because nobody had thought to put a password on the wiki or at least noindex the subdomain.

I did not tell this story at the client meeting. I told it to the CTO privately, over the phone, in a conversation that lasted four minutes and contained three instances of the word I mentioned earlier, the four-letter one, none of them mine. The wiki was offline within the hour. A proper security audit followed. I billed for the crawl. The CTO sent me a bottle of wine. We never discussed it again.

Why Nobody Looks

The question I get asked most often, after telling these stories, is "why?" Why doesn't anybody check? Why doesn't anybody crawl their full domain? Why do companies with seven-figure marketing budgets and dedicated IT departments and compliance teams and security audits and all the infrastructure of modern business management simply not look at what's on their own website?

The answer is structural, not personal. It's not that individual people are negligent. It's that no single person or department owns the full picture.

The marketing team owns the current website. The pages they know about. The content they've published. The campaigns they've launched. They don't own the subdomains. They don't own the staging environments. They don't own the legacy infrastructure from before they were hired. They don't even know it exists, because it's outside their scope, outside their tools, outside their mental model of what "the website" is.

The IT team owns the infrastructure. The servers. The DNS records. The subdomains. They know, in theory, that staging.domain.com exists, because they set it up. But they don't think about it in SEO terms. They don't think about whether Google is indexing it. They don't think about whether it's cannibalizing the production site. They think about uptime and security and server configuration, and if the staging server is up and not being actively attacked, it falls off their priority list, because IT teams have enough to worry about without also worrying about crawl budget.

The development team owns the code. The deployments. The test environments. They create test pages and staging instances as part of their workflow, and cleaning them up when they're done is (in theory) part of the workflow too, but in practice, "cleanup" is the step that gets skipped when the sprint deadline is looming and the product manager is asking why the feature isn't live yet and nobody has ever been fired for leaving a test page up but plenty of people have been fired for missing a deadline.

The result is a gap. A seam between departments, between responsibilities, between mental models. The website as Marketing sees it is not the website as IT sees it is not the website as Development sees it is not the website as Google sees it. And Google's view - the comprehensive, subdomain-spanning, directory-traversing, orphan-finding, test-page-discovering view - is the only one that matters for SEO. Because Google doesn't care about your org chart. Google doesn't know that staging.domain.com belongs to the development team and that the development team doesn't think about SEO. Google sees URLs. Google crawls URLs. Google indexes URLs. And when those URLs hurt your site, Google doesn't send a memo to the right department. Google just indexes the damage and moves on.

This is the fundamental problem, and it's why I do the crawl before I do anything else. Because the crawl shows me the site as Google sees it, not as the client sees it. And those two views are always different. Always. I have never, in twenty-plus years, crawled a client's full domain and found nothing unexpected. The size of the graveyard varies. The severity of the hauntings varies. But the graveyard is always there.

The Cleanup Framework

I've developed a process for this. It's not sexy. It doesn't have a name. It doesn't come with a certification. It's not the kind of thing you can put on a slide at a conference and have people tweet about it. It's grunt work. Tedious, methodical, unglamorous grunt work. It's the SEO equivalent of sorting through a dead relative's attic, except the attic is digital and the dead relative is every decision the company made before you arrived.

Step one: crawl everything. Not just the main domain. Every subdomain in DNS. Every IP that the domain's A records point to. Every variation (www vs. non-www, http vs. https). Use a crawler that can handle hundreds of thousands of URLs without choking. Run it multiple times if you have to. Let it run overnight. Let it find everything.

Step two: categorize. Every URL that the crawl discovers goes into one of four buckets. Live and needed: these are the current pages, the ones the client knows about, the ones that should be indexed. Live and unneeded: these are the ghosts, the pages that are serving content but shouldn't be, the staging sites and campaign remnants and orphaned pages and test content. Redirecting: these are the URLs that already redirect somewhere, and you need to check that they redirect to the right place. Erroring: these are the 404s and 500s and other broken things, and you need to decide whether they should be fixed, redirected, or left to die.

Step three: prioritize. Not all ghosts are equally dangerous. A staging site that's been indexed for three years and is cannibalizing your main site is a priority-one emergency. An orphaned blog post from 2017 that gets two visits a month is a nuisance, not an emergency. A test page that says "lorem ipsum" is embarrassing but probably not hurting your rankings. Triage aggressively. Fix the things that are actively damaging your SEO first. Fix the things that are embarrassing second. Fix the things that are merely messy third. Accept that you will never fix everything. The graveyard is too big. The best you can do is contain it.

Step four: execute. For staging sites and other environments that shouldn't be public: noindex first, then password-protect, then decommission. For campaign remnants and dead microsites: 301 redirect to the most relevant live page. For orphaned content that still has value: re-link it, bring it back into the site architecture, give it a path from the navigation. For orphaned content that has no value: 301 to a relevant page if it has backlinks, 410 Gone if it doesn't. For test pages and placeholder content: delete. Just delete. Nobody needs them. Nobody wants them. Let them go.

Step five: monitor. And this is where the whole thing falls apart, because this is where you need ongoing vigilance, and ongoing vigilance is the one thing that organizations are structurally incapable of providing. Setting up the crawl as a one-time project is easy (well, tedious, but achievable). Doing it on a recurring basis - monthly, quarterly, annually - requires someone to own it. Someone to run the crawl. Someone to review the results. Someone to notice when a new staging site appears or a new orphaned page is created or a new test page goes live. Someone to care. And caring, in the organizational context, means budget and headcount and priority, which means convincing someone with authority that preventing future ghosts is worth the same investment as creating new content or launching new campaigns or building new features.

This is the conversation that almost never happens. I can clean up a graveyard. I've done it dozens of times. But if nobody is maintaining the cemetery, new graves keep appearing. New staging sites. New campaigns that launch and are never decommissioned. New acquisitions that bring new domains and new subdomains and new legacy infrastructure. The graveyard doesn't stop growing just because you cleaned it once. It stops growing when someone's job includes making sure it stops growing. And in most organizations, that's nobody's job.

The Broader Point

I've been talking about specific problems. Staging sites. Campaign remnants. Orphaned pages. Technical debris. The specific, identifiable, fixable things that I find when I crawl a domain. But the broader point is not about any specific ghost. The broader point is about technical debt.

Every software engineer knows about technical debt. It's one of the foundational concepts of software development - the idea that every shortcut you take, every quick fix you implement, every "we'll clean this up later" that you never actually clean up, accumulates over time into a burden that makes future development slower, more expensive, and more fragile. Technical debt is invisible day to day. It doesn't show up in the sprint metrics or the deployment logs or the product roadmap. But it shows up in the velocity. In the speed at which the team can ship new features. In the frequency of bugs. In the frustration of developers who spend 60% of their time working around decisions that someone made three years ago.

SEO has its own technical debt, and it works exactly the same way. Every staging site that goes undecommissioned, every campaign page that goes unredirected, every orphaned URL that goes uncleaned, every test page that goes undeleted, every acquisition domain that goes unintegrated - these are not one-time costs. They're ongoing. They accumulate. They compound. And unlike software technical debt, which at least has a constituency of engineers who understand what it is and advocate (loudly, persistently, usually unsuccessfully) for addressing it, SEO technical debt has no constituency. Nobody advocates for it. Nobody measures it. Nobody budgets for it.

This is why I call it a graveyard. Not because it's spooky (though it is, sometimes, in the way that finding a staging site with customer data is spooky). Not because the metaphor is clever (it's not, particularly). But because graveyards are places where things are put and forgotten. Where decisions are made - to bury, to move on, to stop looking - and the consequences of those decisions persist long after the decision-makers have moved to other companies or other roles or other concerns. The graveyard outlasts the groundskeeper. The ghosts outlast the people who created them.

And the ghosts are patient. They don't demand attention. They don't crash the site. They don't trigger alerts or set off alarms or show up as red lines on a monitoring dashboard. They just sit there, quietly, incrementally, persistently making everything a little worse. A little slower to index. A little more diluted in authority. A little more confusing to crawlers. A little more embarrassing when a customer finds the test page or the old campaign or the wiki with the security vulnerabilities.

The damage is invisible until it's catastrophic. That's the nature of debt, technical or otherwise. You don't notice it accumulating. You notice it when you can't ship fast enough, or when your new pages aren't indexing, or when a competitor overtakes you and you can't figure out why, or when a journalist finds your internal wiki and writes a story about it, or when the crawl budget that should be spent on your new product launch is being consumed by the digital remains of a trade show sponsorship from 2018.

By the time you see it, it's already too late to address it quickly. The cleanup takes weeks. Sometimes months. Sometimes, for the enterprise clients with the four acquisitions and the 187,000 URLs, it takes the better part of a year. And during that year, the debt continues to compound, because you can't stop publishing new content and launching new campaigns and making new acquisitions while you're cleaning up the old ones. The living don't stop living because the graveyard needs tending.

What I Wish Someone Had Told Me

When I started doing this work - when I started crawling full domains as a standard practice, when I started treating digital archaeology as a core SEO discipline rather than an occasional curiosity - I was operating on instinct. I had a feeling that there was more to most sites than met the eye. I had seen enough unexpected URLs scroll past in crawl reports to suspect that the surface of most websites was only a fraction of the total mass, like an iceberg, except icebergs have been overdone as a metaphor and I'm trying to maintain some originality here.

What I wish someone had told me earlier is that this work is not optional. It's not a nice-to-have. It's not something you do after you've optimized the title tags and built the content calendar and fixed the page speed issues and done all the other SEO things that SEO people do. It's the first thing you do. Before anything else. Because if the foundation is rotten - if the graveyard is large enough and active enough and toxic enough - then everything you build on top of it is compromised. Your content strategy is compromised because you're creating new pages that have to compete with ghost pages from your own domain for Google's attention. Your link building is compromised because the authority you're building is being diluted across thousands of URLs that don't deserve it. Your technical SEO is compromised because your crawl budget is being spent on pages that don't matter instead of pages that do.

You can't optimize a graveyard. You have to clean it first.

And I wish someone had told me that the resistance you encounter when you propose this work - the skepticism, the budgetary hesitation, the "that sounds like a lot of work for something that might not move the needle" pushback that you get from stakeholders who would rather spend the money on a new content campaign or a link-building initiative or literally anything that sounds more productive than "let's delete a bunch of old pages" - that resistance is not ignorance. It's rational. From the stakeholder's perspective, cleaning up ghosts is unsexy, expensive, time-consuming, hard to measure, and offers no guarantee of improvement. They're not wrong about any of that. They're wrong about the conclusion they draw from it, which is that the work isn't worth doing, when in fact the work is essential precisely because it's hard to measure. The things that are hardest to measure are the things that go longest without being addressed, and the things that go longest without being addressed are the things that cause the most damage.

This is true in SEO. This is true in infrastructure. This is true in organizations. This is probably true in life, though I'm an SEO consultant and not a philosopher and the last time I tried to make a grand statement about the human condition my wife told me to stop reading so much Marcus Aurelius and take out the recycling.

The Ritual

I still do the crawl. Every new client. Every new engagement. Before I open their analytics, before I check their rankings, before I look at a single keyword or a single backlink or a single piece of content. I crawl the site. The full site. Every subdomain. Every directory. Every corner.

I do it knowing what I'll find. I do it knowing that somewhere in that domain, behind a subdomain nobody remembers and a directory nobody checks, there are pages that shouldn't be there. Pages that are costing money and credibility and crawl budget and rankings. Pages that were created with good intentions and abandoned with no intentions and that have been sitting, patiently, in the dark, waiting for someone to look.

I look.

That's the whole job, sometimes. Not the clever strategy. Not the innovative approach. Not the cutting-edge technique. Just looking. Just pointing the crawler at the full domain and watching what comes back and sorting through the results with the patience and the stomach for unpleasant surprises that this work demands.

The bones are always there. In every company. Under every website. In every domain that has existed for more than a few years and that has been touched by more than a few hands and that has been through more than a few changes. The graveyard is always there.

The only question is whether anyone is brave enough to dig.