TL;DR: A Pew Research Center study found that 38% of webpages from a decade ago, and about 25% of pages sampled across the decade, are now inaccessible; our analysis shows that the Wayback Machine has rescued roughly 15% of those otherwise dead pages.
In 2024, the Pew Research Center published a link-rot study, “When Online Content Disappears”. They stated, “38% of webpages that existed in 2013 are no longer accessible a decade later”. They further noted, “a quarter of all webpages that existed at one point between 2013 and 2023 are no longer accessible”. This is not an isolated report that quantified the rate of loss of the online information. Numerous other link-rot studies in the last two decades have reported similar numbers or worse, depending on the context and samples. For example, Ahrefs, an SEO company, reported in the same year, “At Least 66.5% of Links to Sites in the Last 9 Years Are Dead”. In 2021, Jonathan Zittrain published an article in the Atlantic, “The Internet Is Rotting”, in which his team analyzed about 2 million external links from New York Times (NYTimes) articles and reported that 25% of deep links have rotted. They further noted that 72% of the older links from 1998 were dead. A recent longitudinal study on link-rot from the Old Dominion University (ODU), “Some URLs Are Immortal, Most Are Ephemeral”, analyzed 27.3 million URL samples from the Wayback Machine since 1996 and reported that about 65% of the sampled URLs were found dead on the live Web, when checked in 2023. Brewster Kahle, the founder of the Internet Archive, has been citing numbers from the early days of the Web and stating the average life of webpages to be anywhere from 40 to 100 days. A 2026 book, “Vanishing Culture: A Report on Our Fragile Cultural Record”, by Messarra et al. highlights underlying causes of numerous recent cultural digital losses while emphasizing the critical roles libraries and archives must play to maintain our cultural history for the future. Different studies have looked at the problem from different perspectives and contexts, hence it is often difficult to compare them side-by-side, but they all agree that an increasing number of links are rotting with the passage of time. However, some of these studies (not all) have failed to acknowledge the existence of Web archives, such as the Wayback Machine, where a portion of the dead Web might be preserved and can be used as a fallback when a reference leads to a broken link.
In this post we go through some of the link-rot studies and look at them from the perspective of the Wayback Machine to see how much of the dead Web can be rescued. Table 1 shows the status of the dead and rescued Web at a glance as sampled by a few different studies.
| Study | Year | Period | Samples | Dead | Rescued |
|---|---|---|---|---|---|
| Pew (All) | 2024 | 2013-2023 | 5.4M | 26% | 16% |
| Pew (General) | 2024 | 2013-2023 | 1M | 27% | 13% |
| Zittrain NYT* | 2021 | 2013-2013 | 88K | 40% | 38% |
| ODU NYPW | 2024 | 1996-2021 | 27.3M | 65% | 65% |
* The NYT numbers are based on our recreated dataset.
Let us begin by looking at the study from Pew Research Center. They have generously shared their dataset with us so it was rather trivial for us (after performing some transformations and extractions, as the original dataset was stored in Parquet files) to check the URLs against the Wayback Machine to see if and when each of those were archived the first time. Their dataset contains 5.4 million unique URLs in general, news, government, and Wikipedia references categories sampled from the Common Crawl archive and Wikipedia pages. They also reported on Tweets in their post, but that dataset was not shared with us due to the restrictions posed by the usage policies.
Before we dive into our findings, below are brief descriptions of some terminologies that we will use frequently:
- Alive: URLs that return 200 OK HTTP status code when resolved
- Dead: URLs that return HTTP error status codes, TCP connection errors, or DNS failures when resolved
- Preserved: URLs that are Alive on the live Web as well as present in a Web archive
- Rescued: URLs that are Dead on the live Web, but are present in a Web archive
- Endangered: URLs that are Alive on the live Web, but are not present in any Web archive
- Vanished: URLs that are Dead on the live Web and also not present in any Web archive
- Archived: Preserved + Rescued
- Accessible: Preserved + Rescued + Endangered
When we do not take any Web archives into account, about a quarter of all the 5.4 million sampled URLs would be considered inaccessible or dead as illustrated in Figure 1. However, when we leverage the Wayback Machine to access otherwise dead URLs, the fraction of inaccessible or vanished URLs drops from one in every four down to only one in every ten. The Wayback Machine has about 72% of the entire dataset archived, of which 56% are preserved from the URLs that are still alive on the live Web and 16% are rescued from the dead. There are 18% of the URLs from the sample that are still alive, but have not been archived in the Wayback Machine yet, which we call endangered, as they may become vanished if they cease to exist on the live Web ever. It is worth noting that we did not account for any captures of these URLs that might be present in any of the many smaller Web archives other than the Wayback Machine, which if accounted for, might increase the percentage of the accessible URLs a little more. Moreover, we relied on HTTP status codes and did not look into the contents of the pages to check for any soft-404s (i.e., error pages that wrongly return a 200 OK HTTP status code) or other irrelevant content, which might change the numbers further.
A subset of about 1 million URLs from the Pew dataset is a sample of general webpages from the last decade, spanning across 11 years from 2013 to 2023. They noted that about a quarter of the URLs from this subset were dead in 2023, with older URLs having a greater percentage of loss, all the way to 38% for links from 2013. We recreated their yearly graph in Figure 2 in orange color with an overlay of rescued URLs by the Wayback Machine in green color. We found that about 38% of the 38% dead URLs from 2013 (i.e., about 15% of the total) are rescued by the Wayback Machine. Moreover, about a quarter of the accumulative URLs of the general sample which were considered dead, about half of them were rescued by the Wayback Machine. It is worth noting that the last three years in Figure 2 seem to be rescued almost completely, but it is a side-effect of ingestion of Common Crawl data from the recent years into the Wayback Machine, which happens to be the source of the sample of the Pew dataset.

We tried getting access to the dataset of about 2 million URLs from the Zittrain’s NYTimes outlinks study, but we did not get it yet. However, in the interim we created our own dataset by downloading all the NYTimes pages published in 2013 that are present in the Wayback Machine, extracting all the outlinks from them, and excluding all the links to pages from NYTimes itself. We were able to collect about 88 thousand such URLs this way. Then we checked the live Web status of each of the URLs (after following up to 5 redirects, if any) and also checked for their presence in the Wayback Machine. We found that 40% of the external links from NYTimes pages from 2013 were found dead on the live Web, but 96% of those URLs are archived in the Wayback Machine. This means, only about 2% URLs from this sample have vanished. However, this impressive number needs to be taken with a grain of salt because we do not have the original URL sample and our own sample is derived from pages present in the Wayback Machine, which has an inherent bias of outlinks from those pages being more likely to be archived than the outlinks of the pages that are not present in the Wayback Machine. That said, we will be keen to revisit these numbers if and when we get access to the original sample of URLs used in Zittrain’s study.
A recent, and perhaps the most comprehensive, longitudinal link-rot study from ODU, to which we are a collaborator, analyzed 27.3 million URLs sampled from the index of the Wayback Machine spanning over more than two and a half decades. They reported about 65% of the sampled URLs from 1996 to 2021 were found dead in 2023. A significant number of these samples were not even resolving the DNS, indicating that many of those domain names were not registered anymore. They found that most of the URLs die rapidly in the first few years of their existence, but some of the longest living sites are not dead yet. Luckily, all of the dead URLs in this sample are rescued by the Wayback Machine by the virtue of it being the source of the sample in the first place. This also means, the ODU study would not be able to tell the percentages of endangered or vanished URLs, because its dataset contains no URLs that were never archived.
In summary, all of the link-rot studies, with varying numbers, indicate that the Web is brittle and an increasing number of Web resources die with the passage of time. However, we found that Web archives like the Wayback Machine play an increasingly important role in rescuing the dead Web and minimizing the fracture of the knowledge graph on the Web, but there is a lot more to do. For example, the Turn All References Blue (TARB) project has fixed more than 30 million broken links (and counting) on hundreds of wikis with the help of the InternetArchiveBot, the WaybackMedic bot, and the Wayback Machine.
While there is not a lot that can be done to resurrect the vanished Web other than attempting to find alternate locations where the content might have moved to (via projects like FABLE), we are determined to minimize the percentage of the endangered URLs. However, there are some internal and external factors that limit our ability to make it ZERO, such as, resource limitations, JavaScript-heavy pages, bot blocking, loginwalls, paywalls, deepweb, lack of timely discovery, etc. We strive to narrow down the potential loss of our cultural heritage via different means such as ingesting feeds from MediaCloud, GDELT, Wikipedia EventStream, and more recently, becoming part of the IndexNow initiative for link discovery soon after corresponding page creation or update on the Web. Moreover, we have the Save Page Now (SPN) service and urge that when you “See Something, Save Something!”. Your continued support will help us preserve the Web more and better.
ACKNOWLEDGEMENTS: We thank our friends at the Pew Research Center and the Old Dominion University and our colleagues Jake LaFountain, Stephen Balbach, Chris Freeland, and Mark Graham for their help and support in this work.
—
Dr. Sawood Alam
Research Lead, Wayback Machine
Internet Archive

















