According to a new report from Cloudflare, web crawlers deployed by Perplexity to scan websites appear to be circumventing restrictions. Specifically, the report claims that the company’s bots appear to be “stealth crawling” websites, masking their identity to bypass robots.txt files and firewalls.
Robots.txt is a simple website host file that lets web crawlers know whether they can crawl website content or not. The official Perplexity bots are “PerplexityBot” and “Perplexity-User”. In Cloudflare’s tests, Perplexity was still able to display the content of a new, unindexed website even when these specific bots were blocked by the robots.txt file. This behavior also extended to websites with specific web application firewall (WAF) rules that restricted web crawlers.
Cloudflare believes that Perplexity circumvents these obstacles by using a “generic browser designed to mimic Google Chrome on macOS” when robots.txt prohibits its normal bots. In Cloudflare’s tests, the company’s undeclared crawler was also able to traverse IP addresses that are not listed in Perplexity’s official IP address range to get past firewalls. Cloudflare claims that Perplexity appears to be doing the same thing with Autonomous System Numbers (ASNs) – an identifier for IP addresses managed by the same company – writing that it has noticed the crawler switching ASNs “across tens of thousands of domains and millions of requests per day.”
Engadget has reached out to Perplexity for comment on Cloudflare’s report. We will update this article if we hear back.
Up-to-date information from websites is vital for companies that train AI models, especially since services like Perplexity are being used as a replacement for search engines. In the past, Perplexity has also been caught bending the rules to stay relevant. In 2024, several websites reported that Perplexity was still accessing their content despite being banned in robots.txt – the company blamed it on third-party web crawlers it was using at the time. Later, Perplexity partnered with several publishers to share the revenue generated by ads displayed alongside their content, ostensibly as compensation for its past behavior.
Prohibiting companies from removing content from the Internet is likely to remain a game of “kill the mole.” In the meantime, Cloudflare has removed Perplexity bots from its list of verified bots and implemented a way to identify and block Perplexity stealth crawlers from accessing its customers’ content.