In its August 2025 report, Cloudflare accused Perplexity AI of deliberately bypassing website protection systems and technical restrictions designed to prevent their content being harvested for AI training. The company identified suspicious activity across more than 32,000 websites in the space of three months. According to the report, Perplexity’s bots disguised their digital identifiers and used proxy servers to gain access to material that publishers had explicitly blocked, including content behind paywalls. This is particularly troubling for publishers whose business models rely on the sale of premium content.
The report noted that problematic scraping activity spiked shortly after the April launch of Perplexity Pro Search. By May, the company’s bots were visiting an average of 1.7 million restricted pages each day. A new Cloudflare defence system introduced in June blocked 119 million unauthorised attempts, 78 million of which were traced to Perplexity. While the company has denied the accusations, its new Chief Technology Officer, Mike Schroepfer, admitted in early August that technical errors may have occurred. He pledged a full review and proposed a simpler opt-out mechanism for publishers.
The controversy highlights the regulatory gaps surrounding the collection of training data for AI. Major publishers – including The New York Times, The Guardian and The Washington Post – have confirmed that Perplexity accessed their content despite explicit prohibitions. This case reflects a wider industry conflict, with content owners increasingly resorting to legal action against AI companies. Although Perplexity reached 15.2 million monthly active users in July, a 127% year-on-year increase, the scandal poses a significant reputational risk at a time when data protection and AI ethics are becoming ever more important.
Sources:


