Top 5 AI Content Optimization Problems

At Scrunch we have customers across a wide range of sizes, industries and technology stacks. We help them understand and optimize how their brand shows up across AI platforms – since AI increasingly is sourcing information from the web in real time, that often means helping them optimize their website, CMS and published content to be used by AI platforms like ChatGPT Search, Perplexity and Google AI Overviews.

AI optimization is a fast-evolving category and best practices are changing as the platforms evolve. The five foundational issues below are the most common and most impactful issues we see. As a bonus, solving them can also improve traditional SEO and the experience for human website visitors – they’re low-risk changes that can have a big impact.

1. Over-aggressive or misconfigured bot blocking by robots.txt, website hosts or web application firewalls

To be findable in web search, your content needs to be accessible by search engine indexing bots like Googlebot or Bingbot. These bots check your website periodically for new and updated content and make that content findable in their search index, usually within a few days after visiting. Large sites typically see Googlebot activity every day; smaller properties might see Googlebot drop by once or twice a week. Luckily, most website platforms understand the importance of these bots and will ensure they are not blocked unless you specifically configure them otherwise.

Most AI search engines start by using existing search indexes such as Bing’s or Google’s. But they also retrieve content from your website on-demand when it shows up as relevant for user questions. We’ve seen many examples where default platform configurations block these bots.

We’ve also seen cases where IT and security teams have conservatively blocked all bots from providers like OpenAI in order to prevent OpenAI from training on their content. However, OpenAI’s documentation clearly states that they won’t train AI based on content retrieved on behalf of users – you can have your cake and eat it too, by specifically permitting the “real time retrieval” bot and blocking their training data crawler.

How to check for this issue: Review website configurations (firewall rules, robots.txt) for bot blocking rules. Try asking specific questions in ChatGPT Search that mention your brand and see if ChatGPT will cite your relevant content. Or use Scrunch’s Site Audit feature to assess and remediate issues in an automated and scalable way.

How to fix it: Update configuration with your website platform, web application firewall or anti-bot provider. Common providers here include AWS WAF Bot Control, Cloudflare, Datadome, and Imperva.

For more information: Read more about AI User Agents

2. Pages that require JavaScript to view content

Current generation AI retrieval bots can’t execute JavaScript. They get the “web 1.0” version of your website that includes only the text of the web page as returned by your website’s server, before any JavaScript executes or dynamic content is injected into the page.

In the most severe version of this problem, some websites have no content at all without JavaScript – everything interesting is dynamically added to the page via JavaScript code and API calls (sometimes called a “single-page application” (SPA) or “app shell” by web developers.) Websites that have this behavior cannot be accurately cited in AI search today.

Googlebot originally did not execute JavaScript either. As JavaScript and dynamic content became more pervasive on the web, Google officially launched support for indexing JavaScript in 2014 – over ten years ago! Understandably, web developers have come to rely on this capability to make their modern, JavaScript-heavy web experiences accessible in search, and Google’s guidelines say using JavaScript is fine.

You might reasonably ask: Why don’t AI search retriever bots like OpenAI’s ChatGPT-User execute JavaScript then, since that’s how Google does it? An important difference between Googlebot and agents like ChatGPT-User is that Googlebot indexes content asynchronously – before and regardless of whether a user is asking a question about it.

This means that Googlebot is not as sensitive to how long it takes to parse, execute and understand any given web page. In contrast, ChatGPT retrieves content synchronously immediately after a user asks a relevant question. In addition, ChatGPT usually retrieves several web pages in parallel to find additional sources of information.

These factors mean that ChatGPT has a very limited “time budget” to retrieve and understand pages without negatively impacting the user experience – in other words, without making the user wait too long.

How to check for this issue: Try viewing your web pages with JavaScript turned off in your browser. In Chrome, you can do this by enabling Chrome’s Developer Tools and checking the “Disable JavaScript” box in the settings. (This will turn off JavaScript only while Developer Tools are open, so it won’t interfere with your normal web browsing.) Scrunch’s Site Audit can also detect this issue, comparing the “human” (browser-rendered, JavaScript-enabled) and “AI” views of your pages.

How to fix it: Modern technology stacks mean you can have plenty of dynamic, JavaScript-controlled behavior on your website while still “pre-rendering” the meaningful content – for example, articles, product descriptions or reviews – in the initial response from your web server. This is better for AI search, better for human visitors (they’ll see content faster) and even better for Googlebot. For cases where it’s not possible to revamp your tech stack to support this, or where you want fine-grained control over how AI bots access your content, Scrunch’s AI Middleware feature can help.

For more information: Check out Google’s web.dev article on different rendering approaches on the web.

3. Wrong amount of (text!) content on the page: too little or too much

This one is simple, and often shows up in conjunction with issue #2 – ChatGPT and other AI tools love text content, particularly straightforwardly written “prose” content like articles, product descriptions, and FAQs.

Although newer AI models underlying tools like ChatGPT can natively understand images, documents, audio and even video, these capabilities aren’t “hooked up” to the AI search features, at least not yet. So if you have resources on your site that just consist of embedded videos, diagrams, audio players (such as podcast players), etc. – ChatGPT and others can’t make use of them.

Worse: AI search platforms will often try to retrieve content from these types of pages if they are otherwise ranked highly in search. But because AI search retrieves a limited number of pages for each user question, these pages may end up “wasting a slot” – depriving you of the opportunity to get a more contentful page cited instead.

The inverse problem can also happen, although it’s rarer. If you have tons of text content on a single page, AI may struggle to retrieve specific facts or accurately summarize the overall content.

The AI savvy might ask “what about ‘long context’ models like Gemini?” here. Two things – first, the most common AI search experiences today are facilitated via GPT-4o, GPT-4o-mini, or derivatives of Llama 3.3 that all have a context window of 128K tokens. The era of million-token AI search contexts isn’t quite here yet. Second, remember that AI search is conversational and can pull in multiple sources, so you’re sharing the AI’s “token budget” with lots of other data!

A rule of thumb: a medium-length New Yorker article will work fine. An average novel is longer than we’d recommend, but may work for lower-funnel content where there are few competing sources. If your web pages are the same length as a James Joyce novel, that’s too long.

How to check for this issue: You can assess your pages manually. Scrunch’s Site Audit feature also automatically checks for content length.

How to fix it: For pages that are too short, enhance them with additional – helpful and relevant! – text. For example, add transcripts to podcast episode pages and captions to diagrams. For pages that are too long, try breaking content into logically organized sections (for example, organizing a long FAQ page with subpages for different topics.)

We specifically don’t recommend simply paginating a long-article for AI search optimization – this usually will just lead to ChatGPT et al retrieving just a single page.

4. Content obscured by technical markup structure

As described in issue #2 and #3, AI search engines read the plain text version of your web page, before it’s enhanced by JavaScript, and without seeing or understanding media content on the page like images and video.

Let’s say you’ve got a page that’s entirely rendered on the server, requires no JavaScript, and has all the important information in text. Can things still go wrong? Yes.

We frequently see pages where the text content on the page is returned in a fragmented and disorganized way after going through the processing pipelines that AI search engines use to parse content. (For example, turning HTML – the language of web pages – into simple Markdown-formatted text is a very common technique used in AI search engines as well as in-house “retrieval augmented generation” tools to make text more understandable to AI models.)

This doesn’t mean you need to have perfect “semantic HTML”. However, highly structured information such as pricing tables, hours of availability, calendars, etc. can have a high risk of being misinterpreted by AI, particularly with some commonly-used visual authoring tools that we’ve seen become popular with marketing teams.

For example, we’ve seen examples of pricing tables that look like this to humans:

Product	Silver Plan Price	Gold Plan Price
Foo	$100	$120
Bar	$200	$220

Come out looking like this to AI:

ProductSilver Plan PriceGold Plan Price

FooBar

100200
120220

Powerful AI models are generally very, very good at understanding even very fragmented information given context clues, but as you can imagine, they aren’t guaranteed to be able to correctly answer pricing questions given input like this.

How to check for this issue: Experienced HTML authors can take a peek at page markup using Developer Tools and spot potential problem areas (remember to turn JavaScript off first!) For folks who aren’t HTML experts, using the “Reader Mode” feature in your browser can often give you a quick preview of what your content will look like with formatting and layouts simplified or eliminated. If you’re using Scrunch Site Audit, the AI Preview section can show you a more accurate version, and our Content Clarity check can identify cases that are especially likely to be problematic.

How to fix it: You can update your HTML structure and content ordering so that meaning is preserved even when the markup is stripped out. Alternatively, you can also provide simpler “semi-structured” versions of content that can’t be mangled – for example, a pricing FAQ section that just repeats your pricing structure in prose.

For SEO practitioners in some verticals, adding structured “schema” metadata to pages – in formats like JSON-LD – can be very impactful. Google and Bing use this kind of metadata to enrich search results with structured product cards, information about location & hours for local businesses, biographical data, and more.

Although AI search engines are improving how they handle things like local search results, AI is still primarily driven by unstructured text. In fact, the whole reason AI search works so magically (sometimes!) is because of the inherent power of large language models to extract knowledge, summarize and overall make sense of the chaotic sea of text on the Internet.

Long story short – adding JSON-LD schema to pages is awesome for Google, especially for focused experiences like Google Maps or Shopping. For AI search, make sure to have solid text content that is readable by humans and AI on the page as well. Review the above four tips to make sure your textual content is in good shape for AI search, and you’re good to go!

Top 5 AI Content Optimization Problems

On this page

1. Over-aggressive or misconfigured bot blocking by robots.txt, website hosts or web application firewalls

2. Pages that require JavaScript to view content

3. Wrong amount of (text!) content on the page: too little or too much

4. Content obscured by technical markup structure

5. Expecting structured/schema’ed SEO content to fix AI results without corresponding textual information on the page