Learn / How to get cited by ChatGPT, Perplexity and Google AI Overviews

How to get cited by ChatGPT, Perplexity and Google AI Overviews

This article provides a concrete checklist for optimizing website content to be accurately retrieved and cited by major generative AI engines, including ChatGPT, Perplexity, and Google AI Overviews.

By the Heron team · Published · Reviewed for accuracy

1. Crawler Access and Bot Recognition

Generative AI models primarily consume web data through two mechanisms: traditional search engine crawlers (like Googlebot) and specialized AI crawlers (such as GPTBot, ClaudeBot, and PerplexityBot). To ensure your content is available for citation, you must first verify that your robots.txt file does not block these specific user agents. While many sites block generic bots to save bandwidth, AI engines require open access to large volumes of text to build their retrieval indexes.

You can test your site’s accessibility to AI crawlers by checking your server logs for requests from GPTBot (used by OpenAI) and ClaudeBot (used by Anthropic). Additionally, ensure that your site is not blocking JavaScript rendering if your critical content is loaded dynamically, as some AI parsers execute JavaScript before extracting text. A simple way to verify this is to use the 'robots.txt Tester' tool in Google Search Console and input the specific bot user agents.

For Perplexity and other search-integrated AI tools, ensure your site is indexed in Google’s main index. Perplexity often relies on Google’s search results as a primary source for real-time information. If your pages are orphaned or noindexed, they will likely be excluded from AI-generated answers unless they are heavily linked from other authoritative sources.

2. Implementing llms.txt for Direct AI Access

The llms.txt file is a standardized text file placed in the root directory of your website that serves as a 'menu' for AI crawlers. Unlike robots.txt, which controls access, llms.txt explicitly tells AI models which pages are most important for answering questions. It allows you to categorize content by type (e.g., documentation, blog posts, FAQs) and provides a brief description of each section, helping AI models prioritize high-value content during retrieval.

To implement llms.txt, create a plain text file at yourdomain.com/llms.txt. Each line should follow the format: URL, followed by a space, followed by a description. For example: https://yourdomain.com/docs/api-reference A comprehensive guide to API endpoints. This explicit labeling helps AI models distinguish between core product information and peripheral marketing content, increasing the likelihood of your technical pages being cited in detailed answers.

You can also use llms.txt to specify 'ignore' directives for pages that are not suitable for AI consumption, such as printer-friendly versions or pages with heavy media. By curating your content feed, you reduce noise in the AI’s context window, leading to more accurate and relevant citations. Regularly update this file as you publish new content to ensure the AI crawlers are aware of your latest data.

3. Structured Data and Schema.org Markup

Structured data using Schema.org markup is critical for AI models to understand the context and relationships between entities on your page. While human readers rely on visual hierarchy, AI parsers look for explicit labels such as Article, FAQPage, HowTo, and Product. Implementing FAQPage schema is particularly effective, as it directly maps question-and-answer pairs that AI models frequently extract for concise responses.

Ensure that your key entities (people, organizations, products) are clearly defined using JSON-LD format. AI models prefer JSON-LD over microdata because it is easier to parse programmatically. Include properties like author, datePublished, and mainEntity to provide temporal and authoritative context. For example, a news article with a clear author and publication date is more likely to be cited as a credible source than an undated blog post.

Avoid overloading your structured data with irrelevant properties. Keep the markup clean and focused on the core content. Use tools like Google’s Rich Results Test to validate your schema. Consistent and accurate structured data helps AI models disambiguate similar terms and correctly attribute information to your source, reducing the risk of hallucination or misattribution.

4. Writing for Self-Contained Citability

AI models favor content that is self-contained, meaning a paragraph or section can be understood and cited without requiring the reader to navigate to other pages. Write clear, declarative sentences that state facts directly. Avoid vague references like 'as mentioned above' or 'see below' without providing sufficient context within the sentence itself. This makes it easier for AI to extract a coherent snippet for an answer.

Use specific, named entities and precise terminology. Instead of saying 'the company released a new tool,' say 'OpenAI released ChatGPT in November 2022.' Specificity increases the likelihood of your content being selected as a precise answer. AI models also prefer content that is well-structured with clear headings (H2, H3) that act as semantic anchors, helping parsers identify the start and end of relevant information blocks.

Maintain a consistent tone and avoid excessive marketing jargon. AI models are trained to recognize and prioritize factual, objective language over promotional content. If your page contains both technical details and marketing copy, ensure the technical details are clearly separated or highlighted. This helps AI models distinguish between core facts and supplementary commentary, leading to more accurate citations in technical or professional queries.

5. Strengthening E-E-A-T for AI Trust

Experience, Expertise, Authoritativeness, and Trustworthiness (E-E-A-T) are increasingly important signals for AI models when evaluating source credibility. AI models use link graphs and citation networks to determine authority. Ensure your site has a strong internal linking structure that connects related topics, and a robust external backlink profile from other reputable sites. High-quality backlinks act as votes of confidence, signaling to AI crawlers that your content is a trusted source.

Display clear author bios and credentials, especially for expert-authored content. AI models can parse author information to assess expertise. Pages with named authors who have established profiles on other platforms (like LinkedIn or Twitter) are more likely to be cited as authoritative. Additionally, include publication dates and update timestamps to help AI models prioritize fresh and current information, particularly for time-sensitive topics.

Transparency in sourcing and methodology also enhances E-E-A-T. If you present data or statistics, cite the original sources within the text. AI models often cross-reference multiple sources to verify facts. By providing clear citations and references, you make it easier for AI to validate your claims and include your content in multi-source answers. This reduces the risk of your content being overlooked in favor of more transparent competitors.

Key takeaways

FAQ

Does blocking Googlebot also block ChatGPT and Perplexity?
No. While ChatGPT and Perplexity may use Google’s index, they primarily rely on their own crawlers (GPTBot and PerplexityBot). You should ensure your robots.txt allows these specific user agents, even if you restrict other generic bots.
How does llms.txt differ from robots.txt?
Robots.txt controls which pages crawlers can access, while llms.txt provides a curated menu of content, describing what each page is about. This helps AI models prioritize high-value pages and understand context more efficiently during retrieval.
Is JSON-LD better than microdata for AI citation?
Yes, JSON-LD is generally preferred by AI parsers because it is easier to extract and less prone to formatting errors. It allows for a cleaner separation of structured data from HTML content, making it simpler for AI models to parse key entities and relationships.
Can AI models cite content from pages with heavy JavaScript?
Yes, but only if the AI crawler can execute JavaScript. Most modern AI crawlers, including GPTBot and ClaudeBot, support JavaScript rendering. However, ensuring your critical content is visible in the initial HTML load can improve extraction speed and accuracy.
See how AI search sees your site, free.
Run a free Heron audit