45 Comments
User's avatar
Steve Sailer's avatar

Can you have the LLM's rank order their lists and then you give us example from both ends of the lists?

Robin Hanson's avatar

I linked to the LLM convos, so you can continue them as you like.

Noah's Titanium Spine's avatar

"I then asked (paid versions of) three LLMs to..."

I immediately stopped reading. Please stop doing this. I want to know what YOU think, not what the parrot says.

barnabus's avatar

It's actually interesting to see what the parrot says, because you can then see how the parrot was trained. Exactly because it is a parrot. For example, it looks like particularly Chat-GPT is trained not to reveal politics.

Also, I don't think any of the LLMs actually read the novels, but instead read the internet-available literary analyses of the novels. Obviously, there are lots of these on the internet, because great novels are being studiously analysed in high school and college.

Phil Getts's avatar

That's an important point. None of the LLMs I use read anything that I ask them to. I didn't realize this for a while; they always wrote authoritatively, but regularly were wrong about the contents, even when I was confident the book or article was in its training set (e.g., I'm using Gemini, and the book is in google books). For some reason they can read and analyze web pages, but not their training data, and not scientific articles on web pages. I think they don't even read the web pages containing scientific articles that are open-access. Not sure. I should test that.

Catherine Caldwell-Harris's avatar

LLMs can't read web pages that were in their training data, because their training data has been crunched down into vast multidimensional arrays of vectors. The same as how all your life experiences have been distributed across your own neural networks. Information in artificial and human neural networks memories is superimpositional. This is why LLMs make so many of the same overgeneralization reasoning errors as humans.

barnabus's avatar

Still, I am perfectly capable of remembering novels I have read. I am not that much into Tolstoy, but I do remember many other 19th& early 20th century writers. The major problem for AI is that it appears it can't easily work from first principles. Almost always the results one gets up on AI search is from some metanalysis that's on the web somewhere. Enhanced by stochastic parrotting, of course.

That's with memory is totally bogus - big novels up to now would consume less that 1 Ggbyte. Even if they read-out every major literary and technical book available, with graphs represented vectorially - 10 Terabytes top. Obviously, bitmaps would be a big problem - but without bitmaps, just text and vectorial graphs - 10 Tera top.

Phil Getts's avatar

Training data for Claude is thought to be around 200 terabytes, which would be about enough for all the books published in the past 30 years. But more importantly, LLMs are neural networks in memory. They don't have tiered memory. They are usually less than a terabyte. And the training process isn't even trying to remember facts; it's trying to construct an ontology that will increase prediction of the next token. It has no propositions in its memory, no sentences, just a complex and possibly inhuman concept hierarchy.

barnabus's avatar

Prediction of the next token can be just sometimes something thankless. It can remind one of the saying "People, eat sh*t, millions of flies can't be wrong!"

Catherine Caldwell-Harris's avatar

Re: "The major problem for AI is that it appears it can't easily work from first principles." Agree. Nor can humans (easily). :-). Humans use heuristics, do statistical pattern matching, have reasoning biases ... oops, the very name of this blog!

Catherine Caldwell-Harris's avatar

I asked Gemini to confirm or qualify what I wrote above. Answer:

A. Verbatim Memorization

While LLMs don't have a "file system," they can sometimes "read" (reconstruct) their training data verbatim if that data was repeated thousands of times across the web (like the US Constitution or popular poems). This is known as memorization, where the weights become so tuned to a specific sequence that the original text emerges perfectly.

B. "Can't Read Web Pages"

If an LLM has a "Browsing" tool (like the one I am using now), it can read web pages and scientific articles in real-time. In this case, the text is placed into the context window (short-term memory).

In the context window, the text is not yet "crunched" into weights; it exists as active tokens that the model can analyze with high precision.

The Substack commenter's frustration likely stems from the model failing to "remember" a specific paper it was trained on, which brings us back to your point: it can't "search" its training data like a Google index.

Catherine Caldwell-Harris's avatar

Gemini continued on:

When an LLM uses a browsing tool, it operates with two distinct types of data:

The "Crunched" Memory (Training Data): This is exactly what you described—the vast, multidimensional vector space where information is superimposed. This is where I "know" things like general history or how to speak English.

The "Working" Memory (Context Window): When I browse a web page or read a scientific article, that specific text is pulled into my context window. In this state, the information is not compressed into weights yet. It’s "active" text that I can analyze with high precision, almost like a human holding a physical paper in their hands.

Phil Getts's avatar

I know. But the when pages are still on the web, they should be able to access them via the same mechanism that they access other pages on the web. It's peculiar that an LLM can use web pages about Moby Dick, but apparently not a web page containing the text of Moby Dick. Probably pages are sorted by expected value per byte, but it shouldn't be too hard to hack that sort routine, which is probably (though not necessarily) outside the LLM proper, to prioritize the original text.

Catherine Caldwell-Harris's avatar

I asked Gemini to comment on your paragraph.

Phil Getts is touching on a fascinating technical "glitch" in how AI currently interacts with the world, but his proposal about "prioritizing the original text" via a "sort routine" misunderstands the barrier between an AI's brain (the weights) and its tools (the browser).

Here is an evaluation of his claim and where it hits a technical wall:

1. The "Moby Dick" Paradox: Why it can't "just access it"

Getts asks: If it's on the web, why can't the AI just use the same mechanism it uses for other pages?

The "mechanism" he is referring to is Retrieval Augmented Generation (RAG) or Browsing. When I (Gemini) browse the web, I am essentially using a search engine to find a URL and then reading the text into my "working memory."

The reason an AI doesn't "just do this" for everything in its training data (like Moby Dick) is cost and intent:

The Intent Problem: If you ask me a question about Moby Dick, my neural network "thinks" it knows the answer because the book was in its training data. It doesn't "know" that its memory of the book is a "blurry" multidimensional vector. It feels like a solid fact. Therefore, the AI doesn't "trigger" the browsing tool because it doesn't realize it needs to.

The Cost/Speed Problem: Running a web search for every single noun or fact mentioned in a conversation would make the AI 100x slower and incredibly expensive to run.

2. Is there a "Sort Routine" to hack?

Getts suggests that pages are sorted by "expected value per byte" and that we should "prioritize the original text."

In the context of training, he is partially right. AI companies use "data deduplication" and "quality filtering." If there are 10,000 copies of Moby Dick on the web, the trainers might only keep one or two "high-quality" versions to save space.

However, in the context of live browsing, there is no central "sort routine" within the AI that decides which web pages are "original." The AI sees the web through the lens of a search engine (like Google or Bing).

If a search engine ranks a Wikipedia summary of Moby Dick higher than the Project Gutenberg full text, the AI will likely "read" the summary first.

The "hack" Getts wants isn't a change to the AI's internal code; it would require the AI to be programmed to distrust its own memory and always verify "source texts" via a search—which is exactly what "Agentic AI" (AI that plans and double-checks) is trying to do now.

3. The "Value per Byte" Insight

Getts is very insightful regarding the "value per byte" comment. AI training does prioritize "information density."

A scientific abstract has a high "value per byte" because it summarizes 30 pages into one paragraph.

The raw data or the full 200,000 words of a novel have a lower "value per byte" for a model trying to learn "how language works."

Because of this, models often learn the metadata and summaries of a topic better than the primary source. This is why an AI can sometimes tell you everything about a poem but struggle to recite the third stanza accurately.

The Verdict on Getts' Observation:

He is right about the symptom, but slightly off on the cure. The problem isn't a "sort routine" that needs hacking; it's the "Self-Knowledge Gap." The AI doesn't know the difference between a "perfect memory" (a web page it's currently looking at) and a "compressed reconstruction" (its training data).

The "hack" being developed currently is called "Long-Context"—giving the AI a "working memory" large enough to hold the entire text of Moby Dick at once, so it never has to rely on its "blurry" training memory at all.

Jack's avatar

If you want to be 100% sure the LLM is processing a given piece of content, you can export it as pdf (or better still, txt) and provide that as input. Gemini and the other AIs can download data from the web but it's hit or miss when/whether they do that.

Phil Getts's avatar

Right; but Robin probably can't do that with 240 books.

Hook's avatar

I don't see a ton of value in using LLMs to analyze novels when all you use is the LLM's memory of the novels. If you want an analysis, they should be provided with the full text of the novel.

le raz's avatar

Every time you write about using an LLM you should mentally substitute the phrase "an unpaid intern." E.g., "I asked three different unpaid interns to read these book, and they concluded..." Or "I asked three unpaid interns whether my idea was novel, and they all agreed it was."

If your writing still make sense / has value under this substitution then fine. If it doesn't then you are just wasting your own and everyone else's time.

G            G's avatar

that list sucks

Robin Hanson's avatar

They at least show how they try to combine many criteria.

Phil Getts's avatar

I've read 75 of the first 100, so I must be in alignment with the selection criteria. Though a list which lists Ulysses and In Search of Lost Time as its top two, and even mentions The Man Without Qualities, has a strong pretentiousness component. Interesting to see The Good Soldier back; it was considered the greatest novel of its times by many modernist writers, then fell into oblivion.

But it is not a good list for trying to understand novels, because it contains so very many extremely peculiar novels.

Ben Hoffman's avatar

This has to be wrong; Pierre Bezukhov's arc in War and Peace is a very clear counterexample, and War and Peace is often cited as an exemplary or central case of novel. My instance of Claude Opus writes:

>Pierre Bezukhov's arc involves substantial stance changes driven by several of the "rare" causes Hanson lists—particularly gaining new associates and copying them (f): first the Freemasons (especially Bazdeev), then Platon Karataev during captivity, then his post-war political circle that points toward Decembrist-adjacent views. The Karataev transformation also has elements of "it just felt right" (g), though that's intertwined with the (f) mechanism. His post-war radicalization combines (a) seeing the events of 1812 with (f) new associates.

This is just the example that immediately came to mind before even looking at your list; I would bet at even odds I can name ten comparably clear counterexamples without much trouble.

Ben Hoffman's avatar

Dorothea Brooke in Middlemarch also changes her political alignment due to some of your rare causes. My Claude Opus writes:

Dorothea's arc involves stance changes on political reform—she starts with vague philanthropic enthusiasm, marries Casaubon partly hoping to contribute to some grand intellectual/moral project, and ends up with Will Ladislaw who is actively involved in Reform politics (he becomes a Reform MP).

But the mechanisms are mixed:

(f) Gaining new associates: Will Ladislaw is clearly a new associate whose views she comes to share, and her exposure to his political circle matters.

(a) Seeing events: Her disillusionment with Casaubon comes from observing his actual work and character—that's reality-testing.

(g) It just felt right: There's something ineffable about her attraction to Will and what he represents versus the desiccated scholasticism of Casaubon.

Robin Hanson's avatar

Clause Opus is one of the LLMs I used.

Robin Hanson's avatar

If we both asked the same Claude system to do the same thing, and get different answers, then the more I want to average over may different evaluations by many different systems.

Ben Hoffman's avatar

Seems to me like the thing to do would be to inspect one of the examples where the systems strongly differ - or, short of that, to feed the output of one system into the other and ask it why & how it disagrees. I picked War & Peace and Middlemarch as examples because I've read both - War & Peace more recently, so I was more confident it was a counterexample.

I'm not sure how to make sense of this conversation. Do you care whether the data is accurate, or only whether we can average enough garbage to drown out signal? If you care whether it's accurate, then I could presumably stake some amount of money at even odds - enough to make it worth your time - that if you read War and Peace, you'd agree more with my characterization than with the one you got from your LLM instances. If you just want to average out bigger and bigger datasets without calibrating on facts, then IDK what I could possibly say that would register as relevant information.

Robin Hanson's avatar

My point is that I did the sort of thing I could do in a few hours. Not willing to spend months or years on such a project.

Phil Getts's avatar

War & Peace and Middlemarch are considered great novels, not typical novels. You could read every romance novel that Harlequin has ever published and not find one story like Dorothea Brooke's.

Ben Hoffman's avatar

And if the original claim had been about pulp rather than a top novels list that would be a relevant criticism.

Phil Getts's avatar

"The original claim" is the claim of the list. But I'm only interested in the claim that this is a good list of books to use to study novels in general. I think it's very bad list to use to study novels, because there are so many singular, atypical novels on it, and many nonfiction books, essays, short story collections, and poems. Ulysses, In search of lost time, The magic mountain, The man without qualities, Waiting for Godot (not a novel btw), the waste land (also not a novel), tristram shandy (an anti-novel), ficciones (also not a novel). Also, I would stratify the books by century or literary period if you use the whole list; I think you'll get more noise from changing standards, than information from the extra data.

Harry's avatar

Cool list. I was amazed how many of them I’ve read over the years.

Duane Stiller's avatar

Great work. TL/DR: Pick the right model and get it to show its work before classifying. Here are a few thoughts and suggestions based on similar studies we do:

A. . Wide LLM variance: Based on the nature of Large Language Models (LLMs) and the specific results Hanson reported, the massive divergence (ChatGPT finding 9 novels vs. Claude finding 180) is likely due to three distinct technical factors: Safety Refusals, Recall Capabilities, and Prompt Interpretations.

1. Safety Filters and "Refusal" (The ChatGPT Bottleneck)

The most likely reason ChatGPT only found 9 novels is its aggressive safety training regarding political topics. The Mechanism: OpenAI trains its models to avoid taking stances on political figures or sensitive social issues to prevent "hallucination" or bias. The Result: When asked to identify if a character supports a "political movement," ChatGPT often triggers a refusal response (e.g., "I cannot definitively state the character's political views") or defaults to a neutral position unless the book explicitly names a historical event (like the French Revolution in A Tale of Two Cities). It likely filtered out 95% of the novels as "too ambiguous to classify" to remain safe.

2. Deep Literary Recall vs. Summarization (The Claude Advantage)

Claude (specifically the Opus/3.5 models) is widely noted for having stronger long-context recall and a more "literary" training set compared to its peers. The Mechanism: Claude is often more willing to engage in deep literary analysis and hypothetical reasoning without hitting a "citation" guardrail. The Result: Claude likely "remembered" specific plot points or internal monologues from 180 novels that allowed it to infer a stance, whereas the other models might only have had access to high-level summaries that didn't include those specific character details.

3. Definition of "Social Movement"

The models clearly used different thresholds for what counts as a movement.

• Strict (ChatGPT/Gemini): They likely looked for Capitalized Historical Events (e.g., The Bolshevik Revolution, Abolitionism). If a character was just "fighting against conformity," they ignored it.

• Broad (Claude): Claude likely interpreted "movement" to include thematic struggles (e.g., a character fighting against "hypocrisy" or "bureaucracy"), which exists in almost every novel. This explains why it found relevant characters in nearly 75% of the list.

B. The Best Way to Prompt for This Analysis (The "Gold Standard" Method)

If you wanted to replicate this study and get high-quality, consistent data, you should not use the chat interface. You should use a Python script accessing the API.

Here is the optimal workflow for analyzing a specific list of 240 works:

Step A: The Setup (One-by-One Processing)

Do not batch them. Analyze one book per API call. This ensures the model dedicates its full "attention" to that specific book without bleeding context from the previous book in the list.

Step B: The "Chain of Thought" Prompt Structure

You need to force the model to "show its work" before it gives you the final classification. If you just ask "Is there a movement? (Y/N)", it will guess.

Recommended Prompt Template:

Role: You are an expert literary critic and sociologist.

Task: Analyze the novel: [INSERT TITLE] by [INSERT AUTHOR].

Step 1 (Recall): Briefly summarize the central conflict and the protagonist's relationship to society. List any specific groups, organizations, or ideologies the protagonist interacts with.

Step 2 (Filter): unexpected Identify if there is a specific "Social Movement" involved.

Definition: A collective, organized effort to change laws (Political) or norms (Cultural).

Exclusion: Do not count personal vendettas, romantic rivalries, or general teenage rebellion unless it is part of a named group (e.g., "The Jacobins").

Step 3 (Analysis): If a movement exists, identify the primary cause of the character's stance change from the following list of 8 options: [List Hanson's 8 causes].

Step 4 (Output): Output your final answer in strictly valid JSON format:

JSON

{

"has_movement": true,

"movement_name": "The Abolitionist Movement",

"movement_type": "Political",

"primary_cause_of_change": "Saw unexpected facts"

}

Step C: Why this is better

Grounding: By asking for a summary first (Step 1), you force the model to retrieve the actual plot details from its latent space before it makes a decision. This reduces hallucinations.

Normalization: By explicitly telling the model what to exclude (Step 2), you align the "strictness" of Claude and ChatGPT so you don't get one finding 180 and the other finding 9.

Parsable Data: Requesting JSON output means you can instantly turn the results into a spreadsheet without manually reading 240 chat logs.

James R Smith's avatar

Does the lack of a Pynchon novel(s) reflect on the ability of the writer, or the ability of the Chat Box?

Phil Getts's avatar

sorry, I didn't need to say that.

Phil Getts's avatar

If it does, I think it reflects well. :)

Pynchon's novels, at least Gravity's Rainbow and Lot 49, are incomprehensible unless you realize they're fantasies which are hybrid allegories, meaning the real world and the allegorical world are mashed together inextricably, so there isn't self-one consistent story-world. And the allegories are about Freudian diagnoses.

Beautiful sentences, though.

Catherine Caldwell-Harris's avatar

Sign of relief. I now understand why I couldn't get through a page of Pynchon.

Phil Getts's avatar

I'd love to see the whole list of novels and characters.

Robin Hanson's avatar

I linked to the whole list.

Michael Vassar's avatar

Is easy to see counterexamples off the top of my head. Huck Finn, for instance, changes his stance on slavery/goodness to resolve inconsistency in his prior norms and actions.

And Boromir saw opportunity to gain power status or attention.

Those two examples come to mind before any examples of political opinions changed by facts.

Raskolnikov’s change just feels right.

I’m still not immediately seeing a fist example of a fact driven shift.

Robin Hanson's avatar

If I could get you to code all 240 novels, I'd compare that as a dataset.

Phil Getts's avatar

what do you mean by "code"?

You could post a Google survey asking your readers to make the same judgements on those novels which they read. (Do ask people who only saw the movie not to answer.)

Gemini says these tools exist to create a google forms survey from your text description:

### **1. Google Forms Native "Help Me Create a Form" (Gemini)**

Google has integrated its AI, **Gemini**, directly into Google Forms for eligible Google Workspace users. This is the most direct way to build a survey from a text specification. It requires Google One AI Premium or some other premium Google AI subscription.

* **How it works:** Open a new form, and a "Help me create a form" prompt box will appear. You simply type your description (e.g., *"Create a registration form for a weekend yoga retreat with dietary options and room preferences"*), and it generates the questions and options for you.

* **Source:** Google Docs Editors Help, "Create a form with Gemini in Google Forms," [support.google.com](https://support.google.com/docs/answer/16346789?hl=en), January 2026.

### **2. Google Workspace Add-ons**

If you don't have the Gemini business tier, you can install third-party add-ons from the Google Workspace Marketplace that specialize in "Prompt-to-Form" conversion.

* **MagicForm.app:** This is a popular add-on that lives in your Google Forms sidebar. You can paste your text specification or even upload a PDF, and it converts the content into a structured survey or quiz in seconds.

* *Citation:* "MagicForm.app: Automated Quiz Creation from Text," [unrealspeech.com/ai-apps/magicform-app](https://unrealspeech.com/ai-apps/magicform-app), 2025.

* **GPT for Google Forms:** Developed by Lincoln Apps, this tool allows you to enter a topic or text prompt and select the number of questions. It then generates the form and adds the questions directly to your current document.

* *Citation:* "GPT for Google Forms | Quiz Builder | ChatGPT," [workspace.google.com/marketplace](https://workspace.google.com/marketplace/app/gpt_for_google_forms_quiz_builder_chatgp/37349114302), 2025.

### **3. AI Platforms with Export Capabilities**

Some external AI platforms build the form in their own interface but allow you to sync the data to Google Sheets or export the structure.

* **Weavely.ai:** An AI-native form maker that lets you describe a form via text or voice. It specializes in advanced logic and design. While it is its own platform, it offers seamless integration with Google Sheets for response tracking.

* *Citation:* "Weavely Review 2025: The Google Form Maker AI That Builds Itself?", [skywork.ai](https://skywork.ai/skypage/en/Weavely-Review-2025-The-Google-Form-Maker-AI-That-Builds-Itself/1975260490308317184), October 2025.

* **Form Builder Plus (ChatGPT Custom GPT):** If you use ChatGPT Plus, there are custom GPTs that can generate a direct Google Form link. After you describe the survey in the chat, the GPT uses an API connection to create the file in your Google Drive.

* *Citation:* "How to build Google Forms inside ChatGPT," [weavely.ai/blog/google-forms-gpt](https://www.weavely.ai/blog/google-forms-gpt), 2025.

### **Which one should you use?**

| If you want... | Use this... |

| **Simplicity** (inside Google) | **MagicForm.app** or **Gemini** |

| **Advanced Design** & Logic | **Weavely.ai** |

| **Conversational Building** | **Form Builder Plus** (Custom GPT) |

Robin Hanson's avatar

Seems like a big project to get human readers to code 240 Novels.

Phil Getts's avatar

The Gospels contain a large number of fact-driven shifts to follow Jesus, where the facts are things like "was healed by Jesus". Good luck to your LLMs on what percentage political his movement was.

Gulliver doesn't join a movement, but is converted to the POV of the Houhynyms (sp?) by seeing how nice their society is.

The Watchmen are persuaded to cooperate with the "villain" at the end of /Watchmen/ by facts. But it is not a happy ending, it's a tragedy.

The recent Young Adult fantasy series /Keeper of the Lost Cities/ has a protagonist who is abducted into a utopia, and gradual revelations of clues and facts turn her against it a few books into the series, then back into alliance with it as several different movements and races in a conflict, each react and adapt to events and disclosures of fact.

Does "fact-driven" include people converted by losing their belief in facts? eg, 1984.