Discussion about this post

User's avatar
Daniel Haggard's avatar

I don't understand the strategy of using LLMs to generate the scores. On what basis can we have any confidence in the numbers generated as having any empirical weight at all?

Besides that - what about low dependence contexts that seem extremely thick? e.g. the internet, (discussion forums etc) where tolerance seems extremely low.

hwold's avatar

Please always cite the exact model your using when you’re using LLM.

In particular : ChatGPT is an outlier here, and I’m not convinced dismissing it (as the median do) is the right call. Is it 5.2 auto ? instant ? thinking ? If auto, did it get routed to instant or thinking ?

What about other models ? Is it Opus 4.5 or Sonnet 4.5 ? Gemini Pro or Flash ?

The negative correlations sounds very fishy, too. Increasing power means decreasing thickness ? What ? Granted, they are very small, but still.

12 more comments...

No posts

Ready for more?