Brains first evolved to do concrete mental tasks, like chasing prey. Then language evolved, to let brains think together, such as on how to chase prey together. Words are how we share thoughts.
So we think a bit, say some words, they think a bit, they say some words, and so on. Each time we hear some words we update our mental model on their thoughts, which also updates us about the larger world. Then we think some more, drawing more conclusions about the world, and seek words that, when said, help them to draw similar conclusions. Along the way, mostly as a matter of habit, we judge each other’s ability to think and talk. Sometimes we explicit ask questions, or assign small tasks, which we expect to be especially diagnostic of relevant abilities in some area.
The degree to which such small task performance is diagnostic of abilities re the more human fundamental task of thinking together varies a lot. It depends, in part, on how much people are rewarded merely for passing those tests, and how much time and effort they can focus on learning to pass tests. We teachers are quite familiar with such “teaching to the test”, and it is often a big problem. There are many topics that we don’t teach much because we see that we just don’t have good small test tasks. And arguably schools actually fail most of the time; they arguably pretend to teach many things but mostly just rank students on general abilities to learn to pass tests, and inclinations to do what they are told. Abilities which can predict job performance.
Which brings us to the topic of recent progress in machine learning. Google just announced its PaLM system, which fit 540 billion parameters to a “high-quality corpus of 780 billion tokens that represent a wide range of natural language use cases”, in order to predict from past words the next words appropriate for a wide range of small language tasks. Its performance is impressive; it does well compared to humans on a wide range of such tasks. And yet it still basically “babbles“; it seems not remotely up to the task of thinking together with a human. If you talked with it for a long time, you might well find ways that it could help you. But still, it wouldn’t think with you.
Maybe this problem will be solved by just adding more parameters and data. But I doubt it. I expect that a bigger problem is that such systems have been training at these small language tasks, instead of at the more fundamental task of thinking together. Yes, most of the language data on which they are built is from conversations where humans were thinking together. So they can learn well to say the next small thing in such a conversation. But they seem to be failing to infer the deeper structures that support shared thinking among humans.
It might help to assign such a system the task of “useful inner monologue”. That is, it would start talking to itself, and keep talking indefinitely, continually updating its representations from the data of its internal monologue. The trick would be to generate these monologues and do this update so that the resulting system got better at doing other useful tasks. (I don’t know how to arrange this.) While versions of this approach have been tried before, the fact that this isn’t the usual approach suggests that it doesn’t now produce gains as fast, at least for doing these small language tasks. Even so, if those are misleading metrics, this approach might help more to get real progress at artificial thinking.
I will sit up and take notice when the main improvements to systems with impressive broad language abilities come from such inner monologues, or from thinking together on other useful tasks. That will look more like systems that have learned how to think. And when such abilities work across a wide scope of topics, that will look to me more like the proverbial “artificial general intelligence”. But I still don’t expect to see that for a long time. We see progress, but the road ahead is still quite long.
Robin, by the time we reach your benchmark of Noticable Improvement, we are already dead.
Re: research on thinking together. There is a (fairly old) line of work by OpenAI that goes in this direction:https://openai.com/blog/lea...Is that closer to what you had in mind?