Tag Archives: artificial-intelligence

Large Language Models and the Rationalist-Empiricist Debate.

                          Introduction.

From about 1950 there was a resurgence of interest in the Rationalist Empiricist debate. With people viewing Chomsky as an updated rationalist carrying on the traditions of Leibniz and Descartes and Quine and Skinner updated versions of empiricism carrying on the traditions of Locke and Hume. And a consensus emerged that Chomsky’s rationalism won out over the empiricism of Quine and Skinner.

                In recent years with the rise of Deep Convolutional Neural Networks, and Large Language Models some have argued that their architecture is data driven and hence they are an existence proof that empiricist learning is a viable way of modelling the mind (Bunker, C 2018). Prompting debate where others argue that these models don’t vindicate any kind of empiricism as they rely on innate architecture.

                In this paper my focus will be LLMs because their linguistic capacities make them directly relevant to the earlier debates between Chomsky, Quine and Skinner. And evaluate whether LLMs do indeed vindicate a type of empiricism. I will argue that LLMs are too dissimilar to human cognitive systems to be used as a model for human cognition. They make bad models of both human linguistic competence and human linguistic performance. So, I argue that they offer no vindication of empiricism (or rationalism for that matter).  

 Chomsky and Quine and the Rationalist-Empiricist Debate.

Famously the rationalist-empiricist debate between people such as Locke and Descartes which focused on subjects such as whether humans were born with innate concepts, was revisited in the 1950s. When Noam Chomsky burst onto the scene with his review of Skinner’s book Verbal Behaviour to many this was viewed as a reviving of the rationalist-empiricist debate. Chomsky entitled his 1966 book ‘Cartesian Linguistics: A Chapter in the History of Rationalist Thought’, thus setting himself up as the heir apparent to Descartes rationalism.

                In Skinner’s Verbal Behaviour he divided language up into seven Verbal Operants which he argued are controlled using his three-term contingency of antecedent behaviour and consequence. Chomsky criticized Skinner for taking concepts from the laboratory where they were well understood from animal literature and extending them into areas where there was no such experimental evidence. He charged Skinner with either using the technical terms literally in which his views were false, or metaphorically in which case the technical terms were as vague as ordinary terms from folk psychology.

                In his 1965 book ‘Aspects of a Theory of Syntax’ Chomsky made his famous distinction between competence and performance. Chomsky argued that the only substantive theory of performance which would be possible would come via theory of underlying competence. And he illustrated this point through showing how elements from our underlying grammatical competence could predict and explain aspects of our linguistic performance. Since behaviourism was obviously primarily concerned with behaviour many viewed Chomsky’s competence performance distinction as an attack on behaviourism. Many scientists influenced by Chomsky argued that behaviourists not having a competence-performance distinction meant that it couldn’t be taken seriously as a science (Jackendoff 2002, Collins 2007).

                Chomsky (1972, 1986) poverty of stimulus argument was deemed a further dent in the behaviourist project. Chomsky used the structure dependence syntactic movement such as auxiliary inversion as an example of syntactic knowledge which a person acquired even though a person could go through much or all their life without ever encountering evidence for the construction. The argument being if the child learned the construction despite never encountering evidence for its structure. And the child didn’t engage in trial-and-error learning where he tried out incorrect constructions which were systematically corrected by his peers until he arrived at the correct one (Crain and Nakayama 1987, Brown and Hanlon 1970). Then knowledge of the construction must be built into the child innately.

                Theorists viewed poverty of stimulus arguments as further evidence that the behaviourist project was doomed to failure. With many contrasting Chomsky’s emphasis on innate knowledge with Skinner’s supposed blank-slate philosophy (Pinker 2002). Thereby situating Skinner as a modern-day Locke battling with a modern-day Descartes (Chomsky), and the consensus was that modern science had shown that the rationalist position was the correct one.

                       Quine and Chomsky.

                The debate between Chomsky and Skinner was primarily focused on issues in linguistics, and psychology. In philosophy the rationalist-empiricist debate played out in a debate between Quine and Chomsky. Quine billed himself as an externalized empiricist whose primary aim was to explain how humans go from stimulus to science in a naturalistic manner. His entire project centred on naturalising both epistemology and metaphysics. On the epistemological side of things his need to explain how we go from stimulus to science would involve psychological speculations on how we acquire our language, how we develop the ability to refer to objects. Quine was explicit in these speculations that any linguistic theory was bound to be behaviourist in tone since we acquire our language through intersubjective mouthing of words in public settings. This commitment to behaviourism set Quine at odds with Chomsky.

                In 1969 Chomsky’s wrote a criticism of Quine called ‘Quine’s Empirical Assumptions’. This criticism noted that Quine’s notion of a pre-linguistic quality space wasn’t sufficient to account for language acquisition. That Quine’s Indeterminacy of Translation Argument was trivial and amounted to nothing more than ordinary Underdetermination. And that Quine’s invocation of the notion of the probability of a sentence being spoken was meaningless.

                Skinner never replied to Chomsky[1] arguing that Chomsky so badly misunderstood his position that further dialogue was pointless. But Quine (1969) did reply; in his reply he charged Chomsky with misunderstanding his position and of attacking a strawman. On the issue of a prelinguistic quality space he argued that it was postulated as a necessary condition of acquiring the ability to learn from induction or reinforcement; he never thought it was a sufficient condition of our acquiring language. Quine argued that “the behaviourist was knowingly and cheerfully up to his neck in innate apparatus”. He further argued that indeterminacy of translation was additional to underdetermination and revealed difficulties with linguists and philosophers’ uncritical usage of “meanings, ideas and propositions”. And finally, he noted that Chomsky misunderstood Quine’s discussion of the probability of a sentence being spoken. Quine wasn’t speaking about the absolute probability of a sentence being spoken, rather he was concerned with the probability of a sentence being spoken in response to queries in an experimental setting.

                This led to a series of back and forth between Chomsky and Quine. In his (1970) ‘Methodological Reflections on Current Linguistic Theory’, Quine criticized Chomsky’s notion of implicit rule following. Quine noted that there are two senses of rule-following he could make sense of (1) Being guided by a rule: A person following a rule they can explicitly state, (2) Fitting a Rule: A person’s behaviour can conform to any of an infinite number of extensionally equivalent rules. But Quine charged Chomsky of appealing to a third type of rule (3) A rule that the person cannot state, but is nonetheless implicitly following, and this rule is a particular rule distinct from all the other extensionally equivalent rules that the persons behaviour conforms to. Chomsky (1975) correctly responded that Quine was again arbitrarily assuming that underdetermination was somehow terminal in linguistics but harmless in physics. As Chomsky approach became less about rules and more to do with parameters switching Quine’s rule-following critique had less and less traction.

                While Chomsky’s critique of Skinner achieved the status of almost a creation myth in cognitive science. With most introductory texts in psychology or cognitive science attributing the Chomsky’s review of Verbal Behaviour being the death-knell of behaviourism and the birth of cognitive science. Whereas Chomsky’s criticism of Quine wasn’t as well known, and it had a more nuanced reading. While a lot of people came to the view that Chomsky won the debate; it didn’t attain the creation myth status that the review had. Nonetheless, it is fair to say that most philosophers, accepted Chomsky’s criticisms of Quine as to the point.

                Outside of the realm of academic debates in the popular press when Skinner is spoken about, he is referred to a blank slate theorist (Pinker 2002). With Skinner and to a lesser degree Quine placed in the camp as exemplars of the empiricist tradition and modern-day inheritors of John Locke’s mantal, and Chomsky a self-described exemplar of the Cartesian Tradition. And the scientific consensus is that Chomsky’s rationalism has won out over Quine and Skinner’s empiricism.

       Artificial Intelligence and Empiricism.

In recent years with AI getting more and more sophisticated; philosophers, psychologists, and linguists have begun to explore what these AI systems tell us about the rationalist-empiricist debate. With some theorists arguing that empiricist architecture is responsible for the success of recent AI systems (Buckner 2017, Long, 2024). While others have argued that in fact the architecture because it needs in built biases in fact supports rationalism (Childers et al 2020 p.87).

                Buckner (2017) argued that deep convolutional neural networks are useful models of mammalian cognition. And he further argued that these DCNNs use of “transformational abstraction”, vindicated Hume’s empiricist conception of how humans acquire abstract ideas. Childers et al (2020) have hit back at this view and have argued both that LLMs and DCNNs require built in biases for them to be successful. And they further argue that the need for built in biases, is analogous to the way Quine needed to posit innate knowledge to explain language acquisition, thereby, according to them, undermining their empiricist credentials (Childers et al 2020 p. 72).

                Childers et all’s reading of the rationalist empiricist debate is extremely idiosyncratic. Their assertion that the postulation of any innate dispositions is an immediate weakening of empiricism is bizarre (Ibid p. 84). This reading of the rationalist-empiricist dispute doesn’t stand up to scrutiny. Hume, and early arch-empiricist needed innate formation principles in the human mind to account for how we combine the ideas we receive from impressions into complex thoughts (Fodor 2003). And even Chomsky who is viewed as a paradigm exemplar of the rationalist tradition argued that innateness wasn’t the issue when it came to the rationalist-empiricist debate:

The various empiricist and behaviourist approaches mentioned postulate innate principles and structures (cf. Aspects, pp. 47 f.). What is at issue is not whether there are innate principles and structures, but rather what is their character: specifically, are they of the character of empiricist or rationalist hypotheses, as there construed?” (Katz & Chomsky: 1975).

“…Each postulates innate dispositions, inclinations, and natural potentialities. The two approaches differ in what they take them to be…The crucial question is not whether there are innate potentialities or innate structure. No rational person denies this, nor has the question been at issue. The crucial question is whether this structure is of the character of E or R; whether it is of the character of “powers or “dispositions”; whether it is a passive system of incremental data processing, habit formation, and induction, or an “active” system which is the source of “linguistic competence” as well as other systems of knowledge and belief” (Chomsky 1975 pp. 215-216)

And Watson, Quine and Skinner were consistent about this point throughout their careers: Wason 1924 p. 135, Skinner 1953 p. 90, Skinner 1966 p. 1205, Quine 1969 p. 57, Quine 1973 p. 13, Skinner 1974, p.43.

            The point of Childers et all’s criticism was that Hume’s empiricism with its appeal to a few laws of association., needed to be supplanted by Kant’s system which postulated many more innate priors (Childers et al 2020 p. 87). This may have been a problem of Hume, but it is no difficulty for the likes of Quine who was an externalized empiricist who had no issues whatsoever with innate priors once they could be determined experimentally (Quine 1969 p. 57). When it comes to Artificial Intelligence there is a legitimate debate on whether it is pragmatic to build the systems on rationalist or empiricists principles. But this only has relevance to the rationalist empiricist debate if it can be demonstrated that artificial intelligence systems learn in the same way as humans do. In the next section I will evaluate how closely AI systems model human cognition. To do this I will focus LLMs and the degree to which they accurately model human linguistic cognition.

 Large Language Models and Human Linguistic Competence.

Theorists have argued that the similarities between LLMs output and human linguistic output make LLMs and the way they learn directly relevant to theoretical linguistics. Thus, Piantadosi (2023), has argued that LLMs refute central claims made by Chomsky et al in the generative grammar tradition about language acquisition. This comparisons of LLMs to actual human cognition has been challenged in the literature (Chomsky et al 2023, Kodner et al 2024, Katzir, R 2023). In this section I will consider various disanalogies between LLMs and Human linguistic cognition which makes any comparison between problematic. And in the final section I will consider the relevance of these disanalogies towards considering work in AI as being pertinent to debates about Rationalism versus Empiricism.

    Poverty of Stimulus Arguments, Artificial Intelligence, & Human Linguistic Capacities.

            A clear disanalogy to human linguistic abilities and LLMs is that humans acquire their language despite a poverty of stimulus, while LLMs learn because of a richness of stimulus (Kodner et al 2023, Chomsky et al 2023, Long, R 2024, Marcus, G. 2020). To see the importance of this distinction a brief discussion of the role that Poverty of Stimulus Arguments have played in linguistics is necessary with this in place we can return to the stimulus which LLMs are trained on.

            Chomsky 1965 noted that people acquire syntactic knowledge despite a poverty of stimulus. Humans are exposed to limited fragmentary data and still manage to arrive at a steady state of linguistic including knowledge of syntactic rules which they may not have ever encountered in their primary linguistic data. Chomsky used auxiliary inversion as his paradigm example of a poverty of stimulus (Chomsky 1965, 1968, 1971, 1972, 1975, 1986, 1988[2]). Pullum and Scholz (2002) reconstructed Chomsky’s Poverty of Stimulus Argument as follows:

  1. Humans learn their language either through data driven learning or innately primed learning.
  2. If humans acquire their first language through data driven learning, then they can never acquire anything for which they lack crucial evidence.
  3. But Infants do indeed learn things for which they lack crucial knowledge.
  4. Thus, humans do not learn their first language by means of data-driven learning.
  5. Conclusion: humans learn their first language by means of innately primed learning (Pullum and Scholz 2002).

Pullum & Scholz (2002) Isolated premise three as the key premise in the argument. And they sought empirical evidence to discover the amount of times constructions with evidence for auxiliary occur in a sample of written material. The material they choose to examine was Wall Street Journal back issues. They also estimated the amount of linguistic data a person is on average exposed to do. To do this they relied on Hart and Risely (1997) ‘Meaningful Differences in the Everyday Experiences of Young Children’. They estimated that your average child from a middle-class background will have been exposed to about 30million word tokens by the age of three. Pullum and Scholz argue that the child will have been exposed to about 7500 relevant examples in three years. Which amounts to about 7 relevant questions per day. But a primary criticism of their work was that the Wall Street Journal wasn’t representative of the type of data that a child would be exposed to.  Sampson 2002 searched the British National Corpus and argued that the child would be exposed to about 1 relevant example every 10 days.

But the next question was whether a child would be able to learn the relevant construction from 1 example every 10 days (Lappin & Clark 2011). Reali and Christensen (2005) Perfors, Tenebaum & Reiger (2006) have all constructed mathematical models demonstrating that children are capable of learning from the above amounts of data.  However, Berwick & Chomsky et al. (2011) in their ‘Poverty of Stimulus Revisited’ have hit back arguing that Auxiliary Inversion is meant as an expository example to illustrate the APS to the general public. And that there are much deeper syntactic properties which children could not learn from the PLD. The debate still rages on, but it is still a consensus in generative grammar that the Poverty of Stimulus is a real phenomenon which humans need domain specific innate knowledge to overcome.

As our discussion above indicates that there has been some push back against Poverty of Stimulus Arguments however it is still the default position in linguistics.  Furthermore, even those who push back against the APS would gleefully admit that the linguistic data a LLM is trained on is not analogous to the Primary Linguistic Data of your average child. Children are exposed to 10million tokens a year, LLMs are exposed to around 300 billion tokens and this number is increasing exponentially (Kodner et al 2024). So, while the output of a LLM and a human may be roughly analogous the linguistic input they receive is in no way analogous. 

 Competence & Performance in LLMs & Humans.

The divergence on linguistic data which LLMs and humans are trained on is a clear indicator that they work of different underlying competencies. Other differences emerge in terms of the materials they use. In terms of computation considerations chips are faster than neurons (Long 2024). To the degree that outputs are similar that doesn’t prove that they rely on the same underlying competencies (Firestone 2020, Kodner et al 2024, Milliere & Buckner 2024). Kodner et al give the example of two watches both of which keep time accurately but one of which is digital, and the other is mechanical. Despite similar performances they achieve it through different underlying competencies (Kodner et al 2024). 

   But Are Human and LLM Outputs Analogous?

The question of whether Humans and LLM’s output are analogous is obviously vital if we want to understand whether they operate using the same underlying competencies. We have already seen that the two systems seem to learn differently one despite the poverty of stimulus and one because of the richness of stimulus. This points towards different underlying innate competencies. Different underlying competencies aside the next section will demonstrate that the performances of each system are very different.

          At a superficial level it LLMs and Human outputs appear very similar. Chat GPT-4 can to some degree fool a competent reader into thinking that a human produced the outputted sentences. Clark et al (2021) studied human created stories, news articles, and texts and got a LLM to create similarly sized stories. 130 participants were tested, and they couldn’t tell apart the human from LLM models at a range greater than chance. (Scwitzgebel et al 2023). Scwitzgebel, et al (2023) Created a Large Language Model which was able to simulate Daniel Dennett’s writing style and though experts were able to distinguish amongst them at rates barely above chance, it was surprising how close run the thing was given the fact was that it was scholars who were experts on Dennett who were being probed.

While a LLM can construct sentences which appear to be analogous to ordinary human sentences there are obviously a lot of disanalogies. While LLMs can reliably produce syntactically sound sentences and sentences which are semantically interpretable. The words the LLM use have no meaning to the LLMs only to the humans that interpret their output. The reason that they have no meaning is because they are not grounded in sensory experience for the LLM. Whereas for humans they obviously are (Harnard 2024). The LLM unlike the human isn’t talking about any state of affairs in the world, rather it is merely grouping together tokens according to how the tokens are fed into it in its training data.

When criticisms are made that LLMs outputs don’t have meanings. We need to be careful how we parse these statements. Obviously, they have meaning in the sense that they can distinguish between two sentences which are syntactically identical, but which don’t have the same meaning. But the sentences do not have meaning in the sense of referential relation between the words and a mind independent reality. However, given that the idea of explicating meaning in terms of a word-world relation has been questioned (Chomsky 2000, Quine 1973), it is difficult to know what to make of the claim that LLMs don’t have meanings because their words don’t refer to mind independent objects.

Bender & Koller (2020) used a thought experiment to illustrate why they believed that LLMs did not mean anything when they responded to queries. The thought experiment imagines two people trapped on different Islands who are communicating with each other via code through a wire which is stretched between the islands via the ocean floor. In this thought experiment an Octopus who is a statistical genius accesses the wire and can communicate with the other people on the island through pattern recognition. But though he is able to figure out what code to use, and when, due to the context of the code being used and the patterns of when they are grouped together, he has no understanding of what is being said. Bender & Koller argue that if a person on the Island asked the Octopus how to build a catapult out of coconut and wood he wouldn’t know how to answer because he has no real-world knowledge of interacting with the world and is instead merely grouping brute statistical patterns together.

Piantadosi & Hill (2023) in their “Meaning Without Reference in Large Language Models”, argue that thought experiments such as the Octopus one fail because it makes the unwarranted assumption that meaning can be explicated in terms of reference. They argue that meaning cannot be explicated in terms of reference for the following reasons:

  • There are many terms which are meaningful to us, but which have no clear reference e.g. Justice.
  • We can think of concepts of non-existent objects. These have meanings but don’t refer to anything in the mind-independent world.
  • We have concepts of impossible objects such as a round square, perpetual motion machine,
  • We have concepts which pick out nobody, but which are meaningful: e.g. the present King of Ireland.
  • We have concepts which have meaning but which don’t refer to concrete particulars e.g. concepts of abstract objects.
  • We have terms which have different meanings but the same reference {morning star-evening star}.

They go on to argue that conceptual role theory in which meaning is determined in terms of entire structured domains (like Quine’s web-of-belief) plays a large role in our overall theory of the world. But they do nonetheless acknowledge that reference plays some role in grounding our concepts. Just not as large a role as some theorists criticizing LLMs believe. They are surely right that as theory of the world becomes more and more sophisticated our theory will, as Quine noted, face empirical checkpoints only at the periphery. Nonetheless, when humans are acquiring their language in childhood they must go through a period where they learn to use the right word in the presence of the right object, and to somehow learn to triangulate with their peers in using the same word to pick out a common object in their environment.

As we saw above Piantadosi & Hill (2023) shared concerns about crude referentialist theories of meaning and their relation to LLMs, but they did acknowledge that there are word-world connections between some words and objects in the mind independent world. Quine famously tried to connect our sentences to the world through his notion of an observation sentences. He argued that an observation sentence was a sentence which members of a verbal community would immediately assent to in the presence of the relevant non-verbal stimuli.

But Quine immediately ran into a difficulty with this approach. The difficulty stemmed from the fact that Quine found it hard to say why different members of a speech community assented to an observation sentence. Quine tried to cash out the meaning in terms of stimulus meaning where a particular observation sentence was associated with a particular pattern of sensory receptors being triggered.  But this made intersubjective assent on observation sentences difficult to explain given that different subjects would obviously have different patterns of sensory receptors triggered in various ways in response to the same observation sentences. What pattern of sensory receptors was triggered by what observation sentences would be largely the result of each subject’s long forgotten learning history. All of this made it difficult to see how Quine could make sense of a community assenting to an observation sentence being used in various circumstances.

Quine eventually made sense of intersubjective assenting to an observation sentence by appealing to the theory of evolution. He argued that there was a pre-established harmony between our subjective standards of perceptual similarity and trends in the environment. And that humans as a species were shaped by natural selection to ensure that they shared perceptual similarity standards. This fact was what made it possible for humans to share assent and dissent to observation sentences being used in certain circumstances.

For Quine shared perceptual similarity standards and reinforcement for using certain sounds in certain circumstances gave observation sentences empirical content. But to achieve actual reference Quine argues that we need to add things like quantifiers, pronouns, demonstratives etc. The key point is though that observation sentences link with the world (even if in a manner less tight than objective reference), because of our shared perceptual similarity standards matching objective trends in our environment. This is what makes intersubjective communication possible. LLMs are not responsive to the environment in anything like the way humans are when they are acquiring their first words. Even prior to humans learning the referential capacities of language, children using observation sentences are still in contact with the world. To this degree then Piantadosi & Hill’s concerns about reference are besides the point. As humans begin to acquire their first words, they do so through observation sentences which are connected our environment. The fact that when we acquire a language complete with words and productive syntax we can speak about theoretical items, fictional items, impossible objects etc is interesting but doesn’t speak to the LLM issue. The fact is that as humans first acquire their words they do so in response to their shared sensory environment. LLMs do not learn in this way at all. Their training is entirely the result of exposure to textual examples which they group into tokens based on the statistical likelihood of textual data occurring together. So, while humans eventually learn to speak about non sensorially experienced things their first words are keyed to sensory experience, and this is a key difference between them and LLMs. In the next section I will consider some objections view but firstly I want to recap what we discussed so far.

                     Interlude: Brief Recap.

Thus far we have considered the Rationalist-Empiricist debate as it played out in debates between Skinner, Quine and Chomsky. Noting that the consensus is that Chomsky’s rationalism won the day over the empiricism of Quine and Skinner. In recent years with AI getting more and more sophisticated a resurgence of interest has occurred on AI and its relation to the rationalist-empiricist debate. With some theorists arguing that modern AI indicates that empiricist theories of cognition are, contrary to what was previously believed, being realized by some current AI. Following this I discussed whether this was the case, relating it to previous debates in the rationalist-empiricist debate. Arguing that contrary to Childers et al, an AI can learn in an empiricist manner even does so via built in constraints. With this being argued I then questioned whether the fact that some AI learned in a particular manner had much of an impact on human cognition. To sharpen the issue, I narrowed the debate down to whether if LLMs learned in an empiricist manner this would tell us much about the rationalist-empiricist debate in humans.

          To decide on this issue, I considered whether LLMs and humans learned in the same manner. Concluding like many others that they learn in entirely different ways. Humans learn despite the poverty of stimulus while LLMs learned because of the richness of their stimulus. I argued that the different manners in which they learned indicated that there were probably different competencies underlying their respective performances.

     AI and the Relevance of Rationalism and Empiricism.

Above we discussed some disanalogies between LLMs and human linguistic capacities. Two main differences were noted (1) Differences in the stimulus needed for the respective agents to acquire language, which indicates different underlying competencies. (2) LLMs language is not grounded, and Human language is grounded; and this difference is a result of the different ways in which they acquire their language, humans begin by being responsive to the world in a triangular relation with others, while the LLM acquires their language through statistical grouping of text they are trained on.

So given that humans and LLM’s outputs are the result exposure to different quantities, and types of data, which indicate different underlying competencies, a question arises as to the relevance of AI to understanding human linguistic capacities. Earlier we discussed the debate between Chomsky and Skinner & Quine and its relation to current debates on the nature of LLMs. But given the disanalogies we have noted between LLMs and humans it is questionable whether they have anything to tell us about the rationalist-empiricist debate at all.

The debate between the rationalists and empiricists was never centred on whether innate apparatus was necessary for a creature to learn a particular competency. All sides of the debate agreed that some innate apparatus was necessary to explain particular competencies, and the degree of the innate apparatus which need to be postulated is to be determined empirically. The empiricist position was one which argued that humans learned primarily through data driven learning (supported by innate architecture), while the rationalist argued that humans learned through innate domain specific competencies being triggered by environmental input.

However, given the differences between LLMs and Human’s linguistic competencies their relevance to each other on the rationalist-empiricist debate is in doubt. Even if we can conclusively demonstrate that a LLM or a DCNN learns in an empiricist manner, this will not tell us anything about whether humans learn in an empiricist manner. Because human competencies are so different than a LLMs it is simply irrelevant to the rationalist-empiricist debate whether LLMs learn in an empiricist manner or not.

It is theoretically possible that a philosopher could argue a la Kant that any form of cognition will a priori need to implement innate domain specific machinery to arrive at its steady state. And one could offer empirical data to support this a priori claim by showing that very different forms of cognition e.g. human and LLMs both learn by implementing innate domain specific architecture. But this isn’t how the debate has been played out in the literature. Typically, the literature argues that LLMs are largely empiricist, and this fact vindicates empiricism in general, or it is argued that they are largely rationalist, and this fact vindicated rationalism. I have argued here that given the very different nature of LLMs and humans it is irrelevant to the question of how humans acquire their knowledge whether LLMs are rationalist or empiricist.

But this isn’t to say that it is unimportant whether LLMs or other forms of AI learn in a rationalist or an empiricist manner. There are still practical issues in engineering as to whether one is more likely to be successful in building things like Artificial General Intelligence using empiricist architecture or not. Thus, people like Marcus (2020) argue that while we may be able to build AGI using empiricist principles, we will not be able to build AGI unless we build in substantial innate domain specific knowledge into the system. Marcus even argues that the best way to understand what innate architecture is necessary to be built into our AI models we should look to our best example of an organism with general intelligence, i.e. Humans.

While the engineering question is extremely interesting from a practical point of view and could motivate an interest in whether LLMs or other types of AI learn in a rationalist or an empiricist manner. But when it comes LLMs, the question of whether they learn in an empiricist, or a rationalist manner is largely irrelevant to the rationalist-empiricist question in relation to humans.

                            Bibliography

Berwick, R, & Pietroski, P, & Chomsky, N. 2011 “Poverty of the Stimulus Revisited.” Cognitive Science. Vol 35. Issue 7 pp. 1207-1242.

Bender & Koller. (2020) “Climbing Towards NLU: On Meaning, Form and Understanding in the Age of Data.” Proceedings of the 58th annual meeting of the Association for Computational Linguistics.

Brown, R. & Hanlon, C. 1970. “Derivational complexity and order of acquisition in child speech.” In Hayes, J.R. (eds). Cognition and the Development of Language. New York Wiley.

Buckner, C. 2018. “Empiricism without Magic: Transformational Abstraction in Deep Convolutional Neural Networks. Synthese. Vol 195 pp. 5339-5372.

Childers, T, & Hvorecky, J, & Majer, O. 2023. “Empiricism and the foundations of Cognition”. AI and Society. Vol 38 pp. 67-87.

Chomsky, N. 1959. “A review of B.F. Skinner’s Verbal Behaviour”. Language, 35. PP 26-57.

Chomsky, N. 1965. Aspects of a Theory of Syntax. MA: MIT Press.

Chomsky, N. 1966. Cartesian Linguistics. New York: Harper & Row.

Chomsky, N. 1969. “Quine’s Empirical Assumptions”. Synthese 19 pp. 53-68.

Chomsky, N. 1975. Reflections on Language. New York: Random House.

Chomsky, N. 1986. Knowledge of Language: Its Nature, Origin and Use. New York, New York: Praeger.

Chomsky, N. 2000. New Horizons in the Study of Language and Mind. Cambridge: MA.

Chomsky, N, Roberts, I, Watumull, J. 2023. “The False Promise of ChatGPT.” The New York Times.

Chomsky, N & Katz, J. 1975. “On Innateness: A Reply to Cooper.” Philosophical Review. 84 pp. 70-84.

Clark, L & Lappin, S. 2011. Linguistic Nativism and the Poverty of the Stimulus. Wiley

Collins, J. 2007. “Linguistic Competence Without Knowledge of Language”. Philosophy Compass 2 (6) pp. 880-895

Crain, S & M, Nakayama. 1987. “Structure Dependence in Grammar Formation”. Language, Vol 63 pp. 522-543.

Firestone, C. (2022) “Competence and Performance in Human Machine Comparisons.” Proceedings of the National Academy of Sciences. 117. 43. Pp. 25662-26571.

Fodor, J. 2003. Hume Variations. Oxford University Press.

Hart, B & Risley, T, & Kirby, J. 1997. “Meaningful differences in the everyday experiences of the young American.” Canadian Journal of Education. Vol 22. Issue 3.

Harnard, S. 2024. “Language Writ Large: LLMs, ChatGPT, Grounding, Meaning, and Understanding.”

Jackendoff, R. 2002. Foundations of Language. Great Clarendon Street. Oxford University Press.

Katzir, R. 2023 “Are Large Language Modules Poor Theories of Linguistic Cognition: A Reply to Piantadosi”.  

Kodner, J, & Payne, S, & Heinz, J. 2023 “Why Linguistics will thrive in the 21th Century: a reply to Piantadosi”.

Long, R. 2024. “Nativism and Empiricism in Artificial Intelligence”. Philosophical Studies. Vol 181 pp 763-788.

Marcus, G. 2020. “The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence”

Mollo, D, & Milliere, R. “The Vector Grounding Problem”.

Milliere, R & Buckner C. 2024. “A Philosophical Introduction to Language Models: Part 2”.

Piantadosi, S. 2023 “Modern Language Models Refute Chomsky’s Approach to Language.”

Piantadosi, S, & Hill, F. 2023 “Meaning with our Reference in Large Language Models”.  

Pinker, S. 2002. The Blank Slate: The Modern Denial of Human Nature. New York Viking.

Pullum, G, K. & Scholz, B. 2002 “Empirical assessment of stimulus poverty argument.” The Linguistic Review. 19 pp. 9-50.

Quine, W. 1968. “Reply to Chomsky”. Synthese 19 pp. 274-283.

Quine, W. 1970. “Methodological Reflections on Current Linguistic Theory”. Synthese 19, pp. 264-321.

Quine, W. 1974. The Roots of Reference. La Salle. Open Court Press.

Sampson, G. 2002. “Exploring the Richness of the Stimulus”. The Linguistic Review. 19 pp. 73-104.

Schwitzgebel, E, & Schwitzgebel, D, & Strasser, A. 2024. “Creating a Language Model of a Philosopher”. Mind & Language. 32 (2) 237-259.

Skinner, B.F. 1957. Verbal Behaviour. New Jersey: Prentice-Hall Inc.  


[1] For an interesting reply on Skinner’s behalf see Kenneth McCorquodale.

[2] In the following I am Pullum and Scholz (2002) ‘Empirical Assessment of Stimulus Poverty Arguments.  

Chat GPT and the Meaning of Meaning.

In the last decade there has been an explosion of literature on Large Language Modules. The availability of Chat GPT-3 and Chat GPT-4 has created an interest in the topic for the general public. The resulting literature has to some degree led to a rehashing of age-old debates in philosophy which began with the onset of symbolic systems in the 1980’s. John Searle circa 1980 with his Chinese Room argument claimed that there was no reason that a syntactic system should automatically acquire semantic content. Searle had his critics such as Dennett’s 1991 Systems Objection. It is fair to say that most computer scientists took Dennett’s side in the debate.

                With the rise of Chat GPT and other LLM’s this debate has arisen again. But the debate is a bit different. Virtually nobody thinks that LLM’s are conscious agents. Furthermore, nobody thinks that as they stand, they exhibit artificial general intelligence. Nonetheless, a debate has begun as to whether as they stand, they exhibit any meaning. Leaving aside the question of “hallucinations” which sometimes occur when people use GPT; the answers which the device provide generally seem to be largely pragmatically coherent, syntactically well formed, and semantically coherent from the point of view of people interpreting them.

                But despite users finding the models semantically interpretable; theorists argue that there is little reason to attribute any semantics to the systems. Now to evaluate this claim one needs to be explicit on what is meant by semantics in such discussions. There is a long tradition in linguistics, model theory, and philosophy to think of semantics in terms of a referential relation. Thus, the meaning of “Apple” is a physical apple in our intersubjective world of experience. The same would be true of other words in our language. On this conception LLM’s clearly have no semantics because they refer to no mind independent properties and objects. The story would go that there is a causal interaction between humans, their socio-linguistic community, and shared objects of experience as humans causally interact with their world and others. On this view it is blindingly obvious that Chat-GPT doesn’t have any semantics in the sense of a word world relation.

                But semantics in the sense of referential semantics isn’t the sole criterion of semantics. Critics of referential semantics abound in the literature. Some critics such as Chomsky who do think of language as involving a representational system don’t think of language been explicable in terms of a referential relation (Chomsky 2000). While philosophers such as Quine have drawn similar conclusions using different arguments (Quine 1960). However, reference aside it is apparent that humans’ words clearly have meaning, even if this meaning cannot be accounted for in terms of a crude referential relation.  And if our words have a meaning without a crude referential relation then there is no reason to think that Chat GPT-4 or any other LLM needs to use a referential relation in order to mean something by what they say.

                Nonetheless there seems to be something almost perverse in attributing semantics to an unthinking Large Language Model. But as Sogaard (2022) has noted we need some concept of meaning to explain how we can differentiate between syntactically identical sentences such as (1) Colourless Green Ideas sleep furiously, (2) Happy Purple dreams dance silently.  We need something to decide what distinguishes these syntactically identical sentences. And a key candidate is the meanings of the individual words {however we parse meaning}. In the next section we will try to explicate what we mean by meaning and whether LLM’s exhibit meaning in any of the traditional senses of meaning.

                    Meaning as Reference.

                A commonsensical parsing of what people understand by meaning is that they understand it in terms of reference. Nonetheless, experimental philosophy has revealed some cultural variation on how people conceive of the reference relation (Deutsch 2009). However, while it is interesting to note cultural variation in how people intuitively understand the reference relation, the job of the philosopher is to evaluate the logical coherence of understanding meaning in terms of a reference relation.

                Obviously, there are a class of words which clearly cannot be cashed out in terms of a reference relation. Such words are grammatical particles such as “to”, “not”, “ing”, “s”, “ed”, “ing” which modify nouns and verbs. These particles obviously cannot be parsed in terms of a reference relation. The same is true of logical operators such as “And”, “Or”, “Not”, “If and Only If” etc. While any attempt to understand the grammatical particles in terms of a reference relation is defunct. Understanding the logical operators in terms reference seems equally hopeless unless one were to appeal to reference to some non-physical Platonic Realm.

                But logical operators and grammatical particles aside people typically think of the reference relation as best cashed out in terms of nouns and categories which describe and modify nouns such as verbs and adjectives. Thus, a sentence such as the ‘The tall man runs Fastly’. Can be cashed out referentially as referring to a particular man who we are ascribing certain properties to. At an intuitive level we could argue that the meaning of this sentence can be cashed out in referential terms.

                But even with nouns we run into intractable problems in cashing out meanings in terms of reference. Fodor and Pylyshyn (2015) have summed up these problems. We use words to describe objects which we cannot cash out in sensory experience. We talk about extremely large events the multiverse which cannot be cashed out in any simple way in terms of a referential relation. And extremely small objects such as Quarks which cannot be directly observed (and hence cannot be cashed out in terms of a reference relation). There are impossible objects such as a Round-Square which clearly have a meaning, but which do not refer to any objects. There are fictional objects such a Unicorn which do not refer to any objects, but which clearly have meaning.

                When combined with the fact that grammatical particles and logical constants cannot be cashed out in terms of reference; the above considerations which show that many of our concepts which have meanings, cannot be cashed out in terms of reference. Therefore, it is rational to doubt that meanings can be cashed out in terms of reference. But this brings us full circle to our considerations of LLM’s and whether the words they use have meanings.

                 Some points to note. The sentences used by LLM’s clearly have a meaning to us as interpreters. We can parse their sentences as either sensible or senseless, as true or false etc. But the question is do the words the LLM’s use have meaning for the LLM? The answer to this question is clearly no. And one of the key reasons is that they do not acquire their words in terms of a grounding in sensory experience. Hence none of the words they use mean anything to them because they are not grounded in sensory experience and hence are nothing other than collections of words grouped together based on statistical patterns derived from the LLM.

                       The Contradiction.

But we seem to have arrived at a contradiction here. We have argued that LLM’s do not have meanings because their words are not grounded in sensory experience, but when it comes to humans, we have argued that we cannot cash out meanings in terms of reference. But this conflict is more apparent than real. When it comes to sensory grounding it occurs in the context of what Quine calls the mid-level scheme of ordinary enduring physical objects. Or as what Husserl would call the lifeworld the intersubjective shared world which embodied humans live in.

                It will help to begin simple. Let us take the grounding of a word “Apple”. Before the child acquires any words, their cognitive apparatus is engaged in ontological assumptions. Children will perceptually engage in categorising the objects they experience, and they will have implicit assumptions about object behaviour such as object permanence, object solidity etc (Carey, 2009). Such ontological categories shared by all humans will ensure a surprising convergence between humans when they interpret the sounds others are using in relation to shared objects of experience (Pinker 1994). But in order for a child to interpret what the sounds being spoken to him signify the child will need more than pre-linguistic ontological commitments to guide him. He will also need to know that the sound’s being used in his presence are intended in a communicative sense. To do this he will need a theory of mind, the ability to interpret pointing in terms of a shared object of experience, the ability to track eye contact in terms of a shared object of experience and a cooperative instinct.

                When thinking about this issue in terms of epistemology Quine called this the problem of how we go from our meagre input (impact of light and sound on our sensory receptors) to torrential out put (our total scientific theory of the world). Quine recognised that we couldn’t just appeal to shared objects of experience to explain the epistemological question of how we go from stimulus to science, we must explain how this is done. And his explanations relied on both preestablished harmony of perceptual similarity space built by natural selection and shared empathy to ensure that we understood that others who were gesturing towards objects had a psychology and motivation similar to our own.

                Sandford and Hayes (2014) cashed out our ability to go from simple referential relations between two people to more complex capacities such as the capacity to grasp the frame of coordination, and other complex relational frames. They do this by appealing to group selection as a factor which selected for cooperation amongst humans which in turn made it easier to triangulate on shared objects of experience.

                The key point to note is that on this picture. (1) The prelinguistic ontological commitments to object ontology (hinted at by Quine’s pre-established harmony and demonstrated by results in developmental psychology by Carey, Spelke et all). (2) Cooperation and Empathy (an empathic instinct argued for by Quine and demonstrated by Tomasello 2013, Sanford & Hayes 2014). All make it possible for a child to triangulate with a parent on a shared object of experience when they are acquiring their basic concepts. This epistemological triangle results in the child’s early concepts being grounded in an umwelt shared with their species and social community. Now when a child has acquired his grounded base level concepts, he will have the capacity to learn other concepts which are not grounded e.g. Unicorn, Quark, Round-Square etc. But this process will involve using combinatorial syntax (merge), along with analogical reasoning, and perhaps the use of relational frames to create more complex concepts which cannot be cashed out in terms of our shared sensory umwelt.

                This removes our supposed contradiction. A lot of our complex concepts are not explicable in terms of our shared umwelt. Nonetheless, our base level concepts cashed out in terms of our shared umwelt are grounded in terms of triangulation on shared objects of experience for humans when acquiring their concepts. This triangulation is something that LLM’s do not have. Rather than triangulate on shared objects of experience they simply put have to detect statistical patterns in the input they receive as well as modify them depending on the biases and/or reinforcement learning the module is subjected to.

                So, if we go back to our reference problem from above. We noted that impossible objects, very small objects, very large objects, fictional objects have meanings, but they cannot be cashed out in terms of reference. This created a puzzle that when LLM’s were answering questions they appear meaningful to us. But the sceptic would argue that they cannot be meaningful because the words aren’t picking out mind independent objects. But our consideration of human concepts shows that a lot of our concepts do not have meanings conferred in terms of mind independent objects. But if cashing out meanings in terms of mind independent objects isn’t what makes human words meaningful and LLM’s meaningless (unless interpreted by humans), one wonders what is. The answer is that when humans are acquiring a word they do so triangulating on a shared object of experience made possible by shared concepts of objects and a shared instinct to cooperate. This means that as humans acquire their base level concepts, they do so in terms of a shared world of experience in which multiple humans are causally interacting with the world while communicating with each other. While humans can learn new concepts which are grounded directly in the lived world of sensory experience; the concepts they first acquire are grounded in the human shared world of experience. Where as for LLM’s none of their words are grounded hence they mean nothing by what they say. Though we as agents with grounded concepts can attribute meaning to their linguistic output (Harnard, 2024).

                                  Some Objections

                Not everyone agrees that grounding is what separates Humans and LLM. Some theorists even go as far as to argue that LLM’s have can derive meaning via reference in a similar manner to Humans.

“Referential semantics or grounding now amounts to learning a mapping between the Transformer model vector space and this target space. But why, you may ask, would language model vector spaces be isomorphic to representations of our physical, mental, and social world? After all, language model vector spaces are induced merely from higher-order co-occurrence statistics. I think the answer is straight-forward: Words that are used together, tend to refer to things that, in our experience, occur together. When you tell someone about your recent hiking trip, you are likely to use words like mountain, trail, or camping. Such words, as a consequence, end up close in the vector space of a language model, while being also intimately connected in our mental representations of the world. If we accept the idea that our mental organization maps (is approximately isomorphic to) the structure of the world, the world-model isomorphism follows straight-forwardly (by closure of isomorphisms) from the distributional hypothesis.” (Sogaard, 2022 pp. 442-443).

Even if we accept Sogaard’s tendentious statement that the words which are statistically likely to occur together are to some degree isomorphic to structural features of the world. We still have a gap to fill to tell us how such vector spaces are used to refer to objective features of the mind independent world.  It is a leap to go from saying that pattern A is roughly isomorphic with pattern B to saying that a subject is using pattern A as a representation of pattern B. And thus far there is no reason to think that GPT or any other LLM is using words in a representational manner at all.

                 Bender and Koller used an interesting thought experiment which is useful as an intuition pump to demonstrate the lack of grounding in LLM’s speech.  They ask us to think of two agents who are trapped on separate islands who can communicate using a code they send to each other through a wire. They then ask us to imagine an Octopus who a statistical genius who floats beneath the ocean. This genius manages to intercept the wire and learns how to interpret the signals (pick out statistical patterns in the data) and communicate with the Islanders. He becomes so effective that he manages to fool Islander A that he is Islander B. But Bender and Koller argue that if Islander A asked the Octopus to find some coconuts and build a coconut on the Island and then report its findings back, the Octopus would not be able to do so. It is important to note that the fact that Chat GPT could answer this question by having access to the world’s written material is irrelevant. The point is that while the Octopus can mimic the Islanders based on detecting statistical patterns in their conversations, he would not have the capacity to think up an answer to the Catapult question unless he was repeating what he was exposed to before behaving like a large “stochastic Parrot” or if he was an embodied agent whose understanding of the concepts was grounded in experience interacting with the objects under discussion.

                Bender and Koller’s thought experiment reiterates what we have been arguing throughout this blogpost. While LLM’s can appear surprisingly sophisticated when answering questions about various topics there is no evidence that their words have any meaning in the sense that humans have meaning. And the reason for this is because the robots are not embodied and are not causally interacting with a world they are communicating about with other agents.