□ Abstracts

Perception of tone sequences in monkeys

Akihiro Izumi

Primate Research Institute, Kyoto University


    I examined auditory properties of Japanese monkeys (Macaca fuscata) in perceiving tone sequences. Perception of relative properties in auditory sequences is essential in perceiving human speech. Human speech consists of sequences of sounds, and listeners have to perceive the spectral-temporal relation of the component sounds. A comparison of the properties of such auditory organization between humans and other primate species may help us understand the evolution of the human auditory system. Initially monkeys were trained to detect changes from rising to falling contours of tone sequences. Probe tests with novel sequences showed that monkeys discriminated by the relative pitch when the frequency ranges of sequences were within the training range.  As well as humans, monkeys showed auditory stream segregation based on frequency proximity/separation. Even if a distracter sequence is presented simultaneously, monkeys could segregate tone sequences based on frequency proximity and discriminated the frequency contours of the target sequence. The discrimination performances of the monkeys deteriorated when a temporal gap (i.e., silence) was inserted between the component tones of sequences. A comparison experiment did not show such effects in human participants. These results suggested that monkeys discriminated tone-sequences based on frequency transitions instead of global frequency contours. Monkeys’ performances seem to be explained based on aspects of global/local processing. Local characters are more dominant discrimination cues in monkeys than in humans. Extensive use of global cues in humans may relate to the perception of complex auditory patterns such as speech.

The neuroethology of primate vocal communication

Asif A. Ghazanfar

Max-Planck-Institute for Biological Cybernetics


    The ethological approach has already provided rich insights into the neurobiology of a number of different species. Unfortunately, neuroscientists who investigate the function and structure of primate brains often focus on what they believe are more general cognitive processes and neglect primates' species-typical behaviours. Based on the consistency with which behavioral adaptations are mediated by specialized neural systems in the animal kingdom, I believe that the design of primate neural systems should reflect the specialized functions that these systems have evolved to carry out--in particular, social communication. My research combines the study of primate vocal behaviour with neurophysiology. My current work focuses on the auditory, visual and multi-modal processing of vocal and facial expressions in rhesus monkeys (Macaca mulatta).

Neural Basis of Vocal Communication in a New World Primate, Callithrix jacchus

Xiaoqin Wang

Department of Biomedical Engineering and Center for Hearing Sciences,  Johns Hopkins University

    Understanding how the brain processes vocal communication sounds remains one of the most challenging problems in neuroscience and holds the promise to shed light on the original of human speech and language processing. Species-specific vocalizations of non-human primates are vocal communication sounds used in intra-species interactions, analogous to speech in humans. Primate vocalizations are of special interest to us because, compared with other animal species, non-human primates share the most similarities with humans in the anatomical structures of their central nervous systems, including the cerebral cortex. Therefore, neural mechanisms underlying perception and production of species-specific primate vocalizations may have direct implications for those operating in the human brain for speech processing. Although field studies provide full access to the natural behavior of primates, it is difficult to combine them with physiological studies at the single neuron level in the same animals. The challenge is to develop appropriate primate models for laboratory studies where both vocal behavior and underlying physiological structures and mechanisms can be systematically investigated. This is a crucial step in understanding how the brain processes vocal communication sounds at the cellular and systems levels. Most non-human primates have a well-developed and sophisticated vocal repertoire in their natural habitats. However, for many non-human primate species, vocal activities largely diminish under the captive conditions commonly found in research institutions, due in part to the lack of proper social housing environments. Fortunately, some non-human primate species such as New World  monkeys remain highly vocal in properly configured captive conditions. These non-human primate species can serve as excellent models to study neural mechanisms responsible for processing species-specific vocalizations.
Our long-term objective is to understand physiological mechanisms underlying perception and production of vocal communication sounds in the common marmoset (Callithrix jacchus), a highly vocal New World  primate. The marmoset is chosen for our studies because it has several advantages over other primate species in behavioral, neurophysiological and developmental studies in laboratory conditions (highly vocal in captivity, easily bred and a smooth brain). Our quantitative analyses of marmoset vocalizations have showed that they contain exquisite information for discriminating call types and caller identity, similar to the properties that have been demonstrated for human speech. We have further showed that acoustic structures of marmoset vocalizations undergo developmental changes that are subject to auditory inputs during development, indicating plasticity in marmoset vocal production. Our neurophysiological experiments have revealed a close correlation between spectral and temporal characteristics of vocalizations and response properties of neurons in auditory cortex of marmosets. Our recent study has further showed that auditory cortex neurons are modulated by self-produced vocalizations, indicating auditory-vocal interactions in the superior temporal gyrus. These studies showed that a great deal could be learned from the marmoset model toward our understanding of human speech and language processing in the brain.

Relations between music and language in developmental perspective

Laurel J. Trainor

McMaster University

    In this paper we consider the question of why we have two communication systems (language and music) based in the auditory modality. Language and music share a number of commonalties. Both rely on grammatical structures in which a small number of elements (phonemes or notes) are combined through rules to produce an unlimited number of utterances or musical compositions. In both, the information unfolds over time, and the listener needs to parse the ongoing stream into units such as words and phrases. Both are originally based in the auditory system, but in both we have developed written notation. In both, the listener must solve the normalization problem: a word or musical phrase must be identifiable whether spoken quickly or slowly, at a high or low pitch, or with different timbres. At the same time, however, language and music also differ in fundamental ways. Language is referential, referring to specific things, actions, or ideas; whereas music, for the most part, is not referential. On the other hand, music induces emotions directly in listeners whereas a linguistic utterance typically does not directly evoke emotion, although it may be about emotion. Interestingly, there are intermediate forms that appear to fall somewhere between language and music, such as poetry, in which both the referential meaning and the emotion evoked by the actual sounds and rhythmic patterns of the words are important.
    There are different views on the evolutionary relation between music and language, ranging from the idea that music is an unimportant evolutionary accident of language to the idea that music is an evolutionary adaptation in its own right. Music is found in all cultures, and it is clear from archeological evidence that music has been around for a very long time. It is possible that music and language had a common root, and diverged later in order to meet different needs. In terms of ontogeny, infants appear to go through similar stages of language and music acquisition. In language, infants are initially sensitive to more universal aspects of speech sounds, showing, for example, categorical perception. Through experience with a particular language, however, their speech sound categories reorganize to reflect the specific categories of their language-to-be. Similarly in music, infants are initially sensitive to universal aspects of musical structure, preferring at two months of age to listen to combinations of sounds that are consonant (sound pleasant to adults) in comparison to sounds that are dissonant (sound unpleasant or rough to adults). Considerable incidental experience is required, however, before infants become tuned to the particular scales and harmonies used in their musical system of exposure. For both music and language, formal training is not needed for implicit acquisition. Even musically untrained adults are aware when a note of a melody goes outside the key in which it is composed, and even the unschooled produce grammatically correct utterances.
    How are music and speech related in development? Interestingly, there is some evidence that music and speech may not be as differentiated for young infants as they are for older children and adults. Infant-directed speech, with its celebrated high pitch, large pitch contours, slow tempo, and repetition is clearly musical in character, and is often referred to as musical speech. Songs for infants, on the other hand, tend to be more conversational than most other musical forms. Structurally, they are simpler, with slower pitch contour and a more restricted pitch range than most other styles, making them more similar to infant-directed speech. Infant-directed songs are also rendered much as conversations, with the caregiver singing one-on-one with the infant, a phenomenon rarely seen other than in stylized opera. Pauses are often inserted between phrases as the singer waits for a response from the infant, and improvisation is often used, with phrases or verses repeated or left out at will, and the child's name inserted at strategic places. Early in infancy, caregivers use both speech and music to communicate emotionally at a basic level with their preverbal and pre-musical infants, and it may be that only with experience and cognitive maturation do speech and music become clearly differentiated.

What does a good "akachan-kotoba" sound like?: Japanese adults' intuitive judgments about a good Child-Directed Vocabulary

Reiko Mazuka

Duke University

    Japanese language has a large number of vocabulary that are specific to child directed speech. A survey of Japanese mothers with young infants showed that a large portion of Child-Directed Vocabulary (CDV) are either of the following phonological forms:
(1) three-mora words with a special mora in the middle; e.g., maNma, buRbu, kuQku, aNyo.
(2) four-mora words with special morae in the second and the forth position; e.g., waNwaN shiRshiR. 
(N stands for moraic nasal, Q geminate stops and fricatives, and R long vowel. These are called special morae in Japanese. They represent one mora, but do not constitute a syllable on their own.)
Perception experiments with Japanese infants showed that by 8 to 10 months of age, Japanese infants show preference to words that conform to the phonological forms of the typical CDV. Interestingly, however, these are not the most common forms in adult Japanese.
The present study tries to investigate where these forms of CDV come from. Approximately 1500 nonsense words (written in Katakana) were created, which varied systematically in terms of their number of mora, and syllables, types and positions of consonants and vowels, and types and position of special morae. In experiment 1 and 2, 160 adult native speakers of Japanese who had little experience with young children (college students) were asked to rate each of the word in terms of “How good an ‘akachan-kotoba’ (CDV) it sounds like” in 7 point scale. In experiment 3, a subset of the original nonsense words were given to 152 mothers of infants and young children for the same CDV rating. In experiment 4, 147 college students were asked to rate each of these words in terms of “How good a Japanese word it sounds like.”
The results showed that the CDV ratings of the college students were highly correlated with those of the mothers (r=.87, p<.001). The fact that the judgment a college student has about “good CDV” is highly similar to those of the mothers suggest that such judgment/knowledge need not be learned through an extensive experience of interacting with children. In both groups, the words that are rated most highly as good CDV conformed to the word forms typically found in the CDV vocabulary (c.f., 1 and 2 above). More specifically, 3- and 4-mora words that contained special morae in the second (and the forth) positions were rated as best CDV. When the special mora was a long vowel, the ratings were the highest (e.g., daRda, paRpaR).
On the other hand, college students’ ratings of ‘Japanese-ness’ had no correlation with mothers’ CDV ratings (r=-.03, p>.5), but had a small negative correlation with CDV ratings of other college students (r=-.23, p<.001). The results indicate that Japanese adult judgment of good CDV words is distinct from judgment of Japanese-ness. There are, however, interesting similarities between the two. In four-mora words, repetition of two-mora sequences, i.e., abab (takutaku), are rated highly both as CDV and as Japanese. The results of the natural conversation analysis from 10 Japanese mother-infant dyads will also be discussed to determine the implications of the current findings to a model of language learning.

Infant speech perception sets the stage for language acquisition

Janet F. Werker

Department of Psychology,  University of British Columbia

    Unravelling the mystery of how infants acquire language so rapidly and apparently effortlessly is an exciting challenge. In this talk I will discuss recent research showing how this journey begins in perception. In particular, I will review recent research from our lab showing that 1) infants are born with perceptual biases which help direct their attention to key properties in language, 2) infants are able to use statistical regularities in the input to reorganize these biases to match the phonological categories that are needed in the native language, and finally 3) how these reorganized perceptual categories help infants bootstrap infants into language acquisition proper.

Infants' recognition of word units

Sachiyo Kajikawa

NTT Communication Science Laboratories

    Segmenting speech stream into units is essential for language acquisition. This presentation will focus on infants' recognition of word units as a basis for lexical development. How robustly and precisely infants recognize word units will be discussed with three experimental studies: word recognition in speech and song, detection of word boundaries in speech, and sensitivity to word-final phonetic changes. The results of these studies indicate that infants at around 12 months of age have a robust processing ability to recognize word units in songs and non-native speech, as well as native speech.

 Gesture, speech, and language

Susan Goldin-Meadow

Department of Psychology,  University of  Chicago

    In all cultures in which hearing is possible language has become the province of speech (the oral modality) and not gesture (the manual modality). Why? This question is particularly baffling given that humans are equipotential with respect to language-learning - that is, if exposed to language in the manual modality, children will learn that signed language as quickly and effortlessly as they learn a spoken language. Thus, on the ontogenetic time scale, humans can, without retooling, acquire language in either the manual or the oral modality. Why then, on an evolutionary time scale, has the oral modality become the channel of choice for languages around the globe? One might guess that the oral modality triumphed over the manual modality simply because it is so good at encoding messages in the segmented and combinatorial form that human languages have come to assume. But this is not the case - the manual modality is just as good as the oral modality at segmented and combinatorial encoding. There is thus little to choose between sign and speech on these grounds. However, language serves another important function - it conveys mimetic information. The oral modality is not well suited to this function, but the manual modality excels at it. Indeed, the manual modality has taken over this role (in the form of spontaneous gestures that accompany speech) in all cultures. It is possible, then, that the oral modality assumed the segmented and combinatorial code, not because of its strengths but to compensate for its weaknesses.
This argument rests on several assumptions. The first is that the manual modality is as adept as the oral modality at segmented and combinatorial encoding. The fact that sign languages of the deaf take on the fundamental structural properties found in spoken language supports this assumption. Even more striking, however, is the fact that deaf children not exposed to sign language can invent a gestural language that also has the fundamental structural properties of spoken language. I begin by describing data on these homemade gestural systems. The second assumption is that mimetic encoding is an important aspect of human communication, well served by the manual modality. I present data on the gestures that accompany speech in hearing individuals, focusing on the robustness of the phenomenon (e.g., the fact that congenitally blind children gesture even though they have never seen anyone gesture) and the centrality of gesturing to human thought. I end with a brief discussion of the advantages of having a language system that contains both a mimetic and a segmented/combinatorial code, and of the role that gesture might have played in linguistic evolution.

The neural systems underlying sign language

Karen Emmorey

Laboratory for Cognitive Neuroscience, Salk Institute for Biological Studies

    For more than a century we have known that the left hemisphere of the human brain is critical for producing and comprehending language. Similarly, research over the past two decades has shown that the left cerebral hemisphere is also critical to processing signed languages. However, we are only now beginning to discover whether the neural systems within the left hemisphere are influenced by the visual input pathways or manual output pathways required for sign language comprehension and production. In this talk, I will explore how visual perception and action interact with respect to linguistic and non-linguistic processes, presenting results from a series of positron emission tomography (PET) and functional Magnetic Resonance Imaging (fMRI) studies. Our results indicate that the distinct biological basis of sign language results in a unique interface between vision and language and between action systems and language production.
        The ASL system of classifier constructions is a linguistic subsystem that is used to represent objects moving or located with respect to each other in space. Spatial information is encoded by the relationship between the hands in signing space, rather than by a grammatical morpheme (e.g., a preposition such as on). Such spatial encoding has several consequences for the cognitive and neural systems that underlie spatial language in signed languages. A second modality-driven distinction between signed and spoken languages is the fact that the articulators required for signing are the same as those involved in non-linguistic reaching and grasping movements. However, unlike reaching and grasping, sign articulations are structured within a phonological system of contrasts. Unlike speech, however, certain signs (specifically, handling classifier verbs such as STIR or BRUSH-HAIR) exhibit visual-motoric iconicity related to action production. The results indicate that even when the form of a sign is indistinguishable from a pantomimic gesture, the neural systems underlying its production mirror those for spoken words. Finally, a third distinction between signed and spoken languages is the use of the face to express linguistic contrasts. Recognizing that certain facial expressions mark grammatical distinctions is unique to signers, whereas recognizing emotional facial expressions is universal to humans. We investigated the neural systems underlying recognition of these functionally distinct expressions, comparing deaf ASL signers and hearing nonsigners. The results indicate that function in part drives the lateralization of neural systems that process human facial expressions.  
    In sum, I will discuss the neural-cognitive underpinnings of linguistic systems (spatial language, sign production, and perception of linguistic facial expressions) compared to similar non-linguistic systems (spatial cognition, action production, and perception of emotional facial expressions). 

The Role of Broca’s Area and Mirror System for Language Learning

Nobuo Masataka

Primate Research Institute, Kyoto University

    With regard to production of rhythmic manual movements and canonical babbling, a significant positive relationship is observed between increased production of upper limb rhythmicities and age of onset of canonical babbling. Neurologically, it is known that the supplementary motor area plays an important role in such vocal-motor coordination.  Noninvasive research with the brain activity of hearing adults when learning novel words from a foreign language reveals that this region is specified as essential neural correlates for motor planning of speech action.  In the similar experiment with congenitally deaf adults who had acquired a signed language as their first language, this region is not activated during the learning of novel signs from a foreign signed language. Instead, activation of so-called 'mirror system' in the brain is recorded. At the initial stage of the learning, the action of the signing is processed on the basis of a visual analysis of the different elements that form the action of signing, with no motor involvement. During the stage, the movement is perceived as the biological motion. However, the deaf adults soon come to map the visual representation of the observed action onto their motor representation which they possess as their first language. Accompanying with this change, the mirror system which includes Broca’s area and its right homologue come to be activated. These regions are also important for hearing young children when categorizing meanings of new words. One of the difficulties with which they face in word learning is so-called mapping problem. When a caregiver holds up a ball and says “This is a ball”, the question arises as to how they know that it is the ball that is being labeled rather than ball’s color, the material the ball is made of, or the shape of the ball. The results of my experiment introduced here indicates that in order to solve the difficulty, they make use of the information of the action of the caregiver, and it occurs in the mirror system in the brain. The linguistic function exerted by Broca’s area should be regarded as much more communicative than has ever been assumed.

What gestures in modern humans can tell us about origins of language

Sotaro Kita

University of  Bristol

    One of the important questions regarding origins of language is how language came to have the properties that it has. I will discuss the roles children and gesture could have taken in evolution of language based on findings from two studies. 
One of the hallmarks of language is that complex information is broken down into words that refer to component concepts, and they are put into a hierarchically organized linear sequence. I argue that the children's learning process has a bias toward linear and hierarchical organization of symbols. The evidence comes from the comparison of speech-accompanying gestures by hearing Spanish-speaking Nicaraguans and Nicaraguan Sign Language, a new language that has emerged over the last two decades within the Nicaraguan Deaf community. It is found that Nicaraguan Sign Language is more linear and hierarchical than speech-accompanying gestures, and furthermore, the linear-hierarchical property of Nicaraguan Sign Language has increased in the first decade from its inception.
Modern humans gesture spontaneously when they speak. The "urge" to gesture while speaking is very robust, as demonstrated by a study of gesturing by congenitally blind individuals (Iverson & Goldin-Meadow, 1998). Speakers' "urge" to gesture is especially strong for a special class of words, called "mimetics"("giongo gitaigo") in Japanese linguistics. Mimetics are sound symbolic in the sense that there is a systematic relationship between sounds and meaning. In addition, mimetics exhibit iconic relationships between the word form and meaning. When Japanese speakers utter a mimetic in narrative, they almost always produce an iconic gesture that refers to the same event as the mimetic. This may suggest that the origin of the speech-gesture link started in a special subtype of words like mimetics, and then spread into "normal" words in language. According to this view, iconic gestures are not a precursor to language, but emerged out of the evolution of our ability to iconically map different types of mental representations (e.g., sound and vision in some of the mimetics, hand movement and spatial representation in some of the iconic gestures). 
Thus, children's language development creates linear-hierarchical organization of symbols. Mimetics-like words connect language and gesture. The possible evolutionary relationships between these two lines of development will be discussed.

Interaction analysis of multi-party conversation using ubiquitous sensor data

Mayumi Bono and Yasuhiro Katagiri

ATR Media Information Science Laboratories

    In this paper, we focus on taking turns and exchanging participant roles in face-to-face interactions for building applications that detect real-life activities by using computer. We collected many kinds of data into the Interaction Corpus (Hagita et al., 2003). This corpus was constructed during the ATR Exhibition 2002, when a wide variety of people from outside ATR visited poster and oral presentations and joined demonstrations for two days. The corpus consists of the data that were gathered in the Ubiquitous Sensor Room, which had numerous sensors and cameras for recording the behaviours of exhibition participants. This room had five booths where the exhibitors held poster presentations. Here, we show the necessity of providing a technical method of using this huge data set for understanding human behaviours in open situations.Many studies have been conducted in the fields of sociology and anthropology on the structure of conversations and interactive communities. However, very little work has been done in the Computer-Human Interface community, despite the increasing recognition of the importance of non-verbal information and its functions in human-to-human interaction. To grasp the mechanism of the interactive dynamics occurring in a poster presentation room, it is useful to analyze the turn-taking system (Sacks et al.:1974) and theexchange of participant roles (Goffman:1981, Clark: 1982) used by the participants.
 First, turn-taking is one of the most fundamental mechanisms of human conversation in face-to-face interactions. We analyzed the turn dominance of each participant in conversation by observing speech duration and pause duration as factors representing the state of each poster presentation booth. The exhibitors of the poster presentations were the leaders of each conversation. They could control whether the turn-taking system starts up or not by continuing to control the turn. We analyzed the ratio of speech to pause in 10 exhibitors’ presentations. There were consistent tendencies in the exhibitors’ turn-dominating behaviour. We calculated pause by its durations in the exhibitors’ speech. At the beginning of presentations, they maintain a pause duration of about 500 msc or less. They used these pauses for breathing. In the middle of presentations, the way of using pauses was remarkably changed. The pause durations became longer, at about an average of 3000 msc. Furthermore, they became shorter again toward the end of presentations. We assume that the short pauses are used to take breath for producing speech and that the long pauses are used not only for breathing but also to give turns to interlocutors. Consequently, we can predict turn-taking dynamics through observation of exhibitors’ speech behaviours. We call the conversational state of short breathing ‘Lecture mode: L-mode’ and the long one ‘Interaction mode: I-mode’. We established a rule for automatically finding conversational-state transitions by using speech data, and calculated this using all of the data in the corpus. After detecting conversational modes for all data, we compared participants’ behaviours in both modes, including eye-gaze direction, body direction, and participation in conversation. As a result, we found that when exhibitors explain their work to a plural audience, they tend to look at the poster, and when they have a single-person audience, they tend to look at the audience. This phenomenon can be explained by referring to the turn-taking system described bySacks et al. (1974), that is, the process of selecting next speaker.
 Second, we assume that exhibitors’ behavior will change depending on audience size. We compared the conversational organization of multi-party and two-party conversations through observation of a few cases. Some sociologists and anthropologists who analyze the phases of interaction have defined the internal structure of conversation as 'participation structure' or 'participation framework,' which is created by overlapping individual behaviors. In conversation, the participants exchange roles such as 'speaker', 'addressee’, and ‘side-participant’ ( Clark : 1982) by exchanging the right of utterance for a moment, which is taking turns.Based on the model of conversational participation structure, we hypothesized that the audience diversity in spontaneous conversations can be understood by observing the participant role of each person standing in front of a poster. We propose that the participation structure observed in the Interaction Corpus can effectively be captured by the differences in the transition patterns of the participant roles taken by each participant in the conversation. As our procedure, we established a coding rule for assigning participant roles to each person by using sensors to calculate distance, speech duration, pause duration and direction of body. After that we coded participant roles for audience members, which allowed us to predict how actively or passively a particular audience member would participate in conversation. We confirmed that these verbal and non-verbal participation behaviours reflected audience interest in the poster’s information by comparing our observations with the results of questionnaires given to the participants.
 We proposed an application of ubiquitous sensor technology for analyzing the dynamics of multi-party human-to-human conversational interactions in open situations. We demonstrated this method’s ability to explain the variety of possible transitions of participation structure by utilizing dynamic shifts in the roles of participation. The implications of our study, although still preliminary and restricted to a small range of interaction types, are promising for future expansion. We believe the method itself has a huge potential for elucidating the use of non-verbal cues in human-to-human interactions, particularly when combined with various automatic signal processing technologies.

The evolution of language: Surveying the hypotheses

Tecumseh Fitch

University  of  St. Andrews

    I will argue that scientific consideration of the evolution of language has reached a point where both the number of hypotheses and the mass of data available to test them demand a clear statement of the problems to be solved, and a dispassionate survey of the multiple available hypothesis. It is easy enough to generate speculative hypotheses about the evolution of any single component of the language faculty. However, the task of constructing a coherent account of the evolution of all the various elements underlying language as a whole, which is consistent with both available data and accepted neo-Darwinian theory, is far from trivial. Features of language to be accounted for include speech production, especially vocal imitation, complex intersubjective cognition including theory of mind, intentional semantics or "aboutness", and complex syntax. Constraints to be satisfied include compatibility with modern evolutionary theory, known hominid phyogeny, existence of plausible precursor abilities in animals, and conservatism of the vertebrate brain (and genome). A brief survey of single-cause hypotheses suggests that none are adequate to account for all features of language, suggesting that some variant on a two-stage hypothesis, incorporating selection on an intermediate "protolanguage", is necessary. I consider four such hypotheses: Condillac's gestural origins hypothesis, Bickerton's asyntactic protolanguage hypothesis, Merlin Donald's mimetic stage, and Darwin's prosodic protolanguage hypothesis. Although at present none of these hypotheses is perfect or complete, I conclude that Donald's and Darwin's hypotheses do the best job and leave the fewest questions unanswered. More importantly, I conclude that progress in the field demands rational, simultaneous comparison of multiple hypothesis with the whole field of data, rather than passionate single-hypothesis advocacy.

From hand to mouth: The origins and evolution of language

Michael C. Corballis

Research Centre for Cognitive Neuroscience,  University  of  Auckland

    I argue that language originated in manual gestures rather than vocalizations. The primary evidence for this is as follows:
1. Primates, including the great apes, have much better cortical control over the hands than over vocalization;
2. Attempts to teach great apes to communicate using a modified form of sign language, or by pointing to visually displayed symbols, have been much more successful than attempts to teach them anything resembling speech;
3. The “mirror-neuron system” in monkeys, which has to do with the control of reaching and grasping actions, is located in the homolog of Broca’s area;
4. The left cerebral hemisphere in most people is responsible for both speech and manual praxis, and is also manifest in right-handedness;
5. Signed languages invented by the deaf have all of the essential properties of spoken language, including generative syntax, and go through the same developmental stages (including manual “babbling”);
6. People characteristically gesture manually as they speak.
Speech itself can be regarded as a gestural system, involving programmed movements of articulators, including the mouth and tongue. Language may have progressed from a predominantly manual system to a facial system as the hands were more involved in the manufacture and use of tools. The vocal component may have originated as grunts accompanying manual and later facial gestures, but gradually assumed an additional communicative role, allowing access to gestures otherwise hidden from view in the mouth, and providing extra contrasts between voiced and unvoiced gestures. Ultimately, the vocal component became dominant.
Syntactic language probably evolved with the emergence of the genus Homo from around 2 million years ago. Evidence for increased cognitive sophistication includes the increase in brain size, the emergence of stone tool industries, and migrations from Africa. The progression to a fully autonomous vocal system may not have been complete until the appearance of Homo sapiens within the past 200,000 years.  Some insight into precisely when autonomous speech may have emerged comes from the FOXP2 gene on chromosome 7, which has been implicated in disorders of speech, and a comparative study suggests that this gene underwent a mutation at some point within the past 200,000 years.
Evidence on the role of the FOXP2 gene remains controversial, but there are reasons to believe that it is involved primarily in articulation rather than language per se. One possible scenario is that a mutation of this gene some time between 100,000 and 50,000 years ago in Africa was the final step in the emergence of autonomous speech, and was indirectly responsible for the domination of migrants out of Africa around 50,000 years ago, resulting in the so-called “human revolution” in Europe from around 40,000 years ago.

Evolution of motherese in prelinguistic hominins.

Dean Falk

Department of Anthropology,  Florida State University

    Genetic data strongly suggest that human beings and chimpanzees are descended from a common ancestor that lived between 5-7 million years ago. For this reason, comparative studies of the two living groups provide an important tool for formulating hypotheses about human evolution. This paper uses such an approach to explore when and how the special form of infant-directed speech known as baby talk, or motherese, first evolved in hominins, and to formulate hypotheses about its relationship to the eventual emergence of protolanguage. 
    Although infant chimpanzees older than two months are able to cling unaided to their mothers’ bodies, human infants never develop the ability to do so because they are born at extremely undeveloped stages (i.e., when their heads are still small enough to negotiate the birth canal). The high degree of helplessness in human infants is the result of structural constraints that were imposed on the morphology of the birth canal by selection for bipedalism in conjunction with an evolutionary trend for increased brain (and fetal head) size. Thus, unlike the human mother, the chimpanzee mother is able to go about her business with her tiny infant autonomously attached to her abdomen, and with her forelimbs free to forage for food or grasp branches. According to the “putting the baby down” hypothesis, before the invention of baby slings, early bipedal mothers must have spent a good deal of time carrying their helpless infants in their arms and would have routinely freed their hands to forage for food by putting their babies down nearby where they could be kept under close surveillance. Unlike chimpanzee infants, human babies cry excessively as an honest signal of the need for reestablishing physical contact with caregivers, and it is suggested that such crying evolved to compensate for the loss of infant-riding during the evolution of bipedalism. Similarly, unlike chimpanzees, human mothers universally engage in motherese that functions to soothe, calm, and reassure infants, and this, too, probably began evolving when infant-riding was lost and babies were periodically put down so that their mothers could forage nearby. Thus, for both mothers and babies, special vocalizations are hypothesized to have evolved in the wake of selection for bipedalism to compensate for the loss of direct physical contact that was previously achieved by grasping extremities.
    In contrast to the relatively silent mother/infant interactions that characterize living chimpanzees (and presumably their ancestors), as human infants develop, motherese provides (among other functions) a scaffold for their eventual acquisition of language. Infant-directed speech varies cross-culturally in subtle ways that are tailored to the specific difficulties inherent in learning particular languages. As a general rule, infants’ perception of the prosodic cues of motherese in association with linguistic categories is important for their acquisition of knowledge about phonology, the boundaries between words or phrases in their native languages, and, eventually, syntax. Prosodic cues also prime infants’ eventual acquisition of semantics and morphology. The growing literature on developmental linguistics therefore suggests that the vocalizations with their special signaling properties that first emerged in early hominin mother/infant pairs continued to evolve and eventually formed the prelinguistic substrates from which protolanguage emerged.

Motor control, memory and the evolution of human linguistic and cognitive ability

Philip Lieberman

Cognitive and Linguistic Sciences,  Brown University

    Evidence derived from the study of the behavioral deficits of brain damage and brain imaging of neurologically intact human subjects shows that the 19th century theory that identifies Broca's and Wernicke's areas as the "seats" of human linguistic ability is wrong. As is the case for many other aspects of behavior, human speech and syntax involve the activity of neuronal populations in many parts of the brain, linked in cortical-striatal-cortical circuits. Although Broca's area, one of the traditional "language organs" of the cortex, is involved in speech production and sentence comprehension (as well as manual activity), experimental evidence from many independent research projects shows that the striatal basal ganglia play a key role in producing human speech, comprehending the meaning of a sentence, and shifting criteria during cognition. The basal ganglia, by inhibiting and activating motor or cognitive pattern generators can "reiterate" a finite set of elements to produce voluntary speech and a potentially infinite number of sentences. Similar basal ganglia operations confer cognitive flexibility and the ability to comprehend complex syntax. Recursive motor activities that have a cognitive base, such as dancing, involve equivalent basal ganglia activity.
    Motor control thus constitutes one of the preadaptive factors for the evolution of human cognitive and linguistic ability and Darwinian Natural Selection for bipedal locomotion in the earliest stages of hominid evolution may have triggered the evolution of an enhanced striatal sequencing engine. However, genetic and developmental studies of the FOXP2, the putative "speech and language" gene, which regulates the development of basal ganglia and associated neural structures, suggest that articulate speech, complex syntax, and enhanced cognitive ability (as well as dancing) are species-specific attributes of Homo sapiens that evolved about 200,000 years ago. Apes, who lack a human striatal system can neither talk or command complex syntax.
    Neanderthal hominids who diverged from the human line 500,000 years ago may have lacked the basal ganglia sequencing engine necessary to produce voluntary speech. Neanderthals also appear to have lacked the anatomy necessary to produce the full range of human speech. Ontogenetic studies of human neonates and young children show that the face and mouth follow a developmental trajectory that differs from that of non-human primates. The human mouth is short; the human larynx is low and the human tongue can produce abrupt and extreme discontinuities in the volume of the oral cavity and pharynx which constitute the supralaryngeal vocal tract. The human species-specific supralaryngal airway can produce sounds enhancing the perception of speech. In contrast, Neanderthals exhibit the non-human primate skull growth pattern, their mouths are long precluding functionally human supralaryngeal vocal tracts. Their larynges would have to be placed in a low position that does not occur in any hominoid to achieve a SVT that was functionally equivalent to that of a modern human.
    Equal weight must be placed on the brain's dictionary. Imaging studies (fMRI and PET) show that the neural lexicon is a memory base associating sensory and objective knowledge gained through real-world experience with the phonetic shapes of words and their syntactic roles. Comparative studies of living apes show that they have limited lexical ability which again suggests that the evolution of the human brain's dictionary can be traced back to the common ancestor of apes and humans. However, the posterior regions of the human brain that can be associated with the neural dictionary are disproportionately large compared to non-human primates; enhanced lexical ability again appears to be a species-specific human attribute.