Gestures and multilevel discourse in spontaneous speech corpora: the case of reported speech


  • Data ricezione: 15/11/2023
  • Data accettazione: 30/11/2023
  • Data pubblicazione: 04/01/2024


L'obiettivo di questo lavoro è stato quello di analizzare le unità di discorso riportato e le unità gestuali che vengono prodotte insieme ad esse. Studi basati su corpora e fondati sulla Language into Act Theory, il quadro teorico adottato in questa ricerca, hanno dimostrato che le zone di confine delle unità gestuali tendono a coincidere con i confini prosodici del discorso. Per quanto riguarda il discorso riportato, Good 2015 osserva che quando lo inseriamo nel flusso del discorso spontaneo, utilizziamo risorse che mostrano il suo carattere metaillocutivo, come la variazione prosodica, i cambiamenti nella postura del corpo e anche i gesti messi in scena. Pertanto, è prevedibile che si notino differenze nel profilo prosodico e gestuale tra il discorso non riportato e quello riportato, e viceversa. In questa ricerca, questi aspetti sono stati analizzati sulla base di un corpus di parlato spontaneo, C-ORAL-BGEST, etichettato informazionalmente secondo la Language into Act Theory e gestualmente secondo le linee guida di McNeill 1992, Kendon 2004 e Bressem, Ladewig e Müller 2013. I risultati sembrano mostrare che il cambiamento di livello discorsivo, cioè il passaggio dal livello dell'enunciato al livello del discorso riportato, è evidente non solo dal punto di vista prosodico, ma anche gestuale.


The aim of this work was to analyze reported speech units and the gestural units that are produced alongside them. Studies based on corpora and grounded in Language into Act Theory, the theoretical framework adopted in this research, have shown that the boundary zones of gestural units tend to coincide with the prosodic boundaries of speech. Regarding reported speech, Good 2015 observes that when we insert them into the flow of spontaneous speech, we use resources that show their meta-illocutionary character, such as prosodic variation, changes in body posture and also enacted gestures. Thus, it is to be expected that differences in the prosodic and gestural profile will be noticed between unreported to reported speech, and vice versa. In this research, these aspects were analyzed based on a corpus of spontaneous speech, C-ORAL-BGEST, informationally labeled according to the Language into Act Theory, and gesturally labeled according to the guidelines of McNeill 1992, Kendon 2004 and Bressem, Ladewig and Müller 2013. The results seem to show that the change in discursive level, i.e. the transition from the level of the utterance to the level of reported speech, is noticeable not only prosodically, but also gesturally.

1. Introduction

This work deals with the analysis of spontaneous speech to study the reported speech units and the gestural units that are produced in parallel to them. For this purpose, gestures are understood as deliberate actions that the interlocutor perceives as aiming to expression – and not to some purely practical goal (Kendon 2004).

Furthermore, in this perspective, besides being synchronously manifested with speech and expression-oriented, gestures are idiosyncratic, since they are produced instantly by the speaker as they express themselves (McNeill & Duncan 2000). In other words, although a few gestures are conventional in certain cultures, gestures produced parallel to speech do not have a pre-established or conventionalized pattern among speakers.

Given their expressive scope, the gestures we perform when engaging in trivial interaction can be conceived as integral parts of an utterance. They unfold consistently with what is expressed in speech, i.e., the spoken component of that same utterance (Kendon, 2004). Thus, speech and gesture, indeed, correlate in such an intrinsic way that they can be treated as parts of a single semiotic sphere (McNeill 1992).

Regarding the relation between gestures and prosody – an aspect of speech that is of particular interest to this research –, there is a tendency for gestural and prosodic units to be synchronized, as already noted by Kendon (1972) in his primordial analyses (see also McNeill 1992, 2005; Kita et al. 1998). Subsequent corpus-based studies grounded in the Language into Act Theory (Cresti 2000; Moneglia & Raso 2014; Cavalcante 2020) have shown, among other things, that boundaries of gestural units end where prosodic boundaries are completed (Cantalini 2018; Cantalini & Moneglia 2020; Barros 2021).  The present research was also based on the Language into Act Theory framework. Therefore, section 2 will briefly discuss its main theoretical guidelines to contextualize our analyses.


2. The language into Act Theory

This study was grounded on the theoretical framework of the Language into Act Theory, henceforth L-AcT (Cresti 2000; Moneglia & Raso 2014; Cavalcante 2020), a pragmatic corpus-driven theory aimed at studying the informational structure of spontaneous speech. Within this framework, the units of reference for linguistic analysis are the minimum units that can be pragmatically and prosodically interpreted in the flow of speech (Izre'el et al, 2020). Such units are identified through prosodic and pragmatic parameters: they contain at least one illocution and are considered terminated units since a terminal prosodic boundary can be perceived after them.

These units of reference can be constituted by one or more prosodic and informational patterns. According to the L-AcT, informational functions are conveyed by prosodic units. A prosodic/informational pattern, in turn, is made up of a nuclear unit that carries the illocutionary force, the comment unit (COM). A pattern can also host other non-illocutionary units, which can be of two types: textual units, which compose the semantic text of the pattern, and dialogic units, which are intended to regulate dialogic interaction. These can be classified based on three criteria: function, position, and prosodic form (f0 curve, duration, intensity, and alignment with the syllabic structure) (t’ Hart; Collier; Cohen 1990; Firenzuoli 2003). The prosodic units are internally separated by non-terminal prosodic boundaries.

Based on the concept of prosodic/informational pattern, the units of reference can be of two types: utterances or stanzas. Utterances can be defined as the terminated units made up of just one prosodic/informational pattern; stanzas are terminated units made up of more than one juxtaposed pattern linked by a boundary with a continuity signal.

In our study, the locutive introducer (INT) is of interest among the non-illocutionary units that make up a prosodic/informational pattern. This unit is important because INT signals that what comes next is on a different level, not directly related to the current conversation (Maia Rocha & Raso 2011).


3. Discoursive level

While analyzing spontaneous speech, three different discursive levels are observed: the level of the utterance, the level of the parentheticals, and the meta-illocutionary level (see example 3.4 and figure 3.1). In the speech flow, when the speaker moves from one level to another, they mark this transition prosodically in order to indicate that adjacent prosodic units belong to different hierarchical levels.

The level of the utterance is where most of the interactional flow between speakers takes place. It is grounded in the here and now of the interaction. The parenthetical level, in turn, conveys any comments made by the speaker about what they say at the level of the utterance in order to make its interpretation clearer. These comments can be metanarrative – when the speaker adds information considered crucial to understanding what has been said (see example 3.1); modal – when the speaker shows their commitment to what has been said (see example 3.2); or metalinguistic in a lexical sense – when the speaker reformulates something they said (see example 3.3) (Barros 2021; Santos 2020; Tucci 2010).

The examples below were all extracted from the C-ORAL-BRASIL I (Raso & Mello 2012) corpus, and the parentheticals are labeled PAR.


Metanarrative parenthetical: bfamdl03[1041]:

*LUZ: aqui o’ / eu topei cum caminhão aqui / o dia que eu vim sozinha /=PAR= ele / fazendo a curva / subindo / me espremeu ali / quase que eu caí na vala //

*LUZ: here look / I bumped into a truck here / the day I came alone /=PAR= he / making the turn / going up / squeezed me there / I almost fell into the ditch //


Modal parenthetical: bfamdl01[44]:

*FLA: só que é de micro-ondas / eu acho //=PAR=

*FLA: but it's from a microwave / I think //=PAR=


Metalinguistic (lexical) parenthetical: bfammn06[37]:

*JOR:  e nós távamos entrando com outro tipo de aparelho de televisor no mercado / que era uma coqueluche /=PAR= era uma novidade /=PAR= e os próprios vendedores das loja nũ / tinham experiência pra mostrar aquilo pro consumidor brasileiro //

*JOR: and we were entering the market with another type of television set / which was a coqueluche /=PAR= was a novelty /=PAR= and the salesmen themselves didn't / have the experience to show it to the Brazilian consumer //

Prosodically, the parenthetical level often exhibits a change in f0, usually a fall, compared to adjacent units belonging to the utterance level. It also shows a reduction in intensity and may exhibit a higher articulation rate relative to its surroundings. There is often a pause before and/or after the parenthetical (see Figure 3.1).

From another perspective, the meta-illocutionary level indicates that the here and now of the situation is pragmatically suspended. The most common metaillocutions are emblematic exemplification, instruction, and, above all, reported speech, which will be discussed in section 3.1. Example (3.4) illustrates the instruction.


Instruction (bfamcv03)

*CEL: ah / faz assim /=INT= mata o &no [/2] o oito nosso direto //

*CEL: ah / do like this /=INT= kill the &ni [/2] our eight directly //

Example (3.5), extracted from the C-ORAL-BGEST corpus (Mello et al. in preparation), illustrates a situation where all three levels can be observed. This excerpt’s units were tagged using the informational labels adopted by L-AcT – COM, COB and CMM stand for the illocutionary units; TOP stands for topic, i.e., the domain of identification for the interpretation of the illocution; and PAR stands for parenthetical2.

To indicate that a certain unit is reported, « _r » (r for reported) is added to the label assigned to that unit. Therefore, if there is, for instance, a reported illocutionary unit, its label will be COM_r, COB_r, or CMM_r.

In the example below, meta-illocutionary level is indicated in bold, and the parenthetical level is evidenced in italic.

*CAR: aí do nada /=TOP=eu tava [/2] eu voltei a fazer terapia /=PAR= aí / eu tava [/1] tinha terminado de fazer minha sessão /=TOP= eu catei o celular /=COB= e ele /=INT= tamo nessa de não se falar de novo //=COM_r=

*CAR: then out of the blue /=TOP= I was [/2] I was back in therapy /=PAR= then / I was [/1] I’ve just finished the session /=TOP= I picked up my cell phone /=COB= and he /=INT=we're not talking again //=COM_r=

                 Figure 3.1 – Intensity and f0 contour for example (3.5)

On the other hand, since speech and gesture can be treated as parts of a single semiotic sphere, we wanted to verify whether the differences between the three discursive levels are marked not only prosodically but also gesturally. In fact, Barros 2021 noted that the shift from the utterance level to the parenthetical level is marked through strategies of gestural pattern contrast. This paper aimed to analyze the shift to the meta-illocutionary level, focusing on the case of the reported speech.


3.1 - The reported speech

Reported speech can be defined as a direct discourse the speaker inserts into the flow of his  own speech. In reported speech, there is a change in deictic parameters; that is, the here and now is suspended to insert something that was said at another time, either by the speaker or by someone else.

However, when reporting a past discourse, the speaker not only repeats (or tries to repeat) what has been said but may also incorporate other strategies that allow the hearer to notice the suspension of the here and now of the current situation. Indeed, Good (2015) argues that reported speech is actually just one facet of a whole reported action.

According to Good 2015, some features that may follow the reported speech are shifts in body posture, gaze, prosodic variation, and enacted gestures. Regarding prosodic variation, reported speech tends to exhibit higher values for intensity and f0 when compared to unreported speech.

Zuckerman 2021 draws attention to the fact that the reported speech is not an exact and faithful enactment of the reported event. Hence, if we analyze them from a gestural perspective, and not only from a semantic or prosodic viewpoint, we can access valuable information about the action itself that is reported but mainly about how the speaker chooses to do so.

In the scope of this paper, we intended to investigate, based on spontaneous speech corpus data, the hypothesis that the shift from the unreported speech to the reported speech, and vice versa, is also marked gesturally.

4. Gestures

Just like speech, gestures can be subdivided into units, the so-called gesture units (GUnits). A GUnit is completed when the speaker removes their hands from a default position, defined as the rest position, performs a series of gestures, and then returns their hands to the rest position. Therefore, GUnits can be defined as «the set of gestures performed between two resting positions.»

The GUnits themselves can be subdivided into gesture phrases (GPhrases). GPhraes are defined by the presence of one gestural nucleus, called stroke (see fig. 4.1). The stroke is one of the gesture phases (GPhases) that constitute a GPhrase and is the only mandatory one. Just as the illocution is the nucleus of a prosodic and informational pattern, the stroke is the nucleus of a GPhrase. Other GPhases can be combined with the stroke to form the GPhrase, and they can be classified in four different ways, depending on the function they perform (fig. 4.1): preparation; retraction; hold, and rest (Ladewig & Brassem 2013).

Figure 4.1 – gesture phases (Ladewig & Brassem 2013, p. 1075)

Studies conducted on spontaneous Italian speech (Cantalini 2018; Cantalini & Moneglia 2020) have shown synchrony between prosodic and gesture units and the tendency of the two nuclei to align with each other (i.e., illocutionary focus and stroke). Furthermore, the authors claimed that GPhrases tend to remain within the same unit of reference (see section 2) and never cross the terminal prosodic boundaries. Barros 2021, while investigating parentheticals in Brazilian Portuguese, also observed that the stroke tends to synchronize with prosodic prominences. Mayberry & Jacques 2000, focusing on stuttering, concluded that interruptions in speech flow had repercussions on gestural patterns, which were interrupted or altered and only were resumed when speech was also resumed fluently.

The concept of gestural pattern considered here encompasses the gestural parameters introduced by Bressem, Ladewig, & Müller 2013, namely those related to movement type, direction, and quality; handshape, orientation, and position (for further details, see Bressem 2013). A change in at least one of these parameters constitutes a change in gestural pattern.


C-ORAL-BGEST (Mello et al, in preparation) is a multimodal corpus of spontaneous speech that is part of the C-ORAL-BRASIL project. By the time this paper was written, the corpus consisted of eleven multimodal texts in Brazilian Portuguese and included video and audio recordings. Altogether, the recordings were 24 minutes and 28 seconds long.

The 11 texts analyzed are informationally tagged according to Language into Act Theory and gesturally labeled according to the guidelines of McNeill 1992, Kendon 2004, and Bressem, Ladewig & Müller 2013. The perceptually-based prosodic segmentation follows the pragmatic-prosodic parameters used in other reference corpora, such as C-ORAL-BRASIL I (Raso & Mello 2012) and C-ORAL-ROM (Cresti & Moneglia 2005) based on the perception of non-terminal boundaries (marked by a single slash, «/») and terminal boundaries (marked by a double slash, «//»). Text, image, and sound were aligned in ELAN software (Sloetjes & Wittenburg 2008), through which qualitative analyses were also done.

6. Results and discussions

By analyzing the 11 texts that make up the C-ORAL-BGEST corpus, eight reported speech constructions were detected. In total, 103 reported prosodic units were found, structured in 35 shifts from utterance level to the reported speech level (and vice versa). According to our observations, 30 of the 35 shifts occurred alongside changes in the gestural pattern.

Regarding the situations in which significant changes in the gestural pattern were not identified, two of them occurred with a speaker who did not gesture profusely throughout the interaction. In the specific passages in which the reported speeches took place, this speaker did not gesture either at the level of the utterance or at the level of the reported speech. In the other cases, the speakers either did not gesticulate in the excerpts of reported speech or only made «incidental movements» (Kendon 2004), i.e., movements that were not aimed at expression, such as scratching the eyes.

Among the changes in the observed gestural pattern, some were concerned with using the body space. In these cases, the speaker used, for instance, his left body space to perform the gestures belonging to the utterance level but used his right body space to gesticulate during the reported speech. In other situations, there were changes in the gestures’ amplitude – from more closed hands to more opened hands when switching levels, for example. There were also instances when speakers changed their head orientation or body position only during the reported speech. Moreover, we identified situations where the speaker stopped gesturing during reported units and resumed gestures when the utterance level was restored.

These results seem to show a relation between the shift between discourse levels and the gestural patterns performed concurrently    .


6.1 Examples    

In this section, we will present several examples that have formed the foundation of our analysis regarding reported speech. In all examples(6.1.1, 6.1.2, 6.1.3, 6.1.4), the meta-illocutionarylevelcorresponds to reported speech, and has been highlighted in bold. It is important to recall that labelsassigned tothis level areappendedwith« _r », where « r » stands for reported.

Example (6.1.1) below was extracted from the multimodal text bgest_004. *PEU is the speaker.


*PEU: aí minha mãe me ligou /=COB= e falou olha /=COB_r= &he &s [/1] &he / a gente nũ [/3] cê não tem mais dependência do seu pai agora /=COB_r= você [/1] no plano de saúde /=COB_r= que / o trabalho dele oferece /=PAR_r= aí / a gente vai ter que procurar outro plano de saúde //=COM_r=

*PEU: then my mother called me /=COB= and said look /=COB_r= &he &s [/1] &he / we don’t [/3] you're no longer dependent on your father now /=COB_r= you [/1] in the health insurance/=COB_r= that / his work offers /=COB_r= then / we'll have to look for another healthinsurance plan//=COM_r=

 Figure 6.1.1 – left: unreported speech | right: reported speech

The figure on the left, taken from the initial, unreported part of example (6.1.1) («aí minha mãe me ligou» | «then my mother called me»), shows that, while at the utterance level, *PEU's gestures were mainly characterized by larger movements. At this point, the speaker kept his arms rather open. During the reported speech, as we can see in the picture on the right, *PEU kept his hands closer together while gesturing. Back at the level of the utterance, the speaker did not gesture immediately: his hands remained at rest position for approximately 5 seconds, and *PEU resumed the gestures in another reported speech unit later.

Example (6.1.2) was extracted from the multimodal text bgest_005. The reported parts are marked in bold, and *ZUC is the speaker.


*ZUC: aí eu comecei a fazer &a [/1] cálculo estrutural /=COB= e o cálculo estrutural já depende da geometria /=COB= depende de tudo //=COM=

*ZUC: aí vem o cara lá do anteprojeto e fala /=INT= não /=COB_r= vamo ter que aumentar /=COB_r= trinta centímetros da asa //=COM_r=

*ZUC: aí / eu vou ficar /=INT= seu filha da puta //=COM_r=

*ZUC: eu já calculei essa droga [/1] desgraça entendeu /=COB_r= eu nũ vou fazer isso de novo //=COM_r=

*ZUC: aí / esse é o problema da coisa entendeu //=COM=

*ZUC: then I started to do the [/1] structural calculation /=COB= and the structural calculation already depends on the geometry /=COB= depends on everything //=COM=

*ZUC: then the guy from the preliminary project says /=INT= no /=COB_r= we're going to have to increase /=COB_rthirty centimeters of the wing //=COM_r=

*ZUC: then / I'll be like /=INT= you son of a bitch //=COM_r= 

*ZUC: I already calculated that shit [/1] damn ityou know /=COB_r= I'm not going to do it again //=COM_r=

*ZUC: then / that's the problem you see //=COM=