- Chen:06ICMI
-
Using Maximum Entropy (ME) Model to Incorporate Gesture Cues for SU Detection
L. Chen and M. Harper and Z. Huang
(2006)
- Quek:02MS
-
Speech and Gesture Analysis for Evaluation of Parkinson Disease
F. Quek and R. Bryll and M. Harper and L. Chen and L. Ramig
(2002)
- Galley:03ACL
-
Discourse segmentation of multi-party conversation
M. Galley and K. McKeown and F. E. and H. Jing
562--569
(2003)
- Ruiter:06Language
-
Projecting the end of a speaker's turn: A cognitive cornerstone of conversation
N. J. E. J. P. de Ruiter, H. Mitterer
Language
82
(2006)
A key mechanism in the organization of turns at talk in conversation is the ability to anticipate or PROJECT the moment of completion of a current speaker's turn. Some authors suggest that this is achieved via lexicosyntactic cues, while others argue that projection is based on intonational contours. We tested these hypotheses in an on-line experiment, manipulating the presence of symbolic (lexicosyntactic) content and intonational contour of utterances recorded in natural conversations. When hearing the original recordings, subjects can anticipate turn endings with the same degree of accuracy attested in real conversation. With intonational contour entirely removed (leaving intact words and syntax, with a completely flat pitch), there is no change in subjects' accuracy of end-of-turn projection. But in the opposite case (with original intonational contour intact, but with no recognizable words), subjects' performance deteriorates significantly. These results establish that the symbolic (i.e. lexicosyntactic) content of an utterance is necessary (and possibly sufficient) for projecting the moment of its completion, and thus for regulating conversational turn-taking. By contrast, and perhaps surprisingly, intonational contour is neither necessary nor sufficient for end-of-turn projection.
- Jovanovic06:EACL
-
Addressee Identification in Face-to-Face Meetings
N. Jovanovic and R. Akker and A. Nijholt
(2006)
- Takeuchi03:eurospeech
-
Generation of Natural Response Timing Using Decision Tree Based on Prosodic and Linguistic Information
M. Takeuchi and N. Kitaoka and S. Nakagawa
(2003)
- Ward00:pragmatics
-
Prosodic Features Which Cue Backchannel Responses in English and Japanese
N. Ward
Journal of Pragmatics
32
1177-1207
(2000)
- Ward96:ICSLP
-
Using Prosodic Clues to decide when to produce back-channel utterances
N. Ward
(1996)
- Ford96
-
Interactional units in conversation: syntactic, intonational, and pragmatic resources for the managment of turns
C. E. Ford and S. A. Thompson
(1996)
- Local86:prosody
-
Projection and 'silences': Notes on phonetic and conversational structure
J. Local and J. Kelly
Human Studies
9
185-204
(1986)
- Keep_Floor
-
Keeping the Floor in Multiparty Conversations: Intonation, Syntax, and Pause
A. Wennerstrom and A. F. Siegel
Discourse Processes
36
77-107
(2003)
- Ferrer02:ICSLP
-
Is the speaker done yet? faster and more accurate end-of-utterance detection using prosody in human-computer dialog
E. S. L. Ferrer and A. Stolcke
(2002)
- Heldner06:inter_control
-
Prosodic Cues for Interfaction Control in Spoken Dialogue Systems
M. Heldner and J. Edlund
(2006)
- Edlund05:Phonetica
-
Exploring Prosody in Interaction Control
J. Edlund and M. Heldner
Phonetica
62
215-226
(2005)
- Berger:96ME
-
A maximum entropy approach to natural language processing
A. Berger and S. Pietra and V. Pietra
Computational Linguistics
22
39-72
(1996)
- Goldin-Meadow:99review
-
The role of gesture in communication and thinking
S. Goldin-Meadow
Trends in Cognitive Sciences
3
(1999)
- Coquoz:04
-
Bradcast news segmentation using MDE and STT information to improve speech recognition
S. Coquoz
(2004)
- JHU_WS05
-
Final Report: parsing speech and structural event detection
M. Harper and B. Dorr and B. Roark and J. Hale and Z. Shafran and Y. Liu and M. Lease and M. Snover and L. Young and R. Stewart and A. Krasnyanskaya
(2005)
http://www.clsp.jhu.edu/ws2005/groups/eventdetect/documents/finalreport.pdf
- Zhang:Maxent
-
Maximum {E}ntropy {M}odeling {T}oolkit for {P}ython and {C}++
L. Zhang
()
- Eisenstein:05MIT
-
Gestural Cues for Sentence Segmentation
J. Eisenstein and R. Davis
MIT AI Memo
(2005)
- Poggi:03GW
-
Gesture Mind Markers in ECAs
I. Poggi and C. Pelachaud and E. M. Caldognetto
(2003)
- Chen05:AMI
-
Locating Salient Portions of Meeting Using Multimodal Cues
L. Chen
(2005)
- Chen:06MLMI
-
A Multimodal Analysis of Floor Control in Meetings
L. Chen and M. Harper and A. Franklin and T. R. Rose and I. Kimbara and Z. Q. Huang and F. Quek
(2006)
- LDC_QTR
-
Meeting Recording Quick Transcription Guidelines
L. D. C. (LDC)
(2004)
http://www.nist.gov/speech/test_beds/mr_proj/meeting_corpus_1/documents/pdf/MeetingDataQTRSpec-V1.3.pdf
- Cassell:book
-
Embodied Conversational Agents
J. Cassell and J. Sullivan and S. Prevost and E. Churchill
()
- Rehm05:ECA
-
Informing the design of embodied conversational agents by analysing multimodal politeness behaviors in human-human communication
M. Rehm and E. Andr
(2005)
- Jurafsky97:DA
-
Automatic Detection of discourse structure for speech recognition and understanding
D. Jurafsky and B. Rebecca and et al.
(1997)
- Jovanovic05:SIGdial
-
A corpus for studying addressing behavior in multi-party dialogues
N. Jovanovic and R. Akker and A. Nijholt
(2005)
- Vertegaal01:CHI
-
Eye Gaze Patterns in Conversations: There is More to Conversational Agents Than Meets the Eyes
R. Vertegaal and R. Slagter and G. Veer and A. Nijholt
(2001)
- Clark04:SIGdial
-
Multi-level Dialog Act Tags.
A. Clark and A. Popescu-Belis
(2004)
- Purver05:MLMI
-
A Multimodal Discourse Ontology for Meeting Understanding
J. Niekrasz and M. Purver
(2005)
- Bosch04:specom
-
Turn-taking in social talk dialogues: temporal, formal and functional aspects
L. Bosch and N. Oostdijk and J. P. Ruiter
(2004)
- Caspers01:SIGDial
-
Melodic cues to turn-taking in English: evidence from perception
A. Wichmann and J. Caspers
(2001)
- Caspers01:EuroSpeech
-
Testing the Perceptual Relevance of Syntactic Completion and Melodic Configuration for Turn-Taking in Dutch
J. Caspers
(2001)
- Caspers00:ICSLP
-
Pitch Accents, Boundary Tones and Turn-taking in Dutch Map Task Dialogues
J. Caspers
(2000)
- LREC06_PM
-
An Open Source Prosodic Feature Extraction Tool
Z. Q. Huang and L. Chen and M. Harper
(2006)
- MRDA
-
The {ICSI} meeting recorder dialog act ({MRDA}) corpus
E. Shriberg and R. Dhillon and S. Bhagat and J. Ang and H. Carvey
(2004)
- Bosch04:dur_turn
-
Durational Aspects of Turn-taking in Spontaneous Face-to-Face and Telephone Dialogues
L. Bosch and N. Oostdijk and J. P. Ruiter
563-570
(2004)
- Novick05:gaze
-
Models of Gaze in Multi-party Discourse
D. G. Novick
(2005)
- Padilha02:edilog
-
A simulation of small group discussion
E. Padilha and J. Carletta
117-124
(2002)
- Torres97:gaze
-
Modeling Gaze Behavior as a Function of Discourse Structure
O. Torres and J. Cassell and S. Prevost
(1997)
- Novick96:gaze
-
Coordinating Turn-Taking With Gaze
D. G. Novick and B. Hansen and K. Ward
(1996)
- Kelly99:JML
-
Offering a Hand to Pragmatic Understanding: The Role of Speech and Gesture in Comprehension and Memory
S. D. Kelly and D. J. Barr and R. B. Church and K. Lynch
Journal of Memory and Language
40
(1999)
Most theories of pragmatics take as the basic unit of communication the verbal content of spoken
or written utterances. However, many of these theories have overlooked the fact that important
information about an utterance's meaning can be conveyed nonverbally. In the present study, we
investigate the pragmatic role that hand gestures play in language comprehension and memory. In
Experiments 1 and 2, we found that people were more likely to interpret an utterance as an indirect
request when speech was accompanied by a relevant pointing gesture than when speech or gesture
was presented alone. Following up on this, Experiment 3 supported the idea that speech and gesture
mutually disambiguate the meanings of one another. Finally, Experiment 4 generalized the findings
to different types of speech acts (recollection of events) with a different type of gesture (iconic
gestures). The results from these experiments suggest that broader units of analysis beyond the verbal
message may be needed in studying pragmatic understanding.
- Ju98:gesture
-
Summarization of Videotaped Presentations: Automatic Analysis of Motion and Gesture
S. X. Ju and M. J. Black and S. Minneman and D. Kimber
IEEE Trans. on Circuits and Systems for Video Technology
8
(1998)
- ISL_MB
-
Meeting Browser: Tracking and Summarizing Meetings
M. Waibel, A. Bett and M. Finke
(1998)
- Waibel01:ICASSP
-
Advances in Automatic Meeting Record Creation and Access
M. Waibel, A. Bett and et al.
(2001)
- Ellis03:ICASSP
-
Pitch-Based Emphasis Detection for Characterization of Meeting Records
L. S. Kennedy and D. Ellis
(2003)
- Arons94:ICSLP
-
Pitch-Based Emphasis Detection for Segmenting Speech Recordings
B. Arons
4
1931-1934
(1994)
- Erol03:ICME
-
Multimodal Summarization of Meeting Recordings
B. Erol and D. S. Lee and J. Hull
(2003)
- Liu05:dis
-
Comparing {HMM}, {M}aximum {E}ntropy, and {C}onditional {R}andom {F}ields for Disfluency Detection
Y. Liu and E. Shriberg and A. Stockle and M. Harper
(2005)
- Ang05:ICASSP
-
Automatic Dialog act segmentation and classification in multiparty meetings
J. Ang and Y. Liu and E. Shriberg
(2005)
- multimodal_utt_end
-
Predicting end of utterance in multimodal and unimodal conditions
P. Barkhuysen and E. Krahmer and M. Swerts
(2005)
- PUPM
-
Praat based Prosodic Feature Extraction Toolkit
L. Chen and Z. Q. Huang and M. Harper
(2005)
- Banerjee04:ICSLP
-
Uisng Simple Speech-Based Features to Detect the State of a Meeting and the Roles of the Meeting Participants
S. Banerjee and A. I. Rudnicky
(2004)
- SRI-LM
-
{SRILM} - An Extensible Language Modeling Toolkit
A. Stockle
(2002)
- Han05:HCI
-
Articulated Body Tracking Using Dynamic Belief Propagation
T. X. Han and T. S. Huang
(2005)
- Pellom01:sonic
-
{SONIC}: The {U}niversity of {C}olorado Continuous Speech Recognizer
B. Pellom
(2001)
- McNeill:05MLMI
-
Gesture, Gaze, and Ground
D. McNeill
(2005)
- SRI-Prosody
-
Prosodic Features Extraction
L. Ferrer
(2002)
- Chen:05MLMI
-
{VACE} Multimodal Meeting Corpus
L. Chen and T. R. Rose and et. al.
(2005)
- Huang05:Seg
-
Speech and Non-Speech Detection In Meeting Audio for Transcription
Z. Huang and M. Harper
(2005)
- Tu05:ICCV_lowres
-
Accurate Head Pose Tracking in Low Resolution Video
J. Tu and H. Tao and D. Forsyth and T. S. Huang
(2005)
- Tu05:ICCV_meanshift
-
Online Updating Appearance Generative Mixture Model for Meanshift Tracking
J. Tu and T. S. Huang
(2005)
- Strassel:04ICASSP
-
Shared Linguistic Resources for Human Language Technology in the Meeting Domain
S. Strassel and M. Glenn
(2004)
- McCowan05:PAMI
-
Automatic Analysis of Multimodal Group Actions in Meetings
I. McCowan and D. Gatica-Perez and S. Bengio and G. Lathoud and M. Barnard and D. Zhang
IEEE Trans. on Pattern Analysis and Machine Intelligence
27
305-317
(2005)
- Reidsma04:MLMI
-
Meeting Modeling in the Context of Multimodal Research
D. Reidsma and R. Rienks and N. Jovanovic
(2004)
- LDC:tools
-
Linguistic Annotation Tools
LDC
()
http://www.ldc.upenn.edu/annotation/
- Wu01:hand_modeling
-
Hand modeling, analysis and recognition
Y. Wu and T. S. Huang
Signal Processing Magazine, IEEE Trans. on
18
51-60
(2001)
- wavesurfer
-
Wavesurfer software
M. KTH Speech and Hearing
()
http://www.speech.kth.se/wavesurfer/
- Thorisson02:YTTM
-
Natural Turn-Taking Needs no Manual: Computational Theory and Model
K. Thorisson
173-207
(2002)
- Goodwin81:book
-
Conversational Organization: Interaction Between Speakers and Hearers
C. Goodwin
(1981)
- Garofolo:04lrec
-
The {NIST} {M}eeting {R}oom {P}ilot {C}orpus
J. Garofolo and C. Laprum and M. Michel and V. Stanford and E. Tabassi
(2004)
- Adler89:blackboard
-
Blackboard Systems
R. Adler
110-116
(1989)
- Quek:nistgaze
-
A Coding Tool for Multimodal Analysis of Meeting Video
F. Quek and D. McNeill and T. Rose and Y. Shi
(2003)
- Quek:05CVIU
-
Vector Coherence Mapping: Motion Field Extraction by Exploiting Multiple Coherences
F. Quek and R. Bryll and Y. Qiao and T. Rose
CVIU special issue on Spatial Coherence in Visual Motion Analysis (submitted)
(2005)
- Maes90:agent
-
Designing Autonomous Agents: Theory and Practice from Biology to Engineering and back
P. Maes
(1990)
- Rime83:no_gesture
-
The elimination of visible behaviour from social interactions: effects on verbal, nonverbal and interpersonal variables
B. Rime
European Jounral of Social Psychology
12
113-129
(1983)
- Argyle76:gaze
-
Gaze and Mutual Gaze
M. Argyle and M. Cook
(1976)
- Cassell98:mismatch
-
Speech-Gesture Mismatches: Evidence For One Unerlying Representation of Linguistic {\&} Nonlinguistic Information
J. Cassell and D. McNeill and K. E. McCullough
Pragmatics {\&} Cogition
6
(1998)
- Dybkjer02:lrec
-
Natural Interactivity Resources - Data, Annotation Schemes and Tools
L. Dybkjer and N. Bernsen
(2002)
This paper presents results of three surveys of natural interactivity and multimodal resources carried out by a Working Group in the
ISLE project on International Standards for Language Engineering. Information has been collected on a large number of corpora,
coding schemes and coding tools world-wide. The paper presents the information collection process, the description and validation
methods used, the surveyed resources, and brief conclusions for each of the three resource areas reviewed. Observations on user
profiles, user needs and best practices are briefly presented.
- Bigbee:01_gesana
-
Emerging Requirements for Multi-Modal Annotation {\&} Analysis Tools
T. Bigbee and D. Loehr and et al.
(2001)
- Feyereisen87:PR
-
Gestures and speech, interactions and separations: A reply to McNeill
P. Feyereisen
Psychological Review
94
493-498
(1987)
- Kita00:book
-
How Representational Gestures Help Speaking
S. Kita
162-185
(2000)
- Beattie81:gaze
-
The Regulation of Speaker Turns in Face-to-Face Conversation: Some Implications for Conversation in Sound-only Communication Channels
G. Beattie
Semiotica
34
55-70
(1981)
- SignStream
-
SignStream: A Database Tool for Rsearch on Visual-Gestural Language
C. Neidle
Sign Language and Linguistics
4
203-214
(2002)
- Butterworth89
-
Gesture, Speech, and Computaional Stages: a Reply to {Mc}{N}eill
B. Butterworth and U. Hadar
Psychology Review
(1989)
- Child_GESDEV
-
Meaning in Movements: an Investigation into the Interrelationship of Physiographic Gestures and Speech in Seven-year-olds
M. G. Riseborough
British Journal of Psychology
73
497-503
(1982)
- Kalma92:gaze
-
Gazing in Trials - a Powerful Signal in Floor Appointment
A. Kalma
British Journal of Social Psychology
31
21-39
(1992)
- Transana
-
Transana: a tool for the transcription and qualitative analysis of audio and video data, http://transana.org
C. Fassnacht and D. Woods
()
http://www.transana.org
- Kendon67:gaze
-
Some Functions of Gaze-direction in Social Interaction
A. Kendon
Acta Psychologica
26
22-63
(1967)
- CLAN
-
Using CLAN
C. program of CMU
()
http://childes.psy.cmu.edu/clan
- Weilhammer03:duration
-
Durational Aspects in Turn Taking
K. Weilhammer and S. Rabold
(2003)
- ANVIL
-
Anvil
M. Kipp
()
http://www.dfki.de/~?kipp/anvil/
- Clark04:addressee
-
Speaking while monitoring addressees for understanding
H. Clark and M. Krych
Journal of Memory and Language
50
62-81
(2004)
- Eisenstein03:UIST
-
Natural Gesture in Descriptive Monologues
J. Eisenstein and R. Davis
(2003)
- Vertegaal00:gaze
-
Effects of Gaze on Multiparty Mediated Communication
R. Vertegaal and G. Veer and H. Vons
(2000)
- Shi04:lrec
-
A system for Situated Temporal Analysis of Multimodal Communication
Y. Shi and T. Rose and F. Quek
(2004)
- Cassell01:non-verbal
-
Non-Verbal Cues for Discourse Structure
A. Cassell and T. Nakano and T. W. Bickmore and C. Sidner and C. Rich
106-115
(2001)
- Ruiter00:sketch
-
The production of gesture and speech
J. P. Ruiter
284-311
(2000)
- aphasic
-
Manual activity during speaking in aphasic subjects
P. Feyeseisen
International Journal of Psychology
18
545-556
(1983)
- Padilha03:non-verbal
-
Nonverbal Behaviours Improving a Simulation of Small Group Discussion
E. Padilha and J. Carletta
(2003)
- Loehr04:PhD
-
Gesture and Intonation
D. P. Loehr
(2004)
- Alibali00:lang_cog
-
Gesture and the process of speech production: We think, therefore we gesture
N. W. Alibali and S. Kita and A. J. Young
Language and Cognitive Processes
6
593-613
(2000)
- Steininger:02_lrec_mc
-
Development of the user-state conversations for the multimodal corpus in {SmartKom}
S. Steininger and O. Dioubina and F. Schiel
(2002)
- Quek01:Cue_Comm
-
Gestural Origo and Loci-transitions in Natural Discourse Segmentation
F. Quek and R. Bryll and D. McNeill and M. Harper
189-192
(2001)
- Martell:FORM
-
{FORM}: a kinematically-based Gesture Annotation Scheme, http://www.ldc.upenn.edu/Projects/FORM/index.html
C. Martell and K. Myers and O. Syrotkin
()
- Cassell01:AI
-
Embodied Conversational Agents:Representation and Intelligence in User Interface
J. Cassell
AI Magazine
22
67-83
(2001)
How do we decide how to represent an intelligent system in its interface, and how do we decide how the interface represents information about the world and about its own workings to a user? This article addresses these questions by examining the interaction between representation and intelligence in user interfaces. The rubric "representation" covers at least three topics in this context: how a computational system is represented in its user interface, how the interface conveys its representations of information and of the world to human users, and how the system's internal representation affects the human user's interaction with that system. I will argue that each of these kinds of representations (of the system, of information and the world, of the interaction) is key to how users make the kind of attributions of intelligence that facilitate their interactions with intelligent systems. I will argue for representing a system as a human in those cases where social collaborative behavior is key, and I will argue for the system representing its knowledge to humans in multiple ways on multiple modalities. I will demonstrate my claims by discussing issues of representation and intelligence in an embodied conversational agent - an interface in which the system is represented as a person, in which information is conveyed to human users via multiple modalities such as voice and hand gestures, and in which the internal representation is modality-independent, and both propositional and non-propositional.
- LDC_ges_tool
-
Gesture Annotation: {T}ools and {D}ata, http://www.ldc.upenn.edu/annotatio/gesture
()
http://www.ldc.upenn.edu/annotatio/gesture
- Rose04:mac_vissta
-
MacVissta: A System for Multimodal Analysis
T. Rose and F. Quek and Y. Shi
(2004)
- Tsai87:calibration
-
A versatile camera calibration technique for high accuracy {3D} machine vision metrology using off-the-shelf {TV} cameras and lenses
R. Tsai
IEEE Journal of Robotics and Automation
3
323-344
(1987)
- IW
-
The Language-Thought-Hand System
S. Gallagher and J. Cole and D. McNeill
(2001)
- Sharma:99_iMap
-
Toward Interpretation of natural speech/gesture for spatial planning on a virtual map
R. Sharma and I. Poddar and E. Ozyildiz and S. Kettebekov and H. Kim and T. S. Huang
(1999)
- Seyfeddinipur01:Diss
-
Gesture as an Indicator of Early Error Detection in Self-Monitoring of Speech
S. Seyfeddinipur and S. Kita
(2001)
There is a theoretical controversy regarding when the selfmonitoring process interrupts the speech stream. One view holds that the speech stream is interrupted as soon as an error is detected. Another view holds that, even after an error is detected, the speaker does not interrupt immediately but continues speaking and at the same time plans the upcoming repair. We address this question by observing speech-accompanying gestures at the moment of speech disfluency. The results show that the concurrent gestural movements are typically stopped on average 240 ms before speech is stopped. In other words, the gesture suspension foreshadows the speech suspension. The gestural foreshadowing shows that the speaker must know early on that he is going to suspend speech. The gestural indication of an upcoming speech suspension suggests that the speaker does not interrupt speech at the very moment s/he detects an error. This result supports the hypothesis on speech monitoring stating that the speaker continues to talk after error detection and at the same time plans the upcoming repair.
- Kendon74:prosody
-
Movement coordination in social interaction: some examples described
A. Kendon
(1974)
- Quek02:ICSLP_spa
-
Gestural Spatialization in Natural Discourse Segmentation
F. Quek and D. McNeill and R. Bryll and M. Harper
(2002)
- G-M04:hearing
-
Hearing Gesture: how our hands help us think
S. Goldin-Meadow
(2003)
- Gherbi99:GW
-
Pointing Gesture Interpretation in a Multimodal Context
R. Gherbi and A. Braffort
(1999)
- Ekman69:ges_cat
-
The repertoire of nonverbal behavioral categories
P. Ekman and W. Friesen
Semiotica
(1969)
- Chai04:IUI
-
A probabilistic approach to reference resolution in multimodal user interfaces
J. Y. Chai and P. Hong and M. X. Zhou
70--77
(2004)
Multimodal user interfaces allow users to interact with computers through multiple modalities, such as speech, gesture, and gaze. To be effective, multimodal user interfaces must correctly identify all objects which users refer to in their inputs. To systematically resolve different types of references, we have developed a probabilistic approach that uses a graph-matching algorithm. Our approach identifies the most probable referents by optimizing the satisfaction of semantic, temporal, and contextual constraints simultaneously. Our preliminary user study results indicate that our approach can successfully resolve a wide variety of referring expressions, ranging from simple to complex and from precise to ambiguous ones.
- Darrell93:CVPR
-
Space-time gestures
T. Darrell and A. Pentland
335-340
(1993)
A method for learning, tracking, and recognizing human gestures using a view-based approach to model articulated objects is presented. Objects are represented using sets of view models, rather than single templates. Stereotypical space-time patterns, i.e., gestures, are then matched to stored gesture patterns using dynamic time warping. Real-time performance is achieved by using special purpose correlation hardware and view prediction to prune as much of the search space as possible. Both view models and view predictions are learned from examples. Results showing tracking and recognition of human hand gestures at over 10 Hz are presented
- Krauss00:lexical
-
Lexical gestures and lexical access
R. Krauss and Y. Chen and R. F. Gottesman
261-283
(2000)
- Hayamizu99:Eurospeech
-
A Multimodal Database of Gestures and Speech
S. Hayamizu and S. Nagaya and K. Watanuki and M. Nakazawa and S. Nobe and T. Yoshimura
2247-2250
(1999)
- Kita93:PhD
-
Language and thought interface: A study of spontaneous gestures and Japanese mimetics
S. Kita
(1993)
- Vygotsky87:GP
-
Thinking and Speaking
L. Vygotsky
1
39-285
(1987)
- Rauscher96:Psy
-
Gesture, speech, and lexical access: the role of lexical movements in speech production
F. H. Rauscher and R. Krauss and Y. Chen
Psychological Science
7
226-230
(1996)
- rheme-theme
-
Topic and focus of a sentence and the patterning of a text
E. Hajcova and P. Sgall
(1988)
- Schlenzig94:vision
-
Vision based hand gesture interpretation using recursive estimation
J. Schlenzig and E. Hunter and R. Jain
2
1267-1271 vol.2
(1994)
Gesture recognition requires spatio-temporal image sequence analysis. The actual length of the sequence varies with each instantiation of the gesture, and can be quite long in the case of a multiple gesture sequence. To achieve adequate system response we introduce the concept of recursive estimation of the gesture state. This consists of modeling the gestures as a sequence of static hand poses. Using a hidden Markov model where the unobservable state is the spatio-temporal gesture and the hand poses are the observations allows us to determine the current probabilities of each gesture with a finite state estimator. This decomposes the gesture recognition process into two stages: identification of the hand pose within the current image frame and incorporation of the new information into the probability estimates. We illustrate the performance of the estimator by describing the implementation of a telerobotic application
- Cassell96:model
-
Distribution of Semantic Features across speech and gesture by humans and machines
J. Cassell and S. Prevost
(1996)
- Hostetter:04cogsci
-
On the Tip of the Mind: Gesture as a key to Conceptualization
A. B. Hostetter and M. W. Alibali
(2004)
- Petrelli98:visual
-
Visual display: poting and natural language: the power of multimodal interaction
A. D. Angeli and W. Gerbino and G. Cassano and D. Petrelli
(1998)
- Murphy98:IEEE_bio
-
Biological and cognitive foundations of intelligent sensor fusion
R. R. Murphy
Systems, Man and Cybernetics, Part A, IEEE Transactions on
26
42-51
(1996)
This paper reviews the literature from the biological and cognitive sciences in sensory integration and derives principles for use in constructing intelligent sensor fusion systems. In particular, it presents psychophysical and neurophysical studies on how sensor fusion is accomplished and cognitive models of associated activities, including optimization of sensing configurations, improvement of sensing quality, and filtering of noise. The sensor fusion effects architecture for robot navigation is also presented as one example of how these insights from the biological and computer science can be applied to robotic sensor fusion. Experimental results demonstrates the utility of the biological and cognitive insights, especially that of fusion modes. Other representative architectures for robotic sensor fusion are contrasted with the biological and cognitive principles
- Mayberry00:book
-
Gesture production during stuttered speech: Insights into the nature of gesture-speech integration
R. I. Mayberry and J. Jaques
199-213
(2000)
- MediaTagger
-
MediaTagger: Macintosh-based video transcription, http://www.mpi.nl/world/tg/CAVA/mt/MTandDB.html
M. P. I. for Psycholinguistics
()
http://www.mpi.nl/world/tg/CAVA/mt/MTandDB.html
- Eickeler98:PR
-
Hidden Markov model based continuous online gesture recognition
S. Eickeler and A. Kosmala and G. Rigoll
2
1206-1208 vol.2
(1998)
Presents the extension of an existing vision-based gesture recognition system using hidden Markov models(HMMs). Several improvements have been carried out in order to increase the capabilities and the functionality of the system. These improvements include position independent recognition, rejection of unknown gestures, and continuous online recognition of spontaneous gestures. We show that especially the latter requirement is highly complicated and demanding, if we allow the user to move in front of the camera without any restrictions and to perform the gestures spontaneously at any arbitrary moment. We present solutions to this problem by modifying the HMM-based decoding process and by introducing online feature extraction and evaluation methods
- GestureDB
-
CAVA data base
M. P. I. for Psycholinguistics
()
http://www.mpi.nl/world/tg/CAVA/mt/CAVA_db.html
- Cohen96:FG
-
Dynamical system representation, generation, and recognition of basic oscillatory motion gestures
C. J. Cohen and L. Conway and D. Koditschek
60-65
(1996)
We present a system for generation and recognition of oscillatory gestures. Inspired by gestures used in two representative human-to-human control areas, we consider a set of oscillatory motions and refine from them a 24 gesture lexicon. Each gesture is modeled as a dynamical system with added geometric constraints to allow for real time gesture recognition using a small amount of processing time and memory. The gestures are used to control a pan-tilt camera neck. We propose extensions for use in areas such as mobile robot control and telerobotics
- Bobick:97_IEEE_PAM
-
A state-based approach to the representation and recognition of gesture
A. F. Bobick and A. D. Wilson
IEEE Trans. on Pattern Analysis and Machine Intelligence
19
1325-1337
(1997)
A state-based technique for the representation and recognition of gesture is presented. We define a gesture to be a sequence of states in a measurement or configuration space. For a given gesture, these states are used to capture both the repeatability and variability evidenced in a training set of example trajectories. Using techniques for computing a prototype trajectory of an ensemble of trajectories, we develop methods for defining configuration states along the prototype and for recognizing gestures from an unsegmented, continuous stream of sensor data. The approach is illustrated by application to a range of gesture-related sensory data: the two-dimensional movements of a mouse input device, the movement of the hand measured by a magnetic spatial position and orientation sensor, and, lastly, the changing eigenvector projection coefficients computed from an image sequence
- Cassell91:poetics
-
Gesture and the poetics of prose
J. Cassell and D. McNeill
Poetics Today
12
375-404
(1991)
- ISLE02:WP8
-
ISLE Natural Interactivity and Multimodality(NIMM) WP8.1- Survey of NIMM Data Resources, Current and Future User Profiles, Markets and User Needs for NIMM Resources
M. Knudsen and J. Martin and L. Dybkjer and et al.
(2002)
- Cohen97:SMC
-
Dynamic system representation of basic and nonlinear in parameters oscillatory motion gestures
C. J. Cohen and L. Conway and D. Koditschek and G. P. Roston
5
4513-4518 vol.5
(1997)
We present a system for generation and recognition of oscillatory gestures. Inspired by gestures used in two representative human-to-human control areas, we consider a set of oscillatory (circular) motions and refine from them a 24 gesture lexicon. Each gesture is modeled as a dynamic system with added geometric constraints to allow for real time gesture recognition using a small amount of processing time and memory. The gestures are used to control a pan-tilt camera neck. The gesture lexicon is then enhanced to include nonlinear in parameter ({\&}ldquo;come here{\&}rdquo;) gesture representations. An enhancement is suggested which would enable the system to be trained to recognized previously unidentified yet consistent human generated oscillatory motion gestures
- Eisenstein04:ICMI
-
Visual and Linguistic Information in Gesture Classification
J. Eisenstein and R. Davis
(2004)
- Bobick95:FG
-
A state-based technique for the summarization and recognition of gesture
A. F. Bobick and A. D. Wilson
382-388
(1995)
We define a gesture to be a sequence of states in a measurement or configuration space. For a given gesture, these states are used to capture both the repeatability and variability evidenced in a training set of example trajectories. The states are positioned along a prototype of the gesture, and shaped such that they are narrow in the directions in which the ensemble of examples is tightly constrained, and wide in directions in which a great deal of variability is observed. We develop techniques for computing a prototype trajectory of an ensemble of trajectories, for defining configuration states along the prototype, and for recognizing gestures from an unsegmented, continuous stream of sensor data. The approach is illustrated by application to a range of gesture-related sensory data: the two-dimensional movements of a mouse input device, the movement of the hand measured by a magnetic spatial position and orientation sensor, and, lastly, the changing eigenvector projection coefficients computed from an image sequence
- Sharma98:IEEE
-
Toward multimodal human-computer interface
R. Sharma and V. I. Pavlovic and T. S. Huang
Proceedings of the IEEE
86
853-869
(1998)
Recent advances in various signal processing technologies, coupled with an explosion in the available computing power, have given rise to a number of novel human-computer interaction (HCI) modalities: speech, vision-based gesture recognition, eye tracking, electroencephalograph, etc. Successful embodiment of these modalities into an interface has the potential of easing the HCI bottleneck that has become noticeable with the advances in computing and communication. It has also become increasingly evident that the difficulties encountered in the analysis and interpretation of individual sensing modalities may be overcome by integrating them into a multimodal human-computer interface. We examine several promising directions toward achieving multimodal HCI. We consider some of the emerging novel input modalities for HCI and the fundamental issues in integrating them at various levels, from early signal level to intermediate feature level to late decision level. We discuss the different computational approaches that may be applied at the different levels of modality integration. We also briefly review several demonstrated multimodal HCI systems and applications. Despite all the recent developments, it is clear that further research is needed for interpreting and fitting multiple sensing modalities in the context of HCI. This research can benefit from many disparate fields of study that increase our understanding of the different human communication modalities and their potential role in HCI
- Grobel97:SMC
-
Isolated sign language recognition using hidden Markov models
K. Grobel and M. Assan
1
162-167 vol.1
(1997)
This paper is concerned with the video-based recognition of isolated signs. Concentrating on the manual parameters of sign language, the system aims for the signer dependent recognition of 262 different signs. For hidden Markov modelling a sign is considered a doubly stochastic process, represented by an unobservable state sequence. The observations emitted by the states are regarded as feature vectors, that are extracted from video frames. The system achieves recognition rates up to 94{\%}
- Schlenzig94:ACV_recursive
-
Recursive identification of gesture inputs using hidden Markov models
J. Schlenzig and E. Hunter and R. Jain
187-194
(1994)
Human-machine interfaces play a role of growing importance as computer technology continues to evolve. Motivated by the desire to provide users with an intuitive gesture input system, we describe the design of a recursive filter applied to the vision-based gesture interpretation problem. The gestures are modeled as a hidden Markov model with the state representing the gesture sequences, and the observations being the current static hand pose. At each time step the recursive filter updates its estimate of what gesture is occurring based on the current extracted pose information. The result is a robust system which provides the user with continual feedback during compound gestures
- Darrell96:IEEE_PAM
-
Task-specific gesture analysis in real-time using interpolated views
T. J. Darrell and I. A. Essa and A. P. Pentland
IEEE Trans. on Pattern Analysis and Machine Intelligence
18
1236-1242
(1996)
Hand and face gestures are modeled using an appearance-based approach in which patterns are represented as a vector of similarity scores to a set of view models defined in space and time. These view models are learned from examples using unsupervised clustering techniques. A supervised teaming paradigm is then used to interpolate view scores into a task-dependent coordinate system appropriate for recognition and control tasks. We apply this analysis to the problem of context-specific gesture interpolation and recognition, and demonstrate real-time systems which perform these tasks
- Black98:FG
-
Recognizing temporal trajectories using the condensation algorithm
M. J. Black and A. D. Jepson
16-21
(1998)
The recognition of human gestures in image sequences is an important and challenging problem that enables a host of human-computer interaction applications. This paper describes an incremental recognition strategy that is an extension of the {\&}ldquo;Condensation{\&}rdquo; algorithm proposed by Isard and Blake (1996). Gestures are modeled as temporal trajectories of some estimated parameter over time (in this case velocity). The condensation algorithm is used to incrementally match the gesture models to the input data. The method is demonstrated with an example of an augmented office white-board in which a user makes simple hand gestures to grab regions of the board, print them, save them, etc
- Wilson:97_CVPR
-
Temporal classification of natural gesture and application to video coding
A. D. Wilson and A. E. Bobick and J. Cassell
948-954
(1997)
A method for the temporal classification of natural gesture from video imagery is presented. The work is motivated by recent developments in the theory of natural gesture which have identified several key temporal aspects of gesture important to communication. In particular gesticulation during conversation can be coarsely characterized as periods of bi-phasic or tri-phasic gesture separated by a rest state. We first present an automatic procedure for hypothesizing plausible rest state configurations of a speaker. Second, we develop a state-based parsing algorithm used to both select among candidate rest states and to parse an incoming video stream into bi-phasic and tri-phasic gestures. Finally, we demonstrate the use of the bi-phasic/tri-phasic labeling to select semantically significant static images for low bandwidth coding of video of story-telling speakers
- Eisenstein04:HLT
-
A Salience-Based Approach to Gesture-Speech Alignment
J. Eisenstein and C. M. Christoudias
(2004)
- Kahn:95_CV
-
Understanding people pointing: the Perseus system
R. E. Kahn and M. J. Swain
569-574
(1995)
We present Perseus, a purposive visual system used by our robot, CHIP, to locate objects being pointed at by people. Perseus uses knowledge about the task and environment at all levels of processing to more accurately and efficiently perform visual tasks
- Bohme:98GW
-
Neural Architecture for Gesture-Based Human-Machine-Interaction
H. Bohme and A. Brakensiek and U. Braumann and M. Krabbes and H. Gross
219--232
(1998)
- Sharma00:FG
-
Exploiting speech/gesture co-occurrence for improving continuous gesture recognition in weather narration
R. Sharma and J. Cai and S. Chakravarthy and I. Poddar and Y. Sethi
422-427
(2000)
In order to incorporate naturalness in the design of human computer interfaces (HCI), it is desirable to develop recognition techniques capable of handling continuous natural gesture and speech inputs. Though many different researchers have reported high recognition rates for gesture recognition using hidden Markov models (HMM), the gestures used are mostly pre-defined and are bound with syntactical and grammatical constraints. But natural gestures do not string together in syntactical bindings. Moreover, strict classification of natural gestures is not feasible. We have examined hand gestures made in a very natural domain, that of a weather person narrating in front of a weather map. The gestures made by the weather person are embedded in a narration. This provides us with abundant data from an uncontrolled environment to study the interaction between speech and gesture in the context of a display. We hypothesize that this domain is very similar to that of a natural human-computer interface. We present an HMM architecture for continuous gesture recognition framework and keyword spotting. To explore the relation between gesture and speech, we conducted a statistical co-occurrence analysis of different gestures with a selected set of spoken keywords. We then demonstrate how this co-occurrence analysis can be exploited to improve the performance of continuous gesture recognition
- Schiel:02_lrec_smartkom
-
The {SmartKom} Multimodal Corpus at {BAS}
F. Schiel and S. Steininger and U. Turk
(2002)
http://www.smartkom.org/reports/Report-NR-34.pdf
- Bryll:04PhD
-
A Robust Agent-Based Gesture Tracking System
R. Bryll
(2004)
- Kahn:96_CVPR
-
Gesture recognition using the Perseus architecture
R. E. Kahn and M. J. Swain and P. N. Prokopowicz and R. J. Firby
734-741
(1996)
Communication involves more than simply spoken information. Typical interactions use gestures to accurately and efficiently convey ideas that are more easily expressed with actions than words. A more intuitive interface with machines should involve not only speech recognition, but gesture recognition as well. One of the most frequently used and expressively powerful gestures is pointing. It is far easier and more accurate to point to an object than give a verbal description of its location. To produce a more efficient, accurate, and natural human-machine interface we use the Perseus architecture to interpret the pointing gesture. Perseus uses a variety of techniques to reliably solve this complex visual problem in non-engineered worlds. Knowledge about the task and environment is used at all stages of processing to best interpret the scene for the current situation. Once the visual operators are chosen, contextual knowledge is used to tune them for maximal performance. Redundant interpretation of the scene provides robustness to errors in interpretation. Fusion of independent types of information results in increased tolerance when assumptions about the environment fail. Windows of attention are used to improve speed and remove distractions from the scene. Furthermore, reuse is a major issue in the design of Perseus. Information about the environment and task is explicitly represented so it can easily be re-used in tasks other than pointing. A clean interface to Perseus is provided for symbolic higher level systems like the RAP reactive execution system. In this paper we describe Perseus in detail and show how it is used to locate objects pointed to by people
- Steininger:01_GW_labeling_SK
-
Labeling of gestures in SmartKom - The coding system
G. Steininger and B. Lindemann and T. Paetzold
(2001)
- Cohen:98book
-
Synergistic use of direct manipulation and natural language
P. Cohen and et al.
29-35
(1998)
- Neal:89_S_NLP_WS
-
Natural Language with Integrated deictic and graphic gestures
J. G. Neal and C. Y. Thielman and Z. Dobes and S. M. Haller and B. C. Shapiro
410-423
(1989)
- Nakamura00:ICSLP
-
Multimodal Corpora for Human-machine interaction research
S. Nakamura and et al
(2000)
- Johnston:97_unification
-
Unification-based multimodal integration
M. Johnston and P. Cohen and D. McGee and S. Oviatt and J. Pittman and I. Smith
281--288
(1997)
Recent empirical research has shown conclusive advantages of multimodal interaction over speech-only interaction for map-based tasks. This paper describes a multimodal language processing architecture which supports interfaces allowing simultaneous input from speech and gesture recognition. Integration of spoken and gestural input is driven by unification of typed feature structures representing the semantic contributions of the different modes. This integration method allows the component modalities to mutually compensate for each others' errors. It is implemented in Quick-Set, a multimodal (pen/voice) system that enables users to set up and control distributed interactive simulations.
- Sharma:00_IEEE_CG
-
Speech/Gesture Interface to a Visual-Computing Environment
R. Sharma and M. Zeller and V. I. Pavlovic and T. S. Huang and L. Z. and S. Chu and Y. Zhao and J. C. Phillips and K. Schulten
IEEE Comput. Graph. Appl.
20
29--37
(2000)
Recent progress in 3D immersive display and virtual reality (VR) technologies has made possible many exciting applications. To fully exploit this potential requires "natural" interfaces that allow manipulating such displays without cumbersome attachments. In this article we describe using visual hand-gesture analysis and speech recognition for developing a speech/gesture interface to control a 3D display. The interface enhances an existing application, VMD, which is a VR visual computing environment for structural biology. The free-hand gestures manipulate the 3D graphical display, together with a set of speech commands. We found
- Neal:98book
-
Natural Language with Integrated deictic and graphic gestures
J. G. Neal and C. Y. Thielman and Z. Dobes and S. M. Haller and S. C. Shapiro
38-51
(1998)
- Nickel:03_ICMI
-
Pointing gesture recognition based on 3D-tracking of face, hands and head orientation
K. Nickel and R. Stiefelhagen
140--146
(2003)
In this paper, we present a system capable of visually detecting pointing gestures and estimating the 3D pointing direction in real-time. In order to acquire input features for gesture recognition, we track the positions of a person's face and hands on image sequences provided by a stereo-camera. Hidden Markov Models (HMMs), trained on different phases of sample pointing gestures, are used to classify the 3D-trajectories in order to detect the occurrence of a gesture. When analyzing sample pointing gestures, we noticed that humans tend to look at the pointing target while performing the gesture. In order to utilize this behavior, we additionally measured head orientation by means of a magnetic sensor in a similar scenario. By using head orientation as an additional feature, we observed significant gains in both recall and precision of pointing gestures. Moreover, the percentage of correctly identified pointing targets improved significantly from 65{\%} to 83{\%}. For estimating the pointing direction, we comparatively used three approaches: 1) The line of sight between head and hand, 2) the forearm orientation, and 3) the head orientation.
- Koons:98_book
-
Integrating simultaneous input from speech, gaze, and hand gestures
D. Koons and C. Sparrell and K. Thorisson
53-62
(1998)
- Sowa:99
-
Understanding coverbal dimensional gestures in a virtual design environment
T. Sowa and I. Wachsmuth
117-120
(1999)
- Xiong:03_ICMI
-
Hand motion gestural oscillations and multimodal discourse
Y. Xiong and F. Quek and D. McNeill
132--139
(2003)
To develop multimodal interfaces, one needs to understand the constraints underlying human communicative gesticulation and the kinds of features one may compute based on these underlying human characteristics.In this paper we address hand motion oscillatory gesture detection in natural speech and conversation. First, the hand motion trajectory signals are extracted from video. Second, a wavelet analysis based approach is presented to process the signals. In this approach, wavelet ridges are extracted from the responses of wavelet analysis for the hand motion trajectory signals, which can be used to characterize frequency properties of the hand motion signals. The hand motion oscillatory gestures can be extracted from these frequency properties. Finally, we relate the hand motion oscillatory gestures to the phases of speech and multimodal discourse analysis.We demonstrate the efficacy of the system on a real discourse dataset in which a subject described her action plan to an interlocutor. We extracted the oscillatory gestures from the x, y and z motion traces of both hands. We further demonstrate the power of gestural oscillation detection as a key to unlock the structure of the underlying discourse.
- Bourguet:98CHI
-
Synchronization of speech and hand gestures during multimodal human-computer interaction
M. Bourguet and A. Ando
241--242
(1998)
- Sowa:00_Gesture
-
Coverbal inconic gestures for object descriptions in virtual environments: an empirical study
T. Sowa and I. Wachsmuth
(Apr. 2000)
- Sharma:03Crisis
-
Speech-gesture driven multimodal interfaces for crisis management
R. Sharma and M. Yeasin and N. Krahnstoever and I. Rauschert and G. Cai and I. Brewer and A. M. MacEachren and K. Sengupta
Proceedings of the IEEE
91
1327-1354
(2003)
Emergency response requires strategic assessment of risks, decisions, and communications that are time critical while requiring teams of individuals to have fast access to large volumes of complex information and technologies that enable tightly coordinated work. The access to this information by crisis management teams in emergency operations centers can be facilitated through various human-computer interfaces. Unfortunately, these interfaces are hard to use, require extensive training, and often impede rather than support teamwork. Dialogue-enabled devices, based on natural, multimodal interfaces, have the potential of making a variety of information technology tools accessible during crisis management. This paper establishes the importance of multimodal interfaces in various aspects of crisis management and explores many issues in realizing successful speech-gesture driven, dialogue-enabled interfaces for crisis management. This paper is organized in five parts. The first part discusses the needs of crisis management that can be potentially met by the development of appropriate interfaces. The second part discusses the issues related to the design and development of multimodal interfaces in the context of crisis management. The third part discusses the state of the art in both the theories and practices involving these human-computer interfaces. In particular, it describes the evolution and implementation details of two representative systems, Crisis Management (XISM) and Dialog Assisted Visual Environment for Geoinformation (DAVE/spl I.bar/G). The fourth part speculates on the short-term and long-term research directions that will help addressing the outstanding challenges in interfaces that support dialogue and collaboration. Finally, the fifth part concludes the paper.
- Chen:02ICMI
-
Gesture Patterns during Speech Repairs
L. Chen and M. Harper and F. Quek
(2002)
- Bolt:92UST
-
Two-handed gesture in multi-modal natural dialog
R. A. Bolt and E. Herranz
7--14
(1992)
- Quek:02ACM
-
Multimodal human discourse: {G}esture and speech
F. Quek and D. McNeill and R. Bryll and S. Duncan and X. Ma and C. Kirbas and K. E. McCullough and R. Ansari
ACM Trans. Comput.-Hum. Interact.
9
171--193
(2002)
Gesture and speech combine to form a rich basis for human conversational interaction. To exploit these modalities in HCI, we need to understand the interplay between them and the way in which they support communication. We propose a framework for the gesture research done to date, and present our work on the cross-modal cues for discourse segmentation in free-form gesticulation accompanying speech in natural conversation as a new paradigm for such multimodal interaction. The basis for this integration is the psycholinguistic concept of the coequal generation of gesture and speech from the same semantic intent. We present a detailed case study of a gesture and speech elicitation experiment in which a subject describes her living space to an interlocutor. We perform two independent sets of analyses on the video and audio data: video and audio analysis to extract segmentation cues, and expert transcription of the speech and gesture data by microanalyzing the videotape using a frame-accurate videoplayer to correlate the speech with the gestural entities. We compare the results of both analyses to identify the cues accessible in the gestural and audio data that correlate well with the expert psycholinguistic analysis. We show that "handedness" and the kind of symmetry in two-handed gestures provide effective supersegmental discourse cues.
- Koons:93_sp_gz_gs
-
Integrating simultaneous input from speech, gaze, and hand gestures
D. Koons and C. Sparrell and K. Thorisson
257--276
(1993)
- Wexelblat:95ACM_HCI
-
An approach to natural gesture in virtual environments
A. Wexelblat
ACM Trans. Comput.-Hum. Interact.
2
179--200
(1995)
- Wilson:96FG
-
Recovering the Temporal Structure of Natural Gesture
A. Wilson and A. F. Bobick and J. Cassell
(1996)
A method for the recovery of the temporal structure and phases in natural gesture is presented. The work is motivated by recent developments in the theory of natural gesture which have identified several key aspects of gesture important to communication. In particular, gesticulation during conversation can be coarsely characterized as periods of bi-phasic or tri-phasic gesture separated by a rest state. We first present an automatic procedure for hypothesizing plausible rest state configurations of a speaker; the method uses the repetition of subsequences to indicate potential rest states. Second, we develop a state-based parsing algorithm used to both select among candidate rest states and to parse an incoming video stream into bi-phasic and multi-phasic gestures. We present results from examples of story-telling speakers.
- Pavlovic:96_FG
-
Gestural Interface to a visual computing Environment for Molecular biologists (invited speech)
V. Pavlovic and R. Sharma and T. S. Huang
30
(1996)
In recent years there has been tremendous progress in 3-D immersive display and virtual reality (VR) technologies. Scientific visualization of data is one of many applications that has benefited from this progress. To fully exploit the potential of these applications in the new environment there is a need for "natural" interfaces that allow the manipulation of such displays without burdensome attachments. This paper describes the use of visual hand gesture analysis enhanced with speech recognition for developing a bimodal gesture/speech interface for controlling a 3-D display. The interface augments an existing application, VMD, which is a VR visual computing environment for molecular biolo-gists. The free hand gestures are used for manipulating, the37D graphical display together with a set of speech commands. We concentrate on the visual gesture analysis techniques used in developing this interface. The dual modality of gesture/speech is found to greatly aid the interaction capability.
- Oviatt:00_ACM
-
Perceptual user interfaces: multimodal interfaces that process what comes naturally
S. Oviatt and P. Cohen
Commun. ACM
43
45--53
(2000)
- Oviatt:99_CHI
-
Mutual disambiguation of recognition errors in a multimodel architecture
S. Oviatt
576--583
(1999)
- Pavlovic:97_IEEE_PAMI
-
Visual Interpretation of Hand Gestures for Human-Computer Interaction: A Review
V. Pavlovic and R. Sharma and T. S. Huang
IEEE Trans. on Pattern Analysis and Machine Intelligence
19
677--695
(1997)
- Oviatt:99_ACM
-
Ten myths of multimodal interaction
S. Oviatt
Commun. ACM
42
74--81
(1999)
- Oviatt:97CHI_sync
-
Integration and synchronization of input modes during multimodal human-computer interaction
S. Oviatt and A. DeAngeli and K. Kuhn
415--422
(1997)
- Marsh:98VR
-
Shape your imagination: Iconic gestural-based interaction
T. Marsh and A. Watt
122-125
(1998)
Iconic hand gestures are a natural and intuitive way to convey spatial information. Capturing and interpreting iconic hand gestures will augment users' ability to convey spatial information with computers. This paper discusses the on-going research and the results of a study to employ iconic hand gestures as a human computer interaction (HCI) technique for the input and manipulation of objects and shapes within 3D computer generated graphical environments.
- Wexelblat:98_GW
-
Research Challenges in Gesture: Open Issues and Unsolved Problems
A. Wexelblat
1--11
(1998)
- Esposito:01
-
Disfluencies in gesture: Gestural correlates to speech silent and filled pauses
A. Esposito and K. E. McCullough and F. Quek
(2001)
- Quek:95IVC
-
Eyes in the interface
F. Quek
Image and Vision Computing
13
511-525
(1995)
Computer vision has a significant role to play in the human-computer interaction (HCI) devices of the future. All computer input devices, however, serve only one purpose: they transduce some motion or energy from a human agent into machine useable signals. Thus input devices are 'perceptual organs' by which computers sense the intents of their human users. The present study outlines the role computer vision will play, highlights the impediments to the development of vision-based interfaces, and proposes an approach for overcoming these impediments.
- Wu:99IEEE_NN
-
Statistical multimodal integration for intelligent HCI
S. Wu, L. Oviatt and P. Cohen
Proceedings of the 1999 9th IEEE Workshop on Neural Networks for Signal Processing (NNSP'99), Aug 23-Aug 25 1999
487-496
(1999)
This paper presents a statistical approach to developing multimodal recognition systems and, in particular, to integrating the posterior probabilities of parallel input signals involved in the multimodal system. We first derive the performance bounds of multimodal recognition probabilities, and identify the primary factors that influence multimodal recognition performance. We then develop a technique, a Members-Teams-Committee (MTC) recognition approach, designed to optimize accurate recognition during the multimodal integration process. We evaluate these methods using Quickset, a speech/gesture multimodal system, and report evaluation results based on an empirical corpus collected with Quickset. From an architectural perspective, the integration technique presented here offers enhanced robustness. It also is premised on more realistic assumptions than previous multimodal systems using semantic fusion. From a methodological standpoint, the evaluation techniques that we describe provide a valuable tool for evaluating multimodal systems.
- Esposito:02ICSLP
-
Holds as Gestural Correlates to Empty and Filled Speech Pauses
A. Esposito and S. Duncan and F. Quek
(2002)
- Kjeldsen:97CVPR
-
Interaction with on-screen objects using visual gesture recognition
R. Kjeldsen and J. Kender
788-793
(1997)
This paper will review the design of a working system that visually recognizes hand gestures for the control of a window based user interface. After an overview of the system, it will explore one aspect of gestural interaction in depth, hand tracking, and what is needed for the user to be able to interact comfortably with on-screen objects. We describe how the location of the hand is mapped to a location on the screen, and how it is both necessary and possible to smooth the camera input using a non-linear physical model of the cursor. The performance of the system is examined, especially with respect to object selection. We show how a standard HCI model of object selection (Fitts' Law) can be extended to model the selection performance of free-hand pointing.
- Quek:02ICMI
-
Gestural Trajectory Symmetries and Discourse Segmentation
F. Quek and Y. Xiong and D. McNeill
(2002)
- Yong:01PR
-
Hand gesture recognition using combined features of location, angle and velocity
H. S. Yoon and J. Soh and Y. J. Bae and H. S. Yang
Pattern Recognition
34
1491-1501
(2001)
The use of hand gesture provides an attractive alternative to cumbersome interface devices for human-computer interaction (HCI). Many hand gesture recognition methods using visual analysis have been proposed: Syntactical analysis, neural networks, the hidden Markov model (HMM). In our research, an HMM is proposed for various types of hand gesture recognition. In the preprocessing stage, this approach consist of three different procedures for hand localization, hand tracking and gesture spotting. The hand location procedure detects hand candidate regions on the basis of skin-color and motion. The hand tracking algorithm finds the centroids of the moving hand regions, connects them, and produces a hand trajectory. The gesture spotting algorithm divides the trajectory into real and meaningless segments. To construct a feature database, this approach uses a combined and weighted location, angle and velocity feature codes, and employs a k-means clustering algorithm for the HMM codebook. In our experiments, 2400 trained gestures and 2400 untrained gestures are used for training and testing, respectively. Those experimental results demonstrate that the proposed approach yields a satisfactory and higher recognition rate for user images of different hand size, shape and skew angle. {\&}copy; 2001 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved.
- Quek:02ICSLP
-
Speech Pauses and Gestural Holds in Parkinson's Disease
F. Quek and M. Harper and Y. Haciahmetoglu and L. Chen and L. Ramig
(2002)
- Gupta:01IEEE_SMC
-
Gesture-based interaction and communication: Automated classification of hand gesture contours
L. Gupta and S. Ma
IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews
31
114-120
(2001)
The accurate classification of hand gestures is crucial in the development of novel hand gesture based systems designed for human computer interaction (HCI) and for human alternative and augmentative communication (HAAC). A complete vision-based system consisting of hand gesture acquisition, segmentation, filtering, representation, and classification is developed to robustly classify hand gestures. The algorithms in the The gray scale image of a hand gesture is segmented using a histogram thresholding algorithm. A morphological filtering approach is designed to effectively remove background and object noise in the segmented image. The contour of a gesture is represented by a localized contour sequence whose samples are the perpendicular distances between the contour pixels and the chord connecting the end-points of a window centered on the contour pixels. Gesture similarity is determined by measuring the similarity between the localized contour sequences of the gestures. Linear alignment and nonlinear alignment are developed to measure the similarity between the localized contour sequences. Experiments and evaluations on a subset of American Sign Language (ASL) hand gestures show that, by using non-linear alignment, no gestures are misclassified by the system. Additionally, it is also estimated that real-time gesture classification is possible through the use of a high-speed PC, high-speed digital signal processing chips, and code optimization.
- Quek03:ICASSP
-
Oscillatory Gestures and Discourse
F. Quek and Y. Xiong
(2003)
- Sato:01VR
-
Real-time input of 3D pose and gestures of a user's hand and its applications for HCI
Y. Sato and M. Saito and H. Koike
79-86
(2001)
In this paper, we introduce a method for tracking a user's hand in 3D and recognizing the hand's gesture in real-time without the use of any invasive devices attached to the hand. Our method uses multiple cameras for determining the position and orientation of a user's hand moving freely in a 3D space. In addition, the method identifies predetermined gestures in a fast and robust manner by using a neural network which has been properly trained beforehand. This paper also describes results of user study of our proposed method and its application for several types of applications, including 3D object handling for a desktop system, and 3D walk-through for a large immersive display system.
- Huang:98ICSP
-
Video camera-based dynamic gesture recognition for HCI
Y. Huang and Y. Zhu and G. Xu and H. Zhang and Z. Wen and H. Ren
2
904-907
(1998)
This paper presents a small HCI (Human Computer Interaction) system which input is visual gesturing and output is a panorama image viewer. The spatial-temporal features of gesturing are extracted by motion-based segmentation and used for recognition with dynamic time warping (DTW). Gesturing command can make the user navigate in the image-based virtual environment. The gesture classes consist of 12 kinds of dynamic gestures as 6 translations and 6 rotations along the 3 coordinate axes respectively.
- Quek:02Euro-SIP
-
The Catchment Feature Model: A Device for Multimodal Fusion and A Bridge Between Signal and Sense
F. Quek
EURA-SIP Journal of Applied Signal Processing
(2002 in review)
- Xiong:02ICMI
-
Gestural Hand Trajectory Symmetries and Discourse Segmentation
Y. Xiong and F. Quek and D. McNeill
(2002)
- Xiong:03CVPRHCI
-
Gestural Hand Trajectory Symmetries and Discourse Segmentation
Y. Xiong and F. Quek
(2003)
- kdi:talkbank
-
Talkbank Program, Grant No.BCS-996009 KDI SBE
()
http://www.talkbank.org
- Vislab:KDI
-
{KDI}: {C}ross-model {A}nalysis {S}ignal and {S}ense- {D}ata and {C}omputational {R}esources for {G}esture, {S}peech and {G}aze {R}esearch, http://vislab.cs.vt.edu/KDI
F. Quek and et al.
()
http://vislab.cs.vt.edu/KDI
- Vislab:VACE
-
{ARDA VACE} project http://vislab.cs.wright.edu/Projects/MEETING-ANALYSIS/Overview.html
()
- EARS
-
{DARPA} {EARS} {P}rogram, http://www.darpa.mil/ipto/programs/ears/
()
- Bolt:80
-
Put-that-there
R. A. Bolt
Computer Graphics
14
262-270
(1980)
- Koons:94CHI
-
ICONIC: Speech and Depictive Gestures at the Human-Machine Interface
D. B. Koons and C. Sparrell
(1994)
- Thorisson:92CHI
-
Multi-Modal Natural Diaglogue
K. Thorisson and D. Koons and R. A. Bolt
(1992)
- Poddar:99MS
-
Continuous Recognition of Deictiv Gestures for Multimodal Interfaces
I. Poddar
(1999)
- Quek:95FG
-
FingerMouse: A freehand pointing interface
F. Quek and T. Mysliwiec and M. Zhao
372-377
(1995)
- Kjeldsen:96FG
-
Toward the use of gesture in traditional user interfaces
R. Kjeldsen and J. Kender
151-156
(1996)
- Freeman:95WSFG
-
Television control by hand gestures
W. Freeman and C. Weissman
179-183
(1995)
- Freeman:98IEEE
-
Computer vision in interactive computer graphics
W. Freeman and et al.
IEEE Computer Graphics and Applications
18
42-53
(1998)
- Quek:96IEEE-MM
-
Unencumbered gestural interaction
F. Quek
IEEE Multimedia
4
36-47
(1996)
- Yamato:92CVPR
-
Recognizing human action in time-sequential images using hidden markov model
J. Yamato and J. Ohya and K. Ishii
379-385
(1992)
- Wilson:99IEEE
-
Parametric hidden Markov models for gesture recognition
A. Wilson and A. F. Bobick
IEEE Trans. on Pattern Analysis and Machine Intelligence
21
884-900
(1999)
A method for the representation, recognition, and interpretation of parameterized gesture is presented. By parameterized gesture we mean gestures that exhibit a systematic spatial variation; one example is a point gesture where the relevant parameter is the two-dimensional direction. Our approach is to extend the standard hidden Markov model method of gesture recognition by including a global parametric variation in the output probabilities of the HMM states. Using a linear model of dependence, we formulate an expectation-maximization (EM) method for training the parametric HMM. During testing, a similar EM algorithm simultaneously maximizes the output likelihood of the PHMM for the given sequence and estimates the quantifying parameters. Using visually derived and directly measured three-dimensional hand position measurements as input, we present results that demonstrate the recognition superiority of the PHMM over standard HMM techniques, as well as greater robustness in parameter estimation with respect to noise in the input features. Finally, we extend the PHMM to handle arbitrary smooth (nonlinear) dependencies. The nonlinear formulation requires the use of a generalized expectation-maximization (GEM) algorithm for both training and the simultaneous recognition of the gesture and estimation of the value of the parameter. We present results on a pointing gesture, where the nonlinear approach permits the natural spherical coordinate parameterization of pointing direction
- Wexelblat:97
-
Research challenges in gesture: open issues and unsolved problems
A. Wexelblat
1-11
(1997)
- Cohen:89CHI
-
Synergistic use of direct manipulation and natural language
P. Cohen and et al.
227-234
(1989)
- Cohen:97ACM
-
QuickSet: Multimodal Interaction for Distributed Applications
P. Cohen and et al.
31-40
(1997)
- Oviatt:00ACM
-
Multimodal interfaces that process what comes naturally
S. Oviatt and P. Cohen
Communication of ACM
43
43-53
(2000)
- McNeill:92book
-
Hand and Mind: What Gestures Reveal about Thought
D. McNeill
(1992)
- McNeill:00GP
-
Growth Points, Catchments, and Contexts
D. McNeill
Cognitive Studies: Bulletin of the Japanese Cognitive Science Society
7
(2000)
- McNeill:00Catchment
-
Catchments and Context: Non-modular Factors in Speech and Gesture
D. McNeill
312-328
(2000)
- McNeill00:GPinbook
-
Growth points in thinking-for-speaking
D. McNeill and S. Duncan
141-161
(2000)
- Grosz:95PH
-
Instructions for annotating discourses
C. Nakatani and B. Grosz and D. Ahn and J. Hirschberg
(1995)
- McNeill:02gesture
-
Catchments, prosody and discourse
D. McNeill and F. Quek and et al.
Gesture
1
9-33
(2002)
- McNeill:02Kluwer
-
Dynamic imagery in speech and gesture
D. McNeill and F. Quek and et al.
27-44
(2002)
- Quek:99ICCV
-
Gestures cues for conversational interaction in monocular video
F. Quek and D. McNeill and et al.
64-69
(1999)
- Quek:02ICSLP:symmetry
-
Gestural trajectory symmetries and discourse segmentation
F. Quek and Y. Xiong and D. McNeill
(2002)
- Bell:00ICSLP
-
A Comparision of Disfluency Distribution in a Unimodal and a Multimodal Speech Interface
L. Bell and R. Eklund and R. Gustafson
(2000)
in this paper, we compare the distribution of disfluencies in two human--computer dialogue corpora. One corpus consists of unimodal travel booking dialogues, which were recorded over the telephone. In this unimodal system, all components except the speech recognition were authentic. The other corpus was collected using a semi-simulated multi-modal dialogue system with an animated talking agent and a clickable map. The aim of this paper is to analyze and discuss the effects of modality, task and interface design on the distribution and frequency of disfluencies in these two corpora
- Johnston:00COLING
-
Finite-state multimodal parsing and understanding
M. Johnston and S. Bangalore
(2000)
- Oviatt:96IEEE
-
User-Centered Modeling for Spoken Language and Multimodal Interfaces
S. Oviatt
IEEE Multimedia
26-35
(1996)
- Oviatt:96IEEE
-
User-Centered Modeling for Spoken Language and Multimodal Interfaces
S. Oviatt
Proceeding of IEEE
91
1457-1468
(2003)
- Billinghurst:98CG
-
Put That Where? Voice and Gesture at the Graphics Interface
M. Billinghurst
Computer Graphics
(1998)
- Hauptmann:93
-
Gesture with Speech for Graphics Manipulation
A. G. Hauptmann and P. McAvinney
Int'l J. Man-Machine Studies
38
231-249
(1993)
- Maes:99_ALIVE
-
ALIVE: Artificial Life Interactive Video Environment
P. Maes and et al.
Intercommunication
7
48-49
(1999)
- pentland96:_smart_rooms
-
Smart Rooms
A. Pentland
Scientific Amercian
54-62
(1996)
- Finlayson:03Diss
-
Effects of the Restriction of Hand Gestures on Disfluency
S. Finlayson and V. Forrest and R. Lickley and J. Mackenzie
(2003)
This paper describes an experimental pilot study of disfluency and gesture rates in spontaneous speech where speakers perform a communication task in three conditions: hands free, one arm immobilized, both arms immobilized.
Previous work suggests that the restriction of the ability to gesture can have an impact on the fluency of speech. In particular, it has been found that the inability to produce iconic gestures, which depict actions and objects, results in a higher rate of disfluency. Models of speech production account for this by suggesting that gesture and speech production are part of the same integrated system. Such models differ in their interpretation of the location of the gesture planning mechanism in relation to the speech model: some authors suggest that iconic gestures relate closely to lexical access, while others suggest that the link is located around the conceptualization stage.
The findings of this study tentatively confirm that there is a relationship beiween gesture and fluency - overall, disfluency increases as gesture is restricted. But it remains unclear whether the disfluency is more related to lexical access than to conceptualization. Proposals for a larger study are suggested.
The work is of interest to psycholinguists focusing on the integration of gesture into models of speech production and to Speech and Language Therapists who need to know about the impact that an impaired ability to produce gestures may have on communication.
- Davis94:VISP
-
Visual Gesture Recognition
J. Davis and M. Shah
Vision, Image and Signal Processing
141
(1994)
- Walter01:BMVC
-
Data Driven Gesture Model Acquisition using Minimum Description Length
M. Walter and A. Psarrou and S. Gong
(2001)
- Asada84
-
The curvature primal sketch
H. Asada and M. Brady
(1984)
- Kipp01:_anvil
-
ANVIL: A Generic Annotation Tool for Multimodal Dialogue
M. Kipp
(2001)
- Quek02:_visst
-
VisSTA: A Tool for Analyzing Multimodal Discourse Data
F. Quek and Y. Shi and C. Kirbas and S. Wu
(2002)
- Buntine92:_learn
-
Learning classification trees
W. Buntine
Statistics and Computing
2
63-73
(1992)
- Cassell99:_AAAI
-
{L}iving {H}and to {M}outh: {P}sychological {T}heories about {S}peech and {G}esture in {I}nteractive {D}ialogue {S}ystems
J. Cassell and M. Stone
(1999)
- Bryll01:_HOLD
-
Automatic Hand Hold Detection in Natural Conversation
R. Bryll and F. Quek and A. Esposito
(2001)
- Quek99:_VCM
-
A Parallel Algorighm for Dynamic Gesture Tracking
F. Quek and R. Bryll and X. F. Ma
(1999)
- gibbon03:_syntax_gesture
-
Formal Syntax of Gesture : CoGesT1.1
D. Gibbon and B. Hell and K. Looks and S. Trippel
(2003)
- Munhall04:_Visual_Prosody
-
Visual Prosody and Speech Intelligibility
K. G. Munhall and J. A. Jones and D. E. Callan and T. Kuratate and E. Vatikiotis-Bateson
Psychological Science
15
133-137
(2004)
- kendon72:_speech
-
Some relationships between body motion and speech
A. Kendon
(1972)
- Kendon86:_current
-
Current Issues in the Study of Gesture
A. Kendon
23-47
(1986)
- Kettebekov02:_ICMI
-
Prosody Based Co-analysis for Continuopus Recognition of Coverbal Gestures
S. Kettebekov and M. Yeasin and R. Sharma
(2002)
- Fels97:_GT2
-
Glove-talk II - a neural-network interface which maps gestures to parallel formant speech synthesizer controls
S. S. Fels and H. G. E.
IEEE Transactions on Neural Networks
8
977-984
(1997)
- Chen02_ACM_MM
-
Achieving Effective Floor Control with a Low-Bandwidth Gesture-Sensitive Videoconferencing System
M. Chen
(2002)
Multiparty videoconferencing with even a small number of people is often infeasible due to the high network bandwidth required. Bandwidth can be significantly reduced if most of the advantages of using full-motion video can be achieved with low-frame-rate video; unfortunately, the impact of low-frame-rate video on communication is relatively unexplored. We implemented a multiparty videoconferencing system that supports full-motion video, low-frame-rate video where the video is updated only once every few seconds, and a hybrid scheme where full-motion video is transmitted when the system detects that a user is making a gesture and low-frame-rate video is transmitted at all other times. We studied people using our system for small-group discussions and found that low-frame-rate video limited people's ability to request to speak or judge when to stop speaking. The hybrid scheme, conversely, was as effective as full-motion video for floor control, resulting in a similar number of speaker changes, while using only ten percent of the bandwidth.
- Chen:04ICMI
-
Multimodal Model Integration for Sentence Unit Detection
L. Chen and Y. Liu and M. Harper and E. Shriberg
(2004)
In this paper, we adopt a direct modeling approach to utilize
conversational gesture cues in detecting sentence boundaries, called
SUs, in video taped conversations. We treat the detection of SUs as a
classification task such that for each inter-word boundary, the
classifier decides whether there is an SU boundary or not. In
addition to gesture cues, we also utilize prosody and lexical
knowledge sources. In a first investigation, we find that gesture
features complement the prosodic and lexical knowledge sources for
this task. By using all of the knowledge sources, the model is
able to achieve the lowest overall SU detection error rate.
- Kettebekov02:_IEEE
-
Prosody Based Audio-Visual Co-analysis for Coverbal Gesture Recognition
S. Kettebekov and M. Yeasin and R. Sharma
(2002)
- Kettebekov03:CVPR
-
Improving continuous gesture recognition with spoken prosody
S. Kettebekov and M. Yeasin and R. Sharma
(2003)
Despite recent advances in gesture recognition, reliance on the visual signal alone to classify unrestricted continuous gesticulation is inherently error-prone. Since spontaneous gesticulation is mostly coverbal in nature, there have been some attempts of using speech cues to improve gesture recognition. Some attempts have been made in using speech cues to improve gesture recognition, e.g., keyword-gesture co-analysis. Use of such scheme is burdened by the complexity of natural language understanding. This paper offers a "signal-level" perspective by exploring prosodic phenomena of spontaneous gesture and speech co-production. We present a computational framework for improving continuous gesture recognition based on two phenomena that capture voluntary (co-articulation) and involuntary (physiological) contributions of prosodic synchronization. Physiological constraints, manifested as signal interruptions in multimodal production, are exploited in an audio-visual feature integration framework using hidden Markov models (HMMs). Co-articulation is analyzed using a Bayesian network of naive classifiers to explore alignment of intonationally prominent speech segments and hand kinematics. The efficacy of the proposed approach was demonstrated on a multimodal corpus created from the Weather Channel broadcast. Both schemas were found to contribute uniquely by reducing different error types, which subsequently improves the performance of continuous gesture recognition.
- LDC:tool
-
{L}inguistic {A}nnotation: {S}urvey by {LDC} http://www.ldc.upenn.edu/annotation/
S. Bird and M. Liberman
()
http://www.ldc.upenn.edu/annotation/
- Hirschberg96:ACL
-
A Prosodic Analysis of Discourse Segments in Direction-Giving Monologues
J. Hirschberg and C. H. Nakatani
(1996)
- Liu05:ICASSP
-
{S}tructural {M}etadata {R}esearch in the {EARS} {P}rogram
Y. Liu and E. Shriberg and A. Stolcke and B. Peskin and J. Ang and H. D. and M. Ostendorf and M. Tomalin and P. Woodland and M. Harper
(2005)
- Silverman92:ToBI
-
To{BI}: A Standard for Labeling English Prosody
K. Silverman
(1992)
- Liu04:RT
-
The {ICSI/SRI/UW} {RT}-04 {S}tructural {M}etadata {E}xtraction {S}ystem
Y. Liu and E. Shriberg and A. Stolcke and B. Peskin and M. Harper
(2004)
- Morris01:multi-stream
-
Multi-stream adaptive evidence combination for noise robust ASR
A. Morris and A. Hagen and H. Glotin and H. Bourlard
Speech Communication
(2001)
- Oviatt95:CSL
-
Predicting spoken disfluencies during human-computer interaction
S. Oviatt
Computer Speech and Language
9
19-35
(1995)
- Lapping94:CL_anaphora
-
An algorithm for pronominal anaphora resolution
S. Lappin and H. Leass
Computational Linguistics
20
535-561
(1994)
- Sacks74:turn
-
A simplest Systematics for the Organisation of Turn Taking for Conversation
H. Sacks and E. A. Schegloff and G. Jefferson
Language
50
696-735
(1974)
- Stevenson00:NACCL
-
Experiments on sentence boundary detedction
M. Stevensonm and R. Gaizauskasm
(2000)
- Beeferman98:cyperpunc
-
Cyperpunc: A lightweight punctuation annotation system for speech
D. Beeferman and A. Berger and J. Lafferty
(1998)
- Shafran:03ICASSP
-
Robust speech detection and segmentation for real-time ASR applications
I. Shafran and R. Rose
1
I-432-I-435 vol.1
(2003)
This paper provides a solution for robust speech detection that can be applied across a variety of tasks. The solution is based on an algorithm that performs non-parametric estimation of the background noise spectrum using minimum statistics of the smoothed short-time Fourier transform (STFT). It is shown that the new algorithm can operate effectively under varying signal-to-noise ratios. Results are reported on two tasks - HMIHY and SPINE - which differ in their speaking style, background noise type and bandwidth. With a computational cost of less than 2{\{\{\{\{\{\{\{\{\{\{\{\{\{\{\{\%}}}}}}}}}}}}}}}} real-time on a 1GHz P-3 machine and a latency of 400 ms, it is suitable for real-time ASR applications.
- Duncan74:turn
-
On the Structure of Speaker-Auditor Interaction During Speaking Turns
S. Duncan
Language Society
3
161-180
(1974)
- Duncan72:turn
-
Some Signals and Rules for Taking Speaking Turns in Conversations
S. Duncan
Journal of Personality and Social Psychology
23
283-292
(1972)
- Fach99:Euro
-
A comparision between syntactic and prosodic phrasing
M. Fach
(1999)
- Liu04:EM_NLP
-
Comparing and combining generative and posterior probability models: Some advances in sentence boundary detection in speech
Y. Liu and A. Stolcke and E. Shriberg and M. Harper
(2004)
- Liu04:PhD
-
Structural Event Detection for Rich Transcription of Speech
Y. Liu
(2004)
- Chen99:EuroSpeech
-
Speech Recognition with Automatic Punctuation
C. J. Chen
(1999)
- Kemp00:ICASSP
-
Strategies for automatic segmentation of audio data
T. Kemp and M. Schmidt and M. Westphal and A. Waibel
3
1423-1426 vol.3
(2000)
In many applications, like indexing of broadcast news or surveillance applications, the input data consists of a continuous, unsegmented audio stream. Speech recognition technology, however, usually requires segments of relatively short length as input. For such applications, effective methods to segment continuous audio streams into homogeneous segments are required. In this paper, three different segmenting strategies (model-based, metric-based and energy-based) are compared on the same broadcast news test data. It is shown that model-based and metric-based techniques outperform the simpler energy-based algorithms. While model based segmenters achieve very high level of segment boundary precision, the metric-based segmenter preforms better in terms of segment boundary recall (RCL). To combine the advantages of both strategies, a new hybrid algorithm is introduced. For this, the results of a preliminary metric-based segmentation are used to construct the models for the final model-based segmenter run. The new hybrid approach is shown to outperform the other segmenting strategies
- Jin:03RT
-
Speaker segmentation on conversational telephone speech
Q. Jin
()
www.nist.gov/speech/tests/rt/ rt2003/spring/presentations/RT03_Slides.pdf
- Hosom:00PhD
-
Automatic Time Alignment of Phonemes Using Acoustic-Phonetic Information
J. P. Hosom
(2000)
- Basu:03ICASSP
-
A linked-HMM model for robust voicing and speech detection
S. Basu
1
I-816-I-819 vol.1
(2003)
We present a novel method for simultaneous voicing and speech detection based on a linked-HMM architecture, with robust features that are independent of the signal energy. Because this approach models the change in dynamics between speech and nonspeech regions, it is robust to low sampling rates, significant levels of additive noise, and large distances from the microphone. We demonstrate the performance of our method in a variety of testing conditions and also compare it to other methods reported in the literature.
- Ljolje:91ICASSP
-
Automatic Segmentation and Labeling of Speech.
A. Ljolje and M. D. Riley
(1991)
- owen:98PhD
-
Multiple Media Correlation: Theory and Application
C. B. Owen
(1998)
- Pfau:01ASRU
-
Multispeaker speech activity detection for the {ICSI} meeting recorder
T. Pfau and D. Ellis and A. Stolcke
107-110
(2001)
As part of a project into speech recognition in meeting environments, we have collected a corpus of multichannel meeting recordings. We expected the identification of speaker activity to be straightforward given that the participants had individual microphones, but simple approaches yielded unacceptably erroneous labelings, mainly due to crosstalk between nearby speakers and wide variations in channel characteristics. Therefore, we have developed a more sophisticated approach for multichannel speech activity detection using a simple hidden Markov model (HMM). A baseline HMM speech activity detector has been extended to use mixtures of Gaussians to achieve robustness for different speakers under different conditions. Feature normalization and crosscorrelation processing are used to increase the channel independence and to detect crosstalk. The use of both energy normalization and crosscorrelation based postprocessing results in a 35{\{\{\{\{\{\{\{\{\{\{\{\{\{\{\{\%}}}}}}}}}}}}}}}} relative reduction of the frame error rate. Speech recognition experiments show that it is beneficial in this multispeaker setting to use the output of the speech activity detector for presegmenting the recognizer input, achieving word error rates within 10{\{\{\{\{\{\{\{\{\{\{\{\{\{\{\{\%}}}}}}}}}}}}}}}} of those achieved with manual turn labeling.
- Kingsbury:02ICASSP
-
Robust speech recognition in noisy environments: the 2001 IBM SPINE evaluation system
B. Kingsbury and G. Saon and L. Mangu and M. Padmanabhan and R. Sarikaya
1
I-53-I-56 vol.1
(2002)
We report on the system IBM fielded in the second SPeech In Noisy Environments (SPINE-2) evaluation, conducted by the Naval Research Laboratory in October 2001. The key components of the system include an HMM-based automatic segmentation module using a novel set of LDA-transformed voicing and energy features, a multiple-pass decoding strategy that uses several speaker- and environment-normalization operations to deal with the highly variable acoustics of the evaluation, the combination of hypotheses from decoders operating on three distinct acoustic feature sets, and a class-based language model that uses both the SPINE-1 and SPINE-2 training data to estimate reliable probabilities for the new SPINE-2 vocabulary
- Barras:01Speech_Comm
-
Transcriber: Development and use of a tool for assisting speech corpora production
C. Barras and E. Geoffrois and Z. Wu and M. Liberman
Speech Communication
33
5-22
(2001)
- Deshmukh:98ICSLP
-
Resegmentation of switchboard
N. Deshmukh and A. Ganapathiraju and J. Hamaker and P. Picone
(1998)
- Rapp:95german
-
Automatic phonemic transcription and linguistic annotaion from known text with Hidden Markov Models, An Aligner fro German
S. Rapp
(1995)
- Sjolander:03Sweden
-
An HMM-based system for automatic segmentation and alignment of speech
K. Sjolander
(2003)
- Makashay:00ICSLP
-
Perceptual Evaluation of Automatic Segmentation in Text-To-Speech Synthesis
M. J. Makashay and C. Wightman and A. K. Syrdal and A. Conkie
(2000)
- Greenberg:00NIST
-
An Introduction to the Diagnostic Evaluation of SwitchBoard-Corpus Automatic Speech Recognition System
S. Greenberg and S. Chang and J. Hollenback
(2000)
- Greenberg:00ASR
-
Linguistic Dissection of Switchboard-Corpus Automatic Speech Recognition System
S. Greenberg and S. Chang
(2000)
- Finke:97ASRU
-
Flexible Transcription Alignment
M. Finke and A. Waibel
(1997)
Presents a set of techniques that we employed in our Janus Recognition Toolkit (JRTk) Switchboard and CallHome recognizer in order to deal with imperfections in the transcriptions: inconsistent transcription of pronunciations and contractions, as well as errors in utterance segmentations. These techniques consist of a dynamic, speaking-mode-dependent pronunciation model and a flexible utterance alignment procedure which is based on speaker-adapted models (label boosting). The idea is (a) to automatically retranscribe the training corpus based on these models and procedures, (b) to train a recognizer based on these flexible transcription graphs, and (c) to decode with a dynamic speaking-mode-dependent dictionary. The framework is successfully applied to increase the performance of our state-of-the-art JRTk Switchboard recognizer significantly
- Hain:00NIST
-
The CU-HTK MARCH 2000 HUB5E TRANSCRIPTION SYSTEM
T. Hain and P. C. Woodland and G. Evermann and D. Povey
(2000)
We describe the Cambridge University HTK (CU-HTK) system developed for the NIST March 2000 evaluation of English conversational telephone speech transcription (Hub5E). A range of new features have been added to the HTK system used in the 1998 Hub5 evaluation, and the changes taken together have resulted in an 11\{\%} relative decrease in word error rate on the 1998 evaluation test set. Major changes include the use of maximum mutual information estimation in training as well as conventional maximum likelihood estimation; the use of a full variance transform for adaptation; the inclusion of unigram pronunciation probabilities; and word-level posterior probability estimation using confusion networks for use in minimum word error rate decoding, confidence score estimation and system combination. On the March 2000 Hub5 evaluation set the CU-HTK system gave an overall word error rate of 25.4\{\%}, which was the best performance by a statistically significant margin. This paper describes the new system features and gives the results of each processing stage for both the 1998 and 2000 evaluation sets.
- Stolcke:00NIST
-
The {SRI} {M}arch 2000 {H}ub-5 Conversational Speech Transcription System
A. Stolcke and et al.
(2000)
We describe SRI's large vocabulary conversational speech recognition system as used in the March 2000 NIST Hub-5E evaluation. The system performs four recognition passes: (1) bigram recognition with phone-loop-adapted, within-word triphone acoustic models, (2) lattice generation with transcription-mode-adapted models, (3) trigram lattice recognition with adapted cross-word triphone models, and (4) N-best rescoring and reranking with various additional knowledge sources. The system incorporates two new kinds of acoustic model: triphone models conditioned on speaking rate, and an explicit joint model of within-word phone durations. We also obtained an unusually large improvement from modeling cross-word pronunciation variants in "multiword" vocabulary items. The language model (LM) was enhanced with an "anti-LM" representing acoustically confusable word sequences. Finally, we applied a generalized ROVER algorithm to combine the N-best hypotheses from several systems based on different acoustic models.
- Gent:97EuroSpeech
-
Improving the phonetic annotation by means of prosodic phrasing
V. Gent and et al.
(1997)
- Yangl:03CSL
-
The Effect of Pruning and Compression on Graphical Representation of the Output of a Speech Recognizer
Y. Liu and M. Harper and et al.
Computer Speech and Language
(to be appear)
- Young:02HTKBOOK
-
The {HTK} Book
S. Young and et al.
(2002)
- Wightman:97Aligner
-
The {\emph{Aligner}}
C. Wightman and D. Talkin
(1997)
- Boersma96:_praat
-
Praat, a system for doing phonetics by computer.
P. Boersma and D. Weeninck
(1996)
- META_spec_V5
-
Simple Metadata Annotation Specification
S. Strassel
(2003)
- Barras:01SC
-
Transcriber : Development and use of a tool for assisting speech corpora production
C. Barras and D. Geoffrois and Z. Wu and W. Liberman
Speech Communication
(2001)
Transcriber was designed for the manual segmentation and transcription of long duration broadcast news recordings, including annotation of speech turns, topics and acoustics conditions. It is highly protable, relying on the Tcl/Tk with extentions such as Snack (Sound) and tcLex for lexical analysis.
- Pitt:03IEEE
-
The {VIC} Corpus of Conversational Speech
M. Pitt and K. Johnson and E. Hume and S. Kiesling and M. Raymond
Audio Signal Processing, IEEE Trans. on
(2003)
Same procedure like us, they use Aligner/Xwaves, but we use better tool
- Eickeler:02lrec
-
Creation of an {A}nnotated {G}erman {B}roadcast {S}peech {D}atabase for {S}poken {D}ocument {R}etrieval
S. Eickeler and M. Larson and R. M. and J. K\"ohler
(2002)
- Salor:02ICSLP
-
On Developing New Text and Audio Corpora and Speech Recognition Tools for the Turkish Language
O. \"Ozg\"ui Salor and B. Pellom and et al.
(2002)
- Gales:97
-
Maximum likelihood linear transformations fro HMM-based speech recognition
M. Gales
(1997)
citeseer.nj.nec.com/article/gales98maximum.html
- Anastasakos:SAT
-
A Compact Model for Speaker-Adaptive Training
T. Anastasakos and J. McDonough and R. Schwartz and J. Makhoul
(1996)
- Sundaram:01ISIP
-
{ISIP} 2000 Conversational Speech Evaluation System
R. Sundaram and A. Ganapathiraju and J. Hamaker and J. Picone
(2000)
- Kessens:03CSL
-
On automatic phonetic transcription quality: lower word error rates do not guarantee better transcriptions
J. M. Kessens and H. Strik
Computer Speech and Language
(2003)
- Moreno:98ICSLP
-
A RECURSIVE ALGORIGHM FOR THE FORCED ALIGNMENT OF VERY LONG AUDIO SEGMENTS
P. J. Moreno and C. Jeorg and et al.
(1998)
- palmer94:_adapt
-
Adaptive sentence boundary disambiguation
D. D. Palmer and M. A. Hearst
78-83
(1994)
- Gotoh00:_SU
-
Sentence boundary detection in broadcast speech transcript
Y. Gotoh and S. Renals
(2000)
- Shriberg00:_SC
-
Prosody-based automatic segmentation of speech into sentences and topics
E. Shriberg and A. Stolcke and D. Hakkani-Tur and G. Tur
Speech Communication
(2000)
- Stolcke98:_ICSLP
-
Automatic detection of sentence boundaries and disfluencies based on recognition words
A. Stolcke and E. Shriberg and R. Bates and et al.
2247-2250
(1998)
- Shriberg02:_WKSP_NLP
-
Prosody modeling for automatic speech recognition and undersanding
E. Shriberg and A. Stolcke
(2002)
- Ang02:ICSLP
-
Prosody-based automatic detection of annoyance and frustration in human-computer dialog
J. Ang and R. Dhilon and A. Krupski and et al.
2037-2040
(2002)
- Stolcke00:CL
-
Dialogue act modeling for automatic tagging and recognition of conversational speech
A. Stolcke and K. Ries and N. Coccaro and et al.
Computational Linguistics
339-373
(2000)
- Levelt89:_book
-
Speaking: from intention to articulation
W. Levelt
(1989)
- Mackay72
-
The structure of words and syllables: Evidence from errors in speech
D. G. MacKay
Cognitive Psychology
3
(1972)
- Clark98:_repeat
-
Repeating words in spontaneous speech
H. H. Clark and T. Wasow
Cognitive Psychology
201-242
(1998)
- Brennan01:_listener
-
How listeners compensate for disfluencies in spontaneous speech
S. E. Brennan
Journal of Memory and Language
44
274-296
(2001)
- Tree95
-
The effects of false starts and repetitions on the processing of subsequent words in spontaneou speech
F. Tree
Journal of Memory and Language
34
709-738
(1995)
- Shriberg94:PhD
-
Preliminaries to A Theory of Speech Disfluencies
E. Shriberg
(1994)
- Bear92:_integ
-
Integrating multiple knowledge sources for detecting and correction of repairs in human-computer dialog
J. Bear and J. Dowding and E. Shriberg
56-63
(1992)
- Charniak01:_edit
-
Edit detection and parsing for transcribed speech
E. Charniak and M. Johnson
118-126
(2001)
- Core99:_syntax
-
A syntactic framework for speech repairs and other disruption
M. G. Core and L. K. Schubert
413-420
(1999)
- Zechner01:PhD
-
Automatic Summarization of Spoken Dialogues in Unrestricted Domains
K. Zechner
(2001)
- rabiner86
-
An Introduction to Hidden {M}arkov Models
L. R. Rabiner and B. H. Juang
IEEE ASSP Magazine
3
4-16
(1986)
- Stolcke96:ICASSP
-
Statistical Language Modeling for Speech disfluencies
A. Stolcke and E. Shriberg
(1996)
- Heeman99:CL
-
Speech repairs, intonational phrased and discourse markers: Modeling speakers' utterances in spoken dialogue
P. Heeman and J. Allen
Computational Linguistics
(1999)
- Lickley91:_how
-
How and when are disfluencies found
R. Lickley and R. Shllcock and E. Bard
(1991)
- Lickley98:_when
-
When can listeners detect disfluency in spontaneous speech
R. Lickley and E. Bard
Language and Speech
(1998)
- Nakatani94
-
A corpus-based study of repair cues in spontaneous speech
C. Nakatani and J. Hirschberg
Journal of the Acoustics Society of America
1603-1616
(1994)
- Shaughnessy93:_analy
-
Analysis and automatic recognition of false starts in spontaneous speech
D. O'Shaughnessy
724-727
(1993)
- Shriberg97:Euro
-
A prosody-only decision-tree model for disfluency detection
E. Shriberg and A. Stolcke
2383-2386
(1997)
- Hindle83
-
Deterministic parsing of syntactic nonfluencies
D. Hindle
123-128
(1983)
- Shriberg99:_phonet
-
Phonetic consequences of speech disfluency
E. Shriberg
619-622
(1999)
- Lickley96:ICSLP
-
Jucture cues to disfluency
R. Lickley
(1996)
- Lickley96:ICSLP2
-
On not recognizing disfluencies in dialog
R. Lickley and E. Bard
1876-1879
(1996)
- Bortfeld01:_disf
-
Disfluency Rates in Conversation: Effects of Age, Relationship, Topic, Role and Gender
H. Bortfeld and S. Leon and J. Bloom and M. Schober and S. Brennan
Language and Speech
44
123-147
(2001)
- Shriberg01:_disf
-
To errr is human: ecology and acoustics of speech disfluencies
E. Shriberg
Journal of the International Phonetic Association
31
(2001)
- Cristoforetti00:lrec
-
Annotation of a Multichannel noisy speech corpus
L. Cristoforetti and M. Matassoni and M. Omologo and S. P. and E. Zovato
(2000)
- Hayamizu:99EuroSpeech
-
A Multimodal Database of Gesture and Speech
S. Hayamizu and et al.
5
2247-2250
(1999)
- Nakamura:00ICSLP
-
Multimodal Corpora for Human-Machine Interaction Research
S. Nakamura and et al
IV
25-28
(2000)
- Braffort98:INSE
-
Video-Tracking and recognition of pointing gesture using Hidden Markov Models
A. Braffort and A. Gherbi
(1998)
- Kita99:_gesture
-
Gesture and speech dysfluencies
M. Seyfeddinipur and S. Kita
(1999)
- Mayberry00:_stutter
-
Gesture production during stuttered speech: Insights into the nature of gesture-speech integration
R. Mayberry and J. Jaques
199-213
(2000)
- NCSLGR
-
National Center Sign Language and Gesture Resource
()
http://www.bu.edu/asllrp/cslgr/
- ISLE:_survey
-
Survey of NIMM data Resources, Current and Future User Profiles, Markets and User Needs for NIMM Resources
M. Knudsen and et al
(2002)
- Janin03:_ICASSP
-
The {ICSI} {M}eeting {C}orpus
A. Janin and et al
(2003)
- Wrede03:_ASRU
-
The Relation between Dialogue Acts and Hot Spots in Meetings
B. Wrede and E. Shriberg
(2003)
- Wrede03:_ES
-
Spotting "Hot Spots" in Meetings: Human, Judgements and Prosodic Cues
B. Wrede and E. Shriberg
(2003)
- Lathoud03:_ICASSP
-
Location based Speaker Segmentation
G. Lathoud and I. McCowan
(2003)
- Wright03:_ES
-
Feature Selection for the Classification of Crosstalk in Multi-Channel Audio
S. Wright and G. Brown and V. Wan and S. Renals
469-472
(2003)
- Reyes03:_ICASSP
-
{M}ulti-{C}hannel Source Seperation by Factorial {HMMS}
M. Reyes and B. Raj and D. Ellis
(2003)
- Hillard03:_HLT
-
Detection of Agreement vs. Disagreement in Meetings: Training with Unlabeled data
D. Hillard and M. Ostendorf and E. Shriberg
(2003)
- Liu03:CSL
-
{R}esampling Techniques for {S}entence {B}oundary {D}etection: {A} {C}ase {S}tudy in {M}achine {L}earning from {I}mbalanced {D}ata for {S}poken {L}anguage {P}rocessing
Y. Liu and N. V. Chawla and E. Shriberg and A. Stolcke and M. Harper
Computer Speech and Language
(To appear)
- Chen04:_lrec
-
Evaluating Factors Impacting the Accuracy of Forced Alignments in a Multimodal Corpus
L. Chen and Y. Liu and M. Harper and E. Maia and S. McRoy
(2004)
- Kim-punctuation
-
The Use of Prosody In a Combined System for Punctuation Generation and Speech Recognition
J. Kim and P. C. Woodland
(2001)
- Sonmez
-
Modeling dynamic prosodic variation for speaker verification
K. Sonmez and E. Shriberg and L. Heck and M. Weintraub
3189-3192
(1998)
- Christensen-punctuation
-
Punctuation Annotation using Statistical prosody models
H. Christensen
(2001)
- Kim-punctuation
-
The Use of Prosody In a Combined System for Punctuation Generation and Speech Recognition
J. Kim and P. C. Woodland
(2001)
- Shriberg-Spchcom2000
-
Prosody-Based Automatic Segmentation of Speech into Sentences and Topics
E. Shriberg and A. Stolcke and et al.
Speech Communication
32
127-154
(2000)
- Stolcke-HE-lm
-
Automatic Linguistic segmentation of conversational speech
A. Stolcke and E. Shriberg
1005-1008
(1996)
- class-ngram
-
Class-Based n-gram Models of Natural Language
P. Brown and V. Pietra and P. DeSouza and et al.
Computational Linguistics
467-479
(1992)
- Brants-POS2000
-
{TnT}-a Statistical part-of-speech Tagger
T. Brants
224-231
(2000)
- Coquoz:2004
-
Broadcast news Segmentation Using MDE and STT Information to Improve S peech Recognition
S. Coquoz
(2004)
- NIST-RT03F
-
{MDE Research at ICSI+SRI+UW, NIST RT-03F Workshop}
E. Shriberg and Y. Liu and et al.
(2003)
- Liu:03
-
Resampling Techniques for Sentence Boundary Detection: A Case Study in Machine Learning from Imbalanced Data for Spoken Language Processing
Y. Liu and N. Chawla and E. Shriberg and A. Stolcke and M. Harper
(2003)
- Mateer
-
Disfluency Annotation Stylebook for the Switchboard Corpus
M. Mateer and A. Taylor
(1995)
- Shriberg:_04Prosody
-
Direct Modeling of Prosody: An Overview of Applications in Automatic Speech Processing
E. Shriberg and A. Stolcke
(2004)
- McCowan03:MM_action
-
Automatic Analysis of Multimodal Group Actions in Meetings
I. McGowan and D. Gatica-Perez and S. Bengio and G. Lathoud
(2003)
- Ajmera04:MM4_spkseg
-
Clustering and segmenting speakers and their locations in meetings
J. Ajmera and G. Lathoud and I. McCowan
1
605-608
(2004)
This paper presents a new approach toward automatic annotation of meetings in terms of speaker identities and their locations. This is achieved by segmenting the audio recordings using two independent sources of information: magnitude spectrum analysis and sound source localization. We combine the two in an appropriate HMM framework. There are three main advantages of this approach. First, it is completely unsupervised, i.e. speaker identities and number of speakers and locations are automatically inferred. Second, it is threshold-free, i.e. the decisions are made without the need of a threshold value which generally requires an additional development dataset. The third advantage is that the joint segmentation improves over the speaker segmentation derived using only acoustic features. Experiments on a series of meetings recorded in the IDIAP Smart Meeting Room demonstrate the effectiveness of this approach.
- Jovanovic03:action
-
Recognition of meeting action using information obtained from different modalities
N. Jovanovic
(2003)
- Jovanovic04:MM_address
-
Towards automatic addressee identification in multi-party dialogues
N. Jovanovic and R. Akker
(2004)
- ISL_gaze
-
Head orientation and gaze direction in meetings
R. Stiefelhagen and J. Zhu
858-859
(2002)
Detecting who is looking at whom during multiparty interaction is useful for various tasks such as meeting analysis. There are two contributing factors in the formation of where a person is looking at 1: head orientation and eye orientation. In this poster, we present an experiment aimed at evaluating the potential of head orientation estimation in detecting who is looking at whom, because head orientation can be estimated accurately and robustly with non-intrusive methods while eye orientation can not. Experimental results show that head orientation contributes 68.9{\{\{\{\{\{\%}}}}}} on average to the overall gaze direction, and focus of attention estimation based on head orientation alone can get an average accuracy of 88.7{\{\{\{\{\{\%}}}}}} in a meeting application scenario with four participants. We conclude that head orientation is a good indicator of focus of attention in human computer interaction applications.
- Bounif04:MMM
-
A multimodal database framework for multimedia meeting annotations
H. Bounif and O. Drutskyy and F. Jouanot and S. Spaccapietra
17-25
(2004)
The main objective of this paper is to present a flexible annotation management framework for a multimedia database system, applied to meeting recordings. Presented research and development activities are carried out within the scope of the IM2 project in which annotations play an important role in describing raw data from various points of view and in enhancing the query process. We focus on a database system capable of managing annotations (e.g. text transcriptions, dialog acts, speaker space position, etc.) and keeping links with raw data (audio, video, digital documents). This database provides a schema evolution mechanism and a meta-description layer ensuring flexible and incremental annotation definitions. To enhance this database system, some research works are currently in progress: a predictive methodology for schema evolution and a query technique that deals with fuzzy concepts and ontological commitments. We describe our on-going prototype development, in which we focus on data storage and interactive data access.
- Burger02:ISL_MC
-
The {ISL} Meeting Corpus: The impact of Meeting Type on Speech Type
S. Burger and V. MacLaren and H. Yu
(2002)
- Moore03:ICASSP
-
Microphone array speech recognition: Experiments on overlapping speech in meetings
D. Moore and I. McCowan
5
497-500
(2003)
- McCowan03:MM4_ICIP
-
Modeling human interaction in meetings
I. McCowan and S. Bengio and D. Gatica-Perez and G. Lathoud and F. Monay and D. Moore and P. Wellner and H. Bourlard
4
748-751
(2003)
This paper investigates the recognition of group actions in meetings by modeling the joint behaviour of participants. Many meeting actions, such as presentations, discussions and consensus, are characterised by similar or complementary behaviour across participants. Recognising these meaningful actions is an important step towards the goal of providing effective browsing and summarisation of processed meetings. In this work, a corpus of meetings was collected in a room equipped with a number of microphones and cameras. The corpus was labeled in terms of a predefined set of meeting actions characterised by global behaviour. In experiments, audio and visual features for each participant are extracted from the raw data and the interaction of participants is modeled using HMM-based approaches. Initial results on the corpus demonstrate the ability of the system to recognise the set of meeting actions.
-
Finding presentations in recorded meetings using audio and video features
J. Foote and J. Boreczsky and L. Wilcox
6
3029-3032
(1999)
- Gross00:ISL_ICME
-
Towards a multimodal meeting record
R. Gross and M. Bett and H. Yu and X. Zhu and Y. Pan and J. Yang and A. Waibel
1593-1596
(2000)
Face-to-face meetings usually encompass several modalities including speech, gesture, handwriting, and person identification. Recognition and integration of each of these modalities is important to create an accurate record of a meeting. However, each of these modalities presents recognition difficulties. Speech recognition must be speaker and domain independent, have low word error rates, and be close to real time to be useful. Gesture and handwriting recognition must be writer independent and support a wide variety of writing styles. Person identification has difficulty with segmentation in a crowded room. Furthermore, in order to produce the record automatically, we have to solve the assignment problem (who is saying what), which involves people identification and speech recognition. This paper will examine a multimodal meeting room system under development at Carnegie Mellon University that enables us to track, capture and integrate the important aspects of a meeting from people identification to meeting transcription. Once a multimedia meeting record is created, it can be archived for later retrieval.
- Vertegall99:Gaze
-
GAZE Groupware System: Mediating joint attention in multiparty communication and collaboration
R. Vertegaal
294-301
(1999)
In this paper, we discuss why, in designing multiparty mediated systems, we should focus first on providing non-verbal cues which are less redundantly coded in speech than those normally conveyed by video. We show how conveying one such cue, gaze direction, may solve two problems in multiparty mediated communication and collaboration: knowing who is talking to whom, and who is talking about what. As a candidate solution, we present the GAZE Groupware System, which combines support for gaze awareness in multiparty mediated communication and collaboration with small and linear bandwidth requirements. The system uses an advanced, desk-mounted eyetracker to metaphorically convey gaze awareness in a 3D virtual meeting room and within shared documents.
- Yu99:EuroSP
-
Progress in automatic meeting transcription
M. Yu, H.and Finke and A. Waibel
(1999)
- Shriberg01:prosody_meeting
-
Can Prosody Aid the Automatic Processing of Multi-Party Meetings: Evidence from Predicting Punctuation, Disfluencies, and Overlapping Speech
E. Shriberg and A. Stolcke and D. Baron
(2001)
- Yang99:ISL_ACMMM
-
Multimodal people {ID} for a multimedia meeting browser
J. Yang and X. Zhu and R. Gross and J. Kominek and Y. Pan and A. Waibel
159-168
(1999)
A meeting browser is a system that allows users to review a multimedia meeting record from a variety of indexing methods. Identification of meeting participants is essential for creating such a multimedia meeting record. Moreover, knowing who is speaking can enhance the performance of speech recognition and indexing meeting transcription. In this paper, we present an approach that identifies meeting participants by fusing multimodal inputs. We use face ID, speaker ID, color appearance ID, and sound source directional ID to identify and track meeting. After describing the different modules in detail, we will discuss a framework for combining the information sources. Integration of the multimodal people ID into the multimedia meeting browser is in its preliminary stage.
- Morgan03:ICASSP_ICSI_MC
-
Meetings about meetings: Research at {ICSI} on speech in multiparty conversations
N. Morgan and et al.
4
740-743
(2003)
A report on the progress made in ICSI project on processing speech from meetings was presented. The development of a prosodic database for a large subset of these meetings and its subsequent use for punctuation and disfluency detection was discussed. The report also included the improvement of both near-mic and far-mic speech recognition results for meeting speech test sets.
- Gatica-Perez03:MM4_spk_track
-
Audio-visual speaker tracking with importance particle filters
D. Gatica-Perez and G. Lathoud and I. McCowan and J. Odobez and D. Moore
3
25-28
(2003)
We present a probabilistic method for audio-visual (AV) speaker tracking, using an uncalibrated wide-angle camera and a microphone array. The algorithm fuses 2-D object shape and audio information via importance particle filters (I-PFs), allowing for the asymmetrical integration of AV information in a way that efficiently exploits the complementary features of each modality. Audio localization information is used to generate an importance sampling (IS) function, which guides the random search process of a particle filter towards regions of the configuration space likely to contain the true configuration (a speaker). The measurement process integrates contour-based and audio observations, which results in reliable head tracking in realistic scenarios. We show that imperfect single modalities can be combined into an algorithm that automatically initializes and tracks a speaker, switches between multiple speakers, tolerates visual clutter, and recovers from total AV object occlusion, in the context of a multimodal meeting room.
- Schultz01:ISL_meetingroom
-
The {ISL} Meeting Room System
T. Schultz and A. Waibel and et al.
(2001)
- Polzin98:ISL_emotion
-
Detecting Emotions in Speech
T. S. Polzin and A. Waibel
(1998)
- Ang02:ICSI_emotion
-
Prosody-based Automatic Detection of Annoyance and Frustration in Human-Computer Dialog
J. Ang and R. Dhillon and A. Krupski and E. Shriberg and A. Stolcke
(2002)
- Lathoud03:MM4_locseg
-
Location based speaker segmentation
G. Lathoud and I. A. McCowan
1
176-179
(2003)
This paper proposes a technique that segments audio according to speakers based on their location. In many multi-party conversations, such as meetings, the location of participants is restricted to a small number of regions, such as seats around a table, or at a whiteboard. In such cases, segmentation according to these discrete regions would be a reliable means of determining speaker turns. We propose a system that uses microphone pair time delays as features to represent speaker locations. These features are integrated in a GMM/HMM framework to determine an optimal segmentation of the audio according to location. The HMM frame-work also allows extensions to recognise more complex structure, such as the presence of two simultaneous speakers. Experiments testing the system on real recordings from a meeting room show that the proposed location features can provide greater discrimination than standard cepstral features, and also demonstrate the success of an extension to handle dual-speaker overlap.
- Gatica-Perez03:ICIP
-
On automatic annotation of meeting databases
D. Gatica-Perez and I. McCowan and M. Barnard and S. Bengio and H. Bourlard
3
629-632
(2003)
- ISL_attention
-
Tracking Focus of Attention in Meetings
R. Stiefelhagen
(2002)
- Renals03:ICASSP
-
Audio information access from meeting rooms
S. Renals and D. Ellis
4
744-747
(2003)
- Baron02:ICSI_meeting_SU
-
Automatic Punctuation and Disfluency Detection in Multi-Party Meetings Using Prosodic and Lexical Cues
D. Baron and E. Shriberg and A. Stolcke
(2002)
- Alfred04:MM_structure
-
Dynamic Bayesian networks for meeting structuring
D. Alfred and S. Renals
5
629--632
(2004)
- Shribeg01:Eurospeech
-
Observations on Overlap: Findings and Implications for Automatic Processing of Multi-Party Conversation
E. Shriberg and A. Stolcke and D. Baron
(2001)
- Janin03:ICASSP
-
The {ICSI} meeting corpus
A. Janin and D. Baron and J. Edwards and D. Ellis and D. Gelbart and N. Morgan and B. Peskin and T. Pfau and E. Shriberg and A. Stolcke and C. Wooters
1
364-367
(2003)
- Bhagat03:ICPhS
-
Automatically generated Prosodic Cues to Lexically Ambiguous Dialog Acts in Multiparty Meetings
S. Bhagat and H. Carvey and E. Shriberg
(2003)
- ISL_tracking
-
Lecture and Presentation Tracking in an intelligent Meeting Room
I. Rognia and T. Schaaf
(2002)
- Morgan01:HLT_ICSI_MC
-
The Meeting Project at {ICSI}
D. Morgan, N.and Baron and J. Edwards and D. Ellis and D. Gelbart and J. A. and T. Pfau and E. Shriberg and A. Stolcke
(2001)
- Gelbart02:ICSI_MC_farfield
-
Double the Trouble: Handling Noise and Reverberation in Far-Field Automatic Speech Recognition
D. Gelbart and N. Morgan
(2002)