Speech Recognition Research · Advisor: Yulia Tsvetkov 2014

Using tweets to improve an automatic speech recognizer

Using tweets to improve an automatic speech recognizer.

Abstract

Automatic speech recognition systems (ASRs) convert spoken words into text. They will convert speech into sequences of word sounds, and then decipher the correct word from these sounds. Experts determine how certain words sound, but how people actually pronounce words in conversation might differ from what an expert has determined.

Twitter tweets, however, might provide ideas for how words are pronounced. Given a 140 character limit, twitter users are forced to shorten their words in creative ways while still hinting at the original word. Because they tend to keep the parts of a word that matter the most, the final sequence of characters suggests the word’s pronunciation.

The Problem

Automatic speech recognizers (ASRs) convert spoken language to text by mapping audio to phonetic pronunciations and then to words. Those pronunciations are typically determined by linguistic experts — but real conversational speech often diverges from the expert-defined canonical forms.

Twitter offered an unusual window into this gap. Given a 140-character limit, users shorten words creatively while preserving enough phonetic structure that the original word remains recognizable. That constraint, it turns out, encodes informal pronunciation.

Diagram of an automatic speech recognition pipeline showing audio input, acoustic model, pronunciation dictionary, and language model — Standard ASR pipeline. The pronunciation dictionary — the component we targeted — maps words to their phonetic sequences.

Approach

We used Brown clustering on raw tweet data to group words that were contextually interchangeable — words that appear in similar positions across sentences cluster together regardless of spelling variation. This gave us sets of related forms: standard spellings alongside abbreviated, phonetically-spelled, and contracted variants.

Brown clustering output showing groups of related word variants extracted from tweet data — Brown cluster output. Clusters surface informal variants alongside their standard forms — the phonetic relationship between them is what we extracted.

From these clusters, each team member developed an independent algorithm for inferring pronunciation. My approach: given a cluster of word variants with known pronunciations, find the longest common subsequence of phonemes across the expert-defined pronunciations in the cluster. That subsequence represents the sounds consistently preserved across informal variants — likely the phonetically load-bearing portion of the word.

Results

The final pronunciation variants did not produce measurable improvement in ASR accuracy on the test corpus. However, the methods for extracting pronunciation structure from informal text were the substantive contribution — a starting point for further work on data-driven pronunciation modeling from social media.

Role

Researcher. Responsible for KALDI toolkit setup and training, and the algorithm for pronunciation extraction via longest common phoneme subsequence.