ElevenLabs Scribe V2: Triplicated Text Bug
Introduction: The API Bug and Its Impact
This article addresses a specific API bug within the ElevenLabs Scribe v2 Realtime WebSocket API. The core issue revolves around the committed_transcript event, which, under certain circumstances, returns triplicated text, even when the user speaks a word or phrase only once. This is not a Python SDK issue but a problem directly with the API's behavior. The problem manifests intermittently, causing duplicated words or phrases in the transcribed text, which will affect applications that rely on accurate real-time transcription, such as live captioning systems, voice-controlled applications, and real-time note-taking tools. The bug significantly degrades the user experience by providing incorrect and garbled text output, making the transcription unreliable. The investigation highlights the importance of precise and reliable transcription, which requires meticulous attention to detail within the API's operational mechanics. This bug report provides a thorough analysis of the issue, encompassing detailed reproduction steps, evidence from raw WebSocket messages, and a comprehensive analysis to pinpoint the source of the problem.
Detailed Description of the API Bug
The central problem lies in how the Scribe v2 Realtime WebSocket API processes and outputs the spoken words. The API is designed to convert spoken language into written text in real time. The expected functionality is that when a user speaks a word or a short phrase, the API should accurately transcribe it once. However, the observed behavior reveals an intermittent error where the committed_transcript event returns the text multiple times. For example, if a user says "Hello" only once, the API might return "Hello Hello Hello." The environment in which this bug occurs is crucial for understanding its scope. The API endpoint is wss://api.elevenlabs.io/v1/speech-to-text/realtime, and the model used is scribe_v2_realtime. The configuration involves specific parameters such as the sample rate (16000 Hz), audio format (pcm_16000), language (en), and various VAD (Voice Activity Detection) settings. These settings include parameters like vad_silence_threshold_secs, vad_threshold, min_speech_duration_ms, and min_silence_duration_ms. The bug's intermittent nature, coupled with the consistent pattern of text duplication, indicates a subtle issue within the API's processing pipeline. This irregularity suggests a possible breakdown in how the API handles and commits the transcribed text, especially during the final stages of processing before the committed_transcript event is sent. The investigation includes detailed reproduction steps, raw WebSocket message analysis, and a comprehensive examination of the factors contributing to this behavior.
Expected vs. Actual Behavior: A Comparison
To fully understand the severity of this API bug, a clear distinction between the expected and actual behavior is necessary. The expected behavior is that when a user speaks a word or phrase once, the API should return a single, accurate transcription of that utterance. For example, if the user says, "Hello," the API should respond with:
{
"message_type": "committed_transcript",
"text": "Hello"
}
This is the core functionality that users and developers expect to work flawlessly. However, the actual behavior deviates from this expectation. The API intermittently returns triplicated text in the committed_transcript event. This means, in the example above, instead of receiving a single "Hello," the user might receive "Hello Hello Hello." This duplication can happen with any spoken word or phrase. Here's another example:
User says "I'm here" once → API returns:
{
"message_type": "committed_transcript",
"text": "I'm here I'm here I'm here"
}
The deviation from the expected behavior highlights a significant issue with the API's reliability. The bug's unpredictability further complicates matters, as it doesn't occur every time, making it difficult to detect and debug. This unexpected repetition of text can lead to significant problems in applications that require precise transcription. It demonstrates a critical need for an API update to correct this inconsistency and ensure that the API's functionality aligns with its intended use case. This misalignment between the expected and actual behavior points to a fundamental flaw in the API's internal processing, particularly during the final commitment of transcriptions.
Reproduction Steps: How to Replicate the Bug
Reproducing the ElevenLabs Scribe v2 Realtime WebSocket API bug is essential for both understanding the issue and demonstrating its impact. The following steps outline a clear, concise method to replicate the triplicated text issue. First, establish a WebSocket connection to the API endpoint: wss://api.elevenlabs.io/v1/speech-to-text/realtime. Then, configure the connection with specific parameters to ensure the API behaves as intended. The critical parameters include setting the model_id to scribe_v2_realtime, which specifies the model to use for transcription. Set the encoding to pcm_16000, and sample_rate to 16000 to define the audio format. Set the commit_strategy to vad. The VAD settings are also crucial, including: vad_silence_threshold_secs=1.5, vad_threshold=0.4, min_speech_duration_ms=100, and min_silence_duration_ms=100. Finally, set language_code to en. Once the WebSocket connection is configured, the next step involves sending audio data. The user should speak a word or a short phrase only once, for example, "Hello" or "I'm here." The goal is to observe the committed_transcript events. The expected outcome is a single occurrence of the spoken word or phrase. However, the bug manifests when the API returns the text triplicated. By following these reproduction steps, developers can consistently replicate the issue and provide evidence to the ElevenLabs team for debugging and resolution.
Raw WebSocket Messages: Evidence of the Bug
Analyzing raw WebSocket messages is the most direct way to gather evidence and understand the inner workings of the ElevenLabs Scribe v2 API. The following is a detailed look at the messages exchanged during a typical session and how they demonstrate the triplication bug. The session begins with a session_started event, which contains essential information about the session's configuration. The key details include the session_id, sample_rate, audio_format, language_code, and model_id. Here's an example of a session_started event:
{
"message_type": "session_started",
"session_id": "8b8acc4591f545a79b39521c43e83249",
"config": {
"sample_rate": 16000,
"audio_format": "pcm_16000",
"language_code": "en",
"timestamps_granularity": "word",
"vad_commit_strategy": true,
"vad_silence_threshold_secs": 1.5,
"vad_threshold": 0.4,
"min_speech_duration_ms": 100,
"min_silence_duration_ms": 100,
"max_tokens_to_recompute": 5,
"model_id": "scribe_v2_realtime",
"disable_logging": false,
"include_timestamps": false,
"include_language_detection": false
}
}
Following the session_started event, the core of the issue becomes visible in the committed_transcript events. For example, when the user says "Hello" once, the API should return:
{
"message_type": "committed_transcript",
"text": "Hello"
}
However, the bug causes the API to return the text multiple times, as evidenced in these examples:
{
"message_type": "committed_transcript",
"text": "Hello Hello Hello"
}
and
{
"message_type": "committed_transcript",
"text": "I'm here I'm here I'm here"
}
The partial_transcript events often show the correct, single occurrence of the spoken words, while the committed_transcript events are duplicated, further highlighting the core problem. The session ends with a committed_transcript event containing an empty string, signifying the end of speech. Detailed analysis of these messages is crucial to diagnosing the bug's behavior and potential causes.
Analysis: Decoding the Root Cause
This section delves into the detailed analysis of the triplicated text bug within the ElevenLabs Scribe v2 Realtime WebSocket API. The analysis draws from a range of observations, including examination of partial and committed transcripts, along with other potential factors. A critical observation is the difference between the partial_transcript and committed_transcript messages. The partial_transcript messages consistently show the correct, single occurrence of spoken words, such as "Hello" or "I'm here." The committed_transcript messages, however, inconsistently show duplications, sometimes with 2x and sometimes with 3x repetitions. These discrepancies point to a specific area of the API's processing pipeline, where the final commitment phase is experiencing issues. The intermittent nature of the bug provides additional challenges. The bug does not happen on every utterance, suggesting it is triggered by specific conditions that are not yet completely understood. This inconsistency makes it harder to isolate the trigger and pinpoint the exact source of the problem. Further, the duplication always appears to be in multiples of the original text. The consistent pattern in the duplication suggests that the issue might involve a loop or repeated processing of the text during the commitment stage. An example of a suspected cause is the max_tokens_to_recompute: 5 parameter, which might be incorrectly duplicating tokens during the commitment phase.
Conclusion
The ElevenLabs Scribe v2 Realtime WebSocket API exhibits an intermittent bug where the committed_transcript events return triplicated text, disrupting the intended functionality and affecting the accuracy of transcriptions. The reproduction steps, raw WebSocket messages, and detailed analysis provided in this report offer clear evidence of the bug's behavior, helping to identify its root cause. The discrepancies between partial_transcript and committed_transcript events and the inconsistent pattern of duplication indicate a problem in the API's final commitment phase. This report serves as a thorough analysis of the issue, and these findings can help the ElevenLabs team address and resolve this bug.
External Link:
For more information on the ElevenLabs API and its features, please visit the official ElevenLabs website: ElevenLabs.