Functionality and definitions
Basic terminology
This section provides essential technical concepts and parameters for effectively using the Behavioral Signals streaming API. Understanding these components will help you optimize audio processing for your specific use case, configure the API appropriately, and interpret the results correctly.
✍️ Usage
The Streaming API is suitable for sending chunks of audio (for example retrieved from a microphone) and getting back the results in real time.
Some example use cases include:
- Streaming live calls, for example in call centers,
- Agentic AI applications where the user's voice is processed live.
Streaming is implemented using gRPC
for high efficiency.
ℹ️ Example response
The Streaming API yields protobuf messages with the format below (when serialized to JSON):
{
"pid": 44676,
"cid": 10000119,
"message_id": 0,
"results": [
{
"id": "2",
"startTime": "0.00",
"endTime": "2.00",
"task": "features",
"prediction": [
{
"label": "",
"posterior": "",
"dominantInSegments": null
}
],
"finalLabel": "",
"level": "segment",
"embedding": "",
"st": 0.0,
"et": 2.0
},
{
"id": "2",
"startTime": "0.00",
"endTime": "2.00",
"task": "gender",
"prediction": [
{
"label": "male",
"posterior": "1.0",
"dominantInSegments": null
},
{
"label": "female",
"posterior": "0.0",
"dominantInSegments": null
}
],
"finalLabel": "male",
"level": "segment",
"embedding": "",
"st": 0.0,
"et": 2.0
},
{
"id": "2",
"startTime": "0.00",
"endTime": "2.00",
"task": "age",
"prediction": [
{
"label": "31 - 45",
"posterior": "0.9987",
"dominantInSegments": null
},
{
"label": "18 - 22",
"posterior": "0.0012",
"dominantInSegments": null
},
{
"label": "23 - 30",
"posterior": "1e-04",
"dominantInSegments": null
},
{
"label": "46 - 65",
"posterior": "0.0",
"dominantInSegments": null
}
],
"finalLabel": "31 - 45",
"level": "segment",
"embedding": "",
"st": 0.0,
"et": 2.0
},
{
"id": "2",
"startTime": "0.00",
"endTime": "2.00",
"task": "emotion",
"prediction": [
{
"label": "neutral",
"posterior": "0.5481",
"dominantInSegments": null
},
{
"label": "sad",
"posterior": "0.4272",
"dominantInSegments": null
},
{
"label": "happy",
"posterior": "0.0186",
"dominantInSegments": null
},
{
"label": "angry",
"posterior": "0.0061",
"dominantInSegments": null
}
],
"finalLabel": "neutral",
"level": "segment",
"embedding": "",
"st": 0.0,
"et": 2.0
},
{
"id": "2",
"startTime": "0.00",
"endTime": "2.00",
"task": "positivity",
"prediction": [
{
"label": "neutral",
"posterior": "0.8244",
"dominantInSegments": null
},
{
"label": "negative",
"posterior": "0.1555",
"dominantInSegments": null
},
{
"label": "positive",
"posterior": "0.0201",
"dominantInSegments": null
}
],
"finalLabel": "neutral",
"level": "segment",
"embedding": "",
"st": 0.0,
"et": 2.0
},
{
"id": "2",
"startTime": "0.00",
"endTime": "2.00",
"task": "strength",
"prediction": [
{
"label": "weak",
"posterior": "0.5726",
"dominantInSegments": null
},
{
"label": "neutral",
"posterior": "0.4245",
"dominantInSegments": null
},
{
"label": "strong",
"posterior": "0.0029",
"dominantInSegments": null
}
],
"finalLabel": "weak",
"level": "segment",
"embedding": "",
"st": 0.0,
"et": 2.0
},
{
"id": "2",
"startTime": "0.00",
"endTime": "2.00",
"task": "speaking_rate",
"prediction": [
{
"label": "fast",
"posterior": "0.82",
"dominantInSegments": null
},
{
"label": "normal",
"posterior": "0.1246",
"dominantInSegments": null
},
{
"label": "slow",
"posterior": "0.0553",
"dominantInSegments": null
},
{
"label": "very slow",
"posterior": "0.0",
"dominantInSegments": null
},
{
"label": "very fast",
"posterior": "0.0",
"dominantInSegments": null
}
],
"finalLabel": "fast",
"level": "segment",
"embedding": "",
"st": 0.0,
"et": 2.0
},
{
"id": "2",
"startTime": "0.00",
"endTime": "2.00",
"task": "hesitation",
"prediction": [
{
"label": "no",
"posterior": "0.9098",
"dominantInSegments": null
},
{
"label": "yes",
"posterior": "0.0902",
"dominantInSegments": null
}
],
"finalLabel": "no",
"level": "segment",
"embedding": "",
"st": 0.0,
"et": 2.0
},
{
"id": "2",
"startTime": "0.00",
"endTime": "2.00",
"task": "engagement",
"prediction": [
{
"label": "neutral",
"posterior": "0.5343",
"dominantInSegments": null
},
{
"label": "withdrawn",
"posterior": "0.3897",
"dominantInSegments": null
},
{
"label": "engaged",
"posterior": "0.0761",
"dominantInSegments": null
}
],
"finalLabel": "neutral",
"level": "segment",
"embedding": "",
"st": 0.0,
"et": 2.0
}
]
}
🎚️ Audio Processing Levels
The Behavioral Signals streaming API processes audio at two distinct levels:
Segment Level
-
Definition: Fixed-duration (2-seconds) chunks of audio processed independently
-
Purpose: Provides granular, time-aligned behavioral metrics
-
Output: Continuous stream of metrics for each segment
-
Characteristics:
- Consistent temporal resolution
- Suitable for real-time monitoring
- Results available after each segment
- Note that the smaller the segment length, the less accurate the results may be
-
Use Case: Live dashboards, immediate feedback, continuous monitoring, Real-time emotion detection
Utterance Level
-
Definition: Complete speech units bounded by silence or speaker changes
-
Duration: Variable, determined by Voice Activity Detection (VAD)
-
Context-Aware: Considers full speech context
-
Output: The analysis is performed over the entire utterance, leading to better accuracy
-
Characteristics:
- Natural speech boundaries
- Variable duration based on speech patterns
- Comprehensive analysis of complete thoughts
-
Use Case: Conversational analysis, speaker emotion profiling, turn-taking detection
🗣️ Voice Activity Detection (VAD)
Definition: Algorithm that detects the presence or absence of human speech in audio signals
Core Functionality:
- Distinguishes speech from non-speech segments
- Identifies natural pause points in conversation
- Determines utterance boundaries automatically
Key Components:
- Speech Duration: Minimum continuous speech time to form valid segment
- Stream Resolution: Audio chunk size for streaming. There is no limitation on the chunk size. We recommend values between 100ms - 500ms.
-
⚙️ Configuration Parameters
These are provided by the user in the configuration parameters of the stream.
sampling_rate - required
-
Type: Audio parameter (Hz)
-
Description: Number of audio samples per second
-
Supported Rate: All rates are supported (we always resample to 16kHz, which is used by our models)
-
Impact: Determines audio quality and bandwidth requirements
encoding - required
-
Type: Audio format specification
-
Definition: Audio data format specification for transmission and processing
-
Streaming Support: LINEAR_PCM (uncompressed 16-bit PCM signed little endian)
-
Requirements:
- Mono channel (single channel audio)
- Consistent bit depth (2 bytes per sample)
level - optional
This parameter defines the audio processing level, as defined above. The valid values are:
segment
utterance
If left blank, both levels will be returned in the response (The level is designated by the "level" field of the results):
{
"pid": 0,
"cid": 0,
"message_id": 0,
"results": [
{
"id": "0",
"startTime": "0.0",
"endTime": "10.0",
"task": "<task>",
"prediction": [
{
"label": "<label_1>",
"posterior": "0.9576",
},
{
"label": "<label_2>",
"posterior": "0.0377",
},
...
],
"finalLabel": "<label_1>",
"level": "utterance"
},
...
]
}
Defining configuration
The above configuration can be defined:
- either by setting the
AudioConfig
object of the protobuf (if using bare gRPC)- or by setting the
StreamingOptions
, defined here, if using the Python SDK
Updated 22 days ago