Functionality and definitions

This section provides essential technical concepts and parameters for effectively using the Behavioral Signals streaming API. Understanding these components will help you optimize audio processing for your specific use case, configure the API appropriately, and interpret the results correctly.

✍️ Usage

The Streaming API is suitable for sending chunks of audio (for example retrieved from a microphone) and getting back the results in real time.

Some example use cases include:

Streaming live calls, for example in call centers,
Agentic AI applications where the user's voice is processed live.

Streaming is implemented using gRPC for high efficiency.

ℹ️ Example response

The Streaming API yields protobuf messages with the format below (when serialized to JSON):

{
        "pid": 44676,
        "cid": 10000119,
        "message_id": 0,
        "results": [
            {
                "id": "2",
                "startTime": "0.00",
                "endTime": "2.00",
                "task": "features",
                "prediction": [
                    {
                        "label": "",
                        "posterior": "",
                        "dominantInSegments": null
                    }
                ],
                "finalLabel": "",
                "level": "segment",
                "embedding": "",
                "st": 0.0,
                "et": 2.0
            },
            {
                "id": "2",
                "startTime": "0.00",
                "endTime": "2.00",
                "task": "gender",
                "prediction": [
                    {
                        "label": "male",
                        "posterior": "1.0",
                        "dominantInSegments": null
                    },
                    {
                        "label": "female",
                        "posterior": "0.0",
                        "dominantInSegments": null
                    }
                ],
                "finalLabel": "male",
                "level": "segment",
                "embedding": "",
                "st": 0.0,
                "et": 2.0
            },
            {
                "id": "2",
                "startTime": "0.00",
                "endTime": "2.00",
                "task": "age",
                "prediction": [
                    {
                        "label": "31 - 45",
                        "posterior": "0.9987",
                        "dominantInSegments": null
                    },
                    {
                        "label": "18 - 22",
                        "posterior": "0.0012",
                        "dominantInSegments": null
                    },
                    {
                        "label": "23 - 30",
                        "posterior": "1e-04",
                        "dominantInSegments": null
                    },
                    {
                        "label": "46 - 65",
                        "posterior": "0.0",
                        "dominantInSegments": null
                    }
                ],
                "finalLabel": "31 - 45",
                "level": "segment",
                "embedding": "",
                "st": 0.0,
                "et": 2.0
            },
            {
                "id": "2",
                "startTime": "0.00",
                "endTime": "2.00",
                "task": "emotion",
                "prediction": [
                    {
                        "label": "neutral",
                        "posterior": "0.5481",
                        "dominantInSegments": null
                    },
                    {
                        "label": "sad",
                        "posterior": "0.4272",
                        "dominantInSegments": null
                    },
                    {
                        "label": "happy",
                        "posterior": "0.0186",
                        "dominantInSegments": null
                    },
                    {
                        "label": "angry",
                        "posterior": "0.0061",
                        "dominantInSegments": null
                    }
                ],
                "finalLabel": "neutral",
                "level": "segment",
                "embedding": "",
                "st": 0.0,
                "et": 2.0
            },
            {
                "id": "2",
                "startTime": "0.00",
                "endTime": "2.00",
                "task": "positivity",
                "prediction": [
                    {
                        "label": "neutral",
                        "posterior": "0.8244",
                        "dominantInSegments": null
                    },
                    {
                        "label": "negative",
                        "posterior": "0.1555",
                        "dominantInSegments": null
                    },
                    {
                        "label": "positive",
                        "posterior": "0.0201",
                        "dominantInSegments": null
                    }
                ],
                "finalLabel": "neutral",
                "level": "segment",
                "embedding": "",
                "st": 0.0,
                "et": 2.0
            },
            {
                "id": "2",
                "startTime": "0.00",
                "endTime": "2.00",
                "task": "strength",
                "prediction": [
                    {
                        "label": "weak",
                        "posterior": "0.5726",
                        "dominantInSegments": null
                    },
                    {
                        "label": "neutral",
                        "posterior": "0.4245",
                        "dominantInSegments": null
                    },
                    {
                        "label": "strong",
                        "posterior": "0.0029",
                        "dominantInSegments": null
                    }
                ],
                "finalLabel": "weak",
                "level": "segment",
                "embedding": "",
                "st": 0.0,
                "et": 2.0
            },
            {
                "id": "2",
                "startTime": "0.00",
                "endTime": "2.00",
                "task": "speaking_rate",
                "prediction": [
                    {
                        "label": "fast",
                        "posterior": "0.82",
                        "dominantInSegments": null
                    },
                    {
                        "label": "normal",
                        "posterior": "0.1246",
                        "dominantInSegments": null
                    },
                    {
                        "label": "slow",
                        "posterior": "0.0553",
                        "dominantInSegments": null
                    },
                    {
                        "label": "very slow",
                        "posterior": "0.0",
                        "dominantInSegments": null
                    },
                    {
                        "label": "very fast",
                        "posterior": "0.0",
                        "dominantInSegments": null
                    }
                ],
                "finalLabel": "fast",
                "level": "segment",
                "embedding": "",
                "st": 0.0,
                "et": 2.0
            },
            {
                "id": "2",
                "startTime": "0.00",
                "endTime": "2.00",
                "task": "hesitation",
                "prediction": [
                    {
                        "label": "no",
                        "posterior": "0.9098",
                        "dominantInSegments": null
                    },
                    {
                        "label": "yes",
                        "posterior": "0.0902",
                        "dominantInSegments": null
                    }
                ],
                "finalLabel": "no",
                "level": "segment",
                "embedding": "",
                "st": 0.0,
                "et": 2.0
            },
            {
                "id": "2",
                "startTime": "0.00",
                "endTime": "2.00",
                "task": "engagement",
                "prediction": [
                    {
                        "label": "neutral",
                        "posterior": "0.5343",
                        "dominantInSegments": null
                    },
                    {
                        "label": "withdrawn",
                        "posterior": "0.3897",
                        "dominantInSegments": null
                    },
                    {
                        "label": "engaged",
                        "posterior": "0.0761",
                        "dominantInSegments": null
                    }
                ],
                "finalLabel": "neutral",
                "level": "segment",
                "embedding": "",
                "st": 0.0,
                "et": 2.0
            }
        ]
    }

🎚️ Audio Processing Levels

The Behavioral Signals streaming API processes audio at two distinct levels:

Segment Level

Definition: Fixed-duration (2-seconds) chunks of audio processed independently
Purpose: Provides granular, time-aligned behavioral metrics
Output: Continuous stream of metrics for each segment
Characteristics:
- Consistent temporal resolution
- Suitable for real-time monitoring
- Results available after each segment
- Note that the smaller the segment length, the less accurate the results may be
Use Case: Live dashboards, immediate feedback, continuous monitoring, Real-time emotion detection

Utterance Level

Definition: Complete speech units bounded by silence or speaker changes
Duration: Variable, determined by Voice Activity Detection (VAD)
Context-Aware: Considers full speech context
Output: The analysis is performed over the entire utterance, leading to better accuracy
Characteristics:
- Natural speech boundaries
- Variable duration based on speech patterns
- Comprehensive analysis of complete thoughts
Use Case: Conversational analysis, speaker emotion profiling, turn-taking detection

🗣️ Voice Activity Detection (VAD)

Definition: Algorithm that detects the presence or absence of human speech in audio signals

Core Functionality:

Distinguishes speech from non-speech segments
Identifies natural pause points in conversation
Determines utterance boundaries automatically

Key Components:

Speech Duration: Minimum continuous speech time to form valid segment
Stream Resolution: Audio chunk size for streaming. There is no limitation on the chunk size. We recommend values between 100ms - 500ms.

⚙️ Configuration Parameters

These are provided by the user in the configuration parameters of the stream.

sampling_rate - required

Type: Audio parameter (Hz)
Description: Number of audio samples per second
Supported Rate: All rates are supported (we always resample to 16kHz, which is used by our models)
Impact: Determines audio quality and bandwidth requirements

encoding - required

Type: Audio format specification
Definition: Audio data format specification for transmission and processing
Streaming Support: LINEAR_PCM (uncompressed 16-bit PCM signed little endian)
Requirements:
- Mono channel (single channel audio)
- Consistent bit depth (2 bytes per sample)

level - optional

This parameter defines the audio processing level, as defined above. The valid values are:

segment
utterance

If left blank, both levels will be returned in the response (The level is designated by the "level" field of the results):

{
  "pid": 0,
  "cid": 0,
	"message_id": 0,
  "results": [
    {
      "id": "0",
      "startTime": "0.0",
      "endTime": "10.0",
      "task": "<task>",
      "prediction": [
        {
          "label": "<label_1>",
          "posterior": "0.9576",
        },
        {
          "label": "<label_2>",
          "posterior": "0.0377",
        },
        ...
      ],
      "finalLabel": "<label_1>",
      "level": "utterance"
    },
    ...
  ]
}

📘
Defining configuration
The above configuration can be defined:

either by setting the AudioConfig object of the protobuf (if using bare gRPC)

or by setting the StreamingOptions, defined here, if using the Python SDK

✍️ Usage

ℹ️ Example response

🎚️ Audio Processing Levels

Segment Level

Utterance Level

🗣️ Voice Activity Detection (VAD)

⚙️ Configuration Parameters

sampling_rate - required

encoding - required

level - optional

📘Defining configuration

📘
Defining configuration