Embeddings

Introduction

Embeddings are numerical representations of data (vectors) that capture the underlying patterns and relationships within the input. In the context of audio analysis, speaker embeddings are compact vector representations that encapsulate the unique characteristics of a speaker’s voice, such as pitch, tone, and speaking style. Similarly, behavioral embeddings represent the various behavioral traits extracted from the audio, including emotion, engagement, and politeness.

These embeddings transform complex audio signals into fixed-dimensional vectors, making it easier to perform various machine learning tasks, such as clustering, classification, and similarity measurement. By converting raw audio data into these standardized formats, embeddings enable more efficient storage, retrieval, and processing of information, facilitating a wide range of analytical and predictive applications.

Common use cases for embeddings include speaker recognition, where embeddings help identify or verify a speaker’s identity across different recordings, even in varying acoustic environments. In customer service, embeddings can be used to analyze and improve interactions by identifying behavioral patterns and tailoring responses accordingly. For instance, recognizing when a customer is frustrated can prompt a system to escalate the call to a human representative. In media and content analysis, embeddings assist in indexing and retrieving audio segments based on speaker characteristics or emotional content, enhancing search capabilities. Additionally, embeddings enable advanced analytics, such as detecting trends in customer sentiment or engagement over time, which can be invaluable for market research and business intelligence.

Our API offers two types of embeddings. The embeddings are calculated per utterance:

Speaker Embedding: A 192-dimensional embedding that corresponds to the speaker's tone and unique characteristics of voice. Can be used for speaker identification/verification.
Behavioral Embedding: A 768-dimensional embedding that encapsulates the behavioral characteristics of speech. These include the emotion, positivity, and strength of the utterance.s

🌐 cURL Example

For embeddings to be included in the response, the user must set the embeddings query parameter in the submit audio request:

curl --request POST \
     --url https://api.behavioralsignals.com/clients/your-client-id/processes/audio \
     --header 'X-Auth-Token: your-api-token' \
     --header 'accept: application/json' \
     --header 'content-type: multipart/form-data' \
     --form name=my-awesome-audio \
     --form embeddings=true \
     --form 'meta={"key": "value"}'

Then, you can retrieve the results as described here , where the response will include the aforementioned embeddings.

🐍 Using the Python SDK

With the Python SDK, you can request embeddings by simply setting the flag embeddings to True when calling the upload_audio method:

from behavioralsignals import Client

client = Client(YOUR_CID, YOUR_API_KEY)

# submit for processing
response = client.behavioral.upload_audio(file_path="audio.wav", embeddings=True)

Then, retrieving the results with the get_result method (see here) will have the embeddings inlcuded in the response.

Response Schema

As mentioned in Retrieve results, when sending with the embeddings flag enabled, you will notice two changes w.r.t the default case:

the "embedding" field of the item of the response with "task": "diarization" will have the d=192 speaker embeddings serialized as a string to its value, instead of null,
the response will have an additional item, with "task": "features" and the d=768 behavioral embedding serialized as a string to its value

Example (we have trimmed the serialized values with "..."):

{
    "pid": 0,
    "cid": 0,
    "code": 2,
    "message": "Processing Complete",
    "results": [
        {
            "id": "0",
            "startTime": "0.217",
            "endTime": "5.026",
            "task": "asr",
            "prediction": [
                {
                    "label": " Wait for my buddy. Wait for my buddy to get over here. Hey, we're just doing a social experiment. That's it.",
                    "posterior": null,
                    "dominantInSegments": []
                }
            ],
            "finalLabel": " Wait for my buddy. Wait for my buddy to get over here. Hey, we're just doing a social experiment. That's it.",
            "level": "utterance",
            "embedding": null,
            "st": 0.217,
            "et": 5.026
        },
        {
            "id": "0",
            "startTime": "0.217",
            "endTime": "5.026",
            "task": "diarization",
            "prediction": [
                {
                    "label": "SPEAKER_00",
                    "posterior": null,
                    "dominantInSegments": []
                }
            ],
            "finalLabel": "SPEAKER_00",
            "level": "utterance",
            "embedding": "[16.078125, ..., 12.0625]",
            "st": 0.217,
            "et": 5.026
        },
        {
            "id": "0",
            "startTime": "0.217",
            "endTime": "5.026",
            "task": "language",
            "prediction": [
                {
                    "label": "en",
                    "posterior": "0.9560546875",
                    "dominantInSegments": []
                },
                {
                    "label": "zh",
                    "posterior": "0.005016326904296875",
                    "dominantInSegments": []
                },
                {
                    "label": "ru",
                    "posterior": "0.0034465789794921875",
                    "dominantInSegments": []
                }
            ],
            "finalLabel": "en",
            "level": "utterance",
            "embedding": null,
            "st": 0.217,
            "et": 5.026
        },
        {
            "id": "0",
            "startTime": "0.217",
            "endTime": "5.026",
            "task": "features",
            "prediction": [
                {
                    "label": null,
                    "posterior": null,
                    "dominantInSegments": []
                }
            ],
            "finalLabel": null,
            "level": "utterance",
            "embedding": "[0.028762999922037125, ..., 0.3249509930610657]",
            "st": 0.217,
            "et": 5.026
        },
        {
            "id": "0",
            "startTime": "0.217",
            "endTime": "5.026",
            "task": "gender",
            "prediction": [
                {
                    "label": "male",
                    "posterior": "0.8219",
                    "dominantInSegments": []
                },
                {
                    "label": "female",
                    "posterior": "0.1781",
                    "dominantInSegments": []
                }
            ],
            "finalLabel": "male",
            "level": "utterance",
            "embedding": null,
            "st": 0.217,
            "et": 5.026
        },
        {
            "id": "0",
            "startTime": "0.217",
            "endTime": "5.026",
            "task": "age",
            "prediction": [
                {
                    "label": "46 - 65",
                    "posterior": "0.7322",
                    "dominantInSegments": []
                },
                {
                    "label": "31 - 45",
                    "posterior": "0.2184",
                    "dominantInSegments": []
                },
                {
                    "label": "23 - 30",
                    "posterior": "0.033",
                    "dominantInSegments": []
                },
                {
                    "label": "18 - 22",
                    "posterior": "0.0164",
                    "dominantInSegments": []
                }
            ],
            "finalLabel": "46 - 65",
            "level": "utterance",
            "embedding": null,
            "st": 0.217,
            "et": 5.026
        },
        {
            "id": "0",
            "startTime": "0.217",
            "endTime": "5.026",
            "task": "emotion",
            "prediction": [
                {
                    "label": "neutral",
                    "posterior": "0.9283",
                    "dominantInSegments": []
                },
                {
                    "label": "angry",
                    "posterior": "0.0446",
                    "dominantInSegments": []
                },
                {
                    "label": "happy",
                    "posterior": "0.0217",
                    "dominantInSegments": []
                },
                {
                    "label": "sad",
                    "posterior": "0.0054",
                    "dominantInSegments": []
                }
            ],
            "finalLabel": "neutral",
            "level": "utterance",
            "embedding": null,
            "st": 0.217,
            "et": 5.026
        },
        {
            "id": "0",
            "startTime": "0.217",
            "endTime": "5.026",
            "task": "positivity",
            "prediction": [
                {
                    "label": "neutral",
                    "posterior": "0.9615",
                    "dominantInSegments": []
                },
                {
                    "label": "negative",
                    "posterior": "0.0247",
                    "dominantInSegments": []
                },
                {
                    "label": "positive",
                    "posterior": "0.0139",
                    "dominantInSegments": []
                }
            ],
            "finalLabel": "neutral",
            "level": "utterance",
            "embedding": null,
            "st": 0.217,
            "et": 5.026
        },
        {
            "id": "0",
            "startTime": "0.217",
            "endTime": "5.026",
            "task": "strength",
            "prediction": [
                {
                    "label": "neutral",
                    "posterior": "0.9234",
                    "dominantInSegments": []
                },
                {
                    "label": "strong",
                    "posterior": "0.0694",
                    "dominantInSegments": []
                },
                {
                    "label": "weak",
                    "posterior": "0.0072",
                    "dominantInSegments": []
                }
            ],
            "finalLabel": "neutral",
            "level": "utterance",
            "embedding": null,
            "st": 0.217,
            "et": 5.026
        },
        {
            "id": "0",
            "startTime": "0.217",
            "endTime": "5.026",
            "task": "speaking_rate",
            "prediction": [
                {
                    "label": "fast",
                    "posterior": "0.7854",
                    "dominantInSegments": []
                },
                {
                    "label": "slow",
                    "posterior": "0.1226",
                    "dominantInSegments": []
                },
                {
                    "label": "normal",
                    "posterior": "0.092",
                    "dominantInSegments": []
                }
            ],
            "finalLabel": "fast",
            "level": "utterance",
            "embedding": null,
            "st": 0.217,
            "et": 5.026
        },
        {
            "id": "0",
            "startTime": "0.217",
            "endTime": "5.026",
            "task": "hesitation",
            "prediction": [
                {
                    "label": "no",
                    "posterior": "0.9934",
                    "dominantInSegments": []
                },
                {
                    "label": "yes",
                    "posterior": "0.0066",
                    "dominantInSegments": []
                }
            ],
            "finalLabel": "no",
            "level": "utterance",
            "embedding": null,
            "st": 0.217,
            "et": 5.026
        },
        {
            "id": "0",
            "startTime": "0.217",
            "endTime": "5.026",
            "task": "engagement",
            "prediction": [
                {
                    "label": "engaged",
                    "posterior": "0.6509",
                    "dominantInSegments": []
                },
                {
                    "label": "neutral",
                    "posterior": "0.328",
                    "dominantInSegments": []
                },
                {
                    "label": "withdrawn",
                    "posterior": "0.0211",
                    "dominantInSegments": []
                }
            ],
            "finalLabel": "engaged",
            "level": "utterance",
            "embedding": null,
            "st": 0.217,
            "et": 5.026
        }
    ]
}