Using Millis Platform via WebSocket to build voice agents on desktop and mobile
This tutorial guides you through the process of integrating the Millis AI platform directly via WebSocket to build voice agents for desktop or mobile apps. Users can capture audio natively and send it to Millis via WebSocket, receiving voice responses in real-time.
Once connected, send an initiate message to start the interaction.
Copy
ws.onopen = () => { let initiateMessage = { method: "initiate", data: { agent: { agent_id: "your_agent_id", // Or replace with agent_config: <config> for dynamic configuration } public_key: "your_public_key", metadata: { key: value }, // Optional: extra data attached to the call include_metadata_in_prompt: true/false // Optional: option to include the metadata in the agent's system prompt } }; ws.send(JSON.stringify(initiateMessage));};
Copy
ws.onopen = () => { let initiateMessage = { method: "initiate", data: { agent: { agent_config: { prompt: "", voice: { provider: "elevenlabs", voice_id "..." } } } public_key: "your_public_key", metadata: { key: value }, // Optional: extra data attached to the call include_metadata_in_prompt: true/false // Optional: option to include the metadata in the agent's system prompt } }; ws.send(JSON.stringify(initiateMessage));};
Millis will respond with the message {"method": "onready"} indicating readiness.
Millis will send audio responses as ArrayBuffers with the same format and sample rate. You need to buffer and play these on your side.
Copy
ws.onmessage = (event) => { if (event.data instanceof ArrayBuffer) { // ArrayBuffer received, handle as audio packets let audioResponse = new Uint8Array(event.data); // Buffer and play the audio response } else { // String received, handle as normal events let message = JSON.parse(event.data); handleIncomingMessage(message); }};
ArrayBuffer data will be the audio packets, while string data indicates normal events that you need to process accordingly.
Millis may send various events to manage the session and interaction. Here is the logic behind each message:
pause: Millis detected some voice activity from the client. The agent decides to temporarily pause talking and observe the next voice activity. In this case, you should still keep and buffer incoming audio packets but not play them.
unpause: If Millis detects that it’s not the human trying to talk over or interrupt, the agent will continue talking. In this case, you should continue playing audio packets in the buffer.
clear: Millis detected human’s voice, indicating human interruption intent. The agent will reset and stay silent to let the human continue talking. In this case, clear all audio buffers and stop playback.
ontranscript: Real-time transcript of the client’s audio.
onresponsetext: Real-time transcript of the agent’s response.
onsessionended: For any reason Millis decides to end the session, you will receive this event.
start_answering: The agent decides to start answering the human’s query.
ai_action: For debug purposes. During the conversation, Millis AI intelligently decides to take some action. Listen to this event to understand what the agent is trying to do.
Example:
Copy
function handleIncomingMessage(message) { switch (message.method) { case "pause": // Pause playback and buffer incoming audio packets break; case "unpause": // Resume playback of buffered audio packets break; case "clear": // Clear audio buffer and stop playback break; case "ontranscript": console.log("Client's audio transcript:", message.data); break; case "onresponsetext": console.log("Agent's response transcript:", message.data); break; case "onsessionended": console.log("Session ended."); ws.close(); break; case "start_answering": console.log("Agent starts answering the query."); break; case "ai_action": console.log("AI Action:", message.data); break; }}