At calldesk, we've always strived to challenge our platform and push it at its limits. Since we've won the AWS "Architecture of the Year" 2020 Challenge, we thought it would be interesting to share our journey and feedback on how we've recently scaled our platform from 1000 to 5000 phone calls in parallel using AWS, in less than a month. Quite a technical challenge!
We will be writing a series of articles to explain how this challenge was handled as a team. You'll discover how the engineering team organized itself to reach its goal within 2 sprints, what tools we used, the problems we faced, and what solutions we implemented.
This first article aims to give you an overview of the calldesk technology, the architecture, its associated limits, and why it was not an easy task. The following articles will focus on the solutions we've implemented. Stay tuned!
What the engineering team does at calldesk briefly...
At calldesk, the engineering team is working on 2 main products:
- Our AI-powered voice agents, which are the core business of calldesk, and can automate repetitive calls in a customer call center
- Our studio, which is the SaaS cloud based application we've developed to ease voice agents creation and deployment for our customers and partners
The calldesk technology is a cloud architecture designed to be:
- fast: a voice agent needs to understand the caller and respond in real-time. We can't afford a long response time, and that's why we have designed a streaming architecture
- scalable: our business is growing fast thanks to our customers and partners, so we need to handle more and more calls in parallel
- reliable: we are committed to a 99.99% SLA with our customers and partners. No downtime is acceptable.
- maintainable: a lot of updates have to be released in production every week. We need our architecture to embrace changes
- secured: our voice agents are used by banks, insurance, transport, and utility companies. We have access to sensitive data and need to protect it.
These 5 values drive the engineering culture on a daily basis, and we're happy to share with you how we handled this big technical challenge in the past few weeks.
What is the big challenge we faced?
Due to the covid-19 pandemic, we've observed that many businesses are currently switching from an on-premise contact center to a cloud-based contact center, in order to adapt to the new remote first environment. This change is not local to France, it is a global change.
In this context, we are expecting more and more calls in parallel on our platform. As for now, we were mainly handling phone calls for large French companies (La Poste, OUI.sncf, Enedis, Dalkia, CNP Assurances...). But phone call volumes are way higher when working on the US market, so partnering with these tech giants requires us to scale the capabilities of our platform.
To be more accurate, before starting our partnerships, we had sustained up to 1000 phone calls in parallel. With the expected growing demand, our goal was to handle 5 times more phone calls, which is 5000 phone calls in parallel. It's like moving from 1000 agents in a call center to 5000. Not an easy task.
Why was it not an easy task?
There are 3 main design choices to better understand the challenges raised by this task.
If you're not familiar with telephony protocol and integration, here is a quick explanation of the 2 ways for dealing with this:
- using the PSTN network, which is the old and traditional telephony network. It can work fine on small amount of calls volume, but it has a lot of limits.
- using VoIP (SIP protocol), which uses internet protocol and is the preferred solution when it comes to scaling an infrastructure. This is the technology used when making a call on WhatsApp or Facebook Messenger.
We had to set up a test infrastructure in order to make thousands of calls on our platform using the SIP protocol, without being limited by our laptop capabilities or local network.
First, calldesk technology relies on a best-of-breed architecture, which is a key differentiator of our technology on the market. This best-of-breed architecture means that we are capable of using multiple Automatic Speech Recognition (ASR) engines in competition in parallel, in real time.
In other words, our architecture uses multiple services to transcribe speech to text and is able to use them simultaneously in order to select the best transcript depending on the use case (asking for the last name, for an address...). It always guarantees the best understanding performance and this is what makes our technology so powerful.
Moreover, we have a streaming architecture, which means that voice agents try to understand the caller on every syllable, just like humans do. They don’t wait for the full sentence, which guarantees a very low response time.
Let's take an example. James is calling its customer service and has to give his last name, which is Smith. He's going to say "My last name is Smith". On every syllable, the voice agent will try to understand the caller intent, just like this:
- "My last"
- "My last name": at this time, the sentence is not done but the voice agent has already understood that James is giving his last name
- "My last name is"
- "My last name is Smith": at this time, the voice agent is able to extract the last name using a Named Entity Recognition algorithm, and can already answer the caller. The processing is made in real-time.
However, scaling this kind of architecture is not easy. We are often limited by third-party services and their streaming capabilities. We'll tell you more about the solutions we've chosen in the next articles!
How did we start?
Before moving into the current solution implementation, the first thing we did was thinking and planning. As we were a team of 5 software engineers working on this, we could not just start without a clear objective. We needed to be organized in order to be effective and quickly reach our goals.
Thinking and aligning the team
This may sound obvious, but this was actually a critical step in answering the following questions and aligning the team with a single direction:
- What is the objective and deadline?
- What are the current limitations we are already aware of?
- What are the risks of failing?
- What is the plan for the next few weeks?
- When are we going to do the first stress test?
The objective and the deadline was clear: be able to handle 5000 calls in parallel within 4 weeks. 2 sprints. WOW.
Regarding current limitations, we knew our platform could handle 1000 phone calls in parallel. We also knew about third-party services limitations and asked our providers for an increase before starting anything.
We then discussed some risks: VoIP proxies and DNS resolution, FIFO messaging queues, AWS lambda concurrent executions, logging volumes...
Planning the next steps
We finally defined the planning of the next weeks: they would most likely be the most intensive weeks we've had so far at calldesk. We needed to be very well organized in order to be productive and effective. No place for hesitation. Time was our enemy.
From day 1 to day 3
The first 3 days would be dedicated to functional test developments so that we could test critical components in isolation. This would allow us to quickly iterate on components and see the results. We would also create the voice agent that will handle the phone calls and prepare the test scenarios and protocol.
On day 3
On the third day, during the night (to prevent production phone calls from being impacted), we would launch the first stress test scenarios. We would track every result and alert that would be triggered on the platform for further analysis.
From day 4 to day 5
The next day, we would analyze the results and logs to identify which components were pushed out of their limits. We would define the next steps for the iteration, prioritize them, plan the next stress test, and start coding bug fixes and implementations.
We would iterate over this pattern (third day to the fifth day) over and over again as we increased the volume of phone calls made on our platform. This would let us quickly iterate on the solutions and reach milestones on the road to our final goal.
In the next article, we'll tell you more about the test scenarios and the tools we used, results, limits, and what solutions we decided to implement to make our platform scale. Stay tuned!
Want to be part of this amazing tech journey? Check out our engineering job openings, we would be happy to discuss with you!