Operating a VoIP system with a focus on great customer experience can be quite challenging, especially if you run a heterogeneous network with lots of different SIP clients (like various software clients, all kinds of SIP Phones and Terminal Adapters and especially IP PBXs). SIP clients are known to have all kinds of quirks and implementation errors, and if you don’t control them yourselves (e.g. with a central device provisioning tool), the additional factor of configuration errors introduced by your customers comes into play. Putting the right values into the configuration interface of the clients is not always that straight-forward, it sometimes needs an engineering degree to find out what’s up with parameters like registrar, outbound-proxy, session-timers, codec ordering etc. Flexibility is not always key, especially when it comes to end user interfaces. That’s why Skype is so successful, because “it just works”.
Anyways, if a customer uses your VoIP service (especially if it’s a paid service), it just needs to work, and if not, you better pin down the error cause as soon as possible and provide a solution to the customer, otherwise she’ll turn away from you quite quickly.
The poor man’s approach
In the past, VoIP troubleshooting went somewhere along this line (we’ve been there and done that):
- Ask the customer when approximately she did the failed call or failed to register her phone
- Grep the (hopefully extensive) log files for hints pointing to the error
- If nothing obvious comes up there, start a tcpdump on the system and ask the customer to try the call again
- Copy the resulting trace to your local machine and try to extract the relevant packages from a potentially HUGE trace
- Analyze the call, take your actions, and if necessary repeat the process
This approach has some obvious flaws. First, your support agent needs access directly on the system and the proper rights to start a trace.
It is also quite time consuming and probably doesn’t draw a professional picture if you need to ask your customer for some action in order for you to find the problem. It’s also a heavily manual process, requires quite some technical expertise to pull off, and if the support agent needs to escalate the issue to 2nd Level Support, it involves uploading SIP traces to somewhere, or even worse, sending them back and forth by email.
External Monitoring Tools to the rescue!
Due to the huge overhead of the traditional troubleshooting approach, a whole new ecosystem around external SIP monitoring and analysis. New start-ups were created to tackle these issues, and established network monitoring vendors pushed into the market, providing traffic analyzer solutions to ease the pain of VoIP support. The problem for small VoIP operators is that these solutions can be horrendously expensive. In the telephony industry, licensing models are broken down to a per-line or per-subscriber price, and it’s not uncommon that the line price of the analyzer tools exceed the line price of the VoIP soft-switch, which is just unfeasible.
However, since open source projects increasingly get their feet into the VoIP market, it’s quite natural that also open source VoIP monitoring and troubleshooting tools start to appear. The most promising project in the open source landscape is Homer, an open source SIP capturing server. Since it can passively wiretap traffic on mirrored switch ports, it integrates nicely into a VoIP network environment without interfering with existing networking elements.
Using such tools, the support process changes significantly, because all SIP packets are constantly captured on the network and can be filtered and viewed on web interfaces. Most of them, like Homer, also visually present the call flows of the SIP packets, so it gets very easy to spot issues between the involved hops.
Instead of having to involve the customer into the troubleshooting process, it becomes something like this:
- Filter for calls or registrations of the respective customer
- Visually check the call flows and packets for obvious issues
- If necessary, grep the logs for specific calls
- Take actions and repeat the process if necessary
If more people need to be involved into the troubleshooting process, just the link to the call flow in question needs to be shared.
However, the problem with such tools is that they can only provide an external view of a VoIP system, because in most of the cases it’s not possible to hook into the internal communication of a VoIP soft-switch. For example the Sipwise sip:provider appliances consist of several SIP elements communicating with each other on the local interface, and this traffic can’t be captured without “opening up” the soft-switch and install additional software onto it, which might either be impossible at all, or might void any warranties provided by the vendor.
The Sipwise Approach
To get a complete view of the SIP packet flows also inside of the VoIP system, we have integrated a first version of our own SIP monitoring and troubleshooting system into the upcoming version 2.6 of the sip:providerPRO platform. It provides deep insights into past and current call flows by lining out a break-down of SIP requests and responses, as well as visual call graphs and packet details. The advantage to external solutions is that it integrates tightly into the existing Administrative Interface.
An overview of the amount and distribution of various requests and responses gives you great hints for failure predictions. We’re working hard to also implement trending and predictions of issues, so countermeasures can be taken in a pro-active approach before complains start hitting your support team.
To troubleshoot customer issues, all call scenarios are listed directly in the subscriber view, so you don’t have to search for calls belonging to specific customers:
Each call scenario provides a dynamically rendered graphical representation of the call flow, so you can easily spot any issues in the call routing directly on a network level:
The call scenario is clickable, so you can easily dig into the details of a specific packet:
For further analysis, you can also download the raw SIP trace in PCAP format.
We’ve learned that it is extremely important to provide a very simple way to get an overview of what is going on at any given moment directly on a networking level, because log files don’t always provide all the information needed to troubleshoot an issue. It is crucial to be able to analyze calls which happened in the past, so you don’t have to bother a customer with any actions during the troubleshooting process.
One major task to tackle is also to counteract on arising issues before they pile up, and most importantly before customers get affected. Our focus will be to extend the described tools to predict issues where possible, so you’ll be able to react before problems escalate.