Making the 24/7 Customer Experience Possible Through Monitoring
SolarWinds? web APM products give Traxo a new level of visibility, lower costs, and ease troubleshooting.
AppOptics, Pingdom, and Loggly solve particular needs—each one independently—but we’re also seeing value in how the tools can be integrated together. The interoperability between systems allows us to get to the root of our problem faster and deliver our service better.Chris Stevens Chief Technology Officer, Traxo
For any 24/7 business, knowing when applications or infrastructure aren’t performing or whether systems are acting up is critical. “We like to know when things are going right too – that’s great,” says Chris Stevens, chief technology officer, Traxo, “but when things are going wrong, we expect to be alerted and need to troubleshoot quickly and easily.”
Traxo’s customers are part of the broader travel ecosystem and include corporate travel and procurement organizations, travel management companies, expense management applications, and risk management services. They depend on the data aggregation company’s information to power expense products and duty of care solutions that keep employees safe while they’re traveling.
The Traxo? multi-region, multi-zone AWS? environment includes a number of similar hosts managed by HashiCorp’s Nomad cluster and application scheduler, along with Vault for sensitive data management, and Console for service discovery and configuration.
Also located in that cluster are a number of different containerized workloads on Docker? that include multi-tier microservices. Stevens says they monitor “everything from the host to the container to the application, and also ingress and egress through the various load balancers.”
Traxo’s visibility into its applications started with a homegrown tool built with StatsD installed on Graphite™. However, that became increasingly difficult to manage. And when Traxo moved to AWS, Stevens says, “We wanted to offload anything not needed to run the core business.”
As they added more segments to metrics, however, the StatsD tagging system wasn’t enough. “It was hard to write good aggregates and see segments by host, for example,” says Stevens.
In addition, troubleshooting was extremely manual, and Traxo needed a better solution.
That’s where Librato?, which has since up-leveled to AppOptics™, Loggly?, and Pingdom? from the SolarWinds web application performance monitoring (APM) portfolio, comes in.
“We came to the SolarWinds products organically,” says Stevens. “We started with a need for metrics and evolved from Librato to AppOptics. We then realized the need for log management after that and picked Loggly. We needed outside verification and availability, so we found Pingdom and rounded out our implementation of the full SolarWinds web APM portfolio.”
With Librato, Traxo could tag each of their hosts and write a much smaller metric and then slice and dice it after the fact. In turn, this made the application developers’ decision of what to show easier and the namespace was less cluttered.
As SolarWinds integrated Librato into the larger AppOptics ecosystem, the process was easy, according to Stevens: “The transition was fast. Our team members were already familiar with Librato so there wasn’t much training. Everything was up and running within a matter of days.”
Because Traxo knew it would grow in the short term, price was a consideration when looking for a monitoring solution. Other questions asked: What would training requirements be? Does the solution provide new capabilities to Traxo? Did it have open source support? “Those came together with AppOptics,” says Stevens, “and we got the added benefit of the APM ecosystem.”
Traxo uses AppOptics for proactive monitoring to determine host status, to know how much capacity is being used in a cluster, what applications are doing, and ensuring KPIs are being met. A dashboard display provides at-a-glance, full awareness for the team. The solution also supports deployments, so engineers can monitor a set of metrics. As the rollouts finish in various regions, they crosscheck to see if expectations were met.
“In an incident environment, the alerting is incredibly powerful,” says Stevens. “We have alerts pushed out through Slack? and PagerDuty?. There are many different options, and we were able to just click and configure. The team can see if the app is misbehaving, if it’s a particular host or a section of hosts in an availability zone, or a host with a particularly noisy neighbor that’s having problems.”
A correlation ID traces all the way in from the edge service to one or many backend services. “The tracing is incredibly important and is a new capability for our team. We found a number of different hotspots that way,” says Stevens.
Traxo is using Pingdom to report on the uptime and respond time of its edge-facing applications. The information from Pingdom is piped into a customer-facing Atlassian? Statuspage with a view of what the systems are doing at any given moment.
“We added that to our SLA contracts. It’s in all of our public sites, and our users know they can see the current status of all of our systems and also any incidents that might be ongoing or in the recent history.”
Engineering and operations teams primarily use Loggly to set up alerts and to access a record of all hosts in the system in one place. They also rely on the search functionality to dial down to a particular region or host and look at the stream of logs from the various applications.
“Mostly, we wanted to have a well-documented TLS configuration with both the existing Linux? host and the syslog agent. Once we plugged Loggly in, our image builder ran, and we had working logs on the same location. It was very easy to get running,” explains Stevens.
Since rolling out the SolarWinds tools, Traxo has more visibility into its applications and systems. Operations cost for support is “way down.” Stevens says, “All those things play together and allow us to do better things for our customers.”
One of the biggest changes is time to resolve any incident, according to Stevens. “We’re able to home in faster on the core problem, trace into it, follow a user request, and more easily isolate hosts or regions that are experiencing a particular problem. The ability to sign on across a number of different products within the SolarWinds family cuts down on another login step. Plus, the tools talk to each other.”
He adds that Loggly is exceeding objectives. “Now, we don’t have to think about it. Our logs are there within just a few seconds. We’re satisfied to know that our logs are offsite, searchable, and archived for compliance reasons.”
For others considering any of the SolarWinds web APM products independently, Stevens believes it’s also important to understand how well they all work together.
“AppOptics, Pingdom, and Loggly solve particular needs—each one independently—but we’re also seeing value in how the tools can be integrated together. The interoperability between systems allows us to get to the root of our problem faster and deliver our service better.”
He also touts the value of a multi-prong, multi-service engagement through a single vendor like SolarWinds and sums up the benefits: “The package of SolarWinds products we use has helped us serve our customers by lowering support cost and enabling better visibility—both to us running the system and our customers consuming it. We’ve been able to close the feedback loop between us providing the service and somebody consuming it.”