Dynatrace Metrics Ingest

Posted on

Today we’re going to be talking about some exciting new functionality that was recently added to Dynatrace. We’ve talked about Dynatrace in this blog before, but for those who may not be familiar, Dynatrace is an all-in-one software intelligence platform and a leader in the Gartner magic quadrant for APM. Dynatrace has always been a frontrunner in understanding application performance and their AI and automation help tackle many challenges that would require countless hours of manpower.

Most of the data captured in Dynatrace, up until this point, was gathered from the Dynatrace OneAgent or from Dynatrace Extensions, which pulled data from APIs. This meant that if the metrics weren’t native to Dynatrace, they wouldn’t be consumable into the Dynatrace platform. But,

  • What if you want to keep track of a certain file’s size on a disk?
  • What if you have an important InfluxDB you want to monitor?
  • What if you want to know the number of currently running Ansible deployments, or the failed ones?

This blog will cover:

  1. A high-level overview of the “New Metrics Ingestion Methods
  2. A “Cheat Sheet” for selecting which method is best for you
  3. A “Brief Review on Ingestion Methods
  4. Shortcomings of this Early Adopter release, and what we hope to see in the future
  5. An example – “Ansible Tower Overview Dashboard

New Metrics Ingestion Methods

Historically, teams could write OneAgent plugins, but they required development effort and a knowledge of Python. Now that Dynatrace has released the new Metric ingestion, any custom metrics can be sent to the AI-powered Dynatrace platform easier than ever. There are four main ways to achieve this, and they are:

Dynatrace has already written technical blogs about how to send the metrics (linked above), so this blog will aim to discuss the pros and cons of each method, along with giving a cheat sheet on which path is likely best depending on your business use case.

Cheat Sheet

When deciding which route to take, follow this cheat sheet:

  • Is Telegraf already installed and gathering metrics? Use the Dynatrace Telegraf Plugin
    • Or, does Telegraf has an Input Plugin built in for the technology that requires monitoring? Telegraf may still be the best route because capturing the metrics will be effortless.
  • Is something already scraping metrics in StatsD format? Use the StatsD Implementation
  • If none of the above, the best route is likely to use the Metrics API v2 / OneAgent REST API.

Brief Review on Ingestion Methods

Since Dynatrace has already written about each method, except Telegraf, those details won’t be duplicated in this blog. Instead, here’s a quick overview on each Ingestion Method:

  • Dynatrace StatsD Implementation – If there’s an app that’s already emitting StatsD-formatted metrics, this implementation would be the most direct. The OneAgents listen on port 18125 for StatsD metrics sent via UDP. Dynatrace has enhanced the StatsD protocol to support dimensions (for tagging, filtering). The StatsD format is not as sleek as the new Dynatrace Metrics Syntax, so this path is not recommended unless StatsD is already present.
  • Metrics API v2 (OneAgent REST API) – There is an API endpoint listening for metrics in the Dynatrace Metrics Syntax (if you happen to be familiar with Influx’s Influx Line Protocol, it’s almost identical)
  • Dynatrace Telegraf Output – The latest releases of Telegraf now include a dedicated Dynatrace output, which makes sending metrics to Dynatrace extremely easy when Telegraf is installed. Telegraf can either push metrics to the local OneAgent or out to the Dynatrace cluster.
    • If Telegraf is not yet installed, it still may be the easiest route forward if Telegraf natively supports a technology that needs to be monitored. The list of Telegraf “inputs” can be found here. Installing Telegraf is quite easy, and the Telegraf configuration is detailed well in the Dynatrace documentation.
  • Scripting Languages (Shell) – If code has to be written to support outputting Dynatrace Metrics Syntax or StatsD metrics, the code can be slightly simplified by using the OneAgent dynatrace_ingest script provided with each OneAgent. This script can be invoked instead of writing networking code to push the metrics. Instead, metrics can simply be piped into this executable.

These ingestion methods allow Dynatrace to contend with open monitoring platforms, but they’re not without their own faults. Before moving to the example use case and dashboard, the most important caveats we discovered in Metric Ingestion will be discussed.

Early Adopter Shortcomings

Throughout evaluating this new functionality, a couple of missing features surfaced. Highlighted below are the most challenging issues faced, and then also a proposed solution to remedy the shortcoming.

No Query Language Functions

Problem – The largest shortcoming of this Explorer is the limited aggregation options presented.

Example Use Case –

  • If an ingested metric is a COUNT over time, its value can become astronomically large. For a COUNT type of metric, a user may want to see the overall count, but likely the delta is more important.
  • Another example is if there’s a metric which needs arithmetic applied to it – say the value of a query needs to be multiplied by 10 or divided by 100 – it’s not possible.
  • And another example is when the difference between two different queries needs to be calculated (CPU Used – CPU System = CPU not used by OS) – it’s also not possible.

The workaround here is to modify the metrics before they’re sent to Dynatrace, but that’s not practical for a lot of use cases.

Proposed Solution – Add mathematical operators and query functions. For example, Grafana has dozens built into its product that make data manipulation at query time very easy.

Incomplete Metrics List in Explorer

Problem – The list of metrics presented in the Custom Charts Explorer is not complete, which can be misleading.

Example use case – If a user searches for “awx” they will find up to 100 metrics with a matching name. If that user scrolls through the list, exploring the new metrics, they may believe the 100 metrics were the only ones available, leading to confusion.

Proposed Solution – The list of metrics should indicate whether the list is complete.

New Metrics Registration is Slow

Problem – The time it takes for a new metric to be registered and queryable in Dynatrace takes up to 5 minutes.

Example use case – If you are very familiar with this new Metrics Ingestion, you can send metrics and assume they will properly register. But, when new users are testing out the functionality and developing their workflows, this delay can become a real headache.

Proposed Solution – As soon as a metric has been sent, it should be registered and then shown in the Metrics Explorer. Even if the data itself hasn’t been stored, the metric name should still be queryable near instantaneously.

Although these gaps in functionality are annoying at this time, the new Metrics Ingestion still allows for insightful 3rd-party dashboards to be made.

Example – Ansible Tower Overview Dashboard

At Evolving Solutions, we’re a Red Hat Apex partner and we use a lot of Ansible. If you haven’t seen it yet, Ansible Tower is a very extensible solution for managing your deployment and configuration pipelines. I wanted to try to gather metrics from Ansible Tower’s Metrics API so I could track how many jobs were running and completed.

I wrote two applications which read from the local Ansible Tower Metrics API and scrapes those metrics. One of the apps prints the output to stdout, while the other pushes metrics via UDP to the StatsD Metrics listening port. The one which writes to stdout can be used as a Telegraf input (exec input) or piped into the dynatrace_ingest script.

With the data sent to Dynatrace, I made an example dashboard of how these metrics could be used. In the dashboard, I leveraged

  • Dynatrace (Agent-gathered) Metrics:
    • Host Health Overview
    • Host Metrics
    • Automatically Detected Problems
  • Ansible Tower Metrics (through the Telegraf metrics ingest):
    • Overall Job Status & Status Over Time (Successful vs Failed vs Cancelled jobs)
    • Tower Job Capacity, number of Executing Jobs, and the number of Pending Jobs
    • Ansible Tower Usage Stats (User count, Organizations count, Workflow count)

As you can see, sending these extra Ansible Tower Metrics to Dynatrace allows us to build a detailed overview of the Ansible Tower platform. With new features like open Metrics Ingestion, Dynatrace is continuing to differentiate itself and disrupt the APM market.

Ansible Tower monitoring is a great use case, but it’s only one of an endless number of use cases – do you have any systems you’d like deeper monitoring into with Dynatrace? Reach out to us at Evolving Solutions and we can help you gain complete visibility of your critical systems.

(21/01/11) – A previous version of this blog said that metrics could not have dimensions added or removed after they’ve been set. After speaking with Dynatrace Product Management, it was discovered that this is not true, and instead an obscure edge case was encountered. If you encounter hiccups with the new Metrics Ingestion, click the “Contact Us” button below.

Evolving Solutions Author:
Brett Barrett
Senior Delivery Consultant
Enterprise Monitoring & Analytics
brett.b@evolvingsol.com

Tackling Common Mainframe Challenges

Posted on

Today, we’re going to talk about the mainframe. Yes, that mainframe which hosts more transactions, daily, than Google; that mainframe which is used by 70% of Fortune 500 companies; that mainframe which is currently seeing a 69% sales rise since last quarter. Over the decades, analysts predicted the mainframe would go away, particularly in current years with the ever-expansive public cloud, but it’s challenging to outclass the performance and reliability of the mainframe, especially when new advancements have been made such that 1 IBM Z system can process 19 billion transactions per day.

Since the mainframe seems here to stay and it’s a critical IT component, we need to make sure it’s appropriately monitored. If you Google “Mainframe Monitoring Tools”, you’ll find a plethora of bespoke mainframe tools. Most of these tools are great at showcasing what’s happening inside the mainframe…but that’s it. Yes, of course we need to know what’s happening in the mainframe, but when leadership asks, “why are we having a performance issue”, silo’d tools don’t provide the necessary context to understand where the problem lays. So, what can provide this critical context across multiple tools and technologies?

The Dynatrace Software Intelligence Platform was built to provide intelligent insights into the performance and availability of your entire application ecosystem. Dynatrace has been an industry leader in the Gartner APM quadrant ever since the quadrant was created and it’s trusted by 72 of the Fortune 100. Maybe more pertinent to this blog, Dynatrace is built with native support for IBM mainframe, in addition to dozens of other commonly-used enterprise technologies. With this native support for mainframe, Dynatrace is able to solve some common mainframe headaches, which we’ll discuss below.

End-to-End Visibility

As mentioned earlier, tailored mainframe tools allow us to understand what’s happening in the mainframe, but not necessarily how or why. Dynatrace automatically discovers the distributed applications, services and transactions that interact with your mainframe and provides automatic fault domain isolation (FDI) across your complete application delivery chain.

In the screenshot above, we can see the end-to-end flow from an application server (Tomcat), into a queue (IBM MQ), and then when that message was picked up and processed by the mainframe “CICS on ET01”. With this service-to-service data provided automatically, out-of-the-box, understanding your application’s dependencies and breaking points has never been easier.

Quicker Root Cause Analysis (RCA)

With this end-to-end visibility, RCA time is severely reduced. Do you need to know if it was one user having an issue, or one set of app servers, or an API gateway, or the mainframe? Dynatrace can pinpoint performance issues with automated Problem Cards to give you rapid insight into what’s happening in your environment.

When there are multiple fault points in your applications, Dynatrace doesn’t create alert storms for each auto-baselined metric. Instead, Dynatrace’s DAVIS AI correlates those events, with context, to deliver a single Problem, representing all related impacted entities. An example of a Problem Card is displayed below:

In the screenshot above, there are a couple key takeaways:

  1. In the maroon square, Dynatrace tells you the Business impact. When multiple problems arise, you can prioritize which are the most important by addressing first the Problem with the largest impact
  2. In the blue square, Dynatrace provides the underlying root cause(s). Yes, we can also see there were other impacted services, but at the end of the day, long garbage-collection time cause slow response times.
    1. This is a critical mainframe example. Yes, the mainframe is not the root cause in this case, but that’s great to know! Now, we don’t have to bring in the mainframe team, or even the database team. We can go straight to the frontend app developers and start talking garbage collection strategies.
  3. Finally, Dynatrace is watching your entire environment with intuitive insights into how your systems interact. Because this problem was so large, there were over a billion dependencies (green square) that were analyzed before providing you a root cause. There is simply no way this could be manually.

Optimize and Reduce Mainframe Workloads

The IBM Mainframe uses a consumption-based licensing model, where the cost is related to the number of transactions executed (MSUs; “million service units”). As more and more applications are built that rely on the mainframe, the number of MIPS required increases. Tools that focus only on understanding what happens in the mainframe can tell you 10,000 queries were made, but not why. Because Dynatrace provides end-to-end visibility from your end user to your hybrid cloud into the mainframe, it can tell you exactly where those queries came from. These insights are critical to identifying potential optimization candidates, and can help you tackle your MIPS death from a thousand paper cuts.

In the screenshot below, you can see that (in green) that 575 messages were read off the IBM MQ, but then that caused 77,577 interactions on the mainframe! Likely, there is room for great optimization here.

Yes, those requests may have executed quickly, but maybe they could been optimized so that only 10 mainframe calls needed to be executed, or even 5, as opposed to ~135. Without Dynatrace, it is an intense exercise for mainframe admins to track down all of the new types of queries that are being sent to them.

In Closing

With Dynatrace, all of your teams can share a single pane of glass to visualize where performance degradations and errors are introduced across the entire application delivery chain. With its native instrumentation of IBM mainframe, Dynatrace provides world-class insights into what’s calling your mainframe, and how that’s executing inside the system.

Now that we’ve discussed the common mainframe headaches of end-to-end visibility, root cause analysis, and workload optimization, it’s time to conclude this high-level blog. Hopefully this blog has given you insight into common use cases where Dynatrace provides immense value to mainframe teams, application developers, and business owners. Soon, we’ll be following up with a more technical Dynatrace walkthrough to show you exactly how to get to this data.

Until then, if you have any questions or comments, feel free to reach out to us at ema@evolvingsol.com. We’d love to chat and learn the successes and challenges you have with monitoring your mainframe environments.

Evolving Solutions Author:
Brett Barrett
Senior Delivery Consultant
Enterprise Monitoring & Analytics
brett.b@evolvingsol.com

Proactive Application Performance Monitoring Can Free Up Resources and Boost the Bottom Line

Posted on

As hybrid cloud use and application complexity continues to grow, organizations of all sizes are increasingly challenged by the need to monitor their applications’ availability and performance. Companies are seeking the best methods to support mission-critical IT functions as they deploy new and more complex applications to the cloud.

The Cost of Application Outages

What happens, then, when these mission-critical applications experience outages or performance issues? In IT shops that are increasingly siloed by discipline, who’s in charge of holistically monitoring the entire application?

Application outages can directly impact the bottom line. “In this day and age of social media, they may find out that their application is slow and not performing acceptably on social media,” says Jaime Gmach, president and CEO of Evolving Solutions. “The financial impact to a company ranges from hundreds of thousands of dollars to millions of dollars—to a company going out of business.”

What’s more, reactive solutions to application outages are costly and inefficient. War rooms, where representatives from various departments (e.g., network, server and firewall teams) all gather in a conference room or on a conference call at all hours of the day or night, tie up resources. These silos of employees often can’t see beyond their particular area of expertise, failing to take a comprehensive look at the entire enterprise’s applications.

“They get all of their best people troubleshooting,” explains Nate Austin, practice director, Enterprise Monitoring and Analytics at Evolving Solutions. “Because it’s their best people, it’s all of their most expensive people.”

Proactive Application Performance Monitoring

That’s where application performance monitoring (APM) comes into the picture. APM gives an organization a clearer picture of the health of their mission-critical applications. As applications become more complex—integrated into other processes and into the cloud—managing applications and identifying where (and why) an issue resides becomes more challenging.

To better address the issues clients are facing with application monitoring, Evolving Solutions launched its Enterprise Monitoring and Analytics practice. The practice offers clients the resources and expertise needed to provide continuous, centralized monitoring of various types of applications. This encompasses collecting, gathering and analyzing various performance, availability and even business-based metrics.

“Being able to measure and optimize all aspects of enterprise performance is critical for our clients that require their applications to be up and operational 100% of the time,” Gmach says. “Without it, they risk having an outage or application slowdown, which can be extremely costly. Our goal is to identify issues before they impact users or, in a worst-case scenario, to find and resolve issues more quickly than previously possible.”

What Sets Evolving Solutions’ Application Performance Monitoring Apart?

By working with Evolving Solutions, clients are connected directly to skilled APM engineers to answer questions, resolve technical issues and, most importantly, to help the organizations modernize their overall monitoring capabilities. This can be difficult for many organizations that don’t have a dedicated monitoring team with cross-silo expertise.

“We’ve demonstrated to clients that we can modernize and automate their mission-critical infrastructure to support digital transformation,” Gmach says. The Enterprise Monitoring and Analytics practice’s consultative approach will build upon that foundation to address IT needs.

“We’re going to be working directly with clients to understand what’s important to them up and down the organization,” Austin says. “We’re not just going to come in, install some software and check some boxes. We’re going to make sure that the client can leverage the software to its fullest, including aligning back to executive goals.”

This type of holistic APM approach can help clients reduce their mean time to resolve problems (MTTR) and eliminate the need for war rooms, which can ultimately boost the bottom line for organizations.

Extending Your Team With Application Performance Monitoring Experts

Austin and his APM team are dedicated to ensuring that clients minimize downtime and keep their mission-critical applications running optimally.

“As digital and business transformation continues to grow at a rapid pace, firms will need to monitor their revenue-generating applications and eliminate disruptions,” says Gmach. “By expanding into a new business practice with our highly knowledgeable staff, Evolving Solutions will help clients improve their digital services, create better end-user experiences and ultimately drive business success.”

Evolving Solutions Author: Bo Gebbie