Monitoring Docker Containers with AppDynamics

Posted on

As more organizations break down their monolithic applications into dozens of microservices, monitoring containers is more important, and complicated, than ever. Fortunately, AppDynamics has built Docker Monitoring to give you end-to-end monitoring for all your Docker containers. This allows you to immediately know when you have containers that are causing resource contention issues. This can also help you solve infrastructure-related issues with your containers dynamically. You can see below how AppD gives an overview of your running containers and details into each of them:

The list of running containers

Container Insights

This information is invaluable, but when searching through AppDynamics documentation, it can be confusing to understand how to set up container monitoring. Making this even more challenging is the provided AppDynamics images, as of this time, are not up to date with the current agents. Evolving Solutions recognizes this problem and has addressed it by automatically building new Docker Images when new AppDynamics Machine Agents are released. If you would like to use our images, you can find our repo here. Otherwise, if you’d like to build your own image of the agent, continue reading on.

In this blog, I’m going to walk you through the process of creating your own Docker Image to run the AppDynamics Machine Agent. When you run the image as a sidecar to your Docker containers, it will provide:

  • Infrastructure Information for your containers, such as
    • Container metadata
    • Tags
      • Name-Value pairs derived from Docker/Kubernetes
      • AWS tags where applicable
    • Infrastructure health insights, such as CPU, memory, network, and disk usage

Prerequisites

  1. Container monitoring requires a Server Visibility license >=4.3.3, for both the Controller and the Machine Agent.
  2. AppDynamics recommends that you use Docker CE/EE 17.03 or Docker Engine 1.13 with this product.
  3. Container Monitoring is not supported on Docker for Windows, or Docker for Mac.
  1. Server Visibility Enabled – Enable Server Visibility – Shown in sample docker file below (1.1)
  2. Docker Enabled – Enable Docker Visibility – Shown in sample docker file below (1.1)

Creating the Dockerfile

  1. Download Machine agent installer for Linux with bundled JRE and unzip and then rezip it as machine-agent.zip. This can be done here: https://download.appdynamics.com/download/.

This is important, sometimes the zip from the download is not read properly to create the folder structure. Do this on the machine where the docker image will be run.

  1. Create a directory named MachineAgent on your machine where you will run the docker instance. e.g. /Users/<username>/Docker/DockerVisibilty/MachineAgent. (Or to any directory of your choosing)
  2. Copy the machine-agent.zip to this location.
  3. Create a new file called Dockerfile with the following code and give 744 permission:
# Sample Dockerfile for the AppDynamics Standalone Machine Agent
FROM ubuntu:14.04
# Install required packages
RUN apt-get update && apt-get upgrade -y && apt-get install -y unzip && apt-get clean
# Install AppDynamics Machine Agent
ENV APPDYNAMICS_HOME /opt/appdynamics
ADD machine-agent.zip /tmp/
RUN mkdir -p ${APPDYNAMICS_HOME} && unzip -oq /tmp/machine-agent.zip -d ${APPDYNAMICS_HOME} && rm /tmp/machine-agent.zip
# Setup MACHINE_AGENT_HOME
ENV MACHINE_AGENT_HOME /opt/appdynamics/machine-agent
#Comment this section only if you are using docker-compose to build/run the machine-agent container
ENV MA_PROPERTIES "-Dappdynamics.controller.hostName=<<ControllerHost>> -Dappdynamics.controller.port=<<ControllerPort>> -Dappdynamics.controller.ssl.enabled=false -Dappdynamics.agent.accountName=<<accountname>> -Dappdynamics.agent.accountAccessKey=<<accesskey>> -Dappdynamics.sim.enabled=true -Dappdynamics.docker.enabled=true -Dappdynamics.docker.container.containerIdAsHostId.enabled=true"
# Include start script to configure and start MA at runtime
ADD start-appdynamics ${APPDYNAMICS_HOME}
RUN chmod 744 ${APPDYNAMICS_HOME}/start-appdynamics
# Configure and Run AppDynamics Machine Agent
CMD "${APPDYNAMICS_HOME}/start-appdynamics"

Depending on how you are building and running the machine-agent container in docker (i.e. via docker-compose or docker build/docker run), you’ll need to comment/un-comment the portions in the file. This script will set the AppDynamics specific environment variable needed for the MachineAgent and execute the machineagent.jar file.

5. This docker file will:

    • Use an ubuntu:14.04 base image (you can use any base image you want)
    • Install the unzip package
    • Copy the machine-agent.zip to the /tmp directory of the image
    • Extract the MachineAgent artifacts to /opt/appdynamics/machine-agent/
    • Clear up the /tmp directory
    • Copy the MachineAgent startup script called start-appdynamics onto /opt/appdynamics/machine-agent/ directory
    • Run the script

Note: we are using our own controller parameters in MA_PROPERTIES environment variable in the Dockerfile. You’ll need to use your own controller information in this environment variable.

Creating the Docker Start Script

  1. Create another file called start-appdynamics in the same MachineAgent folder with the following:
#!bin/bash
# Sample Docker start script for the AppDynamics Standalone Machine Agent
# In this example, APPD_* environment variables are passed to the container at runtime
# Uncomment all the lines in the below section when you are using docker-compose to build and run machine-agent container
#MA_PROPERTIES="-Dappdynamics.controller.hostName=${APPD_HOST}"
#MA_PROPERTIES+=" -Dappdynamics.controller.port=${APPD_PORT}"
#MA_PROPERTIES+=" -Dappdynamics.agent.accountName=${APPD_ACCOUNT_NAME}"
#MA_PROPERTIES+=" -Dappdynamics.agent.accountAccessKey=${APPD_ACCESS_KEY}"
#MA_PROPERTIES+=" -Dappdynamics.controller.ssl.enabled=${APPD_SSL_ENABLED}"
#MA_PROPERTIES+=" -Dappdynamics.sim.enabled=${APPD_SIM_ENABLED}"
#MA_PROPERTIES+=" -Dappdynamics.docker.enabled=${APPD_DOCKER_ENABLED}"
#MA_PROPERTIES+=" -Dappdynamics.docker.container.containerIdAsHostId.enabled=${APPD_CONTAINERID_AS_HOSTID_ENABLED}"
# Start Machine Agent
${MACHINE_AGENT_HOME}/jre/bin/java ${MA_PROPERTIES} -jar ${MACHINE_AGENT_HOME}/machineagent.jar
  1. Give appropriate read/write permissions for the file (i.e. 777)

Creating the Docker Build Script

Create a script called build-docker.sh in the same MachineAgent folder with the following:

docker build -t appdynamics/docker-machine-agent:latest

Note: This file also needs appropriate read/write permissions. If you use docker-compose then this is not needed.

Creating the Docker Run Script

Create a script called run-docker.sh in the same MachineAgent folder with the following:

docker run --rm -it -v /:/hostroot:ro -v /var/run/docker.sock:/var/run/docker.sock appdynamics/docker-machine-agent

Note: Give appropriate read/write permission to this file. Again, if docker-compose is used this is not needed.

Build and Run the Image

To build the image run ./build-docker.sh and then to run the docker image run ./run-docker.sh

Docker-Compose

If you wish to use docker-compose, create a file docker-compose.yml in the same MachineAgent directory and the following code.

version: '3'
services:
docker-machine-agent:
build: .
container_name: docker-machine-agent
image: appdynamics/docker-machine-agent
environment:
- APPD_HOST=<<CONTROLLER HOST>>
- APPD_PORT=<<CONTROLLER PORT>>
- APPD_ACCOUNT_NAME=<<CONTROLLER ACCOUNT>>
- APPD_ACCESS_KEY=<<CONTROLLER ACCESS KEY>>
- APPD_SSL_ENABLED=false
- APPD_SIM_ENABLED=true
- APPD_DOCKER_ENABLED=true
- APPD_CONTAINERID_AS_HOSTID_ENABLED=true
volumes:
- /:/hostroot:ro
- /var/run/docker.sock:/var/run/docker.sock

Use the commands docker-compose build and docker-compose run to build and run respectively.

Automation

Would you like to learn how Evolving Solutions used the steps above to automate the build and deploy of newly released AppDynamics Docker agents? Look out for our upcoming blog posts!

Getting started with AppDynamics

If you don’t have an AppDynamics account, you can start your own free trial here.

After you’ve created your account, you can visit AppDynamics’s University to view hundreds of bite-size videos covering all things AppD.

Evolving Solutions Author:
Steven Colliton
Delivery Consultant
Enterprise Monitoring & Analytics
steven.c@evolvingsol.com

Dynatrace Metrics Ingest

Posted on

Today we’re going to be talking about some exciting new functionality that was recently added to Dynatrace. We’ve talked about Dynatrace in this blog before, but for those who may not be familiar, Dynatrace is an all-in-one software intelligence platform and a leader in the Gartner magic quadrant for APM. Dynatrace has always been a frontrunner in understanding application performance and their AI and automation help tackle many challenges that would require countless hours of manpower.

Most of the data captured in Dynatrace, up until this point, was gathered from the Dynatrace OneAgent or from Dynatrace Extensions, which pulled data from APIs. This meant that if the metrics weren’t native to Dynatrace, they wouldn’t be consumable into the Dynatrace platform. But,

  • What if you want to keep track of a certain file’s size on a disk?
  • What if you have an important InfluxDB you want to monitor?
  • What if you want to know the number of currently running Ansible deployments, or the failed ones?

This blog will cover:

  1. A high-level overview of the “New Metrics Ingestion Methods
  2. A “Cheat Sheet” for selecting which method is best for you
  3. A “Brief Review on Ingestion Methods
  4. Shortcomings of this Early Adopter release, and what we hope to see in the future
  5. An example – “Ansible Tower Overview Dashboard

New Metrics Ingestion Methods

Historically, teams could write OneAgent plugins, but they required development effort and a knowledge of Python. Now that Dynatrace has released the new Metric ingestion, any custom metrics can be sent to the AI-powered Dynatrace platform easier than ever. There are four main ways to achieve this, and they are:

Dynatrace has already written technical blogs about how to send the metrics (linked above), so this blog will aim to discuss the pros and cons of each method, along with giving a cheat sheet on which path is likely best depending on your business use case.

Cheat Sheet

When deciding which route to take, follow this cheat sheet:

  • Is Telegraf already installed and gathering metrics? Use the Dynatrace Telegraf Plugin
    • Or, does Telegraf has an Input Plugin built in for the technology that requires monitoring? Telegraf may still be the best route because capturing the metrics will be effortless.
  • Is something already scraping metrics in StatsD format? Use the StatsD Implementation
  • If none of the above, the best route is likely to use the Metrics API v2 / OneAgent REST API.

Brief Review on Ingestion Methods

Since Dynatrace has already written about each method, except Telegraf, those details won’t be duplicated in this blog. Instead, here’s a quick overview on each Ingestion Method:

  • Dynatrace StatsD Implementation – If there’s an app that’s already emitting StatsD-formatted metrics, this implementation would be the most direct. The OneAgents listen on port 18125 for StatsD metrics sent via UDP. Dynatrace has enhanced the StatsD protocol to support dimensions (for tagging, filtering). The StatsD format is not as sleek as the new Dynatrace Metrics Syntax, so this path is not recommended unless StatsD is already present.
  • Metrics API v2 (OneAgent REST API) – There is an API endpoint listening for metrics in the Dynatrace Metrics Syntax (if you happen to be familiar with Influx’s Influx Line Protocol, it’s almost identical)
  • Dynatrace Telegraf Output – The latest releases of Telegraf now include a dedicated Dynatrace output, which makes sending metrics to Dynatrace extremely easy when Telegraf is installed. Telegraf can either push metrics to the local OneAgent or out to the Dynatrace cluster.
    • If Telegraf is not yet installed, it still may be the easiest route forward if Telegraf natively supports a technology that needs to be monitored. The list of Telegraf “inputs” can be found here. Installing Telegraf is quite easy, and the Telegraf configuration is detailed well in the Dynatrace documentation.
  • Scripting Languages (Shell) – If code has to be written to support outputting Dynatrace Metrics Syntax or StatsD metrics, the code can be slightly simplified by using the OneAgent dynatrace_ingest script provided with each OneAgent. This script can be invoked instead of writing networking code to push the metrics. Instead, metrics can simply be piped into this executable.

These ingestion methods allow Dynatrace to contend with open monitoring platforms, but they’re not without their own faults. Before moving to the example use case and dashboard, the most important caveats we discovered in Metric Ingestion will be discussed.

Early Adopter Shortcomings

Throughout evaluating this new functionality, a couple of missing features surfaced. Highlighted below are the most challenging issues faced, and then also a proposed solution to remedy the shortcoming.

No Query Language Functions

Problem – The largest shortcoming of this Explorer is the limited aggregation options presented.

Example Use Case –

  • If an ingested metric is a COUNT over time, its value can become astronomically large. For a COUNT type of metric, a user may want to see the overall count, but likely the delta is more important.
  • Another example is if there’s a metric which needs arithmetic applied to it – say the value of a query needs to be multiplied by 10 or divided by 100 – it’s not possible.
  • And another example is when the difference between two different queries needs to be calculated (CPU Used – CPU System = CPU not used by OS) – it’s also not possible.

The workaround here is to modify the metrics before they’re sent to Dynatrace, but that’s not practical for a lot of use cases.

Proposed Solution – Add mathematical operators and query functions. For example, Grafana has dozens built into its product that make data manipulation at query time very easy.

Incomplete Metrics List in Explorer

Problem – The list of metrics presented in the Custom Charts Explorer is not complete, which can be misleading.

Example use case – If a user searches for “awx” they will find up to 100 metrics with a matching name. If that user scrolls through the list, exploring the new metrics, they may believe the 100 metrics were the only ones available, leading to confusion.

Proposed Solution – The list of metrics should indicate whether the list is complete.

New Metrics Registration is Slow

Problem – The time it takes for a new metric to be registered and queryable in Dynatrace takes up to 5 minutes.

Example use case – If you are very familiar with this new Metrics Ingestion, you can send metrics and assume they will properly register. But, when new users are testing out the functionality and developing their workflows, this delay can become a real headache.

Proposed Solution – As soon as a metric has been sent, it should be registered and then shown in the Metrics Explorer. Even if the data itself hasn’t been stored, the metric name should still be queryable near instantaneously.

Although these gaps in functionality are annoying at this time, the new Metrics Ingestion still allows for insightful 3rd-party dashboards to be made.

Example – Ansible Tower Overview Dashboard

At Evolving Solutions, we’re a Red Hat Apex partner and we use a lot of Ansible. If you haven’t seen it yet, Ansible Tower is a very extensible solution for managing your deployment and configuration pipelines. I wanted to try to gather metrics from Ansible Tower’s Metrics API so I could track how many jobs were running and completed.

I wrote two applications which read from the local Ansible Tower Metrics API and scrapes those metrics. One of the apps prints the output to stdout, while the other pushes metrics via UDP to the StatsD Metrics listening port. The one which writes to stdout can be used as a Telegraf input (exec input) or piped into the dynatrace_ingest script.

With the data sent to Dynatrace, I made an example dashboard of how these metrics could be used. In the dashboard, I leveraged

  • Dynatrace (Agent-gathered) Metrics:
    • Host Health Overview
    • Host Metrics
    • Automatically Detected Problems
  • Ansible Tower Metrics (through the Telegraf metrics ingest):
    • Overall Job Status & Status Over Time (Successful vs Failed vs Cancelled jobs)
    • Tower Job Capacity, number of Executing Jobs, and the number of Pending Jobs
    • Ansible Tower Usage Stats (User count, Organizations count, Workflow count)

As you can see, sending these extra Ansible Tower Metrics to Dynatrace allows us to build a detailed overview of the Ansible Tower platform. With new features like open Metrics Ingestion, Dynatrace is continuing to differentiate itself and disrupt the APM market.

Ansible Tower monitoring is a great use case, but it’s only one of an endless number of use cases – do you have any systems you’d like deeper monitoring into with Dynatrace? Reach out to us at Evolving Solutions and we can help you gain complete visibility of your critical systems.

(21/01/11) – A previous version of this blog said that metrics could not have dimensions added or removed after they’ve been set. After speaking with Dynatrace Product Management, it was discovered that this is not true, and instead an obscure edge case was encountered. If you encounter hiccups with the new Metrics Ingestion, click the “Contact Us” button below.

Evolving Solutions Author:
Brett Barrett
Senior Delivery Consultant
Enterprise Monitoring & Analytics
brett.b@evolvingsol.com

Tackling Common Mainframe Challenges

Posted on

Today, we’re going to talk about the mainframe. Yes, that mainframe which hosts more transactions, daily, than Google; that mainframe which is used by 70% of Fortune 500 companies; that mainframe which is currently seeing a 69% sales rise since last quarter. Over the decades, analysts predicted the mainframe would go away, particularly in current years with the ever-expansive public cloud, but it’s challenging to outclass the performance and reliability of the mainframe, especially when new advancements have been made such that 1 IBM Z system can process 19 billion transactions per day.

Since the mainframe seems here to stay and it’s a critical IT component, we need to make sure it’s appropriately monitored. If you Google “Mainframe Monitoring Tools”, you’ll find a plethora of bespoke mainframe tools. Most of these tools are great at showcasing what’s happening inside the mainframe…but that’s it. Yes, of course we need to know what’s happening in the mainframe, but when leadership asks, “why are we having a performance issue”, silo’d tools don’t provide the necessary context to understand where the problem lays. So, what can provide this critical context across multiple tools and technologies?

The Dynatrace Software Intelligence Platform was built to provide intelligent insights into the performance and availability of your entire application ecosystem. Dynatrace has been an industry leader in the Gartner APM quadrant ever since the quadrant was created and it’s trusted by 72 of the Fortune 100. Maybe more pertinent to this blog, Dynatrace is built with native support for IBM mainframe, in addition to dozens of other commonly-used enterprise technologies. With this native support for mainframe, Dynatrace is able to solve some common mainframe headaches, which we’ll discuss below.

End-to-End Visibility

As mentioned earlier, tailored mainframe tools allow us to understand what’s happening in the mainframe, but not necessarily how or why. Dynatrace automatically discovers the distributed applications, services and transactions that interact with your mainframe and provides automatic fault domain isolation (FDI) across your complete application delivery chain.

In the screenshot above, we can see the end-to-end flow from an application server (Tomcat), into a queue (IBM MQ), and then when that message was picked up and processed by the mainframe “CICS on ET01”. With this service-to-service data provided automatically, out-of-the-box, understanding your application’s dependencies and breaking points has never been easier.

Quicker Root Cause Analysis (RCA)

With this end-to-end visibility, RCA time is severely reduced. Do you need to know if it was one user having an issue, or one set of app servers, or an API gateway, or the mainframe? Dynatrace can pinpoint performance issues with automated Problem Cards to give you rapid insight into what’s happening in your environment.

When there are multiple fault points in your applications, Dynatrace doesn’t create alert storms for each auto-baselined metric. Instead, Dynatrace’s DAVIS AI correlates those events, with context, to deliver a single Problem, representing all related impacted entities. An example of a Problem Card is displayed below:

In the screenshot above, there are a couple key takeaways:

  1. In the maroon square, Dynatrace tells you the Business impact. When multiple problems arise, you can prioritize which are the most important by addressing first the Problem with the largest impact
  2. In the blue square, Dynatrace provides the underlying root cause(s). Yes, we can also see there were other impacted services, but at the end of the day, long garbage-collection time cause slow response times.
    1. This is a critical mainframe example. Yes, the mainframe is not the root cause in this case, but that’s great to know! Now, we don’t have to bring in the mainframe team, or even the database team. We can go straight to the frontend app developers and start talking garbage collection strategies.
  3. Finally, Dynatrace is watching your entire environment with intuitive insights into how your systems interact. Because this problem was so large, there were over a billion dependencies (green square) that were analyzed before providing you a root cause. There is simply no way this could be manually.

Optimize and Reduce Mainframe Workloads

The IBM Mainframe uses a consumption-based licensing model, where the cost is related to the number of transactions executed (MSUs; “million service units”). As more and more applications are built that rely on the mainframe, the number of MIPS required increases. Tools that focus only on understanding what happens in the mainframe can tell you 10,000 queries were made, but not why. Because Dynatrace provides end-to-end visibility from your end user to your hybrid cloud into the mainframe, it can tell you exactly where those queries came from. These insights are critical to identifying potential optimization candidates, and can help you tackle your MIPS death from a thousand paper cuts.

In the screenshot below, you can see that (in green) that 575 messages were read off the IBM MQ, but then that caused 77,577 interactions on the mainframe! Likely, there is room for great optimization here.

Yes, those requests may have executed quickly, but maybe they could been optimized so that only 10 mainframe calls needed to be executed, or even 5, as opposed to ~135. Without Dynatrace, it is an intense exercise for mainframe admins to track down all of the new types of queries that are being sent to them.

In Closing

With Dynatrace, all of your teams can share a single pane of glass to visualize where performance degradations and errors are introduced across the entire application delivery chain. With its native instrumentation of IBM mainframe, Dynatrace provides world-class insights into what’s calling your mainframe, and how that’s executing inside the system.

Now that we’ve discussed the common mainframe headaches of end-to-end visibility, root cause analysis, and workload optimization, it’s time to conclude this high-level blog. Hopefully this blog has given you insight into common use cases where Dynatrace provides immense value to mainframe teams, application developers, and business owners. Soon, we’ll be following up with a more technical Dynatrace walkthrough to show you exactly how to get to this data.

Until then, if you have any questions or comments, feel free to reach out to us at ema@evolvingsol.com. We’d love to chat and learn the successes and challenges you have with monitoring your mainframe environments.

Evolving Solutions Author:
Brett Barrett
Senior Delivery Consultant
Enterprise Monitoring & Analytics
brett.b@evolvingsol.com

IBM Reinvents the Mainframe Again – IBM z15 Boosts New Features

Posted on

As with each IBM Mainframe announcement, the IBM Z15 server is no stranger to innovative features.  This article gives a perspective on two exciting features of IBM’s latest mainframe technology, the IBM Z15.

Compression Acceleration

The first feature is known as Compression Acceleration and made possible by way of a new on-chip accelerator known as the Nest Acceleration Unit or NXU.  The NXU is the functional replacement of a formerly priced feature known as the zEDC (Enterprise Data Compression).  This means the zEDC adapters you may have invested in on prior mainframe servers will not carry forward;  the NXU will become the functional replacement of that capability.  That is actually a good thing!  An image of the NXU is found in Figure 1, complements of IBM.

FIGURE 1: Nest Acceleration Unit on the Z15

Figure 1 reveals the IBM Z15 chip with twelve (12) cores listed; cores are your processor.  That which is circled in red is the NXU, the NXU is shared by all processor cores on the chip.  Unlike the zEDC Feature that was restricted in use by up to 15 LPARs, the NXU can be shared by all LPARs on the server, up to 85.

When you compare the two features, you soon realize the raw compression throughput possible on the zEDC is 1 GB/second per zEDC adapter.  Prior Z System servers allowed you to purchase up to 16 adapters per server, providing an architectural throughput rate of 16 GB/second.  The actual rate on those cards is ½ that value, in that for every card you placed into production, you invested in a backup card just in case the primary card failed.  The raw thruput of the NXU is 26 GB per Z15 core.  Based on IBM benchmarks, the largest IBM Z15 with this integrated compression accelerator compresses up to 260 GB per second1.

Mainframe clients use this feature to compress large files.  Doing so actually reduces the amount of time spent moving data via I/O operations and that, in turn, lowers their IBM Software Costs.  Not a bad trade-off when you think about it and the feature is a no-cost part of the Z15 server now.

System Recovery Boost 

The second feature is known as System Recovery Boost.  This feature offers faster system shutdown, restart and workload catchup.  Instant Recovery is an alternate name for this offering.

System Recovery Boost is where the operating system brings on an additional capacity to speed up OS Shutdown, OS Restart and the catch up of work that may be queued due to the scheduled or unscheduled outage. System Recovery Boost is capable of taking your sub-capacity General Purpose CPs and run them at full speed. This is known as a Speed Boost.

System Recovery Boost is even capable of allowing General Purpose workload to run on zIIP engines.  This is known as a zIIP Boost and provides additional capacity and parallelism to accelerate processing during the boost periods. IBM refers to this as blurring the CPs and zIIPs together.

Find out more about this capacity from IBM at System Recovery Boost and through the IBM z15 Redpiece.

Understanding Sub-capacity CP Speed Boost

IBM’s Speed Boost only applies to sub-capacity Servers, e.g. 4xx, 5xx and 6xx models.  Client LPARs that are running in a boost period access their engines as 7xx models. Other remaining LPARs on the same server run at their sub-capacity setting as purchased by the client.  By way of example, consider Figure 2.

FIGURE 2: Speed Boost example.

Looking at the preceding figure, when LPARs enter a boost period, work that is dispatched from LPiD3 runs at CP7 (full capacity). Other LPARs continue to be dispatched at CP5 (sub-capacity). One boost period is started at LPiD3 shutdown and a new boost period started at re-IPL of LPiD3.

Now, you might be asking “How does the operating system know you are shutting down LPiD3 and the shutdown boost period of thirty (30) minutes should start?”  It’s quite simple. Operations staff will issue a START command against Started Procedure IEASDBS (Shut Down Boost Start).

Upon re-IPL of LPiD3, Boost would be “On by Default” for that image, offering up sixty (60) minutes of boosted capacity to get the operating system and subsystems up along with allowing workload to continue processing at an accelerated pace for the duration of the Boost period.

Understanding zIIP capacity Boost (zIIP Boost)

For those familiar with this platform, you know that zIIPs traditionally only host DRDA, IPSec and IBM Db2 Utility workloads along with Non-IBM Software solutions that have chosen to leverage the zIIP API.  During System Recovery Boost, if you have at least 1 zIIP engine available to the LPAR, it can run both traditional zIIP only eligible workload as well as General Purpose CP Workload.  Earlier in this article, IBM dubbed this capability CP Blurring.  Just like Speed Boost, zIIP Boost will last thirty (30) minutes on shutdown and sixty (60) minutes on restart.

So, What run’s on the zIIP during the Boost Period? The short answer – Any Program!2

Understanding zIIP Turbo Boost

Unlike IBM’s System Recovery Boost, which is a no-charge feature. The zIIP Turbo Boost is a priced feature consisting of:

  • FC 9930 a no charge entitlement feature code.
  • FC 6802, a temporary pre-paid zIIP boost records, effectively providing to you on an annual basis 20 additional zIIP engines that you can activate during Boost Periods. You are allowed to activate this feature up to thirty (30) times per year.

What is vital is you must remember to activate the boost record (FC 6802) before your boost event (shutdown and subsequent restart).  In addition, you must have at least one zIIP already online to the LPAR that will be boosted.

When you are planning out how many physical zIIPs you will add to your server out of the maximum 20, and how many reserved zIIPs you will define, Evolving Solutions recommends that you work with us.  We will use an IBM supplied tool to better understand the impact on your server and the LPAR topology when you add the additional physical zIIPs.

This same tool will even warn you if adding the additional physical zIIPs may cause an LPAR to cross a drawer boundary on a CPC that could lead to performance irregularities. Weights can be adjusted if you know about them in advance to ensure this exposure is mitigated.

Platform Positioning

Platform positioning is pretty straight forward.  First, you must invest in a Z15 processor and you were going to do that anyway.  At the same time, you must be running on either z/OS V2R3 or V2R4 with the following APARs applied: OA57849 (z/OS), OA57552(CPM), OA57478 (CIM), and OA56683(RMF).  The latter can be easily positioned beforehand as you will be installing Device Support maintenance.

For zIIP Boost, you also need to have one or more zIIPs defined in the Image Activation Profile, either as initial or reserved processors, have physical zIIPs installed, have HiperDispatch enabled (defaults “on”), and be running with shared, not dedicated processors.  Pretty straight forward for those that know this platform.

Initial System Setup

As mentioned earlier in this article, System Recovery Boost is enabled by default and can be controlled via the BOOST= parameter in your IEASYSxx parmlib member.  The z/OS V2R4 Initialization and Tuning Reference has been updated to reflect this new keyword, see Figure 3.

FIGURE 3: BOOST keyword in IEASYSxx

You should also review the Image Profile Settings for those LPARs that will be boosted, looking to ensure at least one (2) zIIPs is available and the weights you have specified are reasonable.  Keep in mind that any RESERVED CPs and/or zIIPS that are also physically available on the processor will be brought online automatically as part of the Boost Period and take back offline automatically at the conclusion of the Boost period.

Lastly, you must update your LPAR Shutdown Procedures to include “START IEASDBS” at the start of the shutdown process.

Operational Changes

IBM has introduced a new operator command, “DISPLAY IPLINFO, BOOST”.  This command tells you what the BOOST System Parameter specification for the LPAR.  Also, you can go to the HMC and double-click on the Image Details Panel.  When you do, you see a new value:  System Recovery Boost: On|OFF.

There are also a pair of new Catalog Procedures.  Earlier in this article, we told you about IEASDBS (Shut Down Boost Start).  There is also IEABE (Boost End) if you want to end the boost period early for some reason.  These would be placed in the appropriate PROCLIB concatenation on your system.

Automation Considerations

It comes as no surprise, there are automation considerations.  For example, at this URL: https://www.ibm.com/support/knowledgecenter/SSLTBW_2.4.0/com.ibm.zos.v2r4.izsb100/sysrb_automationconsiderations.htm there are several new messages to consider.

Two messages are worth calling out in this article, message IEA676I and IEA678I.  Both signal the end of the Boost period.  What automation changes might you consider?  Consider this shortlist for starters:

  • Activating your purchased zIIP Turbo Boost Temporary Capacity Record before shut down, and then deactivating it after the boost period is complete.
  • Dynamically changing LPAR weights as required during a shutdown or startup boost, to avoid Vertical Low zIIPS.
  • Add the starting of the IEASDBS proc to your existing shutdown automation.
  • Changing the level of parallelism present in the workload at startup and shutdown. The odds are high your automation solution paces these activities today; with Boost, more parallelism will be desired.

Performance Considerations

IBM has also published a very information System Recovery White Paper. When you study the credits on this white paper, you quickly realize the authors are the Who’s Who from IBM Poughkeepsie that genuinely know this platform!

A couple of key takeaways:

  • IBM System Recovery Boost will benefit IPL and those clients that are sub-capacity. Once the Operating System switches, both Speed and zIIP boost will benefit clients.
  • During IPL, remember the boost period is for sixty minutes. You should make sure Boost is focused on meaningful activities.  One example of the wrong activity is known as IPL Device Enumeration.  Excessive time spent on this activity means less time in support of workload catchup.
  • The White Paper goes further by recommending a classic Redbook to reconsider, the publication number is SG24-7816-00 and centers on Z Systems shutdown and restart resiliency.

In closing

  • System Recovery Boost will significantly reduce the time required to shut down z/OS Images. At the same time, your Restart process is greatly improved especially when zIIP engines are present.
  • From an Evolving Solutions perspective, justifying software currency for your enterprise is a bit easier in that rolling IPL’s for maintenance changes complete much faster with this capability.
  • IBM’s zIIP Boost and Turbo Boost are worth evaluating as well. Consider for a moment that a zIIP costs approximately 145K per year with Maintenance. The 3-year cost for the Turbo Boost will be $500K, the 3-year cost of 20 zIIPs is $3M. The odds are high you can find $ ½M USD worth of value by investing in the Turbo Boost feature.

If you are interested in learning more about the IBM Z15 server, or even consider Platform Services that promote software currency, feel free to reach out to the author via LinkedIn or send the author an email at Jim.F@evolvingsol.com . For more on Evolving Solutions mainframe services here.

FOOTNOTES

1  IBM z15 On Chip Compression

2  See IBM’s zIIP Authorized Use Table for a description on what workload is allowed to run on zIIP processors ftp://public.dhe.ibm.com/systems/support/warranty/pdfs/aut/Authorized_Use_Table_09-2019_en_US.pdf

Thoughts on IBM Z Systems Off-Platform Migrations

Posted on

Have you committed to migrate off your mainframe platform, in favor of alternative solutions?  If you have, you are not alone.  As powerful as this platform is, clients dependent on this technology commonly look for ways to reduce costs and their mainframe is an easy target.  However, many off-platform migrations have been going on for years; and they have yet to be completed.

A typical off-platform migration starts with a directive from senior management to functionally stabilize the platform.  What this means is the IT staff is directed to no longer invest in hardware and software updates.  This results in software stacks that are extremely dated; long since withdrawn from IBM marketing Mainframe Server technology.  Not to mention very real staff challenges over time.

Those familiar with this platform that have embraced a functionally stabilized posture will find themselves supporting their business on older operating system releases such as z/OS 1.9 or z/OS 1.13, hosting transaction and database managers that are decades old.  Even though their platform is “functionally stabilized,” these customers still have the obligation to maintain their monthly license payment to IBM and third-party suppliers.  All the while, they sit on a platform that has not been updated but faithfully serving up data and transactions at rates that cause Google search engine advocates to blush.

Consider for just a moment that Google receives over 63,000 searches per second on any given day. That’s 5.6 billion searches per day1.  Sounds impressive, doesn’t it?

Now consider the fact that IBM Mainframes that run CICS handle more than 1.1 million transactions per second worldwide. That’s more than 95 billion transactions per day2.  If you have a mainframe, the odds are high you are running CICS as your transaction manager.

Your mainframe, though functionally stabilized, remains extremely important to your organization, and it’s not serving up YouTube or Funny Cat videos; it’s running your business and our economy.

Are there consequences to your organization when you stabilize such an important platform?  Here at Evolving Solutions, we believe the consequences are very real.  So do the clients that we support.

One consequence is that customers will experience operational challenges as their company attempts to support a platform that is extremely back-level.  What happens if you have a software problem?  How do you obtain service and support for software and/or hardware that is no longer supported?  The short answer is, you can’t.  If you try, it’s going to cost you.

Another example, customers often reach an inflection point where hardware and software currency is no longer possible due to the underlying mainframe server technology that supports their business.  A practical example of this challenge are those clients that have limited resources available on their large system server. For example, IBM z/OS version 2.3 requires a minimum of 8 GB to initialize as well as a server platform that will support the IPL.   Clients with limited real storage or on back-level IBM hardware are impacted by this constraint; preventing them from upgrading and will expose their enterprise to the following challenges:

  • Increased business risk driven by limited hardware currency & flexibility
  • Reduced business agility
  • Increased cost of ownership

Further, IBM’s z/OS Coexistence Policy supports N-2 migrations only. Migration to IBM’s z/OS V2R3 operating system is only possible if you are on V2R2 or V2R1.  What if you are on Version 1 of z/OS, and there are customers out there on that version?  You must do a side by side migration.

Side by side migrations typically lead to two distinct challenges, the first challenge is high services upgrade costs as your enterprise scrambles to move your mainframe platform back into a supported environment.  The second challenge is schedule slippage that further exposes added costs fueled by software migration timelines put forth by vendors.

The preceding discussion represents challenges that are faced by every mainframe client attempting to move away from their platform.  Many of these enterprises recognize that their mainframe is hosting workloads that remain important to their business, even while their migration occurs. Some even reconsider their need to migrate off completely, effectively recommitting to the platform and taking advantage of the strengths this platform naturally offers.

We can help you overcome these challenges and get you on a path to currency! If you are one of those clients that are reconsidering their off-platform migration strategy, please contact Evolving Solutions so we can get to work customizing a solution for you.

Jim Fyffe is a solutions architect, focused on IBM Z, platform security, LinuxOne, Open Source and Geographically Dispersed Parallel SysPlex. Some may say his alter ego is the Flash! Find more from Jim Fyffe on Linkedin.

 

 

Footnotes

1 See 63 Fascinating Google Search Statistics at this URL: https://seotribunal.com/blog/google-stats-and-facts/

2 According to Marc Staimer of Dragon Consulting: “CICS handles more than 1.1 million transactions per second worldwide. That’s more than 95 billion transactions per day.” Sep 21, 2018