Strange Juniper SRX CPU spikes – Tracking the bugger..

I had identified a potential issue with my Juniper SRX firewalls last week. When I seem to have CPU spikes, the routing engine CPU (and traffic) never really seem crazily high. PPS is pretty normal too.

I found a potential correlation with a BSD process spiking that causes the kernel CPU to spike. The high %age of g_down thread in FreeBSD indicates that higher level entities i.e., user processes, try to access physical devices like disk/storage/memory/IO with such a high rate that there is a resource crunch. This further takes the kernel processes to high values. Therefore, you will see g_down process high at the same time as the kernel level processes.

I’ve finished writing a monitor tool that will track this, hopefully the sensitivity is enough that it will one way or another support the theory of this.

I have other things we need to start monitoring, but this will hopefully help me gain more visibility into the performance of the Juniper firewalls.

It’s an APM component attached to a Linux node in Orion. It’s also graphing these two statistics, but it can be monitored for anywhere because of the unfortunate convoluted way of creating it.

I will add more monitors based upon the JunOS shell script wrapper I created as time goes by.

You may have better ideas of implementing this but this was the quickest way for me, right now.

I created a read-only account on the Juniper firewalls. This allows into JunOS, but not into the shell. Now, I could get around this and cron it straight up on the firewall; and set up SSH keys to connect to secondary nodes in the cluster and so on. However, Orion has a strange way of working with a monitor, so this is how I have implemented it for now. Unfortunately, it does not attach to the actual firewall nodes, but I have created a custom monitoring page with each component purely for the Juniper layer.

The first script is the expect script, to authenticate on the firewall.


#!/usr/bin/expect -f
set password [lrange $argv 0 0]
set ipaddr [lrange $argv 1 1]
set scriptname [lrange $argv 2 2]
set arg1 [lrange $argv 3 10]
set timeout -1
spawn ssh <username>@$ipaddr $scriptname $arg1
match_max 100000
expect “*?assword:*”
send — “$password\r”
send — “\r”
expect eof

The bootstrap shell script,

echo -e “Message.<fw_name>gdN0: <fw_name> g_down Node 0 CPU”
echo -e “Message.<fw_name>gdN1: <fw_name> g_down Node 1 CPU”
echo -e “Statistic.<fw_name>gdN0:”`/Monitoring/Juniper/srx.exp <password> <fw_cluster_ip> show system processes node 0 detail | grep g_down | awk ‘{print $3}’`
echo -e “Statistic.<fw_name>gdN1:”`/Monitoring/Juniper/srx.exp <password> <fw_cluster_ip> show system processes node 1 detail | grep g_down | awk ‘{print $3}’`

This is all added into a 60 second cronjob on the Linux monitoring host:

* * * * * /Monitoring/Juniper/ > /Monitoring/Juniper/SRX.stats

The output is:

Message.<fw_name>gdN0: <fw_name> g_down Node 0 CPU
Message.<fw_name>gdN1: <fw_name> g_down Node 1 CPU

From here I created a new Application Performance Monitor template within Orion. It basically just cats the output of that stats file. I did think about making things happen on the firewall (as mentioned above), but I decided I wanted to keep this all together on the Linux host. If I do this, I can expand the capability to new monitoring features and attach it to any Juniper node globally pretty easily – and only have one place to edit any code.

Hope this helps!