Monitoring Apple Caching Server

Summary

So I have a lot of Apple Caching Servers to manage. The problem for me is monitoring their stats. How much data have they saved the site where they are located? Sure you can look at the Server.app stats and get a general idea. You could also look at the raw data from:

serveradmin fullstatus caching

You could also use the fantastic script from Erik Gomez Cacher which can trigger server alerts to send email notifications as well as send slack notifications to provide you with some statistics of your caching server.

And this is all great for a small amount of caching servers, but once your fleet starts getting up into the 100+ territory, we really need something better. Management are always asking me for stats for this site and that site, or for a mixture of the sites, or a region, or all of the sites combined. Collecting this data and then creating graphs in excel with the above methods is rather painful.

There has to be a better way!

Enter the ILG stack (InfluxDB, Logstash and Grafana)

If you have had a poke around Server.app 5.2 caching server on macOS 10.12, you may have noticed that there is a Metrics.sqlite database in

/Library/Server/Caching/Logs/

Lets have a look whats in this little database:

$ sqlite3 Metrics.sqlite
SQLite version 3.14.0 2016-07-26 15:17:14
Enter ".help" for usage hints.

Lets turn on headers and columns

sqlite> .headers ON
sqlite> .mode column

Now lets see what tables we have in here

sqlite> .tables
statsData  version

statsData sounds like what we want, lets see whats in there.

sqlite> select * from statsData;
entryIndex  collectionDate  expirationDate  metricName               dataValue
----------  --------------  --------------  -----------------------  ----------
50863       1487115473      1487720273      bytes.fromcache.topeers  0
50864       1487115473      1487720273      requests.fromclients     61
50865       1487115473      1487720273      imports.byhttp           0
50866       1487115473      1487720273      bytes.frompeers.toclien  0
50867       1487115473      1487720273      bytes.purged.total       0
50868       1487115473      1487720273      replies.fromorigin.tope  0
50869       1487115473      1487720273      bytes.purged.youngertha  0
50870       1487115473      1487720273      bytes.fromcache.toclien  907
50871       1487115473      1487720273      bytes.imported.byxpc     0
50872       1487115473      1487720273      requests.frompeers       0
50873       1487115473      1487720273      bytes.fromorigin.toclie  227064
50874       1487115473      1487720273      replies.fromcache.topee  0
50875       1487115473      1487720273      bytes.imported.byhttp    0
50876       1487115473      1487720273      bytes.dropped            284
50877       1487115473      1487720273      replies.fromcache.tocli  4
50878       1487115473      1487720273      replies.frompeers.tocli  0
50879       1487115473      1487720273      imports.byxpc            0
50880       1487115473      1487720273      bytes.purged.youngertha  0
50881       1487115473      1487720273      bytes.fromorigin.topeer  0
50882       1487115473      1487720273      replies.fromorigin.tocl  58
50883       1487115473      1487720273      bytes.purged.youngertha  0

Well now this looks like the kind of data we are after!

Looks like all the data is stored in bytes, so no conversions from MB KB TB need to be done. Bonus.
Also looks like each stat or measurement i.e. bytes.fromcache.topeers appears to be written to this DB after, or very shortly after, a transaction or event occurs on the caching server such as a GET request for content from a device. This means that we can add all these stats up over a day and get a much more accurate idea of how much data the caching server is seeing.
This solves the problem that the Cacher script by Erik runs into when the server reboots.

In Cacher, the script looks for a summary of how much data the server has served since the service has started by scraping the Debug.log. You have probably seen this in the Debug log

2017-02-22 09:41:10.137 Since server start: 1.08 GB returned to clients, 973.5 MB stored from Internet, 0 bytes from peers; 0 bytes imported.

Cacher then checks the last value of the previous day, compares it to the latest value for the end of report period day and works out the difference to arrive at a figure of how much data was served by the client for that report day. While this works great on a stable caching server that never reboots or has the service restart on you, it is a little too fragile for my needs. I’m sure Erik would also like a more robust method to generate that information as well.

Looking back at the Metrics.sqlite DB, if you are wondering about those collectionDates and expirationDate values they are epoch time stamps, which is also a bonus as this is very easy to convert into human readable with a command like:

$ date -j -f %s 1487115473
Wed 15 Feb 2017 10:37:53 AEDT

But also makes it easy to do comparisons and do simple math with if you need to.

Having all this information in a sqlite database already makes it quite easy-ish for us to pick up this data with Logstash, feed it into an InfluxDB instance and then visualise it with Grafana.

simples_fb_1567463

With this setup I was able to very easily show the statistics of all our caching servers at once. Of course we can also drill down into individual schools caching servers to reveal those results as well.

YAY PRETTY GRAPHS!

screen-shot-2017-02-21-at-2-58-34-pm

The nuts and bolts

So how do we get get this setup? Well this is not going to be a step by step walkthrough but it should be enough to get you going. You can then make your own changes for how you want to set it up in your own environment, everyones prod environment is a little different but this should be enough to get you setup with a PoC environment.

Lets start with getting Logstash setup on your Caching server.

Requirements:

  • macOS 10.12.x +
  • Server.app 5.2.x +
  • Java8
  • Java8JDK
  • Java JVM script from the always helpful Rich and Frogor  script here

Start by:

  • Getting your caching server up and running.
  • Install Java 8 and the Java 8 JDK.
  • Run the JVM script
  • Confirm that you have Java 8 installed correctly by running  java -version from the command line

If all has gone well, you should get something like this back:

# java -version
java version "1.8.0_111"
Java(TM) SE Runtime Environment (build 1.8.0_111-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.111-b14, mixed mode)

Now we are ready to install Logstash.

  • Download the latest tar ball from here: https://www.elastic.co/downloads/logstash
  • Store it somewhere useful like /usr/local
    • Extract the tar with tar -zxvf logstash-5.2.1.tar.gz -C /usr/local
    • This will extract it into the /usr/local directory for you

Now we need to add some plugins, this is where it gets a little tricky.
If you have authenticated proxy servers, you are going to have a bad time, so lets pretend you don’t.

Installing Logstash plugins

First lets get the plugin that will allow Logstash to send output to InfluxDB

Run the logstash plugin binary and install the plugin logstash-output-influxdb:

$ cd /usr/local/logstash-5.2.1/bin
$ ./logstash-plugin install logstash-output-influxdb

Now we will install the SQLite JDBC connector that allows Logstash to access the sqlite db that caching server saves its metrics into.

  • Download the sqlite-jdbc-3.16.1 plugin from here: https://bitbucket.org/xerial/sqlite-jdbc/downloads/
  • Create a directory in our Logstash dir to save it, I like to put it in ./plugins
    • mkdir -p /usr/local/logstash-5.2.1/plugins
  • Copy the sqlite plugin into our new directory
    • cp sqlite-jdbc-3.16.1 /usr/local/logstash-5.2.1/plugins

Ok we now have Logstash installed and ready to go! Next up we make a configuration file to do all the work.

Creating the configuration file

This is the most challenging part, and a huge shoutout goes the @mosen for all his help on this I definitely wouldn’t have been able to get this working without his help.

The configuration file we need contains three basic components, the input, a filter and the output.

The input

The input is where we are getting out data from, in our case its from the sqlite DB, so our input is going to be the sqlite jdbc plugin and we need to config it so that it knows what information to get and where to get it from.

Its pretty straight forward, and should make sense, but I’ll describe each item below

input {
    jdbc {
        jdbc_driver_library => "/usr/local/logstash-5.2.1/plugins/sqlite-jdbc-3.16.1.jar"
        jdbc_driver_class => "org.sqlite.jdbc"
        jdbc_connection_string => "jdbc:sqlite:/Library/Server/Caching/Logs/Metrics.sqlite"
        jdbc_user => ""
        schedule => "* * * * *"
        statement => "SELECT * FROM statsData"
        tracking_column => "entryindex"
        use_column_value => true
    }
}

The Logstash documentation is pretty good and describes each of the above items, check out the documentation here

The only thing to really worry about here is the schedule this is in regular cron style format, with the current setting as above, Logstash will check that Metrics.sqlite database every minute and submit information to InfluxDB.

This is probably far to often for a production system, for testing its fine though as you will see almost instant results. But before you go to production you should consider running this on a more sane schedule like perhaps every hour or two or whatever suits your environment.

The filter

The filter is applied to the data that we have retrieved with the input, so here is where we are going to add some extra fields to go with our data.
Think of these fields as a way to ‘tag’ the data coming from this caching server with information about which physical caching server it is.

In my environment I have 3 tags that I want the data to have that I can search on and group with. In my case:

region – This is the physical region of where the server is located

site_code –  This is a unique number that each site is assigned

school_type  – In my case this is either primary school or high school

filter {
    date {
        match => [ "collectiondate", "UNIX" ]
        remove_field => [ "collectiondate" ]
    }
    date {
        match => [ "expirationdate", "UNIX" ]
        remove_field => [ "expirationdate" ]
        target => "expiration"
    }
    mutate {
        add_field => {
            "region" => "Northern region"
            "site_code" => "1234"
            "school_type" => "Primary School"
        }
    }
}

Again the documentation from Logstash is pretty good to describe how each of these items works, check here for the documentation

The important parts above that you might want to modify are the fields that are added with the mutate section.

The output

Now we are getting closer, the output section is where we tell Logstash what to do with all the data we have ingested, filtered and mutated.
Again all of this is pretty straight forward, but theres a couple of things that I’ll talk about:

output {
    influxdb {
    allow_time_override => true
    host => "my.influxdb.server"
    measurement => "%{metricname}"
    send_as_tags => [ "region", "site_code", "school_type" ]
    data_points => {
        "value" => "%{datavalue}"
        "region" => "%{region}"
        "site_code" => "%{site_code}"
        "school_type" => "%{school_type}"
    }
    coerce_values => {
        "value" => "integer"
    }
    db => "caching"
    retention_policy => "autogen"
    }
}

So the really only interesting things here is:

send_as_tags : This is where we send the fields we created in the mutate section to influx as tags. The trick here, which is barely documented if at all, is that we also need to specify them as data points.

data_points : Here we need to add our tags (extra fields we added from mutate) as datapoints to send to influxdb, we use the %{name} syntax just like we would use a $name variable in bash. This will then replace the variable with the content of the field from the mutate section.

retention_policy : This is the retention policy of the influx db, again documentation was a bit hard to find on this one, but the default retention policy is not actually called default as it seems to be mentioned everywhere, in fact the default policy is actually called ‘autogen’

Completed conf file

So now we have those sections filled out we should have a complete conf file that looks somewhat like this:

Install the conf file

  • Create a directory in our logstash dir to store our conf file
    • mkdir -o /usr/local/logstash-5.2.1/conf
  • Create the conf file and move it into this new location
    • cp logstash.conf /usr/local/logstash-5.2.1/conf/

Running Logstash

Ok so now we have log stash all installed and configured, we need a way to get logstash running and using our configuration file.

Of course this is a perfect place to use a launch daemon. I won’t go into much depth as there are many great resources out there on how to create and use launchdaemons.
If you haven’t already go ahead and check out launchd.info 

Here is a launchd that I’ve create already, just pop this into your /Library/LaunchDaemons folder give your machine a reboot and logstash should start running.

Setting up InfluxDB and Grafana

There are lots and lots of guides on the web for how to get these two items setup, so I won’t go into too much detail. My preferred method of deployment for these kinds of things is to use Docker.

This makes it very quick to deploy and manage the service.

I’ll assume that you already have a machine that is running docker and have a basic understanding of how docker works.
If not, again there are tons of guides out there and it really is pretty simple to get started.

InfluxDB

You can get an influxdb instance setup very quickly with the below command, this will create a db called caching, you can of course give it any name you like, but you will need to remember it when we connect Grafana to it later on.

docker run -d -p 8083:8083 -p 8086:8086 -e PRE_CREATE_DB=caching --expose 8090 --expose 8099 --name influxdb tutum/influxdb

You should now have InfluxDB up and running on your docker machine.

Port 8083 is the admin web app port and you can check your influxDB is up and running by pointing your web browser to your docker machine IP address on port 8083. You should then get your influx DB web app like this:

screen-shot-2017-02-22-at-10-55-46-am
Grafana

You can also setup Grafana on the same machine with the following command, this will automatically ‘link’ the Grafana instance to the InfluxDB and allow communication between the two containers.

docker run -d -p 3000:3000 --link influxdb:influxdb --name grafana grafana/grafana

Now you should also have a Grafana instance running on your docker machine on port 3000. Load up a web browser and point it to your docker machine IP address on port 3000 and you should get the Grafana web app like this:

screen-shot-2017-02-22-at-10-59-11-am

The default login should be admin/admin

Login and add a data source

screen-shot-2017-02-22-at-11-43-45-am

Setting up the dashboards

So now we get to the fun stuff, displaying the data!

Start by creating a new dashboard

screen-shot-2017-02-22-at-11-45-56-am

Now select the Graph panel.

screen-shot-2017-02-22-at-11-46-06-am

On the Panel Title select edit

screen-shot-2017-02-22-at-11-47-40-am

Now we can get to the guts of it, creating the query to display the information we want

Under the Metrics heading, click on the A to expand the query.

screen-shot-2017-02-22-at-11-49-23-am

From here it is pretty straight forward as Grafana will help you by giving you pop up menus of the items you can choose:

screen-shot-2017-02-22-at-11-50-05-am

What might be a bit strange is that the FROM is actually the retention policy, which is weird, you might think that the FROM should be the name of the database. But no, its the name of the default retention policy which in our case should be autogen.

If you need to remove an item just click it and a menu will appear allowing you to remove it, heres an example of removing the mean() item

screen-shot-2017-02-22-at-11-52-05-am

So to display some information you can start with a query like this:

screen-shot-2017-02-22-at-11-54-33-am

This is going to select all the data from the database caching, with the retention policy of autogen, in the measurement called bytes.fromcache.toclients

Next we are going to select all of those values in that bytes.fromcache.toclients measurement, by telling it to select field(value)

Then we click plus next to the field(value) and from the aggregations menu choose sum() this will then add the values all together.

Then we want to display that total grouped by 1 day – time(1d)

This will show us how much data has been delivered to client devices, from the cache on our caching server in 1 day groupings.

Phew, ok thats the query done.

Now we need to format the graph to look pretty. Under the Axes heading we need to change the unit to bytes.

screen-shot-2017-02-22-at-12-04-48-pm

Under the Legend heading, we can also add the Total so that it prints the total next to our measurement on our graph.

screen-shot-2017-02-22-at-12-05-28-pm

And to finish it off we will change the display from lines to bars. Under the Display heading check bars and uncheck lines.

screen-shot-2017-02-22-at-12-09-42-pm

Almost there.

From the top right, lets select a date range to display, like the this week for example.

screen-shot-2017-02-22-at-12-11-17-pm

AND BOOM!

screen-shot-2017-02-22-at-12-11-58-pm

You can of course change the heading from Panel Title to something more descriptive, add your own headings and axis titles etc etc

Of course you can also add additional queries to the graph so you can see multiple measurements at once for comparison.

For example we might want to see how much data was sent to clients from the cache, and how much data had to be retrieved from Apple

We just add another query under the metrics heading.

screen-shot-2017-02-22-at-12-14-21-pm

So lets add the data from the bytes.fromorigin.toclients measurement

screen-shot-2017-02-22-at-12-17-17-pm

We can also use the WHERE filter to select only the data from a particular caching server rather than all of the caching server data that is being shown above.

screen-shot-2017-02-22-at-12-25-30-pm

That should be enough to get you going and creating some cool dashboard for your management types.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s