Monitoring Apple Caching Server

Update 10-04-2017 :

So after running this is in test/pre-prod for some time, I realised a couple of problems with my initial configuration.

  • My math was off. I was graphing two values, ‘bytes.fromcache.toclients’ and ‘bytes.fromorigin.toclients’

This is not correct, what we actually want is a value of ALL data sent to clients, this is a sum of 3x values, bytes.fromcache.toclients, bytes.fromorigin.toclients and bytes.frompeers.toclients

Then we can accurately see exactly how much data was served to client devices vs how much data was pulled from Apple over the WAN (bytes.fromorigin.toclients).

  • The way I was importing data into InfluxDB was incorrect. I was importing each table from sqlite as a ‘measurement’ into influxdb.

This is not the way it should be done with influx, rather, we should create a ‘measurement’ and then add fields to this measurement.

The reason for this is mainly because Influx can not do any math between different measurements. i.e. it can’t join measurements or do a sum on measurements.

For example, if I want to do a query showing the sum of the three metrics I mentioned above, and these three metrics were different measurements in influxdb it would not work.

Luckily all that was required to fix these two issues was to simply adjust my logstash configuration.

Logstash is super flexible and allows for multiple inputs and outputs and basic if /then logic.

I have updated the post below to include the new information

Summary

So I have a lot of Apple Caching Servers to manage. The problem for me is monitoring their stats. How much data have they saved the site where they are located? Sure you can look at the Server.app stats and get a general idea. You could also look at the raw data from:

serveradmin fullstatus caching

You could also use the fantastic script from Erik Gomez Cacher which can trigger server alerts to send email notifications as well as send slack notifications to provide you with some statistics of your caching server.

And this is all great for a small amount of caching servers, but once your fleet starts getting up into the 100+ territory, we really need something better. Management are always asking me for stats for this site and that site, or for a mixture of the sites, or a region, or all of the sites combined. Collecting this data and then creating graphs in excel with the above methods is rather painful.

There has to be a better way!

Enter the ILG stack (InfluxDB, Logstash and Grafana)

If you have had a poke around Server.app 5.2 caching server on macOS 10.12, you may have noticed that there is a Metrics.sqlite database in

/Library/Server/Caching/Logs/

Lets have a look whats in this little database:

$ sqlite3 Metrics.sqlite
SQLite version 3.14.0 2016-07-26 15:17:14
Enter ".help" for usage hints.

Lets turn on headers and columns

sqlite> .headers ON
sqlite> .mode column

Now lets see what tables we have in here

sqlite> .tables
statsData  version

statsData sounds like what we want, lets see whats in there.

sqlite> select * from statsData;
entryIndex  collectionDate  expirationDate  metricName               dataValue
----------  --------------  --------------  -----------------------  ----------
50863       1487115473      1487720273      bytes.fromcache.topeers  0
50864       1487115473      1487720273      requests.fromclients     61
50865       1487115473      1487720273      imports.byhttp           0
50866       1487115473      1487720273      bytes.frompeers.toclien  0
50867       1487115473      1487720273      bytes.purged.total       0
50868       1487115473      1487720273      replies.fromorigin.tope  0
50869       1487115473      1487720273      bytes.purged.youngertha  0
50870       1487115473      1487720273      bytes.fromcache.toclien  907
50871       1487115473      1487720273      bytes.imported.byxpc     0
50872       1487115473      1487720273      requests.frompeers       0
50873       1487115473      1487720273      bytes.fromorigin.toclie  227064
50874       1487115473      1487720273      replies.fromcache.topee  0
50875       1487115473      1487720273      bytes.imported.byhttp    0
50876       1487115473      1487720273      bytes.dropped            284
50877       1487115473      1487720273      replies.fromcache.tocli  4
50878       1487115473      1487720273      replies.frompeers.tocli  0
50879       1487115473      1487720273      imports.byxpc            0
50880       1487115473      1487720273      bytes.purged.youngertha  0
50881       1487115473      1487720273      bytes.fromorigin.topeer  0
50882       1487115473      1487720273      replies.fromorigin.tocl  58
50883       1487115473      1487720273      bytes.purged.youngertha  0

Well now this looks like the kind of data we are after!

Looks like all the data is stored in bytes, so no conversions from MB KB TB need to be done. Bonus.
Also looks like each stat or measurement i.e. bytes.fromcache.topeers appears to be written to this DB after, or very shortly after, a transaction or event occurs on the caching server such as a GET request for content from a device. This means that we can add all these stats up over a day and get a much more accurate idea of how much data the caching server is seeing.
This solves the problem that the Cacher script by Erik runs into when the server reboots.

In Cacher, the script looks for a summary of how much data the server has served since the service has started by scraping the Debug.log. You have probably seen this in the Debug log

2017-02-22 09:41:10.137 Since server start: 1.08 GB returned to clients, 973.5 MB stored from Internet, 0 bytes from peers; 0 bytes imported.

Cacher then checks the last value of the previous day, compares it to the latest value for the end of report period day and works out the difference to arrive at a figure of how much data was served by the client for that report day. While this works great on a stable caching server that never reboots or has the service restart on you, it is a little too fragile for my needs. I’m sure Erik would also like a more robust method to generate that information as well.

Looking back at the Metrics.sqlite DB, if you are wondering about those collectionDates and expirationDate values they are epoch time stamps, which is also a bonus as this is very easy to convert into human readable with a command like:

$ date -j -f %s 1487115473
Wed 15 Feb 2017 10:37:53 AEDT

But also makes it easy to do comparisons and do simple math with if you need to.

Having all this information in a sqlite database already makes it quite easy-ish for us to pick up this data with Logstash, feed it into an InfluxDB instance and then visualise it with Grafana.

simples_fb_1567463

With this setup I was able to very easily show the statistics of all our caching servers at once. Of course we can also drill down into individual schools caching servers to reveal those results as well.

YAY PRETTY GRAPHS!

screen-shot-2017-02-21-at-2-58-34-pm

The nuts and bolts

So how do we get get this setup? Well this is not going to be a step by step walkthrough but it should be enough to get you going. You can then make your own changes for how you want to set it up in your own environment, everyones prod environment is a little different but this should be enough to get you setup with a PoC environment.

Lets start with getting Logstash setup on your Caching server.

Requirements:

  • macOS 10.12.x +
  • Server.app 5.2.x +
  • Java8
  • Java8JDK
  • Java JVM script from the always helpful Rich and Frogor  script here

Start by:

  • Getting your caching server up and running.
  • Install Java 8 and the Java 8 JDK.
  • Run the JVM script
  • Confirm that you have Java 8 installed correctly by running  java -version from the command line

If all has gone well, you should get something like this back:

# java -version
java version "1.8.0_111"
Java(TM) SE Runtime Environment (build 1.8.0_111-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.111-b14, mixed mode)

Now we are ready to install Logstash.

  • Download the latest tar ball from here: https://www.elastic.co/downloads/logstash
  • Store it somewhere useful like /usr/local
    • Extract the tar with tar -zxvf logstash-5.2.1.tar.gz -C /usr/local
    • This will extract it into the /usr/local directory for you

Now we need to add some plugins, this is where it gets a little tricky.
If you have authenticated proxy servers, you are going to have a bad time, so lets pretend you don’t.

Installing Logstash plugins

First lets get the plugin that will allow Logstash to send output to InfluxDB

Run the logstash plugin binary and install the plugin logstash-output-influxdb:

$ cd /usr/local/logstash-5.2.1/bin
$ ./logstash-plugin install logstash-output-influxdb

Now we will install the SQLite JDBC connector that allows Logstash to access the sqlite db that caching server saves its metrics into.

  • Download the sqlite-jdbc-3.16.1 plugin from here: https://bitbucket.org/xerial/sqlite-jdbc/downloads/
  • Create a directory in our Logstash dir to save it, I like to put it in ./plugins
    • mkdir -p /usr/local/logstash-5.2.1/plugins
  • Copy the sqlite plugin into our new directory
    • cp sqlite-jdbc-3.16.1 /usr/local/logstash-5.2.1/plugins

Ok we now have Logstash installed and ready to go! Next up we make a configuration file to do all the work.

Thinking about the InfluxDB Schema

Before we start pumping data into Influx, we should probably think about how we are going to structure that data.

I came up with a very basic schema that looks like this:

Here I have created 7 ‘measurements’ which then group together the metricnames from the caching server sqlite database

Because of the way math works in influxdb, this allows me to write a query like :

SELECT sum("bytes.fromcache.toclients") + sum("bytes.fromorigin.toclients") + sum("bytes.frompeers.toclients") as TotalServed FROM "autogen"."bytestoclients" WHERE "site_code" = '1234' AND $timeFilter GROUP BY time(1d)

This query will add the three metrics together to give a total of all bytes from cache, origin and peers to client devices.

 

Creating the configuration file

This is the most challenging part, and a huge shoutout goes the @mosen for all his help on this I definitely wouldn’t have been able to get this far without his help.

The configuration file we need contains three basic components, the inputs, a filter and the outputs.

The inputs

The input is where we are getting out data from, in our case its from the sqlite DB, so our input is going to be the sqlite jdbc plugin and we need to config it so that it knows what information to get and where to get it from.

Its pretty straight forward, and should make sense, but I’ll describe each item below

We are going to have an input for each measurement we want, this way we can write a sqlite query to get the metric names or sqlite tables we want to put into that metric.

For example the below input is going to get the following tables:

  1. bytes.fromcache.toclients
  2. bytes.fromorigin.toclients
  3. bytes.frompeers.toclients
  4. bytes.fromcache.topeers
  5. bytes.fromorigin.topeers

From the sqldatabase by running the query in the statement section.

Then we add a label to this input so that we can call it later and ensure the data from this input is put into the correct measurement in the influxdb output.

The label is set by using the type key in the input as you see below

input {
    jdbc {
        jdbc_driver_library => "/usr/local/logstash-5.2.1/plugins/sqlite-jdbc-3.16.1.jar"
        jdbc_driver_class => "org.sqlite.jdbc"
        jdbc_connection_string => "jdbc:sqlite:/Library/Server/Caching/Logs/Metrics.sqlite"
        jdbc_user => ""
        schedule => "* * * * *"
        statement => "SELECT * FROM statsData WHERE metricname LIKE "bytes.%.toclients" OR metricname LIKE "bytes.%.topeers""
        tracking_column => "entryindex"
        use_column_value => true
        type => "bytestoclients"
    }
}

The Logstash documentation is pretty good and describes each of the above items, check out the documentation here

The only thing to really worry about here is the schedule this is in regular cron style format, with the current setting as above, Logstash will check that Metrics.sqlite database every minute and submit information to InfluxDB.

This is probably far to often for a production system, for testing its fine though as you will see almost instant results. But before you go to production you should consider running this on a more sane schedule like perhaps every hour or two or whatever suits your environment.

So in the completed logstash config file we will end up with a jdbc input for each sqlite statement or query we need to run to populate the 7 measurements we add to influx.

The filter

The filter is applied to the data that we have retrieved with the input, so here is where we are going to add some extra fields and tags to go with our data to allow us to use some logic to direct the right input to the right output in the logstash file as well as allow us to group and search our data based on which server that data is coming from
Think of these fields as a way to ‘tag’ the data coming from this caching server with information about which physical caching server it is.

In my environment I have 4 tags that I want the data to have that I can search on and group with. In my case:

region – This is the physical region of where the server is located

site_name – This is the actual name of the site

site_code –  This is a unique number that each site is assigned

school_type  – In my case this is either primary school or high school

We are going to use some logic here to add a tag to the inputs depending upon what the type is. We can’t use the ‘type’ directly in the output section, so we have to convert it into a tag and then we can send that to the output and we can do logic on that.

We also remove any unneeded fields such as collectiondata and expiration date with the date, match, remove_field section

Then we also add our location and server information by adding our tags region, site_name etc etc with the mutate function


filter {
    if [type] == "bytestoclients" {
        mutate {
            add_tag => [ "bytestoclients" ]
        }
     }
    date {
        match => [ "collectiondate", "UNIX" ]
        remove_field => [ "collectiondate", "expirationdate" ]
    }
    mutate {
        add_field => {
            "region" => "Region 1"
            "site_name" => "Site Name Alpha"
            "site_code" => "1234"
            "school_type" => "High School"
         }
    }
}

Again the documentation from Logstash is pretty good to describe how each of these items works, check here for the documentation

The important parts above that you might want to modify are the fields that are added with the mutate section.

The output

Now we are getting closer, the output section is where we tell Logstash what to do with all the data we have ingested, filtered and mutated.
Again all of this is pretty straight forward, but theres a couple of things that I’ll talk about:

we are going to use an if statement to check if the data coming from our input contains a string in a tag.

For example, if the string ‘bytestoclients’ exists in a tag, then we should use a certain output.

This allows us to direct the inputs we created above to a specific output. Each output will have a measurement name and a list of fields (datapoint) that will be sent to influx

We have to list each metric name in the coerce_values section to ensure the data is sent as a float or integer because otherwise it will be sent as a string and this is no good for our math.

There is also an open issue on github with the influxdb output plugin where we can’t use the a variable to handle this. ideally we would simply be able to use something like

coerce_values => {

    "%{metric name}" => "integer"

}

But unfortunately this does not work, and we must list out each metric name – like animal

output {
    if "bytestoclients" in [tags] {
        influxdb {
            allow_time_override => true
            host => "my.influxdb.server"
            measurement => "bytestoclients"
            idle_flush_time => 1
            flush_size => 100
            send_as_tags => [ "region", "site_code", "site_name", "school_type" ]
            data_points => {
                "%{metricname}" => "%{datavalue}"
                "region" => "%{region}"
                "site_name" => "%{site_name}"
                "site_code" => "%{site_code}"
                "school_type" => "%{school_type}"
            }
            coerce_values => {
                "bytes.fromcache.toclients" => "integer"
                "bytes.fromorigin.toclients" => "integer"
                "bytes.frompeers.toclients" => "integer"
                "bytes.fromorigin.topeers" => "integer"
                "bytes.fromcache.topeers" => "integer"
            }
            db => "caching"
            retention_policy => "autogen"
        }
}

So the really only interesting things here is:

send_as_tags : This is where we send the fields we created in the mutate section to influx as tags. The trick here, which is barely documented if at all, is that we also need to specify them as data points.

data_points : Here we need to add our tags (extra fields we added from mutate) as datapoints to send to influxdb, we use the %{name} syntax just like we would use a $name variable in bash. This will then replace the variable with the content of the field from the mutate section.

retention_policy : This is the retention policy of the influx db, again documentation was a bit hard to find on this one, but the default retention policy is not actually called default as it seems to be mentioned everywhere, in fact the default policy is actually called ‘autogen’

Consult the InfluxDB documentation for more info

Completed conf file

So now we have those sections filled out we should have a complete conf file that looks somewhat like this:

Install the conf file

  • Create a directory in our logstash dir to store our conf file
    • mkdir -o /usr/local/logstash-5.2.1/conf
  • Create the conf file and move it into this new location
    • cp logstash.conf /usr/local/logstash-5.2.1/conf/

Running Logstash

Ok so now we have log stash all installed and configured, we need a way to get logstash running and using our configuration file.

Of course this is a perfect place to use a launch daemon. I won’t go into much depth as there are many great resources out there on how to create and use launchdaemons.
If you haven’t already go ahead and check out launchd.info 

Here is a launchd that I’ve create already, just pop this into your /Library/LaunchDaemons folder give your machine a reboot and logstash should start running.

Setting up InfluxDB and Grafana

There are lots and lots of guides on the web for how to get these two items setup, so I won’t go into too much detail. My preferred method of deployment for these kinds of things is to use Docker.

This makes it very quick to deploy and manage the service.

I’ll assume that you already have a machine that is running docker and have a basic understanding of how docker works.
If not, again there are tons of guides out there and it really is pretty simple to get started.

InfluxDB

You can get an influxdb instance setup very quickly with the below command, this will create a db called caching, you can of course give it any name you like, but you will need to remember it when we connect Grafana to it later on.

docker run -d -p 8083:8083 -p 8086:8086 -e PRE_CREATE_DB=caching --expose 8090 --expose 8099 --name influxdb tutum/influxdb

You should now have InfluxDB up and running on your docker machine.

Port 8083 is the admin web app port and you can check your influxDB is up and running by pointing your web browser to your docker machine IP address on port 8083. You should then get your influx DB web app like this:

screen-shot-2017-02-22-at-10-55-46-am
Grafana

You can also setup Grafana on the same machine with the following command, this will automatically ‘link’ the Grafana instance to the InfluxDB and allow communication between the two containers.

docker run -d -p 3000:3000 --link influxdb:influxdb --name grafana grafana/grafana

Now you should also have a Grafana instance running on your docker machine on port 3000. Load up a web browser and point it to your docker machine IP address on port 3000 and you should get the Grafana web app like this:

screen-shot-2017-02-22-at-10-59-11-am

The default login should be admin/admin

Login and add a data source

screen-shot-2017-02-22-at-11-43-45-am

Setting up the dashboards

So now we get to the fun stuff, displaying the data!

Start by creating a new dashboard

screen-shot-2017-02-22-at-11-45-56-am

Now select the Graph panel.

screen-shot-2017-02-22-at-11-46-06-am

On the Panel Title select edit

screen-shot-2017-02-22-at-11-47-40-am

Now we can get to the guts of it, creating the query to display the information we want

Under the Metrics heading, click on the A to expand the query.

screen-shot-2017-02-22-at-11-49-23-am

From here it is pretty straight forward as Grafana will help you by giving you pop up menus of the items you can choose:

screen-shot-2017-02-22-at-11-50-05-am

What might be a bit strange is that the FROM is actually the retention policy, which is weird, you might think that the FROM should be the name of the database. But no, its the name of the default retention policy which in our case should be autogen.

If you need to remove an item just click it and a menu will appear allowing you to remove it, heres an example of removing the mean() item

screen-shot-2017-02-22-at-11-52-05-am

So to display some information you can start with a query like this:

screen-shot-2017-02-22-at-11-54-33-am

This is going to select all the data from the database caching, with the retention policy of autogen, in the field called bytes.fromcache.toclients

Next we are going to select all of those values in that bytes.fromcache.toclients measurement, by telling it to select field(value)

Then we click plus next to the field(value) and from the aggregations menu choose sum() this will then add the values all together.

Then we want to display that total grouped by 1 day – time(1d)

This will show us how much data has been delivered to client devices, from the cache on our caching server in 1 day groupings.

Phew, ok thats the query done.

But, thats just going to show us how much data came from the “cache”, its not going to show us how much data was delivered to clients from cache+peers+origin.

So for that query, we have to do a little trick.

We select the measurement bytestoclients.

Then we select the field bytes.fromcache.toclients click the plus and add our other fields to it looks like this:

But you might notice that this doesn’t show us a single bar graph like we want, we have to manually edit the query to remove some erroneous commas

Hit the toggle edit mode button:

And then remove the commas and add a plus symbol instead.

from:

to:

Now we need to format the graph to look pretty. Under the Axes heading we need to change the unit to bytes.

screen-shot-2017-02-22-at-12-04-48-pm

Under the Legend heading, we can also add the Total so that it prints the total next to our measurement on our graph.

screen-shot-2017-02-22-at-12-05-28-pm

And to finish it off we will change the display from lines to bars. Under the Display heading check bars and uncheck lines.

screen-shot-2017-02-22-at-12-09-42-pm

Almost there.

From the top right, lets select a date range to display, like the this week for example.

screen-shot-2017-02-22-at-12-11-17-pm

AND BOOM!

You can of course change the heading from Panel Title to something more descriptive, add your own headings and axis titles etc etc

Of course you can also add additional queries to the graph so you can see multiple measurements at once for comparison.

For example we might want to see how much data was sent to clients from, and how much data had to be retrieved from Apple

We just add another query under the metrics heading.

So lets add the data from the bytes.fromorigin.toclients field

We can also use the WHERE filter to select only the data from a particular caching server rather than all of the caching server data that is being shown above.

That should be enough to get you going and creating some cool dashboard for your management types.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s