When Apple Caching Server just can’t even right now

Update: This was logged as a bug to Apple and has been resolved in iOS 10 and macOS 10.12

See http://www.openradar.me/radar?id=4958891762778112 for details

Background

Apple Caching Server is pretty cool and it really makes a lot of sense in a large environment.

However, large environments often have a rather complex network topology which makes configuration and troubleshooting a little more difficult.

I just happen to work in a very large environment with a complex network topology.

We have many public WAN IP’s which our client devices and Apple caching servers use to get out to the internet – via authenticated proxies no less.

Apple has some pretty good, although a bit ambiguous in parts, documentation on configuring Apple Caching for complex networks here: http://help.apple.com/serverapp/mac/5.1/#/apd6015d9573

Essentially we have a network that looks a little bit like this:

complex

 

Apple Caching server supports this network topology, however we need to provide our client devices access to a DNS TXT service record in their default search domain so the client device will know all of our WAN IP ranges.

So how does this caching server thing work on the client anyway?

There is a small binary/framework on the client device that does a ‘discovery’ of Apple caching servers approximately every hour – or if it has not yet populated a special file on disk, it will run immediately when requested by a content downloading service such as the App Store.

This special binary does this discovery by grabbing some info about the client device such as the LAN IP address and subnet range, and then it looks up our special DNS Text record ( _aaplcache._tcp. ) and sends all of this data to the Apple Locator service at: lcdn-locator.apple.com

Apple then matches the WAN IP ranges and the LAN IP  ranges provided and sends back a config that the special process writes out to disk. This config file contains the URL of a caching server that it should use (if one has been registered)

This special file on disk is called diskcache.plist, if it has been able to successfully locate a caching server, you should see in this file a line like this:

"localAddressAndPort" => "10.10.10.10:49313"

Where 10.10.10.10:49313 is the IP address and port of the caching server the client should use.

Now this diskcache.plist file exists in a folder called com.apple.AssetCacheLocatorService inside /var/folders. The exact location is stored in the DARWIN_USER_CACHE_DIR variable. This can be revealed by running:

getconf DARWIN_USER_CACHE_DIR

Which should output a directory path like this:

/var/folders/yd/y87k7kk14494j_9c0y814r8c0000gp/C/

Then you can just use plutil -p to read the diskCache.plist

sudo plutil -p /var/folders/yd/y87k7kk14494j_9c0y814r8c0000gp/C/com.appleAssetCacheLocatorService/diskCache.plist

And it should give you some output like this

*Thanks to n8felton for the info about the /var/folders !

Now all of this is fine and no problem, it all works as expected.

c4jt321

Except when it doesn’t.

At some sites, we were seeing a failure of client devices to pull their content from their caching server. The client device would simply pull its content over the WAN.

After a lot of trial and error and wire-sharking (is that a thing?) we found the problem.

As I mentioned earlier we were having _some_ client devices not able to pull their content from the caching server. After investigation on the client we found that they were not populating their diskcache.plist with the information we need from the apple locator service.

How come?

Well in our environment, we utilise a RODC at each site. This AD RODC (Read only domain controller) also operates as a DNS server. It is also the primary DNS server that is provided to clients via DHCP.

We have a few “issues” with our RODCs from time to time and quite often we just shut them down and let the clients talk to our main DC’s and DNS servers over the WAN. However, when we shutdown the RODC’s we don’t remove them from the DHCP servers DNS option. So clients still receive a DHCP packet with a primary DNS server of our now turned off RODC DNS server, they also receive a secondary, and third DNS server that they will use.

As expected the clients seem quite happy with this, the clients are able to perform DNS lookups and browse the internet as expected even though their primary DNS server is non-responsive.

BUT it seems that the special little caching service discovery tool on the client devices does not fail over and use the secondary (or third) DNS server. It seems that this tool only does the DNS lookup for our TXT record against the primary DNS server.

So because this DNS TXT record lookup fails, the caching service discovery tool doesn’t get a list of WAN IP address ranges to send to the Apple locator URL and thus never gets a response back about which caching server it should use!

The fix.

Once we manually remove the non-responsive primary DNS server from the DHCP packet, so the client device now only gets our 2 functional DNS servers as the primary and secondary servers, the caching service discovery tool is able to lookup our DNS TXT record and receive the correct caching server URL from the Apple locator service and everything is right in the world again!

Rainbows-Unicorns-Button

Advertisements

13 comments

  1. Great post. As some one who will probably be implementing an Enterprise level Caching Server (especially after reading some recent release notes), this is very insightful.

    I wanted share that the `/var/folders` location isn’t all that random. You should also be able find it by running a command like `cd $(getconf DARWIN_USER_CACHE_DIR)com.apple.AssetCacheLocatorService` as root (sudo -i).

    http://www.magnusviri.com/OS_X_Admin/what-is-var-folders.html has some great information about /var/folders. Just thought I’d share.

    Like

    1. Nice thanks for that I remember reading about the DARWIN USER CACHE DIR ages ago but that link has some pretty handy info in there!

      Like

    1. I don’t believe it is possible perhaps someone could write an app … maybe if i get some free time that might be something i try.

      Like

      1. Any chance you figured anything more out on the iOS front? I’m trying to figure out why my iOS devices on my network cant see the caching server.

        Like

      2. No, but you can use `AssetCacheLocatorUtil` on 10.12 to help you. There are so many different network topologies you need to configure caching server to suit yours. It is not a case of just turn it on and hope for the best

        Like

  2. Hello,

    Not sure if this has been asked before but how would you suggest a caching solution be deployed to cover a true enterprise network with many location using MPLS / WAN connectivity? The concept of a caching server seems great if you have a caching server deployed at each remote site but if there are 300+ remote sites with < 10mbps of WAN bandwidth, then the solution does not seem to work.

    How do other larger enterprise customer manage OS & APP updates for remote locations with 50, 75, 100+ devices without killing their WAN links?

    Like

    1. The only way is with a caching server at each remote site, because that is where the link is. It makes no sense to put it further up topology. It has to be at the edge to be effective in reducing WAN traffic.

      I contracted for a large education department with some 2,000+ remote sites.

      They were instructed to purchase a specific macmini model to our specs, then they simply run our “caching server” imagr workflow.

      I developed a deployment workflow using NetBoot and Imagr that would automatically build a caching server and configure it based on AD sites and services, using the subnet information for the site out of the siteObjectBL AD attribute

      End users simply click the “caching server” workflow and all the work is done for them, they don’t even get login access to the caching server – its not required, we manage the config all centrally. We even monitor the usage stats, see my post on monitoring caching server with influxdb and grafana

      Currently have around 200 caching servers out in the field, all managed by munki.

      Like

  3. Hi all,

    I am trying to configure a Cache Server to fill ISP requirements. I followed all the sugestions, opened all networks in configurations and no way to start serving.

    Resuming, we have the scenario:

    Cache server (Public IP: 1.1.1.1/20)
    Users (Public IPs range: 2.2.2.2/24)

    We have a public domain for the reverse DNS, and in that domain we added _aaplcache._tcp for the Users network.(TXT with prs=2.2.2.2-2.2.2.255)
    Nothing happened.

    What are we doing wrong ? Can you help ?

    Thanks

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s