An Overview of Performance Monitoring at PBS

Posted by Jon Brendsel on

Over the past 3 years we we have been slowly bulding a set of PBS products and services which are intended to address the next generation of PBS viewers that will be consuming our digital content.  During this period we have launched the following services (APIs):

  • Video (COVE) API
  • Merlin (Content Metadata) API
  • TV Schedules Service (TVSS) API
  • Universal User Authentication (UUA)
  • Localization Service

Consuming these services are a growing number of audience facings expereriences including:

  • Mobile products including PBS Video for iPad, PBS Video for iPhone, PBSKids Video for iOS (iPhone and iPad)
  • National Video Portal (COVE API, Merlin API)
  • Station Video Portals (COVE API, Merlin API)

As we mature architecturally and build and reuse these new services, the flip side is that we grow increasinly dependant on these services operating at a very high levels of reliability and performance.

To monitor thise reliability and performance we have been building out a suite of tools and services.

  • Rightscale.  As many of you know, we use Rightscale as a managed services solution for our AWS infrastructure.  One of the things that Rightscale brings to the party is a pretty sophisticated machine level monitoring capability built on the collectd framework.  We use this to alert us to worrisome changes in CPU, memory, disk, and network IO.  Rightscale has many plugins that have been written for standard applications such as Apache and memcached.
  • Pingdom.  We use Pingdom extensively to monitor end-to-end performance of of our services and expereriences from varous points of presence (PoPs) in the Pingdom network nationwide.
  • Splunk.  Several years ago we invested in a pretty hefty (50 GB) and expensive license to Splunk.  We have deployed splunk agents (forwarders) on all of our AWS instances.  These forwarders, by default, index all of the logs in /var/log/* and make that available for search.  We have built out an extensive set of dashboards dedicated to each of our primary user experiences and services (APIs) that measure the performance over team.

 I will be delving into our Splunk setup and monitoring philosophy more thoroughly in coming posts.