Cloudsearch for Merlin

Posted by Jon Brendsel on

Source Content

Continuing our explorations into Amazon's new Cloudsearch platform, I decided to index all of the PBS Merlin content.  As you know, Merlin is a repository of metadata descripting all content that PBS has published, including principally web pages and videos for national and station content.  Here is an example record for a video object.

{'webobject_type': 'Video', 'tv_rating': None, 'text': 'May 11, 2012 | New Children's Hospital\nThe downtown CHRISTUS Santa Rosa hospital in San Antonio will undergo an expansion to transform it into a Tier 1 Childrens Hospital. Christus had been working with University Hospital on a joint project but negotiations fell apart. While Christus focuses on downtown there are voices in the medical community that believe a second hospital in the medical center area would be a better location.\nSan Antonio will be receiving at least one Tier One Level Childrens Hospital.\n', 'topics': ['news-public-affairs', 'news-public-affairs-health'], 'visible': True, 'contentchannel': 'Texas Week', 'contentchannel_homepage': 'http://www.klrn.org/texasweek', 'created_by': 'AmyBaroch', 'duration': 1710126, 'guid': 'http://video.klrn.org/video/2233626245/', 'title': "May 11, 2012 | New Children's Hospital", 'nola_episode': None, 'geo_lat': None, 'regions': None, 'station': 'KLRN', 'short_description': 'San Antonio will be receiving at least one Tier One Level Childrens Hospital.', 'topic_titles': ['News & Public Affairs', 'Health'], 'image': 'http://www-tc.pbs.org/s3/pbs.merlin.cdn.prod/webobjects/tmp6NWuue.jpg', 'available': None, 'expires': None, 'description': 'The downtown CHRISTUS Santa Rosa hospital in San Antonio will undergo an expansion to transform it into a Tier 1 Childrens Hospital. Christus had been working with University Hospital on a joint project but negotiations fell apart. While Christus focuses on downtown there are voices in the medical community that believe a second hospital in the medical center area would be a better location.', 'tags': None, 'purchase_url': None, 'geo_long': None, 'encore_date': '2012-05-11 08:01:00', 'created': '2012-05-11 19:43:56', 'url': 'http://video.klrn.org/video/2233626245/', 'modified': '2012-05-11 19:44:00', 'premiered': '2012-05-11 08:01:00', 'player_url': 'http://video.klrn.org/video/2233626245/', 'published': '2012-05-11 23:39:47', 'distribution': 'local', 'nola_root': None}

 

Step 1: Create Cloudfront Schema

For the first iteration of the merlin cloudfront service I decided to create the following

FieldType
content_channeltext
date_expiresuint
date_modifieduint
descriptiontext
distributiontext
nola_roottext
object_typetext
stationtext
tagstext
titletext
topicstext
url_destinationtext
url_imagetext
visibleliteral

This was all configured in the AWS Management console.

 

Step 2: Endpoints

After the initial configuration, the schema was ready in about 20 minutes.  The resulting endpoints are:

document endpointdoc-merlin-xylyg5xd3bm7eyxb3aj5vzt56q.us-east-1.cloudsearch.amazonaws.com
search endpointsearch-merlin-xylyg5xd3bm7eyxb3aj5vzt56q.us-east-1.cloudsearch.amazonaws.com

 

Step 3: Populate Index

To do this, I wrote a little python script which I executed on my laptop.  The script looped through all of the 50000 available merlin documents, mapped them into the requisite SDF documents, and then submitted them to the cloudfront service at the foregoing document endpoint.


import coveapi, time, sys, os
from datetime import datetime
import simplejson as json
import urllib, urllib2
import boto
from cloudsearch import connect_cloudsearch, get_document_service
import ConfigParser
import ramp
import hashlib

# read config file
config = ConfigParser.ConfigParser()
config.read('config.cfg')
username=config.get("merlinapi", "username")

# Get a new instance of cloudsearch.DocumentServiceConnection
doc_endpoint = config.get("merlin_cloudsearch","doc_endpoint")
doc_service = get_document_service(endpoint=doc_endpoint)

# check to see that AWS credentials have been properly exposed as enviroinmental variables.
if not (os.getenv('AWS_ACCESS_KEY_ID') and os.getenv('AWS_SECRET_ACCESS_KEY')):
print 'Please set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, e.g.:'
print 'export AWS_ACCESS_KEY_ID=XXXXXXXXXXXXXX'
print 'export AWS_SECRET_ACCESS_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
sys.exit(-1)

# define version number as epoch time in secs
version = int(time.mktime(datetime.utcnow().timetuple()))

# calculate how many merlin documents will be avaialble
merlin_url="http://%s@merlin.pbs.org/api/1.0/webobjects.json?" % username
response=json.load(urllib.urlopen(merlin_url))

num_available = response['count']
batch_size = 200
total_batches = int(num_available/batch_size) + 2
print "available: %s" % num_available
print "batches: %s" % total_batches

# iterate through all of the batches of merlin documents
for batch_num in range(1,total_batches):

# note time that the batch started
print time.ctime()
print "Processing batch %s of %s" % (batch_num,total_batches)

# construct merlin query
begin=(batch_num-1)*batch_size
end=batch_num*batch_size
merlin_url="http://%s@merlin.pbs.org/api/1.0/webobjects.json?limit_start=%s&limit_stop=%s" % (username,begin,end)
response=json.load(urllib.urlopen(merlin_url))

for object in response['results']:
doc={}

doc['url_destination'] = object['url']
doc['object_type'] = object['webobject_type']
doc['description'] = object['text']
doc['url_image'] = str(object['image'])
doc['title'] = object['title']
doc['station'] = object['station']
doc['nola_root'] = str(object['nola_root'])
doc['content_channel'] = object['contentchannel']
if object['tags']:
doc['tags'] = object['tags']
doc['distribution'] = object['distribution']
doc['visible'] = object['visible']
if object['modified']:
doc['date_modified'] = int(time.mktime(time.strptime(object['modified'],"%Y-%m-%d %H:%M:%S")))
if object['expires']:
doc['date_expires'] = int(time.mktime(time.strptime(object['expires'],"%Y-%m-%d %H:%M:%S")))

# construct CSV delimited list of topic slugs
list_topics=''
if object['topics']:
for topic in object['topics']:
if list_topics:
list_topics += ","
list_topics += topic
doc['topics'] = list_topics

# cloudfront id field cannot contain special characters ':' and '/'
# therefore we need to hash the GUID field from merlin
try:
if object['guid']:
guid = hashlib.md5()
guid.update(object['guid'])
doc['id'] = guid.hexdigest()
if doc['visible'] == 1:
doc_service.add(doc['id'], version, doc)
else:
doc_service.delete(doc['id'], version)
else:
print "Missing GUID. Ignoring merlin document."
except UnicodeEncodeError:
print "Could not convert GUID to ascii hash string. Ignoring merlin document."

# commit sdf batch to cloudsearch doc service
result = doc_service.commit()

# print summary stats for batch
print "Batch %s completed" % batch_num
print " Documents added: %s" % result.adds
print " Documents deleted: %s" % result.deletes
if result.status == "error":
print " Status: %s %s" % (result.status,result.errors)
else:
print " Status: %s" % result.status
print "=============================================="

# close connection object
doc_service.clear_sdf()

# save batch results as local sdf file
sdf = result.sdf
f = open('%s/%s.%s.sdf' % (config.get("merlin_cloudsearch","sdf_path"),version,batch_num), 'w')
f.write(sdf)
f.close

Step 4: Resulting SDF file

The resulting SDF record for each Merlin object looked like the following:

{"lang": "en", "fields": {"url_destination": "http://video.pbs.org/video/2222736272/", "title": "April 13, 2012", "content_channel": "Washington Week", "topics": "news-public-affairs,news-public-affairs-politics", "object_type": "Video", "visible": true, "station": "pbs", "date_modified": 1334369601, "distribution": "national", "url_image": "http://www-tc.pbs.org/s3/pbs.merlin.cdn.prod/webobjects/tmprVTvcS.jpg", "id": "8b7d21b921668a5f392a67e06594db02", "nola_root": "WWIR", "description": "April 13, 2012\nWith Rick Santorum out of the GOP race, Mitt Romney and President Barack Obama are gearing up for the general election. But recent comments about Romney's wife, Ann, have caused controversy over gender politics. Plus, a look at the fragile ceasefire in Syria. Joining Gwen: Dan Balz, Washington Post; Beth Reinhard, National Journal; John Harris, POLITICO; Doyle McManus, Los Angeles Times.\nRick Santorum drops out; Ann Romney and gender politics; ceasefire in Syria.\n"}, "version": 1336839970, "type": "add", "id": "8b7d21b921668a5f392a67e06594db02"}

Step 5: Search Results

After running the script initially, we have slightly over 50,000 documents from the Merlin API that have been indexed in cloudsearch.

For example, the following query will search all fields for the terms "chasing ghosts"

http://search-merlin-xylyg5xd3bm7eyxb3aj5vzt56q.us-east-1.cloudsearch.amazonaws.com/2011-02-01/search?q=chasing+ghosts&return-fields=title,nola_root,station,content_channel,object_type,description,distribution,url_destination,url_image,station,tags,visible,topics

The search results returned look like:

Image - merlin_search_result.png

Performance

Our realized performance continues to be very good.  A simple uncached search for the term 'Obama' results retiurns several hundred results in under 10 ms.