Playing with Amazon's new CloudSearch service
Last week Amazon announced the release of their new search service AWS CloudSearch. I had a little bit of downtime yesterday after attending the AWS Summit in NYC so decided to get my hands dirty and see how far I could get prototyping a COVE search service. After watching the brief video below, and reading a bit of the documenation, I had something up and running within an hour.
- As many of you know, at PBS we use Splunk extensively for operational performance monitoring. One of the byproducts is that we dump out a large portion of the COVE API contents each day do disk so that we can join this authoritative dataset with data that is being reported through the various log sources that we monitor. With Splunk, I was able to export the contents of that dump to disk as a CSV file. Amazon limits local uploads to 5 meg so I had to limit the COVE API dataset to include only videos over 2 minutes in length and those that were currently 'available' to be viewed.
- Using the Cloudsearch admin console, I created a new domain called 'cove-2'. Amazon lets you either specify the schema manually in the console, or they let you submit a local file for analysis. In this case, I used the CSV file that I had just created.
- The Cloudsearch admin console extracted all of the fields from the CSV file and then suggested the data type (uint, text, literal), whether the field shoudl be searchable, and finally whether the field should be a facet or it should appear in the search result. I dont completely understand why a field can only be a facet or be stored in the search results but not both.
- After I made a few tweaks to Amazon's suggestions, I saved the settings for our new domain. After about 20 minutes the new domain had been created and we were ready to updoad the data.
- Once the new domain is created, the next step is to upload the structure data in predefined formats. Amazon gives us the choice of providing files in the actual SDF (search description format) or CSVs. In my case, I elected to use the same CSV that was used earlier to create the domain schema. It tool 5-10 minuts for all 37k lines in the CSV file to be indexed.
The results was pretty amazing. The following query searches all of the searchable fields (producer+title+ for the words "great" and "expectations"
The "return-fields" parameter controls which fields that are stored in the index are to be returned with search results. Keep in mind that not all fields that are indexed are returnable in the search results. This is very reminiscent of Lucene / SOLR behavior.
The entire process from start to finish took about 30-40 minutes.