Getting started with Yahoo Pipes
February 19, 2007
Posted by on
This post is probably about a week late from the hype surrounding Yahoo Pipes, but I got a request to post it and I think many people will find it useful. Pipes is a new tool from Yahoo for aggregating content from multiple sources (RSS, HTML, Geo-data) into one single RSS output feed. Additionally you can add multiple pipes together to further refine your output feed. In this example, I will use a number of sources to create a pipe for aggregated content from a number of Deals-oriented websites. The idea behind this pipe is to aggregate the content of all the feeds and remove any duplicates to show only unique content. It’s not perfect, but it has been successful in filtering out most of the duplicates. I started by grabbing the URLs for RSS feeds from the following websites and placing them in a Fetch Source. Deals 2 Buy SlickDeals.net
I then created two Unique operators where I filtered out duplicates by Link and Title. These two operators will get rid of any deals with the exact same title and those that point to the exact same product link. However, it still leaves many duplicates as some of the feeds will point to a link on the Deals Website which then has another embedded link to the actual product. This is done to keep the RSS feed from stealing too much of the site’s ad revenue. However we still want to get as many unique items as possible.
To do further filtering, we run a Content Analysis operator and then filter out unique items based on its output. The Content Analysis operator will basically analyze the link and generate a meta-data Tag which is then appended on the output. Adding another Unique operator after the Content Analysis allows us to remove duplicate meta-data.
Depending on the feed, the meta-data can be inaccurate. I expect this will improve as Yahoo improves Pipes and the content providers improve their feeds.
Finally, I added one last Sort operator to sort everything by the title in ascending order. This results in a Custom RSS Feed aggregated from 6 RSS Feeds with most of the duplicates taken out. While some duplication does remain, I expect Yahoo will add some partial filters which allow you to filter out duplicates based on whether a certain percentage of duplicate words appear in the title. This should get rid of the last few duplicates.
You can view the finished pipe here. I’ve published the pipe so anyone can clone or use my pipe. I’d love to see other interesting pipes that others have found or created. Please let me know in the comments.