RSS Feed for Aamir Khan’s blog using YQL and Pipes
As you know earlier I had parsed Aamir Khan’s Blog to create a feed. It was custom screen scraping code to generate the feed.Today, after reading Anand’s blog, I did the same using YQL and Pipes. Using YQL/PIPE is much easier than writing custom code and is less buggy.
If you have subscribed to http://feeds.thejeshgn.com/aamirkhan then, you don’t have to worry. The feed url remains the same only the technology behind has changed. Now we have a better technology. If you have not subscribed, I guess its a good idea to subscribe.
The post below is for fellow hackers. I have tried to write a detailed post on the process I followed and technologies I used.
YQL (Yahoo Query Language) can be used to query the web for data. YQL exposes a SQL-like SELECT syntax with which we all are very familiar. To get the links for the posts from Aamir’s blog I used
select * from html where url="http://126.96.36.199/blog/login.php" and xpath="//a[contains(@href,'/blog/login.php?topicid=')]"
Now that goes to home page of Aamirs blog and gets the links of all the recent posts listed on side bar.
To test the same, Go to YQL console and run the above query. YQL gives you both xml and json. It also gives you restful url for your own application.
But there was a problem with this approach. It used to get the all the urls except that of latest post. On his blog Aamir lists all the posts except the post on which we are on. On the home page he doesn’t have the link to the latest post. Makes sense to the web readers but not for me. So I went to 21 url and got the links and then truncated the results to first 20 urls (20 latest posts are more than enough for any feed).
select * from html where url="http://188.8.131.52/blog/login.php?topicid=21url" and xpath="//a[contains(@href,'/blog/login.php?topicid=')]"
The most beautiful thing of using Pipes is YQL is built into pipes. So I can send the result of a module into YQL and vice versa. This makes YQL and Pipes a deadly combination.
To get the content I looped through the list of urls and used get page module. I am now getting the data between first
<p class=”body”> and first <br>. Yeah they use <br> for paragraphs. I don’t want to steal users of his blog and hence I am getting only the first paragraph.
1. Get the date info. Probably the text between spans
and parse them into date object.
2. Fix the bugs if there are any. Let me know if you find.