RSS Feed for Aamir Khan’s blog using YQL and Pipes
As you know earlier I had parsed Aamir Khan's Blog to create a feed. It was custom screen scraping code to generate the feed.Today, after reading Anand's blog, I did the same using YQL and Pipes. Using YQL/PIPE is much easier than writing custom code and is less buggy.
If you have subscribed to http://feeds.thejeshgn.com/aamirkhan then, you don't have to worry. The feed url remains the same only the technology behind has changed. Now we have a better technology. If you have not subscribed, I guess its a good idea to subscribe.
The post below is for fellow hackers. I have tried to write a detailed post on the process I followed and technologies I used.
YQL (Yahoo Query Language) can be used to query the web for data. YQL exposes a SQL-like SELECT syntax with which we all are very familiar. To get the links for the posts from Aamir's blog I used
select * from html where url="http://74.55.20.11/blog/login.php" and xpath="//a[contains(@href,'/blog/login.php?topicid=')]"
Now that goes to home page of Aamirs blog and gets the links of all the recent posts listed on side bar.
To test the same, Go to YQL console and run the above query. YQL gives you both xml and json. It also gives you restful url for your own application.
But there was a problem with this approach. It used to get the all the urls except that of latest post. On his blog Aamir lists all the posts except the post on which we are on. On the home page he doesn't have the link to the latest post. Makes sense to the web readers but not for me. So I went to 21 url and got the links and then truncated the results to first 20 urls (20 latest posts are more than enough for any feed).
select * from html where url="http://74.55.20.11/blog/login.php?topicid=21url" and xpath="//a[contains(@href,'/blog/login.php?topicid=')]"
The most beautiful thing of using Pipes is YQL is built into pipes. So I can send the result of a module into YQL and vice versa. This makes YQL and Pipes a deadly combination.
To get the content I looped through the list of urls and used get page module. I am now getting the data between first
<p class="body"> and first <br>. Yeah they use <br> for paragraphs. I don't want to steal users of his blog and hence I am getting only the first paragraph.
You can clone the pipe that I have created to experiment with it.
to do:
1. Get the date info. Probably the text between spans
<span class="graybold">Oct,09,2007</span>
and parse them into date object.
2. Fix the bugs if there are any. Let me know if you find.
Suddenly today I saw 50 new updates on the feed I had subscribed long back but it had only titles and all that changed again a little while ago I saw text below those feeds – loved it all then.
Great work thej.
@Prasoon : Thanks. Now you can see the latest post too :)
I was under the impression aamir has a feed @ that “http://feeds2.feedburner.com/aamirkhan” and was using it for my blogroll for a while now. Now, is that one u created?
@sandeep : Yup. Its created by me :)
Neat!
I’d been using a pure XPath solution that returns just the titles. It had the ghastly URL http://www.s-anand.net/xpath?url=http%3A%2F%2F202.87.41.148%2Fdigital%2FAamirKhan%2Flogin.php%3Ftopicid%3D1&xpath=//a%5Bcontains(@href,%22login.php?topicid=%22)%5D%5Bnot(contains(@href,%22page=%22))][string-length(.)%3E2]%20title-%3E.%20link-%3E./@href
Look forward to moving to your :-)
Thanks. This is useful.