
Update! My twitpic scraper (as well as search API calls) have been integrated into NCSU’s Tweetgator. Check it out on Github!
A couple months ago, IU East re-vamped its twitter wall. We incorporated a codebase originally developed by NCSU, and then I extended it by adding inline hashtag searching and a twitpic scraper.
At the time I wrote it, I could not find any other existing Twitpic scraper – Twitpic doesn’t have a formal API (or at least, it didn’t then; I don’t think it does at the time of this writing, either).
Effectively what this script does (see after the jump) is to browse the Twitpic site, parse out the image IDs, and then re-create the Twitpic images. It is somewhat rudimentary in that it does not cache nor does is it actually download the images — anyone reading this may feel free to extend the code into something like that.
The script below is licensed under the GPL2, with all the requirements, freedoms, and obligations therein.
<?php define('USERNAME', 'yourtwitterusername'); // How many pics to display by default $quantity = (isset($_GET['qty']) && !empty($_GET['qty'])) ? $_GET['qty'] : "8"; // The rendering format $format = (isset($_GET['format']) && !empty($_GET['format'])) ? $_GET['format'] : "json"; $url = "http://www.twitpic.com/photos/$user"; define('MINI_URL', 'http://twitpic.com/show/mini/%1$s'); define('THUMB_URL', 'http://twitpic.com/show/thumb/%1$s'); define('LARGE_URL', 'http://twitpic.com/show/large/%1$s'); define('PIC_URL', 'http://twitpic.com/%1$s'); // The formatting string, used for LI based outputs $liFormat = '<li> <a href="' . $picUrl . '" title="Twitpic"> <img src="' . $thumbnailUrl . '" alt="twitpic" /> </a> </li>'; ///// CURL down the Twitpic data ///////////////////////////// // See below for discussion on why we're not RegExing for images directly $searchForPhotos = '<a href="/(w+)">'; $ch = curl_init($url); $photoIDs = array(); $photos = array(); //return the transfer as a string curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $temp = curl_exec($ch); curl_close($ch); preg_match_all($searchForPhotos, $temp, $photoIDs); array_shift($photoIDs); $photoIDs = array_slice($photoIDs[0], 0, $quantity); $output = ""; switch ($format) { case "LI": foreach($photoIDs as $id) { $output .= sprintf($liFormat, $id); } break; default: ///// Parse out the raw data into a usable format (JSON) ////// foreach($photoIDs as $id) { $photos[$id]["mini"] = sprintf(MINI_URL, $id); $photos[$id]["thumb"] = sprintf(THUMB_URL, $id); $photos[$id]["full"] = sprintf(LARGE_URL, $id); $photos[$id]["url"] = sprintf(PIC_URL, $id); } $output = json_encode($photos); break; } echo $output; ?>
The main challenge I encountered while doing this initially was that Twitpic obfuscates the way images are created — it seems counter-intuitive to search for link tags instead of IMG tags, but if you try hotlinking directly to the IMG src, you’ll find that it doesn’t work (presumably being blocked via .htaccess or something similar).
In the interest of fairness — it would probably be ideal to actually download and cache the images, rather than hotlinking — and I would advise anyone that implements this script to do that.
Once the page scrape has been parsed for link targets, we take those matches and then build new URLs. The actual photos, visible offsite, are different URLs than those that are shown on Twitpic’s site itself, which outsources to Amazon’s cloud web services. The Twitpic /show/ URLs are about as close to an API / web service that it comes, so that’s what we have to work with!
The IU East twitter wall uses a quick Ajax call to populate the Twitpic photos, to increase page load speed.
Any takers that want to make the changes I mention above (or similar improvements) feel free to post them in the comments below (if the form will let you) or email them to me — I have a gmail account and my username is “armahillo”. I will happily modify the code above and credit you for your submission.