Papermashup

Subscribe


Tweets


"RT @Eva_Shaughnessy: #finedining @romomobilecafes restaurant this evening with #NewHair #winning #SaturdayNight #GirlsNightOut http://t.co/…"

@ashleyford 3 weeks ago

"RT @kycutwilson: @ashleyford @burgerbeartom incredible. There's 5 more left! Shout about it!!"

@ashleyford 4 weeks ago

Designer and web developer, Co-founder and Technical Director at Harkable.com. Previously I worked at Spotify, MySpace and InMobi. Contact me - ashley[at]papermashup.com

Papermashup

Use jQuery and PHP to scrape page content

Scrape any page content using PHP and JavaScript

AshleyAshley

So we have content on another domain that we want to load via AJAX into a page how can we do this?…. This was a question that was put to the other day at work. More experienced web developers will know that JavaScript doesn’t allow cross domain XMLHttpRequest’s or AJAX requests (Asynchronous JavaScript and XML). There is a ‘dirty’ way to get around this using PHP and CURL to pull the HTML of the page you want to get the content from so JavaScript thinks it’s coming from your domain. Let me just say, this isn’t an ideal solution but it’s a useful technique when executed in the right situation.

The PHP

In this example we’re taking the community news section from smashingmagazine.com. Firstly using PHP we use CURL to get the whole contents of the homepage. we can then specify using javascript a specific div to get as explained below.


$ch = curl_init("http://www.smashingmagazine.com/");
$html = curl_exec($ch);
echo $html;

The JavaScript

This must be the simplest couple of lines of javascript ever. You can see within the DOM ready function we’re loading the content of the div #noupesoc into #content. As simple as that. You can specify any div or element on the page and grab it using this method.


$("document").ready(function() {
$("#content").load("curl.php #noupesoc");
});

The HTML


<div id="content"> <img src="ajax-loader.gif" alt="Loading..." /> </div>;

download

Designer and web developer, Co-founder and Technical Director at Harkable.com. Previously I worked at Spotify, MySpace and InMobi. Contact me - ashley[at]papermashup.com

Comments 24
  • Sawan Sanghvii
    Posted on

    Sawan Sanghvii Sawan Sanghvii

    Reply Author

    nice article….i m gonna try this…..thnx for sharing…..Check pdf scraping using php on my blog….http://www.webdata-scraping.com/blog


  • wordpress plugin developer
    Posted on

    wordpress plugin developer wordpress plugin developer

    Reply Author

    (It’s just another reason for me to not like blogs – ha ha. Random past posts get their dates changed and are republished on the front page and on RSS feeds. Very good communication and persuasive expertise are necessary to carryout efficient communication between you and your client, your boss, or your associates in order to close a deal or identify the achievement of a project.

    My homepage: wordpress plugin developer


  • Brijesh
    Posted on

    Brijesh Brijesh

    Reply Author

    Hi,

    Thanks for nice tutorial.

    Could we create regular pattern to crawl all page/

    Thanks
    Brijesh Mishra


  • Delia
    Posted on

    Delia Delia

    Reply Author

    Hi there, this doesnt seem to work in explorer. And it also doesnt seem to work on extracting contents of divs on external sites..


  • SWATANTRA PRASAD CHOUDHURY
    Posted on

    SWATANTRA PRASAD CHOUDHURY SWATANTRA PRASAD CHOUDHURY

    Reply Author

    NICE ONE. Loved to go through the same..


  • Steve
    Posted on

    Steve Steve

    Reply Author

    Awesome tutorial!
    I had to use curl on my host 1and1.

    http://www.quickscrape.com/ is what I came up with!


  • mccormicky
    Posted on

    mccormicky mccormicky

    Reply Author

    If the content is WP I can and have used Simplepie to do it but getting this content to show in just one place created another problem.

    In the non WP part of the site there are only a few templates that can be used effectively …because the templates in question would be used by more than one page the RSS WP content shows up everywhere the template is used.

    The non WP part of the site uses Qcodo and Ajax. The qcodo controls are in files that have been encrypted. So what if I want to show content from the non WP part of the site in the WP part of the site? This content doesn’t come with RSS.

    So I went looking for other solutions:Scraping.

    With scraping I can create a WP page and put whatever I want in it then include this page in the non WP part of the site using the techniques either from this tutorial or the one from Net Tuts.

    Rubbish: what solution do you propose other than scraping?

    Ashley: is this safe to do if the content is from the same website/domain?


    • Ashley
      Posted on

      Ashley Ashley

      Reply Author

      @mccormicky if you are in control of the content on both domains then it is safe to use this technique (which is the exact reason i used the technique) – if not you should consider the reputation of the site in question before proceeding.


  • rubbish
    Posted on

    rubbish rubbish

    Reply Author

    Ashley, I am not referring to content theft, (although your argument suggesting that a site’s content is fair game for the talking solely because it is presented on the internet is flat out WRONG.)

    The fact is, if the user has scripting on their page (HTML), you scape their page and incorporate their HTML (to be parsed with jquery) that active content will have access to your page, will not be subject to cross-domain restrictions and will have control over any content or cookies set by your server.

    You will be subjected to cross site scripting in the exact same way as if you had an input form which you echoed back to the user.

    I know this because I have done it myself to idiots. :)

    Remember, when you use CURL to request a page, that request is being sent with a particular signature that identifies it as a non-browser request (the user agent is all wrong, there are more hints as well) , also, the IP address will be that of YOUR server, not that of a user ISP.
    Therefore, the target can recognize and send you targeted text back, that can result in an unfriendly site experience. They could even send back javascript that could parse and steal the log in cookie of whomever is viewing!

    Remember back when Hotlinking graphic files was an issue and people used to substitute offensive files in the place of the hotlinked files?

    In any event, the scraped content no constitutes INPUT and will have to be SANITIZED the same way you sanitize ALL input before including it in a page.

    You can do this all with PHP, no need to write “serious regEX” a short google for PhpHTML dom
    http://simplehtmldom.sourceforge.net/ will provide all you need.


    • Ashley
      Posted on

      Ashley Ashley

      Reply Author

      @rubbish, as I stated in my last comment, “I’m not condoning users to actively trawl the internet and scrape any content they wish…

      I see your point about XSS from an external point of view if you don’t control the domain or content, this wasn’t however the case in my original solution so wasn’t a problem.

      maybe you should check this tutorial out from the reputable Nettuts.com, a little more complex but the the same outcome.

      As I have said previously its nice to use your name rather than hide anonymously.


  • rubbish
    Posted on

    rubbish rubbish

    Reply Author

    executing someone else’s HTML within the context of your own page is asking for trouble and is the easiest way to get yourself XSS’d.


    • Ashley
      Posted on

      Ashley Ashley

      Reply Author

      @rubbish, You’ve clearly not read the post, I’m not condoning users to actively trawl the internet and scrape any content they wish, however if you don’t want someone to use information from your site why post it online?

      The problem I had at work was that we had 2 sub domains that we needed to make ajax requests on, so they were owned by us.

      If you’d left your real name and email address I could have emailed you personally as you’ve clearly got the wrong end of the stick.


  • Tutorial City
    Posted on

    Tutorial City Tutorial City

    Reply Author

    very useful! thanks a lot


  • Josep Viciana
    Posted on

    Josep Viciana Josep Viciana

    Reply Author

    Hello, seems you have a problem in the code shown. Appears <h1> instead of !


  • Josep Viciana
    Posted on

    Josep Viciana Josep Viciana

    Reply Author

    Hola, parece que tienes un problema en el código mostrado. Aparece <h1> en lugar de !


  • Brenelz
    Posted on

    Brenelz Brenelz

    Reply Author

    What do you think is the best way to deal with images / links as they will be broken if moved over.


    • Ashley
      Posted on

      Ashley Ashley

      Reply Author

      @Brenelz I’d use a bit of javascript and regular expression to insert the url to any img tags.


  • Davinder
    Posted on

    Davinder Davinder

    Reply Author

    Hello,
    Thanks for this tutorial! but would you be able to create a simple login script with error/warning messages using Jquery?

    for example:
    I login with a wrong username/password and without refreshing the page a error pops out

    Thank you


  • Ben
    Posted on

    Ben Ben

    Reply Author

    Hmm…wouldn’t it be better to use PHP all the way? I can’t really see a practical use for this? Nice tutorial though.


    • Ashley
      Posted on

      Ashley Ashley

      Reply Author

      @Ben the reason we use javascript is so we can easily inject the content into our page not compromising the rest of your content from loading properly, because if there is a problem loading the content we can easily detect that with jQuery, also it’s a lot easier to do this with jQuery than it would be to strip the content you want out of the page with PHP, you’d have to use some serious RegEx.


  • Eire32
    Posted on

    Eire32 Eire32

    Reply Author

    Thats a nice work work around, I like it. Handy for pulling news if they don’t have an RSS feed or the like.