<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Use jQuery and PHP to scrape page content</title>
	<atom:link href="http://papermashup.com/use-jquery-and-php-to-scrape-page-content/feed/" rel="self" type="application/rss+xml" />
	<link>http://papermashup.com/use-jquery-and-php-to-scrape-page-content/</link>
	<description>Ashley Ford :: CSS &#124; PHP &#124; JavaScript</description>
	<lastBuildDate>Wed, 28 Jul 2010 15:36:02 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
	<item>
		<title>By: JSON and PHP product gallery &#124; Papermashup.com</title>
		<link>http://papermashup.com/use-jquery-and-php-to-scrape-page-content/comment-page-1/#comment-1993</link>
		<dc:creator>JSON and PHP product gallery &#124; Papermashup.com</dc:creator>
		<pubDate>Thu, 07 Jan 2010 14:45:07 +0000</pubDate>
		<guid isPermaLink="false">http://papermashup.com/?p=1268#comment-1993</guid>
		<description>[...] cross domain ajax requests without too much fuss. This post really follows on from my article about scraping content from a page but also allows you to send and receive GET variables through the JSON [...]</description>
		<content:encoded><![CDATA[<p>[...] cross domain ajax requests without too much fuss. This post really follows on from my article about scraping content from a page but also allows you to send and receive GET variables through the JSON [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ashley</title>
		<link>http://papermashup.com/use-jquery-and-php-to-scrape-page-content/comment-page-1/#comment-1974</link>
		<dc:creator>Ashley</dc:creator>
		<pubDate>Mon, 04 Jan 2010 16:46:14 +0000</pubDate>
		<guid isPermaLink="false">http://papermashup.com/?p=1268#comment-1974</guid>
		<description>@mccormicky if you are in control of the content on both domains then it is safe to use this technique (which is the exact reason i used the technique) - if not you should consider the reputation of the site in question before proceeding.</description>
		<content:encoded><![CDATA[<p>@mccormicky if you are in control of the content on both domains then it is safe to use this technique (which is the exact reason i used the technique) &#8211; if not you should consider the reputation of the site in question before proceeding.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: mccormicky</title>
		<link>http://papermashup.com/use-jquery-and-php-to-scrape-page-content/comment-page-1/#comment-1970</link>
		<dc:creator>mccormicky</dc:creator>
		<pubDate>Mon, 04 Jan 2010 13:08:01 +0000</pubDate>
		<guid isPermaLink="false">http://papermashup.com/?p=1268#comment-1970</guid>
		<description>If the content is WP I can and have used Simplepie to do it but getting this content to show in just one place created another problem.  

In the non WP part of the site there are only a few templates  that can be used effectively ...because the templates in question would be used by more than one page the RSS WP content shows up everywhere the template is used. 

The non WP part of the site uses Qcodo and Ajax. The qcodo controls are in files that have been encrypted. So what if I want to show content from the non WP part of the site in the WP part of the site? This content doesn&#039;t come with RSS.

So I went looking for other solutions:Scraping. 

With scraping I can create a WP page and put whatever I want in it then include this page in the non WP part of the site using the techniques either from this tutorial or the one from Net Tuts.

Rubbish: what solution do you propose other than scraping?

Ashley: is this safe to do if the content is from the same website/domain?</description>
		<content:encoded><![CDATA[<p>If the content is WP I can and have used Simplepie to do it but getting this content to show in just one place created another problem.  </p>
<p>In the non WP part of the site there are only a few templates  that can be used effectively &#8230;because the templates in question would be used by more than one page the RSS WP content shows up everywhere the template is used. </p>
<p>The non WP part of the site uses Qcodo and Ajax. The qcodo controls are in files that have been encrypted. So what if I want to show content from the non WP part of the site in the WP part of the site? This content doesn&#8217;t come with RSS.</p>
<p>So I went looking for other solutions:Scraping. </p>
<p>With scraping I can create a WP page and put whatever I want in it then include this page in the non WP part of the site using the techniques either from this tutorial or the one from Net Tuts.</p>
<p>Rubbish: what solution do you propose other than scraping?</p>
<p>Ashley: is this safe to do if the content is from the same website/domain?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ashley</title>
		<link>http://papermashup.com/use-jquery-and-php-to-scrape-page-content/comment-page-1/#comment-1933</link>
		<dc:creator>Ashley</dc:creator>
		<pubDate>Sat, 26 Dec 2009 18:25:41 +0000</pubDate>
		<guid isPermaLink="false">http://papermashup.com/?p=1268#comment-1933</guid>
		<description>@rubbish, as I stated in my last comment,  &quot;&lt;em&gt;Iâ€™m &lt;strong&gt;not&lt;/strong&gt; condoning users to actively trawl the internet and scrape any content they wish...&lt;/em&gt;&quot;

I see your point about XSS from an external point of view if you don&#039;t control the domain or content, this wasn&#039;t however the case in my original solution so wasn&#039;t a problem. 

maybe you should &lt;a href=&quot;http://net.tutsplus.com/tutorials/php/how-to-syndicate-content-without-utilizing-a-news-feed/&quot; rel=&quot;nofollow&quot;&gt;check this tutorial out&lt;/a&gt; from the reputable Nettuts.com, a little more complex but the the same outcome.

As I have said previously its nice to use your name rather than hide anonymously.</description>
		<content:encoded><![CDATA[<p>@rubbish, as I stated in my last comment,  &#8220;<em>Iâ€™m <strong>not</strong> condoning users to actively trawl the internet and scrape any content they wish&#8230;</em>&#8221;</p>
<p>I see your point about XSS from an external point of view if you don&#8217;t control the domain or content, this wasn&#8217;t however the case in my original solution so wasn&#8217;t a problem. </p>
<p>maybe you should <a href="http://net.tutsplus.com/tutorials/php/how-to-syndicate-content-without-utilizing-a-news-feed/" rel="nofollow">check this tutorial out</a> from the reputable Nettuts.com, a little more complex but the the same outcome.</p>
<p>As I have said previously its nice to use your name rather than hide anonymously.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: rubbish</title>
		<link>http://papermashup.com/use-jquery-and-php-to-scrape-page-content/comment-page-1/#comment-1932</link>
		<dc:creator>rubbish</dc:creator>
		<pubDate>Sat, 26 Dec 2009 17:56:38 +0000</pubDate>
		<guid isPermaLink="false">http://papermashup.com/?p=1268#comment-1932</guid>
		<description>Ashley, I am not referring to content theft, (although your argument suggesting that a site&#039;s content is fair game for the talking solely because it is presented on the internet is flat out WRONG.)

The fact is, if the user has scripting on their page (HTML), you scape their page and incorporate their HTML (to be parsed with jquery) that active content will have access to your page, will not be subject to cross-domain restrictions and will have control over any content or cookies set by your server.

You will be subjected to cross site scripting in the exact same way as if you had an input form which you echoed back to the user.

I know this because I have done it myself to idiots. :)

Remember, when you use CURL to request a page, that request is being sent with a particular signature that  identifies it as a non-browser request (the user agent is all wrong, there are more hints as well) , also, the IP address will be that of YOUR server, not that of a user ISP.
Therefore, the target can recognize and send you targeted text back, that can result in an unfriendly site experience. They could even send back javascript that could parse and steal the log in cookie of whomever is viewing!

Remember back when Hotlinking graphic files was an issue and people used to substitute offensive files in the place of the hotlinked files?

In any event, the scraped content no constitutes INPUT and will have to be SANITIZED the same way you sanitize   ALL input before including it in a page.

You can do this all with PHP, no need to write &quot;serious regEX&quot; a short google for PhpHTML dom
http://simplehtmldom.sourceforge.net/ will provide all you need.</description>
		<content:encoded><![CDATA[<p>Ashley, I am not referring to content theft, (although your argument suggesting that a site&#8217;s content is fair game for the talking solely because it is presented on the internet is flat out WRONG.)</p>
<p>The fact is, if the user has scripting on their page (HTML), you scape their page and incorporate their HTML (to be parsed with jquery) that active content will have access to your page, will not be subject to cross-domain restrictions and will have control over any content or cookies set by your server.</p>
<p>You will be subjected to cross site scripting in the exact same way as if you had an input form which you echoed back to the user.</p>
<p>I know this because I have done it myself to idiots. <img src='http://papermashup.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>Remember, when you use CURL to request a page, that request is being sent with a particular signature that  identifies it as a non-browser request (the user agent is all wrong, there are more hints as well) , also, the IP address will be that of YOUR server, not that of a user ISP.<br />
Therefore, the target can recognize and send you targeted text back, that can result in an unfriendly site experience. They could even send back javascript that could parse and steal the log in cookie of whomever is viewing!</p>
<p>Remember back when Hotlinking graphic files was an issue and people used to substitute offensive files in the place of the hotlinked files?</p>
<p>In any event, the scraped content no constitutes INPUT and will have to be SANITIZED the same way you sanitize   ALL input before including it in a page.</p>
<p>You can do this all with PHP, no need to write &#8220;serious regEX&#8221; a short google for PhpHTML dom<br />
<a href="http://simplehtmldom.sourceforge.net/" rel="nofollow">http://simplehtmldom.sourceforge.net/</a> will provide all you need.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ashley</title>
		<link>http://papermashup.com/use-jquery-and-php-to-scrape-page-content/comment-page-1/#comment-1930</link>
		<dc:creator>Ashley</dc:creator>
		<pubDate>Sat, 26 Dec 2009 09:48:50 +0000</pubDate>
		<guid isPermaLink="false">http://papermashup.com/?p=1268#comment-1930</guid>
		<description>@rubbish, You&#039;ve clearly not read the post, I&#039;m not condoning users to actively trawl the internet and scrape any content they wish, however if you don&#039;t want someone to use information from your site why post it online? 

The problem I had at work was that we had 2 sub domains that we needed to make ajax requests on, so they were owned by us.

If you&#039;d left your real name and email address I could have emailed you personally as you&#039;ve clearly got the wrong end of the stick.</description>
		<content:encoded><![CDATA[<p>@rubbish, You&#8217;ve clearly not read the post, I&#8217;m not condoning users to actively trawl the internet and scrape any content they wish, however if you don&#8217;t want someone to use information from your site why post it online? </p>
<p>The problem I had at work was that we had 2 sub domains that we needed to make ajax requests on, so they were owned by us.</p>
<p>If you&#8217;d left your real name and email address I could have emailed you personally as you&#8217;ve clearly got the wrong end of the stick.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: rubbish</title>
		<link>http://papermashup.com/use-jquery-and-php-to-scrape-page-content/comment-page-1/#comment-1929</link>
		<dc:creator>rubbish</dc:creator>
		<pubDate>Sat, 26 Dec 2009 00:46:18 +0000</pubDate>
		<guid isPermaLink="false">http://papermashup.com/?p=1268#comment-1929</guid>
		<description>executing someone else&#039;s  HTML within the context of your own page is asking for trouble and is the easiest way to get yourself XSS&#039;d.</description>
		<content:encoded><![CDATA[<p>executing someone else&#8217;s  HTML within the context of your own page is asking for trouble and is the easiest way to get yourself XSS&#8217;d.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tutorial City</title>
		<link>http://papermashup.com/use-jquery-and-php-to-scrape-page-content/comment-page-1/#comment-1842</link>
		<dc:creator>Tutorial City</dc:creator>
		<pubDate>Mon, 14 Dec 2009 07:17:13 +0000</pubDate>
		<guid isPermaLink="false">http://papermashup.com/?p=1268#comment-1842</guid>
		<description>very useful! thanks a lot</description>
		<content:encoded><![CDATA[<p>very useful! thanks a lot</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Really Useful Tutorials You Should Have Read in November 2009 Ajax Help W3C Tag</title>
		<link>http://papermashup.com/use-jquery-and-php-to-scrape-page-content/comment-page-1/#comment-1811</link>
		<dc:creator>Really Useful Tutorials You Should Have Read in November 2009 Ajax Help W3C Tag</dc:creator>
		<pubDate>Thu, 10 Dec 2009 07:37:39 +0000</pubDate>
		<guid isPermaLink="false">http://papermashup.com/?p=1268#comment-1811</guid>
		<description>[...] Use jQuery and PHP to scrape page content By Ashley Ford, November 18th, 2009 Site: PaperMashup [...]</description>
		<content:encoded><![CDATA[<p>[...] Use jQuery and PHP to scrape page content By Ashley Ford, November 18th, 2009 Site: PaperMashup [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Destillat KW49-2009 &#124; duetsch.info - GNU/Linux, Open Source, Softwareentwicklung, Selbstmanagement, Vim ...</title>
		<link>http://papermashup.com/use-jquery-and-php-to-scrape-page-content/comment-page-1/#comment-1786</link>
		<dc:creator>Destillat KW49-2009 &#124; duetsch.info - GNU/Linux, Open Source, Softwareentwicklung, Selbstmanagement, Vim ...</dc:creator>
		<pubDate>Fri, 04 Dec 2009 08:46:51 +0000</pubDate>
		<guid isPermaLink="false">http://papermashup.com/?p=1268#comment-1786</guid>
		<description>[...] Use jQuery and PHP to scrape page content [...]</description>
		<content:encoded><![CDATA[<p>[...] Use jQuery and PHP to scrape page content [...]</p>
]]></content:encoded>
	</item>
</channel>
</rss>
