Web Development

Screen scraping with XPath in PHP – Web Scraping

2Base Technologies

3 min read3059 views

Published Date: Aug 4, 2013

Screen Scraping is a great skill that every PHP developer should have experience with. Basically it involves scraping the source code of a web page, getting it into a string, and then parsing out the parts that you want to use.It is having few steps follow.Get the source

Getting the contents of the HTML page is very easy. All you needs to do is following the below process.

You can use cURL, file_get_contents() or even you can use loadHtmlFile() of DOM. cURL will be the best option, if you need lot of control over your requests, want to make asynchronous requests or need to use any other HTTP method than GET. But for basic stuff, file_get_contents works absolutely fine.

Now, I am taking the example with google finance news for the stock GDHI. Google is already providing the RSS feeds for the same, but for an example we can consider this.

<?phpinclude('curl.php');

Now you have the contents of the page in your $html_text variable.

Parsing the HTML

You can parse the HTML with the DOM and xPath

<?php

With this DOM document, you can now scrape the news with the xpath.

<?php

Get each of the news with details

Now you have the main div with the contents of news. Next step is to loop through each news, and parse the title, date, and link of the each news.

<?php$return = array();

Our news array is ready. Next step is to display it.

<?phpecho '<pre>';print_r($return);?>

Great!, We are done. So there we go, that’s how we can write the screen scraper in PHP with XPath and cURL.

Now you have the contents of the page in your $html_text variable.Parsing the HTMLYou can parse the HTML with the DOM and xPath

With this DOM document, you can now scrape the news with the xpath.

// Quering to get the contents under the tag div having class = ‘g-section news sfe-break-bottom-16’ within the another tag div having id = news-main &ndash Please refer the html source of the URL which we are scraping

Get each of the news with detailsNow you have the main div with the contents of news. Next step is to loop through each news, and parse the title, date, and link of the each news.

// Create the array$return[] = array(‘title’ => $title,‘date’ => $date,‘sources’ => $source,)}?>Our news array is ready. Next step is to display it.

<?phpecho '<pre>'print_r($return)?>

Great!, We are done. So there we go, that&rsquos how we can write the screen scraper in PHP with XPath and cURL.

Prev blog Next blog

Contents

Screen scraping with XPath in PHP – Web Scraping

<?phpinclude('curl.php');

Parsing the HTML

<?php

<?php

Get each of the news with details

<?php$return = array();

<?phpecho '<pre>';print_r($return);?>

<?phpecho '<pre>'print_r($return)?>