Screen scraping with XPath in PHP – Web Scraping

Screen scraping with XPath in PHP – Web Scraping


Screen Scraping is a great skill that every PHP developer should have experience with. Basically it involves scraping the source code of a web page, getting it into a string, and then parsing out the parts that you want to use.It is having few steps follow.
Get the source

Getting the contents of the HTML page is very easy. All you needs to do is following the below process.

You can use cURLfile_get_contents() or even you can use loadHtmlFile() of DOM. cURL will be the best option, if you need lot of control over your requests, want to make asynchronous requests or need to use any other HTTP method than GET. But for basic stuff, file_get_contents works absolutely fine.

Now, I am taking the example with google finance news for the stock GDHI. Google is already providing the RSS feeds for the same, but for an example we can consider this.

<?php
include('curl.php');

// Create the object of Curl class
$curl = new Curl();

// URL of the page which needs to scrape
$url = "http://www.google.com/finance/company_news?q=PINK:GDHI&start=0&num=30";

// Few fields to post for curl
$fields = "usr=user1&pass=PassWord";

// get the contents of the URL using curl
$html_text = $curl->postForm($url, $fields);
?>

Now you have the contents of the page in your $html_text variable.

Parsing the HTML

You can parse the HTML with the DOM and xPath

<?php

// Create the DOM
$html = new DOMDocument();

// Load the contents of the URL
//$html->loadHtmlFile($url); Use when you are not using the cURL
$html->loadHTML($html_text);
?>

With this DOM document, you can now scrape the news with the xpath.

<?php

// Create the xPath
$xpath = new DOMXPath( $html );

// Quering to get the contents under the tag div having class = 'g-section news sfe-break-bottom-16' within the another tag div having id = news-main – Please refer the html source of the URL which we are scraping
$links = $xpath->query( ".//div[@id='news-main']/div[@class='g-section news sfe-break-bottom-16']" );
?>

Get each of the news with details

Now you have the main div with the contents of news. Next step is to loop through each news, and parse the title, date, and link of the each news.

<?php
$return = array();

// Loop through each news
foreach ( $links as $item ) {

$newDom = new DOMDocument;
$newDom->appendChild($newDom->importNode($item,true));
$xpath = new DOMXPath( $newDom );

// Get the title of the news
$title = trim($xpath->query("//span[@class='name']/a")->item(0)->nodeValue);

// Get the date of the news
$date = trim($xpath->query("//div[@class='byline']/span[@class='date']")->item(0)->nodeValue);

// Get the Link / source of the news
$source = trim($xpath->query("//span[@class='name']/a")->item(0)->getAttribute('href'));

// Create the array
$return[] = array(
'title' => $title,
'date' => $date,
'sources' => $source,
);
}
?>

Our news array is ready. Next step is to display it.

<?php
echo '<pre>';
print_r($return);
?>

Great!, We are done. So there we go, that’s how we can write the screen scraper in PHP with XPath and cURL.

Now you have the contents of the page in your $html_text variable.
Parsing the HTML
You can parse the HTML with the DOM and xPath

With this DOM document, you can now scrape the news with the xpath.

// Quering to get the contents under the tag div having class = ‘g-section news sfe-break-bottom-16’ within the another tag div having id = news-main &ndash Please refer the html source of the URL which we are scraping

Get each of the news with details
Now you have the main div with the contents of news. Next step is to loop through each news, and parse the title, date, and link of the each news.

// Create the array
$return[] = array(
‘title’ => $title,
‘date’ => $date,
‘sources’ => $source,
)
}
?>
Our news array is ready. Next step is to display it.

<?php
echo '<pre>'
print_r($return)
?>

Great!, We are done. So there we go, that&rsquos how we can write the screen scraper in PHP with XPath and cURL.