2Base Technologies- Application Development and Software Services

web developing company
YES, SURELY
We Do It All The Time

We mentor your technical and non-technical team members to understand the technology components behind the business solution. Be Informed.

Screen scraping with XPath in PHP

August 03, 2010 Trackback Technology by Administrator

Screen Scraping is a great skill that every PHP developer should have experience with. Basically it involves scraping the source code of a web page, getting it into a string, and then parsing out the parts that you want to use.
It is having few steps follow.

Get the source

Getting the contents of the HTML page is very easy. You can use cURL, file_get_contents() or even you can use loadHtmlFile() of DOM. cURL will be the best option, if you need lot of control over your requests, want to make asynchronous requests or need to use any other HTTP method than GET. But for basic stuff, file_get_contents works absolutely fine.

Now, I am taking the example with google finance news for the stock GDHI. Google is already providing the RSS feeds for the same, but for an example we can consider this.


<?php
include('curl.php');

// Create the object of Curl class
$curl = new Curl();

// URL of the page which needs to scrape
$url = "http://www.google.com/finance/company_news?q=PINK:GDHI&start=0&num=30";

// Few fields to post for curl
$fields = "usr=user1&pass=PassWord";

// get the contents of the URL using curl
$html_text = $curl->postForm($url, $fields);
?>

Now you have the contents of the page in your $html_text variable.

Parsing the HTML

You can parse the HTML with the DOM and xPath


<?php

// Create the DOM
$html = new DOMDocument();

// Load the contents of the URL
//$html->loadHtmlFile($url); Use when you are not using the cURL
$html->loadHTML($html_text);
?>

With this DOM document, you can now scrape the news with the xpath.


<?php

// Create the xPath
$xpath = new DOMXPath( $html );

// Quering to get the contents under the tag div having class = 'g-section news sfe-break-bottom-16' within the another tag div having id = news-main – Please refer the html source of the URL which we are scraping
$links = $xpath->query( ".//div[@id='news-main']/div[@class='g-section news sfe-break-bottom-16']" );
?>

Get each of the news with details

Now you have the main div with the contents of news. Next step is to loop through each news, and parse the title, date, and link of the each news.


<?php
$return = array();

// Loop through each news
foreach ( $links as $item ) {

$newDom = new DOMDocument;
$newDom->appendChild($newDom->importNode($item,true));
$xpath = new DOMXPath( $newDom );

// Get the title of the news
$title = trim($xpath->query("//span[@class='name']/a")->item(0)->nodeValue);

// Get the date of the news
$date = trim($xpath->query("//div[@class='byline']/span[@class='date']")->item(0)->nodeValue);

// Get the Link / source of the news
$source = trim($xpath->query("//span[@class='name']/a")->item(0)->getAttribute('href'));

// Create the array
$return[] = array(
'title' => $title,
'date' => $date,
'sources' => $source,
);
}
?>

Our news array is ready. Next step is to display it.


<?php
echo '<pre>';
print_r($return);
?>

Great!, We are done. So there we go, that’s how we can write the screen scraper in PHP with XPath and cURL.

You can also download the code with the examples for scraping a list and a single portion. Click here to download

Blinklist!Blogmarks!BlinkBits!Ask!

Comments (9)

Stellar work there eevroyne. I'll keep on reading.
Stellar work there eevroyne. I'll keep on reading.
Stellar work there eevroyne. I'll keep on reading.
Unparalleled accuracy, unequivocal calitry, and undeniable importance!
Why would I be getting this error Fatal error: Call to undefined method DOMNodeList::getAttribute() Refering to the $source= line?

Add New Comment

*
Captcha text

Testimonials

  • 2Base Technologies did an amazing job building our custom website. From our first conversation to the final product the site turned out better than we had imagined. I will highly recommend this firm to anyone I know in need of web services. We are already talking to them about our next project.Philip Nelson
    Empire State Media
  • You have developed a site for my company that I am very proud of, and a site that will take my company to the next level. The ability of your team to resolve issues in a very timely manner says a lot about your company's commitment to providing not only a good product, but excellent service. Many thanks in helping Ferdie’s Soccer Magic achieve it's goals.Ferdie Adoboe
    Ferdie's Soccer Magic
  • We steelmax Rolling Mills Pvt Ltd is deeply thankful for the professional and technical guidance and wish you all success in your future endevours.Sreejith Ambat
    SteelMaxIndia Pvt.LTD
  • This is the CRM which I have looked for 4 years. We got wonderfull support from 2base technologies after implementing this CRM.I can say only one thing about these guys - FABULOUS!!! Robert Pizzio
    Oasis Solutions
  • 2Base Technologies are impressive in all areas, and outstanding at design and delivering according to a client's requirements.2Base Technologies are good quality people. Abdul Munwar Abobacker PROTONZ Technologies
Pause
Resume

© 2012 - 2Base Technologies, All rights reserved