skip to Main Content

I want to extract content from two different tags using PHP. I want to associate h2 tags with the div tags’ content that immediately follows them — like a parent-child relationship.

<h1>Title 1</h1>
<div class="items">some data and divs here 1</div>
<h1>Title 2</h1>
<div class="items">some data and divs here 2</div>
<div class="items">some data and divs here 3</div>
<h1>Title 3</h1>
<div class="items">some data and divs here 4</div>
<div class="items">some data and divs here 5</div>
<div class="items">some data and divs here 6</div>

The number of items between two H1 tag is different.

I know how to scrape all tags with simple_html_dom or GoutteClient to get:

<h1>Title 1</h1>
<h1>Title 2</h1>
<h1>Title 3</h1>

Or

<div class="items">some data and divs here 1</div>
<div class="items">some data and divs here 2</div>
<div class="items">some data and divs here 3</div>
<div class="items">some data and divs here 4</div>
<div class="items">some data and divs here 5</div>
<div class="items">some data and divs here 6</div>

But I am unable to associate the title to the data. I cannot figure out how to have an array like this:

array (
  0 => 
  array (
    'item' => 'Title 1',
    'data' => 'some data and divs here 1',
  ),
  1 => 
  array (
    'item' => 'Title 2',
    'data' => 'some data and divs here 2',
  ),
  2 => 
  array (
    'item' => 'Title 2',
    'data' => 'some data and divs here 3',
  ),
  3 => 
  array (
    'item' => 'Title 3',
    'data' => 'some data and divs here 4',
  ),
  4 => 
  array (
    'item' => 'Title 3',
    'data' => 'some data and divs here 5',
  ),
  5 => 
  array (
    'item' => 'Title 3',
    'data' => 'some data and divs here 6',
  ),
)

I’ve tried to implement something like sibling, but didn’t find a way.

2

Answers


  1. Here’s an idea, use some string manipulation to wrap the parts between the h1 in a span (for example). Then read it using php’s DOMDocument getting the html by the tag names (h1 and span)

    Here’s my attempt:

    $html = '<h1>Title 1</h1>
    <div class="items">some data and divs here 1</div>
    <h1>Title 2</h1>
    <div class="items">some data and divs here 2</div>
    <div class="items">some data and divs here 3</div>
    <h1>Title 3</h1>
    <div class="items">some data and divs here 4</div>
    <div class="items">some data and divs here 5</div>
    <div class="items">some data and divs here 6</div>';
    
    $html = str_replace('</h1>', '</h1><span>', $html);
    $html = str_replace('<h1>', '</span><h1>', $html);
    $html = "<span>$html</span>";
    
    $xml = new DOMDocument();
    $xml->loadHTML($html);
    
    $items = array();
    foreach($xml->getElementsByTagName('span') as $item) {
        $items[] = trim($item->nodeValue);
    }
    array_shift($items);  // ignore first
    
    $titles = array();
    foreach($xml->getElementsByTagName('h1') as $title) {
        $titles[] = trim($title->nodeValue);
    }
    

    Output for $items and $titles:

    Array
    (
        [0] => some data and divs here 1
        [1] => some data and divs here 2
    some data and divs here 3
        [2] => some data and divs here 4
    some data and divs here 5
    some data and divs here 6
    )
    Array
    (
        [0] => Title 1
        [1] => Title 2
        [2] => Title 3
    )
    
    Login or Signup to reply.
  2. Based on the answer on XPath until next tag, I’ve made very few modifications to generate the desired result.

    Code: (Demo)

    $doc = new DOMDocument();
    $doc->loadHTML($html);
    $xpath = new DOMXpath($doc);
    $domNodeList = $xpath->query('/html/body/h1');
    
    $result = [];
    foreach($domNodeList as $element) {
        // Save the h1
        $item = $element->nodeValue;
    
        // Loop the siblings unit the next h1
        while ($element = $element->nextSibling) {
            if ($element->nodeName === "h1") {
                break;
            }
            // if Node is a DOMElement
            if ($element->nodeType === 1) {
                $result[] = ['item' => $item, 'data' => $element->nodeValue];
            }
        }
    }
    var_export($result);
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search