skip to Main Content

I am writing a PHP scraping program. The program works smoothly for me but I found the scraping result slightly differs from my expectation.

Here is my script

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $eng_SCCW_array["Here is my website"]);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

    $html = curl_exec($ch);

    $doc = new DOMDocument();
    @ $doc->loadHTML($html);
    $elements_content = $doc->textContent; 
    echo $elements_content."</br>"."</br>";

And here is the scraping result:
enter image description here

The problem is, some white space is missed as any ‘br’ will not be read by the script. However, this would make the data process later become very complicated. I want to split the scraping result as if the image below. But how shall I do it?

enter image description here

2

Answers


  1. First Check if you are getting any tag or element (Like

    or etc.) with which you can make a loop and add a line break in between it.
    Now you may be getting the whole DOM as text and your code is only adding "</br>" at the end of the response.

     $elements_content = $doc->textContent; 
    

    the above will give you the whole DOM as text
    like this https://prnt.sc/-oPTh0o7oXk_ See the Screenshot

    you need to find the tag and add br with the help of loop

     $elements = $doc->getElementsByTagName('a');
     foreach($elements as $element) {
         echo $element->nodeValue . '</br>';
     }
    

    If you can share the URL from where you get the response. I will check the response type and will try to add line break in between

    Login or Signup to reply.
  2. If you are going to iterate through an array if urls, you’ll need to couple instructions or a parsing function with each url.

    For the provided url, I’d use XPath to target the desired content.

    Code: (Demo)

    $doc = new DOMDocument();
    libxml_use_internal_errors(true);
    $doc->loadHTML($html);
    $xpath = new DOMXPath($doc);
    foreach ($xpath->query('//div[@id="weather_report"]/pre') as $item) {
        var_dump($item->textContent);
    }
    

    Output:

    string(34) "(NOT TO BE ISSUED AFTER 3:30 A.M.)"
    string(117) "WEATHER INFORMATION FOR SOUTH CHINA COASTAL WATERS ISSUEDBY THE HONG KONG OBSERVATORY AT 0:25 A.M. ON 2 SEPTEMBER2023"
    string(9) "WARNINGS:"
    string(121) "HURRICANE FORCE WINDS IN HONG KONG ADJACENT WATERS,SHANWEI, SOUTH OF HONG KONG AND SHANGCHUAN DAO. STRONGWINDS IN NAN'AO."
    string(18) "WEATHER SITUATION:"
    string(583) "SUPER TYPHOON SAOLA HAS WEAKENED INTO A SEVERE TYPHOON. AT MIDNIGHT, SAOLA WITH MAXIMUM SUSTAINED WINDS OF ABOUT175 KILOMETRES PER HOUR NEAR ITS CENTRE WAS LOCATED AT 22.0DEGREES NORTH, 114.0 DEGREES EAST. IT IS FORECAST TO MOVEWEST AT ABOUT 14 KILOMETRES PER HOUR ACROSS THE VICINITY OFTHE PEARL RIVER ESTUARY.  MEANWHILE, TYPHOON HAIKUI WITH MAXIMUM SUSTAINED WINDS OFABOUT 120 KILOMETRES PER HOUR NEAR ITS CENTRE WAS LOCATEDAT 22.1 DEGREES NORTH, 127.5 DEGREES EAST. IT IS FORECASTTO MOVE WEST OR WEST-NORTHWEST AT ABOUT 18 KILOMETRES PERHOUR IN THE GENERAL DIRECTION OF TAIWAN."
    string(36) "AREA FORECASTS FOR THE NEXT 24 HOURS"
    string(26) "HONG KONG ADJACENT WATERS:"
    string(131) "CYCLONIC FORCE 12, BECOMING SOUTHEAST FORCE 11 TO 12.FREQUENT HEAVY SQUALLY SHOWERS AND THUNDERSTORMS.VERY HIGH TO PHENOMENAL SEAS."
    string(521) "NAN'AO:SOUTH TO SOUTHEAST FORCE 5 TO 6, FORCE 7 AT FIRST.SCATTERED SQUALLY SHOWERS AND THUNDERSTORMS.MODERATE TO ROUGH SEAS.SHANWEI:SOUTH TO SOUTHEAST FORCE 9 TO 10, FORCE 12 AT FIRST.FREQUENT HEAVY SQUALLY SHOWERS AND THUNDERSTORMS.HIGH SEAS, PHENOMENAL AT FIRST.SOUTH OF HONG KONG:WEST TO SOUTHWEST FORCE 11 TO 12.FREQUENT HEAVY SQUALLY SHOWERS AND THUNDERSTORMS.VERY HIGH TO PHENOMENAL SEAS.SHANGCHUAN DAO:NORTHWEST FORCE 12, BECOMING CYCLONIC FORCE 12.FREQUENT HEAVY SQUALLY SHOWERS AND THUNDERSTORMS.PHENOMENAL SEAS."
    string(34) "OUTLOOK FOR THE FOLLOWING 48 HOURS"
    string(96) "SOUTH TO SOUTHEASTERLY WINDS OF FORCE 10 TO 11. FREQUENTHEAVY SQUALLY SHOWERS AND THUNDERSTORMS."
    string(56) "LATEST REPORTS FROM SOUTH CHINA COASTAL WEATHER STATIONS"
    string(56) "--------------------------------------------------------"
    string(66) "WAGLAN ISLAND      WIND EAST FORCE 10 WITH GUSTS TO FORCE 11"
    string(54) "HUANGMAOZHOU       WIND SOUTH-SOUTHWEST FORCE 9"
    string(112) "MACAU              WIND NORTH-NORTHWEST FORCE 6                    VISIBILITY 2100 METRES IN RAIN"
    string(105) "XISHA              WIND WEST-SOUTHWEST FORCE 4                    VISIBILITY 21 KILOMETRES"
    

    Or if you want the whole HTML chunk so that you can perform surgery on it, use saveXML(). (Demo)

    $doc = new DOMDocument();
    libxml_use_internal_errors(true);
    $doc->loadHTML($html);
    $xpath = new DOMXPath($doc);
    $item = $xpath->query('//div[@id="weather_report"]')->item(0);
    var_dump($item->ownerDocument->saveXML($item));
    

    Output:

    string(2917) "<div id="weather_report">SOUTH CHINA COASTAL WATERS<br/><br/><pre>(NOT TO BE ISSUED AFTER 3:30 A.M.)</pre><br/><br/><pre>WEATHER INFORMATION FOR SOUTH CHINA COASTAL WATERS ISSUED<br/>BY THE HONG KONG OBSERVATORY AT 0:25 A.M. ON 2 SEPTEMBER<br/>2023</pre><br/><br/><br/><pre>WARNINGS:</pre><br/><pre>HURRICANE FORCE WINDS IN HONG KONG ADJACENT WATERS,<br/>SHANWEI, SOUTH OF HONG KONG AND SHANGCHUAN DAO. STRONG<br/>WINDS IN NAN'AO.</pre><br/><br/><br/><pre>WEATHER SITUATION:</pre><br/><br/><pre>SUPER TYPHOON SAOLA HAS WEAKENED INTO A SEVERE TYPHOON.<br/> <br/>AT MIDNIGHT, SAOLA WITH MAXIMUM SUSTAINED WINDS OF ABOUT<br/>175 KILOMETRES PER HOUR NEAR ITS CENTRE WAS LOCATED AT 22.0<br/>DEGREES NORTH, 114.0 DEGREES EAST. IT IS FORECAST TO MOVE<br/>WEST AT ABOUT 14 KILOMETRES PER HOUR ACROSS THE VICINITY OF<br/>THE PEARL RIVER ESTUARY. <br/> <br/>MEANWHILE, TYPHOON HAIKUI WITH MAXIMUM SUSTAINED WINDS OF<br/>ABOUT 120 KILOMETRES PER HOUR NEAR ITS CENTRE WAS LOCATED<br/>AT 22.1 DEGREES NORTH, 127.5 DEGREES EAST. IT IS FORECAST<br/>TO MOVE WEST OR WEST-NORTHWEST AT ABOUT 18 KILOMETRES PER<br/>HOUR IN THE GENERAL DIRECTION OF TAIWAN.</pre><br/><br/><pre>AREA FORECASTS FOR THE NEXT 24 HOURS</pre><br/><br/><br/><pre>HONG KONG ADJACENT WATERS:</pre><br/><pre>CYCLONIC FORCE 12, BECOMING SOUTHEAST FORCE 11 TO 12.<br/>FREQUENT HEAVY SQUALLY SHOWERS AND THUNDERSTORMS.<br/>VERY HIGH TO PHENOMENAL SEAS.</pre><br/><br/><br/><pre>NAN'AO:<br/>SOUTH TO SOUTHEAST FORCE 5 TO 6, FORCE 7 AT FIRST.<br/>SCATTERED SQUALLY SHOWERS AND THUNDERSTORMS.<br/>MODERATE TO ROUGH SEAS.<br/><br/>SHANWEI:<br/>SOUTH TO SOUTHEAST FORCE 9 TO 10, FORCE 12 AT FIRST.<br/>FREQUENT HEAVY SQUALLY SHOWERS AND THUNDERSTORMS.<br/>HIGH SEAS, PHENOMENAL AT FIRST.<br/><br/>SOUTH OF HONG KONG:<br/>WEST TO SOUTHWEST FORCE 11 TO 12.<br/>FREQUENT HEAVY SQUALLY SHOWERS AND THUNDERSTORMS.<br/>VERY HIGH TO PHENOMENAL SEAS.<br/><br/>SHANGCHUAN DAO:<br/>NORTHWEST FORCE 12, BECOMING CYCLONIC FORCE 12.<br/>FREQUENT HEAVY SQUALLY SHOWERS AND THUNDERSTORMS.<br/>PHENOMENAL SEAS.<br/><br/></pre><br/><pre>OUTLOOK FOR THE FOLLOWING 48 HOURS</pre><br/><pre>SOUTH TO SOUTHEASTERLY WINDS OF FORCE 10 TO 11. FREQUENT<br/>HEAVY SQUALLY SHOWERS AND THUNDERSTORMS.</pre><br/><br/><br/><br/><pre>LATEST REPORTS FROM SOUTH CHINA COASTAL WEATHER STATIONS</pre><br/><pre>--------------------------------------------------------</pre><br/><br/><br/><pre>WAGLAN ISLAND      WIND EAST FORCE 10 WITH GUSTS TO FORCE 11</pre><br/><br/><br/><pre>HUANGMAOZHOU       WIND SOUTH-SOUTHWEST FORCE 9</pre><br/><br/><br/><pre>MACAU              WIND NORTH-NORTHWEST FORCE 6 <br/>                   VISIBILITY 2100 METRES IN RAIN</pre><br/><br/><br/><pre>XISHA              WIND WEST-SOUTHWEST FORCE 4 <br/>                   VISIBILITY 21 KILOMETRES</pre><br/><br/>DISPATCHED BY HONG KONG OBSERVATORY AT 00:25 HKT ON 02.09.2023<br/></div>"
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search