I am writing an audit for our website to see what detail pages are displayed based on a selection. The website is written and maintained by a 3rd party. The issue is when I pull the file such as:
$page = file_get_contants(‘https://example.com/search=red’);
It returns source which is the pre generated source which is a template and then uses ajax calls to get the relevant data and then builds the final page from that and the generated code is what I would like to parse to get the links to the detail pages.
I am thinking something like wkhtmltopdf only which would do it to generated source. Does anyone know of a php library which can do this or a linux package that can do this which a I could call from php?
2
Answers
To handle dynamic content loading, you might want to consider using a headless browser or a tool designed for web scraping with JavaScript rendering capabilities. Puppeteer is one such tool, and it’s commonly used with Node.js for tasks like this. However, you can use a PHP library that wraps around a headless browser as well.
One such PHP library is Goutte, which uses Symfony components and provides a simple interface for web scraping. It’s built on top of Guzzle and Symfony BrowserKit. Here’s a basic example of how you might use Goutte for your task:
Remember to install Goutte and its dependencies using Composer:
I don’t understand quite well what are you trying to accomplish, i guess you are trying to do some web scraping to get the content of various pages?
Nonetheless, here are some packages that will help you with this: