skip to Main Content

I have a couple sites that monitor Twitter for specific types of statements and scrape relevant Tweets using curl in PHP. A few days ago those sites stopped scraping Twitter. I figured they probably redesigned the layout of their mobile.twitter site and all I would have to do is change my xPath query to a different class or something, but instead I found out that whenever you try to visit Twitter without JavaScript enabled you are given a prompt to enable JavaScript to access Twitter. There seems to be no way around this. Before this change one could access a version of Twitter that did not require JavaScript, so I could scrape Tweets with a simple curl request and xPath query.

I have searched Google for ways to enable JavaScript support for curl request but have found nothing. Is it possible to add something to a curl request to parse JavaScript or do I need to find soem other solution?

5

Answers


  1. You can not "Enable" JavaScript on curl. It is not a browser, it only does HTTP requests. Have you considered using the Twitter API?

    You can also intercept XHRs on twitter using your browser’s development tools and work your way through them to figure out what HTTP request you need to make in order to get the data you want.

    Another solution is to use an scriptable "headless" browser. check out CapsperJS. Simply put it is a fully functional browser that does not show any UI and you can control it via JS.

    Login or Signup to reply.
  2. There are many free endpoints available that can help solve this, rather than having to scrape the webpage. If you’re looking for specific Tweets, try the new v2. Search API: https://developer.twitter.com/en/docs/twitter-api/tweets/search/introduction

    You just need to have an approved developer account.

    Login or Signup to reply.
  3. Take a look on this video from Traversy Media

    Real-Time Tweets & Socket.io Project | Twitter Streaming API

    https://www.youtube.com/watch?v=PjjjhGW4ceM

    Source code:

    https://github.com/bradtraversy/real-time-tweet-stream

    Login or Signup to reply.
  4. You need to render page with javascript, to get final html. You can use phantomjs (headless browser) for that purpose. Here is php plugin: http://jonnnnyw.github.io/php-phantomjs/

    Login or Signup to reply.
  5. In fact, the dev account is not required and an anonymous authorization bearer token is enough for now. This token is provided by Twitter itself inside the main app js.

    For an example of implementation, see the snippet https://gitlab.com/Daniel-KM/Omeka-S-module-BlockPlus/-/snippets/2068979.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search