skip to Main Content

This is a follow up from: How to extract data from HTML page source of (a tab within) a webpage?

We’re currently extracting the tabular data available here from the Financials section of a company. The table of data from here, for example: https://finance.yahoo.com/quote/AAPL/financials?p=AAPL&guccounter=2

However, the response I get for my code has been an empty string. When I look at the Root.App.main section that we were previously extracting, it looks like a bunch of encryped strings. I am not sure if I am making a mistake in reading this. What’s the best way to extract this on Java for Android?

Is there a better way to extract a specific value, for example, I want to extract 394,328,000, which is the Total Revenue on 9/30/2022. I’d preferably like to have the entire table data as a Map.

Here’s my current code that may throw more light into how it’s currently being done.

String requestURL = "https://finance.yahoo.com/quote/AAPL/financials?p=AAPL&guccounter=2";
String userAgent = "My UAString";
Document doc = Jsoup.connect(requestURL).userAgent(userAgent).get();

Elements scriptTags = doc.getElementsByTag("script");
String re = "root\.App\.main\s*\=\s*(.*?);\s*\}\(this\)\)\s*;";
String data = null;

for (Element script : scriptTags) {
    Pattern pattern = Pattern.compile(re, Pattern.DOTALL);
    Matcher matcher = pattern.matcher(script.html());

    if (matcher.find()) {
        data = matcher.group(1);
        break;
    }
}

String requestURL = "https://finance.yahoo.com/quote/AAPL/financials?p=AAPL";
String userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36 OPR/56.0.3051.43";
String row = "totalRevenue";

try {
    Document doc = Jsoup.connect(requestURL).userAgent(userAgent).get();
    String html = doc.html();
    //Log.d("html", html);

    Elements scriptTags = doc.getElementsByTag("script");
    String re = "root\.App\.main\s*\=\s*(.*?);\s*\}\(this\)\)\s*;";

    for (Element script : scriptTags) {
        Pattern pattern = Pattern.compile(re, Pattern.DOTALL);
        Matcher matcher = pattern.matcher(script.html());

        if (matcher.find()) {
            String data = matcher.group(1);
            //Log.d("data", data);

            JSONObject jo = new JSONObject(data);
            JSONArray table = getTable(jo);
            //Log.d("table", table.toString());

            String[] tableRow = getRow(table, row);
            String values = TextUtils.join(", ", tableRow);
            Log.d("values", values);
        }
    }
} catch (Exception e) {
    Log.e("err", "err", e);
}

private JSONArray getTable(JSONObject json) throws JSONException {
    JSONArray table = (JSONArray) json.getJSONObject("context")
            .getJSONObject("dispatcher")
            .getJSONObject("stores")
            .getJSONObject("QuoteSummaryStore")
            .getJSONObject("incomeStatementHistoryQuarterly")
            .getJSONArray("incomeStatementHistory");
    return table;
}

private String[] getRow(JSONArray table, String name) throws JSONException {
    String[] values = new String[table.length()];
    for (int i = 0; i < table.length(); i++) {
        JSONObject jo = table.getJSONObject(i);
        if (jo.has(name)) {
            jo = jo.getJSONObject(name);
            values[i] = jo.has("longFmt") ? jo.get("longFmt").toString() : "-";
        } else {
            values[i] = "-";
        }
    }
    return values;
}

private String[] getDates(JSONArray table) throws JSONException {
    String[] values = new String[table.length()];
    for (int i = 0; i < table.length(); i++) {
        values[i] = table.getJSONObject(i).getJSONObject("endDate")
                .get("fmt").toString();
    }
    return values;
}


Map<String, Map<String, String>> getTableNames() {
    final Map<String, String> revenue = new LinkedHashMap<String, String>() {
        { put("Total Revenue", "totalRevenue"); }
        { put("Cost of Revenue", "costOfRevenue"); }
        { put("Gross Profit", "grossProfit"); }
    };
    final Map<String, String> operatingExpenses = new LinkedHashMap<String, String>() {
        { put("Research Development", "researchDevelopment"); }
        { put("Selling General and Administrative", "sellingGeneralAdministrative"); }
        { put("Non Recurring", "nonRecurring"); }
        { put("Others", "otherOperatingExpenses"); }
        { put("Total Operating Expenses", "totalOperatingExpenses"); }
        { put("Operating Income or Loss", "operatingIncome"); }
    };
    Map<String, Map<String, String>> allTableNames = new LinkedHashMap<String, Map<String, String>>() {
        { put("Revenue", revenue); }
        { put("Operating Expenses", operatingExpenses); }

    };
    return allTableNames;
}

JSONObject jo = new JSONObject(jsData);
JSONArray table = getTable(jo);

Map<String, Map<String, String>> tableNames = getTableNames();
String totalRevenueKey = tableNames.get("Revenue").get("Total Revenue");
String[] totalRevenueValues = getRow(table, totalRevenueKey);
String value = totalRevenueValues[0];

List<String> tableData = new ArrayList<>();
Map<String, Map<String, String>> tableNames = getTableNames();
String[] dates = getDates(table);

for (Map.Entry<String, Map<String, String>> tableEntry : tableNames.entrySet()) {
    tableData.add(tableEntry.getKey());
    tableData.addAll(Arrays.asList(dates));

    for (Map.Entry<String, String> row : tableEntry.getValue().entrySet()) {
        String[] tableRow = getRow(table, row.getValue());
        tableData.add(row.getKey());
        for (String column: tableRow) {
            tableData.add(column);
        }
    }
}
String tableDataString = TextUtils.join(", ", tableData);

5

Answers


  1. A browser downloads the HTML and runs it as if it is a software program. This particular page uses javascript to build up the entire page. jsoup doesn’t run javascript. Hence, nothing gets built up, hence, the page is seemingly empty. More and more of the web is done like this (javascript renders everything, the HTML sent by the server doesn’t contain any DOM nodes at all, or if it does, they are all hidden, ‘templates’ for the benefit of the javascript to builds up the page).

    If you must do this programatically, you’d fire up an actual browser and control it like a puppet. Tools like selenium can do this (but it was designed for testing).

    Note that Yahoo has explicitly done all this to stop you from doing it. They will continue to try to stop you; every day your code may (and probably will) break as they try to stop you from doing this by changing stuff. Also, they will likely use whatever legal means available to them to sue you if you try. The laws in your country may indicate that they won’t succeed, but they’ll try.

    In general, then, maybe stop using something that is being provided by someone who is bending over backwards to stop you from doing just that. There are APIs that are designed specifically for you to read them out programatically. Various sites and services offer APIs for reading stock price data.

    Login or Signup to reply.
  2. You should probably not scrape the data from the Webpage, just use the Yahoo API. There are arguments for and against web scraping, take a look here from my personal experience, Yahoo Finance isn’t a place to scrape.

    You don’t have to do all the work, there are existing tutorials on getting data over the yahoo API on this. Take a look here for example or this Java Library sounds like just what you need, and there are a lot of code examples just like you wanted to have here

    Webscraping these kinds of information is kinda solving a problem that you don’t have. The API gives you all the things you want to know in a nice format, prepackaged no Regex or other clean-up action on HTML needed.

    Login or Signup to reply.
  3. I guess your main problem is getting the page source!?

    The solution is using CURL

    Jsoup isn’t smart enough getting the page source while CURL is.

    Jsoup may select the necessary elements, so you doesn’t have to bother with separate pattern matcher.

    Be warned, webpages are not stable. This heuristics may fail when the page get restructured! Use structural checks to detect changes!

    Hint: use "Left click", Inspect in your browser. Howering over the page source will highlight corresponding element.

            CUrl curl = new CUrl(requestURL);
            String htmlDoc = new String(curl.exec());
            assert(curl.getHttpCode() == 200);
            
            Document doc = new Document("");
            doc.append(htmlDoc);
    
            org.jsoup.select.Elements divTags = doc.getElementsByTag("div");
    
            for (org.jsoup.nodes.Element div : divTags) {
    
                if (div.className().contains("D(tbhg)")) {
                    // fetch header row
                    System.out.printf("Processing header %sn", div.className());
    
                    // display D(tbr) section - containing table header
                    org.jsoup.select.Elements tRow = div.getElementsByAttributeValueContaining("class", "D(tbr)");
                    System.out.println(tRow.text());
                }
    
                if (div.className().contains("D(tbrg)")) {
                    // fetch data rows
                    System.out.printf("Processing table data %sn", div.className());
    
                    // display D(tbr) sections - containing table data
                    org.jsoup.select.Elements tRow = div.getElementsByAttributeValueContaining("class", "D(tbr)");
                    for (org.jsoup.nodes.Element divData : tRow) {
                        System.out.println(divData.text());
                    }
                }
            }
    
    Login or Signup to reply.
  4. As answered by @DHC19 you should always prefer using an API over scrapping web pages whenever possible.

    This said, one of the best Web Scrapping libraries available in Java IMHO is HtmlUnit.

    Very easy to load pages, interact with the page (fill forms, click, …), supports JavaScript, provides many options to search elements (by ID, XPath queries, CSS Selectors, …)

    Give it a try 🙂

    Login or Signup to reply.
  5. You can use HtmlUnit – check https://htmlunit.sourceforge.io/ for all the detais.

    Simple example that gets the text content of the whole page.

    public static void main(String[] args) throws IOException {
        final String url = "https://finance.yahoo.com/quote/AAPL/financials?p=AAPL&guccounter=2";
    
        try (final WebClient webClient = new WebClient()) {
            webClient.getOptions().setThrowExceptionOnScriptError(false);
    
            HtmlPage page = webClient.getPage(url);
            webClient.waitForBackgroundJavaScriptStartingBefore(1_000);
    
            System.out.println("----");
            System.out.println(page.asNormalizedText());
        }
    }
    

    You can use asNormalizedText() also for elements on the page. See https://htmlunit.sourceforge.io/gettingStarted.html for some first info about finding elements inside the page.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search