skip to Main Content

Yes, I have searched the web and stackoverflow. I am having trouble extracting data from a table from a website. I can retrieve the full table with code below, but need to extract select data:

Url = "https://www.multpl.com/shiller-pe/table/by-month";
web = new HtmlWeb();
doc = web.Load(Url);
pe = doc.DocumentNode.SelectSingleNode("//*[@id='datatable']").InnerText.ToString();
Console.Write(pe);

Xpath //*[@id=’datatable’]/tbody/tr[3]/td[2] for a data point does not work and throws error.
This also does not work:

Url = "https://www.multpl.com/shiller-pe/table/by-month";
web = new HtmlWeb();
doc = web.Load(Url);
var table = doc.DocumentNode.SelectSingleNode("//*[@id='datatable']");
var tableRows = table.SelectNodes("tr");
var columns = tableRows[0].SelectNodes("th/text()");
for (int i = 1; i < tableRows.Count; i++)
{ 
for (int e = 0; e < columns.Count; e++)
    {
    var value = tableRows[i].SelectSingleNode("td[e + 1]");
    Console.Write(columns[e].InnerText + ":" + value.InnerText);
    }
}

Any direction will help, thank you.

2

Answers


  1. Chosen as BEST ANSWER

    Found a solution finally.

                Url = "https://www.multpl.com/shiller-pe/table/by-month";
                web = new HtmlWeb();
                doc = web.Load(Url);
                foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//td[@class='right']"))
                {
                  numberList.Add(Convert.ToDouble(node.InnerText));
                  //Print(node.InnerText.ToString());
    
                }
                Print(numberList[0]);
    

  2. Ok, I found 2 problems.

    1. The main problem is in your xpath td[e + 1]. You try to use e variable but not using string interpolation. Change your code to next one:
    var value = tableRows[i].SelectSingleNode($"td[{e + 1}]");
    
    1. Second one is about th/text() selector. You want to count columns so change it to th. Html code in https://www.multpl.com/shiller-pe/table/by-month has a few elements for second column in header but single element in the table so it’ll be right sing to use th selector.

    The second column’s header is still specific so you will still have issues with columns[e].InnerText. May be it’s better to handle it manually. Column values can be trimmed too because there is line separators in second column. Here is my final code:

    public void ParseTable()
    {
        string Url = "https://www.multpl.com/shiller-pe/table/by-month";
        HtmlWeb web = new HtmlWeb();
        HtmlDocument doc = web.Load(Url);
        var table = doc.DocumentNode.SelectSingleNode("//*[@id='datatable']");
        var tableRows = table.SelectNodes("//tr");
    
        var columns = tableRows[0].SelectNodes("th"); // fixed
    
        for (int i = 1; i < tableRows.Count; i++)
        {
            for (int e = 0; e < columns.Count; e++)
            {
                var value = tableRows[i].SelectSingleNode($"td[{e + 1}]"); // fixed
                Console.Write(GetColumnName(columns[e]) + ": " + value.InnerText.Trim()); // fixed
                if (e < columns.Count - 1)
                {
                    Console.Write(" | ");
                }
            }
            Console.WriteLine();
        }
    }
    
    private static string GetColumnName(HtmlNode thNode)
    {
        var spanValue = thNode.SelectSingleNode("span[@class='value']");
        if (spanValue != null)
            return spanValue.InnerText.Trim();
        
        var spanTitle = thNode.SelectSingleNode("span[@class='title']");
        return spanTitle != null 
            ? spanTitle.InnerText.Trim() 
            : thNode.InnerText.Trim();
    }
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search