htmlagilitypack website datatable extract data

Erik
March 23, 2023
207 views
0 votes
2 Answers

Yes, I have searched the web and stackoverflow. I am having trouble extracting data from a table from a website. I can retrieve the full table with code below, but need to extract select data:

Url = "https://www.multpl.com/shiller-pe/table/by-month";
web = new HtmlWeb();
doc = web.Load(Url);
pe = doc.DocumentNode.SelectSingleNode("//*[@id='datatable']").InnerText.ToString();
Console.Write(pe);

Xpath //*[@id=’datatable’]/tbody/tr[3]/td[2] for a data point does not work and throws error.
This also does not work:

Url = "https://www.multpl.com/shiller-pe/table/by-month";
web = new HtmlWeb();
doc = web.Load(Url);
var table = doc.DocumentNode.SelectSingleNode("//*[@id='datatable']");
var tableRows = table.SelectNodes("tr");
var columns = tableRows[0].SelectNodes("th/text()");
for (int i = 1; i < tableRows.Count; i++)
{ 
for (int e = 0; e < columns.Count; e++)
    {
    var value = tableRows[i].SelectSingleNode("td[e + 1]");
    Console.Write(columns[e].InnerText + ":" + value.InnerText);
    }
}

Any direction will help, thank you.

Tags: c#datatable html html-agility-pack

Answers

Chosen as BEST ANSWER

Found a solution finally.

            Url = "https://www.multpl.com/shiller-pe/table/by-month";
            web = new HtmlWeb();
            doc = web.Load(Url);
            foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//td[@class='right']"))
            {
              numberList.Add(Convert.ToDouble(node.InnerText));
              //Print(node.InnerText.ToString());

            }
            Print(numberList[0]);

(Edit)

Ok, I found 2 problems.

The main problem is in your xpath td[e + 1]. You try to use e variable but not using string interpolation. Change your code to next one:

var value = tableRows[i].SelectSingleNode($"td[{e + 1}]");

Second one is about th/text() selector. You want to count columns so change it to th. Html code in https://www.multpl.com/shiller-pe/table/by-month has a few elements for second column in header but single element in the table so it’ll be right sing to use th selector.

The second column’s header is still specific so you will still have issues with columns[e].InnerText. May be it’s better to handle it manually. Column values can be trimmed too because there is line separators in second column. Here is my final code:

public void ParseTable()
{
    string Url = "https://www.multpl.com/shiller-pe/table/by-month";
    HtmlWeb web = new HtmlWeb();
    HtmlDocument doc = web.Load(Url);
    var table = doc.DocumentNode.SelectSingleNode("//*[@id='datatable']");
    var tableRows = table.SelectNodes("//tr");

    var columns = tableRows[0].SelectNodes("th"); // fixed

    for (int i = 1; i < tableRows.Count; i++)
    {
        for (int e = 0; e < columns.Count; e++)
        {
            var value = tableRows[i].SelectSingleNode($"td[{e + 1}]"); // fixed
            Console.Write(GetColumnName(columns[e]) + ": " + value.InnerText.Trim()); // fixed
            if (e < columns.Count - 1)
            {
                Console.Write(" | ");
            }
        }
        Console.WriteLine();
    }
}

private static string GetColumnName(HtmlNode thNode)
{
    var spanValue = thNode.SelectSingleNode("span[@class='value']");
    if (spanValue != null)
        return spanValue.InnerText.Trim();
    
    var spanTitle = thNode.SelectSingleNode("span[@class='title']");
    return spanTitle != null 
        ? spanTitle.InnerText.Trim() 
        : thNode.InnerText.Trim();
}

Please signup or login to give your own answer.

Click here to cancel reply.