Composing pulled html table cells into objects - Telegram API

koogel
September 8, 2022
210 views
2 votes
3 Answers

I’m trying to make a time schedule telegram bot for my university, in order to do so I used HtmlAgilityPack to get data from university html table into Pair Object(A simple object with Date, Time, Discipline,Lecturers Name, Auditorium properties). The thing is it pulls cells, but I need to compose it into a Pair Object, so that I can then return an object for the users request. I think I need to use LINQ, but I don’t have much experience with it.
Further, the object.Date parameter will be used to compare the current date with the property date to return the whole schedule.
My code is as following:

public List<Pair> Scrape(string groupNumber)
        {
            //this gets all the cells in an html table
            string groupUrl = _websiteUrl + groupNumber + ".xml";
            var web = new HtmlWeb();
            var doc = web.Load(groupUrl);
            var htmlTableCell = from table in doc.DocumentNode.SelectNodes("/html/body/div[6]/div[2]/div/table").Cast<HtmlNode>()
                                from row in table.SelectNodes("//tr").Cast<HtmlNode>()
                                from cell in row.SelectNodes("th|td").Cast<HtmlNode>()
                                select new { CellText = cell.InnerText};
            //this shows all the cells in a logger
            foreach(var cell in htmlTableCell)
            {
                _logger.LogCritical(cell.CellText);
            }
            return _pairs;
            
            
            
        }

Tags: c#html-agility-pack object

Answers

Chosen as BEST ANSWER

If anyone has this problem, consider checking out the proper way of how the website table is structured, in my case it was a list of table rows with head dates(one date would break the assignment of values of an object), in order to still get the time table I decided to pull the whole week, with the use of the head date of the week, My code now looks something like this:

public string Scrape(string? sequence)
        {
            List<string> pairs = new List<string>();
            string groupUrl = _websiteUrl + sequence + ".xml";
            var web = new HtmlWeb();
            var doc = web.Load(groupUrl);
            var LatestWeek = doc.DocumentNode.SelectNodes("/html/body/div[6]/div[2]/div/div[1]/span").Last().InnerText; // gets the last week value for getting all the rows with such value
            _logger.LogCritical(LatestWeek);
            var CurrentWeek = doc.DocumentNode.SelectNodes("//tr[@vl = '"+LatestWeek+"']"); // gets the current week with value from above

            foreach (var pair in CurrentWeek)
            {
                pairs.Add(pair.InnerText);
            }
            string week = string.Join(" ", pairs); //gets all the pairs from a list and connects them into a string, which can then be used to return all the pairs in a telegram message
            _logger.LogCritical(week);
            return week;
        }

(Edit)

As I’ve stated above you really have to understand the way your website is structured in my case the rows had the same value for each row and they hadn’t had any atributes to distinguish them. I pulled every row and after that I started pulling cells, after that I created an if statement which distinguishes between a date which has been ruining the pull in the first place, and created objects for each row with a pair. In terms of code it looks something like this:

List<Pair> test = new List<Pair>();
string groupUrl = _websiteUrl + sequence + ".xml";
            var web = new HtmlWeb();
            var doc = web.Load(groupUrl);
            var LatestWeek = doc.DocumentNode.SelectNodes("/html/body/div[6]/div[2]/div/div[1]/span").Last().InnerText;
            _logger.LogCritical(LatestWeek);
            var CurrentWeek = doc.DocumentNode.SelectNodes("//tr[@vl = '"+LatestWeek+"']");
            var currentDay = LatestWeek;
            foreach (var pair in CurrentWeek)
            {
                if(pair.SelectSingleNode("td").HasClass("head-date"))
                {
                    currentDay = pair.SelectSingleNode("td").InnerText;
                }
                else
                {
                    Pair newPair = new Pair
                    {
                        Date = currentDay,
                        Time = pair.SelectSingleNode("td[1]").InnerText,
                        Discipline = pair.SelectSingleNode("td[2]").InnerText,
                        LectorsName = pair.SelectSingleNode("td[3]").InnerText,
                        Auditorium = pair.SelectSingleNode("td[4]").InnerText,
                    };
                    test.Add(newPair);
                }
                
            }

- koogel
- February 14, 2023 at 6:39 am
- 0 votes
0
Additional information, you don’t need to understand how the website is structured, in fact, I might say that it’s somewhat harmful, because a single change in the html structure of the website will completely break your code, what I’d advice you to do is to use a class of the elements you want to scrape.

Example(the class is somewhat different, but the principle is the same):
```
public class GroupScraper : IGroupScraper
{
    public List<Group> ScrapeGroups(string url)
    {
        List<Group> groups = new List<Group>();
        var web = new HtmlWeb();
        var doc = web.Load(url);

        var options = doc.DocumentNode.SelectNodes("//select[@class='sch sch-0 sch-group']/option");
        Console.WriteLine(options);
        foreach (var option in options)
        {
            var parts = option.InnerText.Split("/");
            var group = new Group
            {
                GroupLink = option.Attributes["value"].Value,
                GroupCourse = parts[0],
                GroupNumber = parts[1],
                GroupSpecialization = option.Attributes["s"]?.Value,
            };
            groups.Add(group);
        }
        return groups;
    }
```
Here it uses the class(if you have trouble in figuring out how to scrape certain data, consider putting information you want to scrape into chatGpt)
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Composing pulled html table cells into objects – Telegram API

Answers