skip to Main Content

I’m trying to make a time schedule telegram bot for my university, in order to do so I used HtmlAgilityPack to get data from university html table into Pair Object(A simple object with Date, Time, Discipline,Lecturers Name, Auditorium properties). The thing is it pulls cells, but I need to compose it into a Pair Object, so that I can then return an object for the users request. I think I need to use LINQ, but I don’t have much experience with it.
Further, the object.Date parameter will be used to compare the current date with the property date to return the whole schedule.
My code is as following:

public List<Pair> Scrape(string groupNumber)
        {
            //this gets all the cells in an html table
            string groupUrl = _websiteUrl + groupNumber + ".xml";
            var web = new HtmlWeb();
            var doc = web.Load(groupUrl);
            var htmlTableCell = from table in doc.DocumentNode.SelectNodes("/html/body/div[6]/div[2]/div/table").Cast<HtmlNode>()
                                from row in table.SelectNodes("//tr").Cast<HtmlNode>()
                                from cell in row.SelectNodes("th|td").Cast<HtmlNode>()
                                select new { CellText = cell.InnerText};
            //this shows all the cells in a logger
            foreach(var cell in htmlTableCell)
            {
                _logger.LogCritical(cell.CellText);
            }
            return _pairs;
            
            
            
        }

3

Answers


  1. Chosen as BEST ANSWER

    If anyone has this problem, consider checking out the proper way of how the website table is structured, in my case it was a list of table rows with head dates(one date would break the assignment of values of an object), in order to still get the time table I decided to pull the whole week, with the use of the head date of the week, My code now looks something like this:

    public string Scrape(string? sequence)
            {
                List<string> pairs = new List<string>();
                string groupUrl = _websiteUrl + sequence + ".xml";
                var web = new HtmlWeb();
                var doc = web.Load(groupUrl);
                var LatestWeek = doc.DocumentNode.SelectNodes("/html/body/div[6]/div[2]/div/div[1]/span").Last().InnerText; // gets the last week value for getting all the rows with such value
                _logger.LogCritical(LatestWeek);
                var CurrentWeek = doc.DocumentNode.SelectNodes("//tr[@vl = '"+LatestWeek+"']"); // gets the current week with value from above
    
                foreach (var pair in CurrentWeek)
                {
                    pairs.Add(pair.InnerText);
                }
                string week = string.Join(" ", pairs); //gets all the pairs from a list and connects them into a string, which can then be used to return all the pairs in a telegram message
                _logger.LogCritical(week);
                return week;
            }
    

  2. As I’ve stated above you really have to understand the way your website is structured in my case the rows had the same value for each row and they hadn’t had any atributes to distinguish them. I pulled every row and after that I started pulling cells, after that I created an if statement which distinguishes between a date which has been ruining the pull in the first place, and created objects for each row with a pair. In terms of code it looks something like this:

    List<Pair> test = new List<Pair>();
    string groupUrl = _websiteUrl + sequence + ".xml";
                var web = new HtmlWeb();
                var doc = web.Load(groupUrl);
                var LatestWeek = doc.DocumentNode.SelectNodes("/html/body/div[6]/div[2]/div/div[1]/span").Last().InnerText;
                _logger.LogCritical(LatestWeek);
                var CurrentWeek = doc.DocumentNode.SelectNodes("//tr[@vl = '"+LatestWeek+"']");
                var currentDay = LatestWeek;
                foreach (var pair in CurrentWeek)
                {
                    if(pair.SelectSingleNode("td").HasClass("head-date"))
                    {
                        currentDay = pair.SelectSingleNode("td").InnerText;
                    }
                    else
                    {
                        Pair newPair = new Pair
                        {
                            Date = currentDay,
                            Time = pair.SelectSingleNode("td[1]").InnerText,
                            Discipline = pair.SelectSingleNode("td[2]").InnerText,
                            LectorsName = pair.SelectSingleNode("td[3]").InnerText,
                            Auditorium = pair.SelectSingleNode("td[4]").InnerText,
                        };
                        test.Add(newPair);
                    }
                    
                }
    
    Login or Signup to reply.
  3. Additional information, you don’t need to understand how the website is structured, in fact, I might say that it’s somewhat harmful, because a single change in the html structure of the website will completely break your code, what I’d advice you to do is to use a class of the elements you want to scrape.

    Example(the class is somewhat different, but the principle is the same):

    public class GroupScraper : IGroupScraper
    {
        public List<Group> ScrapeGroups(string url)
        {
            List<Group> groups = new List<Group>();
            var web = new HtmlWeb();
            var doc = web.Load(url);
    
            var options = doc.DocumentNode.SelectNodes("//select[@class='sch sch-0 sch-group']/option");
            Console.WriteLine(options);
            foreach (var option in options)
            {
                var parts = option.InnerText.Split("/");
                var group = new Group
                {
                    GroupLink = option.Attributes["value"].Value,
                    GroupCourse = parts[0],
                    GroupNumber = parts[1],
                    GroupSpecialization = option.Attributes["s"]?.Value,
                };
                groups.Add(group);
            }
            return groups;
        }
    

    Here it uses the class(if you have trouble in figuring out how to scrape certain data, consider putting information you want to scrape into chatGpt)

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search