How to get links from html page using indeox and substring?

OledNeduda
March 21, 2023
117 views
1 vote
2 Answers

The problem is in the result variable.
There are more then some places with jpg.
What I want is to get all the places ending with jpg but as string.
I mean that result will have one link ending with jpg then again result will be with another link ending with jpg.

it’s like:

https://something.com/my.jpg/a7gfefg/https://something.com/my2.jpg/sadsadsad64567546/https://something.com/my3.jpg

and I want in result to get each time:

https://something.com/my.jpg

then in the next iterate:

https://something.com/my2.jpg

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.IO;
using System.Linq;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
using System.Windows.Forms;

namespace Testing
{
    public partial class Form1 : Form
    {
        private List<string> links = new List<string>();
        string htmlCode;

        public Form1()
        {
            InitializeComponent();

            GetLinks();
        }

        private void GetLinks()
        {
            using (WebClient client = new WebClient()) // WebClient class inherits IDisposable
            {
                htmlCode = client.DownloadString("https://test.com/my-site");
            }

            int index1 = 0;

            using (StringReader reader = new StringReader(htmlCode))
            {
                string line;
                while ((line = reader.ReadLine()) != null)
                {
                    int index = line.IndexOf("https://test.com");
                    if (index != -1)
                    {
                        index1 = line.IndexOf("png", index);
                    }
                    if (index != -1 && index1 != -1)
                    {
                        string result = line.Substring(index, index1);
                    }
                }
            }

        }
        private void Form1_Load(object sender, EventArgs e)
        {

        }
    }
}

Tags: c#html

Answers

- AboulfazlHadi
- March 21, 2023 at 8:14 pm
- 0 votes
0
The better way to extract image url within html code is using Regular Expression.

Image url extraction regular expression:
```
(http(s?):)([/|.|w|s|-])*.(?:jpg|gif|png)
```
For how to use Regular Expressions in C#:
https://learn.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference
Login or Signup to reply.

- AndrewWilliamson
- March 21, 2023 at 8:15 pm
- 0 votes
0
You’re passing in the web page’s html to the File.ReadAllLines method as if it’s a file name. You already have the html content as a string variable. Remove the line, and rename ‘content’ to ‘htmlCode’:
```
private void GetLinks()
{
    using (WebClient client = new WebClient()) // WebClient class inherits IDisposable
    {
        // Get the html content without saving it to a file
        htmlCode = client.DownloadString("https://my-site");
    }

    using (StringReader reader = new StringReader(input))
    {
        string line;
        while ((line = reader.ReadLine()) != null)
        {
            int index = line.IndexOf("https://something/something");
            int index1 = line.IndexOf(".jpg", index);
            string result = line.Substring(index,index1);
        }
    }
}
```
A regex to find everything starting with https://test.com/ and ending with .jpg could look like this:
```
https://test.com/.+.jpg
```
. is a special character in a regex, which matches anything. The * after the dot means ‘one or more of the preceeding pattern’. The next . before the jpg extension has to be escaped with a back slash because it’s a special character. Note that when putting into a C# stirng literal, the back slashes then have to be escaped:
```
"https://test.com/.+\.jpg"
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.