skip to Main Content

The problem is in the result variable.
There are more then some places with jpg.
What I want is to get all the places ending with jpg but as string.
I mean that result will have one link ending with jpg then again result will be with another link ending with jpg.

it’s like:

https://something.com/my.jpg/a7gfefg/https://something.com/my2.jpg/sadsadsad64567546/https://something.com/my3.jpg

and I want in result to get each time:

https://something.com/my.jpg

then in the next iterate:

https://something.com/my2.jpg

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.IO;
using System.Linq;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
using System.Windows.Forms;

namespace Testing
{
    public partial class Form1 : Form
    {
        private List<string> links = new List<string>();
        string htmlCode;

        public Form1()
        {
            InitializeComponent();

            GetLinks();
        }

        private void GetLinks()
        {
            using (WebClient client = new WebClient()) // WebClient class inherits IDisposable
            {
                htmlCode = client.DownloadString("https://test.com/my-site");
            }

            int index1 = 0;

            using (StringReader reader = new StringReader(htmlCode))
            {
                string line;
                while ((line = reader.ReadLine()) != null)
                {
                    int index = line.IndexOf("https://test.com");
                    if (index != -1)
                    {
                        index1 = line.IndexOf("png", index);
                    }
                    if (index != -1 && index1 != -1)
                    {
                        string result = line.Substring(index, index1);
                    }
                }
            }

        }
        private void Form1_Load(object sender, EventArgs e)
        {

        }
    }
}

2

Answers


  1. The better way to extract image url within html code is using Regular Expression.

    Image url extraction regular expression:

    (http(s?):)([/|.|w|s|-])*.(?:jpg|gif|png)
    

    For how to use Regular Expressions in C#:
    https://learn.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference

    Login or Signup to reply.
  2. You’re passing in the web page’s html to the File.ReadAllLines method as if it’s a file name. You already have the html content as a string variable. Remove the line, and rename ‘content’ to ‘htmlCode’:

    private void GetLinks()
    {
        using (WebClient client = new WebClient()) // WebClient class inherits IDisposable
        {
            // Get the html content without saving it to a file
            htmlCode = client.DownloadString("https://my-site");
        }
    
        using (StringReader reader = new StringReader(input))
        {
            string line;
            while ((line = reader.ReadLine()) != null)
            {
                int index = line.IndexOf("https://something/something");
                int index1 = line.IndexOf(".jpg", index);
                string result = line.Substring(index,index1);
            }
        }
    }
    

    A regex to find everything starting with https://test.com/ and ending with .jpg could look like this:

    https://test.com/.+.jpg
    

    . is a special character in a regex, which matches anything. The * after the dot means ‘one or more of the preceeding pattern’. The next . before the jpg extension has to be escaped with a back slash because it’s a special character. Note that when putting into a C# stirng literal, the back slashes then have to be escaped:

    "https://test.com/.+\.jpg"
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search