skip to Main Content

Okay, so I’m working on this Microsoft SQL database table that contains message strings, and I am trying to identify all of the URLs in those strings via starting with 'www.' and ending with either ('.biz','.com','.net','.org').

I found a very useful function written by StackOverflow member [https://stackoverflow.com/users/1692632/darka] at [How do i extract one or multiple URLs from a text (NVARCHAR(MAX)) column.

I modified his function to suit my purpose…. adding parameters to find the LastIndexOfChar for the various domain types. But I’ve run into a problem. I have to assign the @LastIndexOfChar parameter with the lowest non-zero value from 4 seperate parameters.

My original idea was to use some kind of temp table, but apparently I can’t use those inside of a function, so I tried a Table Variable. But even so, I’m unable to select from the Table Variable inside of my function.

So, how do I find the lowest non-zero value from these 4 parameters? If all 4 are 0, then we’d pass 0 and end the loop.

Below is my function code:

ALTER FUNCTION dbo.getLinksFromText (@Tekstas NVARCHAR(MAX))
RETURNS @Data TABLE(TheLink NVARCHAR(500))
AS
BEGIN

DECLARE @FirstIndexOfChar INT,
        @LastIndexOfChar INT,
        @LengthOfStringBetweenChars INT,
        @String NVARCHAR(MAX),
        @LastIndexOfCharBIZ INT,
        @LastIndexOfCharCOM INT,
        @LastIndexOfCharNET INT,
        @LastIndexOfCharORG INT

DECLARE @TempTable TABLE ( ParameterName varchar(20), LastIndexOfChar INT);

SET @FirstIndexOfChar    = CHARINDEX('www.',@Tekstas,0) 

WHILE @FirstIndexOfChar > 0
BEGIN

    SET @String = ''
    SET @LastIndexOfCharBIZ =   CHARINDEX('.biz',@Tekstas,@FirstIndexOfChar)
    SET @LastIndexOfCharCOM =   CHARINDEX('.com',@Tekstas,@FirstIndexOfChar)
    SET @LastIndexOfCharNET =   CHARINDEX('.net',@Tekstas,@FirstIndexOfChar)
    SET @LastIndexOfCharORG =   CHARINDEX('.org',@Tekstas,@FirstIndexOfChar)
    SET @LastIndexOfChar    = CHARINDEX('.com',@Tekstas,@FirstIndexOfChar + 0)  -- this the default type of domain we're looking for.

    INSERT INTO @TempTable
    VALUES ('BIZ', @LastIndexOfCharBIZ), ('COM', @LastIndexOfCharCOM), ('NET', @LastIndexOfCharNET), ('ORG', @LastIndexOfCharORG)

    -- The line below is commented out, because that's where this function breaks...  it doesn't allow me to do a select from the table variable
    --SET @LastIndexOfChar =  COALESCE(SELECT MIN(LastIndexOfChar) FROM @TempTable  WHERE LastIndexOfChar > 0),0)

    SET @LengthOfStringBetweenChars = @LastIndexOfChar - @FirstIndexOfChar + 4

    IF (@LastIndexOfChar = 0)
        BREAK;

    SET @String = SUBSTRING(@Tekstas,@FirstIndexOfChar,@LengthOfStringBetweenChars)
    INSERT INTO @Data (TheLink) VALUES (@String);

    SET @Tekstas = SUBSTRING(@Tekstas, @LastIndexOfChar, LEN(@Tekstas))
    SET @FirstIndexOfChar = CHARINDEX('www.',@Tekstas, 0) 

END 

RETURN
END

Any suggestions or ideas on how to find the smallest non-zero value out of the four parameters @FirstIndexOfCharBIZ, @FirstIndexOfCharCOM, @FirstIndexOfCharNET, @FirstIndexOfCharORG would be very helpful!

Please note, the message strings that I’m extracting from might not contain any domains or it might contain multiple domains. I want to identify all the domains within the strings. Below are a few example of the row of strings where I’m trying to extract domains from:

1 Hello, please fix my website at http://www.example.biz, thanks!

Our Function would return www.example.biz

2 Client name: Example Company website: www.example.net/test.html message: Can we find someone to improve our SEO rankings with www.google.com??

Our Function would return 2 values: www.example.net and www.google.com

3 Dear Support, our website at mywebsite.com isn't loading, can you fix?

Our Function would not return any values, as the above domain is missing the www portion (this is fine).

2

Answers


  1. Chosen as BEST ANSWER

    Had to make a quick edit to my code/answer, as there was a small bug Basically commented out the IF statement for (@LastIndexOfChar >= 500) and replaced it with (@LengthOfStringBetweenChar >= 500).


    Original answer:
    Okay, I kept banging away at this problem for most of today and finally got the function to work the way I intended. Below is the full source code of my function, along with notes to explain what's going on:

    ALTER FUNCTION dbo.getLinksFromText (@Tekstas NVARCHAR(MAX))
    RETURNS @Data TABLE(TheLink NVARCHAR(500))
    AS
    BEGIN
    
    DECLARE @FirstIndexOfChar INT,
            @LastIndexOfChar INT,
            @LengthOfStringBetweenChars INT,
            @String NVARCHAR(MAX),
            @TLDType VARCHAR(20),
            @LastIndexOfCharBIZ INT,
            @LastIndexOfCharCOM INT,
            @LastIndexOfCharNET INT,
            @LastIndexOfCharORG INT
    
    DECLARE @TempTable TABLE ( ParameterName varchar(20), LastIndexOfChar INT);
    
    SET @FirstIndexOfChar    = CHARINDEX('www.',@Tekstas,0) 
    
    WHILE @FirstIndexOfChar > 0
    BEGIN
        -- @String will be the output, so we must first clean/initalize it. 
        SET @String = ''
    
        -- @LastIndexOfChar is used to determine how far to look for the TLD portion.
        SET @LastIndexOfCharBIZ =   CHARINDEX('.biz',@Tekstas,@FirstIndexOfChar + 0)
        SET @LastIndexOfCharCOM =   CHARINDEX('.com',@Tekstas,@FirstIndexOfChar + 0)
        SET @LastIndexOfCharNET =   CHARINDEX('.net',@Tekstas,@FirstIndexOfChar + 0)
        SET @LastIndexOfCharORG =   CHARINDEX('.org',@Tekstas,@FirstIndexOfChar + 0)
    
        -- Populate the temp tables with values of where the TLDs are.  If the TLDs aren't found, their values will be 0.  
        INSERT INTO @TempTable
        VALUES ('BIZ', @LastIndexOfCharBIZ), ('COM', @LastIndexOfCharCOM), ('NET', @LastIndexOfCharNET), ('ORG', @LastIndexOfCharORG)
    
        -- The line below is how we get the smallest non-zero LastIndexOfChar value.  If there are no values to pass over, then we use 0.
        SET @LastIndexOfChar =  (SELECT COALESCE(MIN(LastIndexOfChar),0) FROM @TempTable  WHERE LastIndexOfChar > 0)
    
        SET @LengthOfStringBetweenChars = @LastIndexOfChar - @FirstIndexOfChar + 4
    
        --Break out of loop if limits are reached (Last index = 0 or length of string between chars = 500+)
        IF (@LastIndexOfChar = 0)
            BREAK;
        --IF (@LastIndexOfChar >= 500)
        IF (@LengthOfStringBetweenChars >= 500)
            BREAK;
    
        SET @String = SUBSTRING(@Tekstas,@FirstIndexOfChar,@LengthOfStringBetweenChars)
        INSERT INTO @Data (TheLink) VALUES (@String);
    
        SET @Tekstas = SUBSTRING(@Tekstas, @LastIndexOfChar, LEN(@Tekstas))
        SET @FirstIndexOfChar = CHARINDEX('www.',@Tekstas, 0)
    
        -- We must clean out the temptable before we start the next loop
        DELETE FROM @TempTable
    
    END 
    
    RETURN
    END
    

    It turns out that I can run a SELECT... but I just needed to change how it was written in order for SQL to accept it as valid. Originally, I was trying to Coalesce outside of the query, rather than inside. Once I did that, and put my select statements inside of (), things started to work.

    Previously, I had
    SET @LastIndexOfChar = COALESCE(SELECT MIN(LastIndexOfChar) FROM @TempTable WHERE LastIndexOfChar > 0),0)
    Instead, it should have been written as
    SET @LastIndexOfChar = (SELECT COALESCE(MIN(LastIndexOfChar),0) FROM @TempTable WHERE LastIndexOfChar > 0)

    In addition, I had to add another IF statement to break out of the loop if the LengthOfStringBetweenChars was 500+. And delete the @TempTable contents at the end of the loop so that it was ready for the next loop. Once those things were figured out, I was able to apply the function to my table and get a full set of results via:

    SELECT MT.ID, MT.BodyText, allLinks.*
    FROM MessageTable AS MT
    OUTER APPLY dbo.getLinksFromText (MT.BodyText) AS allLinks
    

    This function uses a lot of resources on the SQL Server, thus there are probably better solutions out there. User EzLo made a good point in his comment on my question that this probably should be done in C#, and I agree, but I felt like I was super close to this solution, and didn't want to abandon it and have to start all over with a new language.


  2. Based on your question of simply identifying strings that start with www. and end with one of your TLDs, why not use:

        SELECT string FROM table 
            WHERE RIGHT(string,4) = 'www.'
            AND LEFT(string,4) IN ('.biz','.com','.net','.org')
            ORDER BY string
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search