Okay, so I’m working on this Microsoft SQL database table that contains message strings, and I am trying to identify all of the URLs in those strings via starting with 'www.'
and ending with either ('.biz','.com','.net','.org'
).
I found a very useful function written by StackOverflow member [https://stackoverflow.com/users/1692632/darka] at [How do i extract one or multiple URLs from a text (NVARCHAR(MAX)) column.
I modified his function to suit my purpose…. adding parameters to find the LastIndexOfChar for the various domain types. But I’ve run into a problem. I have to assign the @LastIndexOfChar parameter with the lowest non-zero value from 4 seperate parameters.
My original idea was to use some kind of temp table, but apparently I can’t use those inside of a function, so I tried a Table Variable. But even so, I’m unable to select from the Table Variable inside of my function.
So, how do I find the lowest non-zero value from these 4 parameters? If all 4 are 0, then we’d pass 0 and end the loop.
Below is my function code:
ALTER FUNCTION dbo.getLinksFromText (@Tekstas NVARCHAR(MAX))
RETURNS @Data TABLE(TheLink NVARCHAR(500))
AS
BEGIN
DECLARE @FirstIndexOfChar INT,
@LastIndexOfChar INT,
@LengthOfStringBetweenChars INT,
@String NVARCHAR(MAX),
@LastIndexOfCharBIZ INT,
@LastIndexOfCharCOM INT,
@LastIndexOfCharNET INT,
@LastIndexOfCharORG INT
DECLARE @TempTable TABLE ( ParameterName varchar(20), LastIndexOfChar INT);
SET @FirstIndexOfChar = CHARINDEX('www.',@Tekstas,0)
WHILE @FirstIndexOfChar > 0
BEGIN
SET @String = ''
SET @LastIndexOfCharBIZ = CHARINDEX('.biz',@Tekstas,@FirstIndexOfChar)
SET @LastIndexOfCharCOM = CHARINDEX('.com',@Tekstas,@FirstIndexOfChar)
SET @LastIndexOfCharNET = CHARINDEX('.net',@Tekstas,@FirstIndexOfChar)
SET @LastIndexOfCharORG = CHARINDEX('.org',@Tekstas,@FirstIndexOfChar)
SET @LastIndexOfChar = CHARINDEX('.com',@Tekstas,@FirstIndexOfChar + 0) -- this the default type of domain we're looking for.
INSERT INTO @TempTable
VALUES ('BIZ', @LastIndexOfCharBIZ), ('COM', @LastIndexOfCharCOM), ('NET', @LastIndexOfCharNET), ('ORG', @LastIndexOfCharORG)
-- The line below is commented out, because that's where this function breaks... it doesn't allow me to do a select from the table variable
--SET @LastIndexOfChar = COALESCE(SELECT MIN(LastIndexOfChar) FROM @TempTable WHERE LastIndexOfChar > 0),0)
SET @LengthOfStringBetweenChars = @LastIndexOfChar - @FirstIndexOfChar + 4
IF (@LastIndexOfChar = 0)
BREAK;
SET @String = SUBSTRING(@Tekstas,@FirstIndexOfChar,@LengthOfStringBetweenChars)
INSERT INTO @Data (TheLink) VALUES (@String);
SET @Tekstas = SUBSTRING(@Tekstas, @LastIndexOfChar, LEN(@Tekstas))
SET @FirstIndexOfChar = CHARINDEX('www.',@Tekstas, 0)
END
RETURN
END
Any suggestions or ideas on how to find the smallest non-zero value out of the four parameters @FirstIndexOfCharBIZ, @FirstIndexOfCharCOM, @FirstIndexOfCharNET, @FirstIndexOfCharORG would be very helpful!
Please note, the message strings that I’m extracting from might not contain any domains or it might contain multiple domains. I want to identify all the domains within the strings. Below are a few example of the row of strings where I’m trying to extract domains from:
1 Hello, please fix my website at http://www.example.biz, thanks!
Our Function would return www.example.biz
2 Client name: Example Company website: www.example.net/test.html message: Can we find someone to improve our SEO rankings with www.google.com??
Our Function would return 2 values: www.example.net
and www.google.com
3 Dear Support, our website at mywebsite.com isn't loading, can you fix?
Our Function would not return any values, as the above domain is missing the www portion (this is fine).
2
Answers
Had to make a quick edit to my code/answer, as there was a small bug Basically commented out the
IF
statement for(@LastIndexOfChar >= 500)
and replaced it with(@LengthOfStringBetweenChar >= 500)
.Original answer:
Okay, I kept banging away at this problem for most of today and finally got the function to work the way I intended. Below is the full source code of my function, along with notes to explain what's going on:
It turns out that I can run a SELECT... but I just needed to change how it was written in order for SQL to accept it as valid. Originally, I was trying to Coalesce outside of the query, rather than inside. Once I did that, and put my select statements inside of (), things started to work.
Previously, I had
SET @LastIndexOfChar = COALESCE(SELECT MIN(LastIndexOfChar) FROM @TempTable WHERE LastIndexOfChar > 0),0)
Instead, it should have been written as
SET @LastIndexOfChar = (SELECT COALESCE(MIN(LastIndexOfChar),0) FROM @TempTable WHERE LastIndexOfChar > 0)
In addition, I had to add another IF statement to break out of the loop if the LengthOfStringBetweenChars was 500+. And delete the @TempTable contents at the end of the loop so that it was ready for the next loop. Once those things were figured out, I was able to apply the function to my table and get a full set of results via:
This function uses a lot of resources on the SQL Server, thus there are probably better solutions out there. User EzLo made a good point in his comment on my question that this probably should be done in C#, and I agree, but I felt like I was super close to this solution, and didn't want to abandon it and have to start all over with a new language.
Based on your question of simply identifying strings that start with
www.
and end with one of your TLDs, why not use: