Let assume we have 2 strings ($strA and $strB) of some MBs lenght and we want to find (if exists) any 8-bytes substring of $strA inside $strB. My current approach works however is unacceptable slow:
function FindAnySubstring($strA,$strB)
{
for( $i=0; $i<strlen($strA)-8; $i++ )
{
$toFind=subStr($strA,$i,8);
$pos=strpos($strB,$toFind);
if( $pos === FALSE )
continue;
return [$i,$pos];
}
return FALSE;
}
Can it be done in a better way?
EDIT:
Sample data – well, probably I’m not able to add big files, but let assume that we can use shorter samples just for visualisation:
$strA=’abcdefghijKLMNOPQRstuvxyz’;
$strB=’ldsf32vKLMNOPQR4fa156232543′;
Expected result [11,8] (at 11th pos in $strA and 8th pos in $strB)
Current execution time for 3,321,792 bytes long $strA and $strB it takes between 0 and 2530 seconds to find a solution (!) – depending on position of the substring. I would expect 2 orders faster algorithm.
Considered ideas – I have some ideas:
- find shorter substring (substring of the substring, starting from 1 byte) and if found – check the rest; tested with no improvements
- use pararelling; always work, however harder to code
2
Answers
There are a few improvements you can make to potentially optimize its performance:
Early Exit: You can exit the loop as soon as a match is found, rather than continuing to search for additional matches. This can significantly reduce the number of unnecessary iterations.
Precompute Lengths: Instead of repeatedly calling strlen($strA) inside the loop, you can precompute the length of $strA before the loop starts to improve efficiency.
Here’s an updated version of your function with these improvements:
I wrote this , Also you can change it to find all substrings :