Html - Read sort and extract file name from HMTL raw text list

markfree
May 19, 2024
202 views
3 votes
3 Answers

I have a file in raw HTML raw text format. It contains a list of files with thousands of lines.

The list looks like this:

<html>
<head><title>Index of /</title></head>
<body>
<h1>Index of /</h1><hr><pre><a href="../">../</a>
<a href="9999920122022-SP.xml">9999920122022-SP.xml</a>                               20-Dec-2022 15:46                2652
<a href="10000020122022-PR.xml">10000020122022-PR.xml</a>                              20-Dec-2022 15:47               74861
<a href="10000120122022-SC.xml">10000120122022-SC.xml</a>                              20-Dec-2022 16:03              160717
<a href="10000220122022-SC.xml">10000220122022-SC.xml</a>                              20-Dec-2022 16:03              160717
<a href="10852508042023-ESP%3FRITO%20SANTO-ES.xml">10852508042023-ESP?RITO SANTO-ES.xml</a>               08-Apr-2023 22:59              379563
<a href="10000320122022-MG.xml">10000320122022-MG.xml</a>                              20-Dec-2022 16:21             8122831
<a href="11054812072023-S%3FO%20PAULO-SP.xml">11054812072023-S?O PAULO-SP.xml</a>                    12-Jul-2023 21:52              690879
<a href="10170411012023-MATO%20GROSSO-MT.xml">10170411012023-MATO GROSSO-MT.xml</a>                  11-Jan-2023 14:57             5819174
<a href="10272320012023-RIO%20DE%20JANEIRO-RJ.xml">10272320012023-RIO DE JANEIRO-RJ.xml</a>               20-Jan-2023 23:03             1000763
<a href="10000420122022-SP.xml">10000420122022-SP.xml</a>                              20-Dec-2022 17:37               11552
<a href="10000520122022-PR.xml">10000520122022-PR.xml</a>                              20-Dec-2022 17:57               33926
</pre><hr></body>
</html>

I need to find what is the newest file from this list based on the date and time columns.

To do this, I’ve created the following script.

#!/bin/bash

# Function to extract the date and time from the HTML line
extract_datetime() {
    line="$1"
    # Extract the date and time from the line using awk
    datetime=$(echo "$line" | awk '{print $3, $4}')
    # Convert the date and time to a format suitable for comparison
    date -d "$datetime" +%s
}

# Read the HTML content from a file
if [ -f "$1" ]; then
    html_content=$(<"$1")
else
    echo "File not found: $1"
    exit 1
fi

newest_datetime=0
# Iterate over each line of the HTML content
while IFS= read -r line; do
    # Check if the line contains an XML file
    if [[ "$line" =~ href="([^"]*.xml)" ]]; then
        # Extract the file name and date/time from the line
        filename="${BASH_REMATCH[1]}"
        file_datetime=$(extract_datetime "$line")
        # Compare the current file's date/time with the newest one found so far
        if (( file_datetime > newest_datetime )); then
            newest_datetime=$file_datetime
            newest_file="$filename"
        fi
    fi
done <<< "$html_content"

# Output the newest file
echo "Newest file: $newest_file"

Some files in the list may have multiple spaces in the name, which causes Awk to extract the wrong values for date and time.
As a result, the date command gives me errors like these.

date: invalid date ‘SANTO-ES.xml 08-Apr-2023’
date: invalid date ‘PAULO-SP.xml 12-Jul-2023’
date: invalid date ‘GROSSO-MT.xml 11-Jan-2023’
date: invalid date ‘DE JANEIRO-RJ.xml’

Now I’m having a hard time figuring out how to handle these spaces in the filenames.

Any thoughts on this?

Answers

Why don’t you use the RE in bash with BASH_REMATCH array for date matching?

#!/bin/bash

# Read the HTML content from a file
if [ -f "$1" ]; then
    html_content=$(<"$1")
else
    echo "File not found: $1"
    exit 1
fi

newest_datetime=0
# Iterate over each line of the HTML content
while IFS= read -r line; do
    # Check if the line contains an XML file
    if [[ "$line" =~ href="([^"]*.xml)".*</a> *([0-9]+-[a-zA-Z]+-[0-9]+ [0-9]+:[0-9]+) ]]; then
        # Extract the file name and date/time from the line
        filename="${BASH_REMATCH[1]}"
        printf "filename='%s'n" "$filename"
        file_datetime="${BASH_REMATCH[2]}"
        # Compare the current file's date/time with the newest one found so far
        # printf "file_datetime='%s'n" "$file_datetime"
        file_datetime=$(date -d "$file_datetime" +%s)
        # printf "file_datetime='%s'n" "$file_datetime"
        if (( file_datetime > newest_datetime )); then
            newest_datetime=$file_datetime
            newest_file="$filename"
        fi
    fi
done <<< "$html_content"

# Output the newest file
echo "Newest file: $newest_file"

By default, awk splits fields by runs of whitespace. If you care about retaining the exact spacing, you have to do more than use the default fields.

Calling awk once for every line is also inefficient.

It would be better to use an xml/html parser but assuming the file is guaranteed to have the specific format shown, it is not necessary to call date on each line as months can be converted to numbers and then the date rearranged to allow simple sort.

#!/bin/sh

if ! [ -f "$1" ]; then
    echo "File not found: $1"
    exit 1
fi

awk '
    BEGIN {
        split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec",a)
        for(i in a) m2i[a[i]] = sprintf("%02d",i)
    }
    /^<a href=/ {
        split($0,a,/[<>]/)
        if (a[3] ~ /[.]xml$/) {
            split($(NF-2),d,/-/)
            print d[3] m2i[d[2]] d[1] $(NF-1), a[3]
        }
    }
' "$1" |
sort -r |
awk '
    NR>1 && p1!=$1 { exit }
    { p1=$1; print substr($0,17) }
'

or with just bash:

#!/bin/bash

if ! [ -f "$1" ]; then
    echo "File not found: $1"
    exit 1
fi

declare -A m2i
i=101
for m in Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec; do
    m2i[$m]=${i:1}
    ((++i))
done

newest_datetime=000000000000
newest_files=()
declare -n q=BASH_REMATCH
#                 1          2          3     4            5          6
re='<a href=".+">(.+)</a>s+([0-9]{2})-(...)-([0-9]{4})s+([0-9]{2}):([0-9]{2})'
while IFS= read -r line; do
    if [[ $line =~ $re ]]; then
        this_datetime=${q[4]}${m2i["${q[3]}"]}${q[2]}${q[5]}${q[6]}
        if [[ $this_datetime > $newest_datetime ]]; then
            newest_datetime=$this_datetime
            newest_files=("${q[1]}")
        elif [[ $this_datetime = $newest_datetime ]]; then
            newest_files+=("${q[1]}")
        fi
    fi
done <"$1"

printf '%sn' "${newest_files[@]}"

These should both handle the case of more than one file being "newest".

It’s always advisable to use a proper XML/HTML parsing tool, never regular expression.
Using xmllint and selecting href and following text() nodes:

#!/bin/bash

while IFS= read -r line; do
    if [ -z "$line" ];then
        continue
    fi
    #echo "<<< $line"
    if grep -q '.xml' <<<"$line"; then
        fn="$(cut -d '=' -f2 <<<"$line" | tr -d '"')"
        #echo ">>> fn $fn"
    else
        # squeeze white space and get constant width date
        dt="$(tr -s ' ' <<<"$line" | cut -c 2-19)"
        # format date as unix timestamp to have an easily sortable number
        unixts="$(date -d "$dt" "+%s")"
        #echo ">>> dt $dt"
        printf "%s %sn" "$unixts ($dt)" "$fn"
    fi

done < <(xmllint --html --xpath '//a/@href[contains(.,".xml")] | //a/following-sibling::text()' tmp.html) | sort -r

Result

1689209520 (12-Jul-2023 21:52 ) 11054812072023-S%3FO%20PAULO-SP.xml
1681005540 (08-Apr-2023 22:59 ) 10852508042023-ESP%3FRITO%20SANTO-ES.xml
1674266580 (20-Jan-2023 23:03 ) 10272320012023-RIO%20DE%20JANEIRO-RJ.xml
1673459820 (11-Jan-2023 14:57 ) 10170411012023-MATO%20GROSSO-MT.xml
1671569820 (20-Dec-2022 17:57 ) 10000520122022-PR.xml
1671568620 (20-Dec-2022 17:37 ) 10000420122022-SP.xml
1671564060 (20-Dec-2022 16:21 ) 10000320122022-MG.xml
1671562980 (20-Dec-2022 16:03 ) 10000220122022-SC.xml
1671562980 (20-Dec-2022 16:03 ) 10000120122022-SC.xml
1671562020 (20-Dec-2022 15:47 ) 10000020122022-PR.xml
1671561960 (20-Dec-2022 15:46 ) 9999920122022-SP.xml

To get the newest one append head -n1

... | sort -r | head -n1

Please signup or login to give your own answer.

Click here to cancel reply.

Html – Read sort and extract file name from HMTL raw text list

Answers