skip to Main Content

I have a file in raw HTML raw text format. It contains a list of files with thousands of lines.

The list looks like this:

<html>
<head><title>Index of /</title></head>
<body>
<h1>Index of /</h1><hr><pre><a href="../">../</a>
<a href="9999920122022-SP.xml">9999920122022-SP.xml</a>                               20-Dec-2022 15:46                2652
<a href="10000020122022-PR.xml">10000020122022-PR.xml</a>                              20-Dec-2022 15:47               74861
<a href="10000120122022-SC.xml">10000120122022-SC.xml</a>                              20-Dec-2022 16:03              160717
<a href="10000220122022-SC.xml">10000220122022-SC.xml</a>                              20-Dec-2022 16:03              160717
<a href="10852508042023-ESP%3FRITO%20SANTO-ES.xml">10852508042023-ESP?RITO SANTO-ES.xml</a>               08-Apr-2023 22:59              379563
<a href="10000320122022-MG.xml">10000320122022-MG.xml</a>                              20-Dec-2022 16:21             8122831
<a href="11054812072023-S%3FO%20PAULO-SP.xml">11054812072023-S?O PAULO-SP.xml</a>                    12-Jul-2023 21:52              690879
<a href="10170411012023-MATO%20GROSSO-MT.xml">10170411012023-MATO GROSSO-MT.xml</a>                  11-Jan-2023 14:57             5819174
<a href="10272320012023-RIO%20DE%20JANEIRO-RJ.xml">10272320012023-RIO DE JANEIRO-RJ.xml</a>               20-Jan-2023 23:03             1000763
<a href="10000420122022-SP.xml">10000420122022-SP.xml</a>                              20-Dec-2022 17:37               11552
<a href="10000520122022-PR.xml">10000520122022-PR.xml</a>                              20-Dec-2022 17:57               33926
</pre><hr></body>
</html>


I need to find what is the newest file from this list based on the date and time columns.

To do this, I’ve created the following script.

#!/bin/bash

# Function to extract the date and time from the HTML line
extract_datetime() {
    line="$1"
    # Extract the date and time from the line using awk
    datetime=$(echo "$line" | awk '{print $3, $4}')
    # Convert the date and time to a format suitable for comparison
    date -d "$datetime" +%s
}

# Read the HTML content from a file
if [ -f "$1" ]; then
    html_content=$(<"$1")
else
    echo "File not found: $1"
    exit 1
fi

newest_datetime=0
# Iterate over each line of the HTML content
while IFS= read -r line; do
    # Check if the line contains an XML file
    if [[ "$line" =~ href="([^"]*.xml)" ]]; then
        # Extract the file name and date/time from the line
        filename="${BASH_REMATCH[1]}"
        file_datetime=$(extract_datetime "$line")
        # Compare the current file's date/time with the newest one found so far
        if (( file_datetime > newest_datetime )); then
            newest_datetime=$file_datetime
            newest_file="$filename"
        fi
    fi
done <<< "$html_content"

# Output the newest file
echo "Newest file: $newest_file"


Some files in the list may have multiple spaces in the name, which causes Awk to extract the wrong values for date and time.
As a result, the date command gives me errors like these.

date: invalid date ‘SANTO-ES.xml 08-Apr-2023’
date: invalid date ‘PAULO-SP.xml 12-Jul-2023’
date: invalid date ‘GROSSO-MT.xml 11-Jan-2023’
date: invalid date ‘DE JANEIRO-RJ.xml’

Now I’m having a hard time figuring out how to handle these spaces in the filenames.

Any thoughts on this?

3

Answers


  1. Why don’t you use the RE in bash with BASH_REMATCH array for date matching?

    #!/bin/bash
    
    # Read the HTML content from a file
    if [ -f "$1" ]; then
        html_content=$(<"$1")
    else
        echo "File not found: $1"
        exit 1
    fi
    
    newest_datetime=0
    # Iterate over each line of the HTML content
    while IFS= read -r line; do
        # Check if the line contains an XML file
        if [[ "$line" =~ href="([^"]*.xml)".*</a> *([0-9]+-[a-zA-Z]+-[0-9]+ [0-9]+:[0-9]+) ]]; then
            # Extract the file name and date/time from the line
            filename="${BASH_REMATCH[1]}"
            printf "filename='%s'n" "$filename"
            file_datetime="${BASH_REMATCH[2]}"
            # Compare the current file's date/time with the newest one found so far
            # printf "file_datetime='%s'n" "$file_datetime"
            file_datetime=$(date -d "$file_datetime" +%s)
            # printf "file_datetime='%s'n" "$file_datetime"
            if (( file_datetime > newest_datetime )); then
                newest_datetime=$file_datetime
                newest_file="$filename"
            fi
        fi
    done <<< "$html_content"
    
    # Output the newest file
    echo "Newest file: $newest_file"
    
    Login or Signup to reply.
  2. By default, awk splits fields by runs of whitespace. If you care about retaining the exact spacing, you have to do more than use the default fields.

    Calling awk once for every line is also inefficient.

    It would be better to use an xml/html parser but assuming the file is guaranteed to have the specific format shown, it is not necessary to call date on each line as months can be converted to numbers and then the date rearranged to allow simple sort.

    #!/bin/sh
    
    if ! [ -f "$1" ]; then
        echo "File not found: $1"
        exit 1
    fi
    
    awk '
        BEGIN {
            split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec",a)
            for(i in a) m2i[a[i]] = sprintf("%02d",i)
        }
        /^<a href=/ {
            split($0,a,/[<>]/)
            if (a[3] ~ /[.]xml$/) {
                split($(NF-2),d,/-/)
                print d[3] m2i[d[2]] d[1] $(NF-1), a[3]
            }
        }
    ' "$1" |
    sort -r |
    awk '
        NR>1 && p1!=$1 { exit }
        { p1=$1; print substr($0,17) }
    '
    

    or with just bash:

    #!/bin/bash
    
    if ! [ -f "$1" ]; then
        echo "File not found: $1"
        exit 1
    fi
    
    declare -A m2i
    i=101
    for m in Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec; do
        m2i[$m]=${i:1}
        ((++i))
    done
    
    newest_datetime=000000000000
    newest_files=()
    declare -n q=BASH_REMATCH
    #                 1          2          3     4            5          6
    re='<a href=".+">(.+)</a>s+([0-9]{2})-(...)-([0-9]{4})s+([0-9]{2}):([0-9]{2})'
    while IFS= read -r line; do
        if [[ $line =~ $re ]]; then
            this_datetime=${q[4]}${m2i["${q[3]}"]}${q[2]}${q[5]}${q[6]}
            if [[ $this_datetime > $newest_datetime ]]; then
                newest_datetime=$this_datetime
                newest_files=("${q[1]}")
            elif [[ $this_datetime = $newest_datetime ]]; then
                newest_files+=("${q[1]}")
            fi
        fi
    done <"$1"
    
    printf '%sn' "${newest_files[@]}"
    

    These should both handle the case of more than one file being "newest".

    Login or Signup to reply.
  3. It’s always advisable to use a proper XML/HTML parsing tool, never regular expression.
    Using xmllint and selecting href and following text() nodes:

    #!/bin/bash
    
    while IFS= read -r line; do
        if [ -z "$line" ];then
            continue
        fi
        #echo "<<< $line"
        if grep -q '.xml' <<<"$line"; then
            fn="$(cut -d '=' -f2 <<<"$line" | tr -d '"')"
            #echo ">>> fn $fn"
        else
            # squeeze white space and get constant width date
            dt="$(tr -s ' ' <<<"$line" | cut -c 2-19)"
            # format date as unix timestamp to have an easily sortable number
            unixts="$(date -d "$dt" "+%s")"
            #echo ">>> dt $dt"
            printf "%s %sn" "$unixts ($dt)" "$fn"
        fi
    
    done < <(xmllint --html --xpath '//a/@href[contains(.,".xml")] | //a/following-sibling::text()' tmp.html) | sort -r
    

    Result

    1689209520 (12-Jul-2023 21:52 ) 11054812072023-S%3FO%20PAULO-SP.xml
    1681005540 (08-Apr-2023 22:59 ) 10852508042023-ESP%3FRITO%20SANTO-ES.xml
    1674266580 (20-Jan-2023 23:03 ) 10272320012023-RIO%20DE%20JANEIRO-RJ.xml
    1673459820 (11-Jan-2023 14:57 ) 10170411012023-MATO%20GROSSO-MT.xml
    1671569820 (20-Dec-2022 17:57 ) 10000520122022-PR.xml
    1671568620 (20-Dec-2022 17:37 ) 10000420122022-SP.xml
    1671564060 (20-Dec-2022 16:21 ) 10000320122022-MG.xml
    1671562980 (20-Dec-2022 16:03 ) 10000220122022-SC.xml
    1671562980 (20-Dec-2022 16:03 ) 10000120122022-SC.xml
    1671562020 (20-Dec-2022 15:47 ) 10000020122022-PR.xml
    1671561960 (20-Dec-2022 15:46 ) 9999920122022-SP.xml
    

    To get the newest one append head -n1

    ... | sort -r | head -n1

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search