I have a file in raw HTML raw text format. It contains a list of files with thousands of lines.
The list looks like this:
<html>
<head><title>Index of /</title></head>
<body>
<h1>Index of /</h1><hr><pre><a href="../">../</a>
<a href="9999920122022-SP.xml">9999920122022-SP.xml</a> 20-Dec-2022 15:46 2652
<a href="10000020122022-PR.xml">10000020122022-PR.xml</a> 20-Dec-2022 15:47 74861
<a href="10000120122022-SC.xml">10000120122022-SC.xml</a> 20-Dec-2022 16:03 160717
<a href="10000220122022-SC.xml">10000220122022-SC.xml</a> 20-Dec-2022 16:03 160717
<a href="10852508042023-ESP%3FRITO%20SANTO-ES.xml">10852508042023-ESP?RITO SANTO-ES.xml</a> 08-Apr-2023 22:59 379563
<a href="10000320122022-MG.xml">10000320122022-MG.xml</a> 20-Dec-2022 16:21 8122831
<a href="11054812072023-S%3FO%20PAULO-SP.xml">11054812072023-S?O PAULO-SP.xml</a> 12-Jul-2023 21:52 690879
<a href="10170411012023-MATO%20GROSSO-MT.xml">10170411012023-MATO GROSSO-MT.xml</a> 11-Jan-2023 14:57 5819174
<a href="10272320012023-RIO%20DE%20JANEIRO-RJ.xml">10272320012023-RIO DE JANEIRO-RJ.xml</a> 20-Jan-2023 23:03 1000763
<a href="10000420122022-SP.xml">10000420122022-SP.xml</a> 20-Dec-2022 17:37 11552
<a href="10000520122022-PR.xml">10000520122022-PR.xml</a> 20-Dec-2022 17:57 33926
</pre><hr></body>
</html>
I need to find what is the newest file from this list based on the date and time columns.
To do this, I’ve created the following script.
#!/bin/bash
# Function to extract the date and time from the HTML line
extract_datetime() {
line="$1"
# Extract the date and time from the line using awk
datetime=$(echo "$line" | awk '{print $3, $4}')
# Convert the date and time to a format suitable for comparison
date -d "$datetime" +%s
}
# Read the HTML content from a file
if [ -f "$1" ]; then
html_content=$(<"$1")
else
echo "File not found: $1"
exit 1
fi
newest_datetime=0
# Iterate over each line of the HTML content
while IFS= read -r line; do
# Check if the line contains an XML file
if [[ "$line" =~ href="([^"]*.xml)" ]]; then
# Extract the file name and date/time from the line
filename="${BASH_REMATCH[1]}"
file_datetime=$(extract_datetime "$line")
# Compare the current file's date/time with the newest one found so far
if (( file_datetime > newest_datetime )); then
newest_datetime=$file_datetime
newest_file="$filename"
fi
fi
done <<< "$html_content"
# Output the newest file
echo "Newest file: $newest_file"
Some files in the list may have multiple spaces in the name, which causes Awk to extract the wrong values for date and time.
As a result, the date
command gives me errors like these.
date: invalid date ‘SANTO-ES.xml 08-Apr-2023’
date: invalid date ‘PAULO-SP.xml 12-Jul-2023’
date: invalid date ‘GROSSO-MT.xml 11-Jan-2023’
date: invalid date ‘DE JANEIRO-RJ.xml’
Now I’m having a hard time figuring out how to handle these spaces in the filenames.
Any thoughts on this?
3
Answers
Why don’t you use the RE in bash with
BASH_REMATCH
array for date matching?By default,
awk
splits fields by runs of whitespace. If you care about retaining the exact spacing, you have to do more than use the default fields.Calling
awk
once for every line is also inefficient.It would be better to use an xml/html parser but assuming the file is guaranteed to have the specific format shown, it is not necessary to call
date
on each line as months can be converted to numbers and then the date rearranged to allow simple sort.or with just bash:
These should both handle the case of more than one file being "newest".
It’s always advisable to use a proper XML/HTML parsing tool, never regular expression.
Using
xmllint
and selectinghref
and followingtext()
nodes:Result
To get the newest one append
head -n1
... | sort -r | head -n1