skip to Main Content

Hello I’ve a problem with parsing the html table using bash script. I managed to take table but extracting the data is not going so well.
The table content looks like this:

<tr><td><small>1</small></td><td>Kalisz</td><td>62-800</td><td>Poland</td><td>Greater Poland</td><td>Kalisz</td><td>Kalisz<tr><td></td><td colspan=6>&nbsp;&nbsp;&nbsp;<a href="/maps/browse_51.75_18.087.html" rel="nofollow"><small>51.75/18.087</small></a></td></tr>
<tr class="odd"><td><small>2</small></td><td>Piotrków Trybunalski</td><td>97-300</td><td>Poland</td><td>Łódź Voivodeship</td><td>Piotrków Trybunalski</td><td>Piotrków Trybunalski<tr class="odd"><td></td><td colspan=6>&nbsp;&nbsp;&nbsp;<a href="/maps/browse_51.411_19.689.html" rel="nofollow"><small>51.411/19.689</small></a></td></tr>
<tr><td><small>3</small></td><td>Toruń</td><td>87-100</td><td>Poland</td><td>Kujawsko-Pomorskie</td><td>Toruń</td><td>Toruń<tr><td></td><td colspan=6>&nbsp;&nbsp;&nbsp;<a href="/maps/browse_53.021_18.623.html" rel="nofollow"><small>53.021/18.623</small></a></td></tr>

There are 200 of those rows. My bash script looks like this:

#!/bin/bash

URL="https://www.geonames.org/postalcode-search.html?country=PL&q="

HTML=$(curl -s "$URL")

(echo "$HTML" | grep -A 201 "<table class="restable">" | tail -n 200 )>> table.html

html_lines=()
while IFS= read -r line; do
  html_lines+=("$line")
done < "table.html"

for html_line in "${html_lines[@]}"; do
  field1_value=$(echo "$html_line" | grep -oP '(?<=<td>)[^<]+(?=</td>)')
  field2_value=$(echo "$html_line" | grep -oP '[0-9]{2}-[0-9]{3}')
  field3_value=$(echo "$html_line" | grep -oP '(?<=<small>)[^<]+(?=</small>)')

  # Printing the extracted fields
  echo "$field1_value;$field2_value;$field3_value" >> output.txt
done

Current result I’m getting is:

Kalisz
62-800
Poland
Greater Poland
Kalisz;62-800;1
51.75/18.087
Piotrków Trybunalski
97-300
Poland
Łódź Voivodeship
Piotrków Trybunalski;97-300;2
51.411/19.689
Toruń
87-100
Poland
Kujawsko-Pomorskie
Toruń;87-100;3
53.021/18.623

And the result I want to have is:

Greater Poland;62-800;51.75/18.087
Łódź Voivodeship;97-300;51.411/19.689
Lesser Poland;33-300;49.609/20.704

I want to parse them for future use in csv file

3

Answers


  1. If your rows are guaranteed to be on one line and there is no exception to the formatting compared to the example,

    sed 's/<.td><td>/;/g;s/<[^>]*>//g;s/&nbsp;/ /g'
    

    will give you a CSV version of the table that you can parse with awk.

    Login or Signup to reply.
  2. With perl one-liner you could try following. Written and tested with your shown samples Only.

    perl -pe 's/^<tr.*?>(?:<td>.*?</td>){2}<td>(.*?)</td><td>.*?</td><td>(.*?)</td>.*?<a href=".*?_(.*?)_(.*?).html.*$/$2;$1;$3/$4/'  Input_file
    

    Here is the Online demo of regex.

    Explanation of regex:

    ^                   ##Matching from starting of the line here.
    <tr.*?>             ##Matching from <tr till very next occurrence of > which will include class= line occurrence also.
    (?:                 ##Creating a non-capturing group here.
       <td>.*?</td>    ##Matching <td> till very next occurrence of </td>
    ){2}                ##Matching 2 occurrences of it.
    <td>(.*?)</td>     ##Matching from <td> till very next occurrence of </td> and storing values in between in 1st capturing group here.
    <td>.*?</td>       ##Matching from <td> till very next occurrence of </td>.
    <td>(.*?)</td>     ##Matching from <td> till very next occurrence of </td> and storing values in between in 2nd capturing group here.
    .*?                 ##Doing Lazy match
    <a href=".*?_       ##To make sure it matches very first occurrence of <a href=" till next occurrence of _
    (.*?)_              ##Creating 3rd capturing group with Lazy match till next occurrence of _ here.
    (.*?)               ##Creating 4th capturing group with a lazy match here.
    .html.*$           ##matching literal . followed with html till last of the line.
    
    Login or Signup to reply.
  3. Using any awk:

    $ cat tst.awk
    BEGIN { FS="</td>"; OFS=";" }
    {
        for ( i=1; i<=NF; i++ ) {
            gsub("</?tr>|<td[^>]*>|.*<small>|</small>.*","",$i)
        }
        print $5, $3, $8
    }
    

    $ awk -f tst.awk file
    Greater Poland;62-800;51.75/18.087
    Łódź Voivodeship;97-300;51.411/19.689
    Kujawsko-Pomorskie;87-100;53.021/18.623
    

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search