Redis - Parsing file names to get python package names consistently

StephenPaulin
October 25, 2023
259 views
2 votes
3 Answers

Need some scripting help. Here is the problem. I have a set of python packages (files)

beautifulsoup4-4.12.2-py3-none-any.whl
certifi-2023.7.22-py3-none-any.whl
charset_normalizer-3.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
google-3.0.0-py2.py3-none-any.whl
idna-3.4-py3-none-any.whl
protobuf3-0.2.1.tar.gz
protobuf-3.19.6-py2.py3-none-any.whl
proton-0.9.1.tar.gz
python-qpid-proton-0.38.0.tar.gz
redis-4.5.5-py3-none-any.whl
requests-2.31.0-py3-none-any.whl
robotframework-6.1.1-py3-none-any.whl
robotframework_requests-0.9.1-py3-none-any.whl
robotframework-run-keyword-async-1.0.8.tar.gz
soupsieve-2.5-py3-none-any.whl
urllib3-2.0.7-py3-none-any.whl

I need to parse each file name to get the package name and its version. I have some working while others fail. Basically the list above should find this for the name and version of each.

beautifulsoup4                    4.12.2
certifi                           2023.7.22
charset-normalizer                3.3.1
google                            3.0.0
idna                              3.4
protobuf3                         0.2.1
protobuf                          3.19.6
proton                            0.9.1
python-qpid-proton                0.38.0
redis                             4.5.5
requests                          2.31.0
robotframework                    6.1.1
robotframework-requests           0.9.1
robotframework-run-keyword-async  1.0.8
soupsieve                         2.5
urllib3                           2.0.7

I’ve tried cut, grep, sed, and awk to get this working but the numbers appearing in names, multi-digit versions, inconsistency of pattern, cause one or the other methods to fail. You’ll also notice that charset and robotframework-requests change _ to a – but those cases I’m hoping are less frequent and I can work around that when it happens.

I’m stuck on how to make this work. Any ideas. Here is my current script logic (fullName is the file name listed above) but if fails for certifi, charset-normalizer, idna, soupsieve, and robotframework-requests.

version=`echo "$fullName" | sed -nre 's/^[^0-9]*(([0-9]+.)*[0-9]+).*/1/p'`
artifactId=`echo "$fullName" | sed -r "s/-${version}.*//g"`

Specifically for the current script for the ones that don’t work build the artifact and version as:

beautifulsoup4                  4
protobuf3-0.2.1.tar.gz          3
urllib3-2.0.7-py3-none-any.whl  3

If anyone has a good way to parse the artifactId/version with regex or any other bash scripting method I’m open to try anything.

Thanks

Answers

- RavinderSingh13
- October 25, 2023 at 12:05 pm
- 0 votes
0
If you have only 1 Input_file and you want to get values out of it then try following. Written and tested in GNU awk. Written and tested with shown samples Only.
```
awk '
match($0,/^(.*)-([0-9]+(.[0-9]+)*)[.-].*/,arr){
  print arr[1]"t"arr[2]
}
' packagesFile.txt  | column -t
```
Login or Signup to reply.

Using sed you could use 2 capture groups and capture and number format after the last occurrence of - followed by either - or .

If there has to be at least a single dot in the version, you can change (.[0-9]+)* to (.[0-9]+)+

sed -E "s/^(.*)-([0-9]+(.[0-9]+)*)[.-].*/1 2/" file | column -t

Output

beautifulsoup4                    4.12.2
certifi                           2023.7.22
charset_normalizer                3.3.1
google                            3.0.0
idna                              3.4
protobuf3                         0.2.1
protobuf                          3.19.6
proton                            0.9.1
python-qpid-proton                0.38.0
redis                             4.5.5
requests                          2.31.0
robotframework                    6.1.1
robotframework_requests           0.9.1
robotframework-run-keyword-async  1.0.8
soupsieve                         2.5
urllib3                           2.0.7

- EdMorton
- October 25, 2023 at 2:15 pm
- 0 votes
0
If the input was a list of names stored in a file or coming in from a pipe then a sed or awk solution would be the better approach as that’d be far more efficient than a shell loop reading the input (see why-is-using-a-shell-loop-to-process-text-considered-bad-practice), but given the input is actually just 1 string stored in a variable, bash builtins are the better approach as that’d be more efficient than spawning a subshell to call echo+sed in a pipe.

So, to regexp-match a variable’s contents and output just the parts you want, converting _s to -s, just using bash builtins:
```
[[ $fullName =~ ^(.*)-([0-9]+(.[0-9]+)*) ]] &&
    printf '%st%sn' "${BASH_REMATCH[1]//_/-}" "${BASH_REMATCH[2]}"
```
e.g. using a shell loop just to populate fullName from your provided sample input so we can demonstrate the above solution (again, you would not write a shell loop just to manipulate text):
```
$ while IFS= read -r fullName; do
    [[ $fullName =~ ^(.*)-([0-9]+(.[0-9]+)*) ]] &&
        printf '%st%sn' "${BASH_REMATCH[1]//_/-}" "${BASH_REMATCH[2]}"
  done < file  | column -t
beautifulsoup4                    4.12.2
certifi                           2023.7.22
charset-normalizer                3.3.1
google                            3.0.0
idna                              3.4
protobuf3                         0.2.1
protobuf                          3.19.6
proton                            0.9.1
python-qpid-proton                0.38.0
redis                             4.5.5
requests                          2.31.0
robotframework                    6.1.1
robotframework-requests           0.9.1
robotframework-run-keyword-async  1.0.8
soupsieve                         2.5
urllib3                           2.0.7
```
See:
- Regexp
  Matching with BASH_REMATCH
- Parameter
  Substitution with ${var//old/new}
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Redis – Parsing file names to get python package names consistently

Answers