Need some scripting help. Here is the problem. I have a set of python packages (files)
beautifulsoup4-4.12.2-py3-none-any.whl
certifi-2023.7.22-py3-none-any.whl
charset_normalizer-3.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
google-3.0.0-py2.py3-none-any.whl
idna-3.4-py3-none-any.whl
protobuf3-0.2.1.tar.gz
protobuf-3.19.6-py2.py3-none-any.whl
proton-0.9.1.tar.gz
python-qpid-proton-0.38.0.tar.gz
redis-4.5.5-py3-none-any.whl
requests-2.31.0-py3-none-any.whl
robotframework-6.1.1-py3-none-any.whl
robotframework_requests-0.9.1-py3-none-any.whl
robotframework-run-keyword-async-1.0.8.tar.gz
soupsieve-2.5-py3-none-any.whl
urllib3-2.0.7-py3-none-any.whl
I need to parse each file name to get the package name and its version. I have some working while others fail. Basically the list above should find this for the name and version of each.
beautifulsoup4 4.12.2
certifi 2023.7.22
charset-normalizer 3.3.1
google 3.0.0
idna 3.4
protobuf3 0.2.1
protobuf 3.19.6
proton 0.9.1
python-qpid-proton 0.38.0
redis 4.5.5
requests 2.31.0
robotframework 6.1.1
robotframework-requests 0.9.1
robotframework-run-keyword-async 1.0.8
soupsieve 2.5
urllib3 2.0.7
I’ve tried cut, grep, sed, and awk to get this working but the numbers appearing in names, multi-digit versions, inconsistency of pattern, cause one or the other methods to fail. You’ll also notice that charset and robotframework-requests change _ to a – but those cases I’m hoping are less frequent and I can work around that when it happens.
I’m stuck on how to make this work. Any ideas. Here is my current script logic (fullName is the file name listed above) but if fails for certifi, charset-normalizer, idna, soupsieve, and robotframework-requests.
version=`echo "$fullName" | sed -nre 's/^[^0-9]*(([0-9]+.)*[0-9]+).*/1/p'`
artifactId=`echo "$fullName" | sed -r "s/-${version}.*//g"`
Specifically for the current script for the ones that don’t work build the artifact and version as:
beautifulsoup4 4
protobuf3-0.2.1.tar.gz 3
urllib3-2.0.7-py3-none-any.whl 3
If anyone has a good way to parse the artifactId/version with regex or any other bash scripting method I’m open to try anything.
Thanks
3
Answers
If you have only 1 Input_file and you want to get values out of it then try following. Written and tested in GNU
awk
. Written and tested with shown samples Only.Using
sed
you could use 2 capture groups and capture and number format after the last occurrence of-
followed by either-
or.
If there has to be at least a single dot in the version, you can change
(.[0-9]+)*
to(.[0-9]+)+
Output
If the input was a list of names stored in a file or coming in from a pipe then a
sed
orawk
solution would be the better approach as that’d be far more efficient than a shell loop reading the input (see why-is-using-a-shell-loop-to-process-text-considered-bad-practice), but given the input is actually just 1 string stored in a variable, bash builtins are the better approach as that’d be more efficient than spawning a subshell to call echo+sed in a pipe.So, to regexp-match a variable’s contents and output just the parts you want, converting
_
s to-
s, just using bash builtins:e.g. using a shell loop just to populate
fullName
from your provided sample input so we can demonstrate the above solution (again, you would not write a shell loop just to manipulate text):See:
Matching with
BASH_REMATCH
Substitution with
${var//old/new}