I’m using optparse
in a ruby program (ruby 2.7.1p83
) under Linux. If any of the command-line arguments are filenames with "special" characters in them, the parse!
method fails with this error:
invalid byte sequence in UTF-8
This is the code which fails …
parser = OptionParser.new {
|opts|
... etc. ...
}
parser.parse! # error occurs here
I know about the scrub
method and other ways to do encoding in ruby. However, the place where the error occurs is in a library routine (OptionParser#parse!
), and I have no control over how this library routine deals with strings.
I could pre-process the command-line arguments and replace the special characters in these arguments with an acceptable encoding, but then, in the case where the argument is a file name, I will be unable to open that file later in the program, because the filename I have accepted into the program will have been altered from the file’s original name.
I could do something complicated like pre-traversing the arguments, building a hashmap where the key is the encoded argument and the value is the original argument, changing the ARGV values to the encoded values, parsing the encoded arguments using OptionParser
, and then going through the resulting arguments after OptionParser
completes and using the hashmap to in a procedure which replaces the encoded arguments with their original values … and then continuing with the program.
But I’m hoping that there would be a much simpler way to solve this problem in ruby.
Thank you in advance for any ideas or suggestions.
UPDATE: Here is more detailed info …
I wrote the following minimal program called rtest.rb
in order to test this:
#!/usr/bin/env run-ruby
# -*- ruby -*-
require 'optparse'
parser = OptionParser.new {
}
parser.parse!
Process.exit(0)
I ran it as follows, with the only files present in the current directory being rtest.rb
itself, and another file having this name: Äfoo
…
export LC_TYPE='en_us.UTF-8'
export LC_COLLATE='en_us.UTF-8'
./rtest.rb *
It generated the following error and stack trace …
Traceback (most recent call last):
7: from /home/hippo/bin/rtest.rb:8:in `<main>'
6: from /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1691:in `parse!'
5: from /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1666:in `permute!'
4: from /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1569:in `order!'
3: from /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1575:in `parse_in_order'
2: from /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1575:in `catch'
1: from /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1579:in `block in parse_in_order'
/opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1579:in `===': invalid byte sequence in UTF-8 (ArgumentError)
Here is what appears in the pertinent section of the file /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb
. See line 1579
…
1572 def parse_in_order(argv = default_argv, setter = nil, &nonopt) # :nodoc:
1573 opt, arg, val, rest = nil
1574 nonopt ||= proc {|a| throw :terminate, a}
1575 argv.unshift(arg) if arg = catch(:terminate) {
1576 while arg = argv.shift
1577 case arg
1578 # long option
1579 when /A--([^=]*)(?:=(.*))?/m
1580 opt, rest = $1, $2
In other words, the regex match on the argument is failing due to this encoding issue.
When I have time (not right away, unfortunately), I’ll put some code into that module to do encoding of the arg
variable, to see if this might fix the problem.
FURTHER UPDATE: I am running under Ubuntu 20.0.4
, and the version of ruby that’s offered is 2.7.0. I also managed to get 2.7.1 running on my ancient debian 8
box. This error occurs in both environments. I would have to install a newer version of ruby or compile it from source before I could try version 2.7.7 or version 3.x.
YET ANOTHER UPDATE: I had some unexpected spare time, and so I build ruby-3.3.0 from source and re-ran the test. I got the same error!
% /opt/local/rubies/ruby-3.3.0/bin/ruby ./rtest.rb *
/opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1640:in `===': invalid byte sequence in UTF-8 (ArgumentError)
from /opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1640:in `block in parse_in_order'
from /opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1636:in `catch'
from /opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1636:in `parse_in_order'
from /opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1630:in `order!'
from /opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1739:in `permute!'
from /opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1764:in `parse!'
from ./rtest.rb:8:in `<main>'
However, I now think the error occurs because the filename is encoded in an unusual manner. If I do echo *
in that directory, I see this, which is what I expect:
% echo *
Äfoo rtest.rb
However, if I do /bin/ls
in the same directory, I see this:
% /bin/ls *
''$'304''foo' rtest.rb
And even the OS can’t recognize the file with the name specified as follows …
% /bin/cat 'Äfoo'
/bin/cat: Äfoo: No such file or directory
But if I use the longer, encoded file name, the OS has no trouble accessing the file …
% /bin/cat ''$'304''foo
File contents
File contents
The ls
command seems to know how to encode the Äfoo
filename into ''$'304''foo
, but ruby doesn’t seem to know how to do this.
2
Answers
NOTE: I prefer my other Answer. However, I'm leaving this Answer in place also, in case anyone is still interested.
Per the discussion below my original question, and especially per the comments there by @Schwern, it seems like this error is due to an un-parseable and un-encodable set of bytes in the file name that I have been having problems with. Therefore, it's likely to be impossible in ruby to deal properly files named as such.
And to be clear, this problem occurs for any string which contains such un-encodable bytes, not just file names.
Therefore, I am simply checking for such un-parseable strings on the command line, and I'm exiting the ruby script with an error if I encounter any.
The following is my improved test program which shows how I am now handling this case:
I am provisionally treating this as my "Answer", unless something better comes along.A better Answer has now indeed come along. See the other Answer of mine here.
I came up with this slightly hacky workaround that seems to me to be a better Answer than the other one that I posted here.
This is therefore my preferred Answer.
I wrote a preprocessor called
ruby-preproc
which tries to determine, based upon the program’s command-line arguments, which encoding might work for a given ruby program (see below to examine the code forruby-preproc
). Then, all conforming ruby programs would simply need to be written as follows …If the ruby program which uses this convention is called
the-script.rb
, then it would simply be invoked as normal:But also, this preprocessor enables the use of a special, optional, initial argument,
-E<encoding>
. In this case, the specified encoding will be forced instead of the argument list being examined. For example, for any ruby program which is set up to use this preprocessor, the following can be done …And if the initial
-E<encoding>
argument is not given, then theruby-preproc
processor examines all of the command-line arguments that have been specified, and it looks for an encoding that works for every one of them. If such an encoding is found, then the script is run with that encoding being specified.Here is the code for
ruby-preproc
(this is an improved version of the original that I posted in this "Answer" yesterday) …