In ruby, optparse raises error when filename contains certain characters - Ubuntu

HippoMan
February 22, 2023
271 views
0 votes
2 Answers

I’m using optparse in a ruby program (ruby 2.7.1p83) under Linux. If any of the command-line arguments are filenames with "special" characters in them, the parse! method fails with this error:

invalid byte sequence in UTF-8

This is the code which fails …

parser = OptionParser.new {
  |opts|
  ... etc. ...
}
parser.parse! # error occurs here

I know about the scrub method and other ways to do encoding in ruby. However, the place where the error occurs is in a library routine (OptionParser#parse!), and I have no control over how this library routine deals with strings.

I could pre-process the command-line arguments and replace the special characters in these arguments with an acceptable encoding, but then, in the case where the argument is a file name, I will be unable to open that file later in the program, because the filename I have accepted into the program will have been altered from the file’s original name.

I could do something complicated like pre-traversing the arguments, building a hashmap where the key is the encoded argument and the value is the original argument, changing the ARGV values to the encoded values, parsing the encoded arguments using OptionParser, and then going through the resulting arguments after OptionParser completes and using the hashmap to in a procedure which replaces the encoded arguments with their original values … and then continuing with the program.

But I’m hoping that there would be a much simpler way to solve this problem in ruby.

Thank you in advance for any ideas or suggestions.

UPDATE: Here is more detailed info …

I wrote the following minimal program called rtest.rb in order to test this:

#!/usr/bin/env run-ruby                                                                                                                               
# -*- ruby -*-                                                                                                                                        

require 'optparse'

parser = OptionParser.new {
}
parser.parse!

Process.exit(0)

I ran it as follows, with the only files present in the current directory being rtest.rb itself, and another file having this name: Äfoo …

export LC_TYPE='en_us.UTF-8'
export LC_COLLATE='en_us.UTF-8'
./rtest.rb *

It generated the following error and stack trace …

Traceback (most recent call last):
    7: from /home/hippo/bin/rtest.rb:8:in `<main>'
    6: from /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1691:in `parse!'
    5: from /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1666:in `permute!'
    4: from /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1569:in `order!'
    3: from /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1575:in `parse_in_order'
    2: from /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1575:in `catch'
    1: from /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1579:in `block in parse_in_order'
/opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1579:in `===': invalid byte sequence in UTF-8 (ArgumentError)

Here is what appears in the pertinent section of the file /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb . See line 1579…

 1572   def parse_in_order(argv = default_argv, setter = nil, &nonopt)  # :nodoc:                                                                     
 1573     opt, arg, val, rest = nil
 1574     nonopt ||= proc {|a| throw :terminate, a}
 1575     argv.unshift(arg) if arg = catch(:terminate) {
 1576       while arg = argv.shift
 1577         case arg
 1578           # long option                                                                                                                           
 1579           when /A--([^=]*)(?:=(.*))?/m
 1580             opt, rest = $1, $2

In other words, the regex match on the argument is failing due to this encoding issue.

When I have time (not right away, unfortunately), I’ll put some code into that module to do encoding of the arg variable, to see if this might fix the problem.

FURTHER UPDATE: I am running under Ubuntu 20.0.4, and the version of ruby that’s offered is 2.7.0. I also managed to get 2.7.1 running on my ancient debian 8 box. This error occurs in both environments. I would have to install a newer version of ruby or compile it from source before I could try version 2.7.7 or version 3.x.

YET ANOTHER UPDATE: I had some unexpected spare time, and so I build ruby-3.3.0 from source and re-ran the test. I got the same error!

% /opt/local/rubies/ruby-3.3.0/bin/ruby ./rtest.rb *
/opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1640:in `===': invalid byte sequence in UTF-8 (ArgumentError)
    from /opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1640:in `block in parse_in_order'
    from /opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1636:in `catch'
    from /opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1636:in `parse_in_order'
    from /opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1630:in `order!'
    from /opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1739:in `permute!'
    from /opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1764:in `parse!'
    from ./rtest.rb:8:in `<main>'

However, I now think the error occurs because the filename is encoded in an unusual manner. If I do echo * in that directory, I see this, which is what I expect:

% echo *
Äfoo rtest.rb

However, if I do /bin/ls in the same directory, I see this:

% /bin/ls *
''$'304''foo'   rtest.rb

And even the OS can’t recognize the file with the name specified as follows …

% /bin/cat 'Äfoo'
/bin/cat: Äfoo: No such file or directory

But if I use the longer, encoded file name, the OS has no trouble accessing the file …

% /bin/cat ''$'304''foo
File contents
File contents

The ls command seems to know how to encode the Äfoo filename into ''$'304''foo, but ruby doesn’t seem to know how to do this.

Answers

Chosen as BEST ANSWER

NOTE: I prefer my other Answer. However, I'm leaving this Answer in place also, in case anyone is still interested.

Per the discussion below my original question, and especially per the comments there by @Schwern, it seems like this error is due to an un-parseable and un-encodable set of bytes in the file name that I have been having problems with. Therefore, it's likely to be impossible in ruby to deal properly files named as such.

And to be clear, this problem occurs for any string which contains such un-encodable bytes, not just file names.

Therefore, I am simply checking for such un-parseable strings on the command line, and I'm exiting the ruby script with an error if I encounter any.

The following is my improved test program which shows how I am now handling this case:

#!/usr/bin/env ruby                                                                                                               
# -*- ruby -*-                                                                                                                                        

require 'optparse'

badargs = []
ARGV.each {
  |arg|
  begin
    # The following test is being used because this                                                                                                   
    # error shows up in a regex match of the argument                                                                                               
    # when done within the OptionParser code. There are
    # no doubt other and possibly better ways to trigger
    # this error, but this is good enough for me for the
    # time being, especially because this is just an
    # illustrative test program.                                                                                                      
    arg =~ /./m
  rescue
    badargs << arg
  end
}

nbad = badargs.length
if nbad > 0 then
  if nbad == 1 then
    object = 'this command-line argument'
  else
    object = 'these command-line arguments'
  end
  puts "Unable to parse #{object}: #{badargs}"
  Process.exit(1)
end

parser = OptionParser.new {
}
parser.parse!

Process.exit(0)

~~I am provisionally treating this as my "Answer", unless something better comes along.~~
A better Answer has now indeed come along. See the other Answer of mine here.

(Edit)

I came up with this slightly hacky workaround that seems to me to be a better Answer than the other one that I posted here.

This is therefore my preferred Answer.

I wrote a preprocessor called ruby-preproc which tries to determine, based upon the program’s command-line arguments, which encoding might work for a given ruby program (see below to examine the code for ruby-preproc). Then, all conforming ruby programs would simply need to be written as follows …

#!/usr/bin/env ruby-preproc
# -*- ruby -*-

[ ... normal ruby code goes here ... ]

If the ruby program which uses this convention is called the-script.rb, then it would simply be invoked as normal:

./the-script.rb args ...

But also, this preprocessor enables the use of a special, optional, initial argument, -E<encoding>. In this case, the specified encoding will be forced instead of the argument list being examined. For example, for any ruby program which is set up to use this preprocessor, the following can be done …

./the-script.rb -EISO-8859-1 args

And if the initial -E<encoding> argument is not given, then the ruby-preproc processor examines all of the command-line arguments that have been specified, and it looks for an encoding that works for every one of them. If such an encoding is found, then the script is run with that encoding being specified.

Here is the code for ruby-preproc (this is an improved version of the original that I posted in this "Answer" yesterday) …

#!/opt/local/rubies/ruby-3.3.0/bin/ruby  
# -*- ruby -*-

# Note that any recent standard ruby executable can be
# used in the initial shebang line.
#
# Also note that the following construct is a way to
# obtain the value of whatever appears in the shebang
# line, so that this file name doesn't need to be
# entered twice in this program:

require 'rbconfig'
ruby_executable = File.join(RbConfig::CONFIG["bindir"],
                            RbConfig::CONFIG["RUBY_INSTALL_NAME"] +
                            RbConfig::CONFIG["EXEEXT"])

$default_enc = 'default'
$tried       = []
$success     = true
$prog        = ARGV[0]

nargs = ARGV.length
if nargs > 1 && ARGV[1] =~ /^-E(.+)$/ then
  # If we're here, then the -E<encoding> parameter
  # appears on the command line. Just use that
  # specified encoding.

  $curr_enc = $1
  $success  = true
  $args     = [ $prog ] + ARGV[2..-1]
else
  # If we're here, no -E<encoding> was specified,
  # so examine the command-line arguments in order
  # to find out whether any of the following listed
  # encodings might work properly for all of these
  # arguments.
  #
  # Put as many encodings into this list as is desired.
  # And I believe it's best to also include $default_enc.

  encodings_to_try = [
    $default_enc,
    'UTF-8',
    'ISO-8859-1',
  ]

  $curr_enc = $default_enc
  $args     = ARGV

  encodings_to_try.each {
    |enc|
    $success  = true
    $curr_enc = enc
    $args.each {
      |arg|
      newarg = arg.dup
      begin
        if enc != $default_enc then
          newarg.encode!(enc)
        end
        newarg =~ /.?/m
      rescue Exception => e
        if e.to_s.include?('invalid byte sequence')
          $success = false
          break
        end
      end
    }
    if $success then
      break
    else
      $tried << $curr_enc
    end
  }
end

if $success then
  if $curr_enc == $default_enc then
    Process.exec(ruby_executable, *$args)
  else
    Process.exec(ruby_executable, "-E#{$curr_enc}", *$args)
  end
else
  tlen = $tried.length
  if tlen < 1 then
    via = ''
  elsif tlen == 1 then
    via = " via this encoding: #{$tried[0]}"
  else
    via = " via any of these encodings: #{$tried}"
  end
  puts("Unable to run `#{$prog}` because one or more arguments cannot be parsed#{via}")
  Process.exit(1)
end

Please signup or login to give your own answer.

Click here to cancel reply.

In ruby, optparse raises error when filename contains certain characters – Ubuntu

Answers