skip to Main Content

I’m using optparse in a ruby program (ruby 2.7.1p83) under Linux. If any of the command-line arguments are filenames with "special" characters in them, the parse! method fails with this error:

invalid byte sequence in UTF-8

This is the code which fails …

parser = OptionParser.new {
  |opts|
  ... etc. ...
}
parser.parse! # error occurs here

I know about the scrub method and other ways to do encoding in ruby. However, the place where the error occurs is in a library routine (OptionParser#parse!), and I have no control over how this library routine deals with strings.

I could pre-process the command-line arguments and replace the special characters in these arguments with an acceptable encoding, but then, in the case where the argument is a file name, I will be unable to open that file later in the program, because the filename I have accepted into the program will have been altered from the file’s original name.

I could do something complicated like pre-traversing the arguments, building a hashmap where the key is the encoded argument and the value is the original argument, changing the ARGV values to the encoded values, parsing the encoded arguments using OptionParser, and then going through the resulting arguments after OptionParser completes and using the hashmap to in a procedure which replaces the encoded arguments with their original values … and then continuing with the program.

But I’m hoping that there would be a much simpler way to solve this problem in ruby.

Thank you in advance for any ideas or suggestions.

UPDATE: Here is more detailed info …

I wrote the following minimal program called rtest.rb in order to test this:

#!/usr/bin/env run-ruby                                                                                                                               
# -*- ruby -*-                                                                                                                                        

require 'optparse'

parser = OptionParser.new {
}
parser.parse!

Process.exit(0)

I ran it as follows, with the only files present in the current directory being rtest.rb itself, and another file having this name: Äfoo

export LC_TYPE='en_us.UTF-8'
export LC_COLLATE='en_us.UTF-8'
./rtest.rb *

It generated the following error and stack trace …

Traceback (most recent call last):
    7: from /home/hippo/bin/rtest.rb:8:in `<main>'
    6: from /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1691:in `parse!'
    5: from /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1666:in `permute!'
    4: from /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1569:in `order!'
    3: from /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1575:in `parse_in_order'
    2: from /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1575:in `catch'
    1: from /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1579:in `block in parse_in_order'
/opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1579:in `===': invalid byte sequence in UTF-8 (ArgumentError)

Here is what appears in the pertinent section of the file /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb . See line 1579

 1572   def parse_in_order(argv = default_argv, setter = nil, &nonopt)  # :nodoc:                                                                     
 1573     opt, arg, val, rest = nil
 1574     nonopt ||= proc {|a| throw :terminate, a}
 1575     argv.unshift(arg) if arg = catch(:terminate) {
 1576       while arg = argv.shift
 1577         case arg
 1578           # long option                                                                                                                           
 1579           when /A--([^=]*)(?:=(.*))?/m
 1580             opt, rest = $1, $2

In other words, the regex match on the argument is failing due to this encoding issue.

When I have time (not right away, unfortunately), I’ll put some code into that module to do encoding of the arg variable, to see if this might fix the problem.

FURTHER UPDATE: I am running under Ubuntu 20.0.4, and the version of ruby that’s offered is 2.7.0. I also managed to get 2.7.1 running on my ancient debian 8 box. This error occurs in both environments. I would have to install a newer version of ruby or compile it from source before I could try version 2.7.7 or version 3.x.

YET ANOTHER UPDATE: I had some unexpected spare time, and so I build ruby-3.3.0 from source and re-ran the test. I got the same error!

% /opt/local/rubies/ruby-3.3.0/bin/ruby ./rtest.rb *
/opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1640:in `===': invalid byte sequence in UTF-8 (ArgumentError)
    from /opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1640:in `block in parse_in_order'
    from /opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1636:in `catch'
    from /opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1636:in `parse_in_order'
    from /opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1630:in `order!'
    from /opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1739:in `permute!'
    from /opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1764:in `parse!'
    from ./rtest.rb:8:in `<main>'

However, I now think the error occurs because the filename is encoded in an unusual manner. If I do echo * in that directory, I see this, which is what I expect:

% echo *
Äfoo rtest.rb

However, if I do /bin/ls in the same directory, I see this:

% /bin/ls *
''$'304''foo'   rtest.rb

And even the OS can’t recognize the file with the name specified as follows …

% /bin/cat 'Äfoo'
/bin/cat: Äfoo: No such file or directory

But if I use the longer, encoded file name, the OS has no trouble accessing the file …

% /bin/cat ''$'304''foo
File contents
File contents

The ls command seems to know how to encode the Äfoo filename into ''$'304''foo, but ruby doesn’t seem to know how to do this.

2

Answers


  1. Chosen as BEST ANSWER

    NOTE: I prefer my other Answer. However, I'm leaving this Answer in place also, in case anyone is still interested.

    Per the discussion below my original question, and especially per the comments there by @Schwern, it seems like this error is due to an un-parseable and un-encodable set of bytes in the file name that I have been having problems with. Therefore, it's likely to be impossible in ruby to deal properly files named as such.

    And to be clear, this problem occurs for any string which contains such un-encodable bytes, not just file names.

    Therefore, I am simply checking for such un-parseable strings on the command line, and I'm exiting the ruby script with an error if I encounter any.

    The following is my improved test program which shows how I am now handling this case:

    #!/usr/bin/env ruby                                                                                                               
    # -*- ruby -*-                                                                                                                                        
    
    require 'optparse'
    
    badargs = []
    ARGV.each {
      |arg|
      begin
        # The following test is being used because this                                                                                                   
        # error shows up in a regex match of the argument                                                                                               
        # when done within the OptionParser code. There are
        # no doubt other and possibly better ways to trigger
        # this error, but this is good enough for me for the
        # time being, especially because this is just an
        # illustrative test program.                                                                                                      
        arg =~ /./m
      rescue
        badargs << arg
      end
    }
    
    nbad = badargs.length
    if nbad > 0 then
      if nbad == 1 then
        object = 'this command-line argument'
      else
        object = 'these command-line arguments'
      end
      puts "Unable to parse #{object}: #{badargs}"
      Process.exit(1)
    end
    
    parser = OptionParser.new {
    }
    parser.parse!
    
    Process.exit(0)
    

    I am provisionally treating this as my "Answer", unless something better comes along.
    A better Answer has now indeed come along. See the other Answer of mine here.


  2. I came up with this slightly hacky workaround that seems to me to be a better Answer than the other one that I posted here.

    This is therefore my preferred Answer.

    I wrote a preprocessor called ruby-preproc which tries to determine, based upon the program’s command-line arguments, which encoding might work for a given ruby program (see below to examine the code for ruby-preproc). Then, all conforming ruby programs would simply need to be written as follows …

    #!/usr/bin/env ruby-preproc
    # -*- ruby -*-
    
    [ ... normal ruby code goes here ... ]
    

    If the ruby program which uses this convention is called the-script.rb, then it would simply be invoked as normal:

    ./the-script.rb args ...
    

    But also, this preprocessor enables the use of a special, optional, initial argument, -E<encoding>. In this case, the specified encoding will be forced instead of the argument list being examined. For example, for any ruby program which is set up to use this preprocessor, the following can be done …

    ./the-script.rb -EISO-8859-1 args
    

    And if the initial -E<encoding> argument is not given, then the ruby-preproc processor examines all of the command-line arguments that have been specified, and it looks for an encoding that works for every one of them. If such an encoding is found, then the script is run with that encoding being specified.

    Here is the code for ruby-preproc (this is an improved version of the original that I posted in this "Answer" yesterday) …

    #!/opt/local/rubies/ruby-3.3.0/bin/ruby  
    # -*- ruby -*-
    
    # Note that any recent standard ruby executable can be
    # used in the initial shebang line.
    #
    # Also note that the following construct is a way to
    # obtain the value of whatever appears in the shebang
    # line, so that this file name doesn't need to be
    # entered twice in this program:
    
    require 'rbconfig'
    ruby_executable = File.join(RbConfig::CONFIG["bindir"],
                                RbConfig::CONFIG["RUBY_INSTALL_NAME"] +
                                RbConfig::CONFIG["EXEEXT"])
    
    $default_enc = 'default'
    $tried       = []
    $success     = true
    $prog        = ARGV[0]
    
    nargs = ARGV.length
    if nargs > 1 && ARGV[1] =~ /^-E(.+)$/ then
      # If we're here, then the -E<encoding> parameter
      # appears on the command line. Just use that
      # specified encoding.
    
      $curr_enc = $1
      $success  = true
      $args     = [ $prog ] + ARGV[2..-1]
    else
      # If we're here, no -E<encoding> was specified,
      # so examine the command-line arguments in order
      # to find out whether any of the following listed
      # encodings might work properly for all of these
      # arguments.
      #
      # Put as many encodings into this list as is desired.
      # And I believe it's best to also include $default_enc.
    
      encodings_to_try = [
        $default_enc,
        'UTF-8',
        'ISO-8859-1',
      ]
    
      $curr_enc = $default_enc
      $args     = ARGV
    
      encodings_to_try.each {
        |enc|
        $success  = true
        $curr_enc = enc
        $args.each {
          |arg|
          newarg = arg.dup
          begin
            if enc != $default_enc then
              newarg.encode!(enc)
            end
            newarg =~ /.?/m
          rescue Exception => e
            if e.to_s.include?('invalid byte sequence')
              $success = false
              break
            end
          end
        }
        if $success then
          break
        else
          $tried << $curr_enc
        end
      }
    end
    
    if $success then
      if $curr_enc == $default_enc then
        Process.exec(ruby_executable, *$args)
      else
        Process.exec(ruby_executable, "-E#{$curr_enc}", *$args)
      end
    else
      tlen = $tried.length
      if tlen < 1 then
        via = ''
      elsif tlen == 1 then
        via = " via this encoding: #{$tried[0]}"
      else
        via = " via any of these encodings: #{$tried}"
      end
      puts("Unable to run `#{$prog}` because one or more arguments cannot be parsed#{via}")
      Process.exit(1)
    end
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search