In a web scanner application, i need to parse some script’s output to get some informations, but the problem is that i don’t get the same output in linux shell and in java output, let me describe it (this example is done with whatweb on one of the websites i need to scan at work, but i also have this problem whenever i have a colored output in shell):

Here is what i get from linux’s output (with some colors): [200] Apache[2.2.9], Cookies[ca67a6ac78ebedd257fb0b4d64ce9388,jfcookie,jfcookie%5Blang%5D,lang], Country[EUROPEAN UNION][EU], HTTPServer[Fedora Linux][Apache/2.2.9 (Fedora)], IP[], Joomla[1.5], Meta-Author[Administrator], MetaGenerator[Joomla! 1.5 - Open Source Content Management], PHP[5.2.6,], Plesk[Lin], Script[text/javascript], Title[Accueil  ], X-Powered-By[PHP/5.2.6, PleskLin]

And here is what i get from Java:

[1m[34m[0m [200] [1m[37mApache[0m[[1m[32m2.2.9[0m], [1m[37mCookies[0m[[1m[33mca67a6ac78ebedd257fb0b4d64ce9388,jfcookie,jfcookie%5Blang%5D,lang[0m], [1m[37mCountry[0m[[1m[33mEUROPEAN UNION[0m][[1m[35mEU[0m], [1m[37mHTTPServer[0m[[1m[31mFedora Linux[0m][[1m[36mApache/2.2.9 (Fedora)[0m], [1m[37mIP[0m[[1m[33m185.13.64.116[0m], [1m[37mJoomla[0m[[1m[32m1.5[0m], [1m[37mMeta-Author[0m[[1m[33mAdministrator[0m], [1m[37mMetaGenerator[0m[[1m[33mJoomla! 1.5 - Open Source Content Management[0m], [1m[37mPHP[0m[[1m[32m5.2.6,[0m], [1m[37mPlesk[0m[[1m[33mLin[0m], [1m[37mScript[0m[[1m[33mtext/javascript[0m], [1m[37mTitle[0m[[32mAccueil [0m], [1m[37mX-Powered-By[0m[[1m[33mPHP/5.2.6, PleskLin[0m]

My guess is that colors in linux’s shell are generated by those unknown characters, but they are really a pain for parsing in java.

I get this output by running the script in a new thread, and doing raw_data+=data;(where raw_data is a String) whenever i have a new line in my output, to finally send raw_data to my parser.

How can i do to avoid getting those annoying chars and so, to get a more friendly output like i get in linux’s shell?



  1. You can use a regex here:

    String raw_data= ...;
    String cleaned_raw_data = raw_data.replaceAll("\[\d+m", "");

    This will remove any sequence of characters starting with a \[, ending with a m and having between them one or more digit (\d+).

    Note that [ is preceded by a \ because [ has a special meaning for regular expressions (it’s a meta-character).


    Regular expression visualization

  2. In your Java code, where you are executing the shell script, you can add an extra sed filter to filter out the shell-control characters.

    # filter out shell control characters
    ./my_script | sed -r "s/x1B[([0-9]{1,2}(;[0-9]{1,2})?)?[m|K]//g"

    Use tr -dc '[[:print:]]' to remove non-printable characters, like this:

    # filter out shell control characters
    ./my_script | 
     sed -r "s/x1B[([0-9]{1,2}(;[0-9]{1,2})?)?[m|K]//g" | 
     tr -dc '[[:print:]]'

    You could even add a wrapper script around the original script to do this. And call the wrapper script. This allows you to do any other pre-processing, before feeding it into the Java program and keeps it clean of all unnecessary code and you can focus on the core logic of the application.

    If you can’t add a wrapper script for any reason and would like to add the filter in Java, Java doesn’t support pipes in the command, directly. You’ll have to call your command as an argument to bash it like this:

    String[] cmd = {
    "./my_script | sed -r 's/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[m|K]//g'"
    Process p = Runtime.getRuntime().exec(cmd);

    Don’t forget to escape all the '' when you use the regex in Java.

    Source and description for the sed filter:

