skip to Main Content

I am working on a project that requires me to extract the names and locations of the Python packages that were installed using pip install command.

A webpage contains a code element that has multiline text with bash commands. I want to write a JS code that can parse this text and find the packages and their positions in the text.

For example, if the text is:

$ pip install numpy
pip install --global-option build_ext -t ../ pandas>=1.0.0,<2
sudo apt update
pip uninstall numpy
pip install "requests==12.2.2"

I want to get something like this:

[
    {
        "name": "numpy",
        "position": 14
    },
    {
        "name": "pandas",
        "position": 65
    },
    {
        "name": "requests",
        "position": 131
    }
]

How can I do this in JavaScript?

2

Answers


  1. Chosen as BEST ANSWER

    Here is an optional solution, trying to use loops instead of Regex:

    The idea will be to find the lines with the pip install text, so they are the lines we are interested. Then, break the command into words, and loop on them until we reach the packages part of the command.

    First, we will define the regex for a package. Remember that a package can be something like pip install 'stevedore>=1.3.0,<1.4.0' "MySQL_python==1.2.2":

    const packageArea = /(?<=s|^)["']?(?<package_part>(?<package_name>w[w.-]*)([=<>~!]=?[w.,<>]+)?)["']?(?=s|$)/;
    

    Note to the named groups, the package_part used to identify the "package with version" string, while the package_name used to extract the package name.


    About the arguments

    We have two types of CLI arguments: options and flags.

    The problem with options is that we need to understand that the next word is not a package name, but the option value.

    So first I listed all the options from pip install command:

    const pipOptionsWithArg = [
      '-c',
      '--constraint',
      '-e',
      '--editable',
      '-t',
      '--target',
      '--platform',
      '--python-version',
      '--implementation',
      '--abi',
      '--root',
      '--prefix',
      '-b',
      '--build',
      '--src',
      '--upgrade-strategy',
      '--install-option',
      '--global-option',
      '--no-binary',
      '--only-binary',
      '--progress-bar',
      '-i',
      '--index-url',
      '--extra-index-url',
      '-f',
      '--find-links',
      '--log',
      '--proxy',
      '--retires',
      '--timeout',
      '--exists-action',
      '--trusted-host',
      '--cert',
      '--client-cert',
      '--cache-dir',
    ];
    

    Then I wrote a function that will be used later, to decide what to do when we see an argument:

    const handleArgument = (argument, restCommandWords) => {
      let index = 0;
      index += argument.length + 1; // +1 for the space removed by split
    
      if (argument === '-r' || argument === '--requirement') {
        while (restCommandWords.length > 0) {
          index += restCommandWords.shift().length + 1;
        }
        return index;
      }
    
      if (!pipOptionsWithArg.includes(argument)) {
        return index;
      }
    
      if (argument.includes('=')) return index;
    
      index += restCommandWords.shift().length + 1;
      return index;
    };
    

    This function received the identified argument, and the rest of the command, split into words.

    (Here you start to see the "index counter". Since we also need to find the position of each find, we need to keep track of the current position in the original text).

    In the last lines of the function, you can see I handling both --option=something and --option something.


    The Parser

    Now the main parser is splitting the original text into lines, then into words.

    Each action have to update the global index to keep track our position in the text, and also, this index help us to search and find inside the text and not fall to wrong substring, buy indexOf(str, counterIndex):

    export const parseCommand = (multilineCommand) => {
      const packages = [];
      let counterIndex = 0;
    
      const lines = multilineCommand.split('n');
      while (lines.length > 0) {
        const line = lines.shift();
    
        const pipInstallMatch = line.match(/pip +install/);
        if (!pipInstallMatch) {
          counterIndex += line.length + 1; // +1 for the newline
          continue;
        }
    
        const pipInstallLength = pipInstallMatch.index + pipInstallMatch[0].length;
        const argsAndPackagesWords = line.slice(pipInstallLength).split(' ');
        counterIndex += pipInstallLength;
    
        while (argsAndPackagesWords.length > 0) {
          const word = argsAndPackagesWords.shift();
    
          if (!word) {
            counterIndex++;
            continue;
          }
    
          if (word.startsWith('-')) {
            counterIndex += handleArgument(word, argsAndPackagesWords);
            continue;
          }
    
          const packageMatch = word.match(packageArea);
          if (!packageMatch) {
            counterIndex += word.length + 1;
            continue;
          }
    
          const startIndex = multilineCommand.indexOf(packageMatch.groups.package_part, counterIndex);
          packages.push({
            type: 'pypi',
            name: packageMatch.groups.package_name,
            version: undefined,
            startIndex,
            endIndex: startIndex + packageMatch.groups.package_part.length,
          });
    
          counterIndex += word.length + 1;
        }
      }
    
      return packages;
    };
    

  2. You can see my explained code in this answer.

    Here is another similar solution, based more on Regex:

    const pipOptionsWithArg = [
      '-c',
      '--constraint',
      '-e',
      '--editable',
      '-t',
      '--target',
      '--platform',
      '--python-version',
      '--implementation',
      '--abi',
      '--root',
      '--prefix',
      '-b',
      '--build',
      '--src',
      '--upgrade-strategy',
      '--install-option',
      '--global-option',
      '--no-binary',
      '--only-binary',
      '--progress-bar',
      '-i',
      '--index-url',
      '--extra-index-url',
      '-f',
      '--find-links',
      '--log',
      '--proxy',
      '--retires',
      '--timeout',
      '--exists-action',
      '--trusted-host',
      '--cert',
      '--client-cert',
      '--cache-dir',
    ];
    const optionWithArgRegex = `( (${pipOptionsWithArg.join('|')})(=| )\S+)*`;
    const options = /( -[-w=]+)*/;
    const packageArea = /["']?(?<package_part>(?<package_name>w[w.-]*)([=<>~!]=?[w.,<>]+)?)["']?(?=s|$)/g;
    const repeatedPackages = `(?<packages>( ${packageArea.source})+)`;
    const whiteSpace = / +/;
    const PIP_COMMAND_REGEX = new RegExp(
      `(?<command>pip install${optionWithArgRegex}${options.source})${repeatedPackages}`.replaceAll(' ', whiteSpace.source),
      'g'
    );
    export const parseCommand = (command) => {
      const matches = Array.from(command.matchAll(PIP_COMMAND_REGEX));
    
      const results = matches.flatMap((match) => {
        const packagesStr = match?.groups.packages;
        if (!packagesStr) return [];
    
        const packagesIndex = command.indexOf(packagesStr, match.index + match.groups.command.length);
    
        return Array.from(packagesStr.matchAll(packageArea))
          .map((packageMatch) => {
            const packagePart = packageMatch.groups.package_part;
            const name = packageMatch.groups.package_name;
    
            const startIndex = packagesIndex + packagesStr.indexOf(packagePart, packageMatch.index);
            const endIndex = startIndex + packagePart.length;
    
            return {
              type: 'pypi',
              name,
              version: undefined,
              startIndex,
              endIndex,
            };
          })
          .filter((result) => result.name !== 'requirements.txt');
      });
    
      return results;
    };
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search