skip to Main Content

I have giant string (markdown) that contains something like this:

## Header 1

{~1.0} Lorem ipsum dolor sit amet. Sed congue diam
turpis, {~2.0} vitae congue erat accumsan nec. {~3.0}

{~4.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~5.0}
vitae congue erat accumsan nec. {~6.0}

{~7.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~8.0}
vitae congue erat accumsan nec. {~9.0}

## Header 2

{~10.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~11.0}
vitae congue erat accumsan nec. {~12.0}

{~113.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~14.0}
vitae congue erat accumsan nec. {~15.0}

{~16.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~17.0}
vitae congue erat accumsan nec. {~18.0}

## Header 3

{~19.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~20.0}
vitae congue erat accumsan nec. {~21.0}

{~22.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~23.0}
vitae congue erat accumsan nec. {~24.0}

{~25.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~26.0}
vitae congue erat accumsan nec. {~27.0}

This is a marker {~x.x}

And I will call "section" to the combination of a header and one more more paragraphs.

I need to match the first and the last marker of every section.

Currently I’m using this regex /s?{([^}]*(~d*(?:.d+)?)[^}]*)}s?/g in javascript that I got from the selected answer of this question to capture all the markers, but now I need to modify it to capture only the first and the last ones from every ‘section’.

The string comes from user input so I cannot know in advance how many paragraphs a ‘section’ will have neither the content of the headers, all that I know is that there will be at least one section (meaning one header followed by x amount of paragraphs).

3

Answers


  1. This is possible with lookarounds, which JS supports.

    Since we’re reusing the original pattern a lot, let’s store it in a variable:

    const pattern = String.raw`{([^}]*(?:~d*(?:.d+)?)[^}]*)}`;
    

    A string that doesn’t contain the pattern above looks like this, where [^] denotes "all character", similar to a . with the s flag:

    `(?:(?!${pattern})[^])*`
    

    From that, we construct our lookahead and lookbehind:

    // Pattern, anything that doesn't contain pattern, then header or end of string (not end of line).
    const lookahead = `${pattern}(?=(?:(?!${pattern})[^])*(?:^##.+|(?![^])))`;
    
    // Header, anything that doesn't contain pattern, then pattern itself.
    const lookbehind = `(?<=^##.+$(?:(?!${pattern})[^])*)${pattern}`;
    

    Here’s how our final steps go:

    const regex = new RegExp(`${lookbehind}|${lookahead}`, 'gm');
    
    // Filter out unmatched groups.
    [...text.matchAll(regex)].map(match => match.filter(Boolean));
    

    Try it:

    console.config({ maximize: true });
    
    function match(string) {
      const pattern = String.raw`{([^}]*(?:~d*(?:.d+)?)[^}]*)}`;
      const lookahead = `${pattern}(?=(?:(?!${pattern})[^])*(?:^##.+|(?![^])))`;
      const lookbehind = `(?<=^##.+$(?:(?!${pattern})[^])*)${pattern}`;
      const regex = new RegExp(`${lookbehind}|${lookahead}`, 'gm');
      
      console.log(regex); // Just to show you how monstrous it is.
      
      return string.matchAll(regex);
    }
    
    const text = `
    ## Header 1
    
    {~1.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~2.0} vitae congue erat accumsan nec. {~3.0}
    
    {~4.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~5.0} vitae congue erat accumsan nec. {~6.0}
    
    {~7.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~8.0} vitae congue erat accumsan nec. {~9.0}
    
    ## Header 2
    
    {~10.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~11.0} vitae congue erat accumsan nec. {~12.0}
    
    {~113.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~14.0} vitae congue erat accumsan nec. {~15.0}
    
    {~16.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~17.0} vitae congue erat accumsan nec. {~18.0}
    
    ## Header 3
    
    {~19.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~20.0} vitae congue erat accumsan nec. {~21.0}
    
    {~22.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~23.0} vitae congue erat accumsan nec. {~24.0}
    
    {~25.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~26.0} vitae congue erat accumsan nec. {~27.0}
    `.trim();
    
    console.log([...match(text)].map(match => match.filter(Boolean)));
    <script src="https://gh-canon.github.io/stack-snippet-console/console.min.js"></script>
    Login or Signup to reply.
  2. You can achieve the result you want by matching the marker using a tempered greedy token to ensure there is either no marker between ## and this one, or no marker or ## between this one and the next ## or end-of-string:

    ##(?:(?!{~d+.d+}).)*{~(d+.d+)|{~(d+.d+)(?:(?!{~d+.d+}|##).)*(?=##|$)
    

    This matches either:

    • ## : literal ##
    • (?:(?!{~d+.d+}).)* : some number of characters, where the character is not the start of a marker expression ({d+.d+})
    • {~ : literal {~
    • (d+.d+) : the marker number, captured in group 1

    or:

    • {~ : literal {~
    • (d+.d+) : the marker number, captured in group 2
    • (?:(?!{~d+.d+}|##).)* : some number of characters, where the character is not the start of a marker expression ({d+.d+}) or a header (##)
    • (?=##|$) : lookahead to assert that the next match is either the start of a header or end-of-string

    Demo on regex101

    In JavaScript:

    const str = `## Header 1
    
        {~1.0} Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris non dui id felis feugiat ornare sit amet facilisis urna. {~2.0} Sed congue diam turpis, vitae congue erat accumsan nec. {~3.0} Aenean non bibendum augue, eget ultricies odio.
    
        ## Header 2
    
        {~4.0} Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris non dui id felis feugiat ornare sit amet facilisis urna. {~5.0} Sed congue diam turpis, vitae congue erat accumsan nec. {~6.0} Aenean non bibendum augue, eget ultricies odio.
    
        ## Header 3
    
        {~7.0} Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris non dui id felis feugiat ornare sit amet facilisis urna. {~8.0} Sed congue diam turpis, vitae congue erat accumsan nec. {~9.0} Aenean non bibendum augue, eget ultricies odio.`
        
    const regex = /##(?:(?!{~d+.d+}).)*{~(d+.d+)|{~(d+.d+)(?:(?!{~d+.d+}|##).)*(?=##|$)/gs
    
    const matches = str.matchAll(regex)
    
    const res = [...matches].map(m => m[1] || m[2])
    
    console.log(res)
    Login or Signup to reply.
  3. This is my variant, less regexp:y than most others perhaps, but it works:

    function getNumbers(str) {
      return `n${str}`.split('n## ')
        .map(x => [...x.matchAll(/{~(d|.)*}/g)].map(x => x[0]))
        .map(x => [x[0], x.slice(-1)]).flat(2).filter(x => x)
        .map(x => +x.replace(/[{}~]/g, ''));
    }
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search