skip to Main Content

I have a string:

const str = 'a string, a long string'

I want to break it down into words (no problem here) and then track the index of each word within the original string.

Actual result:

[
  { word: 'a',      idx: 0 },
  { word: 'string', idx: 2 },
  { word: 'a',      idx: 0 },
  { word: 'long',   idx: 12 },
  { word: 'string', idx: 2 }
]

Desired result:

[
  { word: 'a',      idx: 0 },
  { word: 'string', idx: 2 },
  { word: 'a',      idx: 10 },
  { word: 'long',   idx: 12 },
  { word: 'string', idx: 17 }
]

Code so far:

const str = 'a string, a long string'

const segmenter = new Intl.Segmenter([], { granularity: 'word' })

const getWords = str => {
  const segments = segmenter.segment(str)
  return [...segments]
    .filter(s => s.isWordLike)
    .map(s => s.segment)
}

const words = getWords(str)

const result = words.map(word => ({
  word,
  idx: str.indexOf(word)
}))

console.log(result)

3

Answers


  1. I decomposed your string into an array of object containing the word and the word index.

    const str = 'a string, a long string';
    
    const words = str.split(' ').map((word, index) => ({ word, index }));
    
    console.log(words)
    

    If you want the ponctuation as a word you could use a regex.

    const words = str.split(/s+|(?=p{P})|(?<=p{P})/u).map((word, index) => ({ word, index }));
    
    Login or Signup to reply.
  2. The objects you’re iterating over, which contain the segment and whether or not it isWordLike, also have the index:

    const str = 'a string, a long string'
    
    const segmenter = new Intl.Segmenter([], { granularity: 'word' })
    
    const getWordsWithIndexes = str => {
      const segments = segmenter.segment(str)
      return [...segments]
        .filter(s => s.isWordLike)
        .map(s => ({ idx: s.index, word: s.segment }))
    }
    
    const result = getWordsWithIndexes(str)
    
    console.log(result)

    Here’s the type definition:

    interface SegmentData {
        /** A string containing the segment extracted from the original input string. */
        segment: string;
        /** The code unit index in the original input string at which the segment begins. */
        index: number;
        /** The complete input string that was segmented. */
        input: string;
        /**
         * A boolean value only if granularity is "word"; otherwise, undefined.
         * If granularity is "word", then isWordLike is true when the segment is word-like (i.e., consists of letters/numbers/ideographs/etc.); otherwise, false.
         */
        isWordLike?: boolean;
    }
    
    
    Login or Signup to reply.
  3. Maybe an idea to use String.matchAll to retrieve words and indexes.

    Something like:

    const str = 'a string, a long string';
    const x = 'astringalongstring';
    
    console.log(getWordStartIndexes(str));
    console.log(getWordStartIndexes(x));
    
    function getWordStartIndexes(str) {
       const allMatches = str.matchAll(/(p{L}+?)([^p{L}]|[,:;]|$)/gu);
       return [...allMatches]
        .map( match => ({word: match[1], index: match.index }) )
       
    }
    .as-console-wrapper {
        max-height: 100% !important;
    }
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search