skip to Main Content

I’ve been trying to set up my own Scripture verse regex that can accomodate a wide ranges of Scripture types. I’ve copied some of this regex from another stackoverflow question. The issue that I’m having is that it won’t find the verse after a comma, after a verse range as in Hos 10:1-3, 8. It will find Hos 10:1-3, but not 8. The issue boils down to a S that I have in there, but I need that in there so that the regex will find scripture passages that are next to each other, where a second passage starts with a number, like 1 Peter. The regex is:

((?:(I+|1st|2nd|3rd|First|Second|Third|[123])s)?(Gen|Ge|Gn|Exo|Ex|Exod|Lev|Le|Lv|Num|Nu|Nm|Nb|Deut|Dt|Josh|Jos|Jsh|Judg|Jdg|Jg|Jdgs|Rth|Ru|Sam|Samuel|Kings|Kgs|Kin|Chron|Chronicles|Ezra|Ezr|Ez|Neh|Ne|Esth|Es|Job|Job|Jb|Pslm|Ps|Psalms|Psa|Psm|Pss|Prov|Pr|Prv|Eccles|Ec|Song|So|Canticles|Song of Songs|SOS|Isa|Is|Jer|Je|Jr|Lam|La|Ezek|Eze|Ezk|Dan|Da|Dn|Hos|Ho|Joel|Joe|Jl|Amos|Am|Obad|Ob|Jnh|Jon|Micah|Mic|Nah|Na|Hab|Zeph|Zep|Zp|Haggai|Hag|Hg|Zech|Zec|Zc|Mal|Mal|Ml|Matt|Mt|Mrk|Mk|Mr|Luk|Lk|John|Jn|Jhn|Acts|Ac|Rom|Ro|Rm|Co|Cor|Corinthians|Gal|Ga|Ephes|Eph|Phil|Php|Col|Col|Th|Thes|Thess|Thessalonians|Ti|Tim|Timothy|Titus|Tit|Philem|Phm|Hebrews|Heb|He|James|Jas|Jm|Pe|Pet|Pt|Peter|Jn|Jo|Joh|Jhn|John|Jude|Jd|Jud|Jud|Rev|The Revelation|Genesis|Exodus|Leviticus|Numbers|Deuteronomy|Joshua|Judges|Ruth|Samuel|Kings|Chronicles|Ezra|Nehemiah|Esther|Job|Psalms|Psalm|Proverbs|Ecclesiastes|Song of Solomon|Isaiah|Jeremiah|Lamentations|Ezekiel|Daniel|Hosea|Joel|Amos|Obadiah|Jonah|Micah|Nahum|Habakkuk|Zephaniah|Haggai|Zechariah|Malachi|Matthew|Mark|Luke|John|Acts|Romans|Corinthians|Galatians|Ephesians|Philippians|Colossians|Thessalonians|Timothy|Titus|Philemon|Hebrews|James|Peter|John|Revelation|Re|Ap))(.?s?d+((?:[:.]d+)?(s?[-–—]s?)?(?:d+)(?:(,s?d+)*)?S([:.]?d+)?(,?s?d+[–—-]s?d+,?d+)?)?(?:[:.]d+)?(?:[abcde])?(?:,d+)*(?:[-–—]d?s?)?)(?:[:.]d+[–-—]s?d+,?s?d+)?

I have a link to the tester with a lot of the passages types that I’m wanting to find: link to regex tester with examples
Anyone have any suggestions on how to edit this in order to find Hos 10:1-3, 8? I’m pretty new to regex, so I suspect this should be an easy fix for a more seasoned regexer. I’ve tried quite a bit of different combinations with no success, as each new combination I’ve tried breaks finding the current passages that I’ve already been able to test.

3

Answers


  1. This works for me. I’m not saying its perfect, but I wanted to help.

    b(?:(?:(?:I+|1st|2nd|3rd|First|Second|Third|[123])s)?(?:Gen|Ge|Gn|Exo|Ex|Exod|Lev|Le|Lv|Num|Nu|Nm|Nb|Deut|Dt|Josh|Jos|Jsh|Judg|Jdg|Jg|Jdgs|Rth|Ru|Sam|Samuel|Kings|Kgs|Kin|Chron|Chronicles|Ezra|Ezr|Ez|Neh|Ne|Esth|Es|Job|Job|Jb|Pslm|Ps|Psalms|Psa|Psm|Pss|Prov|Pr|Prv|Eccles|Ec|Song|So|Canticles|Song of Songs|SOS|Isa|Is|Jer|Je|Jr|Lam|La|Ezek|Eze|Ezk|Dan|Da|Dn|Hos|Ho|Joel|Joe|Jl|Amos|Am|Obad|Ob|Jnh|Jon|Micah|Mic|Nah|Na|Hab|Zeph|Zep|Zp|Haggai|Hag|Hg|Zech|Zec|Zc|Mal|Mal|Ml|Matt|Mt|Mrk|Mk|Mr|Luk|Lk|John|Jn|Jhn|Acts|Ac|Rom|Ro|Rm|Co|Cor|Corinthians|Gal|Ga|Ephes|Eph|Phil|Php|Col|Col|Th|Thes|Thess|Thessalonians|Ti|Tim|Timothy|Titus|Tit|Philem|Phm|Hebrews|Heb|He|James|Jas|Jm|Pe|Pet|Pt|Peter|Jn|Jo|Joh|Jhn|John|Jude|Jd|Jud|Jud|Rev|The Revelation|Genesis|Exodus|Leviticus|Numbers|Deuteronomy|Joshua|Judges|Ruth|Samuel|Kings|Chronicles|Ezra|Nehemiah|Esther|Job|Psalms|Psalm|Proverbs|Ecclesiastes|Song of Solomon|Isaiah|Jeremiah|Lamentations|Ezekiel|Daniel|Hosea|Joel|Amos|Obadiah|Jonah|Micah|Nahum|Habakkuk|Zephaniah|Haggai|Zechariah|Malachi|Matthew|Mark|Luke|John|Acts|Romans|Corinthians|Galatians|Ephesians|Philippians|Colossians|Thessalonians|Timothy|Titus|Philemon|Hebrews|James|Peter|John|Revelation|Re|Ap)b)w*.?s*(?:d+(?::|.)?d*(?:s*-s*d*)?(?:,s*d(?!s*d)*)*)+

    Hard to explain such a long regex so bear with me.

    Consider it as having sections:

    • the prefix section (I, II, 1st, or 2nd etc.)
    • the word section (Nu, Psalms, Tit, etc.)
    • the numbers section (1:4, 1.4, 1.4-2.5, 3:8 etc.)

    Note I’ve used non-capturing groups (?:) in many places because there is no need to refer back to the captured characters. This makes the output cleaner because there is less rubbish being captured.

    b word boundary added at the very start and at the end of the "word section". This stops any captures that might be caused by typos, i.e. in the middle of a word typoJohn3:16typo.

    w* zero or many word characters at the end of the word section to capture the remainder of the word. Useful for Gen 1:1 when spelled in full Genesis 1:1. Then…

    . literal period/full-stop. Useful if the word section ends with a period, e.g. Gen. 1:1. Then…

    s zero or many space characters.

    Then the "numbers section" (?:d+(?::|.)?d*(?:s*-s*d*)?(?:,s*d(?!s*d)*)*)+

    This section is itself surrounded by a non-capturing group with +. This ensures that the matching words must have some numbering afterward, otherwise Hello James will match James. Instead, only Hello James 2 would be captured.

    This group begins with one or many digits d. Then…

    Zero or one colon (:) or full stops (.) follows by zero or more digits. This captures 2:23 and 4.23 etc.

    Another non-capturing group, with hyphens surrounded by zero or many spaces then zero or more digits. That’s this bit: (?:s*-s*d*)?. Then…

    Check for commas then zero or many spaces then zero or many digits — NOT followed by many spaces and a digit. This is helpful because otherwise you won’t get numbers after hyphens after commas, and you will get final commas and spaces. (If you really want to follow how this works, take the section out ((?!s*d)), one bit at a time and use my test text below.

    At this point it’s just zero or many of the non-capturing groups.

    Tested on the below text (available here for now https://regex101.com/r/jtybY1/1)

    Joel 10:13 The passages Luke 2:32 and Lk 1:23 that we are l Jamese ook Judge ing at tonight 1 Cor 12:34 2 Cor 3:4 are found Jude 6, in Jude 5, Genesis 2:1 - 3:19, 1 John 3:16-17, 1 Peter 1:1, and Romans 10:13, 15, 17. Please turn in Mark 2  sd 2yDan 4 ur Bibles. Ps 109:4,5,6,8.  Isaiah 61.2-3 Mt 5.4. Hi James, sd
    
    Ge 27.27-29,89-40 Heb 11.20 Heb. 12.17 Jonah 3
    
    
    Jd. 5
    Jd 6
    
    1 Cor 12:34 2 Cor 3:4. He 4.12 Re 1.16
    
    Leviticus 16:6 He 5.3 He 7.27
    
    Hos 10:1-3, 8 and 1 John 2:23`
    
    Login or Signup to reply.
  2. I will not comment on the part of the regex that matches the name or abbreviation of the book, but focus on the second part that matches the chapter(s) and verse(s):

    • It starts with (.?, but it looks like you want to match a literal point here, so it should be escaped: (.?

    • You have constructs that could lead to exponential backtracking, like can be seen in this snippet: d+)?(s?[-–—]s?)?(?:d+. If that middle optional group does not capture any characters, you have d+ followed by d+, which means that if there is backtracking coming back to this part of the regex, the distribution of digits will change to every possible split. This should be avoided.

    • The S cannot be right: this matches any non-space character, which could be punctuation, a letter, … which surely would not be right to match there.

    • There is a lot of unnecessary repetition: Sometimes you have support for a certain pattern that is not supported elsewhere, like for instance the [abcde] suffix. This seems arbitrary. It is better to avoid such duplication where possible.

    Now to the core of the question: how to avoid that a digit before a book name is captured as if it is a chapter/verse belonging to a preceding reference?

    For this you can use a look-ahead. You could for instance assert before matching a space followed by digit as a chapter/verse, that these two characters are not followed by a space and a capital letter, as that would indicate that this digit is really a book number prefix.

    Here is the suggested correction to that second part of your regex:

    .?s(?:d+[:.])?d+[a-e]?(?:(?:s?[-–—]|,)(?!s[1-3]s+[A-Z])s?(?:d+[:.])?d+[a-e]?)*
    

    Note however that there are ambiguities like this:

    Jude 1, 2 John 1:5

    …which could mean "(Jude 1), (2 John 1:5)" or "(Jude 1, 2) (John 1:5)". The suggested regex will take the first option. I would assume that in the second case you would better need some punctuation or text between the two references, like so:

    Jude 1, 2; John 1:5

    or so:

    Jude 1, 2 and John 1:5

    Login or Signup to reply.
  3. There are a couple of problems with the regex approach.

    One is that the book names can be of the form 2 Peter and this can get muddled with the introduced form like: Gen 1, 2 Peter.

    For most of the books which can have numerals at the start this can be got round by adding these versions to the list of book names and looking for them explicitly before doing anything else.

    However, this cannot solve the problem of John which can have a preceding numeral or not. No amount of clever parsing can get round this problem and the only person who can solve it is the one who authored the text and knows what they mean.

    There is also the fact that a . can be used at the end of a book name, though this can also be accommodated by adding such strings to the list of book names before searching for them.

    A major problem (IMO) is the difficulty of comprehending regexs which deal with these special cases. For maintainability I’d suggest coding in a less sophisticated way.

    So, I gave up on regex because of the problems and instead here is a JS version – very unpolished and a bit hacky, but it does the job of dealing with the 2 Peter etc problems.

    Authors will have to check themselves for the John problem but it’s unlikely to occur often as the numbers after the preceding , would need to be low. Such a human check will be required however sophisticated and comprehensive the code/regex.

    <style>
      span {
        background: pink;
      }
    </style>
    <div></div>
    <script>
      const books = ['Gen', 'Ge', 'Gn', 'Exo', 'Ex', 'Exod', 'Lev', 'Le', 'Lv', 'Num', 'Nu', 'Nm', 'Nb', 'Deut', 'Dt', 'Josh', 'Jos', 'Jsh', 'Judg', 'Jdg', 'Jg', 'Jdgs', 'Rth', 'Ru', 'Sam', 'Samuel', 'Kings', 'Kgs', 'Kin', 'Chron', 'Chronicles', 'Ezra', 'Ezr', 'Ez', 'Neh', 'Ne', 'Esth', 'Es', 'Job', 'Job', 'Jb', 'Pslm', 'Ps', 'Psalms', 'Psa', 'Psm', 'Pss', 'Prov', 'Pr', 'Prv', 'Eccles', 'Ec', 'Song', 'So', 'Canticles', 'Song of Songs', 'SOS', 'Isa', 'Is', 'Jer', 'Je', 'Jr', 'Lam', 'La', 'Ezek', 'Eze', 'Ezk', 'Dan', 'Da', 'Dn', 'Hos', 'Ho', 'Joel', 'Joe', 'Jl', 'Amos', 'Am', 'Obad', 'Ob', 'Jnh', 'Jon', 'Micah', 'Mic', 'Nah', 'Na', 'Hab', 'Zeph', 'Zep', 'Zp', 'Haggai', 'Hag', 'Hg', 'Zech', 'Zec', 'Zc', 'Mal', 'Mal', 'Ml', 'Matt', 'Mt', 'Mrk', 'Mk', 'Mr', 'Luk', 'Lk', 'John', 'Jn', 'Jhn', 'Acts', 'Ac', 'Rom', 'Ro', 'Rm', 'Co', 'Cor', 'Corinthians', 'Gal', 'Ga', 'Ephes', 'Eph', 'Phil', 'Php', 'Col', 'Col', 'Th', 'Thes', 'Thess', 'Thessalonians', 'Ti', 'Tim', 'Timothy', 'Titus', 'Tit', 'Philem', 'Phm', 'Hebrews', 'Heb', 'He', 'James', 'Jas', 'Jm', 'Pe', 'Pet', 'Pt', 'Peter', 'Jn', 'Jo', 'Joh', 'Jhn', 'John', 'Jude', 'Jd', 'Jud', 'Jud', 'Rev', 'The Revelation', 'Genesis', 'Exodus', 'Leviticus', 'Numbers', 'Deuteronomy', 'Joshua', 'Judges', 'Ruth', 'Samuel', 'Kings', 'Chronicles', 'Ezra', 'Nehemiah', 'Esther', 'Job', 'Psalms', 'Psalm', 'Proverbs', 'Ecclesiastes', 'Song of Solomon', 'Isaiah', 'Jeremiah', 'Lamentations', 'Ezekiel', 'Daniel', 'Hosea', 'Joel', 'Amos', 'Obadiah', 'Jonah', 'Micah', 'Nahum', 'Habakkuk', 'Zephaniah', 'Haggai', 'Zechariah', 'Malachi', 'Matthew', 'Mark', 'Luke', 'John', 'Acts', 'Romans', 'Corinthians', 'Galatians', 'Ephesians', 'Philippians', 'Colossians', 'Thessalonians', 'Timothy', 'Titus', 'Philemon', 'Hebrews', 'James', 'Peter', 'John', 'Revelation', 'Re', 'Ap', 'Jd.', 'Heb.'];
    
      const preStrings = ['III', 'II', 'I', '1st', '2nd', '3rd', 'First', 'Second', 'Third', '1', '2', '3'];
      const preStringed = ['Sam', 'Samuel', 'Kings', 'Kgs', 'Kin', 'Chron', 'Chronicles', 'Corinthians', 'Co', 'Cor', 'Thessalonians', 'Th', 'Thes', 'Thess', 'Timothy', 'Ti', 'Tim', 'Peter', 'Pe', 'Pet', 'Pt', 'John', 'Jn', 'Jhn'];
      let text = `Joel 10:13 The passages Luke 2:32 and Lk 1:23 that we are looking at tonight 1 Cor 12:34 2 Cor 3:4 are found Jude 6, in Jude 5, Genesis 2:1 - 3:19, 1 John 3:16-17, 1 Peter 1:1, and Romans 10:13, 15, 17. Please turn in your Bibles. Ps 109:4,5,6,8.  Isaiah 61.2-3 Mt 5.4
    
    Ge 27.27-29,89-40 Heb 11.20 Heb. 12.17 Jonah 3
    
    Jd. 5
    Jd 6
    
    1 Cor 12:34 2 Cor 3:4. He 4.12 Re 1.16
    
    Leviticus 16:6 He 5.3 He 7.27
    
    Hos 10:1-3, 8 and 1 John 2:23`;
      //add the prestringed versions e.g. 1 Peter
      for (let b = 0; b < preStringed.length; b++) {
        for (let pre = 0; pre < preStrings.length; pre++) {
          books.push(preStrings[pre] + ' ' + preStringed[b]);
        }
      }
      // add the book name with . at the end as this seems to be added sometimes, at least to the shortened forms
      const length = books.length;
      for (let b = 0; b < length; b++) {
        books.push(books[b] + '.');
      }
    
      // sort descending - longer items first
      books.sort((a, b) => b.length - a.length);
      let booksAt = [];
      // go thro' each book finding where it matches in text
      for (let b = 0; b < books.length; b++) {
        const book = books[b];
        let chNoInText = 0;
        while (chNoInText < text.length) {
          let j = text.indexOf(book, chNoInText);
          if (j < 0) break;
          if (((j + book.length) < text.length) && !(text.charAt(j + book.length).match(/^[a-z]+$/))) {
            booksAt.push([book, j]);
            let replacement = book;
            for (let k = 0; k < book.length; k++) {
              replacement = replacement.replace(book.charAt(k), 'X');
            }
            text = text.replace(book, replacement); // to prevent a shorter version matching
          }
          chNoInText = j + book.length + 1;
        }
      }
      // into ascending order of start position
      booksAt.sort(function(a, b) {
        return a[1] - b[1];
      });
      newText = '';
      let chNoInText = 0;
      for (let b = 0; b < booksAt.length; b++) {
        while (chNoInText < booksAt[b][1]) { //copy across characters to start of book
          newText += text.charAt(chNoInText);
          chNoInText++;
        }
        newText += '<span>' + booksAt[b][0] + '</span><span>';
        chNoInText += booksAt[b][0].length; //skip the 'fill-in characters
        for (let i = 0; i < 100; i++) {
          chNoInText++;
          const nextCh = text.charAt(chNoInText);
          //test whether are at the end of the chapter(s) and verse(s)
          if (nextCh.match(/^[a-z]+$/)) break;
          if (nextCh.match(/^[A-Z]+$/)) break;
          newText += text.charAt(chNoInText - 1);
        };
        newText += '</span>&nbsp;';
      }
      document.querySelector('div').innerHTML = newText;
    </script>
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search