What is a regex to extract words and punctuation but ignore decimals and numbers?


I have the following sentence:

"We bought 3.5 million shirts."

I want to create an array with all of the words and punctuation but not the number including the decimal point.

I have the following regex:


However this still grabs the decimal point between the numbers as follows:

["We", "bought", ".", "million", "shirts", "."]

I want the result to be as follows: looking for the following result:

["We", "bought", "million", "shirts", "."]

Notice that the "." from the number is excluded.

How can I still select periods at the end of sentences but not decimal points that occur before a number?

Show source
| ruby   | regex   2017-01-04 08:01 3 Answers

Answers ( 3 )

  1. 2017-01-04 08:01

    Try this

    str = "We bought 3.5 million shirts."
    # => ["We", "bought", "million", "shirts", "."]

    How does this work?

    • [[:alpha:]]+ selects one or more letters, aka words
    • [[:punct:]](?![[:digit::]]) selects punctation that is not followed by a number
  2. 2017-01-04 09:01

    You can try this:

    a="We bought 3.5 million shirts 15 dolalr.;"
    puts b

    Try it here

    Output array:

  3. 2017-01-04 09:01

    I suggest using a small enhancement: replace \D+ with \p{L}+ (or [[:alpha:]]+) to only match 1+ letters and then restrict [[:punct:]] to only match if it is not a . followed with a digit (with a negative lookahead (?!\.\d)):

    s = "We bought 3.5 million shirts."
    res = s.scan(/\p{L}+|(?!\.\d)[[:punct:]]/)
    puts res # => [We, bought, million, shirts, .]

    See the Ruby demo

    Another approach is to first remove all numbers with \d*\.?\d+ regex and then collect the "words" with punctuation:

    s = "We bought 3.5 million shirts."
    res = s.gsub(/\d*\.?\d+/, '').scan(/\w+|\p{P}/)

    See this Ruby demo

◀ Go back