Rails & Regex: How to validate for length of a number with or without a hyphen
The problem
I needed to update a Rails model validation for a string attribute. This attribute - let's call it system_number - had an original regex validation guaranteeing that it contained exactly 10 numbers.
validates :system_number, format: { with: /\A\d{10}\z/ }, allow_blank: true
The product manager asked that it be updated to allow either 10 characters (only numbers), or 11 characters (numbers and a hyphen), but with the hyphen only allowed after the first four numbers.
I touch regex, oh, like 3 times a year? This means any time I have to use it for something more complicated than "selection starts with XYZ", it's a whirlwind game of trial and error.
And we're off!
The original regex
Skip ahead to "The merged solution" if you don't care about how we arrived at the answer :)
- \A = Start of the string
- \d = Any digit
- {10} = Exactly 10 of whatever precedes (in this case, \d{10} means "exactly 10 digits")
- \z = End of the string
So all together, it checks for exactly 10 digits between the beginning of the string and of the string, and nothing else.
The first solution
validates :system_number, format: { with: /\A\d{10}\z|\A\d{4}-{1}\d{6}\z/ }, allow_blank: true
Here, I kept the original regex and added |, which means "or".
After the "or", I put the code for the "11 characters with a hyphen".
- \A = Start of the string
- \d = Any digit
- {4} = Exactly 4 of whatever precedes (in this case, \d{4} means "exactly 4 digits")
- - = Followed by a hyphen
- {1} = Again, exactly 1, in this case one hyphen
- \d = Any digit
- {6} = Exactly 6 of whatever precedes (in this case, \d{6} means "exactly 6 digits")
- \z = End of the string
So all together, it checks for either exactly 10 digits from start to end, or for an 11-character string with four digits, a hyphen, and then six digits.
And it works!
Via Rubular.com, we can see that it matches a string with 10 characters, all digits:
And a string with 11 characters, where the first four are digits, followed by a hyphen, followed by six digits:
It also doesn't match if they hyphen is located elsewhere, or if it has 11 characters with no hyphen, or if it has fewer than ten characters. Yay!
The merged solution
This is all very well and good, but it's a little verbose. We can trim it down to the following:
validates :system_number, format: { with: /\A\d{4}-?\d{6}\z/ }, allow_blank: true
Instead of spelling out both conditions and joining them with an "or" by way of |, we can instead do:
- \A = Start of the string
- \d = Any digit
- {4} = Exactly 4 of whatever precedes (in this case, \d{4} means "exactly 6 digits")
- - = Followed by a hyphen
- ? = Zero or one of whatever precedes (in this case, -? means "zero or one hyphen")
- \d = Any digit
- {6} = Exactly 6 of whatever precedes (in this case, \d{6} means "exactly 6 digits")
- \z = End of the string
So all together, it checks for 4 digits, followed by an optional hyphen, followed by six digits. Voilá!
Bonus solution: hyphen anywhere within the string
We weren't sure at first if the hyphen were required to be after the first four numbers, or if it could be located anywhere within the string, so Ten Forward's VP of Engineering, Brett, took on the challenge of writing a regex matcher for if the hyphen could be located anywhere (excepting the beginning or the end).
Here's what that validation looked like:
validates :system_number, format: { with: /\A(?=\d+-?\d+\z)([\d-]{11}|\d{10})(?<!\d{11})\z/ }, allow_blank: true
It's a doozy! Let's break it down.
First section: \A(?=\d+-?\d+\z)
- \A = Start of the string
- (?= = This is called a lookahead, and only matches if whatever precedes is or is not followed by the remainder; here, the = marks it as a positive lookahead, ie: we expect the remainder to follow what precedes (versus (?! for a negative lookahead, in which case we'd expect the remainder NOT to follow what precedes). So in this case, the start of the string (\A) must be followed by \d+-?\d+\z.
- \d = Any digit
- + = At least one occurrence of whatever precedes (in this case, \d+ means "one or more digits")
- - = A hyphen
- ? = Zero or one of whatever precedes (in this case, -? means "zero or one hyphen")
- + = At least one occurrence of whatever precedes (as earlier in the section, in this case, \d+ means "one or more digits")
- \d = Any digit
- + = At least one occurrence of whatever precedes (in this case, \d+ means "one or more digits")
- \z = End of string
So in total, it matches a string where the start is followed by at least one digit, zero or one hyphen, and at least one other digit.
Second section: ([\d-]{11}|\d{10})
- [\d-] = The brackets indicate one instance of whatever is inside, so in this case, a single digit (\d) or a hyphen (-)
- {11} = Exactly 11 of whatever precedes (in this case, [\d-]{11} means "exactly 11 characters, where each character is either a digit or hyphen")
- | = Or
- \d = Any digit
- {10} = Exactly 10 of whatever precedes (in this case, \d{10} means "exactly 10 digits")
All together, it means exactly 11 characters (comprising hyphens or digits) or exactly 10 characters (comprising only digits).
Third section: (?<!\d{11})\z/
- (?<! = This is called a "lookbehind". The opposite of a "lookahead", it only matches if what follows the parentheses either is or is not preceded by what's in the parentheses; in this case, the ! marks it as negative, or not matching. So (?<!\d{11})\z means the \z must not be preceded by \d{11}.
- \d = Any digit
- {11} = Exactly 11 of whatever precedes (in this case, \d{11} means "exactly 11 digits")
- \z = End of the string
As a statement, again, it means the end of the string cannot be preceded by 11 digits - or, in plain English, the string cannot comprise 11 digits.
In total:
- The string must start with a digit and have no more than one hyphen followed by at least one digit;
- It can be 11 characters (comprising hyphens and digits) or 10 characters (comprising just digits);
- And it cannot be 11 digits