Friday October 12, 2007 Email regex address validation (aka "Web Developers and Clue, the Empty Set?")
Why, oh why, must I regularly be told by online registration forms that, sorry, but the email address I entered is not valid and I should enter a valid one before allowing registration? There's so many possible problems here:
The most common cause is the first. To describe them as clueless muppets who should never be allowed to code up anything that remotely stands a chance of being something anyone else must ever interact with, would probably be too strong and not generally true, but it's close enough for me for now!
. Really, if you're writing some web form that requires you to sanity-check an email address, how hard could it be to google for "email validation regex"? If they really were keen on doing it themselves, surely reading the addr-spec BNF would be a pre-requisite? A quick glance at least, surely? Sadly, it appears many (most?) web developers are incapable of such lofty levels of rigour. Worse, it appears some of these idiots have webpages with their incorrect, broken regexes ranked high in the google results.
The next problems are that it's easy to get regexes wrong, semantically at least, and it's also possible to get too clever. A good example is Stephen Shirley's email regex effort (javascript compatible). It's slightly wrong in not allowing foo@host addresses, a small oversight. It's also locale dependent in two ways. Firstly, 'a-z' can match non-ASCII characters in some locales (ie chars Stephen didn't intend to match). Secondly, Stephen's gotten too clever: the range match very sneakily relies on ASCII ordering - cute, but locale-fragile.
Before addressing the last possible problem, here's my simpler rendition of Stephen's regex, fixed to be locale-independent, split over multiple lines for readability, JavaScript compatible:
(")?([[:alnum:]!#$%&'*+/=?^_`{|}~-])+
(\.[[:alnum:]!#$%&'*+/=?^_`{|}~-]+)*(")?
@[[:alnum:]-]+(\.[[:alnum:]-])*\.?
I've checked this against RFC2822 and the ECMAScript specification for validity. Stephen's also had a look at it. For PHP, the following should work, using ereg:
$regex = "(\")?[[:alnum:]!#$%&'*+/=?^_`{|}~-]+";
$regex .= "(\.[[:alnum:]!#$%&'*+/=?^_`{|}~-]+)*(\")?";
$regex .= "@[[:alnum:]-]+(\.[[:alnum:]-])*\.?";
if (ereg ($regex, $argv[1]))
....
else
....
However, neither of the above are guaranteed to be correct. It's easy, in trying to fully validate the syntax, to miss some subtleties of syntax or meaning (in what you're trying to validate, or in the form you're describing that syntax in). This suggests it's a bad idea to try..
Which brings us to the next problem: Even if one manages to correctly positively validate the syntax, the address need not be functional. The system must still validate the email address, typically with a probe-email from which the registrant must retrieve a URL to complete the registration. Only then can the system know the email really is valid. Further, the developer can not rely on the client-side Javascript to have been run, or not have been subverted - they must sanitise the address server side too. So the syntax checking really only is for the convenience of the (registrant), to prevent them aimlessly waiting for an email that might never come if they've typo-ed their address (and typo-ed it in such a way as to be syntactically invalid!). So really, given the difficulty of getting it right, given that it's for the user's convenience and given the system likely will functionally test the address anyway, the syntax check should not be mandatory!
In conclusion:
Update: Removed the begin and end anchor matches - whether they're appropriate depends on context of input - and added PHP example.
Update2: Add support for quoted-string, and add hyphen to domain parts, as per comments
Update3: See discussion for further corrections (which should re-inforce how bad an idea it is to try enforce syntax checks arbitrarily..).
( Oct 12 2007, 04:36:41 PM IST ) Permalink Comments [6]
Amen, brother. As the proud owner to the domain utterback.name, I have found it very frustrating that there are so many high-tech, bleeding-edge websites that haven't figured out that the top level domain can have more than 3 letters for the last 7 years now.
Posted by Brian Utterback on October 12, 2007 at 06:05 PM IST #
The local portion of the email is allowed by the spec to have " surrounding it and the top level domain portion of the regex was missing a 1 or more modifier so the regex with those corrections would be:
("?)([[:alnum:]!#$%&'*+/=?^_`{|}~-])+(\.[[:alnum:]!#$%&'*+/=?^_`{|}~-]+)*\1@[[:alnum:]]+(\.[[:alnum:]]+)*\.?
Posted by Tim Galeckas on October 24, 2007 at 04:47 PM IST #
Good catch on the quoted string option for local-part.
You're wrong about the domain part though. There is no requirement for a domain name to have at least two components. I.e. user@tld is perfectly valid (and such addresses have been known to exist, IMU).
The regexes above however are wrong for the domain portion, as I have forgotten to allow the - character. (Which just reinforces my point that these attempts to positively validate syntax are quite misguided :) ).
I'll add the quoted-string part, and fix the domain parts in a sec!
Posted by Paul Jakma on October 24, 2007 at 07:19 PM IST #
Best email regex so far.
However, JavaScript doesn't seem to support POSIX [:alnum:], need to use a-zA-Z0-9 instead
but then regex not locale independent! Any suggestions?
On the PHP side, mail() doesn't seem to like quotes around local part of address. So I have removed (")? sections from my regex.
Posted by Sean Kavanagh on February 07, 2008 at 03:00 AM GMT #
Hmm, I thought I had verified in the ECMAScript spec that it supported POSIX style ranges. Seems I didn't, for I can't find back any justification for it. Instead, it seems ECMAscript uses character-escapes to indicate classes, e.g. \w is equivalent to [:alnum:]. So the following should work for ECMAscript, can you test?:
(")?([\w!#$%&'*+/=?^_`{|}~-])+
(\.[\w!#$%&'*+/=?^_`{|}~-]+)*(")?
@[\w-]+(\.[\w-])*\.?
If that works, I'll update the main article.
Posted by Paul Jakma on February 07, 2008 at 11:34 AM GMT #
After many hours of trial and error I have got the following to work on my website:
JavaScript version (tested using IE 7):
|
|function isValidEmailAddress(emailAddress) {
| var pattern = new RegExp("^(\")?[\\w!#$%&'*+/=?^_`{|}~-]+"
| + "(\\.[\\w!#$%&'*+/=?^_`{|}~-]+)*\\1"
| + "@((?=[^_])[\\w-])+(\\.((?=[^_])[\\w-])+)*\\.?$");
| return pattern.test(emailAddress);
|}
|
Note 1: local part should have begin AND end quotes if quoted
which I enforce above by replacing second (\")? with \\1
Note 2: \w includes _ which I have excluded above using (?=[^_])
PHP version using POSIX based ereg():
|
|function isValidEmailAddress($emailAddress) {
| $pattern = "^(\")?[[:alnum:]!#$%&'*+/=?^_`{|}~-]+"
| . "(\\.[[:alnum:]!#$%&'*+/=?^_`{|}~-]+)*(\")?"
| . "@[[:alnum:]-]+(\\.[[:alnum:]-]+)*\\.?$";
| return (bool) ereg($pattern, $emailAddress);
|}
|
PHP version using Perl based preg_match():
|
|function isValidEmailAddress($emailAddress) {
| $pattern = "/^(\")?[\\w!#$%&'*+\\/=?^_`{|}~-]+"
| . "(\\.[\\w!#$%&'*+\\/=?^_`{|}~-]+)*(\")?"
| . "@((?=[^_])[\\w-])+(\\.((?=[^_])[\\w-])+)*\\.?$/";
| return (bool) preg_match($pattern, $emailAddress);
|}
|
Note 1: Can't replace second (\")? with \\1 in PHP v4.4.7
In my previous post I said that I couldn't get quoted local parts to work.
The reason they were failing was that the quotes were being backslashed automatically.
I fixed the problem using:
|
| $from = str_replace("\\\"", "\"", $_POST["from"]);
|
They now work.
Note that [:alnum:] is not always equivalent to \w.
On PHP [:alnum:] doesn't match _, while \w does.
Since I have kept to your original syntax in the domain parts,
I had to use a lookahead pattern to exlude _ from \w in JavaScript,
i.e. [[:alnum:]-]+ => ((?=[^_])[\w-])+
I also discovered a significant bug in your pattern (fixed in code above).
The second part of the domain group is missing a +:
|
| @[[:alnum:]-]+(\.[[:alnum:]-])*\.?
|
Should be:
|
| @[[:alnum:]-]+(\.[[:alnum:]-]+)*\.?
|
It doesn't really work at all without the +!
Posted by Sean Kavanagh on February 08, 2008 at 07:44 PM GMT #