Recent Posts

Regular expression to validate email address

There is no simple regular expression for this problem: see this fully RFC‑822–compliant regex, which is anything but simple. (It was written before the days of grammatical patterns.) The grammar specified in RFC 5322 is too complicated for primitive regular expressions, although the more sophisticated grammatical patterns in Perl, PCRE, and PHP can all manage to correctly parse RFC 5322 without a hitch. Python and C# should also be able to manage it, but they use a different syntax from those first three.

However, if you are forced to use one of the many less powerful pattern-matching languages, then it’s best to use a real parser. But understand that validating it per the RFC tells you absolutely nothing about whether the person entering the address is its true owner. People sign others up to mailing lists this way all the time. Fixing that requires a fancier kind of validation that involves sending that address a message that includes a confirmation token meant to be entered in the same web page as was the address.

That's the only way to know you got the address of the person entering it, which is why most mailing lists now use that mechanism to confirm sign-ups. After all, anybody can put down president@whitehouse.gov, and that will even parse as legal, but it isn’t likely to be the person at the other end.

For PHP, you should not use the pattern given in Validate an E-Mail Address with PHP, the Right Way from which I quote:

There is some danger that common usage and widespread sloppy coding will establish a de facto standard for e-mail addresses that is more restrictive than the recorded formal standard.

That is no better than all the other non-RFC patterns. It isn’t even smart enough to handle even RFC 822, let alone RFC 5322.

If you want to get fancy and pedantic, implement a complete state engine. A regular expression can only act as a rudimentary filter. The problem with regular expressions is that telling someone that their perfectly valid e-mail address is invalid (a false positive) because your regular expression can't handle it is just rude and impolite from the user's perspective. A state engine for the purpose can both validate and even correct e-mail addresses that would otherwise be considered invalid as it disassembles the e-mail address according to each RFC. This allows for a potentially more pleasing experience, like

The specified e-mail address 'myemail@address,com' is invalid. Did you mean 'myemail@address.com'?

See also Validating Email Addresses, including the comments. Or Comparing E-mail Address Validating Regular Expressions.

No comments:

Post a Comment