Friday, March 02, 2007

How (not) to validate email addresses

A question that programmers often ask is "How do I validate an email address?"

At first glance that appears to be a sensible question. If you're writing a web form or some other application that needs to accept an email address, you might want to detect errors (say, typing fred4my.com instead of fred@my.com) and give the user a chance to correct the error.

But the question of what is a valid email address is much harder than you might expect. The official standard for email accepts a very broad range of email address formats.

[Aside: what's with Google? Try searching for "how to validate email addresses" (without the quotes). I get a 403 error page:

We're sorry...
... but your query looks similar to automated requests from a computer virus or spyware application. [...]

After some experiments, it looks like Google UK blocks (almost) any search containing "email" and "address". But Google Australia doesn't seem to care; and even Google UK will accept the query if it comes from Konqueror's toolbar.]

The best advice for validating email addresses is: Just Say No. At most, check that the email address isn't blank. If you absolutely know that the address can't be a local address, check for the presence of at least one at-sign @. (Yes, you read me right the first time: at least one.) And that's it -- leave the validation up to the mail server. If the mail server can deliver it, it is valid, and if it can't, it isn't.

If you want to guard against user typos, get the user to type the address twice, like they do for a password.

But ignorant programmers -- and it's frightening how many programmers fall into that category -- insist on doing incorrect validation. This example shows the danger of false negatives: anyone using this code will wrongly reject perfectly valid email addresses like:

my.name@somedomain.info
professor@ancienthistory.museum
somebody (see me @ the pub) @somewhere.com

Yes, the third one is valid: the part between ( and ) is a comment, and is ignored by any compliant mail server.

Another common mistake is to reject emails like something+else@domain.com: plus signs in the user name part are allowed.

And then there are the commercial sites that won't let you register with a Hotmail, Yahoo or Gmail address. Don't get me started on the sheer pig-ignorance and stupidity of that...

But ultimately, even if an email address is syntactically valid (and it is a horrific task to check that!) there's no guarantee that the address is valid until you've actually sent to it successfully. fred@somedomain.com is syntactically valid, but you still have to send an email to that address to find out whether the address is valid or not! That's why using a validator that works for "99% of email addresses" is bad practice -- not only do you needlessly reject the 1% of valid email addresses that your software can't handle, but you still don't know whether the address is valid until you actually try it.

The only thing worse than people who insist on validating email addresses are people who insist on validating email addresses with a regular expression. To quote Jamie Zawinski:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Somebody, I think in the spirit of George Leigh Mallory ("because it's there"), wrote a regular expression to almost validate email addresses (it can't deal with comments, and naturally it can't tell whether or not the address actually exists). To give you a flavour of this regex, here are the first sixty-five characters of this 6343-character monster:

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]

Multiply that by a hundred. Now imagine trying to track down a bug in this beast. How confident are you that the creator of this regex has correctly dealt with all the odd corner cases?

1 comment:

Anonymous said...

Good job! I fully agree that e-mail validators should either follow the standard 100% (which is next to impossible), or not try syntactical validation at all. And no, I'd rather do no syntax e-mail verification, than boggle my code with a 6KB regexp. My heart aches for all the wasted CPU cycles on such regexps.

Still, in the case with web apps, trying to validate the address by sending an actual message is not a good approach either. There are several reasons:
1. It can be quite time consuming (DNS resolution timeouts, connection timeouts, slow server response, etc)
2. It introduces another element your code must watch for - a mail server. Another point of failure that you must watch for.
3. We all know that e-mail delivery is not always instant. What if the address is perfectly valid, but there is a mail queue, and the delivery takes 15 minutes? What if

So what do we do? IMHO, have the user re-type the address and hope she knows her finders. It doesn't sound good, but in this case nothing is better than something.