Saturday, December 06, 2014

Lookahead in regular expressions


As I was solving a couple of problems in Checkio, I came across a new concept - lookaheads in regular expressions. At first, it was slightly confusing to understand how lookaheads are different from normal regular expressions. But they actually prove to be very useful in certain cases.

Lookaheads are zero-length assertions as they do not consume characters in the string but only assert whether a match is possible or not. Positive lookaheads are represented as (?=regex) and negative lookaheads as (?!regex).

Example: 
    Hello(?=World) - matches any "Hello" followed by "World"
  Hello(?=World) - matches any "Hello" which is not followed by "World" like "Hello Today" or even simply "Hello"

The real power of lookaheads arise when you chain two or more lookaheads together.

Consider a password validation which requires the password to have atleast one letter and one digit. To construct a regular expression for this case, you can't use something like, "(.*\d)(.*[A-Za-z]) " as that implies one or more digits followed by one or more letters. So a password like "123hello" would pass whereas "hello123" would fail the regex. The regex engine would start scanning the string for a digit and when it reaches "123", there are no more letters following it, so it would fail.

Instead, this can be easily achieved by using lookaheads,  "(?=.*\d)(?=.*[A-Za-z])". As the regex engine begins matching the first lookahead for digits, it traverses the string "hello123" from 'h' till it reaches the digit '1'. The first lookahead is now matched. But since the lookahead doesn't consume any string, the regex engine again starts from 'h' till it reaches a letter which is 'h' itself. Now both lookaheads are matched and the entire regex returns true.

For a more complex password validation, we could use:
((?=.*\d)(?=.*[A-Z])(?=.*[a-z])(?=.*[@?&$!]).{8,20})
which implies the password needs to have min 8 characters and max 20 characters. And it should have atleast one digit, one lowercase, one uppercase letter and one special character among "@,?,&,$,!".


0 comments:

Post a Comment