Regular Expressions - Groups, Lookahead, and Patterns
Regular Expression Groups
Group Count
groupCount returns the number of groups in the pattern.
Accessing Groups
- If there is one group, count is 1
- To get the value, use
group(1) group(0)returns the entire matched value
Example
Pattern pattern = Pattern.compile("(\\d+)-(\\w+)");
Matcher matcher = pattern.matcher("123-abc");
if (matcher.find()) {
matcher.group(0); // "123-abc" (entire match)
matcher.group(1); // "123" (first group)
matcher.group(2); // "abc" (second group)
}
Lookahead and Lookbehind
Types
| Syntax | Name | Description |
|---|---|---|
(?<! |
Negative Lookbehind | Asserts that what precedes does NOT match |
(?<= |
Positive Lookbehind | Asserts that what precedes DOES match |
(?! |
Negative Lookahead | Asserts that what follows does NOT match |
(?= |
Positive Lookahead | Asserts that what follows DOES match |
Examples
(?<=@)\w+ # Match word after @
\w+(?=@) # Match word before @
(?<!un)happy # Match "happy" not preceded by "un"
happy(?!ness) # Match "happy" not followed by "ness"
Backreferences (Repeat Groups)
Use \2 to reuse the second captured group.
Example
(\w+)\s+\1 # Match repeated words like "the the"
This matches a word, followed by whitespace, followed by the same word.
Password Validation Pattern
Complex Password Rule
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*\W)(?!.*[^\u0000-\u007F])(?!.*\s).{8,50}$
Requirements
- Contains at least one digit (
(?=.*\d)) - Contains at least one lowercase letter (
(?=.*[a-z])) - Contains at least one uppercase letter (
(?=.*[A-Z])) - Contains at least one special character (
(?=.*\W)) - No Unicode characters allowed (
(?!.*[^\u0000-\u007F])) - No whitespace allowed (
(?!.*\s)) - Length between 8 and 50 characters (
.{8,50})
Breakdown
| Part | Meaning |
|---|---|
^ |
Start of string |
(?=.*\d) |
Must contain digit |
(?=.*[a-z]) |
Must contain lowercase |
(?=.*[A-Z]) |
Must contain uppercase |
(?=.*\W) |
Must contain special char |
(?!.*[^\u0000-\u007F]) |
No non-ASCII chars |
(?!.*\s) |
No whitespace |
.{8,50} |
8 to 50 characters |
$ |
End of string |
Common Regex Patterns
Here is a collection of frequently used regex patterns for everyday development tasks:
Email Validation
^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
Matches standard email addresses. Note that fully RFC-compliant email validation is extremely complex; this pattern covers the vast majority of real-world addresses.
URL Matching
https?://[^\s/$.?#].[^\s]*
Matches HTTP and HTTPS URLs. Handles most common URL formats.
Phone Number (Various Formats)
\+?[\d\s\-().]{7,15}
A flexible pattern that matches phone numbers in multiple formats: +1-234-567-8900, (234) 567-8900, 234.567.8900, etc.
IP Address (IPv4)
\b(?:\d{1,3}\.){3}\d{1,3}\b
Matches IPv4 addresses like 192.168.1.1. Note that this does not validate the range (0-255) of each octet.
Date (YYYY-MM-DD)
\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01])
Matches dates in ISO 8601 format.
Regex Character Classes
| Class | Meaning | Equivalent |
|---|---|---|
\d |
Any digit | [0-9] |
\D |
Any non-digit | [^0-9] |
\w |
Word character | [a-zA-Z0-9_] |
\W |
Non-word character | [^a-zA-Z0-9_] |
\s |
Whitespace | [ \t\n\r\f] |
\S |
Non-whitespace | [^ \t\n\r\f] |
. |
Any character (except newline) |
Quantifiers
| Quantifier | Meaning | Example |
|---|---|---|
* |
0 or more | a* matches “”, “a”, “aa” |
+ |
1 or more | a+ matches “a”, “aa” but not “” |
? |
0 or 1 | a? matches “” or “a” |
{n} |
Exactly n | a{3} matches “aaa” |
{n,} |
n or more | a{2,} matches “aa”, “aaa”, etc. |
{n,m} |
Between n and m | a{2,4} matches “aa”, “aaa”, “aaaa” |
Greedy vs Lazy
By default, quantifiers are greedy (match as much as possible). Add ? after a quantifier to make it lazy (match as little as possible):
<.*> # Greedy: matches "<div>content</div>" as one match
<.*?> # Lazy: matches "<div>" and "</div>" separately
Regex Performance Tips
- Be specific:
[a-z]+is faster than.+because the engine has fewer choices at each step. - Avoid catastrophic backtracking: Patterns like
(a+)+bcan cause exponential backtracking on strings like “aaaaaaaaac”. Use atomic groups or possessive quantifiers when available. - Anchor when possible: Using
^and$tells the engine where to start and stop, reducing unnecessary matching attempts. - Use non-capturing groups:
(?:...)instead of(...)when you do not need to capture the group content. This saves memory and processing time. - Test with realistic data: A regex that works on small inputs may be slow on large inputs. Always test with production-like data sizes.
Comments