Regular expressions in SQL Server 2025
20 minute read
Version 1 - Published December 16, 2025 - I will expand this article in the future with more detailed information about regex.
We finally have native regular expression support in SQL Server 2025, directly in T-SQL.
We have been waiting for this for twenty years. It removes the need for CLR assemblies and simplifies many search cases, such as validating check constraints, extracting data, and transforming complex data.
Native integration allows regex to be used in complex queries, constraints, filtered indexes, and in interoperability with other SQL Server features.
Regular expressions (regex) are a fundamental tool for advanced string manipulation, allowing complex patterns for search, validation, and transformation of text data.
Regular expressions before SQL Server 2025
Before SQL Server 2025, regex handling in SQL Server was very limited.
Native functions like LIKE, PATINDEX, or STRING_SPLIT offer basic search and string-splitting capabilities, but they are far from the power and flexibility of modern regex.
For example, LIKE only supports a simplified syntax with wildcards (%, _), and PATINDEX only finds the position of a simple pattern in a string.
To address these limitations, the most common alternatives were custom SQL CLR functions: you could integrate full regex function libraries in CLR (.NET) assemblies, but that clearly introduces performance issues.
Regex functions in SQL Server 2025
SQL Server 2025 provides five scalar functions and two table functions (TVF, for Table Valued Functions) to work with regular expressions.
Scalar functions return a value per row:
REGEXP_LIKEtests whether a string matches a pattern (boolean),REGEXP_COUNTcounts occurrences of a pattern,REGEXP_INSTRlocates the position of a match,REGEXP_REPLACEperforms replacements, andREGEXP_SUBSTRextracts substrings.
TVF functions generate sets of rows:
REGEXP_MATCHESlists all matches with their positions, whileREGEXP_SPLIT_TO_TABLEsplits a string using a pattern as a delimiter.
These functions rely on Google’s RE2 library, which guarantees linear execution time relative to input size, avoiding regex denial-of-service attacks (ReDoS) that are possible with other engines. This architecture ensures safe execution even with complex patterns.
Supported platforms include SQL Server 2025 (17.x), Azure SQL Database, Azure SQL Managed Instance (with update policy “SQL Server 2025” or “Always-up-to-date”), and SQL database in Microsoft Fabric.
| Function | Type | Description | Main parameters | Returns |
|---|---|---|---|---|
REGEXP_LIKE() | Scalar | Validates whether a string matches a regex pattern | String, regex pattern, flags | 1 (true) or 0 (false) |
REGEXP_REPLACE() | Scalar | Replaces substrings matching the pattern with a replacement string | String, regex pattern, replacement, flags | Modified string |
REGEXP_SUBSTR() | Scalar | Extracts a substring matching the regex pattern | String, regex pattern, position, occurrence | Extracted substring |
REGEXP_COUNT() | Scalar | Counts the number of occurrences of the pattern in the string | String, regex pattern | Number of occurrences |
REGEXP_INSTR() | Scalar | Locates the position of a match | String, regex pattern, position, occurrence | Position (integer) |
REGEXP_MATCHES() | Table | Returns a table with matching substrings and their positions | String, regex pattern | Table with match details |
REGEXP_SPLIT_TO_TABLE() | Table | Splits a string into substrings using a regex delimiter | String, regex pattern | Table with substrings and ordinal |
REGEXP_LIKE, REGEXP_MATCHES and REGEXP_SPLIT_TO_TABLE require compatibility level 170 or higher. Other regex functions are available at all compatibility levels in SQL Server 2025. To check or change a database compatibility level, use the following commands:
-- Check current level in the current database
SELECT compatibility_level
FROM sys.databases
WHERE name = DB_NAME();
-- Change if needed
ALTER DATABASE [MyDatabase]
SET COMPATIBILITY_LEVEL = 170;
Restrictions
- LOBs (Large Objects)
varchar(max)andnvarchar(max)are supported with a 2 MB limit forREGEXP_LIKE,REGEXP_COUNTandREGEXP_INSTRonly. The functionsREGEXP_REPLACE,REGEXP_SUBSTR,REGEXP_MATCHESandREGEXP_SPLIT_TO_TABLEdo not support LOBs. This is a limitation of the library. - Three restrictions limit the use of regex in certain contexts: natively compiled stored procedures (In-Memory OLTP) cannot execute regex functions, Memory-Optimized tables are incompatible, and regex functions do not respect SQL collations for comparisons (behavior strictly based on Unicode).
- The maximum pattern length is 8,000 bytes. Repetitions with explicit quantifiers
{n},{n,m},{n,}are limited to 1,000 repetitions. The functions do not support lookahead ((?=...)) or lookbehind ((?<=...)) in RE2 syntax. This is better for performance. The absence of those features is what makes RE2 fast and less resource-hungry.
For example, the following query fails because it uses a lookbehind:
SELECT LastName,
CONCAT(UPPER(REGEXP_SUBSTR(LastName, '^\w')),
LOWER(REGEXP_SUBSTR(LastName, '(?<=^\w).+'))) AS [InitCap]
FROM Contact.Contact;
The error message is:
Msg 19300, Level 16, State 1, Line 1
An invalid Pattern '(?<=^\w).+' was provided. Error 'invalid perl operator: (?<' occurred during evaluation of the Pattern.
REGEXP_LIKE validates matches
REGEXP_LIKE tests whether a string matches a pattern. Its full syntax is:
REGEXP_LIKE ( string_expression, pattern_expression [ , flags ] )
The string_expression parameter accepts char, nchar, varchar or nvarchar types, including LOB types up to 2 MB. The pattern_expression defines the regex pattern with an 8,000-byte limit. The optional flags parameter (30 characters max) controls matching behavior (for example case-insensitivity). The function returns a boolean true or false, like a classic LIKE, and can be used in a predicate.
REGEXP_LIKE requires compatibility level 170 or higher. Without that setting, the query fails with a syntax error.
-- Change compatibility level: 170 (SQL Server 2025)
ALTER DATABASE PachadataTraining
SET COMPATIBILITY_LEVEL = 170;
GO
USE PachadataTraining
GO
-- Email validation
SELECT
ContactId,
Email,
CASE
WHEN REGEXP_LIKE(Email, '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}$')
THEN 'Email valid'
ELSE 'Email invalid'
END AS ValidationEmail
FROM Contact.Contact
WHERE Email IS NOT NULL
ORDER BY ContactId;
-- Performance:
-- With regex: CPU time = 38 ms for 10,319 rows
-- Without regex: CPU time = 8 ms for 10,319 rows
-- WHERE clause with REGEXP_LIKE: filter names with non-alphabetic characters
SELECT
ContactId,
FirstName,
LastName
FROM Contact.Contact
WHERE REGEXP_LIKE(LastName, '[^A-Za-z·-˜\s\-\'']');
Using it in a CHECK constraint ensures data integrity on insert:
ALTER TABLE Contact.Contact
ADD CONSTRAINT CK_Contact_Email_Valid
CHECK (REGEXP_LIKE(Email, '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$'));
ALTER TABLE Contact.Contact
ADD CONSTRAINT CK_Contact_Phone_Valid
CHECK (REGEXP_LIKE(Phone, '^\d{2}[\s.-]?\d{2}[\s.-]?\d{2}[\s.-]?\d{2}[\s.-]?\d{2}$'));
REGEXP_REPLACE to transform strings
REGEXP_REPLACE offers advanced replacement capabilities with support for capture groups. Full syntax:
REGEXP_REPLACE ( string_expression, pattern_expression
[ , string_replacement [ , start [ , occurrence [ , flags ] ] ] ] )
The string_replacement parameter defines the replacement string (empty by default). The start parameter indicates the starting position (default: 1). The occurrence parameter specifies which occurrence to replace: 0 replaces all occurrences, while a positive integer targets only the Nth occurrence.
You can use back references: \1 to \9 refer to captured groups, and & refers to the full matched pattern.
-- Format a phone number with capture groups
DECLARE @phone NVARCHAR(20) = '0612345678';
SELECT REGEXP_REPLACE(
@phone,
'(\d{2})(\d{2})(\d{2})(\d{2})(\d{2})',
'\1 \2 \3 \4 \5'
) AS FormattedPhone;
-- Result: 06 12 34 56 78
-- Mask the last 4 digits of a credit card
SELECT REGEXP_REPLACE(CreditCard, '\d{4}$', '****') AS MaskedCard
FROM Payments;
-- Replace only the 2nd occurrence
SELECT REGEXP_REPLACE('A-B-C-D', '-', '/', 1, 2) AS Result;
-- Result: A-B/C-D
-- Remove all non-numeric characters
SELECT REGEXP_REPLACE(Phone, '[^0-9]', '', 1, 0) AS CleanPhone
FROM Contacts;
REGEXP_SUBSTR extracts substrings
REGEXP_SUBSTR extracts the part of a string matching the specified pattern. It works at any compatibility level.
REGEXP_SUBSTR ( string_expression, pattern_expression
[ , start [ , occurrence [ , flags [ , group ] ] ] ] )
The group parameter allows you to extract a specific capture group: 0 returns the full match (default), while a positive integer returns the Nth captured group.
-- Extract username and domain from an email
SELECT
ContactId,
Email,
REGEXP_SUBSTR(Email, '^[^@]+') AS UserName,
REGEXP_SUBSTR(Email, '@(.+)$', 1, 1, '', 1) AS Domain
FROM Contact.Contact
WHERE Email IS NOT NULL;
-- Extract first name and last name separately
DECLARE @Name varchar(50) = 'Isabelle D''Arco';
SELECT @Name,
REGEXP_SUBSTR(@Name, '\w+', 1, 1) AS FirstName,
REGEXP_SUBSTR(@Name, '[\w'']+', 1, 2) AS LastName;
-- Extract a French postal code
DECLARE @address varchar(50) = '55 AVENUE MARECHAL FOCH 69006 LYON';
SELECT REGEXP_SUBSTR(@address, '\b\d{5}\b', 1, 1) AS FrenchZipCode;
-- First word starting with a vowel
SELECT
REGEXP_SUBSTR(m.text, '\b[aeiou…ƒ„‚Šˆ‰‹Œ“—–]\w*', 1, 1, 'i') AS Word,
m.text
FROM sys.messages m;
REGEXP_COUNT counts occurrences
REGEXP_COUNT counts the number of matches of a pattern in a string:
REGEXP_COUNT ( string_expression, pattern_expression [ , start [ , flags ] ] )
This function supports LOB types ([n]varchar(max)) up to 2 MB and returns an integer.
-- Count digits in a string
SELECT TOP 10
Amount,
REGEXP_COUNT(CAST(Amount as varchar(50)), '\d') AS NumberOfDigits
-- no implicit conversion ...
FROM Enrollment.Invoice;
-- Count words
SELECT verse,
REGEXP_COUNT(verse, '\b\w+\b') AS NbOfWords
FROM AI.Verses;
-- Count keys in a JSON string
SELECT TOP 10
JsonInfo,
REGEXP_COUNT(ClobInfo, '"[^"]+"\s*:') AS NumberOfKeys
FROM Travel.Restaurant;
REGEXP_INSTR locates matches
REGEXP_INSTR returns the position of a match with control over the returned result:
REGEXP_INSTR ( string_expression, pattern_expression
[ , start [ , occurrence [ , return_option [ , flags [ , group ] ] ] ] ] )
The return_option parameter determines whether the function returns the start position (0, default) or end position (1) of the match.
-- Find the position of the domain in the email
SELECT
ContactId,
Email,
REGEXP_INSTR(Email, '[^@][A-Za-z0-9.-]+\.[A-Za-z]{2,}$') AS PositionDomain,
REGEXP_INSTR(Email, '@') AS PositionAt
FROM Contact.Contact
WHERE Email IS NOT NULL;
-- Position of spaces in full names
SELECT
ContactId,
CONCAT(FirstName, ' ', LastName) AS NomComplet,
REGEXP_INSTR(CONCAT(FirstName, ' ', LastName), '\s') AS FirstSpace,
REGEXP_INSTR(CONCAT(FirstName, ' ', LastName), '\s', 1, 2) AS SecondSpace
FROM Contact.Contact
WHERE FirstName IS NOT NULL AND LastName IS NOT NULL;
REGEXP_MATCHES (TVF) lists all matches
REGEXP_MATCHES returns a table with details of each match:
REGEXP_MATCHES ( string_expression, pattern_expression [ , flags ] )
The returned table contains the followng columns:
match_id(bigint, sequence),start_position(int),end_position(int),match_value(same type as input),substring_matches(JSON, details of sub-groups).
-- Extract all hashtags from a text
SELECT * FROM REGEXP_MATCHES(
'Discover #SQLServer2025 and #Regex for #DataProcessing',
'#([A-Za-z0-9_]+)'
);
-- Returns 3 rows with details for each hashtag
-- Use with CROSS APPLY on a table
SELECT E.EmployeeID, M.match_id, M.match_value
FROM Employees E
CROSS APPLY REGEXP_MATCHES(E.Skills, '\w+') AS M;
REGEXP_SPLIT_TO_TABLE (TVF) splits strings
REGEXP_SPLIT_TO_TABLE splits a string into rows based on a delimiter pattern:
REGEXP_SPLIT_TO_TABLE ( string_expression, pattern_expression [ , flags ] )
The returned table contains value (substring) and ordinal (1-based position).
-- Split a list of skills
SELECT * FROM REGEXP_SPLIT_TO_TABLE(
'SQL, Python, Java, JavaScript',
',\s*'
);
-- Returns: SQL(1), Python(2), Java(3), JavaScript(4)
-- Split a text into words
SELECT value, ordinal
FROM REGEXP_SPLIT_TO_TABLE('Le chat mange la souris', '\s+');
-- Use with a table
SELECT C.ClientID, S.value AS Tag, S.ordinal
FROM Clients C
CROSS APPLY REGEXP_SPLIT_TO_TABLE(C.Tags, ';\s*') AS S;
These two TVF functions require compatibility level 170 or higher.
RE2 syntax
Metacharacters and special characters: The RE2 engine supports standard metacharacters: . matches any character except newline, ^ anchors the start, $ anchors the end, \ escapes special characters, | represents alternation (logical OR), and brackets [] define character classes.
Perl character classes: Perl shortcuts simplify common patterns: \d equals [0-9] (digits), \D equals [^0-9] (non-digits), \s whitespace (space, tab, newline), \S non-whitespace, \w word characters [0-9A-Za-z_], and \W non-word characters.
POSIX classes: POSIX classes provide a readable alternative to Perl shortcuts: [[:alnum:]] for alphanumerics, [[:alpha:]] for letters, [[:digit:]] for digits, [[:lower:]] and [[:upper:]] for lowercase/uppercase, [[:space:]] for whitespace, [[:word:]] for word characters, and [[:xdigit:]] for hex digits.
Quantifiers: Quantifiers control repetitions: * means zero or more (greedy), + one or more, ? zero or one, {n} exactly n occurrences, {n,m} between n and m, {n,} n or more. Non-greedy (lazy) versions add ?: *?, +?, ??, {n,m}?.
Quantifiers {n}, {n,m} and {n,} are limited to 1,000 repetitions maximum. Quantifiers * and + do not have this restriction.
Groups and backreferences: Parentheses create numbered capture groups: (pattern). Non-capturing groups (?:pattern) optimize processing. Named groups (?P<name>pattern) improve readability. Backreferences \1 to \9 allow reuse of captures.
Anchors and word boundaries: Anchors position the match: ^ start of string/line, $ end of string/line, \A absolute start of string, \z absolute end, \b word boundary, \B non-word boundary.
Four flags control matching behavior
| Flag | Description | Default |
|---|---|---|
c | Case-sensitive match | On (default) |
i | Case-insensitive match | Off |
m | Multi-line mode: ^ and $ match line boundaries | Off |
s | Single-line mode: . also matches newlines | Off |
Flags combine in a string: 'im' enables case-insensitivity and multi-line mode. If flags conflict (for example 'ic'), the last specified wins.
-- Case-insensitive search
SELECT *
FROM Contact.Contact
WHERE REGEXP_LIKE(LastName, 'Simon', 'i');
-- Multi-line mode to process multi-line text
SELECT REGEXP_REPLACE(Content, '^\s+', '', 1, 0, 'm') AS CleanedText
FROM Articles;
-- Flag combination
SELECT REGEXP_COUNT(MultiLineText, '^line', 1, 'im') AS LineCount
FROM Documents;
Concrete usage examples
Email address validation
-- Full email pattern with TLD validation
DECLARE @PatternEmail NVARCHAR(200) = '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$';
-- Filter valid emails
SELECT *
FROM Contact.Contact
WHERE REGEXP_LIKE(Email, @PatternEmail, 'i');
-- Data quality report
SELECT
COUNT(*) AS TotalEmails,
SUM(CASE WHEN REGEXP_LIKE(Email, @PatternEmail, 'i') THEN 1 ELSE 0 END) AS ValidEmails,
SUM(CASE WHEN NOT REGEXP_LIKE(Email, @PatternEmail, 'i') THEN 1 ELSE 0 END) AS InvalidEmails
FROM Contact.Contact;
French phone number validation
-- French landline or mobile phone (10 digits starting with 0)
DECLARE @PatternPhoneFR NVARCHAR(100) = '^0[1-9](\s?|-?|\.?)\d{2}(\s?|-?|\.?)\d{2}(\s?|-?|\.?)\d{2}(\s?|-?|\.?)\d{2}$';
-- Validation
SELECT Phone,
CASE WHEN REGEXP_LIKE(Phone, @PatternPhoneFR)
THEN 'Valid' ELSE 'Invalid' END AS Status
FROM Contact.Contact;
-- Normalization (keep digits only)
UPDATE Contact.Contact
SET NormalizedPhone = REGEXP_REPLACE(Phone, '[^0-9]', '', 1, 0)
WHERE REGEXP_LIKE(Phone, '\d');
Cleaning and transforming data
-- Remove multiple spaces
UPDATE Documents
SET Content = REGEXP_REPLACE(Content, '\s+', ' ', 1, 0)
WHERE REGEXP_LIKE(Content, '\s{2,}');
-- Extract structured data from free text
SELECT OriginalText,
REGEXP_SUBSTR(OriginalText, '\b[A-Z]{2}\d{3}[A-Z]{2}\b') AS FrenchPlateNumber,
REGEXP_SUBSTR(OriginalText, '\b\d{5}\b') AS PostalCode,
REGEXP_SUBSTR(OriginalText, '\b0[67]\d{8}\b') AS MobilePhone
FROM Documents;
Masking sensitive data
-- Mask credit card numbers
SELECT
CardNumber,
REGEXP_REPLACE(
CardNumber,
'(\d{4})(\d{4})(\d{4})(\d{4})',
'XXXX-XXXX-XXXX-\4'
) AS MaskedCard
FROM Payments;
-- Partial email masking
SELECT Email,
REGEXP_REPLACE(
Email,
'^(.{2})(.*)(@.+)$',
'\1***\3'
) AS MaskedEmail
FROM Contact.Contact;
Performance analysis
Let’s perform a few basic performance analyses and comparisons.
A simple search query with LIKE on the Posts table of the StackOverflow2013 database.
The Posts table contains 17,142,169 rows over 38 GB of data.
--SET STATISTICS TIME, IO ON;
--ALTER DATABASE [StackOverflow2013] SET COMPATIBILITY_LEVEL = 170;
SELECT COUNT(*)
FROM Posts
WHERE Title LIKE '%SQL Server%';
GO
SELECT COUNT(*)
FROM Posts
WHERE REGEXP_LIKE(Title , 'SQL Server');
GO
SELECT COUNT(*)
FROM Posts
WHERE REGEXP_LIKE(Title , 'SQL Server', 'i');
Plain LIKE consumes 32 seconds of CPU.
Table 'Posts'. Scan count 3, logical reads 4188571, physical reads 8, read-ahead reads 4168983.
SQL Server Execution Times:
CPU time = 32310 ms, elapsed time = 19485 ms.
REGEXP_LIKE consumes 483 seconds of CPU (8 minutes…).
Execution time is 4.3 minutes. Degree of parallelism is 2…
Table 'Posts'. Scan count 3, logical reads 4186772, physical reads 2, read-ahead reads 4169796.
Table 'Worktable'. Scan count 0, logical reads 0.
SQL Server Execution Times:
CPU time = 483825 ms, elapsed time = 258479 ms.
REGEXP_LIKE with the 'i' option (case-insensitive) consumes 999 seconds of CPU (16.6 minutes).
Execution time is 8.7 minutes. Degree of parallelism is still 2. The global benefit of parallelism is clear on such a heavy operation.
Table 'Posts'. Scan count 3, logical reads 4189766, physical reads 2, read-ahead reads 4172078.
Table 'Worktable'. Scan count 0, logical reads 0
SQL Server Execution Times:
CPU time = 999970 ms, elapsed time = 527492 ms.
In this simple search case, REGEXP_LIKE is therefore 15x slower than LIKE for a strict search, and 31x slower for a case-insensitive search, which corresponds to the functionality provided to LIKE by the CI collation on the column (SQL_Latin1_General_CP1_CI_AS).
Queries that involve regex are clearly CPU-bound.
Note that the type is NVARCHAR, so UTF-16.
Let’s see if that aspect is significant in CPU consumption.
SELECT COUNT(*)
FROM Posts
WHERE REGEXP_LIKE(CAST(Title as varchar(250)) , 'SQL Server', 'i');
Not really… REGEXP_LIKE with the ‘i’ option on a column converted to VARCHAR consumes 1013 seconds of CPU (16.8 minutes).
Table 'Posts'. Scan count 3, logical reads 4190377, physical reads 9, read-ahead reads 4171470...
Table 'Worktable'. Scan count 0...
SQL Server Execution Times:
CPU time = 1013526 ms, elapsed time = 537002 ms.
Cardinality estimation
A major issue with regular expressions is cardinality estimation. The SQL Server cardinality estimator cannot properly estimate the number of rows returned by a regular expression because it has no statistical model for it. How could you estimate the number of rows matching a complex regular expression?
The cardinality estimator does know how to estimate cardinality on the LIKE operator based on string statistics, present since SQL Server 2005, which provide an estimate on the first 80 characters of a VARCHAR. This works pretty well.
In our example, we are searching for titles that contain the string ‘SQL Server’. The Title column is of type NVARCHAR(250), so the cardinality estimator uses string statistics for a LIKE.
Let’s look at the cardinality estimate for LIKE, on the IndexScan operator in the execution plan:
<RelOp AvgRowSize="261" EstimateCPU="9.42827" EstimateIO="3088.19"
EstimatedExecutionMode="Row" EstimateRows="100247" EstimatedRowsRead="17142200"
LogicalOp="Clustered Index Scan" Parallel="true" PhysicalOp="Clustered Index Scan"
EstimatedTotalSubtreeCost="3097.62" TableCardinality="17142200">
In practice, the query with LIKE '%SQL Server%' returns 43 467 rows, which is fairly close to the plan estimate of 100 247 rows.
Now look at the cardinality estimate for REGEXP_LIKE, still on the IndexScan operator in the execution plan:
<RelOp AvgRowSize="261" EstimateCPU="9.42827" EstimateIO="3088.19"
EstimatedExecutionMode="Batch" EstimateRows="1542800" EstimatedRowsRead="17142200"
LogicalOp="Clustered Index Scan" Parallel="true" PhysicalOp="Clustered Index Scan"
EstimatedTotalSubtreeCost="3097.62" TableCardinality="17142200">
Here, the cardinality estimator estimates 1,542,800 rows returned by REGEXP_LIKE, which is far from the 43,467 rows actually returned. The estimation error is huge, which can lead to not using an index, or to excessive memory grants for sort or hash operations.
You can also note that the CPU estimate is identical between the two plans, which is obviously totally wrong, as seen in actual CPU consumption.
The cardinality estimation is identical for the case-insensitive regular expression.
In fact, the cardinality estimator seems to use a fixed percentage, here 9% (1,542,800 / 17,142,200). A quick check with REGEXP_LIKE on the Comments table shows the same percentage.
The basic idea is probably to rely on IQP (Intelligent Query Processing) to dynamically adjust cardinality estimation during query execution and correct memory grants, but those features are only available in Enterprise edition, and that is a band-aid.
Query options for cardinality estimation
To allow at least manual correction of some large estimation errors, SQL Server provides two query hint options.
SELECT COUNT(*)
FROM Posts
WHERE REGEXP_LIKE(Title , 'SQL Server', 'i')
OPTION (USE HINT ('ASSUME_FIXED_MIN_SELECTIVITY_FOR_REGEXP'));
<RelOp AvgRowSize="261" EstimateCPU="9.42827" EstimateIO="3088.19"
EstimatedExecutionMode="Row" EstimateRows="85710.8" EstimatedRowsRead="17142200"
LogicalOp="Clustered Index Scan" NodeId="5" Parallel="true" PhysicalOp="Clustered Index Scan"
EstimatedTotalSubtreeCost="3097.62" TableCardinality="17142200">
SELECT COUNT(*)
FROM Posts
WHERE REGEXP_LIKE(Title , 'SQL Server', 'i')
OPTION (USE HINT ('ASSUME_FIXED_MAX_SELECTIVITY_FOR_REGEXP'));
<RelOp AvgRowSize="261" EstimateCPU="9.42827" EstimateIO="3088.19"
EstimatedExecutionMode="Batch" EstimateRows="8571080" EstimatedRowsRead="17142200"
LogicalOp="Clustered Index Scan" NodeId="3" Parallel="true" PhysicalOp="Clustered Index Scan"
EstimatedTotalSubtreeCost="3097.62" TableCardinality="17142200">
In our case, the ASSUME_FIXED_MIN_SELECTIVITY_FOR_REGEXP option yields a cardinality estimate of 85,710 rows, which corresponds to about 0.5% of the table.
The ASSUME_FIXED_MAX_SELECTIVITY_FOR_REGEXP option yields a cardinality estimate of 8,571,080 rows, i.e. 50% of the table.
These options do not allow significant improvement in cardinality estimation, but they at least avoid huge errors.
Impact of regex complexity on performance
Let’s see whether regex complexity causes significant CPU time differences.
ALTER DATABASE [PachadataTraining] SET COMPATIBILITY_LEVEL = 170;
GO
USE PachadataTraining;
GO
SET STATISTICS TIME ON;
GO
SELECT
ContactId,
Email,
CASE
WHEN REGEXP_LIKE(Email, '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}$')
THEN 'Email valid'
ELSE 'Email invalid'
END AS ValidationEmail
FROM Contact.Contact
WHERE Email IS NOT NULL;
-- simpler, less accurate
SELECT
ContactId,
Email,
CASE
WHEN REGEXP_LIKE(Email, '^[^\s@]+@[^\s@]+\.[^\s@]{2,}$')
THEN 'Email valid'
ELSE 'Email invalid'
END AS ValidationEmail
FROM Contact.Contact
WHERE Email IS NOT NULL;
SELECT
ContactId,
Email,
CASE
WHEN REGEXP_LIKE(Email, '^[A-Za-z0-9][A-Za-z0-9._%+-]{0,63}@[A-Za-z0-9][A-Za-z0-9.-]{0,62}\.[A-Za-z]{2,6}$')
THEN 'Email valid'
ELSE 'Email invalid'
END AS ValidationEmail
FROM Contact.Contact
WHERE Email IS NOT NULL;
Here, we test three regex patterns of increasing complexity to validate email addresses in the Contact.Contact table of the PachadataTraining database.
The first regex is relatively simple and validates email addresses in a basic way. It consumes about 21 ms of CPU to process 10,319 rows.
The second regex is simpler and less precise, but it consumes about 23 ms of CPU for the same number of rows, probably because it uses fewer character classes and precise quantifiers.
The third regex is more complex and validates email addresses more rigorously. It consumes about 27 ms of CPU to process the same 10,319 rows.
Regex complexity has a direct but modest impact on performance.
SARGability of regular expressions
Some regular expressions in SQL Server 2025 have been presented as SARGable. In other words, they can potentially leverage indexes to improve query performance, and be optimized with index seeks rather than scans.
Obviously, this would only be possible when searching for a match at the start of a string, for example with a regex like ^abc, which searches for strings that start with 'abc'.
Let’s try this:
ALTER DATABASE [PachadataTraining] SET COMPATIBILITY_LEVEL = 170;
GO
USE PachadataTraining;
GO
SELECT ContactId
FROM Contact.Contact
WHERE LastName LIKE 'Fera%';
SELECT ContactId
FROM Contact.Contact
WHERE REGEXP_LIKE(LastName , '^Fera', 'i')
OPTION (USE HINT ('ASSUME_FIXED_MIN_SELECTIVITY_FOR_REGEXP'));
Here, the first query uses LIKE 'Fera%', which is SARGable and can use an index on the LastName column. The execution plan shows an Index Seek.
For the test with REGEXP_LIKE('^Fera', 'i'), the execution plan shows an Index Scan, which indicates that the regex was not considered SARGable by the query optimizer, despite the presence of the start anchor ^.
Info
I have also tried with the multi-line flag 'm' and the single-line flag 's', but they did not change anything. The regex remained non-SARGable.
To increase the chances of SARGability, I added the ASSUME_FIXED_MIN_SELECTIVITY_FOR_REGEXP query hint, but it did not help.
I wanted the lowest possible estimate to encourage the optimizer to use a seek.
It was not necessary, because by selecting the clustered index column in the SELECT, I ensure no lookup is required. My nonclustered index on LastName covers the query needs.
The optimizer could not do a seek despite that.
I wondered if this could be related to the case-insensitive collation of the LastName column (SQL_Latin1_General_CP1_CI_AS).
So I tried with a case-sensitive collation. I created a temporary table with two indexed columns, the first with a binary collation and the second with a case-sensitive UTF-8 collation, to try everything.
ALTER DATABASE [PachadataTraining] SET COMPATIBILITY_LEVEL = 170;
GO
USE PachadataTraining;
GO
-- DROP TABLE #Contact
CREATE TABLE #Contact (
ContactId int PRIMARY KEY,
LastName varchar(50) COLLATE French_BIN2 index ix_temp_LastName,
LastNameUtf8 varchar(50) COLLATE Latin1_General_100_CS_AS_SC_UTF8
index ix_temp_LastNameUtf8
)
GO
INSERT INTO #Contact
SELECT ContactId, LastName, LastName
FROM Contact.Contact;
SELECT ContactId
FROM #Contact
WHERE LastName LIKE 'Fera%';
SELECT ContactId
FROM #Contact
WHERE REGEXP_LIKE(LastName , '^Fera')
OPTION (USE HINT ('ASSUME_FIXED_MIN_SELECTIVITY_FOR_REGEXP'));
SELECT ContactId
FROM #Contact
WHERE LastNameUtf8 LIKE 'Fera%';
SELECT ContactId
FROM #Contact
WHERE REGEXP_LIKE(LastNameUtf8 , '^Fera')
OPTION (USE HINT ('ASSUME_FIXED_MIN_SELECTIVITY_FOR_REGEXP'));
In none of these cases did REGEXP_LIKE produce an Index Seek, whether on a binary collation or a case-sensitive UTF-8 collation.

Performance analysis of a simple query
Let’s take a simple query that uses REGEXP_COUNT to filter post titles containing exactly 5 words. First, the SQL code:
SET STATISTICS TIME ON;
SELECT TOP 10 Title
FROM Posts
WHERE REGEXP_COUNT(Title, '\w+') = 5
ORDER BY Title;
I run it once to observe performance statistics.
SQL Server Execution Times:
CPU time = 320221 ms, elapsed time = 82955 ms.
The query consumes 320 seconds of CPU (more than 5 minutes) to process post titles. Execution time is 83 seconds, and server-level parallelism is set to 4.
I have 16 cores on my machine, so I push parallelism to the maximum:
SELECT TOP 10 Title
FROM Posts
WHERE REGEXP_COUNT(Title, '\w+') = 5
ORDER BY Title
OPTION (MAXDOP 16);
SQL Server Execution Times:
CPU time = 672954 ms, elapsed time = 43590 ms
Execution time drops to 43 seconds, but CPU consumption increases to 672 seconds (11 minutes). Parallelism reduces execution time, but only by a quarter. This is explained by the hardware: my processor is an AMD Ryzen 7 3700X with 8 physical cores and 16 logical threads (SMT). SQL Server uses logical threads for parallelism, but total CPU load cannot exceed physical core capacity.
Let’s try 8 cores:
SELECT TOP 10 Title
FROM Posts
WHERE REGEXP_COUNT(Title, '\w+') = 5
ORDER BY Title
OPTION (MAXDOP 8);
SQL Server Execution Times:
CPU time = 431595 ms, elapsed time = 55577 ms.
Execution time is 55 seconds, with CPU consumption of 431 seconds (7 minutes). This is a good compromise between execution time and CPU consumption.
Parallelism clearly helps distribute the heavy CPU load of regex. This is expected behavior for CPU-bound operations, and slightly different from most operational queries we usually run, where we try not to parallelize too much.
Optimization strategies to reduce CPU impact
Pre-filter with fast predicates
The most effective strategy is to reduce the data volume before applying regex:
-- Bad: regex on the whole table
SELECT *
FROM Contact.Contact
WHERE REGEXP_LIKE(Email, '^[a-z]+\d+@gmail\.com$', 'i');
-- 28 ms CPU
-- Good: pre-filter with LIKE, then regex
SELECT *
FROM Contact.Contact
WHERE Email LIKE '%@gmail.com' -- fast filter first
AND REGEXP_LIKE(Email, '^[a-z]+\d+@gmail\.com$', 'i'); -- regex on subset
-- 18 ms CPU
Persist results for frequent validations
For frequently validated columns, computing and storing the result avoids recomputation:
-- Add a persisted computed column
ALTER TABLE Clients ADD ValidEmail AS (
CAST(
CASE WHEN REGEXP_LIKE(Email, '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$')
THEN 1 ELSE 0 END
AS BIT
)
) PERSISTED;
-- Index the persisted column
CREATE INDEX IX_Clients_ValidEmail ON Clients(ValidEmail);
Pattern design rules
- Anchor patterns with
^and$when possible to avoid costly partial searches - Use non-greedy quantifiers (
*?,+?) to reduce backtracking - Avoid overly generic patterns like
.*without anchors - Prefer explicit character classes
[0-9]over\dfor readability
Measure CPU impact
To evaluate regex CPU impact, enable TIME statistics before running a query:
-- Enable statistics for analysis
SET STATISTICS TIME ON;
SELECT * FROM HugeTable
WHERE REGEXP_LIKE(Column, 'pattern');
SET STATISTICS TIME OFF;