Query: Improve translation of String's StartsWith, EndsWith and Contains
See original GitHub issuePROVIDERS BEWARE:
Linq translation for methods Contains, EndsWith and StartsWith that we have in the Relational package uses LIKE operator, which may return incorrect results if the value parameter (what we are searching for) contains wildcard characters, e.g. ‘%’ or ‘_’.
This issue addresses SqlServer and Sqlite providers, but all other providers will still use the old translation. Each provider that can be affected by this should implement their own MethodCallTranslators for Contains, EndsWith and StartsWith.
Currently in EF7 a LINQ query like this:
var things = Things.Where(t => t.Name.StartsWith("a"));
Gets translated to SQL like this (note that I am simplifying the query and expanding parameter values for clarity):
SELECT * FROM Things WHERE Name LIKE 'a%' ;
However, in order to return correct results, a LINQ query like this:
var underscoreAThings = Things.Where(t => t.Name.StartsWith("_a"));
Should be translated to SQL like this:
SELECT * FROM Things WHERE Name LIKE '~_a%' ESCAPE '~';
The escaping accounts for SQL wildcard characters in the input string which should not be treated as wildcards (we can add a separate Like()
method for passing patterns, but that belongs in a separate work item).
When the input string is store correlated (e.g. is another column in the database instead of parameter or a literal in the query) using LIKE in the translation correctly becomes more difficult, e.g. it would be hard to perform the required escaping in SQL.
In general for cases in which LIKE doesn’t work well we can fall back to alternative translations that don’t rely on LIKE, e.g. for String.StartsWith()
:
var underscoreAThings = Things.Where(t => t.Name.StartsWith(t.Prefix));
SELECT * FROM Things WHERE CHARINDEX(Prefix, Name) = 1 OR Prefix='';
Note that CHARINDEX()
won’t match an empty string but String.StartsWith("")
always return true
, that’s why we add the Prefix =‘’ condition.
The main disadvantage of this translation is that it is not sargable. That can be addressed with a hybrid translation, e.g.:
SELECT * FROM Things WHERE Name LIKE Prefix+'%' AND (CHARINDEX(Prefix, Name) = 1 OR Prefix = '');
This should be quick to evaluate using an index because the LIKE condition should be able to take advantage of the index to produce fairly selective results and the second condition will filter out false positives returned by LIKE
.
Notice that this alternative removes the need to fiddle with the input value: we no longer need to escape wildcards because in the worse case they will produce false positive matches which the CHARINDEX() based condition will still be able to filter out.
Also notice that based on the current query caching design we wouldn’t need to always produce this more complex translation. Instead, we could sniff into the argument of String.StartsWith()
and pivot on it to produce different translations, e.g.:
- If the value is opaque (i.e. it comes from the store) or if it contains a wildcard character, then produce the condition based on
CHARINDEX()
- If the value does not contain a wildcard character in the first position then we can emit the condition based on
LIKE
Similar approaches can be used for String.EndsWith()
and String.Contains()
. However for these methods LIKE does not really contribute to the performance since the beginning of the input value cannot be used to perform index lookups, so it should be ok to produce a translation that doesn’t use LIKE at all.
Issue Analytics
- State:
- Created 9 years ago
- Reactions:4
- Comments:37 (26 by maintainers)
Top GitHub Comments
@jemiller0 uh, happy to have helped although I didn’t really mean to 😃 Please note that my proposal was to use LIKE AND LEFT(LEN()) and not LEFT(LEN()) alone, simply because I assumed that LEFT(LEN()) wouldn’t be index-optimized. This retains the original logic of using LIKE first for speed, then filtering out false positives with something (CHARINDEX or LEFT(LEN())). So I’m not clear if you’re still doing LIKE AND LEFT(LEN()) or have switched to LEFT(LEN()) on its own. I’ll let the EF Core team comment further on the usefulness of LEFT(LEN()) for SqlServer.
Regarding prepared statements, I don’t think there’s any relevance to the statement type (SELECT, INSERT, UPDATE…). All of them greatly benefit from preparation in PostgreSQL, whereas in SqlServer I’m assuming all statement types are implicitly cached without the need for explicit preparation (here’s a doc page on this). It may be worth testing to confirm actual SqlServer behavior. Regardless, it seems like a good to implement preparation in EF Core simply to have all providers benefit from it, and EF Core may be in a good position to know what to prepare and what not to prepare. If you want to continue this conversation it may be better to do so in #5459.
Regarding case sensitivity in PostgreSQL, unquoted identifiers are always folded to lowercase, whereas quoted ones maintain case (this is why Npgsql EF Core provider systematically quotes all identifiers). Since this is also off-topic feel free to open an issue in the Npgsql repo to continue discussing.
Thanks for the valuable discussion. To summarize, at least in Npgsql I’m going to have the StartsWith() translator:
As a very minor implementation note, wouldn’t it be slightly better to to replace CHARINDEX with
LEFT(LEN(<pattern>))
, similar to howEndsWith()
is currently implemented? This would avoid going through the entire string, searching for the pattern.