match_only_text and general necessity for case insensitive and exact match (search)
See original GitHub issueplease see updated outline https://github.com/elastic/ecs/issues/1837#issuecomment-1073573242
however, the below can be used to note some of the shortcomings of match_only_text
Description
match_only_text
text data type causes undesired search results when wanting to perform accurate searches that are also case insensitive… This data type appears to not be the most optimal solution for security/log data.
Searching for ends with or starts with does not keep/respect positioning, most importantly is the lack of accuracy. When searching for things other than numbers/letters (ie: $
, .
, {
, etc…) the search characters are ignored.
I understand the “solution” may be for a custom analyzer or to use wildcard, however the problem is that ECS is being applied without customers/users understanding this issue OR even changing the mappings. Thus the default and most commonly loaded/widely used mappings are giving users inaccurate search results.
The desire for security use cases, and I assume most logging use cases, is a) case insensitivity and b) accurate/exact results.
I would recommend one of two things: a) adopting a community text analyzer (see: https://github.com/neu5ron/es_stk). Which is adopted in things such as Security Onion. b) moving everything, defined as `match_only_text, to wildcard.
Example
Test Data
I loaded some sample values into Elasticsearch with a explicitly defined mapping for match_only_text
on the field cli
(note: used the field cli but it can be any field, whether ECS or not as long as that mapping is applied).
POST /_bulk
{ "index" : { "_index": "es_stk_test"} }
{"cli": """C:\Users\test\rundll32.exe -exec a bypass ^T^e^S^t """}
{ "index" : { "_index": "es_stk_test"} }
{"cli": """C:\Users\test\rundll32.exe"""}
{ "index" : { "_index": "es_stk_test"} }
{"cli": """rundll32.exe -exec a bypass ^T^e^S^t """}
{ "index" : { "_index": "es_stk_test"} }
{"cli": """C:\Users\test\rundll32.exe $ -exec a bypass ^T^e^S^t """}
{ "index" : { "_index": "es_stk_test"} }
{"cli": """C:\Users\test\rundll32.exe $"""}
{ "index" : { "_index": "es_stk_test"} }
{"cli": """rundll32.exe $ -exec a bypass ^T^e^S^t """}
{ "index" : { "_index": "es_stk_test"} }
{"cli": """rund1132.exe $ -exec a bypass ^T^e^S^t """}
{ "index" : { "_index": "es_stk_test"} }
{"cli": """rund1132.exe $"""}
Result to Find
The value I to want to find is C:\Users\test\rundll32.exe $
Search - in Human Form
I expect to find this result using the following logic (in human form):
- contains
rundll32.exe
- followed by a
$
Search - in Elastic Form
After converting this logic into an actual Elastic query the syntax looks like:
cli.text:"*rundll32.exe*\$"
Search - Results
The results from this search return 6 matches when there should be only 1.
There is only one occurrence where rundll32.exe
endswith a $
Not only does it find results that don’t end in $
, it returns results that do not contain a $
at all.
Follow along complete test
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:7 (4 by maintainers)
Top GitHub Comments
Hey @ebeahan thanks for the reply. Especially appreciate some of the background/context for others. An additional article that may help some as well: https://socprime.com/blog/elastic-for-security-analysts-part-1-searching-strings/
I would like to try to clear up a few things for the discussion going forward and to preface that - I apologize for causing confusion by using the word “inaccuracy” (to which I updated the title and some of the wording of the original issue). This will also answer the confusion I caused whether it was a standard analyzer issue.
Desired State
For (cyber) security and logging use cases there is necessity of:
(for visualizations AKA anything outside of match/search) I think this mostly goes without saying, but as the point gets brought up later I wanted to note it.
not to be confused with maximum character length related to the ability to just search/match the data, but rather to display/return data in a visualization/ML/etc which requires use of an aggregation. This is a separate discussion that has been noted in other ECS issues (https://github.com/elastic/ecs/issues/105) and is alleviated in use of certain data types, however when this discussion gets brought up it tends to take away from the points of 1) and 2).
Elasticsearch ECS Data Types
keyword
yes
no
yes
limited
at32766
additional noteworthy info:
match_only_text
no
yes
no
n/a
additional noteworthy info:
wildcard
yes
no
yes
unlimited
(relatively) at2147483647
additional noteworthy info:
ECS Data Type Usage
The 3 noted data types from above are used within the ECS templates (going off the main branch as of commit hash 7496470bf422451744cef8308c1782baab8086bf
keyword
985
max character(byte) length set to 1024 or lower on 984 of the 985 fields.
match_only_text
61
wildcard
18
Data Types for Desired State
1) Exact match
It should go without saying this is solved through the use of
wildcard
ORkeyword
2) Case insensitive
Solution is TBD… discussed in depth later. However, we know that
keyword
is not the solution andmatch_only_text
seems like it was meant to be an option for case insensitive search, but as outlined it is not useful for security/cyber/logging use cases as it strips/ignores/removes symbols/punctuation/etc.3) Aggregations
keyword
orwildcard
this is relatively solved, as already in the state of ECS keyword or wildcard is used where appropriate and for the vast majority of fields.
4) Aggregation max character length
separate topic
Solving Case Insensitive Search
The recommendation that users can implement their own
custom (text) analyzer
is great but it’s just not happening and in return users have a false sense of security because they are now running searches expecting results that they will never get. Even the use ofwildcard
as the solution has become problematic. I think it’s evident it is not the solution given the fact it is only used 18 times despite it’s release over 1 1/2 years ago.You did mention
There are some early ideas about defining schemas for specific use-cases
, but I would like to say this case insensitive search situation has been around for 18+ months. Also, I would assume there is not a ton of ECS implementations outside of cyber. Or more to the point, cyber has always been the main use case (especially given the drivers/employees/etc within Elastic who created/maintain/oversee/contributed to ECS). Sure I could see data normalization and a schema is not some special thing specific to cyber and therefore should obviously not be squandered and ruined for all other possible use cases - so correct me if I am wrong if ECS is not primarily cyber focused in relation to the early adoption ideas.On these deployments/uses of Elastic that don’t have this solved, one would think that it could be lack of training, pro services, subscriptions, or the like… However, I have seen this issue across 15 Elastic deployments in just the last year…across just about every possible scenario:
The most common thing is users are just using the ECS templates and just have no clue about case sensitive search restrictions or more so they think ECS has solved it given a few blog posts over the past year in relation to wildcard and match_only_text. Not to mention even if ECS templates solved case insensitivity through wildcard, it is only used on %2 of the fields.
The reason I bring up this non technical background for the solution, is I think Elastic has not just the responsibility to solve this but they more than have the means to solve the issue. Especially given the (cyber) community has solved this and is more than willing to help. We all just want to get to searching and using the data and helping people do the same.
With that said, is it possible to have a customer analyzer integrated into ECS? I think if it was possible to get
standard analyzer
fortext
field types promoted into it’s very own data type in the foundational elasticsearch mappings… then ECS using a custom analyzer does not seem outside the realm of possibility.Etc
Hey @neu5ron, is there a reason this was closed? It is something that should certainly be resolved, but I do not see PRs that would address the issues you have raised.
I realize that https://github.com/elastic/kibana/issues/134143 is not meant to resolve the issues with
match_only_text
field types; but still opened that issue in hopes of improving the experience within Kibana and provide users with the ability to toggle thecase_insensitive
option.