Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

match_only_text and general necessity for case insensitive and exact match (search)

See original GitHub issue

please see updated outline https://github.com/elastic/ecs/issues/1837#issuecomment-1073573242 however, the below can be used to note some of the shortcomings of match_only_text

Description

match_only_text text data type causes undesired search results when wanting to perform accurate searches that are also case insensitive… This data type appears to not be the most optimal solution for security/log data. Searching for ends with or starts with does not keep/respect positioning, most importantly is the lack of accuracy. When searching for things other than numbers/letters (ie: $, ., {, etc…) the search characters are ignored.

I understand the “solution” may be for a custom analyzer or to use wildcard, however the problem is that ECS is being applied without customers/users understanding this issue OR even changing the mappings. Thus the default and most commonly loaded/widely used mappings are giving users inaccurate search results.

The desire for security use cases, and I assume most logging use cases, is a) case insensitivity and b) accurate/exact results.

I would recommend one of two things: a) adopting a community text analyzer (see: https://github.com/neu5ron/es_stk). Which is adopted in things such as Security Onion. b) moving everything, defined as `match_only_text, to wildcard.

Example

Test Data

I loaded some sample values into Elasticsearch with a explicitly defined mapping for match_only_text on the field cli (note: used the field cli but it can be any field, whether ECS or not as long as that mapping is applied).

POST /_bulk
{ "index" : { "_index": "es_stk_test"} }
{"cli": """C:\Users\test\rundll32.exe   -exec       a bypass ^T^e^S^t  """}
{ "index" : { "_index": "es_stk_test"} }
{"cli": """C:\Users\test\rundll32.exe"""}
{ "index" : { "_index": "es_stk_test"} }
{"cli": """rundll32.exe   -exec       a bypass ^T^e^S^t  """}
{ "index" : { "_index": "es_stk_test"} }
{"cli": """C:\Users\test\rundll32.exe $  -exec       a bypass ^T^e^S^t  """}
{ "index" : { "_index": "es_stk_test"} }
{"cli": """C:\Users\test\rundll32.exe $"""}
{ "index" : { "_index": "es_stk_test"} }
{"cli": """rundll32.exe $   -exec       a bypass ^T^e^S^t  """}
{ "index" : { "_index": "es_stk_test"} }
{"cli": """rund1132.exe $   -exec       a bypass ^T^e^S^t  """}
{ "index" : { "_index": "es_stk_test"} }
{"cli": """rund1132.exe $"""}

Result to Find

The value I to want to find is C:\Users\test\rundll32.exe $

Search - in Human Form

I expect to find this result using the following logic (in human form):

contains rundll32.exe
followed by a $

Search - in Elastic Form

After converting this logic into an actual Elastic query the syntax looks like: cli.text:"*rundll32.exe*\$"

Search - Results

The results from this search return 6 matches when there should be only 1. There is only one occurrence where rundll32.exe endswith a $ Not only does it find results that don’t end in $, it returns results that do not contain a $ at all.

Follow along complete test

https://github.com/neu5ron/es_stk/wiki#testing-yourself

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

neu5roncommented, Mar 24, 2022

Hey @ebeahan thanks for the reply. Especially appreciate some of the background/context for others. An additional article that may help some as well: https://socprime.com/blog/elastic-for-security-analysts-part-1-searching-strings/

I would like to try to clear up a few things for the discussion going forward and to preface that - I apologize for causing confusion by using the word “inaccuracy” (to which I updated the title and some of the wording of the original issue). This will also answer the confusion I caused whether it was a standard analyzer issue.

Desired State

For (cyber) security and logging use cases there is necessity of:

(Search) exact match. for things such as symbols, numbers, punctuation, etc (ie: anything that is not just words/letters).
(Search) case insensitive.
Aggregations.
(for visualizations AKA anything outside of match/search) I think this mostly goes without saying, but as the point gets brought up later I wanted to note it.
Aggregation max character length.
not to be confused with maximum character length related to the ability to just search/match the data, but rather to display/return data in a visualization/ML/etc which requires use of an aggregation. This is a separate discussion that has been noted in other ECS issues (https://github.com/elastic/ecs/issues/105) and is alleviated in use of certain data types, however when this discussion gets brought up it tends to take away from the points of 1) and 2).

Elasticsearch ECS Data Types

keyword

exact match: yes
case insensitive: no
aggregations: yes
aggregation max character length: limited at 32766

additional noteworthy info:

con:
- it is case sensitive, requires use of regex (which worth mentioning is not the full PCRE spec)
- aggregation max character length. keep for separate topic that should not be used as a weight in the discussion

match_only_text

exact match: no
case insensitive: yes
aggregations: no
aggregation max character length: n/a

additional noteworthy info:

con:
- will not match exact/precise of anything that is not a letter/word (shown in thread of this github issue)

wildcard

exact match: yes
case insensitive: no
aggregations: yes
aggregation max character length: unlimited (relatively) at 2147483647

additional noteworthy info:

con:
- use of case insensitive flag is not available in the main components and overall user facing components of Kibana such as Discover or Visualization.
pro:
- aggregation max character length. separate topic that should not be used as a weight in the discussion

ECS Data Type Usage

The 3 noted data types from above are used within the ECS templates (going off the main branch as of commit hash 7496470bf422451744cef8308c1782baab8086bf

keyword
985
max character(byte) length set to 1024 or lower on 984 of the 985 fields.
match_only_text
61
wildcard
18

Data Types for Desired State

1) Exact match

It should go without saying this is solved through the use of wildcard OR keyword

2) Case insensitive

Solution is TBD… discussed in depth later. However, we know that keyword is not the solution and match_only_text seems like it was meant to be an option for case insensitive search, but as outlined it is not useful for security/cyber/logging use cases as it strips/ignores/removes symbols/punctuation/etc.

3) Aggregations

keyword or wildcard

this is relatively solved, as already in the state of ECS keyword or wildcard is used where appropriate and for the vast majority of fields.

4) Aggregation max character length

separate topic

Solving Case Insensitive Search

The recommendation that users can implement their own custom (text) analyzer is great but it’s just not happening and in return users have a false sense of security because they are now running searches expecting results that they will never get. Even the use of wildcard as the solution has become problematic. I think it’s evident it is not the solution given the fact it is only used 18 times despite it’s release over 1 1/2 years ago.

You did mention There are some early ideas about defining schemas for specific use-cases, but I would like to say this case insensitive search situation has been around for 18+ months. Also, I would assume there is not a ton of ECS implementations outside of cyber. Or more to the point, cyber has always been the main use case (especially given the drivers/employees/etc within Elastic who created/maintain/oversee/contributed to ECS). Sure I could see data normalization and a schema is not some special thing specific to cyber and therefore should obviously not be squandered and ruined for all other possible use cases - so correct me if I am wrong if ECS is not primarily cyber focused in relation to the early adoption ideas.

On these deployments/uses of Elastic that don’t have this solved, one would think that it could be lack of training, pro services, subscriptions, or the like… However, I have seen this issue across 15 Elastic deployments in just the last year…across just about every possible scenario:

paid elastic subscriptions and OSS
ECE, self, and or Elastic Cloud
Professional Services provided by Elastic themselves, an internal team, or other 3rd Party
Very intelligent professionals who are well versed in (cyber) detection
Very intelligent professionals who are well versed with the Elastic stack
Industries such as Health, Gov/Fed, Transportation, University, Financial, etc

The most common thing is users are just using the ECS templates and just have no clue about case sensitive search restrictions or more so they think ECS has solved it given a few blog posts over the past year in relation to wildcard and match_only_text. Not to mention even if ECS templates solved case insensitivity through wildcard, it is only used on %2 of the fields.

The reason I bring up this non technical background for the solution, is I think Elastic has not just the responsibility to solve this but they more than have the means to solve the issue. Especially given the (cyber) community has solved this and is more than willing to help. We all just want to get to searching and using the data and helping people do the same.

With that said, is it possible to have a customer analyzer integrated into ECS? I think if it was possible to get standard analyzer for text field types promoted into it’s very own data type in the foundational elasticsearch mappings… then ECS using a custom analyzer does not seem outside the realm of possibility.

Etc

(data) type	type count
date	60
object	8
keyword	985
wildcard	18
long	118
ip	15
match_only_text	61
geo_point	8
boolean	21
flattened	11
scaled_float	3
constant_keyword	3
float	5
nested	13

0reactions

rwaightcommented, Nov 15, 2022

Hey @neu5ron, is there a reason this was closed? It is something that should certainly be resolved, but I do not see PRs that would address the issues you have raised.

as a heads up, exposing case insensitivity in Kibana or KQL or EQL or Lucene does not fix the issue mentioned above with match_only_text field types

I realize that https://github.com/elastic/kibana/issues/134143 is not meant to resolve the issues with match_only_text field types; but still opened that issue in hopes of improving the experience within Kibana and provide users with the ability to toggle the case_insensitive option.