question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Normalizers, Analyzers, etc aren't copied from fields when cloning Index

See original GitHub issue

Maybe related to #957?

My definitions of the ‘base’ Index along with all filters, analyzers, normalizers:

from elasticsearch_dsl import Index, normalizer, analyzer, char_filter, token_filter

autocomplete_filter = token_filter(
    'autocomplete', 'edgeNGram',
    min_gram=1,
    max_gram=20)

remove_leading_non_alphanum_char_filter = char_filter(
    'remove_leading_non_alphanum', 'pattern_replace',
    pattern="^(\W|_)+",
    replacement="")

sorting_normalizer = normalizer(
    'sorting',
    filter=["lowercase", "asciifolding"])

default_analyzer = analyzer(
    'default',
    tokenizer="standard",
    filter=["standard", "lowercase", "asciifolding", "stop", "snowball"],
    char_filter=["html_strip"])

nostop_analyzer = analyzer(
    'nostop',
    tokenizer="standard",
    filter=["standard", "lowercase", "asciifolding"],
    stopwords=[],
    char_filter=["html_strip", remove_leading_non_alphanum_char_filter])

autocomplete_analyzer = analyzer(
    'autocomplete',
    tokenizer="standard",
    filter=["standard", "lowercase", "asciifolding", autocomplete_filter],
    stopwords=[],
    char_filter=["html_strip", remove_leading_non_alphanum_char_filter])

base_index = Index('base')
base_index.analyzer(default_analyzer)

A simple sample Index and Document that will generate the error that follows:

# this clone is NOT an issue, the default analyzer settings etc all transfer correctly
i = base_index.clone('user-related')

@i.document
class UserDoc(Document):
    name = field.Keyword(
        normalizer=sorting_normalizer,
        fields={"autocomplete": field.Text(
            analyzer=autocomplete_analyzer,
            search_analyzer=nostop_analyzer)
        })

i.to_dict() correctly yields the following:

{
  "mappings": {
    "doc": {
      "properties": {
        "name": {
          "fields": {
            "autocomplete": {
              "analyzer": "autocomplete",
              "search_analyzer": "nostop",
              "type": "text"
            }
          },
          "normalizer": "sorting",
          "type": "keyword"
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "type": "custom",
          "tokenizer": "standard",
          "stopwords": [],
          "filter": [
            "standard",
            "lowercase",
            "asciifolding",
            "autocomplete"
          ],
          "char_filter": [
            "html_strip",
            "remove_leading_non_alphanum"
          ]
        },
        "default": {
          "tokenizer": "standard",
          "type": "custom",
          "filter": [
            "standard",
            "lowercase",
            "asciifolding",
            "stop",
            "snowball"
          ],
          "char_filter": [
            "html_strip"
          ]
        },
        "nostop": {
          "type": "custom",
          "tokenizer": "standard",
          "stopwords": [],
          "filter": [
            "standard",
            "lowercase",
            "asciifolding"
          ],
          "char_filter": [
            "html_strip",
            "remove_leading_non_alphanum"
          ]
        }
      },
      "normalizer": {
        "sorting": {
          "type": "custom",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      },
      "filter": {
        "autocomplete": {
          "min_gram": 1,
          "max_gram": 20,
          "type": "edgeNGram"
        }
      },
      "char_filter": {
        "remove_leading_non_alphanum": {
          "replacement": "",
          "type": "pattern_replace",
          "pattern": "^(\\W|_)+"
        }
      }
    }
  }
}

But after cloning, the important elements are (incorrectly, I think?) missing; i.clone('user-related-20180905').to_dict() outputs (note it does keep my default analyzer I added to base_index):

{
  "mappings": {
    "doc": {
      "properties": {
        "name": {
          "type": "keyword",
          "normalizer": "sorting",
          "fields": {
            "autocomplete": {
              "type": "text",
              "search_analyzer": "nostop",
              "analyzer": "autocomplete"
            }
          }
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "standard",
            "lowercase",
            "asciifolding",
            "stop",
            "snowball"
          ],
          "char_filter": [
            "html_strip"
          ]
        }
      }
    }
  }
}

To further illustrate, this the output from a cloned index .create() is:

PUT http://localhost:9200/user-related-20180905-064032 [status:400 request:0.013s]
Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/vagrant/esdocs/contrib/esdjango/run.py", line 25, in <module>
    run(DjangoController)
  File "/vagrant/esdocs/utils.py", line 101, in run
    controller.run_operation(cmd_parser=parser, **options)
  File "/vagrant/esdocs/controller.py", line 85, in run_operation
    getattr(self, "index_{}".format(action))(**options)
  File "/vagrant/esdocs/controller.py", line 122, in index_rebuild
    self._index_create(index, name, set_alias=False)
  File "/vagrant/esdocs/controller.py", line 169, in _index_create
    index.create(using=self.using)
  File "/home/ubuntu/.virtualenvs/myproject/lib/python3.5/site-packages/elasticsearch_dsl/index.py", line 220, in create
    self._get_connection(using).indices.create(index=self._name, body=self.to_dict(), **kwargs)
  File "/home/ubuntu/.virtualenvs/myproject/lib/python3.5/site-packages/elasticsearch/client/utils.py", line 76, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/home/ubuntu/.virtualenvs/myproject/lib/python3.5/site-packages/elasticsearch/client/indices.py", line 88, in create
    params=params, body=body)
  File "/home/ubuntu/.virtualenvs/myproject/lib/python3.5/site-packages/elasticsearch/transport.py", line 318, in perform_request
    status, headers_response, data = connection.perform_request(method, url, params, body, headers=headers, ignore=ignore, timeout=timeout)
  File "/home/ubuntu/.virtualenvs/myproject/lib/python3.5/site-packages/elasticsearch/connection/http_urllib3.py", line 186, in perform_request
    self._raise_error(response.status, raw_data)
  File "/home/ubuntu/.virtualenvs/myproject/lib/python3.5/site-packages/elasticsearch/connection/base.py", line 125, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.RequestError: RequestError(400, 'mapper_parsing_exception', 'analyzer [nostop] not found for field [autocomplete]')

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:11 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
honzakralcommented, Sep 17, 2018

@i.document is about assigning the Dcoument to the Index not the other way around. We should certainly make it clearer in the docs. It is only useful when working with an Index object and, with types going away, there is no real reason to do that unless you are doing something very specific.

For your use case, please, again, look at the example with alias migration, that is what I would recommend - no ambiguity, no extra decorators, works out of the box with the decorators without the need to redeploy your application when a new index is introduced. Settings are governed by the template and can be changed any time as well. If there is something missing from that example, please let me know.

Thank you!

0reactions
jaddisoncommented, Sep 15, 2018

Thanks @HonzaKral. Again, I understand what you’re saying (although I’ll admit I’d glossed over the ability to pass in an index on Document operations).

That said - and ignoring my use case for now - is it reasonable to expect someone who uses the @<index>.document decorator to have to keep that index around to do operations elsewhere in code?

# in app/search_base.py
base_index = Index('base')
base_index.settings(...)

------------------

# in users/search.py:
from app.search_base import base_index
user_index = base_index.clone('user-related')
user_index.settings(...)  # user specific index settings

@user_index.document
class UserDoc(Document):
    id=field.Long()

------------------

# in users/views.py
from .search import user_index, UserDoc

def user_list(request):
  users = UserDoc.search(index=user_index._name).query(...).execute()
  # OR hardcoding
  users = UserDoc.search(index='user-related').query(...).execute()


def user_details(request, pk):
  users = UserDoc.get(pk, index=user_index._name)
  # OR hardcoding
  users = UserDoc.get(pk, index='user-related')

Where the same code without the decorator is more intuitive and less fragile:

# in search.py:
class UserDoc(Document):
    id=field.Long()

    class Index:
        name = 'user-related'

------------------

# in views.py
from .search import UserDoc

def user_list(request):
  users = UserDoc.search().query(...).execute()

def user_details(request, pk):
  users = UserDoc.get(pk)

If you still disagree, then by all means, please close this ticket. I may not understand why it is this way, but I’m willing to accept that it’s like that for a specific reason (although I would obviously still like to understand why, and why my suggestion isn’t valid - still, it’s your project, it’s of course your call).

The user must pick between:

  • having a base index (no settings repetition, etc) but repetition of the index name everywhere
  • copying common index settings, etc to all Document class Index and convenience of ignoring what the index name is throughout application code

Honestly, I won’t be too upset if you just close this ticket without replying. You’ve put up with a lot of my pushback. Cheers - and thanks for all the effort on the Python bindings!

Read more comments on GitHub >

github_iconTop Results From Across the Web

When Cloning an Issue Why Arent Fields Copying to New ...
Hello, When I clone an issue, all the fields are not copying to the new cloned issue. I want all fields to copy...
Read more >
Reindex API | Elasticsearch Guide [8.5] | Elastic
Extracts the document source from the source index and indexes the documents into the destination index. You can copy all documents to the...
Read more >
Apache Solr Reference Guide: For Solr 7.4
explains how a Solr schema defines the fields and field types which Solr uses to organize data within the document files it indexes....
Read more >
A Neat Trick with Elasticsearch Normalizers - Dainius Jocas
To analyze the textual data Elasticsearch uses analyzers while for ... setup a keyword field with a normalizer with a char_filter . give...
Read more >
Geneious Prime User Manual
one, and specify which field of the Genbank document should be copied to the “Name” ... Other Clone Manager formats such as .cx5...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found