Failed parsing Markdown training section header with colon (:)
See original GitHub issueRasa version: 1.4.2
Rasa X version (if used & relevant):
Python version: 3.7
Operating system (windows, osx, …): OS X Issue: A recently introduced regex parses Markdown section headers in the wrong way in training files. This seems related to recently introduced code in
def _find_section_header(self, line: Text) -> Optional[Tuple[Text, Text]]:
"""Checks if the current line contains a section header
and returns the section and the title."""
match = re.search(r"##\s*(.+):(.+)", line)
if match is not None:
return match.group(1), match.group(2)
return None
Which performs a greedy lookup.
For a section header such as ## synonym:10:00 am
, the section and value are reported as ('synonym:10', '00 am')
instead of the expected ('synonym', '10:00 am')
.
This results in a failure to train.
Proposed Solution:
Change the regex to ##\s*(.+?):(.+)
Error (including full traceback):
Traceback (most recent call last):
File "/Users/ethan/src/doodle/svc-doodlebot/env/bin/rasa", line 8, in <module>
sys.exit(main())
File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/__main__.py", line 76, in main
cmdline_arguments.func(cmdline_arguments)
File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/cli/train.py", line 76, in train
kwargs=extract_additional_arguments(args),
File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/train.py", line 45, in train
kwargs=kwargs,
File "uvloop/loop.pyx", line 1417, in uvloop.loop.Loop.run_until_complete
File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/train.py", line 96, in train_async
kwargs,
File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/train.py", line 137, in _train_async_internal
new_fingerprint = await model.model_fingerprint(file_importer)
File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/model.py", line 204, in model_fingerprint
nlu_data = await file_importer.get_nlu_data()
File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/importers/importer.py", line 269, in get_nlu_data
nlu_data = await asyncio.gather(*nlu_data)
File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/importers/rasa.py", line 60, in get_nlu_data
return utils.training_data_from_paths(self._nlu_files, language)
File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/importers/utils.py", line 9, in training_data_from_paths
training_datas = [loading.load_data(nlu_file, language) for nlu_file in paths]
File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/importers/utils.py", line 9, in <listcomp>
training_datas = [loading.load_data(nlu_file, language) for nlu_file in paths]
File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/nlu/training_data/loading.py", line 67, in load_data
data_sets = [_load(f, language) for f in files]
File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/nlu/training_data/loading.py", line 67, in <listcomp>
data_sets = [_load(f, language) for f in files]
File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/nlu/training_data/loading.py", line 138, in _load
return reader.read(filename, language=language, fformat=fformat)
File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/nlu/training_data/formats/readerwriter.py", line 10, in read
return self.reads(rasa.utils.io.read_file(filename), **kwargs)
File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/nlu/training_data/formats/markdown.py", line 73, in reads
self._set_current_section(header[0], header[1])
File "/Users/ethan/src/doodle/svc-doodlebot/env/lib/python3.7/site-packages/rasa/nlu/training_data/formats/markdown.py", line 192, in _set_current_section
"".format(section, "', '".join(available_sections))
ValueError: Found markdown section 'synonym:10' which is not in the allowed sections 'intent', 'synonym', 'regex', 'lookup'.
Command or request that led to error:
rasa train
Content of configuration file (config.yml) (if relevant):
language: en
pipeline:
- name: WhitespaceTokenizer
- name: RegexFeaturizer
- name: CRFEntityExtractor
- name: EntitySynonymMapper
- name: SklearnIntentClassifier
- name: CountVectorsFeaturizer
- name: EmbeddingIntentClassifier
- name: DucklingHTTPExtractor
url: http://localhost:8000
locale: en_US
dimensions:
- time
- duration
timezone: UTC
policies:
- name: KerasPolicy
- name: MappingPolicy
Content of domain file (domain.yml) (if relevant):
Content of training NLU Markdown
## synonym:10:00 am
- @10:00 am
(Some extra before and after)
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:5 (4 by maintainers)
Top GitHub Comments
Closed in https://github.com/RasaHQ/rasa/pull/4718. Thanks!
Nope, just fork the repo, create a branch for your fix, and open a PR! 😃I’ll assign you the issue.