anonymize strips spaces and newlines from text
See original GitHub issueDescribe the bug Anonymizing text eliminates spaces and new lines. For example this text:
Let us know if you have any questions. Otherwise we can plan to discuss more on next week’s call.
Thanks!
John
John Doe | Marketing Manager, Paid Search | Marketing | Strategic Education, Inc. | 612.977.4912
Is parsed as:
Let us know if you have any questions. Otherwise we can plan to discuss more on <DATE_TIME> call.
Thanks!
<PERSON><PERSON> | Marketing Manager, Paid Search | Marketing | Strategic Education, Inc. | <PHONE_NUMBER>
As you can see the new lines and spaces were eliminated from between the persons making the text malformed.
To Reproduce Code to reproduce:
def test_new_line():
analyzer = AnalyzerEngine()
results = analyzer.analyze(text3, "en")
engine = AnonymizerEngine()
anony = engine.anonymize(text3, results)
print(anony.text)
Expected behavior The text should maintain the spaces and newlines between entities
Issue Analytics
- State:
- Created 2 years ago
- Reactions:2
- Comments:7 (4 by maintainers)
Top Results From Across the Web
How to remove all line breaks from a string - Stack Overflow
I have a text in a textarea and I read it out using the . value attribute. Now I would like to remove...
Read more >Clean and Tokenize Text With Python - Dylan Castillo
The first step in a Machine Learning project is cleaning the data. In this article, you'll find 20 code snippets to clean and...
Read more >Jinja2 Tutorial - Part 3 - Whitespace control |
Manually strip whitespaces by adding a minus sign - to the start or end of the block. Apply indentation inside of Jinja2 blocks....
Read more >How to remove all line breaks from a string using JavaScript?
Visit each character of the string and slice them in such a way that it removes the newline and carriage return characters. Code...
Read more >Remove spaces, tabs and newlines from a String in Python
The `re.sub()` method will remove all spaces, tabs and newlines from the string by replacing them with emtpy strings.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hey @navalev I would like to work on this issue.
Just did a quick check and it seems like this is an issue with how spaCy detects entities in text.
Returns:
So the newline characters are actually part of the detected entity. This might be related to this issue.
Either preprocessing of the text or postprocessing of the list of
RecognizerResult
s could help overcome this issue.Tested with this environment: ============================== Info about spaCy ==============================
spaCy version 3.0.6
Platform Windows-10-10.0.19041-SP0
Python version 3.7.9