question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

anonymize strips spaces and newlines from text

See original GitHub issue

Describe the bug Anonymizing text eliminates spaces and new lines. For example this text:

Let us know if you have any questions. Otherwise we can plan to discuss more on next week’s call.

Thanks!
John

John Doe | Marketing Manager, Paid Search | Marketing | Strategic Education, Inc. | 612.977.4912

Is parsed as:

Let us know if you have any questions. Otherwise we can plan to discuss more on <DATE_TIME> call.

Thanks!
<PERSON><PERSON> | Marketing Manager, Paid Search | Marketing | Strategic Education, Inc. | <PHONE_NUMBER>

As you can see the new lines and spaces were eliminated from between the persons making the text malformed.

To Reproduce Code to reproduce:

def test_new_line():
    analyzer = AnalyzerEngine()
    results = analyzer.analyze(text3, "en")
    engine = AnonymizerEngine()
    anony = engine.anonymize(text3, results)
    print(anony.text)

Expected behavior The text should maintain the spaces and newlines between entities

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:2
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
bhushan-borolecommented, Jul 23, 2021

Hey @navalev I would like to work on this issue.

0reactions
omri374commented, Sep 9, 2021

Just did a quick check and it seems like this is an issue with how spaCy detects entities in text.

import spacy

nlp = spacy.load("en_core_web_lg")

text = """
Let us know if you have any questions. Otherwise we can plan to discuss more on next week’s call.

Thanks!
John

John Doe | Marketing Manager, Paid Search | Marketing | Strategic Education, Inc. | 612.977.4912
"""

doc = nlp(text)
[(ent.start, ent.end, ent.text, ent.label_) for ent in doc.ents]

Returns:

[(18, 21, 'next week’s', 'DATE'),
 (27, 29, 'John\n\n', 'PERSON'),
 (29, 33, 'John Doe | Marketing', 'PERSON')]

So the newline characters are actually part of the detected entity. This might be related to this issue.

Either preprocessing of the text or postprocessing of the list of RecognizerResults could help overcome this issue.

Tested with this environment: ============================== Info about spaCy ==============================

spaCy version 3.0.6
Platform Windows-10-10.0.19041-SP0
Python version 3.7.9

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to remove all line breaks from a string - Stack Overflow
I have a text in a textarea and I read it out using the . value attribute. Now I would like to remove...
Read more >
Clean and Tokenize Text With Python - Dylan Castillo
The first step in a Machine Learning project is cleaning the data. In this article, you'll find 20 code snippets to clean and...
Read more >
Jinja2 Tutorial - Part 3 - Whitespace control |
Manually strip whitespaces by adding a minus sign - to the start or end of the block. Apply indentation inside of Jinja2 blocks....
Read more >
How to remove all line breaks from a string using JavaScript?
Visit each character of the string and slice them in such a way that it removes the newline and carriage return characters. Code...
Read more >
Remove spaces, tabs and newlines from a String in Python
The `re.sub()` method will remove all spaces, tabs and newlines from the string by replacing them with emtpy strings.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found