Incorrect endlines handling with compressed input
See original GitHub issueProblem: I tried to read gzipped jsonl with smart_open
and stuck with a problem: smart_open
incorrectly handle end lines (pick them from json, instead of “real” endline) -> breaks lines -> breaks jsonl.
Input data:
echo '{"content":"Reptilian and Dragon like Encounters | EveLorgen.com\nEveLorgen.com\n& The Alien Love Bite\nSearch\nMain menu\nSkip to primary content\nHome\nNews\nArticles\nAlien Abduction\nAlien Love Bite Related\nAlien or Demonic\nAnomalous Trauma\nEmotional and Psychic Vampirism\nMedical and Scientific Aspects of Alien Abduction\nMilitary Abduction (MILABS) and Reptilians\nMind Control\nMiscellaneous\nPoetry and Mystic Prose\nPsychology and Relationships\nSpiritual Warfare and the Human Soul\nBooks\nDrawings\nRadio\nVideos\nSubscribe\nBio and Colleagues\nContact\nTestimonials\nPost navigation\n← Previous Next →\nReptilian and Dragon like Encounters\nPosted on May 10, 2013 by eve\t\nThe first article link is another interview with Matt R, on his reptilian encounters and DNA activations. The reptilians are very particular about pedigrees and will follow these bloodlines like hound dogs. Here is an exerpt of Matt’s article:\n“Reptilians often inform their abductees that they are descended from reptilian bloodlines. They are very specific about the nature of this pedigree. So specific, that they are able to determine which abductees is from what reptilian family line. Joe Montaldo actually did a show on this topic a few years ago http://www.youtube.com/watch?v=EiuTqCDE0zY.”\nThe full article can be found here: http://naturalplane.blogspot.com/2013/03/here-be-dragons-new-katrina-abductions.html\nThis next article , “A Summons to Appear: Blood of Dragons, Part 2” is the sequel to Ken Bakeman’s encounter with several entities who engaged him in a forced baptism like ritual. The primary beings involved in this baptism encounter is a toad like reptilian, a mantis type creature and eyeless reptilians in robes, as well as the royal large tall Dragon like beings with wings.\nhttp://www.kenbakeman.com/reptilian_baptism_p2.html\nBe Sociable, Share!\nTweet\nThis entry was posted in Alien Abduction, Military Abduction (MILABS) and Reptilians, News and tagged dragon bloodlines, dragons, milabs, reptilians by eve. Bookmark the permalink.\t\nTerms and Conditions of www.evelorgen.com Website Material:\nThe content written by myself or other authors, and people I have interviewed are for information only, intended for the benefit of people seeking truth, freedom, personal growth and expansion of awareness. I may not agree with all content or opinions of other contributing authors or interviewees.\nThis website as an independent “entity” shall not permit and is protected from any malevolent intended attack, to undermine, subvert, harm or intent of any strategy of attack to be permitted to affect myself, family members, colleagues, contributors to my web site or any loved ones that have ties to me on all levels, and all dimensions of time.\nAnyone, or group who goes to my web site to partake in reading the information can not use it to harm anyone or anything, or use it in any way whatsoever for purposes of deception, harm or any agreements of entrapment or snares of any kind. I hold this to be in effect on all levels and dimensions of time.\nDeclaration of NON CONSENT FOR INTERFERENCE:\nLet it be known, I do not consent to any agreement of entrapment that bears intention to deceive, misinform, manipulate, exploit, control, steal, harvest, seduce, harm or negatively influence my being, in mind, soul, spirit, body and physical place of habitation, business, website or published works in any way across all levels, dimensions and time, whether they are fabricated linear or synthetic creations or times on all levels and dimensions.
Through my not consenting, I intend protection from harm and maintain neutrality, so that my presence of being honors Truth, compassion, wisdom, harmony, healing, constant awakening and life, so as to not be trapped, to the best of my ability in every situation.\nI do not consent to false limiting beliefs or false soul “programs” driving my body and consciousness, but rather my highest Spirit’s truth within without limitation as a Creator as integrated mind, soul and spirit of original Primordial consciousness.
Let it be known that by my choice to NOT CONSENT to any agreement of entrapment on any level, on all levels, across all dimensions and for all time, it is in effect now and forevermore. I hold that such is true and in effect, that any such agreement of entrapment, deception, and harmful intention, now be DEEMED null and void based on the intention of its creator to harm and not honor my life, my sovereign being and free will.
No singular or collective entity, or artificial intelligence is under any circumstances given permission (of malintent) to enter my Universe, life, dimensions, levels or time. If there are such attempts to ignore the LAW, they are responsible for one thousand times the consequences of that breach in self-destruction—and are fully legally responsible for their choices. The choice given is to not interfere or accept the consequences as stated. Should you choose to override our LAW, knowing the full terms and conditions stated, I in no way can be held responsible or harmed for any choice that breaches my LAW on any level, on all dimensions across all times and future cycles of time. I claim the Law and I Am the Law. I forbid any singular or collective entities to attempt to breach my Law and Not Consent to my LAW, and therefore am protected from entering any Game, or ANY and ALL Games set out to ensnare me out of my own SOVEREIGN BEING. They will bring upon themselves their own intention in harm.\nI HOLD THIS TO BE IN EFFECT IMMEDIATELY ON ALL LEVELS AND ALL DIMENSIONS OF TIME AND SPACE, PAST PRESENT AND FOR THE FUTURE CYCLES OF TIME.\"\nI do not offer legal, medical, psychiatric or clinical psychological diagnosis and therefore am not liable for any claims against such.\nProudly powered by WordPress\n"}' > 1.txt
cat 1.txt | gzip > 1.txt.gz
Code:
from smart_open import smart_open
import gzip
with smart_open("1.txt", "r") as infile:
num_lines = sum(1 for _ in infile)
assert num_lines == 1 # correct
with gzip.open("1.txt.gz", "r") as infile:
num_lines = sum(1 for _ in infile)
assert num_lines == 1 # correct
with smart_open("1.txt.gz", "r") as infile:
num_lines = sum(1 for _ in infile)
assert num_lines == 1, num_lines # wrong, num_lines=4
Issue Analytics
- State:
- Created 5 years ago
- Reactions:2
- Comments:5 (1 by maintainers)
Top Results From Across the Web
lzma — Compression using the LZMA algorithm ... - Python Docs
This class does not transparently handle inputs containing multiple compressed streams, unlike decompress() and LZMAFile . To decompress a multi-stream ...
Read more >gzip - compression and decompression of string data in java
I am using the following code to compress and decompress string data, but the problem which I am facing is, it is easily...
Read more >Data Compression Explained - Matt Mahoney
There is no such thing as a "universal" compression algorithm that is guaranteed to compress any input, or even any input above a...
Read more >zlib 1.2.13 Manual
The library does not install any signal handler. ... Compress more input starting at next_in and update next_in and avail_in accordingly.
Read more >Solved Part 2: A Simple Text Compression Scheme (25 points)
Part 2: A Simple Text Compression Scheme (25 points) In this problem,. Input file #1: simple.txt AAAA BBBBBBB CCC DDDDD Expected output: 4A....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
My slight preference would be to mimic
open
, yes. But not important either way (not worth any massive refactoring IMO).I think the culprit here is the Unicode line separator character hiding in your data.
The reason why this affects smart_open is: we use the codecs module of the standard library to perform byte-to-text decoding. Here’s how that module performs: