Line splitting on \u2028 from S3
See original GitHub issueProblem description
Be sure your description clearly answers the following questions:
- What are you trying to achieve?
Reading lines of text with a character \u2028 from S3 produces different output than from the local disk. I would like the output from S3 to match what happens from local disk.
- What is the expected result?
Lines of text from the local disk are not split on a \u2028 character in a string of text.
- What are you seeing instead?
Lines of text from the S3 are split on a \u2028 character in a string of text.
Steps/code to reproduce the problem
In order for us to be able to solve your problem, we have to be able to reproduce it on our end. Without reproducing the problem, it is unlikely that we’ll be able to help you.
Include full tracebacks, logs and datasets if necessary. Please keep the examples minimal (minimal reproducible example).
# create test file
print('C123 Overview:\u2028 Earl', file=Path('/aws_home/bad_split_text.txt').open('w'))
# then upload to bad_split_text.txt to S3
c1 = [l for l in smart_open.open('/aws_home/bad_split_text.txt', encoding='utf-8', newline='\n')]
c2 = [l for l in smart_open.open('s3://mybucket/bad_split_text.txt', encoding='utf-8', newline='\n')]
print(repr(c1)) # ['C123 Overview:\u2028 Earl\n']
print(repr(c2)) # ['C123 Overview:\u2028', ' Earl\n']
Versions
Please provide the output of:
import platform, sys, smart_open
print(platform.platform())
print("Python", sys.version)
print("smart_open", smart_open.__version__)
Linux-5.4.0-1029-aws-x86_64-with-glibc2.10 Python 3.8.5 | packaged by conda-forge | (default, Sep 24 2020, 16:55:52) [GCC 7.5.0] smart_open 3.0.0
Checklist
Before you create the issue, please make sure you have:
- Described the problem clearly
- Provided a minimal reproducible example, including any required data
- Provided the version numbers of the relevant software
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (4 by maintainers)
I found some discussion of this at https://bugs.python.org/issue12855 which points to https://docs.python.org/3/library/stdtypes.html#str.splitlines for the list of code points that
codecs
treats as newlines:It also confirms my suspicion that the
codecs
module can not be used to implement this properly. Howeverio.TextIOWrapper
seems to be available in python 2.7 as well so I would suggest just replacingwith
in
_encoding_wrapper()
Thanks. When reading as binary for the remote I can now get the same results as the local file.