Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Line splitting on \u2028 from S3

See original GitHub issue

Problem description

Be sure your description clearly answers the following questions:

What are you trying to achieve?

Reading lines of text with a character \u2028 from S3 produces different output than from the local disk. I would like the output from S3 to match what happens from local disk.

What is the expected result?

Lines of text from the local disk are not split on a \u2028 character in a string of text.

What are you seeing instead?

Lines of text from the S3 are split on a \u2028 character in a string of text.

Steps/code to reproduce the problem

In order for us to be able to solve your problem, we have to be able to reproduce it on our end. Without reproducing the problem, it is unlikely that we’ll be able to help you.

Include full tracebacks, logs and datasets if necessary. Please keep the examples minimal (minimal reproducible example).

# create test file
print('C123 Overview:\u2028 Earl', file=Path('/aws_home/bad_split_text.txt').open('w'))
# then upload to bad_split_text.txt to S3

c1 = [l for l in smart_open.open('/aws_home/bad_split_text.txt', encoding='utf-8', newline='\n')]
c2 = [l for l in smart_open.open('s3://mybucket/bad_split_text.txt', encoding='utf-8', newline='\n')]

print(repr(c1)) # ['C123 Overview:\u2028 Earl\n']
print(repr(c2)) # ['C123 Overview:\u2028', ' Earl\n']

Versions

Please provide the output of:

import platform, sys, smart_open
print(platform.platform())
print("Python", sys.version)
print("smart_open", smart_open.__version__)

Linux-5.4.0-1029-aws-x86_64-with-glibc2.10 Python 3.8.5 | packaged by conda-forge | (default, Sep 24 2020, 16:55:52) [GCC 7.5.0] smart_open 3.0.0

Checklist

Before you create the issue, please make sure you have:

Described the problem clearly
Provided a minimal reproducible example, including any required data
Provided the version numbers of the relevant software

Issue Analytics

State:
Created 3 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

markopycommented, Jan 10, 2021

I found some discussion of this at https://bugs.python.org/issue12855 which points to https://docs.python.org/3/library/stdtypes.html#str.splitlines for the list of code points that codecs treats as newlines:

Representation	Description
\n	Line Feed
\r	Carriage Return
\r\n	Carriage Return + Line Feed
\v or \x0b	Line Tabulation
\f or \x0c	Form Feed
\x1c	File Separator
\x1d	Group Separator
\x1e	Record Separator
\x85	Next Line (C1 Control Code)
\u2028	Line Separator
\u2029	Paragraph Separator

It also confirms my suspicion that the codecs module can not be used to implement this properly. However io.TextIOWrapper seems to be available in python 2.7 as well so I would suggest just replacing

    if encoding is None:
        encoding = DEFAULT_ENCODING

    kw = {'errors': errors} if errors else {}
    if mode[0] == 'r' or mode.endswith('+'):
        fileobj = codecs.getreader(encoding)(fileobj, **kw)
    if mode[0] in ('w', 'a') or mode.endswith('+'):
        fileobj = codecs.getwriter(encoding)(fileobj, **kw)

with

    if encoding is None:
        encoding = DEFAULT_ENCODING
    fileobj = io.TextIOWrapper(fileobj, encoding=encoding, errors=errors, newline=newline)

in _encoding_wrapper()

1reaction

chrisfleischcommented, Nov 13, 2020

Thanks. When reading as binary for the remote I can now get the same results as the local file.

Top Results From Across the Web

Adding Line separator (\u2028) into Logback pattern

I've managed to do this by entering the "Line separator" unicode character ( ) directly: %date{"yyyy-MM-dd'T'HH:mm:ss.

inconsistent with bz2.open on files containing vertical tab ^K

I think the culprit here is the vertical tab character \x0b. Not sure why it gets confused with the line return character. All...

Amazon S3

The Amazon S3 origin reads objects stored in Amazon Simple Storage Service, ... A standard single-line JSON Lines object can be split into...

Get better insight from reviews using Amazon Comprehend

Perplexity is calculated by splitting a dataset into two parts—a ... We start by simply loading the data from an S3 bucket into...

ZJt - River Thames Conditions - Environment Agency - GOV.UK

Id yahoo yang keren, Tavolo multimediale interattivo, Banuti flori, Break free acoustic ... Matrimonial biodata format pdf, Uzini burkard, U2028 javascript, ...