question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Line splitting on \u2028 from S3

See original GitHub issue

Problem description

Be sure your description clearly answers the following questions:

  • What are you trying to achieve?

Reading lines of text with a character \u2028 from S3 produces different output than from the local disk. I would like the output from S3 to match what happens from local disk.

  • What is the expected result?

Lines of text from the local disk are not split on a \u2028 character in a string of text.

  • What are you seeing instead?

Lines of text from the S3 are split on a \u2028 character in a string of text.

Steps/code to reproduce the problem

In order for us to be able to solve your problem, we have to be able to reproduce it on our end. Without reproducing the problem, it is unlikely that we’ll be able to help you.

Include full tracebacks, logs and datasets if necessary. Please keep the examples minimal (minimal reproducible example).

# create test file
print('C123 Overview:\u2028 Earl', file=Path('/aws_home/bad_split_text.txt').open('w'))
# then upload to bad_split_text.txt to S3

c1 = [l for l in smart_open.open('/aws_home/bad_split_text.txt', encoding='utf-8', newline='\n')]
c2 = [l for l in smart_open.open('s3://mybucket/bad_split_text.txt', encoding='utf-8', newline='\n')]

print(repr(c1)) # ['C123 Overview:\u2028 Earl\n']
print(repr(c2)) # ['C123 Overview:\u2028', ' Earl\n']

Versions

Please provide the output of:

import platform, sys, smart_open
print(platform.platform())
print("Python", sys.version)
print("smart_open", smart_open.__version__)

Linux-5.4.0-1029-aws-x86_64-with-glibc2.10 Python 3.8.5 | packaged by conda-forge | (default, Sep 24 2020, 16:55:52) [GCC 7.5.0] smart_open 3.0.0

Checklist

Before you create the issue, please make sure you have:

  • Described the problem clearly
  • Provided a minimal reproducible example, including any required data
  • Provided the version numbers of the relevant software

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
markopycommented, Jan 10, 2021

I found some discussion of this at https://bugs.python.org/issue12855 which points to https://docs.python.org/3/library/stdtypes.html#str.splitlines for the list of code points that codecs treats as newlines:

Representation Description
\n Line Feed
\r Carriage Return
\r\n Carriage Return + Line Feed
\v or \x0b Line Tabulation
\f or \x0c Form Feed
\x1c File Separator
\x1d Group Separator
\x1e Record Separator
\x85 Next Line (C1 Control Code)
\u2028 Line Separator
\u2029 Paragraph Separator

It also confirms my suspicion that the codecs module can not be used to implement this properly. However io.TextIOWrapper seems to be available in python 2.7 as well so I would suggest just replacing

    if encoding is None:
        encoding = DEFAULT_ENCODING

    kw = {'errors': errors} if errors else {}
    if mode[0] == 'r' or mode.endswith('+'):
        fileobj = codecs.getreader(encoding)(fileobj, **kw)
    if mode[0] in ('w', 'a') or mode.endswith('+'):
        fileobj = codecs.getwriter(encoding)(fileobj, **kw)

with

    if encoding is None:
        encoding = DEFAULT_ENCODING
    fileobj = io.TextIOWrapper(fileobj, encoding=encoding, errors=errors, newline=newline)

in _encoding_wrapper()

1reaction
chrisfleischcommented, Nov 13, 2020

Thanks. When reading as binary for the remote I can now get the same results as the local file.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Adding Line separator (\u2028) into Logback pattern
I've managed to do this by entering the "Line separator" unicode character ( ) directly: %date{"yyyy-MM-dd'T'HH:mm:ss.
Read more >
inconsistent with bz2.open on files containing vertical tab ^K
I think the culprit here is the vertical tab character \x0b. Not sure why it gets confused with the line return character. All...
Read more >
Amazon S3
The Amazon S3 origin reads objects stored in Amazon Simple Storage Service, ... A standard single-line JSON Lines object can be split into...
Read more >
Get better insight from reviews using Amazon Comprehend
Perplexity is calculated by splitting a dataset into two parts—a ... We start by simply loading the data from an S3 bucket into...
Read more >
ZJt - River Thames Conditions - Environment Agency - GOV.UK
Id yahoo yang keren, Tavolo multimediale interattivo, Banuti flori, Break free acoustic ... Matrimonial biodata format pdf, Uzini burkard, U2028 javascript, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found