pd.read_fwf removes leading and trailing whitespace
See original GitHub issueCode Sample, a copy-pastable example if possible
from io import StringIO
import pandas as pd
data = u""" a bbb
ccdd """
df = pd.read_fwf(StringIO(data), widths=[3, 3], header=None)
The output is
>>> df.iloc[0,0]
u'a'
Expected Output
u' a '
Problem description
Apparently, leading and trailing whitespaces are removed but I want to keep them. Adding dtype options, converters does not solve the problem. Is this expected behaviour?
I do not think this is intended because if we implement the same example with pd.read_csv()
, whitespaces are preserved.
from io import StringIO
import pandas as pd
data = u""" a ,bbb
cc,dd """
df = pd.read_csv(StringIO(data), header=None)
>>> df.iloc[0, 0]
' a '
For consistency, behaviour should be identical.
The problem is also mentioned on Stackoverflow (https://stackoverflow.com/questions/41558138/pandas-read-fwf-removing-leading-and-trailing-whitespace).
Output of pd.show_versions()
pandas: 0.20.1 pytest: 3.0.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 xarray: None IPython: 5.3.0 sphinx: 1.5.6 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.2.2 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.7 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.3 bs4: 4.6.0 html5lib: 0.999 sqlalchemy: 1.1.9 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 6 years ago
- Reactions:3
- Comments:8 (2 by maintainers)
Top GitHub Comments
I know this issue is old and closed, but it’s the only place I could find where it’s been discussed. Is there a way to prevent read_fwf from trimming whitespace? In my particular case I’m trying to split a fixed-width string based on the index of the character in that string. If the first characters are whitespace then this breaks the indexing.
The way I see it: a file where the character widths of each column are fixed, regardless of whether the content of each ‘cell’ occupies the full width or not. But either way, there are databases that follow this format, so it would perhaps be good to have the option to switch stripping on or off?
I don’t think this is a bug - since fixed width files are by definition white-space padded, stripping that whitespace is a very sane default and probably what most people want.
That said, I think it would be reasonable to add an option to support this.