Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ENH: Possible to add dtype/converters as arguments for pandas.read_xml() ?

See original GitHub issue

Is your feature request related to a problem?

I am using pandas lib to read xml for further processes, however a number of columns with leading ZERO are always converted to numbers, so I lost the original data.

Describe the solution you’d like

It would be great to add dtype/converter arguments for pandas.read_xml() to force pandas to interprete certain columns with given dtype/converters. Just like similar IO read (read_csv, read_html, etc)

read_xml read_csv

API breaking implications

Probably not, this argument could be optional.

Describe alternatives you’ve considered

Write my own code to pull data by each xml nodes, which results in very bad performance.

Issue Analytics

State:
Created 2 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

ParfaitGcommented, Sep 18, 2021

As a current workaround, consider running XSLT to quote the nodes with leading zeroes and then convert on the pandas side. If using the default lxml parser, XSLT 1.0 scripts are supported in read_xml. Below XSLT runs the standard Identity Template and encloses the text values of the zip with double quotes.

import pandas as pd

xml = \
'''<root>
     <row>
        <zip>08540</zip>
        <dat>123</dat>
     </row>
     <row>
        <zip>08628</zip>
        <dat>456</dat>
     </row>
     <row>
        <zip>27599</zip>
        <dat>789</dat>
     </row>
    </root>'''

xsl = \
'''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <!-- IDENTITY TEMPLATE TO COPY XML AS IS -->
    <xsl:template match="node()|@*">
       <xsl:copy>
         <xsl:apply-templates select="node()|@*"/>
       </xsl:copy>
    </xsl:template>
    
    <!-- ENCLOSE zip NODES WITH DOUBLE QUOTES -->
    <xsl:template match="zip">
      <xsl:copy>
        <xsl:variable name="quot">"</xsl:variable>
        <xsl:value-of select="concat($quot, text(), $quot)"/>
      </xsl:copy>
    </xsl:template>
    
</xsl:stylesheet>'''

df = (
    pd.read_xml(xml, stylesheet = xsl)
      .assign(zip = lambda x: x["zip"].str.replace('"', ''))
)

df
     zip  dat
0  08540  123
1  08628  456
2  27599  789

1reaction

ParfaitGcommented, Sep 15, 2021

Agreed! Good feature to add to running list. Also, read_xml passes parsed data to TextParser shared by other io readers.