Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Text not in the expected cell

See original GitHub issue

Here is my document:

I want to extract tuples like("户名", "张三"),("户号", "1102"), but I found I cannot locate values correctly using the following code:

from docx import Document

file = '/paht/to/my_doc.docx'

doc = Document(file)
tables = doc.tables
table = tables[0]  # now, I get the table object successfully
# Intuitively, "户    名" is in the first cell of second row. When I try to find it:
row_num = 1
cell_num = 0
row = table.rows[row_num]
cell = row.cells[cell_num]
print(cell.text)  # will print "张三"
# Actually,"户    名"  is in the last cell of first row:
row_num = 0
row = table.rows[row_num]
cell = row.cells[-1]
print(cell.text)  # will print "户    名"

Could anybody please tell me what’s wrong with it:)

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:9 (4 by maintainers)

Top GitHub Comments

2reactions

scannycommented, Jul 12, 2021

Okay, in the XML you sent me, w:tbl/w:tblGrid contains 32 w:gridCol entries. This means the table layout grid is 32 columns wide.

The first row (the long merged heading) only accounts for 31 layout grid columns. I think that’s the source of the problem and python-docx is interpreting the first cell of the second row as the 32nd cell in the first column.

Basically, the table is just too complex for python-docx to navigate it. There are rules that Word has for rendering tables that go beyond the simple-ish uniform layout-grid concept and that are not implemented in python-docx. So pursuing the “row-by-row” “apparent-cell” approach we mentioned above is probably the best course.

1reaction

scannycommented, Jul 9, 2021

@zifeiYv ok, that all looks in order, I suspect that the table is not consistent with the spec and its underlying layout-grid is corrupted, which means the python-docx cell addressing won’t work reliably. There is a lot of merging of cells in that table and it can happen that Word can still render the table but the layout-grid addressing of python-docx can’t figure out which cell is which.

Can you dump the overall table to an XML file and attach it? (print(table._element.xml) it will be too long to paste into a comment-box here)

But whatever we find out there, I think the best available path is for you to resort to “physical” indexing:

from docx.table import _Cell

def iter_visible_cells(row):
    tr = row._tr
    for tc in tr.tc_lst:
        yield _Cell(tc, row.parent.table)

row = table.rows[1]
row_cells = list(iter_visible_cells(row))
print(row_cells[0].text)

The cells accessed this way should work just as well as any other cell and will be much easier to address in a complex layout like this.