Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to use Python iteration to read paragraphs, tables and pictures in word？

See original GitHub issue

So far, I’ve found a way to read paragraphs and tables in word sequentially and iteratively, but I’m stuck with how to read pictures sequentially. I would like to ask you to help me on the basis of the original code to achieve how the sequence of iteration word pictures? Here is my current code

from docx.document import Document as _Document
from docx.oxml.text.paragraph import CT_P
from docx.oxml.table import CT_Tbl
from docx.table import _Cell, Table, _Row
from docx.text.paragraph import Paragraph
import docx
path = './test.docx'
doc = docx.Document(path)

def iter_block_items(parent):
    if isinstance(parent, _Document):
        parent_elm = parent.element.body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    elif isinstance(parent, _Row):
        parent_elm = parent._tr
    else:
        raise ValueError("something's not right")
    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)

for block in iter_block_items(doc):
    # read Paragraph
    if isinstance(block, Paragraph):
        print(block.text)
    # read table
    elif isinstance(block, Table):
        print(block.style.name)

Issue Analytics

State:
Created 4 years ago
Reactions:2
Comments:9 (2 by maintainers)

Top GitHub Comments

6reactions

Goodelangcommented, May 27, 2019

How about this way to list picture and text? But I don’t know how to convert wmf and emf to other formats

from docx import Document
from os.path import basename
import re
file_name = "D:/2.docx"
doc = Document(file_name)
a = list()
pattern = re.compile('rId\d+')
for graph in doc.paragraphs:
    b = list()
    for run in graph.runs:
        if run.text != '':
            b.append(run.text)
        else:
            # b.append(pattern.search(run.element.xml))
            contentID = pattern.search(run.element.xml).group(0)
            try:
                contentType = doc.part.related_parts[contentID].content_type
            except KeyError as e:
                print(e)
                continue
            if not contentType.startswith('image'):
                continue
            imgName = basename(doc.part.related_parts[contentID].partname)
            imgData = doc.part.related_parts[contentID].blob
            b.append(imgData)
    a.append(b)

2reactions

phillipkentcommented, Apr 18, 2019

Here is my code for converting a source DOCX to a new DOCX: gist

I am assuming since you are reading this that you have got the problem of converting a large number of ‘structured’ source documents to new document formats (with new styles, company branding, etc). This code is offered as an example of how to do it - you will need to adapt the code to how your source documents are structured and what you want your new document structure to be like.

In my code, the new document is based on a ‘stub’ document which contains preamble text (title page and copyright information) and styles. The new document starts as the stub, then the source document is stepped-through for its paragraphs, tables and images, which are processed and inserted at the end of the new document.

Source documents are taken from the directory ‘source’ and converted documents are saved to the directory ‘converted’. Two types of source documents are handled: ‘Fiscal Guide’ or ‘Economics Regime’. Each one has its own stub document and different conversion steps. The stub documents are not included here! The code is offered as an example for adaptation to your own uses.