question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to use Python iteration to read paragraphs, tables and pictures in word?

See original GitHub issue

So far, I’ve found a way to read paragraphs and tables in word sequentially and iteratively, but I’m stuck with how to read pictures sequentially. I would like to ask you to help me on the basis of the original code to achieve how the sequence of iteration word pictures? Here is my current code

from docx.document import Document as _Document
from docx.oxml.text.paragraph import CT_P
from docx.oxml.table import CT_Tbl
from docx.table import _Cell, Table, _Row
from docx.text.paragraph import Paragraph
import docx
path = './test.docx'
doc = docx.Document(path)

def iter_block_items(parent):
    if isinstance(parent, _Document):
        parent_elm = parent.element.body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    elif isinstance(parent, _Row):
        parent_elm = parent._tr
    else:
        raise ValueError("something's not right")
    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)

for block in iter_block_items(doc):
    # read Paragraph
    if isinstance(block, Paragraph):
        print(block.text)
    # read table
    elif isinstance(block, Table):
        print(block.style.name)

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:2
  • Comments:9 (2 by maintainers)

github_iconTop GitHub Comments

6reactions
Goodelangcommented, May 27, 2019

How about this way to list picture and text? But I don’t know how to convert wmf and emf to other formats

from docx import Document
from os.path import basename
import re
file_name = "D:/2.docx"
doc = Document(file_name)
a = list()
pattern = re.compile('rId\d+')
for graph in doc.paragraphs:
    b = list()
    for run in graph.runs:
        if run.text != '':
            b.append(run.text)
        else:
            # b.append(pattern.search(run.element.xml))
            contentID = pattern.search(run.element.xml).group(0)
            try:
                contentType = doc.part.related_parts[contentID].content_type
            except KeyError as e:
                print(e)
                continue
            if not contentType.startswith('image'):
                continue
            imgName = basename(doc.part.related_parts[contentID].partname)
            imgData = doc.part.related_parts[contentID].blob
            b.append(imgData)
    a.append(b)
2reactions
phillipkentcommented, Apr 18, 2019

Here is my code for converting a source DOCX to a new DOCX: gist

I am assuming since you are reading this that you have got the problem of converting a large number of ‘structured’ source documents to new document formats (with new styles, company branding, etc). This code is offered as an example of how to do it - you will need to adapt the code to how your source documents are structured and what you want your new document structure to be like.

In my code, the new document is based on a ‘stub’ document which contains preamble text (title page and copyright information) and styles. The new document starts as the stub, then the source document is stepped-through for its paragraphs, tables and images, which are processed and inserted at the end of the new document.

Source documents are taken from the directory ‘source’ and converted documents are saved to the directory ‘converted’. Two types of source documents are handled: ‘Fiscal Guide’ or ‘Economics Regime’. Each one has its own stub document and different conversion steps. The stub documents are not included here! The code is offered as an example for adaptation to your own uses.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to use Python iteration to read paragraphs, tables and ...
You want to modify iter_block_items such that the iteration body handles the case of a picture: currently it only handles (and yields) ...
Read more >
Reading and Writing MS Word Files in Python via Python ...
To write paragraphs, you can use the add_paragraph() method of the Document class object. Once you have added a paragraph, you will need...
Read more >
Word File Processing in Python - YouTube
Today we learn how to create, edit and parse Word (docx) files in Python. ... Programming Books & Merch The Python Bible Book:...
Read more >
Python Read Microsoft Word Documents - YouTube
Join Free Programming Courseshttps://geekscoders.com/My Affiliate Books:Learn Python, 5th Edition https://amzn.to/2TvLMt3Python Crash Course ...
Read more >
Adding Pictures to Word Documents | Python Tutorial #19
Hi Welcome to Amigoscode. Intuit QuickBooks and I joined forces to help you kick start your career in tech. In this python tutorial...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found