Use Python to parse Microsoft Word documents using PyWin32 Library

Python is like a disease. Once you start coding, your skills with other languages’ syntax will be heavily affected. Anyways, that’s not the topic.

Last year, I had to grade about 50 word documents by following a grading scheme. I spent an approximate of 3 hours grading the documents and was pretty much frustrated. A couple of days ago while learning python, I decided to learn it while developing an application and the first thing that popped out to my mind is to automate the grading process. So, after following Hadeer’s tutorial on how to install python on windows I started to develop the application.

To access the Word documents using Python, you need the following:

  1. Microsoft Word already installed.
  2. PyWin32 library

The PyWin32 is a Python’s library that makes windows extensions available to Python. In other words, it lets you access various windows features – at least Microsoft Office’s features – without using one of Microsoft’s languages like Visual Basic or C#.

Downloading and Installing PyWin32

  1. Go to the library’s download page.
  2. Download the version that suits your computer. There are different versions for 32 and 64-bits computers as well as different versions for different Python versions so make sure you download the right one. My computer uses Win7 64-bit and Python 2.7.2 so I downloaded pywin32-216.win-amd64py2.7.
  3. Run the downloaded file and continue with the Wizard. Pretty much easy.

The Python Part

Let’s start Coding and know some concepts:

  1. PyWin32 is a wrapper that makes you use the same methods and properties that are available in Visual Basic for Applications (VBA) but with Python’s syntax.
  2. This is the Word 2007 developer reference and the useful part is the Object Model Reference. I had to check them to know what methods and properties are available and therefore they’re VERY important.
  3. In any of the references, you will find some examples written in VBA. All you have to do is convert them to Python’s Syntax.
  4. Long Story short, All you have to do is:
    1. Open the Word application and hide it:
      word = win32.gencache.EnsureDispatch('Word.Application')
      word.Visible = False
      

      Now, the word variable is considered an Application. So in the Object Model Reference you will find a class called Application. All the methods and properties (called the Application Object members) are applicable. One of the properties is the “Visibility” of the application which I set to false to do all the work in the background. If you tried it, you will find the Microsoft Word instance in the processes in your Task Manager.

    2. Then I want to get all the files in the folder (directory) to start the grading process. To do so in a simple way, I put my Python (.py) file in the same folder as the documents. Then I will loop through each available file and open it in the Word instance that is running in the background.
      for infile in glob.glob( os.path.join('', '*.docx') ):
      #My checks here
      

      Note that I am interested only in the docx files so I passed  the “*.docx” to the function.

    3. Now I have the file name called “infile” or input file name contains something like “My Word Document.docx”. To open the input file and start processing:
      doc = word.Documents.Open(os.getcwd()+'\'+infile)
      

      The os.getcwd() is short for “get current working directory path” and then open the file.

    4. If you want to create a new document use:
      doc = word.Documents.Add()
      
    5. If you checked the Model Reference mentioned earlier, now the “doc” variable has a Document object. Therefore, All the methods and properties of the document is available. I want to check whether there are Grammar or Spelling problems:
      if not doc.CheckGrammar:
           print "Did not pass the grammar and spelling check"
      

      There is another one called CheckSpelling that checks for spelling mistakes only.

    6. According to the grading scheme, the student had to use at least 3 different styles (heading 1, heading 2 .. etc.), 3 font sizes, 3 font types (Tahoma, Times New Roman .. etc.), and 3 font effects. To do it, I will loop through all the words available in the document and apply my checks (p.s. That’s not an efficiency tutorial).
      fonts = []
      sizes = []
      styles = []
      effect = False
      # For every word in the document
      for word_t in doc.Words:
           if not word_t.Font.Name in fonts:
                fonts.append(word_t.Font.Name)
           if not word_t.Font.Size in sizes:
                sizes.append(word_t.Font.Size)
           if not word_t.Style in styles:
                styles.append(word_t.Style)
           if word_t.Font.Bold or word_t.Font.DoubleStrikeThrough or word_t.Font.Emboss or word_t.Font.Italic or word_t.Font.Underline or word_t.Font.Engrave or word_t.Font.Shadow or word_t.Font.Shading or word_t.Font.StrikeThrough or word_t.Font.Subscript or word_t.Font.Superscript or word_t.Font.SmallCaps or word_t.Font.AllCaps:
                effect = True
      
    7. Save to a new file
      doc.SaveAs(os.getcwd()+'\Modified.docx')
      

      Or save the same file

      doc.Save
      
    8. After finishing all your work – You have to quit the Microsoft Word instance we initialized earlier (we can’t leave it running).
      word.Application.Quit(-1)
      

Tips

  1. If you’re using a method that takes a couple of optional variables but you’re not using them – do not add empty brackets .. For example doc.CheckGrammar will do the correct Job. If you used doc.CheckGrammar() it will return a different result. I have no reason for this, it just happens.
  2. Always check the “Object Members” page. It has all the properties and methods that you will need.

Code Snippets

These are some codes found in my application – with comments – if you need them.

Note: While checking whether a paragraph is left aligned for example – below – you will find I am using number “0” to check whether it’s left aligned. Why 0? Check the Enumerations. In VBA, you can write something like wdAlignParagraphLeft to check whether it’s left aligned. I couldn’t get that to work in Python so I just used the numeric value which is found here for paragraph alignment.

Checking whether images, cliparts, shapes .. etc. exist


if doc.Shapes.Count == 0:
    #No images or shapes or cliparts found

Check whether table of contents exist


if doc.TablesOfContents.Count == 0:
    #No table of contents found

Get all the list types used in the document


lists = []
for list_t in doc.Lists:
     if not list_t.Range.ListFormat.ListType in lists:
          lists.append(list_t.Range.ListFormat.ListType)

Paragraphs Alignment


for para in doc.Paragraphs:

    if not para.Format.Alignment == 0:
        #not left aligned

Some Table checks

# Number of features used (at least 3 to get the full grade)
tableFeatures = 0

if doc.Tables.Count == 0:

    print "No tables were found"

for table in doc.Tables:

    for border in table.Borders:

        if border.LineWidth != 1:

            tableFeatures+=1

        if border.LineStyle != 0:

            tableFeatures+=1

        if not table.Uniform:

            tableFeatures += 1

        if table.TableDirection != 0:

            tableFeatures += 1

        if table.Spacing != 1:

            tableFeatures += 1

Finally

The same thing could be applied to Microsoft Excel. I exported the results of the grading to an excel file. You can check the references as well.

Excel Object Model Reference and Excel Developer Reference.

References

  1. http://msdn.microsoft.com/en-us/library/bb244391(v=office.12).aspx
  2. http://www.blog.pythonlibrary.org/2010/07/16/python-and-microsoft-office-using-pywin32/
  3. http://sourceforge.net/projects/pywin32/

23 thoughts on “Use Python to parse Microsoft Word documents using PyWin32 Library

  1. Hi

    I’m using this library to convert html files into word (Documents.Add etc).
    All is fine but when I open the word files converted, Microsoft Word displays the document in html mode and I’ve to switch it into Page mode. How to configure it with python ?
    And if you know how to insert page numbers, I would be the happiest :)

    thanks

  2. Great post Galal!

    I wasn’t able to run the script as is because of issues with word object creation. However for anyone else experiencing the same problem, the following works:

    from win32com.client import Dispatch
    word = Dispatch(‘Word.Application’)

    I am using pywin32 build 217.

  3. Hi,

    Do you know if there is a way to basically use a word document template and update it dynamically in a loop and save the multiple updated versions of the template?

    for e.g, something like:
    for i in range(4):
    doc.Range.Text = i ## I know this statement is wrong, but this is just an example to include different text in all the documents.
    doc.SaveAs(os.getcwd+”\abcd_”+str(i)+”.docx”)

    Does this make sense?

  4. You need to think about what it is about scaling the business to which they
    pertain. One of european union wars the last eight games.
    To do this, you ll want to be able to understand what you intend to establish your identity,
    you need to know how people think about entering into
    some sort of funding. Another thing that prevents people from exploring the benefits of your
    products and market accordingly. It is not only the one that european union wars will
    help you undeerstand the DRASTIC difference between 2.

  5. I see a lot of interesting content on your blog. You have to spend a lot of time writing,
    i know how to save you a lot of time, there is a tool that creates unique, SEO friendly
    posts in couple of seconds, just search in google – laranita’s free content source

  6. Hey! Quixk question that’s entirely off topic.
    Do you know how to make your site mobile friendly?
    My website looks weird when browsing from my apple iphone.
    I’m trying to find a template or plugin that might be able to fix this issue.

    If you have any suggestions, plrase share.
    Many thanks!

  7. Use Python to parse Microsoft Word documents using PyWin32 Library Completeaza chestionare
    auto drpciv 2014 categoria B, C, E, A, D si treci examenul auto cu aceste teste
    drpciv online- chestionare auto
    Chestionare auto si teste explicate pentru categoria B.

    Chestionarele de la examenul auto oficial sunt actualizate si revizuite.
    Ia examenul auto din prima.

  8. Thanks for your marvelous posting! I genuinely enjoyed reading it, you will be a great author.
    I will be sure to bookmark your blog and will often come
    back someday. I want to encourage you to definitely continue your great writing, have a nice evening!

  9. Hello there! I could have sworn I’ve been to this site beffore
    but after reading through some of the post I realized it’s new to me.
    Anyways, I’m definitely happy I found it and I’ll be book-marking and checking back frequently!

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>