Python is like a disease. Once you start coding, your skills with other languages’ syntax will be heavily affected. Anyways, that’s not the topic.
Last year, I had to grade about 50 word documents by following a grading scheme. I spent an approximate of 3 hours grading the documents and was pretty much frustrated. A couple of days ago while learning python, I decided to learn it while developing an application and the first thing that popped out to my mind is to automate the grading process. So, after following Hadeer’s tutorial on how to install python on windows I started to develop the application.
To access the Word documents using Python, you need the following:
- Microsoft Word already installed.
- PyWin32 library
The PyWin32 is a Python’s library that makes windows extensions available to Python. In other words, it lets you access various windows features – at least Microsoft Office’s features – without using one of Microsoft’s languages like Visual Basic or C#.
Downloading and Installing PyWin32
- Go to the library’s download page.
- Download the version that suits your computer. There are different versions for 32 and 64-bits computers as well as different versions for different Python versions so make sure you download the right one. My computer uses Win7 64-bit and Python 2.7.2 so I downloaded pywin32-216.win-amd64–py2.7.
- Run the downloaded file and continue with the Wizard. Pretty much easy.
The Python Part
Let’s start Coding and know some concepts:
- PyWin32 is a wrapper that makes you use the same methods and properties that are available in Visual Basic for Applications (VBA) but with Python’s syntax.
- This is the Word 2007 developer reference and the useful part is the Object Model Reference. I had to check them to know what methods and properties are available and therefore they’re VERY important.
- In any of the references, you will find some examples written in VBA. All you have to do is convert them to Python’s Syntax.
- Long Story short, All you have to do is:
- Open the Word application and hide it:
word = win32.gencache.EnsureDispatch('Word.Application') word.Visible = False
Now, the word variable is considered an Application. So in the Object Model Reference you will find a class called Application. All the methods and properties (called the Application Object members) are applicable. One of the properties is the “Visibility” of the application which I set to false to do all the work in the background. If you tried it, you will find the Microsoft Word instance in the processes in your Task Manager.
- Then I want to get all the files in the folder (directory) to start the grading process. To do so in a simple way, I put my Python (.py) file in the same folder as the documents. Then I will loop through each available file and open it in the Word instance that is running in the background.
for infile in glob.glob( os.path.join('', '*.docx') ): #My checks here
Note that I am interested only in the docx files so I passed the “*.docx” to the function.
- Now I have the file name called “infile” or input file name contains something like “My Word Document.docx”. To open the input file and start processing:
doc = word.Documents.Open(os.getcwd()+'\'+infile)
The os.getcwd() is short for “get current working directory path” and then open the file.
- If you want to create a new document use:
doc = word.Documents.Add()
- If you checked the Model Reference mentioned earlier, now the “doc” variable has a Document object. Therefore, All the methods and properties of the document is available. I want to check whether there are Grammar or Spelling problems:
if not doc.CheckGrammar: print "Did not pass the grammar and spelling check"
There is another one called CheckSpelling that checks for spelling mistakes only.
- According to the grading scheme, the student had to use at least 3 different styles (heading 1, heading 2 .. etc.), 3 font sizes, 3 font types (Tahoma, Times New Roman .. etc.), and 3 font effects. To do it, I will loop through all the words available in the document and apply my checks (p.s. That’s not an efficiency tutorial).
fonts =  sizes =  styles =  effect = False # For every word in the document for word_t in doc.Words: if not word_t.Font.Name in fonts: fonts.append(word_t.Font.Name) if not word_t.Font.Size in sizes: sizes.append(word_t.Font.Size) if not word_t.Style in styles: styles.append(word_t.Style) if word_t.Font.Bold or word_t.Font.DoubleStrikeThrough or word_t.Font.Emboss or word_t.Font.Italic or word_t.Font.Underline or word_t.Font.Engrave or word_t.Font.Shadow or word_t.Font.Shading or word_t.Font.StrikeThrough or word_t.Font.Subscript or word_t.Font.Superscript or word_t.Font.SmallCaps or word_t.Font.AllCaps: effect = True
- Save to a new file
Or save the same file
- After finishing all your work – You have to quit the Microsoft Word instance we initialized earlier (we can’t leave it running).
- Open the Word application and hide it:
- If you’re using a method that takes a couple of optional variables but you’re not using them – do not add empty brackets .. For example doc.CheckGrammar will do the correct Job. If you used doc.CheckGrammar() it will return a different result. I have no reason for this, it just happens.
- Always check the “Object Members” page. It has all the properties and methods that you will need.
These are some codes found in my application – with comments – if you need them.
Note: While checking whether a paragraph is left aligned for example – below – you will find I am using number “0” to check whether it’s left aligned. Why 0? Check the Enumerations. In VBA, you can write something like wdAlignParagraphLeft to check whether it’s left aligned. I couldn’t get that to work in Python so I just used the numeric value which is found here for paragraph alignment.
Checking whether images, cliparts, shapes .. etc. exist
if doc.Shapes.Count == 0: #No images or shapes or cliparts found
Check whether table of contents exist
if doc.TablesOfContents.Count == 0: #No table of contents found
Get all the list types used in the document
lists =  for list_t in doc.Lists: if not list_t.Range.ListFormat.ListType in lists: lists.append(list_t.Range.ListFormat.ListType)
for para in doc.Paragraphs: if not para.Format.Alignment == 0: #not left aligned
Some Table checks
# Number of features used (at least 3 to get the full grade) tableFeatures = 0 if doc.Tables.Count == 0: print "No tables were found" for table in doc.Tables: for border in table.Borders: if border.LineWidth != 1: tableFeatures+=1 if border.LineStyle != 0: tableFeatures+=1 if not table.Uniform: tableFeatures += 1 if table.TableDirection != 0: tableFeatures += 1 if table.Spacing != 1: tableFeatures += 1
The same thing could be applied to Microsoft Excel. I exported the results of the grading to an excel file. You can check the references as well.