Remove MS Word “HTML” using PHP

calendar_today Asked Jul 9, 2012
thumb_up 7 upvotes
history Updated April 14, 2026

Direct Answer

http://htmlpurifier.org/ This will do what you want. This is an advisory response with reference links, ranked #26th of 32 by community upvote score, from 2012.


The Problem (Q-score 4, ranked #26th of 32 in the Word VBA archive)

The scenario as originally posted in 2012

Possible Duplicate:
What is the best free way to clean up Word HTML?
PHP to clean-up pasted Microsoft input

I allow clients to enter notes in a rich text editor, and have only recently upgraded to ckEditor 3x, which strips MS word classes, styles, and comments by default (when users paste into the editor object). So moving forward I’m all set.

I’ve recently had a need to clean up 5 years worth of notes some of which have MS word generated HTML embedded. I need to loop through this body of text and clean it.

I do not need to strip out all span tags, only those identified as written by Microsoft.

I’ve tried using HTMLCleaner, but it is not removing the MS generated HTML. http://word2cleanhtml.com does exactly what I want, however the developers are currently not offering the API for public use (as of July 9, 2012).

I’ve looked for such a class off and on for the last few weeks and am not having much luck. Have any of you found a useful class you’d like to share?

Why community consensus is tight on this one

Across 32 Word VBA entries in the archive, the accepted answer here holds solid answer (above median) status — meaning voters are unusually aligned on the right fix.


The Verified Solution — solid answer (above median) (+7)

Advisory answer — community consensus with reference links

Note: the verified answer below is a reference / advisory response rather than a copy-ready snippet.

http://htmlpurifier.org/

This will do what you want.


When to Use It — vintage (14+ years old, pre-2013)

Ranked #26th in its category — specialized fit

This pattern sits in the 63% tail relative to the top answer. Reach for it when your scenario closely matches the question title; otherwise browse the Word VBA archive for a higher-consensus alternative.

What changed between 2012 and 2026

The answer is 14 years old. The Word VBA object model has been stable across Office 2013, 2016, 2019, 2021, 365, and 2024/2026 LTSC, so the pattern still compiles. Changes that might affect you: 64-bit API declarations (use PtrSafe), blocked macros in downloaded files (Mark-of-the-Web), and the shift toward Office Scripts for web-first workflows.

help
Frequently Asked Questions

Is this above-median answer still worth copying?
expand_more

Answer score +7 vs the Word VBA archive median ~4; this entry is solid. The score plus 4 supporting upvotes on the question itself (+4) means the asker and 6 subsequent voters all validated the approach.

This answer links out — what are the reference links worth following?
expand_more

Read the first external link for the canonical reference, then search this archive for a top-10 entry in the same category — advisory answers are best paired with a ranked code snippet to close the loop.

This answer is 14 years old. Is it still relevant in 2026?
expand_more

Published 2012, which is 14 year(s) before today’s Office 2026 build. The Word VBA object model has had no breaking changes in that window. Three things to re-test: (1) blocked macros on downloaded files (Mark-of-the-Web), (2) 64-bit API declarations (PtrSafe, LongPtr), (3) any shift toward Office Scripts for web scenarios.

Which Word VBA pattern ranks just above this one at #25?
expand_more

The pattern one rank above is “VSTO 2007: how do I determine the page and paragraph number of a Range?”. If your use case overlaps, compare both before committing.

Data source: Community-verified Q&A snapshot. Q-score 4, Answer-score 7, original post 2012, ranked #26th of 32 in the Word VBA archive. Last regenerated April 14, 2026.