The Problem (Q-score 2, ranked #28th of 32 in the Word VBA archive)
The scenario as originally posted in 2013
I am trying to read this PDF using itextsharp in C# which will convert this pdf into word file. also it needs to maintain table formating and fonts in word
when i try with English pdf it will work perfectly but using some of the Indian languages like Hindi, Marathi it is not working.
public string ReadPdfFile(string Filename)
{
string strText = string.Empty;
StringBuilder text = new StringBuilder();
try
{
PdfReader reader = new PdfReader((string)Filename);
if (File.Exists(Filename))
{
PdfReader pdfReader = new PdfReader(Filename);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{ ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
text.Append(currentText);
pdfReader.Close();
}
}
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
textBox1.Text = text.ToString();
return text.ToString(); ;
}
Why community consensus is tight on this one
Across 32 Word VBA entries in the archive, the accepted answer here holds solid answer (above median) status — meaning voters are unusually aligned on the right fix.
The Verified Solution — solid answer (above median) (+9)
3-line Word VBA pattern (copy-ready)
I inspected your file with a special focus on your sample “मतद|र” being extracted as “मतदरर” in the topmost line of the document pages.
In a nutshell:
Your document itself provides the information that e.g. the glyphs “मतद|र” in the head line represent the text “मतदरर”. You should ask the source of your document for a document version in which the font informations are not misleading. If that is not possible, you should go for OCR.
In detail:
The top line of the first page is generated by the following operations in the page content stream:
/9 280 Tf
(-12"!%$"234%56*5) Tj
The first line selects the font named /9 at a size of 280 (an operation at the beginning of the page scales everything by a factor of 0.05; thus, the effective size is 14 units which you observe in the file).
The second line causes glyphs to be printed. These glyphs are referenced inbetween the brackets using the custom encoding of that font.
When a program tries to extract the text, it has to deduce the actual characters from these glyph references using information from the font.
The font /9 on the first page of your PDF is defined using these objects:
242 0 obj<<
/Type/Font/Name/9/BaseFont 243 0 R/FirstChar 33/LastChar 94
/Subtype/TrueType/ToUnicode 244 0 R/FontDescriptor 247 0 R/Widths 248 0 R>>
endobj
243 0 obj/CDAC-GISTSurekh-Bold+0
endobj
247 0 obj<<
/Type/FontDescriptor/FontFile2 245 0 R/FontBBox 246 0 R/FontName 243 0 R
/Flags 4/MissingWidth 946/StemV 0/StemH 0/CapHeight 500/XHeight 0
/Ascent 1050/Descent -400/Leading 0/MaxWidth 1892/AvgWidth 946/ItalicAngle 0>>
endobj
So there is no /Encoding element but at least there is a reference to a /ToUnicode map. Thus, a program extracting text has to rely on the given /ToUnicode mapping.
The stream referenced by /ToUnicode contains the following mappings of interest when extracting the text from (-12″!%$”234%56*5):
<21> <21> <0930>
<22> <22> <0930>
<24> <24> <091c>
<25> <25> <0020>
<2a> <2a> <0031>
<2d> <2d> <092e>
<31> <31> <0924>
<32> <32> <0926>
<33> <33> <0926>
<34> <34> <002c>
<35> <35> <0032>
<36> <36> <0030>
(Already here you can see that multiple character codes are mapped to the same unicode code point…)
Thus, text extraction must result in:
- = 0x2d -> 0x092e = म
1 = 0x31 -> 0x0924 = त
2 = 0x32 -> 0x0926 = द
" = 0x22 -> 0x0930 = र instead of |
! = 0x21 -> 0x0930 = र
% = 0x25 -> 0x0020 =
$ = 0x24 -> 0x091c = à¤
" = 0x22 -> 0x0930 = र
2 = 0x32 -> 0x0926 = द
3 = 0x33 -> 0x0926 = द
4 = 0x34 -> 0x002c = ,
% = 0x25 -> 0x0020 =
5 = 0x35 -> 0x0032 = 2
6 = 0x36 -> 0x0030 = 0
* = 0x2a -> 0x0031 = 1
5 = 0x35 -> 0x0032 = 2
Thus, the text iTextSharp (and also Adobe Reader!) extract from the heading on the first document page is exactly what the document in its font informations claims is correct.
As the cause for this is the misleading mapping information in the font definition, it is not surprising that there are misinterpretations all over the document.
When to Use It — classic (2013–2016)
Ranked #28th in its category — specialized fit
This pattern sits in the 53% tail relative to the top answer. Reach for it when your scenario closely matches the question title; otherwise browse the Word VBA archive for a higher-consensus alternative.
What changed between 2013 and 2026
The answer is 13 years old. The Word VBA object model has been stable across Office 2013, 2016, 2019, 2021, 365, and 2024/2026 LTSC, so the pattern still compiles. Changes that might affect you: 64-bit API declarations (use PtrSafe), blocked macros in downloaded files (Mark-of-the-Web), and the shift toward Office Scripts for web-first workflows.