pdftotext不输出希伯来字符

我正在使用Xpdf的pdftotext从Ubuntu上的一些希伯来语pdf文件中获取文本。

在我的本地机器上，这工作得很好。然后，我试图在另一台机器上执行此操作，并且希伯来字符不显示在文本文件中。我证实我有语言包（见下面为什么我这么认为）。我还能在哪里find这个问题？

>> tail -2 /etc/xpdf/xpdfrc include /etc/xpdf/includes >> cat /etc/xpdf/includes # This file was automatically generated by /usr/sbin/update-xpdfrc. # Instead, add or remove files in /etc/xpdf/ then run # /usr/sbin/update-xpdfrc to regenerate this file. include /etc/xpdf/xpdfrc-latin2 include /etc/xpdf/xpdfrc-thai include /etc/xpdf/xpdfrc-greek include /etc/xpdf/xpdfrc-turkish include /etc/xpdf/xpdfrc-arabic include /etc/xpdf/xpdfrc-hebrew include /etc/xpdf/xpdfrc-cyrillic >> cat /etc/xpdf/xpdfrc-hebrew #----- begin Hebrew support package (2003-feb-16) unicodeMap ISO-8859-8 /usr/share/xpdf/hebrew/ISO-8859-8.unicodeMap unicodeMap Windows-1255 /usr/share/xpdf/hebrew/Windows-1255.unicodeMap #----- end Hebrew support package >> ls /usr/share/xpdf/hebrew/ ISO-8859-8.unicodeMap Windows-1255.unicodeMap

幸运的是，友好的Ubuntu人员可以很容易地安装语言。只需在shell中input这个命令：

 sudo apt-get install language-support-he language-pack-he

您会注意到，它将希伯来语支持添加到其他一些子系统（例如HSpell，Myspell和PostgreSQL）中，并且还安装了一些希伯来语字体。

为了更好地衡量，请安装以下希伯来语字体：

 sudo apt-get install culmus culmus-fancy xfonts-efont-unicode xfonts-efont-unicode-ib xfonts-intl-european msttcorefonts

最后，确保在运行pdftotext时指定UTF-8编码格式，因为它可能不会自动检测到源：

 pdftotext -enc UTF-8 input.pdf output.txt

你应该看看PDFlib.com 的文本提取工具包TET （托马斯·梅兹（Thomas Merz），“PostScript和PDF圣经”的作者）。

TET主要是在其他PDF处理应用程序中使用的库，但他们也…

…在它上面build立了一个强大的命令行工具，叫做“TET iFilter”（啤酒中免费）。
…build立了一个Acrobat插件（像啤酒一样免费）

这个可以从PDF文件（墨水，CJK，希伯来文，阿拉伯文）中提取非ASCII文本，将连字符还原到原来的字符对或三重奏，一般来说，它围绕Adobe自己的文本提取function运行圆圈。

它适用于Windows，Linux，Mac OS X和各种Unix系统。