I have a pdf file which is Persian script and it is a. Since Persian uses UTF-8 format therefore I can't convert it into a plain text in Microsoft Word, also I can't copy-paste the text resulting unreadable characters. I have tried a lot of softwares such as and e-Pdf Converter however after the conversion still the characters are not displayed properly. I even tried OCR but again same problem appeared. The pdf does'nt have any password or restrictions. Does anyone have any other ideas?
![]()
Edit: I actually tried creating a file in MS Word and converting it to a PDF, after that again I had the same problem with the PDF file.(even the encoding was known). Very often PDF files in non-Latin scripts (especially RTL scripts such as Arabic, Hebrew and Farsi) are generated by software which sort of LTR-ifies the text at the word or sentence-fragment level, or just somehow gets the right glyphs to display but you get gibberish for the 'logical' text. In these cases there is very little to be done except write a custom back-converter which is effectively not an option. However, if you can figure out how the file is created - which is often indicated in the meta-data accessible using common PDF readers - there might be an option to open the file in the application which generated it, or at least you could make your question more specific.
Tech support scams are an industry-wide issue where scammers attempt to trick you into paying for unnecessary technical support services. You can help protect yourself from scammers by verifying that the contact is a Microsoft Agent or Microsoft Employee and that the phone number is an official Microsoft global customer service number.
I have currently worked to convert a pdf to an editable Persian text. The best solution I have found is to use google doc as follows. You should convert pdf pages to images.
For this you can use Adobe acrobat reader( Not the adobe reader which is free) or in Linux I use GIMP to open a pdf and then I select to open each page in a separate image. It's your own choice.
Upload the image files to Google Drive. Go to Google Drive and right click on each image then click open with google doc. wait until google doc open an editable text from your image. Copy it to word I dont know if there are any automated method.
I hope some time I have time to make an application for doing this automatically. I had the same problem with converting pdf files to word. After copy/paste in Word, the formatting changed and caused trouble. I tried several online converters but they also failed. The only method that worked was as follows:.
Open the pdf file with Adobe Acrobat Reader, then from the file menu choose print. From the printer names, choose adobe acrobat. Yes, you are about to create a pdf from a pdf!. Open the new pdf file with Google Chrome (drag and drop the file onto Chrome). Now simply select all the text (ctrl + A) and copy/paste it into a blank Word file.
I have a two-page PDF document with minimal formatting written in Farsi (Persian), which uses an RTL Arabic script. I would like to extract the text into a plain text file in whatever encoding - UTF-8, UTF-16, CP-1256, or ISO-8859-6 would be fine. I'm using KDE 3.2.2 on GNU/Linux. I have access to Adobe Acrobat Reader and all the standard GhostScript tools. Access to a machine running Windows 2000 could be arranged at some inconvenience. Opening the PDF in Acroread and selecting the text with the mouse or with Edit-Select All and then pasting it into a Unicode-capable text editor (Kate) doesn't seem to work. For example, for the title line I get the following gibberish which doesn't translate to Farsi text in any of the aforementioned encodings.'
('.F' Ė G1'( 1/ GĖF'Ė( 2004 'H1' F'ED1' F'G, 13'13 1/ E3ĖD'Ė3H3 'Ė Ė&'H1' Ė1'/ GĖ'E13 I have also tried using pdf2ps ps2ascii with no success. Any other suggestions would be appreciated. Regards, Tristan - V.-o Tristan Miller en,(fr,de,ia) In a haiku, so it's hard (7. In article, Tristan Miller wrote: I have a two-page PDF document with minimal formatting written in Farsi (Persian), which uses an RTL Arabic script. I would like to extract the text into a plain text file in whatever encoding - UTF-8, UTF-16CP-1256, or ISO-8859-6 would be fine.
I'm using KDE 3.2.2 on GNU/Linux. I have access to Adobe Acrobat Reader and all the standard GhostScript tools. Access to a machine running Windows 2000 could be arranged at some inconvenience. I had a friend of mine running Windows open the PDF in Acrobat, select the text, and paste it into Microsoft Word. This worked, except that the characters in each line were reversed.
I suspect this has something to do with the LTR/RTL text direction settings, but so far we've been unable to correct the problem. Saving the resulting Word file as ISO-8859-6 text and then using the 'rev' Unix command doesn't reproduce the original document - probably something to do with different Arabic letterforms being used for word-initial and word-terminal letters. In the meantime, if anyone has further suggestions, please let me know! Regards, Tristan - V.-o Tristan Miller en,(fr,de,ia) In a haiku, so it's hard (7.
You may need the Acrobat Middle Eastern version for arabic pasting to work. Best Regards, Paulo Soares Tristan Miller wrote in message news. In article, Tristan Miller wrote: I have a two-page PDF document with minimal formatting written in Farsi (Persian), which uses an RTL Arabic script. I would like to extract the text into a plain text file in whatever encoding - UTF-8, UTF-16CP-1256, or ISO-8859-6 would be fine. I'm using KDE 3.2.2 on GNU/Linux. I have access to Adobe Acrobat Reader and all the standard GhostScript tools. Access to a machine running Windows 2000 could be arranged at some inconvenience.
I had a friend of mine running Windows open the PDF in Acrobat, select the text, and paste it into Microsoft Word. This worked, except that the characters in each line were reversed.
![]()
I suspect this has something to do with the LTR/RTL text direction settings, but so far we've been unable to correct the problem. Saving the resulting Word file as ISO-8859-6 text and then using the 'rev' Unix command doesn't reproduce the original document - probably something to do with different Arabic letterforms being used for word-initial and word-terminal letters. In the meantime, if anyone has further suggestions, please let me know! RegardsTristan Uli Wachowitz 7/6/2004, 11:36 น. In article, Tristan Miller wrote: Greetings. I have a two-page PDF document with minimal formatting written in Farsi (Persian), which uses an RTL Arabic script. I would like to extract the text into a plain text file in whatever encoding - UTF-8, UTF-16CP-1256, or ISO-8859-6 would be fine.
I'm using KDE 3.2.2 on GNU/Linux. I have access to Adobe Acrobat Reader and all the standard GhostScript tools.
Access to a machine running Windows 2000 could be arranged at some inconvenience. Have you tried pdftotext (from my open source Xpdf package)? Linux and Windows binaries are available from It will output UTF-8 ('pdftotext -enc UTF-8.'
), and it has (somewhat rudimentary) support for right-to-left scripts. I've run into some problems with Arabic PDF files, in which the ToUnicode mappings are completely broken, i.e., whatever software is creating the PDF files is including incorrect Unicode mapping info. But if you were able to copy text using Acrobat, then pdftotext should work. Derek Tristan Miller 8/6/2004, 3:18 น. In article, Derek B. Noonburg wrote: I have a two-page PDF document with minimal formatting written in Farsi (Persian), which uses an RTL Arabic script. I would like to extract the text into a plain text file in whatever encoding - UTF-8, UTF-16CP-1256, or ISO-8859-6 would be fine.
Have you tried pdftotext (from my open source Xpdf package)? This worked almost perfectly - thanks! The only problems were that in some cases in the original document where LTR text was embedded in a RTL paragraph, the LTR text was reversed. But there were only a couple instances of this, so they're easily fixed by hand. Regards, Tristan - V.-o Tristan Miller en,(fr,de,ia) In a haiku, so it's hard (7. In article, Herb Martin wrote: PDF's are NOTORIOUSLY POOR at handling non-Roman text in a compatible manner. I have been seeking a general solution to this for Arabic for some time.
If you're reading this thread in soc.culture.arabic, take a look at the same in comp.text.pdf. A solution has been posted (pdftotext, part of the Xpdf package) which works at least in my case - perhaps it also works in the general case. Regards, Tristan - V.-o Tristan Miller en,(fr,de,ia) In a haiku, so it's hard (7.
If you're reading this thread in soc.culture.arabic, take a look at the same in comp.text.pdf. A solution has been posted (pdftotext, part of the Xpdf package) which works at least in my case - perhaps it also works in the general case. I will give it a try - on your recommendation - but without much optimism as I have tried a variety of 'to text' utilities, code pages/encoding, and cut and paste methods and failed to copy most Arabic script from PDFs. (I am trying to build a study sheet/word list feed to the 'flash card' program 'Pauker'.) - Herb Martin 'Tristan Miller' wrote in message news:[email protected]. Herb Martin 8/6/2004, 9:08 น.
Comments are closed.
|
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |