I was just doing exactly that from a Motorola (Freescale) PDF for a software simulation. It doesn't handle tabs (are there any in a PDF?) and deletes 'whitespace'. BUT, it didn't take very long to restore the spacing. Not great but easier than typing the whole deal. GG
Don'cha love it when the author turns off the "Text Copy" tool on the document so you can't copy and paste? Why they do that is beyond me. You could print as many copies as you wish, or make infinite copies on a Xerox machine. Why make it difficult to copy a couple of lines of text?
Another moan is when the author uses some wierd font that produces garbage characters when you paste into a text editor. I often end up shrinking the editor to a small window that overlays the pdf file, and do a manual copy.
Then there's the text in a scanned image format. No copying, no searches, and it takes a lot of room on the disk.
Hopefully, in 50 years or so, paper will be found only in museums, and everyone will have flexible electronic displays. Since there will be no need to print anything, searches will be easy, and there won't be a need to use special fonts or lock the document for any reason. Life will be easy for engineers.
One thing I notice that's amiss is that there is a carriage return before and after subscripted text. So:
V 50 V DS
Comes out as VDS 50 V
The symbol characters (degrees and ohms) also tend to get translated/screwed up, depending on where you're pasting to. There are also some lines screwed up, st the ends of some lines end up together on later lines.
Problems in extracting text are mostly a function of the application that created the PDF (Framemaker 5.5 for the Power PC set to LaserWriter 8 8.7 and Acrobat Distiller 4.0 for Macintosh in this case). In this case, if you open the document in Illustrator you can see many individual blocks of text, some of which the copy operation strings together, and others which it misses.
This stuff is fairly easily fixed by a bit of editing-- those dot leaders are irritating to fix. I tried pasting into a text-only application (Ultraedit), Excel, the Open Office text editor and into MS Word, and all came out pretty much the same except for the symbols. It might even be faster than re-typing everything.
Extracting text using GSView in "normal" mode is only slightly better.
Best regards, Spehro Pefhany
--
"it\'s the network..." "The Journey is the reward"
speff@interlog.com Info for manufacturers: http://www.trexon.com
Embedded software/hardware/analog Info for designers: http://www.speff.com
Quite often I have trouble extracting text from a PDF. I use the Text tool, copy, but on then pasting into my text editor I get garbage. Each individual character gets a return inserted. Typical example is at
formatting link
where I just wanted to extract the details under 'Absolute Maximum Ratings'.
What's the deal here please? If the document is proprietorially protected, wouldn't the Text tool be inaccessible?
What's your text editor? Assuming you're under Windows, perhaps the problem is trying to paste Unicode into an editor that can't handle it. You might try pasting the text into Word or Wordpad to see what happens.
You might also look at xpdf,
formatting link
. I don't think you can run the PDF viewer under Windows, but the command-line utilities, including a PDF-to-text converter, will work.
I'm using Adobe Acrobat 4... I have version 5, but it's been screwed over by zealot programmers, so I only use it to read some stuff that version 4 lacks font capability for.
With version 4 I get spaces with subscripted text, no ; otherwise looks OK.
...Jim Thompson
--
| James E.Thompson, P.E. | mens |
| Analog Innovations, Inc. | et |
| Analog/Mixed-Signal ASIC\'s and Discrete Systems | manus |
| Phoenix, Arizona Voice:(480)460-2350 | |
| E-mail Address at Website Fax:(480)460-2142 | Brass Rat |
| http://www.analog-innovations.com | 1962 |
I love to cook with wine. Sometimes I even put it in the food.
Using the Column Select tool in my Adobe Reader, I get:
Features =E2=80=A2 30A, 50V =E2=80=A2 r DS(ON) =3D 0.040 =E2=84=A6 =E2=80=A2 SOA is Power Dissipation Limited =E2=80=A2 Nanosecond Switching Speeds =E2=80=A2 Linear Transfer Characteristics =E2=80=A2 High Input Impedance =E2=80=A2 Majority Carrier Device =E2=80=A2 Related Literature
- TB334 =E2=80=9CGuidelines for Soldering Surface Mount Components to PC Boards=E2=80=9D
Which is close. Apparently when characters are in the symbol font, a carriage return is inserted. My reader is version 5.0.5.
Using the Column Select tool in my Adobe Reader, I get:
Features =E2=80=A2 30A, 50V =E2=80=A2 r DS(ON) =3D 0.040 =E2=84=A6 =E2=80=A2 SOA is Power Dissipation Limited =E2=80=A2 Nanosecond Switching Speeds =E2=80=A2 Linear Transfer Characteristics =E2=80=A2 High Input Impedance =E2=80=A2 Majority Carrier Device =E2=80=A2 Related Literature
- TB334 =E2=80=9CGuidelines for Soldering Surface Mount Components to PC Boards=E2=80=9D
Which is close. Apparently when characters are in the symbol font, a carriage return is inserted. My reader is version 5.0.5.
....but guess I must have used WordPad for the first! Don't recall doing so - but can't think of any other explanation. So that makes pdf reader definitely an improvement.
Thanks. I took a look at PMView but it seems to be just a (versatile) image viewer, rather like several others (e.g. IrfanView), which can also Print to File. Maybe I should explore the second part of your recommendation; what 'virtual PostScript printer' do you use please?
BTW, I have Snagit, which can also capture *text* from many windows, although it fails in the PDF example under discussion.
Barely a day goes by that Slackware doesn't pleasantly surprise me! It seems I got xpdf along with it, and lo and behold:
------------------------
30A, 50V, 0.040 Ohm, N-Channel Power MOSFET This is an N-Channel enhancement mode silicon gate power field effect transistor designed for applications such as switching regulators, switching converters, motor drivers, relay drivers and drivers for high power bipolar switching transistors requiring high speed and low gate drive power. This type can be operated directly from integrated circuits. Formerly developmental type TA9771. Ordering Information PART NUMBER PACKAGE BRAND BUZ11 TO-220AB BUZ11 NOTE: When ordering, use the entire part number.
Features · 30A, 50V · rDS(ON) = 0.040 · SOA is Power Dissipation Limited · Nanosecond Switching Speeds · Linear Transfer Characteristics · High Input Impedance · Majority Carrier Device · Related Literature - TB334 "Guidelines for Soldering Surface Mount Components to PC Boards" Symbol D G S
Can you help me locate specifically the PDF to text converter please? I'm wallowing in files with off-putting and Windows-alien names like 't1lib-1.3.tar.gz'.
ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here.
All logos and trade names are the property of their respective owners.