OT: Copying text from a PDF

FE

. .

. .

ny

I was just doing exactly that from a Motorola (Freescale) PDF for a software simulation. It doesn't handle tabs (are there any in a PDF?) and deletes 'whitespace'. BUT, it didn't take very long to restore the spacing. Not great but easier than typing the whole deal. GG

Reply to
Glenn Gundlach
Loading thread data ...

Leon Heller wrote: [...]

Don'cha love it when the author turns off the "Text Copy" tool on the document so you can't copy and paste? Why they do that is beyond me. You could print as many copies as you wish, or make infinite copies on a Xerox machine. Why make it difficult to copy a couple of lines of text?

Another moan is when the author uses some wierd font that produces garbage characters when you paste into a text editor. I often end up shrinking the editor to a small window that overlays the pdf file, and do a manual copy.

Then there's the text in a scanned image format. No copying, no searches, and it takes a lot of room on the disk.

Hopefully, in 50 years or so, paper will be found only in museums, and everyone will have flexible electronic displays. Since there will be no need to print anything, searches will be easy, and there won't be a need to use special fonts or lock the document for any reason. Life will be easy for engineers.

Sure...

Mike Monett

Reply to
Mike Monett

One thing I notice that's amiss is that there is a carriage return before and after subscripted text. So:

V 50 V DS

Comes out as VDS 50 V

The symbol characters (degrees and ohms) also tend to get translated/screwed up, depending on where you're pasting to. There are also some lines screwed up, st the ends of some lines end up together on later lines.

Problems in extracting text are mostly a function of the application that created the PDF (Framemaker 5.5 for the Power PC set to LaserWriter 8 8.7 and Acrobat Distiller 4.0 for Macintosh in this case). In this case, if you open the document in Illustrator you can see many individual blocks of text, some of which the copy operation strings together, and others which it misses.

This stuff is fairly easily fixed by a bit of editing-- those dot leaders are irritating to fix. I tried pasting into a text-only application (Ultraedit), Excel, the Open Office text editor and into MS Word, and all came out pretty much the same except for the symbols. It might even be faster than re-typing everything.

Extracting text using GSView in "normal" mode is only slightly better.

Best regards, Spehro Pefhany

--
"it\'s the network..."                          "The Journey is the reward"
speff@interlog.com             Info for manufacturers: http://www.trexon.com
Embedded software/hardware/analog  Info for designers:  http://www.speff.com
Reply to
Spehro Pefhany

Quite often I have trouble extracting text from a PDF. I use the Text tool, copy, but on then pasting into my text editor I get garbage. Each individual character gets a return inserted. Typical example is at

formatting link
where I just wanted to extract the details under 'Absolute Maximum Ratings'.

What's the deal here please? If the document is proprietorially protected, wouldn't the Text tool be inaccessible?

--
Terry Pinnell
Hobbyist, West Sussex, UK
Reply to
Terry Pinnell

What's your text editor? Assuming you're under Windows, perhaps the problem is trying to paste Unicode into an editor that can't handle it. You might try pasting the text into Word or Wordpad to see what happens.

You might also look at xpdf,

formatting link
. I don't think you can run the PDF viewer under Windows, but the command-line utilities, including a PDF-to-text converter, will work.

Matt Roberds

Reply to
mroberds

I'm using Adobe Acrobat 4... I have version 5, but it's been screwed over by zealot programmers, so I only use it to read some stuff that version 4 lacks font capability for.

With version 4 I get spaces with subscripted text, no ; otherwise looks OK.

...Jim Thompson

--
|  James E.Thompson, P.E.                           |    mens     |
|  Analog Innovations, Inc.                         |     et      |
|  Analog/Mixed-Signal ASIC\'s and Discrete Systems  |    manus    |
|  Phoenix, Arizona            Voice:(480)460-2350  |             |
|  E-mail Address at Website     Fax:(480)460-2142  |  Brass Rat  |
|       http://www.analog-innovations.com           |    1962     |
             
I love to cook with wine.      Sometimes I even put it in the food.
Reply to
Jim Thompson

I just tried it and it worked OK for me when I pasted the text into the PFE editor. Here are a couple of lines:

Drain to Source Breakdown Voltage (Note 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .V DS

50 V Drain to Gate Voltage (R GS = 20k Ù ) (Note 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V DGR 50 V Continuous Drain Current T C

It's not perfect, but I haven't got a CR after every character.

I often extract text from PDFs whan creating PCB parts, and don't have many problems.

Leon

Reply to
Leon Heller

Thanks for all those prompt responses. I'll follow up the suggestions.

Using TextPad here - great editor.

Same result when pasting into various other apps. I shouldn't have said returns after *every* character, but still pretty bad:

formatting link

--
Terry Pinnell
Hobbyist, West Sussex, UK
Wed 1 June 2005, 08:36 UK time
Reply to
Terry Pinnell

Couple of options.

Under Adobe Reader 6 use the snapshot tool to copy and paste into Word or Excel.

or 2.

download an alternative and quicker to open pdf reader from

formatting link
and use the text tool and paste into Excel. This will give you a more coherent display but still not perfect.

Cheers

Reply to
Chris

will

ooops

formatting link

Reply to
Chris

Thanks. Yes, that is arguably an improvement:

formatting link
compared to Adobe Acrobat Reader (5 in my case; each version seems to get worse to me!):
formatting link
but I see PDF Reader has pasted a fixed size font rather than the original proportional?

--
Terry Pinnell
Hobbyist, West Sussex, UK
Reply to
Terry Pinnell

I use Clipmate

formatting link
which has nice text cleanup. Apparently it was not necessary for:

30A, 50V, 0.040 Ohm, N-Channel Power MOSFET

It showed up as WYSIWYG

--

    Boris Mohar
Reply to
Boris Mohar

Using the Column Select tool in my Adobe Reader, I get:

Features =E2=80=A2 30A, 50V =E2=80=A2 r DS(ON) =3D 0.040 =E2=84=A6 =E2=80=A2 SOA is Power Dissipation Limited =E2=80=A2 Nanosecond Switching Speeds =E2=80=A2 Linear Transfer Characteristics =E2=80=A2 High Input Impedance =E2=80=A2 Majority Carrier Device =E2=80=A2 Related Literature

- TB334 =E2=80=9CGuidelines for Soldering Surface Mount Components to PC Boards=E2=80=9D

Which is close. Apparently when characters are in the symbol font, a carriage return is inserted. My reader is version 5.0.5.

Doug

Reply to
DGoncz

Using the Column Select tool in my Adobe Reader, I get:

Features =E2=80=A2 30A, 50V =E2=80=A2 r DS(ON) =3D 0.040 =E2=84=A6 =E2=80=A2 SOA is Power Dissipation Limited =E2=80=A2 Nanosecond Switching Speeds =E2=80=A2 Linear Transfer Characteristics =E2=80=A2 High Input Impedance =E2=80=A2 Majority Carrier Device =E2=80=A2 Related Literature

- TB334 =E2=80=9CGuidelines for Soldering Surface Mount Components to PC Boards=E2=80=9D

Which is close. Apparently when characters are in the symbol font, a carriage return is inserted. My reader is version 5.0.5.

Doug

Reply to
DGoncz

....but guess I must have used WordPad for the first! Don't recall doing so - but can't think of any other explanation. So that makes pdf reader definitely an improvement.

--
Terry Pinnell
Hobbyist, West Sussex, UK
Reply to
Terry Pinnell

Just downloaded it. Thanks. Wouldn't want to run it under 'doze anyway. :-)

BTW, Ghost Script/Ghost View extracts it with no problem. So does Acrobat but it's easier with Ghost.

Ted

Reply to
Ted Edwards

Three suggestions: Get PMView and use the screen capture => convert to 16 color => Save as a .PNG. The file size for the max ratings is

Reply to
Ted Edwards

Thanks. I took a look at PMView but it seems to be just a (versatile) image viewer, rather like several others (e.g. IrfanView), which can also Print to File. Maybe I should explore the second part of your recommendation; what 'virtual PostScript printer' do you use please?

BTW, I have Snagit, which can also capture *text* from many windows, although it fails in the PDF example under discussion.

--
Terry Pinnell
Hobbyist, West Sussex, UK
Reply to
Terry Pinnell

Barely a day goes by that Slackware doesn't pleasantly surprise me! It seems I got xpdf along with it, and lo and behold:

------------------------

30A, 50V, 0.040 Ohm, N-Channel Power MOSFET This is an N-Channel enhancement mode silicon gate power field effect transistor designed for applications such as switching regulators, switching converters, motor drivers, relay drivers and drivers for high power bipolar switching transistors requiring high speed and low gate drive power. This type can be operated directly from integrated circuits. Formerly developmental type TA9771. Ordering Information PART NUMBER PACKAGE BRAND BUZ11 TO-220AB BUZ11 NOTE: When ordering, use the entire part number.

Features · 30A, 50V · rDS(ON) = 0.040 · SOA is Power Dissipation Limited · Nanosecond Switching Speeds · Linear Transfer Characteristics · High Input Impedance · Majority Carrier Device · Related Literature - TB334 "Guidelines for Soldering Surface Mount Components to PC Boards" Symbol D G S

--------------

Cheers! Rich

Reply to
Rich Grise

Thanks for the text paste.

Must say I'm a bit lost on that site

formatting link
Can you help me locate specifically the PDF to text converter please? I'm wallowing in files with off-putting and Windows-alien names like 't1lib-1.3.tar.gz'.

--
Terry Pinnell
Hobbyist, West Sussex, UK
Reply to
Terry Pinnell

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.