How to eliminate duplicate strings?

Short status: Target is 68332, compiler and linker is Microtec C++. Application is written mainly in C++.

We are running out of FLASH memory, and a check in the linker map revealed, that 800 Kbyte out of almost 2 Mbyte is used for the strings segment. Quite a lot for an embedded system with no GUI.

Further checks with the cygwin command

strings prom.bin |sort|uniq -c

reveals, that most of the strings are RTTI information for C++, and many are repeated 50 or 100 times!

(Strings finds printable strings in the binary, sort and uniq is used to sort the strings and count duplicates.)

The raw output of strings is approx. 800K as expected, and if the duplicates are removed it is squezed to 120K!

Is there a way to eliminate the duplicate strings? Logically the linker should be able to analyze what is entered into the strings segment, and eliminate identical strings that are already there.

Since the object format is said to be IEEE, it may be possible to use another linker, e.g. GNU ld, without replacing the compiler (which has its "specialities").

Has anyone tried that?

--
mdc at manbw dk  -  MAN B&W Diesel A/S, Copenhagen
www.manbw.com    -  Electronics & software dept.
      -  Speaking for myself only. -
Reply to
Mogens Dybæk Christensen
Loading thread data ...

Short status: Target is 68332, compiler and linker is Microtec C++

Application is written mainly in C++

We are running out of FLASH memory, and a check in the linker ma

revealed, that 800 Kbyte out of almost 2 Mbyte is used for the string segment. Quite a lot for an embedded system with no GUI

Further checks with the cygwin comman

reveals, that most of the strings are RTTI information for C++, an

many are repeated 50 or 100 times

(Strings finds printable strings in the binary, sort and uniq is use

to sort the strings and count duplicates.

The raw output of strings is approx. 800K as expected, and if th

duplicates are removed it is squezed to 120K

Is there a way to eliminate the duplicate strings? Logically th

linker should be able to analyze what is entered into the string segment, and eliminate identical strings that are already there

Since the object format is said to be IEEE, it may be possible to us

another linker, e.g. GNU ld, without replacing the compiler (which ha its "specialities")

Has anyone tried that

--
mdc at manbw dk  -  MAN B&W Diesel A/S, Copenhage
www.manbw.com    -  Electronics & software dept
      -  Speaking for myself only.
Reply to
Mogens Dybæk Christensen

Well, doesn't that almost force the solution: turn off RTTI --- you almost certainly won't be needing that in an embedded system, anyway.

--
Hans-Bernhard Broeker (broeker@physik.rwth-aachen.de)
Even if all the snow were burnt, ashes would remain.
Reply to
Hans-Bernhard Broeker

Unfortunately, that would require some redesign. The exact amount is not known just now, but we have reasons that it was not turned off.

If we could eliminate the duplicates, we would be up and running without touching the source code!

--
mdc at manbw dk  -  MAN B&W Diesel A/S, Copenhagen
www.manbw.com    -  Electronics & software dept.
      -  Speaking for myself only. -
Reply to
Mogens Dybæk Christensen

Careful with the assessment that everything found by 'strings' is actually a string. Code can look like text, to the 'strings' utility, especially if you feed it a flat binary core image instead of a structured object file format.

Looking at 'size -A' of individual .obj files or the debuggable object file might be a better test, here.

And for actual strings, it's probably doing that already. But I'm far from certain that such compression can be done on RTTI tables without breaking them. If they could, wouldn't the compiler/linker vendor have done it already?

--
Hans-Bernhard Broeker (broeker@physik.rwth-aachen.de)
Even if all the snow were burnt, ashes would remain.
Reply to
Hans-Bernhard Broeker

Hi Hans-Bernhard

Thanks for your interest in the problem.

I am aware of the false strings in the output. They may account for some %, but inspection of the output from strings reveals lots of real strings, which are duplicated.

We actualle reverse-converted the S19 file that was produced by the build process, and ran strings on that file. This should eliminate all debug information etc. The size of that output is very close to what the linker map says about the strings segment, so I think we are looking at the real thing.

Microtec claims to use IEEE format, and GNU m68k-elf-objdump can read their .obj files. It shows, that there is a binary RTTI segment (which I cannot interpret), but the type strings are in the string segment. So probably the RTTI segment is a set of pointers into the strings segment.

Thus it should not change anything to the running code, if the address of one string is replaced by the address of another identical string (and the first string removed from the binary image). But the linker does _not_ do that at the moment.

Unfortunately, the Microtec dialect of IEEE seems incompatible with GNU m68k-elf-ld, which gave an assert when I tried.

- We are now in contact with Microtec support, but no solution till now.

--
mdc at manbw dk  -  MAN B&W Diesel A/S, Copenhagen
www.manbw.com    -  Electronics & software dept.
      -  Speaking for myself only. -
Reply to
Mogens Dybæk Christensen

Uhm, maybe a stupid question, but why compile with RTTI?

Imo RTTI can be handy, but you hardly ever *really* need it. If the application really uses RTTI, maybe a redesign is in order to eliminate the need for it?

PeterV

Reply to
Peter de Vroomen

As stated earlier in this thread, we do need RTTI. Althoug this is an embedded system, we use templates and dynamic casts. Nobody really wants to give an estimate of the redesign to take it out.

And yes, you can always make another program than the one you have. But you don't get it for free. ;-)

Ever heard of super tankers breaking apart due to engine failure during a hurricane? ;-) We don't want that happen to our system.

In a such mission critical system, the cost of test, verification and approval can be prohibitive.

--
mdc at manbw dk  -  MAN B&W Diesel A/S, Copenhagen
www.manbw.com    -  Electronics & software dept.
      -  Speaking for myself only. -
Reply to
Mogens Dybæk Christensen

You're using RTTI in a mission-critical system? Wow.

Why?

Steve

formatting link

Reply to
Steve at fivetrees

In some cases, the compiler has an option to merge duplicate strings, however this usually happens only within a single module.

I had a similar problem one time, in this case it was a point of sale terminal. I was asked to make several enhancements to the existing application that had was already completely filling the available code space in the terminal.

I noted that there was a fair number of duplicated strings, and that they were spread through several modules.

I wrote a program to scan all of the source files, and identify all strings and the number of occurances of each. On a second pass, it replaces all literal strings (ie: not already variables) occuring more than once with character array references, and also generates XSTRINGS.H and XSTRINGS.C which contained definitions for the string arrays. It also accepts a file listing strings NOT to change in case you happen to be unlucky enough to be working on a system allowing writable strings and someone actually did that.

You could try something like that - it worked very well for me.

Regards, Dave

--
Dunfield Development Systems          http://www.dunfield.com
Low cost software development tools for embedded systems
Software/firmware development services       Fax:613-256-5821
Reply to
Dave Dunfield

Our problem is similar, also close to maximum in the hardware platform.

But unless you run the "string fixer" on some intermediate file produced by the compilers C++ pass, it won't do the job here. Most strings are created in that step, not in the source.

I still think the right place is in the linker, which has all relevant information.

The vendor, Microtec/Mentor Graphics, gave som suggestions on linker options, but it did not change anything. Haven't heard from them for some days, but the problem has got a number. ;-)

A hack to make GNUs ld link Microtec's object files, and optimize the strings, may also be a solution. The formats are close, but not identical.

--
mdc at manbw dk  -  MAN B&W Diesel A/S, Copenhagen
www.manbw.com    -  Electronics & software dept.
      -  Speaking for myself only. -
Reply to
Mogens Dybæk Christensen

You mean we rely on information stored i RAM, or what? So does the underlying RTOS.

The basic decision is to use C++, which som people argue is not "safe". I think the compiler is far bettet to throw around pointers to objects and structures than a human programmer. And the application _is_ that complex. And it works. That is why we don't want to just cook up another solution. This is not the toy business. ;-)

--
mdc at manbw dk  -  MAN B&W Diesel A/S, Copenhagen
www.manbw.com    -  Electronics & software dept.
      -  Speaking for myself only. -
Reply to
Mogens Dybæk Christensen

For the record, C++ templates don't require the use of RTTI.

-- Michael N. Moran (h) 770 516 7918

5009 Old Field Ct. (c) 678 521 5460 Kennesaw, GA, USA 30144
formatting link

"So often times it happens, that we live our lives in chains and we never even know we have the key." The Eagles, "Already Gone"

The Beatles were wrong: 1 & 1 & 1 is 1

Reply to
Michael N. Moran

Just for your info, Microtec support came up with the same "solution": Edit the intermediate assembler files in 325 compilations to add MERGE_START/MERGE_END where appropriate. No definition of appropriate.

:-(

--
mdc at manbw dk  -  MAN B&W Diesel A/S, Copenhagen
www.manbw.com    -  Electronics & software dept.
      -  Speaking for myself only. -
Reply to
Mogens Dybæk Christensen

[...]

I find it quite surprising that most compilers for embedded programming don't seem to have an automatic optimization mode for this. My in-house developed Pascal/Modula2 compiler does it as one of the first steps in its optimization routines. The final assembler file can look like this snippet: ; ;; String references ; STR2: STR4: STR18: STR22: STR0: .DB "Saving... ",0 STR3: STR5: STR19: STR23: STR1: .DB "OK",0 STR6: .DB "PIN=",0 STR7: .DB "ID=",0 STR8: .DB "I=",0 STR9: .DB " sec",0 STR11: STR10: .DB "DL",0 STR13: STR12: .DB "OL",0 STR15: STR21: STR14: .DB ", ",0 STR16: .DB "Calibrating",0 STR17: .DB " ",0 STR26: STR20: .DB "A/D not calibrated!",0 STR24: .DB "No program saved!",0 STR25: .DB "Terminal module",0 STR27: .DB "

Reply to
Bjarne Bäckström

Oh but they do! The one at hand just failed to use it on the RTTI string tables --- and the workaround they proposed was to turn it on for those, too, by massaging the intermediate asm source a bit.

--
Hans-Bernhard Broeker (broeker@physik.rwth-aachen.de)
Even if all the snow were burnt, ashes would remain.
Reply to
Hans-Bernhard Broeker

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.