adsp2187L: assembler or C? [cross-post]

- A
- Alessandro Basili
  
  Contact options for registered users
posted
13 years ago

Thu, Nov 4, 2010 11:41 AM

Hi, I have a custom designed board which is based on an ADSP2187L dsp, with a 512KB FLASH, a 128KB SRAM (20 ns cycle) and an FPGA running at 50 MHz.

The dsp task is mainly processing commands and data from external links (space-wire) on a 200~500 us cycle. Since the actual software has been written in assembler and there are some concerns about its reliability, I was wondering whether it is worth while considering the possibility to move to a higher language like C, in order to gain in maintainability and readability without loosing too much on performances.

Thanks.

--
Alessandro Basili
CERN, PH/UGC
Electronic Engineer

- V
- Vladimir Vassilevsky
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Thu, Nov 4, 2010 3:24 PM

ADSP218x is a dinosaur with the address space of 16K words; what is all that external memory for? This DSP is inconvenient for C programming as it doesn't support stack frames and base-index addressing.

Yes, you can program ADSP218x in C; VDSP toolset recommended. Yes, there will be significant overhead. It is impossible to tell if there is enough of speed without actually knowing the application. Why getting stuck with the 20 year old technology?

Vladimir Vassilevsky DSP and Mixed Signal Design Consultant

formatting link

- A
- Alessandro Basili
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Nov 8, 2010 9:54 AM

The main purpose of the DSP is basically data compressing in a physics detector readout system. We have some 300'000 12 bit ADC channels to read out at a frequency of ~2KHz out of which only few signals are meaningful. That's why we use the external memory to store the raw event and then we try to compress using cluster reconstruction with cuts on the adc distributions.

B.t.w. what are "stack frames" and "base-index addressing"?

Maybe I can think about having the framework in C, while the specific compression algorithms (which will be the most critical ones) maybe still done with the assembler.

Indeed I believe there is no general answer. But I believe that simply "translating" from assembler to C will be a bad idea and I should maybe focus the possibility of restructuring the whole program.

The 20 y.o. technology was the one which didn't suffer radiation effects for a low earth orbit exposure. At least this was the claim when the choice was made (some 10 years ago already!).

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Nov 8, 2010 8:11 PM

(snip)

Much embedded programming is done in C with inline assembler.

With some compilers, you can switch back and forth in the middle of a function. Depending or the processor, keeping track of which registers to use may or may not be a problem, but in most cases it can be done and result in fast code.

-- glen

- V
- Vladimir Vassilevsky
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Nov 8, 2010 8:30 PM

Inline assembler is bad style and a characteristic feature of lame programmers. It combines disadvantages of both C and assembler.

If there is a need to use an assembler, make a separate module in assembler and call it from C as an external function.

The result is incomprehensible, unalterable, undebuggable and unportable write-only code.

Vladimir Vassilevsky DSP and Mixed Signal Design Consultant

formatting link

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Nov 8, 2010 8:54 PM

(snip, I wrote)

Except for the "lame" part, I agree.

Usually a good idea, though it depends on the function call overhead. Sometimes it isn't so bad to have only the function statement in C, with all the code in assembler.

Or use #ifdef to select between the assembler code and equivalent C code. It is then portable, but possibly too slow.

Well, assembler is pretty much unportable anyway. If you only write the inner loops in assembler then you get speed, with only a small amount of write-only code.

Last I did it, was for a Z280. There are few enough registers that you pretty much don't have to worry about them. It was surprisingly easy to write and debug. A big advantage for the Z280 is the bank switching function call. The compiler gets that right, while keeping track of which bank the code is in.

-- glen

- R
- Randy Yates
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Nov 8, 2010 9:23 PM

"Some concerns about its reliability" is pretty vague. Is it broken? "If it ain't broke, don't fix it."

You want to spend time rewriting working software so it's more maintainable, even though you have no clear problem with it? This sounds like a horrible idea to me.

--
Randy Yates                      % "So now it's getting late,
Digital Signal Labs              %    and those who hesitate
mailto://yates@ieee.org          %    got no one..."
http://www.digitalsignallabs.com % 'Waterfall', *Face The Music*, ELO

- R
- Randy Yates
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Nov 8, 2010 9:24 PM

Amen, brother! My philosophy exactly!

--
Randy Yates                      % "Bird, on the wing,
Digital Signal Labs              %   goes floating by
mailto://yates@ieee.org          %   but there's a teardrop in his eye..."
http://www.digitalsignallabs.com % 'One Summer Dream', *Face The Music*, ELO

- L
- Leon
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Nov 8, 2010 10:45 PM

ng,

n his eye..."

formatting link

'One Summer Dream', *Face Th= e Music*, ELO

That was how I did with a large application for the ADSP-2187 I developed 15 years ago which needed much faster floating point than the IEEE library routines provided in the main computation function. The company I worked for has just upgraded to a Blackfin.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Nov 8, 2010 11:22 PM

I disagree. It's a matter of taste and style, of course. It is also a matter of the application and the target - for DSPs with typical DSP code, assembly can make a much bigger difference over C than for more common microprocessor code.

Sometimes you can't avoid using assembly. A typical good reason is because you need access to low-level features that can't be expressed in C (such as registers that are accessible only with special instructions), for a few bits of startup code, as part of an interrupt routine (if your compiler does a poor job), or for handling key parts of an RTOS.

Typical bad reasons is for speed-optimising code. Sometimes it /is/ important to get the best possible speed out of the system within a small section of the code. You may also be writing library code or other heavily-used code, where it can be worth the effort. But in a great many cases when people think they need to write code in assembly for speed, they are wrong - either it is not worth the cost (in terms of development time, maintainability, readability, correctness, robustness, portability, etc.), or the C compiler will actually do a good job if the programmer just learned to use it properly.

So all in all, for most targets and most applications, you don't often need assembly when using modern tools. Since well-written high-level code is /generally/ clearer and more portable than even well-written assembly (though you can write bad code in any language), the preference should always be for using high level coding unless you have overriding reasons for using assembly.

Then you have the choice - separate assembly modules, or inline assembly.

If you are writing large sections of assembly code, then assembly modules makes sense. It is clearer to stick to a single language at a time, and use tools suitable for that language, and large "asm(...)" statements are as messy as large multi-line pre-processor macros.

But if you are mixing C and assembly, and want to have minimal assembly, then inline assembly is the way to go, especially if you have good tools. There are some C compilers that can't deal well with inline assembly - they insist on it being restricted to "assembly functions" only, or they turn off all optimisations for functions that use inline assembly. Other compilers work very well with inline assembly - the compiler will handle register and/or stack allocation, and happily include the inline assembly in its optimisation flow. If you are using such a compiler (gcc is a well-known example, but there are commercial tools that work well too), then you will probably get the smallest and fastest code by using inline assembly and letting the compiler do its job. The actual assembly code itself can often be tucked away in "static inline" functions.

Inline assembly lets you mix assembly with the C to get the /best/ of both worlds.

As an example, I once re-wrote the C startup code used by a particular compiler, so that the startup code was in C instead of the original assembler. The original assembler code was quite well-written and clear, but it is difficult to write assembly that is general, clear, and efficient. You often can't write the code to take advantage of particular circumstances (such optimising based on the values of compile-time constants) without it being messy and full of conditional assembly. But the C compiler will do such optimisations fine. So my C code, along with a couple of lines of inline assembly to set the stack pointer, was much smaller and clearer in source code, and the target code was significantly smaller and faster. The code couldn't have been written in pure C, and the mix with inline assembler was a big improvement over the external assembly module.

If you need to write code requiring a lot of register tracking, then I can see that inline assembly will be messy. But for a lot of uses, when done properly (both by the user, and by the toolchain vendor), inline assembly is much clearer and simpler than external assembly modules, and results in smaller and faster code.

- R
- Randy Yates
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Nov 8, 2010 11:51 PM

PS: I highly recommand yasm for x86 assembly.

formatting link

--Randy

--
Randy Yates                      % "...the answer lies within your soul
Digital Signal Labs              %       'cause no one knows which side
mailto://yates@ieee.org          %                   the coin will fall."
http://www.digitalsignallabs.com %  'Big Wheels', *Out of the Blue*, ELO

- R
- Randy Yates
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Tue, Nov 9, 2010 2:15 AM

And yes, I do realize this was about the ADSP2187, but I enjoyed yasm so much I want to contribute to their advertising campaign.

--
Randy Yates                      % "Maybe one day I'll feel her cold embrace,
Digital Signal Labs              %                    and kiss her interface, 
mailto://yates@ieee.org          %            til then, I'll leave her alone."
http://www.digitalsignallabs.com %        'Yours Truly, 2095', *Time*, ELO

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Tue, Nov 9, 2010 4:12 AM

In comp.dsp David Brown wrote: (snip, I wrote)

The last few assembly routines I wrote were to use the IA32 RDTSC instruction. It conveniently returns the 64 bit result in EDX:EAX, the usual return registers for (long long) on IA32. The executable instructions are RDTSC and RET.

Yes, those are the cases where inline assembler works best, but also, as mentioned, is ugly, non-portable, etc. You can write just the inner loop, maybe only a few instructions, in assembler with the rest in C.

Last I wrote inline assembler was in Dynamic C for the Rabbit 3000, which is pretty much a Z180. Unlike the Z80, code has a 24 bit addressing space, with special call and ret instructions to change the high bits. With inline assembler, the C compiler keeps track of the addressing.

For the Z180 there isn't much optimization.

(snip)

For the Z180, there aren't many registers, and some have special uses, so there really isn't much of a problem with registers tracking.

-- glen

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Tue, Nov 9, 2010 9:48 AM

On the x86, you have so few registers that the compiler can't help you much in the allocation, and many instructions have implicit register usage. It also means that the function call overhead is less, since you have data is on the stack anyway (though the actual processor implementation may keep copies internally in registers). So it really doesn't matter if you have fixed registers in your assembly code, and with a bit of luck the processor will handle the "ret" instruction early on in the instruction pipeline.

But let's think about what actually happens in the processor the "rdtsc" instruction implemented as an external assembly module, and as inline assembly. Bear with me on the details here - I have never used assembly on the x86, and I haven't tested this code.

Case 1 - external module.

In assembly.s, you have:

readTimestamp: rdtsc ret

In test.c, you have :

extern uint64_t readTimestamp(void);

void test(void) { // part A uint64_t t = readTimestamp(); // part B }

First, consider what the compiler knows. All it knows about readTimestamp is that it follows the C calling convention, and returns a value in EDX:EAX. It must assume that the function may change any volatile registers (according to the standard x86 ABI), and it may read or write any memory. This means any values held in local registers in part A, such as loop counters, pointers, etc., must be preserved on the stack before calling readTimestamp(), and restored afterwards. Similarly, any outstanding memory writes must be done, and in part B any values from memory must be re-read. In other words, the call to readTimestamp is a serious block in the flow of the optimiser.

Secondly, consider the processor executing the code. The function call and the return are non-conditional, so they will be executed early in the instruction pipeline. But any jump in the instruction flow means a new block of memory needs to be in the cache, with associated risks of cache misses, page misses, etc.

In other words, it can be quite costly to read the timestamp this way.

With inline assembly, you have just test.c :

static inline uint64_t readTimestamp(void) { uint64_t x; asm (" rdtsc " : "=A" (x) :: ); return x; }

void test(void) { // part A uint64_t t = readTimestamp(); // part B }

The use of this inline assembly within the test() function is identical. But now the compiler knows everything about it - it knows that readTimestamp changes EDX:EAX, but leaves EBX and ECX untouched, and neither reads from nor writes to memory. This gives it much more flexibility in optimising the code in test(). The generated code will be nothing more than the single "rdtsc" instruction and whatever register movements are needed, and as it is inline there is no change of flow when the processor is executing it.

For processors with more registers, there is even more to be gained with inline assembly. On the PPC, for example, I used this inline assembly function recently in code that converted a set of values from little-endian to big-endian:

static inline uint32_t readByteSwapped(const uint32_t * p, uint32_t x) { uint32_t y; asm (" lwbrx %[y], %[x], %[p] " : [y] "=r" (y) : [x] "r" (x), [p] "r" (p) ); return y; }

By using inline assembly, the compiler can allocate registers and pipeline the byte-swapped reads optimally. If I made this an external assembly routine following the C calling conventions, the code would have been ten times slower - using shifts and masks for the conversion would have been faster.

Assembly is /always/ ugly and non-portable. Being inline or external makes no difference there.

The only cure for a poor compiler is to get a better one, if there is one available.

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Tue, Nov 9, 2010 11:10 PM

In comp.dsp David Brown wrote: (snip, I wrote)

(snip)

There is another complication, though, relating to pipelining and RDTSC. In your example, you want Part A to run before RDTSC, and Part B to run after. Without the jump, it is more likely that the processor will be able to execute RDTSC out of order, such that the timing doesn't do what you expect.

(snip)

Most of the time I try to time a whole loop, such that register use isn't so much of a problem. Not quite long enough to time with millisecond TOD clocks, though.

At least for IA32, I have been told that many compilers have a built-in inlined bswap() such that you don't need to write one. (snip)

(snip)

-- glen

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Wed, Nov 10, 2010 8:15 AM

That is partially true. As you say yourself, "it is more likely" - forcing a jump makes it more likely that part A runs before RDTSC, and part B runs after it. But it doesn't guarantee it.

This is a common problem in situations when you need assembly rather than C, and it is something that a lot of people have trouble understanding. It is a common mistake to think that "volatile" can be used to get it right - without realising that the compiler can do a lot of re-ordering of non-volatile instructions and accesses over and around the volatile ones. Another common mistake is to think you can write it in assembly, and that tricks like forcing an unnecessary jump or call will ensure things are executed in the right order - without realising that the processor can do substantial re-ordering.

In fact, you need both parts - you need compiler-specific functionality to tell the compiler to keep part A and part B separate, and you need target-specific functionality to tell the processor to stall until in-flight instructions are completed.

The compiler feature you need is a "memory barrier", that tells the compiler to complete any outstanding calculations and stores, and that code afterwards must re-read from memory (and thus can't be executed before the memory barrier). With gcc, the simplest memory barrier is a "memory clobber" inline assembly - asm volatile ("" ::: "memory"). Other compilers have different methods, and if you are using an OS that provides barriers, then use them.

The cpu feature will be some sort of "sync" instruction. I don't know what that might be on the x86 - it might be quite complicated depending on the exact effect you are trying to achieve. Again, if your OS or compiler has such functions, use them. (They will typically be implemented as macros and inline assembly.)

But in any case, a method that is "more likely to work" is not a solution.

Accurate measurement of timing on a processor as big and complex as modern x86's is far from easy - it is always going to be inaccurate, and a rough average over time is the best you can get.

But it's a fair point - there is no need to make your code smaller or faster than it actually needs to be. Premature optimisation is the root of all evil, after all.

On the other hand, there is no point in making your code needlessly bigger and slower, especially when better alternatives are easier to write.

Different processors and different compilers have different features - this code was specific for the ppc using Code Warrior (it would also work with gcc). If I were using Diab Data's compiler (I can't remember what they are called now), I'd use it's extensions that let you declare data as big or little-endian explicitly, and let the compiler generate the best code.

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Wed, Nov 10, 2010 6:15 PM

In comp.dsp David Brown wrote: (snip regarding inline or external assembly code using RDTSC)

Yes. There could also be a task switch in between, which should result in a very large increment in the count. I have never seen that in my uses of RDTSC. I once even did it from Java using JNI and it seemed to work just fine.

(snip)

For IA32 the closest I find is SFENCE, Store Fence, which guarantees that all previous stores are complete before following stores are done. (My words after reading the description.) (snip)

Usually averaged over a long loop it is close enough, at least for a specific generation of processors. Also, RDTSC reduces the problem of variable clock rates, counting clock cycles instead of real time.

Yes, I usually don't do that until it really needs to be done.

-- glen

- A
- Ala
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Thu, Nov 11, 2010 4:25 PM

thanks. I would like to borrow one arsonist.

- A
- Al Clark
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Nov 12, 2010 1:17 AM

Vladimir Vassilevsky wrote in news:7oSdncAnWruvTU_RnZ2dnUVZ snipped-for-privacy@giganews.com:

As someone who has written a considerable amount of 218x code, the short answer is FORGET USING THE C COMPILER! The ADSP-218x architecture predates the emergence of C as a universal language. The registers are not orthogonal, the number of pointers is limited, etc.

The good news is that 218x assembly is very easy to read and understand and looks more like C than traditional assembly. It is not a difficult processor.

I agree with Vladimir that it is essentially obsolete. The only reason I see to use it is if you have a large existing code base.

You could use a new ADSP-21489 or ADSP-21479 fixed/floating point DSP for less money, easier programing in C or assembly, and substantially more performance. I mention SHARC because the assembly language has a similar, but actually easier structure and could be ported to C as desired.

You could also use any number of Blackfin targets. These are just the ADI choices. There are many other possibilities as well.

Al Clark

formatting link