EFM32 Instruction Execution Times

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Nov 7, 2015 11:25 PM

Your explanation has always been clear, but I am not certain that your facts are straight. I have worked with DSP chips and designed pipelined processors for FPGAs. Your supposition that the data from a register must "enter the pipeline at its input" is an assumption from what I can see. I would have to consult the ARM CM3 architecture reference manual to see for sure. I don't see where you have done this. That is my point. Just as you assumed an invalid value for the pipeline length you may well be making a wrong assumption about how the pipeline works.

--

Rick

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Nov 7, 2015 11:27 PM

If you want to run fast, you either put your code in RAM, or you let the processor use cache that is available on all but low end processors.

--

Rick

- D
- Dimiter_Popoff
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Nov 7, 2015 11:34 PM

So where else can data enter the pipeline except at its input? How can you have the result of a 6 cycle operation in less than

6 cycles? That on a general ALU, I made the exception for DSP trickeries in my first post.

Dimiter

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Nov 8, 2015 12:05 AM

I don't know what "DSP trickeries" means. A pipeline does multiple steps, you don't need an input until the step that uses it. The adder for the accumulation only needs the result of the accumulation on the next clock when it starts the next add. Why would it need the result of the add at the same time as the inputs to the multiply?

Rather than making assumptions about how the ARM instruction set works, why not look it up?

--

Rick

- D
- Dimiter_Popoff
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Nov 8, 2015 12:19 AM

It is not assumption, not more than the result of adding 1 to 1 being 2 is an assumption. A 6 cycle operation takes 6 cycles, it boils down to that. I made it clear enough, please consult my previous posts.

Under DSP trickeries I mean doing extra hardware to hide the effect of the pipeline length for MAC instructions which I explained above from the end user.

As for looking up everyone is free to look up whatever one wants, there is no need for me to look up things for other people.

Dimiter

- L
- lasselangwadtchristensen
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Nov 8, 2015 1:09 AM

2

the

used

be

ed

it

s,

AC

_m3_r1p1_trm.pdf

I believe there is a CMSIS-DSP that implements a large number of optimised dsp functions like filters and ffts

-Lasse

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Nov 8, 2015 1:17 AM

Lol, ok, if you want to believe stuff you acknowledge you made up, then so be it. I was talking about the ARM processors. You seem to be talking about an imaginary processor that none of the rest of us know anything about.

Enjoy.

--

Rick

- R
- Randy Yates
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Nov 8, 2015 2:40 AM

Yes.

--
Randy Yates 
Digital Signal Labs 
http://www.digitalsignallabs.com

- W
- Waldek Hebisch
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Nov 8, 2015 4:47 AM

Those issues are well-undersood, but you badly mixed up things. First, pipeline depth matters for jumps and interrupt latency, but is irrelevant for most other operations. What matters is latency of given operation. In partucular most modern machines manage to have 1 cycle latency for simple integer operations regardless of pipeline length. For example both Cortex M3 and PC class processors have 1 cycle integer add latency, despite 3 cycle pipeline in M3 and longer than 10 cycles in PC-s. Now, difference between troughput and latency comes from pipelined execution units, but speaking about "pipeline depth" without any qualification is misleading.

Second, your PPC example may be valid, but it is quite unusual to need to keep inputs valid during execution of instruction. For example, when computing dot product on a PC I had to take into accout floating point add and multiply latencies (IIRC both were 4 cycles on my machine). Since at that time there my machine had no MAC instruction I had to use separate multiply and add. I had to keep

4 acumulators, to hide add latency. But in case of multiply I could immediately reuse input registers for another multiply. Of course, I took advantage of out of order execution to reuse logical output registers from multiply. But even on in-order machine I would just have to keep outputs of multiplies in separate register and still reuse input registers. On machine with MAC instruction I would just keep enough accumulators and reuse inputs.

Third, most of the above is irrelevant to Cortex M3. Namely, M3 executes at most 1 instruction per cycle. To have MAC doing useful work one needs to feed enough data to it. On M3 fetching an argument is a separate instruction, so assuming both arguments came from memory (as will be the case for long filter) we will have max of 1 MAC per 3 cycles (or rather 4 cycles assuming

2 cycles for MAC). Also, for most M3 instruction latency is the same as troughput. MAC and multiply may be special, but time to fetch arguments is likely to hide any extra latency.

--
                              Waldek Hebisch

- D
- Dimiter_Popoff
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Nov 8, 2015 11:12 AM

Well, I'll try to explain it one more time for you. These things may be well understood but you are clearly not among those who have understood them so here we go again.

To repeat the example I already gave in another post, to do a MAC we need to multiply a sample (S) by a coefficient (C) and add the product to the accumulated result (A). So we have S0*C0+A0=A1, S1*C1+A1=A2 and so on.

Let us say we have a 6 stage pipeline (just to persist with the

6 figure I started my examples with). At the first line we have all 3 inputs - S0, C0 and A0 readily available and they enter the pipeline. Next clock cycle we need S1, C1 - which let us say we have or can fetch fast enough and... oops, A1. But A1 will be at the OUTPUT of the pipeline 6 cycles later so we will have to wait until then.

I don't think this can be explained any clearer. Obviously the MAC operation can be substituted by any other one which needs the output of the pipeline as an input.

This is plain arithmetic and is valid for any pipelined processor. Those which manage 1 or more MACs per cycle have special hardware to do so - doing what my source example does this or that way, the registers they use would simply be hidden from the programming model.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

- R
- Richard Damon
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Nov 8, 2015 1:18 PM

On 11/8/15 6:12 AM, Dimiter_Popoff wrote: Well, I'll try to explain it one more time for you. These

The issue is the assumption that the MAC instruction can't be scheduled to start until all the data it needs anywhere during the execution is available as completed.

The scheduler knows that the addend isn't needed until cycle 5, and can see that it will be available then, so it can start the MAC now and get that data later.

Yes, this makes the scheduler more complicated, but it can be (and is) done to keep things running fast. (It might not be done in all cases, but would commonly be done for this sort of case for any processor designed for efficient DSP.

- D
- Dimiter_Popoff
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Nov 8, 2015 1:36 PM

Of course like I said DSPs are dealing with this - have been for decades. But it takes some specific opcode(s), for MAC between registers (like on "normal" load/store processors) it is impractical to track all the opcodes currently in the pipeline in order to make this sort of decisions - which is the main reason why load/store machines are designed with more registers, the idea behind RISC is to leave more work to be done by software and thus save silicon area. ARM has apparently been started as a cheaper/lower power tradeoff; it has worked quite well for them, is working at the moment really. The too few registers impediment comes into effect only when horse powers begin to matter - and I think they have addressed this in their

64-bit model (I am not familiar with it though).

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

- J
- jakbru
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Nov 8, 2015 1:50 PM

When designing a pipelined MAC, making it such that the accumulation operan d has to be internally delayed a few cycles before it is used is... strange . Normally, it would be designed in such a way that back-to-back MACs can b e issued, and the accumulation operand is just forwarded to the next instru ction internally. Creating logic to catch the most recent contents for a "r egister" would be trivial. I have a hard time seing how anyone would skip s uch an easy optimization, with a big payoff, especially for a register cons trained architecture. But, there might still be such implementations of cou rse.

BR Jakob

- M
- Mel Wilson
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Nov 8, 2015 6:36 PM

"A complete conspiracy is a law of nature."

(or words, probably in French, to that effect.)

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Nov 8, 2015 8:15 PM

We all know this, but Dimiter seems to want to hold onto the idea that the CM3 is constructed the why he imagines it without considering the possibility that it is different and refuses to do any work to verify the facts.

At this point I consider his posts on the topic to be trollish and without value.

--

Rick

- D
- Dimiter_Popoff
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Nov 8, 2015 8:52 PM

You suggesting to look something up for days is of huge value, sure. So what did you look up.

BTW, did you eventually understand my explanation? I would have expected someone doing logic designs to be a lot quicker in doing so.

Dimiter

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Nov 8, 2015 10:15 PM

I've already told you I understand your explanation. The issue is not the logic of your idea, the issue is whether your ideas apply to the CM3 or not. As many here have pointed out, the limitations you impose are in no way an inherent part of a pipelined processor. I have already said that pipelined designs typically have exactly the logic needed for a pipeline to be fully useful which you seem to feel is DSP "trickery". Rather this is just intelligent design.

I think I'm done with this conversation. It is just going in circles and getting nowhere.

--

Rick

- D
- Dimiter_Popoff
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Nov 8, 2015 10:34 PM

OK, no point continuing indeed. You have your generic assumptions against my experience plus my numeric explanation - you are free to stick to your beliefs of course.

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Nov 9, 2015 6:19 AM

Unless I'm severely mistaken, most Cortex M3 processors are "low end" and do not sport caches.

At least on the ST parts, not all of the RAM is connected to the processor's instruction bus, so you don't get as much speedup as you'd think. Some do have a magic memory address range that's dual-ported to both buses.

--

Tim Wescott 
Wescott Design Services 
http://www.wescottdesign.com

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Nov 9, 2015 7:17 AM

According to wikipedia (not always reliable) CM3 has no cache. There's still RAM speedup and I forgot that most CM3 devices use a prefetch to make sure the CPU has instructions when they are needed.

Think about it. Why would they keep speeding up the CPU clock speed if performance was limited by the Flash alone?

--

Rick