EFM32 Instruction Execution Times

- R
- Randy Yates
  
  Contact options for registered users
posted
8 years ago

Sat, Nov 7, 2015 3:47 PM

Hi,

I'm trying to find information on the Silicon Labs/Energy Micro EFM32 Cortex M3 processor instruction execution times, namely, the MLA/Multiply-Accumulate instruction, but others as well. I've found the instruction in the reference manual but nowhere are cycle times mentioned.

This is surreal. Every assembly language reference manual I've ever used includes cycle counts for each instruction. Here they're nowhere to be found.

--
Randy Yates 
Digital Signal Labs 
http://www.digitalsignallabs.com

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Nov 7, 2015 4:25 PM

I'm not 100% certain, but I think details like this are the same for all CM3 processors since all makers of the chips license the same code for the processor. They can optimize various aspects like cache size, memory and peripherals, but ARM has been moving to standardizing more and more of the core CPU design so that there is a great deal of consistency across all the instantiations of their design.

Check at the ARM web site for docs on the CM3 core.

I'm curious why you are working with this particular part. I have looked at their devices and not found a lot that makes them stand out in the crowd of CM3s. Their big deal is supposed to be low power, but I didn't find them to be much lower power than the many other CM3s available.

--

Rick

- D
- Dimiter_Popoff
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Nov 7, 2015 4:35 PM

Be prepared for surprises with the MAC instruction on a non-specialized DSP like that. Even if they specify 1 cycle throughput this can be unrealistic given the few registers they have. If they have a 6 stage pipeline it takes at least 18 registers for MAC intermediate results only to bypass the data dependencies. IOW, if you just write a loop with a counter you will need at least 6 cycles (plus perhaps some additional time for the mul) simply because every multiply-add needs the result of the previous one to be able to add to.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Nov 7, 2015 4:45 PM

You might want to rethink that. The accumulate operation (add) is typically one clock cycle while the multiply is sometimes multiple cycles. I don't know what the multiply time is in the CM3, I thought it was one cycle as well, but perhaps that is a pipelined time. Regardless, the multiply spits out a result on every clock which is then added to the accumulator on each clock producing a MAC result on each clock.

I remember that in the CM4 they claimed to be able to get close to 1 MAC per clock with various optimizations.

--

Rick

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Nov 7, 2015 6:17 PM

The Cortex M4 has a range of additional instructions aimed precisely at DSP instructions such as MAC. That is the main difference between the M3 and the M4.

The M3 and M4 have a 3 stage pipeline.

A very quick google shows that MLA (32x32 -> 32) on the M3 is 2 cycles. It is 1 cycle on the M4. If you want 64-bit results and accumulates, it is 4 to 7 cycles on the M3 and 1 on the M4. The M4 also has a variety of other DSP-style instructions, including SIMD codes for 16-bit or 8-bit MACs in parallel.

With its very short pipelines, the M4 has enough registers to keep up a good throughput at MAC operations - significantly better than on an M3 in many circumstances.

Even an M4 is not going to compete with a dedicated DSP on MAC throughput per clock cycle - but it is /vastly/ easier to work with. The real question is what the OP actually wants to do, and if his M3 (or a replacement M4) is good enough - there is no point in going for a hideous architecture that can do 1 GMAC/s if 1 MMAC/s is more than enough for the application.

- R
- Randy Yates
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Nov 7, 2015 8:33 PM

Hi Rick,

Thanks, I will.

If I were choosing the processor from scratch I would almost have certainly chosen the M4, assuming it made sense from a power POV. However, I'm coming in on the tail-end of someone else's project, so the choice wasn't mine and has already been made awhile back.

--
Randy Yates 
Digital Signal Labs 
http://www.digitalsignallabs.com

- R
- Randy Yates
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Nov 7, 2015 8:48 PM

Rick, I realized you are asking why this particular CM3. Of course the answer is still that I didn't make the choice.

To pick your brain, why not the EFM32? Is there something to detract from this SiLabs choice?

--
Randy Yates 
Digital Signal Labs 
http://www.digitalsignallabs.com

- R
- Randy Yates
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Nov 7, 2015 8:57 PM

Hi David,

I don't want to sound ungrateful, but why in the hell must I resort to Google to get this deeply domain-specific information? It belongs in a reference manual.

Turns out Rick was right - it's in the ARM Cortex M3 TRM:

formatting link

The goal is to implement a high performance filter in few enough cycles to get back to low-power mode and meet a specific battery life goal. Is the CM3 "good enough?" TBD. There are a lot of choices (processing architecture, filter specifications, etc.) that will decide.

--
Randy Yates 
Digital Signal Labs 
http://www.digitalsignallabs.com

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Nov 7, 2015 9:42 PM

Should I assume that I can't talk you into an FPGA design in a low power device?

--

Rick

- R
- Randy Yates
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Nov 7, 2015 9:49 PM

That would require a board respin. Not good!

--
Randy Yates 
Digital Signal Labs 
http://www.digitalsignallabs.com

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Nov 7, 2015 9:49 PM

No, I gave it a once over when it came out and once or twice since then. My only bone is that they "brag" about the low power aspects, but the rest of the field has improved right along with Energy Micro so they don't exactly stand out in this respect. I don't see where SiLabs has added much to the offering, but then I haven't given them a good look in a year or two. I don't have a lot of CPU projects, I'm more of an FPGA guy. When I do use a CPU, I tend to look for eval boards or the like to get started with and TI is really good in that regard. I think mostly there just isn't much difference in CPUs in general and CMx specifically unless there is a particular peripheral you need.

If you learn more about the EFM32 line and find something especially useful or unique, let us know.

--

Rick

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Nov 7, 2015 10:17 PM

Yes, of course. I didn't quite grasp what you were saying. You want to duty cycle running the signal processing at full power with idling at low power. Exactly how to do it with most CPUs.

--

Rick

- D
- Dimiter_Popoff
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Nov 7, 2015 10:18 PM

Hah, it appears I am the only one - not only in this group - to have really gone through this.

The multiply does spit say a result every cycle, OK. But this is at the end of the pipeline; so each multiply has started 6 (to stay with my 6 stages example) cycles earlier than the result it spits. Now since we accumulate the result in one register - and it is also at the input of the pipeline for the multiply-add opcode - a new instruction cannot begin going through the pipeline before one is finished, not without some additional, DSP-ish trickery - which "normal" processor do not have or if they do they talk about some "DSP engine" or sort of.

I had to do this the hard way on the e300 power core; in a simple loop, the FMADD (FP multiply-add 64 bit) would take something like 20-30nS in a straight forward loop (at 2.5nS clock period). The latency specified for the FMADD is just 2 cycles though; I had to bypass the data dependencies by using at least 6 (I did 6, 7 and 8) sets of

3 registers so the loop would go through all sets which all had different destination registers and thus would have enough time for the pipeline every cycle. At the end of the loop all 6 (or 7 or 8) destination registers are simply added to get the final result. I have posted it before, hopefully this explanation is better than my previous ones. Here is the source of how this works:

formatting link

Notice that it also saves load/store by a factor of 6 (or 8 in the example, I think it is the 8 sets case); the measured performance with this was 5.5 nS per FMADD (theoretical best, no load/store involved would have been 5 nS).

Now David said the ARM in question has only 3 stages in its pipeline, it would take 9 registers to bypass its data dependencies; might even be doable with the few registers they have. [In fact the above is a good example of why ARM try to keep their pipelines short; their architecture just does not have the registers it takes to maintain a longer pipeline full/productive, it is a major architecture limitation for a load/store machine).

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Nov 7, 2015 10:33 PM

No, that is most certainly /not/ why ARM wants to keep these pipelines short. I am not disagreeing with your calculations regarding throughput, latency, and registers on hardware that is not DSP-dedicated (and I don't know what DSP features an M4 really has, as I haven't needed them myself). You've pointed out these issues before, and I think they are often misunderstood - people see the "MAC instruction timing 1 cycle" and think they can get 72 MMACs from a 72 MHz M4. So it is good that you raise awareness here.

But these are primarily control-oriented microcontroller cores - short pipelines means low latencies, consistent timings, short branch delays, minimal interrupt latency jitter, small core die area, and low power. Being able to improve throughput of long MAC chains is merely a bonus.

Remember, the M4 core is not in the same class as the e300 - you would be better to compare the e300 to a Cortex A device with NEON SIMD instructions and see how that compares in DSP throughput. (Alternatively, you could compare MACs/s per $, or per mW, to get a fairer match.)

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Nov 7, 2015 10:36 PM

And guess which link turns up at the top of a google search for "Cortex M3 MLA timing"?

You can argue that Silicon Labs should have information in their own datasheets, or at least pointers to the ARM documents. They probably /do/ have that information there somewhere, if you dig deep enough. But sometimes googling is a lot faster, easier and less stressful than looking in the "right" places.

Without knowing a good deal more, it is impossible to guess. But certainly the "run as fast as possible for a short time, then sleep" is the right way to minimise power. Have plenty of capacitors on the board to reduce power spikes to the battery.

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Nov 7, 2015 10:46 PM

I don't quite understand what you are saying. You seem to be saying the pipeline is 6 clock cycles long while that does not seem to be supported by the facts. Then you propose the inputs to the instruction have to be available at the *start* of the instruction (not sure what that even means really as instructions are fetched, decoded and executed, which one is the "start") which is not necessarily true. I don't know that pipelining the MAC instruction requires anything special from the CPU other than the various controls required for pipelining.

I don't know how many clocks it takes to do pipelined MAC instructions on the CM3. I do know they specifically added all the required logic to do fully pipelined MACs on the CM4, the real limitation seems to be memory accesses. Perhaps it was a 16 bit mode where two coefficients would be fetched in one memory operation and two data values were fetched in one memory operation, but they were able to reach 1 MAC per clock as long as nothing got in the way.

There is no reason why one processor would be the same as another in this regard. This link seems to be something other than ARM CM3 code. I'm guessing this is your e300 power core.

I'm not in a position to debate this since I am not so familiar with the ARM instruction set, but I don't see any reason to use more registers for a simple instruction like the MAC than are actually required. I have never seen a problem with overlapping register usage in pipelined instruction sets. As long as the register is updated by the time it is used, it all works. Otherwise, what is the point of pipelining?

--

Rick

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Nov 7, 2015 10:47 PM

In addition to everything else that's mentioned, with today's processors you're highly constrained by pipelining & whatnot.

Most of the parts that I've worked with need lots of wait states to run out of flash -- I wouldn't be surprised if the processor spends most of it's time twiddling it's thumbs waiting on memory.

--

Tim Wescott 
Wescott Design Services 
http://www.wescottdesign.com

- D
- Dimiter_Popoff
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Nov 7, 2015 10:47 PM

Well I cannot have your certainty about the motivation ARM have, but I strongly suspect they _do_ know about the data dependencies and they do take them into account when designing.

What I am pointing out is the architectural limitation; the MAC loop is only one good example how it takes pipeline depth times 3 registers plus address pointers and counters etc. to be able to keep it productive.

Of course like you say most applications do not need all the resources, then there are architectures much worse than ARM doing commercially fine etc., I am not interested in such a discussion at all.

My point is about the number of registers a load/store machine needs in order to make use of a given pipeline depth. ARM is fundamentally limited in that by having too few registers and being a load/store machine at the same time, there is nothing one can do against these figures.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

- D
- Dimiter_Popoff
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Nov 7, 2015 11:07 PM

I just stick to the same example from the beginning for clarity.

Well I know on the surface this is easy to overlook, as I had not thought about it until I had to deal with it. But it is a general issue. Operands enter the pipeline at its input; if one of these operands needs to be the output of the pipeline guess what, you will have to wait for the entire pipeline length to be walked through before you have all operands to do the next operation. Let us try the MAC example: at the pipeline input you need a sample, a coefficient and the accumulated value, S, C and A. Assume that to calculate S*C+A takes as many steps as the pipeline is deep, say 6 cycles. Now we start with S0*C0+A0=A1; next cycle we do S0*C0+A1. But A1 will not be available for another 6 cycles, not before s0*c0+a0 make it to the end of the pipeline. It is called a data dependency.

Well I hope I did explain it well enough this time :-). Pipelining is powerful but like anything else it has its limitations, the above example summarizes it quite well.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

- D
- Dimiter_Popoff
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Nov 7, 2015 11:14 PM

my mistake - obviously this shoud read

"Now we start with S0*C0+A0=A1; next cycle we do S1*C1+A1."

Dimiter