EFM32 Instruction Execution Times

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Nov 9, 2015 8:27 AM

I think it is /possible/ to have cache on a CM3, but it is certainly not common.

There are three points here:

Cortex M devices use the Thumb2 instruction set - the aim of this is that a solid majority of instructions are 16-bit. Since these cpus are single-issue, that means you can run your cpu an average of about 50-70% higher clock speed than flash, assuming a 32-bit bus.

There are ways to get processors going faster than flash even without a processor instruction cache. In particular, it is common for the flash units in faster microcontrollers to have a small buffer/cache in the flash module. If this is combined with wide access flash, say 64-bit wide, you can easily get streams of instructions at cpu speed (but with a penalty for branches and calls).

And here is the main point - manufacturers /don't/ keep speeding up CPU clock speed on the CM3. Most serious manufacturers who make fast CM microcontrollers have moved to the CM4 - some never bothered with the CM3 in the first place. They put caches (and single-precision floating point) on their faster devices.

So yes, CM3 devices /are/ low end - they are now either on older, legacy parts (in this field, that means more than a couple of years old), or as microcontrollers in integrated chips where the cpu plays a minor role (such as a high-end ADC that happens to have a cpu integrated).

- G
- George Neuner
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Nov 9, 2015 5:52 PM

Data can be inserted into the pipeline at the input of any functional unit. It can also be extracted at the output of any functional unit even for units in the middle of the pipe.

The general method is called "bypass>> ... A pipeline does multiple steps, you don't need an input until the step

A MAC unit is a combination of a multiplier and an adder. A very low end MAC may be serial, but most are pipelined. A pipelined unit will have a *zero* cycle forwarding path from the adder's output back into the adder's input, so the result can be used on the very next cycle.

You can't. However, the point of the pipeline is to parallelize operations by overlapping them.

In your example: S0 * C0 + A0 = A1 S1 * C1 + A1 = A2 : Sn * Cn + An = A(n+1)

The output of the MAC can be fed back directly into its adder to be available on the next cycle.

So consider a typical low end 4 cycle pipelined MAC, combining a 3 cycle multiplier with a 1 cycle adder, that can *start* a new operation on every cycle

cycle operation(s)

1 S0 * C0 -> T0 2 S1 * C1 -> T1 3 S2 * C2 -> T2 4 S3 * C3 -> T3 , T0 + A0 -> A1 5 S4 * C4 -> T4 , T1 + A1 -> A2 6 S5 * C5 -> T5 , T2 + A2 -> A3 :

The 1st result takes 4 cycles - the length of the pipeline - but once the pipe is primed, it begins to produce a new result on every succeeding cycle.

In general, a pipe that can start a new operation every N cycles can produce a new result every N cycles.

How often a pipeline can start a new operation often is referred to as the "major cycle" of the pipe. The major cycle of the pipeline may be

*far* less than its total length. Reducing the major cycle is the whole point of pipelining an operation.

George

- D
- Dimiter_Popoff
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Nov 9, 2015 7:01 PM

I know what a pipeline is and how it works, and what its whole point is.

Why are there two operations per pipeline stage in your example?

If you want to better understand what data dependencies I refer to, try to draw your scheme for computation of say a factorial.

Why are there two operations per pipeline stage in your example?

Dimiter

- G
- George Neuner
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Nov 9, 2015 8:19 PM

But you don't seem to know what a bypass/forward network is or why your comments about data dependencies and pipelined operations are only *partly* correct.

Read more carefully: those aren't pipeline stages, they are clocks.

The multiplier and adder operate simultaneously, but the adder is not enabled until both inputs are available. Once the pipeline is primed, the adder has inputs available on every clock and so there are 2 results produced per clock - the temporary output from the multiplier and the final accumulated output from the adder.

You are conflating the CPU's instruction decode/execute pipeline with the _separate_ functional unit pipeline of the MAC.

I am very aware of data dependencies. _You_ need to do some reading about modern CPU architectures.

As this thread in particular was about MAC, you need to learn more about how a MAC unit actually is implemented. With a pipelined MAC you do not have to wait for one operation to complete before starting a dependent operation that uses the result.

Bypass/forward networks exist to mitigate pipeline stalls due to data dependencies by delivering data directly from the producer to the consumer *without* passing through the register file. These networks operate inside pipelines and sometimes even between different pipelines.

George

- D
- Dimiter_Popoff
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Nov 9, 2015 9:18 PM

I know forwarding may be done for some opcodes - you don't seem to know that it is neither universally applicable to any opcode, nor is it applied to any opcode to which it is applicable out of practical considerations.

With the MAC case I spoke of it has not been done simply because there is a way to do it in software by taking advantage of the sufficient number of registers, thus saving silicon area and making the entire operation _more_ efficient (by saving the number of needed load/store operations).

Can you please detail your scheme with names of the user visible registers where S, C and A are. You have yet to understand it is wrong.

If there is a separate MAC unit this is a DSP to which my comments do not apply, I made that exception at the very start. Please read more carefully.

Now try to draw your scheme for factorial computation using a single pipeline.

Evidently not.

I think it is the other way around. I know what forwarding in this context is and I know - as you seem not to know - that this is the exception, not the norm. If everything could be forwarded the pipeline would be unnecessary (your written scheme demonstrates that actually you think of the pipeline as of some FIFO which it is not, it only bears some resemblance).

I have used a pipelined MAC unit on a DSP some 15 years ago for the first time, it did 1 MAC per cycle, what makes you think I do not know this is being done all the time.

And I wrote many times in my previous posts that my MAC comments did NOT apply to specialized DSPs.

Many pipelined processors do not have a specialized DSP unit; and some still have a MAC instruction, the power architecture is a major example.

I know of at least one reasonably modern core for which things work exactly as I explained they do; you need to take advantage of the multiple registers the programming model gives you to achieve the specified 2 clocks per 64 bit MAC. And I do know one DSP core which does 1 MAC/cycle in complete detail (complete enough to have written the assembler for it, too).

How many cores do _you_ know in such detail (know like in "know").

And please before trying to teach me lessons make sure you know what you are talking about.

Dimiter

- R
- Randy Yates
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Tue, Nov 10, 2015 2:22 PM

I was wondering about that too. Also, is RAM 0-wait and FLASH not?

A one-line instruction cache helps, but I was also wondering whether coefficients (constants) would need to be in RAM for best performance.

--
Randy Yates 
Digital Signal Labs 
http://www.digitalsignallabs.com

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Tue, Nov 10, 2015 7:33 PM

Flash is generally wait (if you're running the processor above minimum speed) and RAM can be (if it's directly connected to the processor's instruction bus and does not have to use the bridge).

--

Tim Wescott 
Wescott Design Services 
http://www.wescottdesign.com

- B
- Boudewijn Dijkstra
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Nov 11, 2015 12:17 PM

Op Sat, 07 Nov 2015 16:47:27 +0100 schreef Randy Yates :

Which reference manual?

ARM is a bit different. What an instruction does, is basically the same across the entire architecture, in this case ARMv7-M. This is documented in an Architecture Reference Manual (ARM). How long an instruction takes, depends on the implementation, in this case Cortex-M3. This is documented in a Technical Reference Manual (TRM).

--
(Remove the obvious prefix to reply privately.) 
Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/

- R
- Randy Yates
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Nov 11, 2015 11:39 PM

I was referring to the one distributed by Silicon Labs.

Thanks for the information, Boudewijn. I was not aware of the Architecture Reference Manual.

In my opinion it would have been better for Silicon Labs to have omitted all instruction information in their TRM and referred people to the ARM Technical Reference Manual, rather than listing some pieces there and other pieces in the ARM TRM.

--
Randy Yates 
Digital Signal Labs 
http://www.digitalsignallabs.com