Microblaze instruction timings

Hi,

Can anyone (Goran?) fill in some details of the Microblaze's pipeline for me? Do multi-cycle instructions always take multiple cycles? For example, if a load or shift is followed by an instruction that doesn't use the result of the load or shift, will the load or shift still cost two cycles? What is the branch penalty?

Also, what does the 950 logic-cell figure include? Does it include the caches as well as all of the optional instructions / debug logic?

Cheers, JonB

Reply to
Jon Beniston
Loading thread data ...

Hi,

The multicycle instruction always take multiple cycles. This is due to the pipeline of MicroBlaze. MicroBlaze has only 3 pipestages, Instruction Fetch (IF), Operand Fetch (OF) and Execution Stage (EX)

When a multicycle instruction is executing (is in EX), the next instruction is in the OF stage. The pipeline can't move since the EX stage is occupied. A more complex handling of the EX stage to allow more than 1 instruction at the same time maybe possible but will increase the control complexity quite a lot. All pipeline hazardous are resolved in hardware and an increase in complexity might result in a overall lower performance since the clock frequency might be lower.

The best way to handle multicycle instruction is to increase the number of pipeline stages but that will increase the area. You will always pay for a higher performance by using more resources. The current MicroBlaze is a good tradeoff between area and performance. It's smaller and the same time it's also faster than any other soft processor.

The 950 LUT figure >Hi,

Reply to
Goran Bilski

Thanks for the explaination.

Sure.

Does "basic features" include the h/w divider? I've been trying to reproduce the quoted Dhrystone figures on the simulator, and only get

0.63 MIPS/MHz without it. If I add it, I can get 0.77.

It seems strange that on the Web page

formatting link
the Spartan 3 is rated at 0.8 and the Spartan II is rated at 0.65, yet they are both listed as requiring the same number of logic cells. I would presume that either the performance figure for the Spartan II is too low, or the number of logic cells required by the Spartan 3 and Virtex II's to acheive the quoted figure is actually higher.

Incidentally, I've been trying to get the Dhrystone numbers for NIOS as well. Can anybody clarify if their instruction set simulator is cycle accurate? If it is, the figures appear to be 0.64 for a 32-bit implementation and 0.15 for a 16-bit implementation, but I have a feeling that this should be lower.

Cheers, JonB

Reply to
Jon Beniston

Jon,

RTL simulation in Nios of instruction execution (using ModelSim or similar) is cycle accurate. This is true for whether you're executing out of on-chip memory, SRAM (via simulation model), or SDRAM (we include a simulation model in the latest Nios kit). For Dhrystone, you can just run in hardware (much faster than running a long simulation) to compare slight changes you make to Nios.

That said, I agree what you're seeing is a bit high - we've seen 0.4 (SDRAM + cache) to 0.5DMips/mhz (on-chip mem) for 32-bit "classic" Nios. It makes me wonder if there is some difference in code?

I would also recommend that in what ever benchmark you do, to have the memory (program/data/cache) as you will have it in your final application to get the most realistic results possible.

Finally, while Dhrystone is pretty popular, the biggest advantage of going with a soft-core CPU (regardless of whose it is) is that you're in an environment where things can be tweaked to make your application much faster. Custom instructions & peripherals can do wonders depending on what your code looks like. One of my colleagues has a cover article in Embedded Systems Programming this month that you may find useful (sorry for the shameless plug..):

formatting link

...of course, you can also wait a bit for Nios II :)

Jesse Kempa Altera Corp. jkempa at altera dot com

Reply to
Jesse Kempa

When I said simulator, I meant the software simulator that comes as part of the GNUPro tools. I don't have access to the RTL.

Do you have any idea what the performance is for a 16-bit core?

Sure.

Cheers, JonB

Reply to
Jon Beniston

Hi,

See below.

J The multicycle instruction always take multiple cycles. This is due to the pipeline of MicroBlaze. MicroBlaze has only 3 pipestages, Instruction Fetch (IF), Operand Fetch (OF) and Execution Stage (EX) Thanks for the explaination.

The current MicroBlaze is a good tradeoff between area and performance. Sure.

The 950 LUT figure includes the basic features no caches or debug. The caches is quite cheap on LUTs, around 50 LUTs for the instruction cache. The cost is that BRAM is needed to handle the caches. Does "basic features" include the h/w divider? I've been trying to reproduce the quoted Dhrystone figures on the simulator, and only get

0.63 MIPS/MHz without it. If I add it, I can get 0.77.

To get 0.8 MIPS/MHz, you need to enable the HW divider. The size of the HW divider is around 60-80 LUTs. I can't remember correctly but the implementation is a basic shift-compare design which only needs a compare block and a shift block. The divide will take 35 clock cycles. 2 clock cycles to setup the operands, 32 clock cycles for the division and 1 clock cycle for writing the result.

It seems strange that on the Web page

formatting link
the Spartan 3 is rated at 0.8 and the Spartan II is rated at 0.65, yet they are both listed as requiring the same number of logic cells. I would presume that either the performance figure for the Spartan II is too low, or the number of logic cells required by the Spartan 3 and Virtex II's to acheive the quoted figure is actually higher.

The difference is that S3 and VII has embedded multiplier so MicroBlaze will have a HW multiplier while the S2 doesn't have the HW multiplier so multiplication is done using SW (which takes many more clock cycles)

Incidentally, I've been trying to get the Dhrystone numbers for NIOS as well. Can anybody clarify if their instruction set simulator is cycle accurate? If it is, the figures appear to be 0.64 for a 32-bit implementation and 0.15 for a 16-bit implementation, but I have a feeling that this should be lower.

Cheers, JonB

Reply to
Goran Bilski

Hi,

Sorry, I sent the answers as HTML only so I resent this as text only.

See below.

J>

To get 0.8 MIPS/MHz, you need to enable the HW divider. The size of the HW divider is around 60-80 LUTs. I can't remember correctly but the implementation is a basic shift-compare design which only needs a compare block and a shift block. The divide will take 35 clock cycles. 2 clock cycles to setup the operands, 32 clock cycles for the division and 1 clock cycle for writing the result.

The difference is that S3 and VII has embedded multiplier so MicroBlaze will have a HW multiplier while the S2 doesn't have the HW multiplier so multiplication is done using SW (which takes many more clock cycles)

Reply to
Goran Bilski

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.