PowerPC soft-core?

In XCELL issue 52, page 19 Xilinx claims that:

V4-PowerPC reduces power 10:1 compared to FPGA Fabric built version (of PowerPC)

but that means Xilinx has internally a PowerPC Soft Core IP? If they dont then could not measure the power difference :) I wonder why there is no information about the xilinx soft-core PowerPC at all?

Antti

Reply to
Antti Lukats
Loading thread data ...

There's a big difference between a synthesizable ASIC CPU model and an FPGA optimized CPU model. Back in 2000 I did a design study on implementing the PPC instruction set architecture. Depending upon specific hard-wired or software-emulated feature sets, and small-or-fast settings, an integer PPC subset requires between 1000-2000 LUTs and today would run at most of the speed of current FPGA optimized soft CPU cores.

Jan Gray

Reply to
Jan Gray

Yes, we have an soft version of the PowerPC 405. We used this extensively in the development of the Virtex-II Pro family in order to create and verify IP blocks, development system tools, port software and to provide early access system boards to external 3rd party developers. We were able to do this through our contract with IBM and a lot of work within Xilinx.

No, there are no plans to release this as we do not have the rights to do so and the size, speed and power would make it unattractive to nearly everyone.

The V-4 PowerPC 405 in comparison displaces only 672 slices, consumes 0.29mW/DMIP (0.44mW/MHz), runs up to 450 MHz and places and routes in less then a second. Just try to get a soft processor to match that. :)

Ed

Reply to
Ed McGettigan

I think the PPC cores are a nice feature and are well executed. That said,

672 displaced slices are sufficient to hold two (or three austere) 32-bit pipelined RISC soft cores (requiring say 1 BRAM each), each running at ~1/3 of the PPC freq. So, for some applications (e.g. small memory footprint code and data 'controllers' that fit in a BRAM), the hard core is not a big (order of magnitude) win on MIPS/area. Can't 'speak to power' -- the hard processor core is surely much lower power. Properly RPM'd, a compact soft processor core will PAR in neglible time.

Certainly the PPC core(s) are vastly more attractive targets for COTS software tools and OSs and infrastructure (docs, developer expertise, ...).

See also

formatting link
"... this counterintuitive rule of thumb: one streamlined 32-bit soft CPU core optimized for programmable logic might need only half the silicon area of an elaborate 32-bit hard CPU core!"

Jan Gray

Reply to
Jan Gray

"Jan Gray" schrieb im Newsbeitrag news:%wY%d.1173$ snipped-for-privacy@newsread2.news.atl.earthlink.net...

said,

~1/3

big

...).

an

LOL, mercy mercy, :) MicroBlaze is is defenetly more than 672/2 slices !

but I think I agree that the rule of thumb is OK!

btw Jan I guess you are one of the few who could correctly answer the following FPGA-Quiz question:

How many slices are needed to implement frequency divider by 2^37 ?

ANSWER: Number of Slices: 3 out of 1408 0% Number of Slice Flip Flops: 2 out of 2816 0% Number of 4 input LUTs: 6 out of 2816 0% Number of bonded IOBs: 1 out of 140 0% Number of GCLKs: 1 out of 16 6%

the above is synthesis report for divide by 2^n, n=21..37 P&R shows 3 slices for V2Pro or 4 slices for S3

Antti

Reply to
Antti Lukats

Hi Antti,

You could use ERIC5... but it does not really compare to a PowerPC ;-)

This makes me curious: Is there other stuff like BRAM involved? Otherwise you HAVE to tell us how you do that (I would simply claim that this is not possible...)

Thomas

formatting link

Reply to
Thomas Entner

Hi Antti,

as a long time Altera-user I just remembered that the Xilinx-slices support distributed RAM and stuff... I suppose you take advantage of that. Still very impressive!

Thomas

Reply to
Thomas Entner

Using SRL16 ? you can in one lut have the output set to 1 every 16 clock I think 2^4. So using that as Clock Enable to another you can have 2^(4*n) with n the number of LUT so that would make 2^(4*6) = 2^24 ... damn, not enough.

Sylvain

Thomas Entner wrote:

Reply to
Sylvain Munaut

Nice one! The secret lies in understanding

  1. Peter Alfke's appnote [direct.xilinx.com/bvdocs/appnotes/xapp210.pdf]
  2. "On Arbitrary Cycle n-Bit LFSRs"
    formatting link
  3. and the LFSR generator in [fpgacpu.org/xsoc/xsoc-beta-093.zip].

Jan Gray

Reply to
Jan Gray

"Jan Gray" schrieb im Newsbeitrag news:aT40e.1546$ snipped-for-privacy@newsread2.news.atl.earthlink.net...

Hm... the LFSR is a nice try! Well my bet is that LFSR based approuch would use at least 1 more slice (possible 2 more slices)...

formatting link

there is the actual solution :)

Antti

Reply to
Antti Lukats

"Thomas Entner" schrieb im Newsbeitrag news:42407bd6$0$28872$ snipped-for-privacy@newsreader01.highway.telekom.at...

dont claim things not possible! there is no BRAM involved and no DSP48 either a divider can be implemented with 0 slices when using BRAMs :)

formatting link

Reply to
Antti Lukats

Great stuff!

I forgot about the distributed RAM and SRL16 when I made my claim ;-) but at least I realized that myself... I sent a second post, but it did not appear in my Outlook Express which I use as news-reader. However, if you look at the post of Sylvain, at the very end, you can see that my post as arrived at least somewhere. Very strange... Do others see the same behavior of lost posts?

Thomas

Reply to
Thomas Entner

Very nice Antti! No need to resort to LFSRs after all. Next, we're all looking forward to a (slow) compact nybble serial MicroBlaze reimplementation. :-)

Jan Gray

Reply to
Jan Gray

Hi Jan

I've wondered about how trully RISC PPC really is. Ironic that it comes out of the work of John Cocke and the 801, the current PPC seems to be RISC in actual performance but much harder to describe than other RISCs esp DLX, MIPS or ARM.

RISC ISAs are always characterized by the target technology on which they are 1st implemented, hence poor FPGA efficiency unless thats where you/we start.

If IBM were starting today on a fresh ISA with the memory wall in mind (100s of dead cycles per cache miss), I would think/hope they would come up with something entirely different.

I would suggest that RISC ness could well be defined by how easy it is to build an ISA simulator and how close that runs to the hosting platform. The closer it runs to host, the less peripheral things the ISA must do per opcode. Clearly PPC, ARM, SPARC all do alot more than simple datapath operations but all are defined with specific HW features in mind so their soft cores are all big to start.

The PearPPC emulator that runs on x86 to allow running MacOSX supposedly runs about 500 x86 ops per PPC op (from Pear site). The 68K emulators seem to run far closer to x86 performance perhaps because they are far "simpler" to understand. Generally the goal of emulation is to reach about 10x slow down. PearPC only achieves better performance closer to 50x slower IIRC through use of a JIT but the 68K JITs are still far better.

It amuses me to think that an emulated 68K running MacOS7 must run orders faster (acceptably so on Basilisk) than an emulated PPC running the much heavier OSX, just where is the world going!

The new Transputer also runs its ISA simulator closer to host speed (60x slower in plain C). Perhaps I should have started with the approach to build fastest possible ISA encoding with x86 native asm simulator in mind but it wouldn't much effect final HW architecture, just encodings. Perhaps really fast emulation of new ISA should be at top of todo list of architects to help propagate new cpus, certainly to get something running ASAP.

regards

johnjakson at usa dot com

Reply to
JJ

John,

So true. For example, for PPC, the early implementations (1 um 3LM) were multiple dice -- the ICU (instruction cache and branch processing unit) was separate from the FXU (fixed point execution unit). That is why the PPC calling convention is peculiar -- the call (bl (branch-and-link)) instruction saves the return address in a link register LR (resident in the ICU) instead of a GPR (resident in the FXU), and which cannot be directly load/store'd, and so you have to first move it to a GPR to store it to the frame -- and the reverse nonsense in the function epilog.

[Reference: Brian Case, IBM RS/6000's Complex Implementation Extracts Peak Performance, in Understanding RISC Microprocessors, 1993 (MPR rollup)]

In our memory wall world, the on-chip ISA doesn't matter as much, performance-wise. You can sometimes model the performance of big applications by considering only the memory transactions that appear at the pins, and the particulars of the attached memory subsystem. What happens on chip is (mostly) irrelevant. Certainly during a garbage collection of a huge heap, or the like, all you're doing for tens of seconds is waiting on millions of non-resident cache line fills. And I have seen application code, over a large data structure, that spends several seconds (many billions of cycles) on one data load instruction that consistently misses the cache.

There are well known latency tolerance techniques...

For FPGA RISCs, I like to count the multiplexers in the datapath, because FPGA mux implementations are painfully expensive (*). PearPC only achieves better

I think that is more attributable to a difference in the maturity or sophistication of the emulator. It should be possible to translate PPC to x86 in way less than 500-1 growth.

If this meme subtly or overtly causes new ISAs to carry forward legacy ISA mistakes, let us hope it does not catch on. :-)

Jan Gray

Reply to
Jan Gray

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.