There's a big difference between a synthesizable ASIC CPU model and an FPGA optimized CPU model. Back in 2000 I did a design study on implementing the PPC instruction set architecture. Depending upon specific hard-wired or software-emulated feature sets, and small-or-fast settings, an integer PPC subset requires between 1000-2000 LUTs and today would run at most of the speed of current FPGA optimized soft CPU cores.
Yes, we have an soft version of the PowerPC 405. We used this extensively in the development of the Virtex-II Pro family in order to create and verify IP blocks, development system tools, port software and to provide early access system boards to external 3rd party developers. We were able to do this through our contract with IBM and a lot of work within Xilinx.
No, there are no plans to release this as we do not have the rights to do so and the size, speed and power would make it unattractive to nearly everyone.
The V-4 PowerPC 405 in comparison displaces only 672 slices, consumes 0.29mW/DMIP (0.44mW/MHz), runs up to 450 MHz and places and routes in less then a second. Just try to get a soft processor to match that. :)
I think the PPC cores are a nice feature and are well executed. That said,
672 displaced slices are sufficient to hold two (or three austere) 32-bit pipelined RISC soft cores (requiring say 1 BRAM each), each running at ~1/3 of the PPC freq. So, for some applications (e.g. small memory footprint code and data 'controllers' that fit in a BRAM), the hard core is not a big (order of magnitude) win on MIPS/area. Can't 'speak to power' -- the hard processor core is surely much lower power. Properly RPM'd, a compact soft processor core will PAR in neglible time.
Certainly the PPC core(s) are vastly more attractive targets for COTS software tools and OSs and infrastructure (docs, developer expertise, ...).
"... this counterintuitive rule of thumb: one streamlined 32-bit soft CPU core optimized for programmable logic might need only half the silicon area of an elaborate 32-bit hard CPU core!"
Using SRL16 ? you can in one lut have the output set to 1 every 16 clock I think 2^4. So using that as Clock Enable to another you can have 2^(4*n) with n the number of LUT so that would make 2^(4*6) = 2^24 ... damn, not enough.
I forgot about the distributed RAM and SRL16 when I made my claim ;-) but at least I realized that myself... I sent a second post, but it did not appear in my Outlook Express which I use as news-reader. However, if you look at the post of Sylvain, at the very end, you can see that my post as arrived at least somewhere. Very strange... Do others see the same behavior of lost posts?
I've wondered about how trully RISC PPC really is. Ironic that it comes out of the work of John Cocke and the 801, the current PPC seems to be RISC in actual performance but much harder to describe than other RISCs esp DLX, MIPS or ARM.
RISC ISAs are always characterized by the target technology on which they are 1st implemented, hence poor FPGA efficiency unless thats where you/we start.
If IBM were starting today on a fresh ISA with the memory wall in mind (100s of dead cycles per cache miss), I would think/hope they would come up with something entirely different.
I would suggest that RISC ness could well be defined by how easy it is to build an ISA simulator and how close that runs to the hosting platform. The closer it runs to host, the less peripheral things the ISA must do per opcode. Clearly PPC, ARM, SPARC all do alot more than simple datapath operations but all are defined with specific HW features in mind so their soft cores are all big to start.
The PearPPC emulator that runs on x86 to allow running MacOSX supposedly runs about 500 x86 ops per PPC op (from Pear site). The 68K emulators seem to run far closer to x86 performance perhaps because they are far "simpler" to understand. Generally the goal of emulation is to reach about 10x slow down. PearPC only achieves better performance closer to 50x slower IIRC through use of a JIT but the 68K JITs are still far better.
It amuses me to think that an emulated 68K running MacOS7 must run orders faster (acceptably so on Basilisk) than an emulated PPC running the much heavier OSX, just where is the world going!
The new Transputer also runs its ISA simulator closer to host speed (60x slower in plain C). Perhaps I should have started with the approach to build fastest possible ISA encoding with x86 native asm simulator in mind but it wouldn't much effect final HW architecture, just encodings. Perhaps really fast emulation of new ISA should be at top of todo list of architects to help propagate new cpus, certainly to get something running ASAP.
So true. For example, for PPC, the early implementations (1 um 3LM) were multiple dice -- the ICU (instruction cache and branch processing unit) was separate from the FXU (fixed point execution unit). That is why the PPC calling convention is peculiar -- the call (bl (branch-and-link)) instruction saves the return address in a link register LR (resident in the ICU) instead of a GPR (resident in the FXU), and which cannot be directly load/store'd, and so you have to first move it to a GPR to store it to the frame -- and the reverse nonsense in the function epilog.
[Reference: Brian Case, IBM RS/6000's Complex Implementation Extracts Peak Performance, in Understanding RISC Microprocessors, 1993 (MPR rollup)]
In our memory wall world, the on-chip ISA doesn't matter as much, performance-wise. You can sometimes model the performance of big applications by considering only the memory transactions that appear at the pins, and the particulars of the attached memory subsystem. What happens on chip is (mostly) irrelevant. Certainly during a garbage collection of a huge heap, or the like, all you're doing for tens of seconds is waiting on millions of non-resident cache line fills. And I have seen application code, over a large data structure, that spends several seconds (many billions of cycles) on one data load instruction that consistently misses the cache.
There are well known latency tolerance techniques...
For FPGA RISCs, I like to count the multiplexers in the datapath, because FPGA mux implementations are painfully expensive (*). PearPC only achieves better
I think that is more attributable to a difference in the maturity or sophistication of the emulator. It should be possible to translate PPC to x86 in way less than 500-1 growth.
If this meme subtly or overtly causes new ISAs to carry forward legacy ISA mistakes, let us hope it does not catch on. :-)