Custom CPU Designs

R

Rick C 6 years ago

In the Forth language group there are occasional discussions of custom proc essors. This is mostly because a simple stack processor can be designed to be implemented in an FPGA very easily, taking little resources and running at a good speed. Such a stack processor is a good target for Forth.

I know there are many other types of CPU designs which are implemented in F PGAs. I'm wondering how often these are used by the typical embedded devel oper?

Rick C. -- Get 1,000 miles of free Supercharging -- Tesla referral code - https://ts.la/richard11209

Vote

T

Theo 6 years ago

It's not uncommon when you have an FPGA doing some task to need a processor in there to manage it - for example to handle error conditions or to report status over some software-friendly interface like I2C, or classical printf debugging on a UART.

Xilinx provides Microblaze and Intel/Altera provide NIOS, which are a bit legacy (toolchains aren't that great, etc), although designed to be fairly small. There's increasing interest in small 32-bit RISC-V cores for this niche, since the toolchain is improving and the licensing conditions more liberal (RISC-V itself doesn't design cores, but lots of others - including ourselves - have made open source cores to the RISC-V specification).

I don't see that Forth has much to speak to this niche, given these cores are largely offloading work that's fiddly to do in hardware and people just want to write some (usually) C to mop up the loose ends. I can't see what Forth would gain them for this.

If you're talking using these processors as a compute overlay for FPGAs, a different thing entirely, I'd suggest that a stack processor is likely not very efficient in terms of throughput. Although I don't have a cite to back that up.

Theo

Vote

D

David Brown 6 years ago

That matches what I have seen from customers. Very few people design their own cpus - it is rarely worth the effort. (People do it for fun, which is a different matter.) They want off-the-shelf cores and off-the-shelf tools. And they want off-the-shelf programmers to work with them.

There was a time when FPGA's were smaller and more limited, where you might want an absolute minimal sized CPU core. Then a stack-based core with a small decoder would be a good choice, and Forth a good fit. Those days are long gone. Modern programmable logic is bigger and have architectural features that are a better fit for "standard" 32-bit RISC cores than older devices. And the design tools make it easy - you pick your core, your peripherals, your memories, your interfaces from the libraries and let the tools make the buses, C header files, and everything else.

I can appreciate that Forth can be a very efficient language, and Forth-oriented cpus can be small and fast. But efficiency of the hardware is not the only goal of a design. I would expect that Forth and stack processors are almost exclusively used by developers who have been using them for the last twenty years already.

Vote

G

Grant Edwards 6 years ago

I worked on a project using a NIOS2 CPU core in an Altera FPGA. It was a complete disaster. The CPU's throughput (running at 60MHz) was lower than the 40MHz ARM7 we were replacing. The SDRAM controller's bandwidth was so low it couldn't keep up with 100Mbps full duplex Ethernet. The SW development environment was incredibly horrible and depended on some ancient version of Eclipse (and required a license dongle) and a bunch of bloated tools written by some incompetent contractor. There were Java apps that generated shell scripts that generated TCL programs that generated C headers (or something like that). The JTAG debugging interface often locked up the entire CPU.

After spending a year developing prototype products that would never quite work, we abandoned the NIOS2 in favor of an NXP Cortex M3.

Grant

Vote

R

Rick C 6 years ago

ces

or

in

or

rt

f

y

ng

Someone showed me that RISC-V can be a small solution, but at that size it isn't fast. So the size performance tradeoff isn't so great. Stack proces sors tend to be very lean and speed effective without pipelining. I know m y processor was designed to run 1 instruction per clock cycle on all instru ctions.

st

t

Not sure what you mean about Forth and "speaking". Forth is a natural lang uage for a stack based processor. Often there is a 1 to 1 mapping of instr uctions to language words. It is also very easy to retarget to new modific ations to a processor design.

a

t

ack

I recall someone had a very large compendium of soft core processors with s ize and speed measurements with a calculation of something like IPS/LUT. I t was amazing what some designs could achieve. I wish I knew where to find that now.

Rick C. + Get 1,000 miles of free Supercharging + Tesla referral code - https://ts.la/richard11209

Vote

J

jim.brakefield 6 years ago

m

urces

for

d in

d

ssor

port

ntf

it

rly

is

re

ding

t isn't fast. So the size performance tradeoff isn't so great. Stack proc essors tend to be very lean and speed effective without pipelining. I know my processor was designed to run 1 instruction per clock cycle on all inst ructions.

es

just

hat

nguage for a stack based processor. Often there is a 1 to 1 mapping of ins tructions to language words. It is also very easy to retarget to new modif ications to a processor design.

, a

not

back

size and speed measurements with a calculation of something like IPS/LUT. It was amazing what some designs could achieve. I wish I knew where to fi nd that now.

|>very large compendium of soft core processors

formatting link

Several legacy processors are listed:

formatting link

Also look into MISTer as it supports several legacy systems. None are competitive speed wise with high performance uP.

With LUTs costing less than $0.001 each some soft core uPs are inexpensive, free if you have unused LUTs and block RAMs. For debug, changing block RAM contents is much faster than rerunning the FP GA design.

Vote

R

Rick C 6 years ago

tom

sources

et for

ted in

ded

cessor

report

rintf

bit

airly

this

more

luding

it isn't fast. So the size performance tradeoff isn't so great. Stack pr ocessors tend to be very lean and speed effective without pipelining. I kn ow my processor was designed to run 1 instruction per clock cycle on all in structions.

ores

e just

what

language for a stack based processor. Often there is a 1 to 1 mapping of i nstructions to language words. It is also very easy to retarget to new mod ifications to a processor design.

As, a

y not

to back

th size and speed measurements with a calculation of something like IPS/LUT . It was amazing what some designs could achieve. I wish I knew where to find that now.

e, free if you have unused LUTs and block RAMs.

FPGA design.

I don't think the issue is very often $ with a soft core. At least for me it's about board space and integration with the other FPGA functions.

Looks like I was mistaken about the speed/size of the RISC-V core. However ... it appears to have been hand optimized if I am reading this correctly.

"GRVI is an FPGA-efficient RISC-V RV32I soft processor core, hand technolog y mapped and floorplanned for best performance/area"

That means it can't be ported to other families without much effort to achi eve similar results. But still, assuming it drops off to half the numbers it's still a very good design.

Thanks for the link and also all the work you did on this list.

Rick C. -- Get 1,000 miles of free Supercharging -- Tesla referral code - https://ts.la/richard11209

Vote

T

Theo 6 years ago

That's a pretty good summary of the experience. A few things have improved slightly:

- if you have the space it's better to use an on-chip BRAM rather than SDRAM, given the SDRAM is often running at ~100MHz x 16 bits, which makes even instruction fetch multi-cycle. DDR is better but the memory controllers are much more complex, and life gets easier when you have cache.

- they've upgraded to a plugin for a modern version of Eclipse rather than a fork from 2005, but it's still Eclipse :( I just drive the shell scripts directly (although the pile of Make they generate isn't that nice). I've never had it need a dongle.

- the JTAG interface is horrible and I've spent way too much time reverse engineering it[1] and working around its foibles, in particularly the JTAG UART which is broken in several ways (google 'JTAG Atlantic' for some workarounds)

These days a RISC-V CPU in FPGA solves a lot of the horrible proprietaryness, although you still have to glue the toolchain together yourself (git clone riscv-gcc).

But if you can do the job with a hard MCU I can't see why you'd want an FPGA.

Personally I much prefer the ARM cores on FPGAs these days - they're a Proper CPU that Just Works. And the Cortex A-class cores can boot Linux which makes the software development workflow a lot nicer. Although they aren't that beefy (a $10000 Stratix 10 has a quad core A53, which is a Raspberry Pi 3), often hard to get parts with them in, and the bandwidth between ARM and soft logic often isn't very good.

Theo

[1] Did you know the Altera product codenames were based on dragons from How to Train Your Dragon? Interesting what you find out when running strace(1) on the binary...

Vote

J

jim.brakefield 6 years ago

:

ustom

be

resources

rget for

ented in

edded

rocessor

o report

printf

a bit

fairly

r this

s more

ncluding

).

ze it isn't fast. So the size performance tradeoff isn't so great. Stack processors tend to be very lean and speed effective without pipelining. I know my processor was designed to run 1 instruction per clock cycle on all instructions.

cores

ple just

ee what

l language for a stack based processor. Often there is a 1 to 1 mapping of instructions to language words. It is also very easy to retarget to new m odifications to a processor design.

PGAs, a

ely not

e to back

with size and speed measurements with a calculation of something like IPS/L UT. It was amazing what some designs could achieve. I wish I knew where t o find that now.

ive, free if you have unused LUTs and block RAMs.

e FPGA design.

e it's about board space and integration with the other FPGA functions.

er... it appears to have been hand optimized if I am reading this correctly .

ogy mapped and floorplanned for best performance/area"

hieve similar results. But still, assuming it drops off to half the number s it's still a very good design.

I'm currently showing 36+ distinct RISC-V cores There are probably many more: it's a popular item at many universities. Some of which are optimized for low LUT count. See:

formatting link

for a list of FPGA and non-FPGA cores.

GRVI was done by Jan Gray. He is expert at keeping LUT count low. It is not open source?

Vote

T

Theo 6 years ago

This is our Tinsel multithreaded RISC-V core, which is written in a high-level HDL (BSV) and not hand mapped:

formatting link

To compare:

"Another recent overlay is Gray?s GRVI Phalanx [22, 23], a manycore RV32I fabric supporting message-passing via a Hoplite NoC. Gray reports that a single 3-stage GRVI core has an Fmax of 375MHz, uses 320 LUTs, and has a predicted CPI (cycles per instruction) of 1.6. These numbers can be summarised by a single figure of 0.7 MIPS/LUT. By comparison, a single

16-thread pure RV32I Tinsel core (with tightly-coupled data and instruction memories) uses 500 ALMs, clocks at 450MHz, and has a predicted CPI of 1 (there are no pipeline hazards due to multithreading), giving a figure of 0.9 MIPS/LUT. This rough comparison assumes a highlythreaded workload, and involves Fmax and LUT counts taken from different FPGA architectures (Virtex Ultrascale versus Stratix V). Unlike GRVI, Tinsel is not appropriate for singlethreaded workloads.

Gray hand-maps a remarkable 1,680 GRVI cores clocking at 250MHz onto a modern, large Xilinx XCVU9P FPGA using relationally placed macros. However, the hand-mapped approach is quite fragile, and its effectiveness could be offset when introducing off-the-shelf IP into the design, e.g. DRAM/SRAM controllers, Ethernet MACs, FPUs, or custom accelerators, all of which are likely to reduce regularity. Off-chip memory access, inter-FPGA communication, and floating-point are left for future work. Gray also cites highlevel programming support as an important goal for the future, which we have begun to explore in this paper."

You likely wouldn't want to use it for the kind of management purposes I described earlier, but it's useful for doing compute on larger FPGAs.

Theo

Vote

G

Grant Edwards 6 years ago

I don't remember if we ever got it to work reliably at 100M full-duplex. I know it was close, and it worked OK half-duplex. I do remember there was a _lot_ of messing around with interleave, burst length, etc.

IIRC, we eventually got to the point where we only needed the dongle when the HW guys changed something and the register map changed. I ended up writing several scripts so we could use plain old Makefiles after the register maps had been generated.

Once I got a UART working so I count print messages, I just gave up on the JTAG BS. Another interesting quirk was that the Altera USB JTAG interface only worked right with a few specific models of powered USB hubs.

We needed a pretty decent-sized FPGA for custom peripherals and an I/O processor. So, Altera sold somebody on the idea of moving the uC into the FPGA also.

Definitely. The M-class parts are so cheap, there's not much point in thinking about doing it in an FPGA.

I did not know that. :)

Grant

Vote

C

Clifford Heath 6 years ago

Lattice provide the 8-bit MICO that fits well even on their smallest parts, with a complete C compiler build chain.

The actual software is a tyre fire though, running only on some version of Enterprise Linux that hasn't been updated in almost a decade.

CH

Vote

R

Rick C 6 years ago

A "tyre fire"??? That's one I haven't heard before.

I never figured out why the British call the things that make their cars ride smooth by the name of an ancient city.

; )

Rick C. +- Get 1,000 miles of free Supercharging +- Tesla referral code - https://ts.la/richard11209

Vote

C

Clifford Heath 6 years ago

Australians too.

Some US language is ancient English (but modern English has moved on), and sometimes its the reverse. "Aluminium/Aluminum" is an example where English moved on (to improve standardisation).

CH

Vote

R

Rick C 6 years ago

Sorry, can you explain the aluminium/aluminum thing? I know some people pronounce it with an accent (not saying who) but I don't get the English moved on thing.

Rick C. ++ Get 1,000 miles of free Supercharging ++ Tesla referral code - https://ts.la/richard11209

Vote

C

Clifford Heath 6 years ago

Aluminum is the original name, which Americans retained when the English decided to standardise on the -ium extension that was being used with most other metals already.

That's my understanding anyhow.

CH

Vote

P

Paul Rubin 6 years ago

Right, but the idea is to have both.

Vote

R

Rick C 6 years ago

OK, thanks

Rick C. --- Get 1,000 miles of free Supercharging --- Tesla referral code - https://ts.la/richard11209

Vote

D

David Brown 6 years ago

Yes, that is correct (AFAIK). This is one of the differences between spoken English and spoken American that always annoys me when I hear it

- I don't really know why, and of course it is unfair and biased. The other one that gets me is when Americans pronounce "route" as "rout" instead of "root". A "rout" is when one army chases another army off the battlefield, or a groove cut into a piece of wood. It is not something you do with a network packet or pcb track!

I'm sure Americans find it equally odd or grating when they hear British people "rooting" pcbs and network packets.

:-)

Vote

R

Rick C 6 years ago

:

,

re

h

I've seen the word "rooted" used in a much more vulgar sense in many Britis h works to think you don't know why that just sounds wrong when applied to PWBs.

Rick C. -+- Get 1,000 miles of free Supercharging -+- Tesla referral code - https://ts.la/richard11209

Vote

Custom CPU Designs

Join the Discussion

Didn't find your answer?