Custom CPU Designs

R

Rick C 6 years ago

In the Forth language group there are occasional discussions of custom proc essors. This is mostly because a simple stack processor can be designed to be implemented in an FPGA very easily, taking little resources and running at a good speed. Such a stack processor is a good target for Forth.

I know there are many other types of CPU designs which are implemented in F PGAs. I'm wondering how often these are used by the typical embedded devel oper?

Rick C. -- Get 1,000 miles of free Supercharging -- Tesla referral code - https://ts.la/richard11209

Vote

T

Theo 6 years ago

It's not uncommon when you have an FPGA doing some task to need a processor in there to manage it - for example to handle error conditions or to report status over some software-friendly interface like I2C, or classical printf debugging on a UART.

Xilinx provides Microblaze and Intel/Altera provide NIOS, which are a bit legacy (toolchains aren't that great, etc), although designed to be fairly small. There's increasing interest in small 32-bit RISC-V cores for this niche, since the toolchain is improving and the licensing conditions more liberal (RISC-V itself doesn't design cores, but lots of others - including ourselves - have made open source cores to the RISC-V specification).

I don't see that Forth has much to speak to this niche, given these cores are largely offloading work that's fiddly to do in hardware and people just want to write some (usually) C to mop up the loose ends. I can't see what Forth would gain them for this.

If you're talking using these processors as a compute overlay for FPGAs, a different thing entirely, I'd suggest that a stack processor is likely not very efficient in terms of throughput. Although I don't have a cite to back that up.

Theo

Vote

G

Grant Edwards 6 years ago

I worked on a project using a NIOS2 CPU core in an Altera FPGA. It was a complete disaster. The CPU's throughput (running at 60MHz) was lower than the 40MHz ARM7 we were replacing. The SDRAM controller's bandwidth was so low it couldn't keep up with 100Mbps full duplex Ethernet. The SW development environment was incredibly horrible and depended on some ancient version of Eclipse (and required a license dongle) and a bunch of bloated tools written by some incompetent contractor. There were Java apps that generated shell scripts that generated TCL programs that generated C headers (or something like that). The JTAG debugging interface often locked up the entire CPU.

After spending a year developing prototype products that would never quite work, we abandoned the NIOS2 in favor of an NXP Cortex M3.

Grant

Vote

T

Theo 6 years ago

That's a pretty good summary of the experience. A few things have improved slightly:

- if you have the space it's better to use an on-chip BRAM rather than SDRAM, given the SDRAM is often running at ~100MHz x 16 bits, which makes even instruction fetch multi-cycle. DDR is better but the memory controllers are much more complex, and life gets easier when you have cache.

- they've upgraded to a plugin for a modern version of Eclipse rather than a fork from 2005, but it's still Eclipse :( I just drive the shell scripts directly (although the pile of Make they generate isn't that nice). I've never had it need a dongle.

- the JTAG interface is horrible and I've spent way too much time reverse engineering it[1] and working around its foibles, in particularly the JTAG UART which is broken in several ways (google 'JTAG Atlantic' for some workarounds)

These days a RISC-V CPU in FPGA solves a lot of the horrible proprietaryness, although you still have to glue the toolchain together yourself (git clone riscv-gcc).

But if you can do the job with a hard MCU I can't see why you'd want an FPGA.

Personally I much prefer the ARM cores on FPGAs these days - they're a Proper CPU that Just Works. And the Cortex A-class cores can boot Linux which makes the software development workflow a lot nicer. Although they aren't that beefy (a $10000 Stratix 10 has a quad core A53, which is a Raspberry Pi 3), often hard to get parts with them in, and the bandwidth between ARM and soft logic often isn't very good.

Theo

[1] Did you know the Altera product codenames were based on dragons from How to Train Your Dragon? Interesting what you find out when running strace(1) on the binary...

Vote

G

Grant Edwards 6 years ago

I don't remember if we ever got it to work reliably at 100M full-duplex. I know it was close, and it worked OK half-duplex. I do remember there was a _lot_ of messing around with interleave, burst length, etc.

IIRC, we eventually got to the point where we only needed the dongle when the HW guys changed something and the register map changed. I ended up writing several scripts so we could use plain old Makefiles after the register maps had been generated.

Once I got a UART working so I count print messages, I just gave up on the JTAG BS. Another interesting quirk was that the Altera USB JTAG interface only worked right with a few specific models of powered USB hubs.

We needed a pretty decent-sized FPGA for custom peripherals and an I/O processor. So, Altera sold somebody on the idea of moving the uC into the FPGA also.

Definitely. The M-class parts are so cheap, there's not much point in thinking about doing it in an FPGA.

I did not know that. :)

Grant

Vote

T

Theo 6 years ago

I've spent months working around such problems :( We have an application that pushes gigabytes through JTAG UARTs and have learnt all about it...

There's a pile of specific issues:

- the USB 1.1 JTAG is an FT245 chip which basically bitbangs JTAG; it sends a byte containing 4 bits for the 4 JTAG wires. The software is literally saying "clock high, clock low, clock high, clock low" etc. Timing of that is not reliable. Newer development boards have a USB 2.0 programmer where things are a bit better here, but it's still bitbanging.

- being USB 1.1, if you have a cheap USB 2.0 hub it may only support USB SST which means all USB 1.1 peripherals share 12Mbps of bandwidth. In our case we have 16 FPGAs all trying to chat using that shared 12Mbps bandwidth. Starvation occurs and nobody makes any progress. A better hub with MST will allow multiple 12Mbps streams to share the

480Mbps USB 2.0 bandwidth. Unfortunately when you buy a hub this is never advertised or explained.

- The software daemon that generates the bitbanging data is called jtagd and it's single threaded. It can max out a CPU core bitbanging, and that can lead to unreliability. I had an Atom where it was unusable. I now install i7s in servers with FPGAs, purely to push bits down the JTAG wire.

- To parallelise downloads to multiple FPGAs, I've written some horrible containerisation scripts that lie to each jtagd there's only one FPGA in tte system. Then I can launch 16 jtagds and use all 16 cores in my system to push traffic through the JTAG UARTs

- Did I mention that programming an FPGA takes about 700MB? So I need to fit at least 8GB of RAM to avoid memory starvation when doing parallel programming (if the system swaps the bitbanging stalls and the FPGA programming fails)

- there's some troubles with jtagd and libudev.so.0 - if you don't have it things seem to work but get unreliable. I just symlink libudev.so.1 on Ubuntu and it seems to fix it.

- the register-level interface of the JTAG UART isn't able to read the state of the input FIFO without also dequeuing the data on it. Writing reliable device drivers is almost impossible. I have a version that wraps the UART in a 16550 register interface to avoid this problem.

- if the FPGA is failing timing, the producer/consumer of the UART can break in interesting ways, which look a lot like there's some problem with the USB hub or similar.

It's a very precarious pile of hardware and software that falls over in numerous ways if pushed at all hard :(

Theo [adding comp.arch.fpga since this is relevant to those folks]

Vote

R

Rick C 6 years ago

I guess once your design becomes complex enough it isn't so practical to debug it in the HDL simulator. Eh?

Rick C. - Get 1,000 miles of free Supercharging - Tesla referral code - https://ts.la/richard11209

Vote

T

Theo 6 years ago

We have boxes of 16 and a rack of 80 FPGAs, and this is used for data onload/offload not debugging. So the simulator won't do ;-P

Theo

Vote

R

Rick C 6 years ago

Have you thought of putting it all into one really big FPGA? 8-o

Rick C. + Get 1,000 miles of free Supercharging + Tesla referral code - https://ts.la/richard11209

Vote

Custom CPU Designs

Join the Discussion

Didn't find your answer?