A Challenge for serialized processor design and implementation

Antti · 2008-03-19T06:28:05+00:00

HiI have been think and part time working towards a goal to make useableand useful serialized processor. The idea is that it should be1) VERY small when implemented in any modern FPGA (less 25% ofsmallest device, 1 BRAM)2) be supported by high level compiler (C ?)3) execute code in-place from either serial flash (Winbond quad speedSPI memory delivers 320 mbit/s!) or from file on sd-cardserial implementation would be smaller and run at higher speeds, so128 clock per machine cycle would already mean 2 MIPS, what would beacceptable for many applications.Parallax basic stamps I executes 2KIPS only, so ultra lite serialprocessor in FPGA with 2 MIPS would be eh, for me its some to dreamoff :)I have poked around this idea for some years, but never got the "finalkick" to really go and do-complete the design and development of thisprocessor.So I decided to offer some bounty for others to maybe motivate to workfor this goal and dream, current list of items available for thedevelopers from my own funding is listed here (I hope to add items andmaybe some $ by the time)there is also very preliminary spec-goal document as wellAntti Lukats

W

Walter Banks 18 years ago

Interesting...

It will take some thought on how to implement the right shift and logical operators.

Walter..

Vote

K

Krzysztof Kepa 18 years ago

I agree with you. However, plethora of taxonomies CISC/RISC/DISC/OISC, SISD/SIMD/MISD/MIMD, hardware/software/configware and so on gives the opportunity that everybody finds what he feels comfortable with...

Cheers, Krzysztof

Vote

R

Ron N. 18 years ago

Implementation details: either you would need 3-address instructions, or some form of load/store operations (indirect to support C pointers) to a small stack or small register set. C subroutine support would also seem to require some sort of push PC, and either pop PC or jmp indirect operations. Also, couldn't a nand or nor operator replace the bitwise-invert and "or" operators above?

Can anyone point out a C compiler that supports an instruction set this minimal? Can something like LLVM target this small an instruction set?

Thanks.

-- rhn A.T nicholson d.0.t C-o-M

Vote

E

Everett M. Greene 18 years ago

You need XOR for add/subtract so it can be one of the operations at little extra cost. Some sort of CALL and RETURN is needed for the processor to be useful. AND is needed as well.

^^^^^^^^^^^^^^^^^^^^^^^^^ as with ARM?

Vote

J

Jim Granville 18 years ago

and a SPI port should be simple :) (at least at one CLK speed)

-jg

Vote

A

Andy Botterill 18 years ago

The ARM can do logical, arithmetic abd rorates. Andy

Vote

W

Walter Banks 18 years ago

Call return can be implemented it is just slow. We have created a compiler for at least one processor without call return

w..

Vote

W

Walter Banks 18 years ago

We have implemented C compilers with missing instructions just not all the instructions alluded to in the post. To implement a C compiler there are about 90 basic sequences that need to be defined. After that 1400 rules or so are needed to generate reasonable code.

LLVM despite its goals is not likely to generally succeed very well in diverse targets. It is an approach many commercial compiler companies including ours has tried and abandoned.

Regards,

-- Walter Banks Byte Craft Limited Tel. (519) 888-6911

formatting link

Vote

D

David Brown 18 years ago

(a & b) == ~(~a | ~b) (a ^ b) == (a | b) & ~(a & b)

Thus you don't need AND or XOR. Similarly, you subtraction.

As for CALL and RETURN, you need some way to implement them, but they don't need to be fundamental operations. There are many ways to create them, especially if you have a way to directly read the current PC. Even if you don't have any way to read or store the PC at run-time, it is still possible to simulate CALL and RETURN. Imagine a machine whose only control flow instructions are a direct jump and some sort of conditional. Every time you want to use a "CALL", you are effectively making a note of the return point, doing a jump to the function, then (with the RETURN) doing a jump to the return point. So the compiler needs to collect all the return points in the program, and give each a unique number. The CALL mechanism will stack that number, and the RETURN mechanism will look up that number in a table to get the required return address. If there is no indirect jump instruction, then the RETURN will be a function full of "if (returnNo == x) goto returnX;" statements (or something faster - perhaps a binary search system). The same mechanism can be used for function pointers, and to overcome other architectural limitations such as limited hardware CALL/RETURN stack size, or paging issues.

Vote

E

Everett M. Greene 18 years ago

I was just taking a shot at the silly rotation of one of the operands that can be done with (nearly) every instruction.

Vote

E

Everett M. Greene 18 years ago

But you haven't produced addition/subtraction yet. The first term of adders/subtractors is a ^ b. If you gate that output to rest of the world, you have XOR with no further fuss. NOT is available via a ^ 1s.

CALL/RETURN can be done in some very crude ways. The method used is a matter of practicality and usefulness.

Anything computable can be done with a Turing machine but it isn't very practical.

Vote

R

Ron N. 18 years ago

You can produce any arithmetical or logical operation out of just NAND gates (or just NOR gates as was actually done in some ECL supercomputer implementations), and the same can be done in software, if efficiency is of no importance. However one big difference is that arithmetic operations imply carry (or borrow) logic, and word-wide logic implies the lack of inter-bit interference (carry), which are both costly to synthesize from the other. So a MISC that operates on words should provide at least one logic operation and one arithmetic operation. Right shift and compare to constant (say zero) are also costly to synthesize, so could also be provided for efficiency and completeness.

IMHO. YMMV.

-- rhn A.T nicholson d.0.t C-o-M

Vote

R

rickman 18 years ago

If you were a customer, I would say nothing, but keep asking related questions until I got an answer I could use. But since you appear to be an engineer, I will point out that you answered my request for clarification with yet another *vague* and *undefined* answer! As long as you use terms like "too large" and "virtually free", I can't know what you mean. There was a Dilbert cartoon about this once. He kept asking his customer to clarify his requirement and the customer kept giving useless answers until finally he insisted that Dilbert should be able to read his mind! I can' read your mind, so I don't know how big "too large" is and I will never know what "virtually free" means. The cost of asking this question keeps the answer from ever being free in any sense. The extra effort required to pull a good answer from you has just raised my bid by 20%. ;^)

So what about existing FPGA soft cores is not "flexible enough"? I am pretty sure they all use either block ram or LOT ram for register files.

Ok, now we are getting a definition of "real compiler"... it is any compiler that is not exactly like the picoblaze C compiler.

But what happens in a couple of years when it is hard to find a 4 GB card and you need at least 26 bit addressing?

Yes, and you won't be able to buy 4 GB cards in another 2 or 3 years. So why bother with the 32 bit requirement? Think big and just go with

64 bits like any "real" processor will do.

We are not talking about wide datapath. We are talking about the size of the registers and the length of time to do simple operations. In a parallel implementation wider data paths cost is in silicon. In a bit serial design a wider data path cost is in time. This process is already very slow and will have limited apps due to the slowness. So why cripple it by successive factors of 2 just because the ALU logic is "free"? An 8 bit design will be 4X faster than a 32 bit design and can be just as powerful since you can easily implement 32 bit instructions by chaining 4 8 bit instructions. Then you have the speed of 8 bits and the flexibility of 32 (or even 64) bits.

Ok, registers in the DataFlash buffers... aren't they used when writing to the flash? Maybe I am not familiar with the DataFlash you use.

Optimal in *what* sense??? You have already spec'd out speed and your registers in DataFlash buffers have reduced the speed to what can be supported by the external memory. So what could "optimal" possibly mean in any sense?

Wait a year or two and the definition of "too large" and "virtually free" will change to allow not only Microblaze processor to fit the requirement, but likely any ARM implementation too!

If this is your definition of "virtually free" then there is *no* solution. There are always designs that use all practical amount of some resource in an FPGA and there are always designs that have tons of free resources. So how can you talk about using "unused" resources as if this was some guaranteed amount? On the other hand, there are going to be an incredibly small number of designs that are using an FPGA and also can't afford an $0.80 CPU.

If you want to consider a definition of "virtually free", then I suggest that you consider the increment between FPGA sizes. In the Spartan 3 the smallest increment is 2300 LUTs after accounting for the Xilinx "expanding universe" inflation factor. If your feature uses one fifth of this amount, then adding it is unlikely to cause the design to be bumped up to the next size of FPGA and so will be "free". This equates to 460 LUTs/FFs in the *smallest* device, the

3S50. As the starting chip size gets larger, the increment gets larger so that even in the lowly 3S400 "virtually free" means 1,638.4 LUTs.

I think that for any but most strict applications, a standard, small soft core processor is "small enough" to be "virtually free".

Vote

R

rickman 18 years ago

Actually, it is not that you *can* produce all logic from NAND gates, in effect that is how it *is* done. The basic logic element consists of transistors configured to be a NAND gate, an inverter (a degenerate form of a NAND gate), a NOR gate (a NAND gate with inverted logic) and transmission gates. So inside of a chip designed at the gate level, there really is no distinction between NAND and NOR gates and there are no OR and AND gates.

Implementations like the Cray computer that used ECL NAND gates did so because of the limitations of ECL packaging at that time. The real difficulty in designing that super computer had to do with packaging and heat dissipation. So Cray attacked those problems first and fit the logic design into those constraints. A regular array of ECL DIPS on common sized circuit boards fit the thermal design well and the only variation needed was the circuit board routing.

Vote

R

referringto 18 years ago

Years ago I thought about using a serial design to implement a CPU in a GAL 16V8 or 20V8. My idea was to map all registers to memory and to use a 64kbit DRAM as a 256x8 memory that is accessed bit wise. Refresh would have been automatically provided by cycling through the registers. Unfortunately that design never got anywhere. The main issues where that all the flipflops were eaten up by state counters and the design got horribly complex. The ALU design itself was of course negligible.

Vote

J

Jim Granville 18 years ago

You would need a few 20V8's to make soemthing useful!

You could revisit that in CPLDs - but a 64kb DRAM might be hard to find tho! :) The 32K Serial SRAM could replace it tho ?

-jg

Vote

R

referringto 18 years ago

Obviously the point was specifically to do something useful with a single 20V8 :) It was the next challenge after maxing out on CPLDs (see

formatting link

4164s are still easy to come by as NOS - actually I have several tubes of them. Of course this is nothing for a new design, but GALs are of similar vintage so nothing seems wrong with marrying them to an antique DRAM.

Actually a better challenge may be to design a CPU with a minimum number standard logic chips. Even though it is highly anachronistic in the age of programmable logic, a hard wired sea-of-gates still has something gratifying to it.

Vote

G

glen herrmannsfeldt 18 years ago

rickman wrote: (snip)

I believe that is pretty much true for TTL, and I believe also for ECL, PMOS and NMOS, but not for CMOS. CMOS can directly implement either NAND or NOR gates. For non-inverted gates CMOS might require inverters at the input or output.

-- glen

Vote

P

Paul Keinanen 18 years ago

ECL is OR/NOR, since it is basically a differential pair (with multiple transistors in parallel with the left transistor for multiple input gates) and the output can be taken (using an emitter follower) from either collector resistor of the differential pair to get OR or NOR functionality. If the output is taken from both sides, there are simultaneous OR and NOR outputs.

In ECL the EXOR gate can be implemented by replacing the differential pair collector resistors with two additional differential pairs as in a Gilbert cell mixer (MC1496 style).

Paul

Vote

J

Jim Granville 18 years ago

This looked nifty, so I've downloaded this and had a quick try :

Could not fit into a 9536, or 9536XL - just too many product terms. (with default settings), so it bumped to the 72MC devices, just over 50% full.

Auto-Fitted in XC2C64, but generated 34MC's. (SIZE optimised), but PT looks OK for a smaller device. However, I did see it generated a number of intermediate Macrocells....

Hmmm... so I switched to Speed, thinking that might avoid some intermediate MCs - and, strangely, Speed optimised now FITS XC2C32A,where SIZE failed :)

Only just, but as your target was 32MC's this counts as a PASS!

[ Need to find something for that extra MC and 3PTs to do :)

- maybe a Debug/Trace LED ?, could select 1 of 3 probe points ? ]

RESOURCES SUMMARY Design Name cpu8bit2 Device XC2C32-3-PC44 Macrocells Used 31/32 (97%) Pterms Used 109/112(98%) Registers Used 24/32 (75%) Pins Used 18/33 (55%) Function Block Inputs Used 70/80 (88%)

You could add this newer device to your 32 MCell supported list. ( and I'd suggest appending the full FIT cpu8bit2.rpt file to the PDFs )

-jg

Vote

A Challenge for serialized processor design and implementation

Join the Discussion

Didn't find your answer?