CPU design

Then perhaps there should be a "NanoBlaze"? ;) (micro, nano, pico)

-Dave Pollum

Reply to
Dave Pollum
Loading thread data ...

Dave Pollum schrieb:

[]

Dave,

NanoBlaze is already (R) registered trademark of Xilinx Inc.

Antti

Reply to
Antti

Damn, I'm working on a 16 bits RISC cpu optimized for S3/V4 and I was thinking of nanoblaze ... gotta find some other name now ...

Sylvain

Reply to
Sylvain Munaut

If it is going to be open source, how about NanoFire ?

-jg

Reply to
Jim Granville

Yes, this is what I'm planning.

I have another idea for a CPU, very RISC like. The bits of an instructions are something like micro-instructions:

There are two internal 16 bit registers, r1 and r2, on which the core can perform operations and 6 "normal" 16 bit registers. The first 2 bits of an instructions defines the meaning of the rest:

2 bits: operation: 00 load internal register 1 01 load internal register 2 10 execute operation 11 store internal register 1

I think it is a good idea to use 8 bits for one instruction instead of using non-byte-aligned instructions, so we have 6 bits for the operation. Some useful operations:

6 bits: execute operation: r1 = r1 and r2 r1 = r1 or r2 r1 = r1 xor r2 cmp(r1, r2) r1 = r1 + r2 r1 = r1 - r2 pc = r1 pc = r1, if c=0 pc = r1, if c=1 pc = r1, if z=0 pc = r1, if z=1

For the load and store micro instructions, we have 6 bits for encoding the place on which the load and store acts:

6 bits place: 1 bit: transfer width (0=8, 1=16 bits) 2 bits source/destination: 00: register: 3 bits: register index 01: immediate: 1 bit: width of immediate value (0=8, 1=16 bits) next 1 or 2 bytes: immediate number (8/16 bits) 10: memory address in register 3 bits: register index 11: address 1 bit: width of address (0=8, 1=16 bits) next 1 or 2 bytes: address (8/16 bits)

The transfer width and the value need not to be the same. E.g. 1010xx means, that the next byte is loaded into the internal register and the upper 8 bits are set to 0.

But for this reduced instruction set a compiler would be a good idea. Or different layers of assembler. I'll try to translate my first CPU design, which needed 40 bytes:

; swap 6 byte source and destination MACs .base = 0x1000 p1: .dw 0 p2: .dw 0 tmp: .db 0 move #5, p1 move #11, p2 loop: move.b (p1), tmp move.b (p2), (p1) move.b tmp, (p2) sub.b p2, #1 sub.b p1, #1 bcc.b loop

With my new instruction set it could be written like this (the normal registers 0 and 1 are constant 0 and 1) :

load r1 immediate with 5 store r1 to register 2 load r1 immediate with 11 store r1 to register 3 loop: load r1 from memory address in register 2 load r2 from memory address in register 3 store r1 to memory address in register 3 store r2 to memory address in register 2 load r1 from register 3 load r2 from register 1 operation r1 = r1 - r2 store r1 in register 3 load r1 in register 2 operation r1 = r1 - r2 store r1 in register 2 operation pc = loop if c=0

This is 20 bytes long. As you can see, there are micro optimizations possible, like for the last two register decrements, where the subtrahend needs to be loaded only once.

I think this instruction set could be implemented with very few gates, compared to other instruction sets, and the memory usage is low, too. Another advantage: 64 different instructions are possible and orthogonal higher levels are easy to implement with it, because the load and store operations work on all possible places. Speed would be not the fastest, but this is no problem for my application.

The only problem is that you need a C compiler or something like this, because writing assembler with this reduced instruction set looks like it will be no fun.

Instead of 16 bits, 32 bits and more is easy to implement with generic parameters for this core.

--
Frank Buss, fb@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
Reply to
Frank Buss

sounds good, i do wonder if micron is trademark too, any other good names??

Minon is a possibility, but it may end up being called indi.

maybe a thread should just list un occupied namespace :-)

cheers

Reply to
jacko

Since this is a very specifica application, do you have a handle on the code size yet ?

Another angle to this, would be to choose the smallest CPU for which a C compiler exists.

Here, Freescale's new RS08 could be a reasonable candidate ?

Or chose another more complex core and then scan the compiled output, to check the Opcode usages, and subset that.

-jg

Reply to
Jim Granville

just got quartus II after 1/2 hr seems ok, after setting top level!!

i wonder if the avalon sopc includes usb?

not sure if c compilier for it.

well at least i have a vhdl compilier now which looks good.

must start on the micron design soon.

Reply to
jacko

Yes, I have carry and zero flag. To make the implementation of the core easier, I think I'll use one bit of the instruction set to determine if the flags are updated or not.

Why? I think I can implement a "call" instruction like in 68000:

r2=pc pc=r1

In the sub routine I can save r2, if I need more call stack.

Interrupts could be implemented by saving the PC register in a special register and restoring it by calling a special return instruction.

64 instructions are possible, so relative branching is a good idea and I'll use the same concept with one bit for deciding, if it is absolute or relative.

I've implemented a simple Forth implementation for Java and it's just different, not more difficult to program in Forth:

formatting link

The MARC4 from Atmel uses qForth:

formatting link

Maybe you are right and the core and programs are smaller with Forth, I'll think about it. Really useful is that it is simple to write an interactive read-eval-print loop in Forth (like in Lisp), so that you can program and debug a system over RS232.

--
Frank Buss, fb@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
Reply to
Frank Buss

and Java also:

formatting link

could not resist ;-)

Martin

Reply to
Martin Schoeberl

You have tested both: a "normal" instruction set and a stack machine. For the stack machine you wrote that it is two times faster. What about code size and the size of the core?

I've downloaded your code and looks like it is implemented very close to the hardware instead of using arbitrary VHDL and let the synthesizer decide how to implement it. A good idea for my implementation :-)

--
Frank Buss, fb@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
Reply to
Frank Buss

Mmh, this statement is from a very early version of JOP (about 2001). It was a comparison on the implementation of the Java virtual machine (JVM) in two different types of microcode.

About code size: It's the code (bytecode) that the Java compiler generates plus some class information. Bytecode is efficient, but class information adds to the memory footprint. The size depends on the support of Java libraries.

Core size is configurable, starting from about 1000 LCs. A well balanced version of JOP is about 2000 LCs.

What do you mean with 'very close to the hardware'? I try to avoid vendor specific library elements as much as possible and stay with plain VHDL. If you mean that the VHDL coding style is more hardware oriented, than I agree. I started directly in an FPGA implementation and did almost no simulation.

Martin

Reply to
Martin Schoeberl

So will you have instructions that saves the C,Z values? Imagine doing a cmp instruction and after that you take an interrupt, the interrupt handler will also use these flags so when you return the interrupted program will use the wrong values.

Reply to
Göran Bilski

Yes, I think a r1 to flags register and flags register to r1 instruction will be sufficient, a little bit like 6502 txs and tsx.

--
Frank Buss, fb@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
Reply to
Frank Buss

the

I'll

_A.pdf

'll

ive

nd

Simpler solution - have the microcode FSM push the flags to the stack. It's a simple alteration, and saves a lot of heartache. I have contemplated even pushing the entire context to the stack, since I can burst write from the FSM a lot faster than I can with individual PSH/POP instructions, but I figure that would be overkill.

Reply to
radarman

Quite a while back I designed a small microcontroller for a Xilinx XC4000E series part that used approximately 80 LUTs and ran at IIRC, 105 MHz, I think it was in a 4020XL. It was a simple risc machine that was sort of a cross between a PIC microcontroller and an RCA1802. It had a register file with 16 registers like the 1802, and had a small instruction set similar to a PIC. If I recall correctly, it was a harvard architecture. The ISA was specifically designed for the FPGA architecture.

Anyway the difficult part about it was that it had no programming tools to support it. We did write a crude assembler for it, but that was about as far as we took it. The point is, the hardware and ISA design is only part of the job. The tools development is as big a piece as the processor design itself.

Reply to
Ray Andraka

I did a google for and saw lots of things

I was looking specifically for Adam Dunkels , he gets alot of press on OSNews and other sites for his various embedded OS projects.

His uIP stack claims to be the worlds smallest stack, uses 4-5KB of code space and only a few 100 bytes of ram. uIP has been ported to a wide range of systems and many commercial projects. He mentions ABB, Altera, BMW, Cisco Systems, Ericsson, GE, HP, Volvo Technology, Xilinx. The IwIP is a bigger faster version of uIP.

formatting link

Besides uIP he also has a tiny OS Contiki, a ProtoThreads package.

John Jakson transputer_guy

Reply to
JJ

For someone doing a fully custom/own assembler/compiler :

The tiniest CPUs do not need a stack, and interupts do not need to be re-entrant, so a faster context switch is to re-map the Registers, Flags (and even PC ? ) onto a different area in BRAM. You can share this resource by INTs re-map top-down, and calls re-map bottom up - with a hardware trap when they collide :)

-jg

Reply to
Jim Granville

Yes, this was what I mean, e.g. figures 5.6 to 5.9 of your thesis, where you describe the processor pipeline with gates and which is implemented like this in VHDL. But maybe this is the normal case and I'm just to new to VHDL to write and interconnect components in this way.

formatting link

Why not? When I was implementing my CRC32 check for my network core, I've tested the algorithm with a VHDL testbench (ethernet packet send and receive works at 10 Mbit and 100 Mbit on my Spartan 3E starter kit now). The turnaround times are faster with simulation and it is very easy to debug it, instead of debugging a synthesized core in hardware. The same was true for my DS2432 ROM id reader, where I've written the testbench, first and then implemented the reader.

formatting link

--
Frank Buss, fb@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
Reply to
Frank Buss

Once you get into seeing clearly the relationship between features and cost a lot can be removed.

Interrupts can be removed at extremely low cost to applications. Both the Microchip PIC12 and Freescale RS08 do not have interrupts. In the RS08 C compiler we developed some software IP to where possible go into a power down mode and launch execution threads that compiled as execution to completion.

The threads are typically short and a as a side effect run to completion makes local re-use easy

C compilers implemented for small processors work well with out either a data or subroutine return stack. Two of the processors we have written compilers for in the last couple years both used an assessable return register. Flow control analysis in the compiler make nested subroutines user transparent.

The instruction set reduction in the RS08 from the S08 parent had a

4-6% impact on application performance.

Walter..

Reply to
Walter Banks

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.