[Instruction Set Architecture] Skip on (no) carry

- W
- whygee
  
  Contact options for registered users
posted
17 years ago

Sun, Apr 15, 2007 12:14 AM

Hello,

so I'm playing with

formatting link

and developing a completely new instruction set, along with an architecture, tools etc... in JavaScript (before I translate to C and VHDL). An overall description of the core is available at

formatting link

[note that it is always under construction so some parts don't work]

My question : Do you know of any processor architecture where the carry of the addition is not stored in a condition code register, but (instead) the core skips the next instruction(s) ?

I have recently come to this idea because of many self-imposed limitations, like the existence of only one write port to the register set. And skipping is a nice alternative because the carry bit is often used as a condition for a jump, so the current solution jumps immediately.

I know of many ISAs and architectures but I have not seen this before. Does anyone know a similar approach ? I post this here because this is more likely to be used in other small and embedded CPUs, rather than the large, server-scale CPUs of comp.arch.

YG (you can reply to the address at the top of the first link of this page)

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Apr 15, 2007 4:17 AM

As long as you have a one-instruction bit set, I can synthesize a carry bit so I'm mostly happy. There are times when I have done assembly language coding that I have found it convenient to wait a bit before I checked a condition bit, but I could probably cope with an add-skip-no-carry instruction (ASNC -- odd, but it'd do).

If you're inventing an instruction set, remember that the PowerPC architecture has an EIEIO instruction. Please try to top it.

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Posting from Google?  See http://cfaj.freeshell.org/google/

Do you need to implement control loops in software?
"Applied Control Theory for Embedded Systems" gives you just what it says.
See details at http://www.wescottdesign.com/actfes/actfes.html

- P
- Paul Taylor
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Apr 15, 2007 6:41 AM

With regards to SHR, SAR, SHL, ROL, ROR, are ROL and ROR _really_ necessary? Not trying to discourage you from implementing them. My reason for asking is that I playing with a compact 16-bit design, and I looked at those and decided that I could sacrifice them.

Regards,

Paul.

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Apr 15, 2007 9:34 AM

I think you have a variable-length skip - which is a good idea. IIRC the COP8 had a packed jump, that was only 1 byte, and of limited reach (but efficent).

Another benefit of a short-skip opcode, is for a core you wish to feed from Serial memory : SPI Flash is getting faster all the time [Winbond have 150MBd streaming], so the sequential access time is reasonable, but a branch is more costly. That means a skip makes sense, as it does not spawn a new address, and for small distances, that is faster than the jump.

Some CPUs have conditional fields in the opcodes, which mean they can skip. It tends to be wasteful, as this is not often needed, but the CC bits come along for the ride anyway.

I've also seen Conditional RET encoded, which used an otherwise unused field from the conditional jump variants, and that looked like a useful idea - esp. for assembler coding.

Have you looked at the Lattice Mico8, and PicoBlaze / PacoBlaze SoftCPUs - they have some good 'compact' ideas.

-jg

- W
- Walter Banks
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Apr 15, 2007 11:55 AM

There are quite a few processors that don't have a condition code register. For extended math yours is one approach but you can also use some form of chained multiprecision math. Multiprecision operations with 32 bit processors probably could be dropped with very little impact on most applications.

Conceptually skip and conditional skip are powerful tools that can be used in clever combinations. Generally more skip conditions can be used than conventional conditional branches. A lot of thought needs to be put into what happens with sequential skip instructions. Is a skip treated as a pre-another instruction or a separate instruction?

w..

- W
- Walter Banks
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Apr 15, 2007 12:07 PM

You can certainly sacrifice left operations. Right operations will depend a lot on the rest of the instruction set. A single barrel rotate can replace them all.

w..

- W
- Walter Banks
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Apr 15, 2007 1:01 PM

The COP8 is a remarkably compact instruction set. (We wrote a C compiler for it)

1) It used a lot of the instruction space for jumps and calls. 31 opcodes were used for branches you referred to as was a 2 byte in page branch and a 3 byte branch anywhere. It had 2 and 3 byte calls 2) The COP8 has a swap instead of a store (But does have a load). The swap saves a lot of temp space operations in expression evaluation. 3) The COP8 was implemented as a bit serial alu which made swap a very low cost instruction to implement. Most RMW instructions are very low cost bit serial (INC,DEC CLR set to 1 -1 for example) 4) The interrupt service in the COP8 is worth looking at. It is implemented as a combination of minimum hardware and specialized instructions to create a vectored interrupt system. Most of the logic is software. 5) Several processors have software I/O devices SX is well known. One that should also be looked at is how the Z8 handled its serial port.

w..

- P
- Paul Taylor
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Apr 15, 2007 1:35 PM

I tentatively decided to have a shrc, shlc and shra - shrift right through carry, shift left through carry, and shift right arithmetic. I decided on just those three because I figured that the shift operations don't get used that much - mostly to multiply or divide by two on occasion. However, using this scheme, shifting logically takes two instructions - a clear carry instruction followed by the shrc/shrl instruction. But I can live with that. It means my instruction set is a bit smaller - I have taken this approach through the whole design.

The reason why I said tentatively above is because I haven't done that much assembly language programming especially of late, and as I progress the design, bad decisions will of course need to be put right.

Regards,

Paul.

- W
- Walter Banks
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Apr 15, 2007 2:25 PM

I have found that ASR is more important for general purpose computing that either LSR or ROR. I have dealt with many processors that did not have shift with carry and some that did not. Either is not a particularly big problem for code generation.

I did not explain my barrel shift point earlier. Barrel shift or rotate is a very effective method of field extraction.

w..

- P
- Paul Taylor
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Apr 15, 2007 2:55 PM

Ah, good point. That was not in my mind earlier.

Regards,

Paul.

- T
- Terran Melconian
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Apr 15, 2007 6:52 PM

How about for implementing multiplication and division?

I often use them for serialization and deserialization of I/O data streams when that is being done in software.

- R
- robertwessel2
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Apr 15, 2007 11:02 PM

at

formatting link

- E
- Everett M. Greene
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Apr 16, 2007 4:49 AM

Except for sign-extended and logical shifts.

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Apr 16, 2007 5:40 AM

COP800 , HPC16xxx...

--
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

- D
- David R Brooks
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Apr 16, 2007 8:51 AM

Rotates are also found at the core of many cryptographic algorithms, if you see that as a potential application area.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Apr 16, 2007 9:59 AM

I've done lots of assembly programming for the COP8. It's nice in some ways, and fairly compact, as long as you don't need much speed and can be smart about ram bank switching.

That's nice for the jumps, but it means less space for other features.

The swap instruction is very useful. However, you end up with a lot of swaps followed by loads to simulate a save - it's not clear to me that this is a win.

These are low cost, assuming you are accessing [B] instead of direct access to specific memory locations. With direct memory access, you have (IIRC) 3 bytes and 4 cycles for such operations - and direct memory access is extremely common.

The bit serial nature of the cpu means that every instruction cycle takes 10 clock cycles, so something like a call instruction takes 50 clock cycles.

It's worth looking at, and avoiding. A basic interrupt setup that will save and restore a few critical registers and jump to vectored interrupt routines has an overhead (for the save and restore) of around 80 instruction cycles - that's 800 clock cycles, which is more what you would expect from a PC cpu than a microcontroller. You can get a bit faster if you don't need to save and restore registers.

The COP8 has it's advantages as a microcontroller - it's a solid and robust device. But its cpu core is not great.

- W
- whygee
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Apr 18, 2007 11:41 PM

Hello,

I am often away for a few days, and work a lot on

formatting link

It should now work a bit better (still under Mozilla/Firefox only).

I have not received any personal reply, OTOH the posts in this thread are interesting.

-~o0o~-

Tim Wescott wrote :

I'm mostly happy. I'm not sure to understand. Can you give an example ?

convenient > to wait a bit before I checked a condition bit, but I could probably cope with an > add-skip-no-carry instruction (ASNC -- odd, but it'd do). with a skip-on-no-carry, you can set a register or memory location to a value that will be checked later. However, most asm i have done uses the carry immediately, often to do something "else". Bresenham-like algorithms come to my mind, and there are others.

Have a look at

formatting link

and tell me what names sound/look weird (or too obscure).

-~o0o~-

Paul Taylor suggested :

I can't name a program that I have written that does not use shifts. I even see the absence of rotation operator in C as a plague.

My VSP (I also discovered later that this name is also used by others, if someone can find a better name, please apply :-P) was designed for interactive/multimedia stream processing (like : ID3 tag parsing) and user I/O (LCD matrix). These applications require a certain amount of bit and byte-level processing. Byte-level is ok (look at the IE group of instructions), SHL provide some necessary functions but i'm still not satisfied when it comes to bit stream insertion/extraction. I'm limited to 2 reads and 1 write (with often the same address).

So yes, these 5 "bit shuffling" instructions are necessary and IMHO not enough. I have probably found an answer in the Cray1 architecture manual, with one clever trick, but I don't know how/if i can implement it here.

-~o0o~-

Walter Banks remarked :

them all. I see ROL/ROR/SHL/SHR as different ways to use a shifter. In the code i have written so far, i have not remarked a preference for a specific direction. I have also examined the possibility of having only one rotation direction but this could create problems at the algorithmic level. The opcode space is still quite comfortable and i have seen no way or reason to remove one of these opcodes.

Walter then added :

Right ( MIPS, Alpha come to my mind). That's one of the RISC methodology cornerstones. From my point of view, addind a separate register is a lot of troubles, because new specific instructions must be included.

chained ? I don't know this method.

Multiprecision is not the primary purpose. Overflow detection is much more common.

I'm not sure about what you mean but here is an example of VSP code :

; Addition of R2 to the 64-bit value R0:R1 adds2 r2 r0 ; r0 = r0+r2 ; The next instruction is skipped if no carry was generated add 1 r1 ; carry : r1 = r1+1 (long form : 2 half-words)

The core computes the address of the next instruction at every cycle. Either it's a whole new address (then the prefetch mechanism is critical), or the skip advances a small counter that addresses the prefetch buffer. My idea is to do the following in parallel, during the same cycle: - the prefetch buffer automatically advances by 1 or 2 half-words (16 or 32 bits) - the new pointer into the prefetch buffer is computed in the early stages of the pipeline (add 1 or 2 half-words to the given value, plus 1 because skip 0 is equivalent to no-skip) - the addition is performed and if a carry does not occur, then the above computed pointer is committed into the buffer instead of the automatically advanced pointer.

But that mechanism will be implemented later, i want to make sure that the instruction set is satisfying now.

-~o0o~-

Terran Melconian asked :

This makes me think that the core has no multiplier, because it is not meant to computate stuff, only to move data around. So if multiplies must be implemented, a bit-by-bit version is a good compromise (complexity/latency/size, because a Booth multiply array is obviously overkill).

I have two options : either create "multiply/divide step" instructions, or build a separate, asynchronous unit (accessible through special registers).

Both have drawbacks : - mulstep/divstep instructions would use some amount of program space, and occupy the core. Also, i'm not sure how to implement the instructions. - a separate, asynchronous unit would allow the core to execute other instructions in parallel. The program would write the 2 operands to the input registers, then poll until the multiplier has finished. The problem ? I intend the VSP to become SMT later. So several threads could compete for the access to this "shared" unit. Any suggestion is welcome (and will be integrated if it is elegant)

"Bit banging" is often a major headache. I tried to take this into account.

-~o0o~-

Jim Granville noticed :

There are good reasons for this, on top of the pure coolness factor. The most important aspect is that the instructions are variable-length too (but they are quite simple, anyway). So the decoding logic has probably not yet read or decoded the next instructions, and may not know how long they are. The assembly software must compute the skip length so i though, if the core can skip 1 or 2 half-words, why not 3 or 4. More would create problems, though, and i'll have to make sure that the prefetching mechanisms can prepare instructions fast enough to keep the instruction buffers filled with at least

2+4+2=8 16-bit words, or 16 bytes, or 128 bits...

Longer skips would create a fetching penalty so i stick to 2 bits.

I have never thought about this, because i think that the most used instructions will be stored in on-chip SRAMs. SPI Flash would be used for bootstrap only, probably with an Alpha-like method (fill the cache from external SPI then let the CPU execute from address 0).

However, off-chip programs are going to exist, and a typical use of the VSP core includes a single (or a couple) SDRAM chip (16-bit wide bus) so your streaming example is easy to translate to SDRAM.

Condition Code Registers ... what a pain...

where ?

I am not trying to make "the most compact code ever". Often, this requires a lot of instruction-specific fields here and there in the instruction word, and their proliferation is nefast for decoding speed and complexity.

For example, the VSP uses only one immediate field (16 bits should be enough for most instructions ;-P)

OTOH i have not found a way to use a single place for the

2-bit skip length field (it's in bits 6-7 in the ADDSx instructions, but in bits 8-9 for conditional skip instructions). Compromises...

-~o0o~-

Ulf Samuelss> COP800 , HPC16xxx...

This remark made me check what the COP8 is and i have found an instruction that decrements, then skips if the result is zeo. That's used for loops and it's similar to one PIC instruction. So all I did was generalise this idea. cool :-)

Thanks everybody for the read,

Yann Guidon

formatting link

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Apr 19, 2007 1:03 AM

Yes, but there is a class of design, that could mostly simply-stream-code-from-SPI. Yes, it is slower then SDRAM, but it is a lot cheaper and simpler too!

The SPI memory bandwidths are getting fairly good - 150MBd is nearly 20Mbyte/sec, which is quite ok for many microcontroller tasks. Jumps are more costly, hence this discussion on short-skips.

In such a system, you might want to lock some small code into BRAM, for interrupts, but that could be handled with a simple address compare and a simple duplicate of code - the size needed is so small, you'd just build-time-copy the BRAM mapped stuff, into FPGA config, and also simply leave it in the SPI memory. A next-step would be to have this locked INT memory, and add a BRAM cache that is less memory fixed, but HW complexity of the next-step is higher, and the system is less deterministic.

hmm... I think it was the PicoBlaze that has conditional returns ?

Another feature to look at on Embedded Controller cores, is a Fixed interrupt response time - ie you remove opcode-length jitter, so a timer interrupt will be truly time-locked. Typically, this just means extending the shortest opcodes INT reaction times, but it does not impact the longest INT rsponse times.

-jg

- W
- whygee
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Apr 19, 2007 9:17 AM

I'm too much a speed freak to consider this ;-) My early estimations have found that at 10MHz core speed, the peak memory bandwidth can reach 3x4x10=3D120MBs so a 100MHz SDRAM chip with 16 bit wide bus is ok. To compensate for the frequency difference, a 8x clock ratio is possible, so the SDRAM would run at 80MHz for example.

To compensate the access latency, there is also the possibility to use a SMT approach (with 4 independent threads sharing the core, in the straight-forward CDC6600 PP fashion) and a DDR SDRAM chip would compensate for the increased bandwidth requirements.

but 20MB/s is still too slow for most tasks the VSP is meant to perform (streaming data to/from mass storage into/from signal coprocessors).

"Memory is like an orgasm. It's a lot better if you don't have to fake it= =2E" (Seymour Cray)

16KB of on-hip fast SRAM is a good start, which also helps reducing the p= ower drain of the SDRAM interface. And for the prototypes (if any), if the FPGA doesn't have enough room, i still have a collection of 17ns 32KB asynch c= ache SRAMs from old PC motherboards ;-)

I don't remember.

Note that the VSP's "Q" instruction group has unconditional and condition= al versions, which are used both for call and return. (see

formatting link

) The mechanism is a bit... unusual but worth the exploration :-)

By design, each VSP instruction takes the same time so it's not an issue.=

(yes, even for jumps/skips/call/return/whatever, which also explains why it can't be agressively pipelined and is limited to the tens of MHz ballp= ark).

But i don't see where a difference of 100 or 200ns in IRQ response time c= an be critical. If something is so important, I implement it directly in HW :-) And to make small jitter tolerable for data streams, a FIFO usually does = the job well. With audio systems (my main target), up to 10=B5s of jitter is tolerable because FIFOs are everywhere, and delta-sigma converters' latency is ofte= n longer.

10=B5s are enough for 100 instructions at 100ns, an eternity...

Furthermore, the "interrupt response time" is not always a good measure, because it depends on what you count (time to execute the first ISR instr= uction ?) and many parameters (most are context-specific) play a significant role. For example, the registers usually need to be saved : flushing the regist= er set to memory, loading new value, etc. takes a time proportional to the regis= ter set's size. But even that is not always acurate because the interrupt routine could need only a few registers (at least in the beginning) to service the IRQ.=

So if acknowledging an IRQ needs 3 registers (just by hypothesis), then 7 instructions are needed (3 save the registers, 3 load new values, and one toggles the acknowledge bit). If the instruction cycle time is 100ns (10MHz), then it takes roughly 800= ns (including IRQ signal sampling and associated jitter) to answer. So the jitter is mostly due to the IRQ signal sampling electronics (if the signal comes at the beginning or end of the cycle etc.) Nothing i can reasonably do here.

So i have thought about the interrupts. however, i have not implemented it yet in the JavaScript simulator. The memory system is much more critical.

have a nice day,

yg