I have a general purpose soft processor core that I developed in verilog. The processor is unusual in that it uses four indexed LIFO stacks with expl icit stack pointer controls in the opcode. It is 32 bit, 2 operand, fully pipelined, 8 threads, and produces an aggregate 200 MIPs in bargain basemen t Altera Cyclone 3 and 4 speed grade 8 parts while consuming ~1800 LEs. Th e design is relatively simple (as these things go) yet powerful enough to d o real work. I wrote a fairly extensive paper describing the processor, and am about to post it and my code over at but was thinking the paper and t he concepts might be good enough for a more formal publication. Any sugges tions on who might be interested in publishing it?

Hi, I was in a similar position about 5 years ago. My own processor is the ByoRISC, a RISC-like extensible custom processor supporting multiple- input, multiple-output custom instructions. The processor is unusual in that it uses four indexed LIFO stacks with explicit stack pointer controls in the opcode. It is 32 bit, 2 operand, fully pipelined, 8 threads, and produces an aggregate 200 MIPs in bargain b asement Altera Cyclone 3 and 4 speed grade 8 parts while consuming ~1800 LE s. The design is relatively simple (as these things go) yet powerful eno ugh to do real work. This reads like a "fourstack" architecture on steroids. It seems good! How do you compare with more classic RISC-like soft-cores like MicroBlaze, Nios-II, LEON, etc? There is also a classic book on stack-based computers, you really need to go through this and reference it in your publication. o post it and my code over at but was thinking the paper and the concepts might be good enough for a more formal...

New soft processor core paper publisher?

P

Paul Rubin 13 years ago

formatting link

:

Overview:

Currently shipping on more than 2 billion devices/year

Deployed on more than 9 billion devices around the world since 1998

More than 50% of SIM cards deployed in 2011 run Java Card

...

Included in billions of SIM cards, payment cards, ID cards, e-passports, and more

Vote

A

Andrew Haley 13 years ago

OK, but that's hardly "most Java", unless you're just counting the number of virtual machines that might run at some point.

Andrew.

Vote

P

Paul Rubin 13 years ago

Well there's all sorts of ways to calculate it. If you want total LOC, Android phones may be past servers by now.

Vote

T

Tom Gardner 13 years ago

device and let the other processors talk to that one.

Oh dear. It looks like you have vanishingly little experience writing software. That is supported by your statement in another post.

On 26/06/13 01:06, rickman wrote: > I never understood the difference between thread and > process until I read the link you provided.

Vote

E

Eric Wallin 13 years ago

I'm just putting it out there, people can use it if they want to, or not.

Thomas, with your experience with ERIC5 series, do you see anything obviously missing from the Hive instruction set? What do you think of the literal sizing?

Vote

E

Eric Wallin 13 years ago

Question to the programming types:

Ever seen a signed logical or arithmetic shift distance before? Hive shift distances are signed, which works out quite nicely (the basic shift is shi ft left, with negative shift distances performing right shifts). This is s omething I haven't encountered in any opcode listings I've had the pleasure to peruse, so I'm wondering if it is kind of new-ish.

Vote

G

glen herrmannsfeldt 13 years ago

PDP-10 has signed shifts. The manual is available on bitsavers, such as:

AA-H391A-TK_DECsystem-10_DECSYSTEM-20_Processor_Reference_Jun1982.pdf

Shifts use a signed 9 bit value from the computed effective address.

-- glen

Vote

E

Eric Wallin 13 years ago

Thanks for that Glen!

Vote

T

thomas.entner99 13 years ago

usly missing from the Hive instruction set? What do you think of the liter al sizing?

I just took a quick look at your document (time is limited...). What I like is the concept of "in-line" literals. A good extension would be to have th e same concept also for calls and jumps (i.e. so you do not have to load th e destination address into a register first) and maybe also other instructi ons that can work with literals. I also think that you leave some bits unus ed: e.g. byt instruction does not use register B, so you would have 3 addit ional bits in to opcode to make it possible to have 11b literal instead of an 8b literal (or you could use this 3 bits for other purposes, e.g. A = A + lit8)

What others already mentioned is the restricted code-space, but without C-c ompiler this will never become a real issue ;-)

For your desired application, you could maybe think of options to reduce th e resource usage. BTW: The bad habit of Quartus to replace flip-flop chains with memories (you mentioned this somewhere in your document) can be disab led by turning off "auto replace shift registers" somewhere in the synthesi s settings of Quartus.

Regards,

Thomas

formatting link

Vote

E

Eric Wallin 13 years ago

ke is the concept of "in-line" literals. A good extension would be to have the same concept also for calls and jumps (i.e. so you do not have to load the destination address into a register first) and maybe also other instruc tions that can work with literals. I also think that you leave some bits un used: e.g. byt instruction does not use register B, so you would have 3 add itional bits in to opcode to make it possible to have 11b literal instead o f an 8b literal (or you could use this 3 bits for other purposes, e.g. A = A + lit8)

Oooh, very nice idea, thanks so much! I gave this some thought and even fo und some space to shoehorn some opcodes in, but the lit has to come from th e data memory port and go back into the control ring to offset / replace th e PC, and this would require some combinatorial logic in front of the progr am memory address port which could slow the entire thing down. I'll defini tely give it a try though.

I'm kind of against invading the B stack index/pop for other things, having it always present allows for concurrent stack cleanup.

-compiler this will never become a real issue ;-)

Hive could be easily edited to have 32 bit addresses, but the use of BRAM f or small processor main memory is likely an even stronger restriction on co de-space, which is why I don't feel the need for anything beyond 16 bits.

the resource usage. BTW: The bad habit of Quartus to replace flip-flop chai ns with memories (you mentioned this somewhere in your document) can be dis abled by turning off "auto replace shift registers" somewhere in the synthe sis settings of Quartus.

Using the "speed" optimization technique for analysis and synthesis avoids this as well.

Vote

T

Theo Markettos 13 years ago

Java Card isn't the JVM - it's Java compiled down to whatever CPU is on the card.

Theo

Vote

R

RCIngham 13 years ago

Transputer?

formatting link

--------------------------------------- Posted through

formatting link

Vote

T

Tom Gardner 13 years ago

It had a lot going for it, but was a too dogmatic about the development environment. At the time it was respectably fast, but that wasn't sufficient -- particularly since there was so much scope for increasing speed of uniprocessor machines.

Given that uniprocessors have hit a wall, transputer

*concepts* embodied in a completely different form might begin to be fashionable again.

It would also help if people can decide that reliability is important, and that bucketfuls of salt should be on hand when listening to salesman's protestations that "the software/hardware framework takes care of all of that so you don't have to worry".

Vote

E

Eric Wallin 13 years ago

ke is the concept of "in-line" literals. A good extension would be to have the same concept also for calls and jumps (i.e. so you do not have to load the destination address into a register first) and maybe also other instruc tions that can work with literals. I also think that you leave some bits un used: e.g. byt instruction does not use register B, so you would have 3 add itional bits in to opcode to make it possible to have 11b literal instead o f an 8b literal (or you could use this 3 bits for other purposes, e.g. A = A + lit8)

After looking into this yesterday I don't think I'll do it. The in-line va lue has to be retrieved before it can be used to offset or replace the PC, which is one clock too late for the way the pipeline is currently configure d. Using it in other ways like adding wouldn't work unless I used a sepear ate adder, as the ALU add/subtract happens fairly early in the pipe. But I really appreciate this excellent suggestion Thomas, and for the time you t ook to read my paper!

Vote

R

rickman 13 years ago

You mean 'C'? I worked on a large transputer oriented project and they used ANSI 'C' rather than Occam. It got the job done... or should I say "jobs"?

You mean like 144 transputers on a single chip? I"m not sure where processing is headed. I actually just see confusion ahead as all of the existing methods seem to have come to a steep incline if not a brick wall. It may be time for something completely different.

What? Since when did engineers listen to salesmen?

Rick

Vote

T

Tom Gardner 13 years ago

ANSI 'C' rather than Occam. It got the job done... or should I say "jobs"?

I only looked at the Transputer when it was Occam only. I liked Occam as an academic language, but at that time it would have been a bit of a pain to do any serious engineering; ISTR anything other than primitive types weren't supported in the language. IIRC that was ameliorated later, but by then the opportunity for me (and Inmos) had passed.

I don't know how C fitted onto the Transputer, but I'd only have been interested if "multithreaded" (to use the term loosely) code could have been expressed reasonably easily.

Shame, I'd have loved to use it.

Or Intel's 80 cored chip :)

Not that way! Memory bandwidth and latency are key issues - but you knew that!

have come to a steep incline if

Precisely. My bet is that message passing between independent processor+memory systems has the biggest potential. It matches nicely onto many forms of event-driven industrial and financial applications and, I am told, onto significant parts of HPC. It is also relatively easy to comprehend and debug.

The trick will be to get the sizes of the processor + memory + computation "just right". And desktop/GUI doesn't match that.

Since their PHBs get taken out to the golf course to chat about sport by the salesmen :(

Vote

R

rickman 13 years ago

Yeah, but I think the current programming paradigm is the problem. I think something else needs to come along. The current methods are all based on one, massive von Neumann design and that is what has hit the wall... duh!

Time to think in terms of much smaller entities not totally different from what is found in FPGAs, just processors rather than logic.

An 80 core chip will just be a starting point, but the hard part will

*be* getting started.

I think the trick will be in finding ways of dividing up the programs so they can meld to the hardware rather than trying to optimize everything.

Consider a chip where you have literally a trillion operations per second available all the time. Do you really care if half go to waste? I don't! I design FPGAs and I have never felt obliged (not since the early days anyway) to optimize the utility of each LUT and FF. No, it turns out the precious resource in FPGAs is routing and you can't do much but let the tools manage that anyway.

So a fine grained processor array could be very effective if the programming can be divided down to suit. Maybe it takes 10 of these cores to handle 100 Mbps Ethernet, so what? Something like a browser might need to harness a couple of dozen. If the load slacks off and they are idling, so what?

It's a bit different with me. I am my own PHB and I kayak, not golf. I have one disti person who I really enjoy talking to. She tried to help me from time to time, but often she can't do a lot because I'm not buying 1000's of chips. But my quantities have gone up a bit lately, we'll see where it goes.

Rick

Vote

B

Bakul Shah 13 years ago

Have you looked at Tilera's TILEpro64 or Adapteva's Epiphany

64 core processors?

Languages like Erlang and Go use similar concepts (as did Occam on the transputer). But I think the problem is that /in general/ we still don't know how to write parallel or distributed programs. Most of the concepts are from ~40 years back (CSP, guarded commands etc.). We still don't have decent tools. Turning serial programs into parallel versions is manual, laborious, error prone and not very successful.

Vote

T

Tom Gardner 13 years ago

No I haven't.

I've been constrained by getting high-availability software to market quickly, on hardware that is demonstrably supported all over the world.

Erlang is certainly interesting from this point of view.

I'm not interested in turning existing serial programs into parallel ones; that way lies madness and failure.

What is more interestingly tractable are "embarrassingly parallel" problems (e.g. massive event processing systems), and completely new approaches (currently typified by big data and map-reduce, but that's just the beginning).

Vote

T

Tom Gardner 13 years ago

My suspicion is that, except for compute-bound problems that only require "local" data, that granularity will be too small.

Examples where it will work, e.g. protein folding, will rapidly migrate to CUDA and graphics processors.

Those internal FPGA constraints also have analogues at a larger scale, e.g. ic pinout, backplanes, networks...

The fundamental problem is that in general as you make the granularity smaller, the communications requirements get larger. And vice versa :(

I'm sort-of retired (I got sick of corporate in-fighting, and I have my "drop dead money", so...)

I regard golf as silly, despite having two courses in walking distance. My equivalent of kayaking is flying gliders.

Vote

New soft processor core paper publisher?

Join the Discussion

Didn't find your answer?