New soft processor core paper publisher?

I have a general purpose soft processor core that I developed in verilog. The processor is unusual in that it uses four indexed LIFO stacks with expl icit stack pointer controls in the opcode. It is 32 bit, 2 operand, fully pipelined, 8 threads, and produces an aggregate 200 MIPs in bargain basemen t Altera Cyclone 3 and 4 speed grade 8 parts while consuming ~1800 LEs. Th e design is relatively simple (as these things go) yet powerful enough to d o real work.

I wrote a fairly extensive paper describing the processor, and am about to post it and my code over at opencores.org, but was thinking the paper and t he concepts might be good enough for a more formal publication. Any sugges tions on who might be interested in publishing it?

Reply to
tammie.eric
Loading thread data ...

Hi,

I was in a similar position about 5 years ago. My own processor is the ByoRISC, a RISC-like extensible custom processor supporting multiple- input, multiple-output custom instructions.

 The processor is unusual in that it uses four indexed LIFO stacks with explicit stack pointer controls in the opcode.  It is 32 bit, 2 operand, fully pipelined, 8 threads, and produces an aggregate 200 MIPs in bargain b asement Altera Cyclone 3 and 4 speed grade 8 parts while consuming ~1800 LE s.  The design is relatively simple (as these things go) yet powerful eno ugh to do real work.

This reads like a "fourstack" architecture on steroids. It seems good! How do you compare with more classic RISC-like soft-cores like MicroBlaze, Nios-II, LEON, etc? There is also a classic book on stack-based computers, you really need to go through this and reference it in your publication.

o post it and my code over at opencores.org, but was thinking the paper and the concepts might be good enough for a more formal publication.  Any su ggestions on who might be interested in publishing it?

I had chosen to publish to VLSI-SoC 2008 (due to proximity, that year it was held in Greece). It is an OK conference, however, not indexed well by DBLP and the likes. Anyway, here is a link to my submitted version of the paper:

formatting link

The paper was really well accepted at the conference venue. I had received some of my best reviews. However, I didn't had the chance to present the paper in person, because I was in really deep-S in the army and couldn't get a three- day special leave for the conference. (I joined the army at 31, so I was s'thing like an elderly private :). For instance, I was about the same age as all the majors in the camp. Only colonels and people among permanent staff where older.

On the contrary, I had a hard-time to publish an extended/long version of the paper as a journal paper. All three publishers were arguing about the existence of the conference paper, and that due to this fact, no journal paper version was necessary (even with ~40% material additions).

My suggestion is to: a) go for the journal paper (e.g. IEEE Trans. on VLSI or ACM TECS if you have s'thing really modern) b) otherwise submit to an FPGA or architecture conference. It depends on where you live, there are numerous European and worldwide conferences with processor-related topics (FPGA-based architectures, GPUs, ASIPs, novel architectures, manycores, etc).

In all cases you may have to adapt your material (e.g. due to page limits) to the conventions of the publisher.

BTW another more recent example is the paper on the iDEA DSP soft-core processor:

formatting link

This looks like a lean, mean architecture well-opted for contemporary FPGAs.

Hope these help.

Best regards, Nikolaos Kavvadias

formatting link

Reply to
Nikolaos Kavvadias

=

expl=

=

basemen=

Th=

d=

=

t=

sugges=

Do you also have an assembler, C++ compiler and debugger for this beast? You should have a reference design running on a FPGA board if you want to attract a following. Ideally it should also run linux.

Why can't you do both. Post the code to opencores.org and then write a paper

about it and publish.

John

--------------------------------------- Posted through

formatting link

Reply to
jt_eaton

Thank you for your reply Nikolaos!

"A Four Stack Processor" by Bernd Paysan? I ran across that paper several years ago (thanks!). Very interesting, but with multiple ALUs, access to d ata below the LIFO tops, TLBs, security, etc. it is much more complex than my processor. It looks like a real bear to program and manage at the lowes t level.

The target audience for my processor is an FPGA developer who needs to impl ement complex functionality that tolerates latency but requires determinist ic timing. Hand coding with no toolchain (verilog initial statement boot c ode). Simple enough to keep the processor model and current state in one's head (with room to spare). Small enough to fit in the smallest of FPGAs ( with room to spare). Not meant at all to run a full-blown OS, but not a tr ivial processor.

"Stack Computers: The New Wave" by Philip J. Koopman, Jr.? Also ran across that many years ago (thanks!). The main thrust of it seems to be the advo cating of single data stack, single return stack, zero operand machines, wh ich I feel (nothing personal) are crap. Easy to design and implement (I've made several while under the spell) but impossible to program in an effici ent manner (gobs of real time wasted on stack thrash, the minimization of w hich leads directly to unreadable procedural coding practices, which leads to catastrophic stack faults).

Hmm. The last thing I want is to have my hands tied when I'm trying to giv e something away for free. But my paper would likely benefit from external editorial input.

My processor incorporates what I believe are a couple of new innovations (b ut who ever really knows?) that I'd like to get out there if possible. And I wouldn' mind a bit of personal recognition if only for my efforts.

IEEE is probably out. I fundamentally disagree with the hoarding of tecnic al papers behind a greedy paywall.

Wow, very nice paper describing a very nice design, thanks!

Reply to
Eric Wallin

Thanks for the response John!

See my response to Nikolaos above. Full-blown OS support was not the development target. But it's not a pico-blaze either. Somewhere in the middle, mainly for FPGA algorithms that can benefit from serialization.

That's probably the route I'll end up taking.

Reply to
Eric Wallin

Benchmarks. Tell us why we should use your processor. How does it win compared with the alternatives?

How easy is it to program? An assembler or a C compiler are really necessary to make something usable - LLVM may come in handy as a C compiler toolkit, I'm not sure what's an equivalent assembler toolkit.

Actually synthesise the thing. It's hard to take seriously something that's never actually been tested for real, especially if it makes assumptions like having gigabytes of single-cycle-latency memory. Debug it and make sure it works in real hardware.

If you put it on opencores, document document document. There are tons of half-baked projects with lame or nonexistent documentation, that kind of half work on the author's dev system but fall over in real life for one reason or another.

Is it vendor-independent, or does it use Xilinx/Altera/etc special stuff? If so, how easily can that be replaced with an alternative vendor?

Regression tests and test suites. How do we know it's working? Can we work on the code and make sure we don't break anything? What does 'working' mean in the first place?

If you're trying to make an argument in computer architecture you can get away without some of this stuff (a research prototype can have rough edges because it's only to prove a point, as long as you tell us what they are). Generally you need to tell a convincing story, and either the story is that XYZ is a useful approach to take (so we can throw away the prototype and build something better) or XYZ is a component people should use (when it becomes more convincing if there's more support)

Some lists of well-known conferences:

formatting link
formatting link

Good luck :)

Theo

Reply to
Theo Markettos

Good point. So far I've coded a verification boot code gauntlet that it ha s passed, as well as restoring division and log2. If I had more code to pu sh through it I could statistically tailor the instruction set (size the im mediates, etc.) but I don't. I may at some point but I may not either. Th is is mainly for me, to help me to implement various projects that require complex computations in an FPGA (I currently need it for a digital Theremin that is under development), but I want to release it so others may examine and possibly use it or help me make it better, or use some of the ideas in there in their own stuff.

er

It's fairly general purpose and I think if you read the paper you might (or might not) find it easy to understand and program by hand using verilog in itial statements. My main goals were that it be simple enough to grasp wit hout tools, complex and fast enough to do real things, have compact opcodes so BRAM isn't wasted, etc. A compiler, OS, etc. are overkill and definite ly not the intended target.

There is a middle ground between trivial and full-blown processors (particu larly for FPGA logical use). Of all the commercial offerings in this range that I'm aware of, my processor is probably most similar to the Parallax P ropeller, which is almost certainly pipeline threaded (though they don't te ll you that in the documentation). The Propeller and has a video generator ; character, sine, and log tables; and other stuff mine doesn't. But mine has a simpler, more unified overall concept and programming model. It is a true middle ground between register and stack machines.

t's

ike

it

Not trying to argue from authority, but I've got 10 years of professional H DL experience, and have made several processors in the past for my own edif ication and had them up and running on Xilinx demo boards. This one hasn't actually run in the flesh yet, but it has gone through the build process m any times and has been pretty thoroughly verified, so I would be amazed if there were any issues (famous last words). But I'll run it on a Cyclone IV board before releasing it.

f

I know what you mean, I never use any code directly from there. To be fair , most of the code I ran across in industry was fairly poor as well. Anywa y, I've got a really nice document that took me about a month to write, wit h lots of drawings, tables, examples, etc. describing the design and my tho ughts behind it. Even if people don't particularly like my processor they might be able to get something out of the background info in the paper (FPG A multipliers and RAM, LIFO & ALU design, pipelining, register set construc tion, etc.).

I was careful to not use vendor specific constructs in the verilog. The bl ock RAM for main memory the the stacks is inferred, as are the ALU signed m ultipliers. I spent a long time on the modular partitioning of the code wi th a strong eye towards verification (as I usually do). The code was devel oped in Quartus, and has been compiled many, many times, but I haven't run it through XST yet.

ork

ean

I'm probably an odd man out, but I don't agree with a lot of "standard" ind ustry verification methodology. Test benches are fine for really complex c ode and / or data environments, but there is no substitute for good coding, proper modular partitioning, and thorough hand testing of each module. I' ve seen too many out of control projects with designers throwing things ove r various walls, leaving the verification up to the next guy who usually is n't familiar enough with it to really bang on the sensitive parts. And I k ind of hate modelsim.

Anyone that codes should spend a lot of time verifying - I do, and for the most part really enjoy it. The industry has turned this essential activity into something most people loathe, so it just doesn't happen unless people get pushed into doing it. And even then it usually doesn't get done very thoroughly. Co-developing in environments like that is a nightmare.

Thanks, I'll check them out!

Reply to
Eric Wallin

FWIW 'benchmarks' doesn't necessarily mean running SPECfoo at 2.7 times quicker than a 4004, but things like 'how many instructions does it take to write division/FFT/quicksort/whatever' compared with the leading brand. Or how many LEs, BRAMs, mW, etc. Numbers are good (as is publishing the source so we can reproduce them).

Fair enough. If you're making architectural points, you can probably get away with assembly examples. A simple assembler is good for developer sanity, though. Could probably be knocked up in Python reasonably fast.

I'm just a bit jaded from seeing papers at conferences where somebody wrote some verilog which they only ran in modelsim, and never had to worry about limited BRAM, or meeting timing, or multiple clock domains, or...

This is good. Just a thought - could you angle it as 'how to do processor design' using your processor as a case study? That makes it more of a useful tutorial than 'buy our brand, it's great'...

That's not exactly what I meant... let's say you rearrange the pipelining on your CPU. It turns out you introduce some obscure bug that causes branches to jump to the wrong place if there's a multiply 3 instructions back from the branch. How would you know if you did this, and make sure it didn't happen again? Hand testing modules won't catch that.

It's worse if there's an OS involved, of course. But it can be easy to introduce stupid bugs when you're refactoring something, and waste a lot of time tracking them down.

We use Bluespec so avoid modelsim ;-) (with Jenkins so we run the test suite for every commit. A bit overkill for your needs, perhaps)

I admit the tools don't always make it easy...

Theo

Reply to
Theo Markettos

to

Or

rce

I have FPGA resource numbers for the Cyclone III target in the paper. Brie fly, it consumes ~1800 LEs, 4 18x18 multipliers, 4 BRAMs for the stacks, pl us whatever the main memory needs. This is roughly 1/3 of the smallest Cyc lone III part. I have a restoring division example in the paper that gives 197 / 293 cycles best / worst case (a thread cycle is 8 200MHz clocks, but there are 8 threads running at this speed so aggregate throughput is poten tially 200 MIPs if all threads are busy doing something).

I've seen lots of papers that claim speed numbers but don't give the speed grade, or tell you what hoops they jumped through to get those speeds. Wit hout that info the speeds are meaningless.

That's certainly possible. At this point I'm writing code for it directly in verilog using an initial statement text file that gets included in the m ain memory. Several define statements make this clearer and actually fairl y easy. But uploading code to a boot loader would require something like a n assembler. I'm really trying to stay away from the need for toolsets.

r

The paper is kind of that, background and general how to, but my processor doesn't have caches, branch prediction, pipeline hazards, TLBs, etc. so peo ple wanting to know how to do that stuff will come up totally empty.

on

es to

e

en

It's correct by construction! ;-) Seriously though, there are no hazards t o speak of and very little internal state, so branches pretty much either w ork or they don't. Once basic functionality was confirmed in simulation, I used processor code to check the processor itself e.g. I wrote some code th at checks all branches against all possible branch conditions. Each test i ncrements a count if it passes or decrements if it fails. The final passin g number can only be reached if all tests pass. I've got simple code like this to test all of the opcodes. This exercise can help give an early feel for the completeness of the instruction set as well. Verifying something like the Pentium must be one agonizingly painful mountain to climb. Verify ing each silicon copy must be a bear as well.

Reply to
Eric Wallin

So is there anything like the old Byte magazine (or a web equivalent) where enthusiasts and other non-academic, non-industry types can publish articles on computers / computing hardware?

Reply to
Eric Wallin

years ago (thanks!). Very interesting, but with multiple ALUs, access to data below the LIFO tops, TLBs, security, etc. it is much more complex than my processor. It looks like a real bear to program and manage at the lowest level.

implement complex functionality that tolerates latency but requires deterministic timing. Hand coding with no toolchain (verilog initial statement boot code). Simple enough to keep the processor model and current state in one's head (with room to spare). Small enough to fit in the smallest of FPGAs (with room to spare). Not meant at all to run a full-blown OS, but not a trivial processor.

That the ground I have been plowing off and on for the last 10 years.

that many years ago (thanks!). The main thrust of it seems to be the advocating of single data stack, single return stack, zero operand machines, which I feel (nothing personal) are crap. Easy to design and implement (I've made several while under the spell) but impossible to program in an efficient manner (gobs of real time wasted on stack thrash, the minimization of which leads directly to unreadable procedural coding practices, which leads to catastrophic stack faults).

I assume that you do understand that the point of MISC is that the implementation can be minimized so that the instructions run faster. In theory this makes up for the extra instructions needed to manipulate the stack on occasion. But I understand your interest in minimizing the inconvenience of stack ops. I spent a little time looking at alternatives and am currently looking at a stack CPU design that allows offsets into the stack to get around the extra stack ops. I'm not sure how this compares to your ideas. It is still a dual stack design as I have an interest in keeping the size of the implementation at a minimum. 1800 LEs won't even fit on the FPGAs I am targeting.

who ever really knows?) that I'd like to get out there if possible. And I wouldn' mind a bit of personal recognition if only for my efforts.

I would like to hear about your innovations. As you seem to understand, it is hard to be truly innovative finding new ideas that others have not uncovered. But I think you are certainly in an area that is not thoroughly explored.

papers behind a greedy paywall.

I won't argue with that. Even when I was an IEEE member, I never found a document I didn't have to pay for.

When can we expect to see your paper?

--

Rick
Reply to
rickman

Ooo, same here, and my condolences. I caught a break a couple of months ag o and have been beavering away on it ever since, and I finally have somethi ng that doesn't cause me to vomit when I code for it. Multiple indexed sim ple stacks with explicit pointer control makes everything a lot easier than a bog standard stack machine. I think the auto-consumption of literally e verything, particularly the data, indexes, and pointers you dearly want to use again is at the bottom of all the crazy people just accept with stack m achines. This mechanism works great for manual data entry on HP calculator s, but not so much for stack machines IMHO. Auto consumption also pretty m uch rules out conditional execution of single instructions.

MISC is interesting, but you have to consider that all ops, including simpl e stack manipulations, will generally consume as much real time as a multip ly, which suddenly makes all of those confusing stack gymnastics you have t o perform to dig out your loop index or whatever from underneath your read/ write pointer from underneath your data and such overly burdensome.

Indexes into a moving stack - that way lies insanity. Ever hit the roll do wn button on an HP calculator and get instantly flummoxed? Maybe a compile r can keep track of that kind of stuff, but my weak brain isn't up to the t ask.

Altera BRAM doesn't go as wide as Xilinx with true dual port. When I was w orking in Xilinx I was able to use a single BRAM for both the data and retu rn stacks (16 bit data).

I'm not sure anything less than the smallest Cyclone 2 is really worth deve loping in. A lot of the stuff below that is often more expensive due to th e built-in configuration memory and such. There are quite inexpensive Cycl one dev boards on eBay from China.

I haven't seen anything exactly like it, certainly not the way the stacks a re implemented. And I deal with extended arithmetic results in an unusual way. In terms of scheduling and pipelining, the Parallax Propeller is prob ably the closest in architecture (you can infer from the specs and operatio nal model what they don't explicitly tell you in the datasheet).

I was a member too right out of grad school. But, like Janet Jackson sang: "What have they done for me lately?"

It's all but done, just picking around the edges at this point. As soon as the code is verified to my satisfaction I'll release both and post here.

Reply to
Eric Wallin

button on an HP calculator and get instantly flummoxed? Maybe a compiler can keep track of that kind of stuff, but my weak brain isn't up to the task.

Have a look at comp.arch, in particular the current discussion about the "belt" in the Mill processor. Start by watching the video.

The Mill is a radical architecture that offers far greater instruction level parallelism than existing processors, partly by having no general purpose registers.

The Mill is irrelevant to FPGA processors; it is aimed at beating x86 machines.

Reply to
Tom Gardner

The Mill looks vaguely interesting (if you're into billion transistor processors) but as you indicated I'm not sure how it is relevant to this thread?

Reply to
Eric Wallin

processors) but as you indicated I'm not sure how it is relevant to this thread?

You wrote "Indexes into a moving stack - that way lies insanity." The Mill's belt is effectively exactly that, and they appear not to have gone insane.

Reply to
Tom Gardner

insane.

I bet they would if they tried to hand code it in assembly! ;-)

The first video comment is priceless: "Gandalf?"

Reply to
Eric Wallin

insane.

It *is* considerably easier than hand-coding Itanium. With that you change *any* aspect of the microarchitecture and you go back to the beginning. How do I know? I asked someone that was doing it to assess its performance, and decided to Run Away from anything to do with the Itanium.

How shallow :)

Reply to
Tom Gardner

and have been beavering away on it ever since, and I finally have something that doesn't cause me to vomit when I code for it. Multiple indexed simple stacks with explicit pointer control makes everything a lot easier than a bog standard stack machine. I think the auto-consumption of literally everything, particularly the data, indexes, and pointers you dearly want to use again is at the bottom of all the crazy people just accept with stack machines. This mechanism works great for manual data entry on HP calculators, but not so much for stack machines IMHO. Auto consumption also pretty much rules out conditional execution of single instructions.

I was looking at how to improve a stack design a few months ago and came to a similar conclusion. My first attempt at getting around the stack ops was to use registers. I was able to write code that was both smaller and faster since in my design all instructions are one clock cycle so executed instruction count equals number of machine cycles. Well, sort of. My original dual stack design was literally one clock per instruction. In order to work with clocked block ram the register machine would use either both phases of two clocks per machine cycle or four clock cycles.

While pushing ideas around on paper, the J1 design gave me an idea of adjusting the stack point as well as using an offset in each instruction. That gave a design that is even faster with fewer instructions. I'm not sure if it is practical in a small opcode. I have been working with 8 and 9 bit opcodes, the latest approach with stack pointer control can fit in 9 bits, but would be happier with a couple more bits.

stack manipulations, will generally consume as much real time as a multiply, which suddenly makes all of those confusing stack gymnastics you have to perform to dig out your loop index or whatever from underneath your read/write pointer from underneath your data and such overly burdensome.

Programming to facilitate stack optimization is king on a stack machine. I'm not sure how the multiply speed is relevant, but the real question is just how fast does an algorithm run which has to include all the instructions needed as well as the clock speed. Then it is also important to consider resources used. I think you said your design uses

1800 LEs which is a *lot* more than a simple two stack design. They aren't always available.

button on an HP calculator and get instantly flummoxed? Maybe a compiler can keep track of that kind of stuff, but my weak brain isn't up to the task.

Then I don't know why you are designing CPUs, lol! I like RPN calculators and have trouble using anything else. I also program in Forth so this all works for me.

working in Xilinx I was able to use a single BRAM for both the data and return stacks (16 bit data).

I expect Xilinx has some patent that Altera can't get around for a couple more years. Lattice seems to be pretty good though. I just would prefer to have an async read since that works in a one clock machine cycle better.

developing in. A lot of the stuff below that is often more expensive due to the built-in configuration memory and such. There are quite inexpensive Cyclone dev boards on eBay from China.

I don't know about dev board cost, but I can get a 1280 LUT Lattice part for under $4 in reasonable quantity. That is the area I typically work in. My big problem is packages. I don't want to have to use extra fine pitch on PCBs to avoid the higher costs. BGAs require very fine via holes and fine pitch PCB traces and run the board costs up a bit. None of the FPGA makers support the parts I like very well. VQ100 is my favorite, small but enough pins for most projects.

implemented. And I deal with extended arithmetic results in an unusual way. In terms of scheduling and pipelining, the Parallax Propeller is probably the closest in architecture (you can infer from the specs and operational model what they don't explicitly tell you in the datasheet).

"What have they done for me lately?"

My mistake was getting involved in the local chapters. Seems IEEE is just a good ol' boys network and is all about status and going along to get along. They don't believe in the written rules, more so the unwritten ones.

the code is verified to my satisfaction I'll release both and post here.

Ok, looking forward to it.

--

Rick
Reply to
rickman

Interesting. The J1 strongly influenced me as well.

< I have been working with 8 and 9 bit opcodes, the latest approach with

I decided to stay away from non-powers of 2 widths for instructions and dat a. Not efficient in standard storage. Having multiple instructions per wo rd I see now as more of a bug than a feature because you have to index into it to return from a subroutine and how / where do you store the index?

I feel that this is a fiddly activity that wastes the programmer's time and creates code that is exceedingly difficult to figure out later.

Multiply is relevant because in a 32 bit machine it will likely be THE spee d bottleneck, pulling overall timing down. They include non-fabric registe ring at the I/O of the FPGA multiply hardware to help pipeline it. Same wi th BRAM - reads really speed up if you use the "free" output registering (i n addition to the synchronous register you are generally forced to use).

l down button on an HP calculator and get instantly flummoxed? Maybe a com piler can keep track of that kind of stuff, but my weak brain isn't up to t he task.

Quite the contrary, I've used HP calculators religiously since I won one in a HS engineering contest almost 30 years ago. Too bad they don't make the "real" ones anymore (35S is the best they can do it seems, maybe they lost the plans along with those of the Saturn V). But when I hit the roll down button to find a value on the stack, I have to give up on the other stack items due to confusion. I really want to like Forth, but after reading the books and being repeatedly repelled by the syntax and programming model I gave up.

My goal with CPU design was to make one simple enough to program without sp ecial tools, but complex enough to do real work and I think I've finally ac hieved that.

I like Lattice parts too, and used the original MachXO on many boards in li eu of a CPLD.

But I gave up on single cycle along with two stacks and autoconsumption. L ike you say async read BRAM is hard to come by. Single cycle is also slow and strands a bazillion FFs in the fabric.

I wonder if you've read this article:

formatting link

-the-world

Moore made a lot of money off of what seem like frivolous lawsuits, which b rings him down several notches in my eyes.

Reply to
Eric Wallin

I'm sure HP still has the plans for the Saturn, viz

formatting link

Sorry, couldn't resist.

Nobody /writes/ Forth. They write programs that emit Forth. The most mainstream example of that is printer drivers emitting PostScript.

Reply to
Tom Gardner

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.