Implementing five stage pipeline

Hi all, Thinking of a 5 stage pipeline risc.

  1. fetch
  2. decode
  3. execute
  4. buffer
  5. write back The result of execution stage is buffered at the +ve edge of the buffer cycle. And this works if we enable the data forwarding method. And the next instruction will get the updated values from the buffer register at its execution stage. And the buffered data will be placed to memory ony at the write back stage. My doubt is if this is true where will we buffer the output of the execution stage of the second instruction at the +ve edge of its buffer cycle as the buffer is still holding the result of the previous instruction. I am a beginer to these type of things. This is similar to the ARM9 pipeline. Whats their way of tackling this situation.
Reply to
vssumesh
Loading thread data ...

Some body please advice me in this issue as i am still wondering aout what to do ???

Reply to
vssumesh

Really its a bit strange that a complete beginner would jump into the deep end when in the past we went through some baby steps 1st.

Are you doing this in Verilog/VHDL or C or some other academic cpu design tool?

You do have the Hennessy-Patterson book right (from MKP)?

Not actually in this book but one radical suggestion I have is to throw this horrible design model away and do a multi threaded architecture. You replace one set of cycle sucking performance limits with a much simpler thread engine that ultimately boils down to a 2 or so bit counter in state control.

Such a design can run around 2x as fast as the single threaded design at circuit levels, and that 2x can be traded back for many simplifications to get back same speed but with much less hardware. Definitely you don't need register forwarding, or hazard logic detection logic in MTA designs, but you do end up with a couple of threads for the end user to deal with. Many more issues there.

John

gmail or transputer2 at yahoo

Reply to
JJ

But i just can do that because i am building this to imitate ARM pipeline.

Reply to
vssumesh

I'm not familiar with ARM cores, so I can't help you emulate their architecture. My advice would be to read datasheets for ARM cores, if they're available, but they're probably not. It seems to me that if they gave away their architecture (the world knew it) and they already "give away" their instruction set they would have no product anymore. I suppose their implementation is probably pretty good, but still architecture is a major part of design.

My question for John though is what do you mean by a threaded architecture? I don't see how adding a second core will make the first one run twice as fast. It seems to me each "thread" needs to be independent enough to have its own pipeline. If each has its own pipeline, register forwarding and hazard logic are going to be needed. If you didn't have those I suppose you could just bubble the pipeline, but that seems pretty wasteful.

The big problem I see with multi-core FPGA based processors is that it's very easy to be memory bound in an FPGA. Fetch from an SDRAM is only so fast. I know you can put several of them in parallel to improve performance and I suppose that would do it, but the limits are definitely close without some good caching schemes. Unfortunately, it seems that associative caches are very expensive to implement in an FPGA.

-Arlen

Reply to
gallen

The real problem here is that the H & P comp arch bible and most books that repeat the same material don't teach anyone today how to do anything that doesn't look like a DLX so tough if you have to figure it out yourself. This is esp true for MTA design, a much overlooked technique.

Now MTA (multi threaded architecture) isn't even new, it goes back to the 50s in previous century (the one that starts with 1). The idea is really simple, very familiar to DSP people who do a lot of transposition between parallel & serial DSP, bit wise v word wise.

I will elaborate a simple design that works well for me at 300MHz in V2Pro and still uses only < 500 FFs LUT sites (not the 1000 typically needed for 32b work) but not complete in some opcode decodes.

I use a single Blockram to hold 4 sets of state, 128 by 32 bit words each. That is further split for each thread, half for register file and half for ICache. The RFile therefore gives 64 regs, and the ICache or queue is 128 16 bit opcodes (or 64 32b opcodes or 256 8b opcodes).. Please don't ever do 8b opcodes!

The primary controller is a 3 bit counter counting through 8 states, b0 is used for odd,even for each instruction slot. B1,2 used to distuinguish which thread is in effect.

The odd,even bit lets me do 32bit math over 2 clocks 16b at a time. It also lets me get 2 operand reads and a later write back paired with an early opcode fetch in 2 cycles so its 4 way ported. These reads, writes, Ifetch are for 3 different threads though. All Blockram accesses are 32b wide. The datapath takes 32b every other cycle for x,y inputs and 5 clocks later returns z 32b result on opposite phase to Blockram. At same phase, another opcode pair is fetched.

The big bang is that the design clocks at the limit of the Blockram, or

16b add or 3 LUTs of logic which is about 2x faster than the usual 32b flat single thread pipelines. The usual instruction decision logic that is often crammed into 1 pipeline, now straddles 8 pipelines so very little logic needed between pipes.

Now thread i+0 reads data operands in clock t0 but writes results back at t5 and later in next opcode for that thread reads operands in t8, same for t16 & so on.

Thread i+0 uses t0,t5, t8,t13 etc for reg reads & writes Thread i+1 uses t1,t6, t9,t14 etc. Thread i+2 uses t2,t7, t10,t15 etc Thread i+3 uses t3,t8, t11,t16 etc

So all threads stay out of each others way, no interlocks, no forwarding, no hazards, no branch prediction, but 4 thread states.

I missed out alot of detail, hey you have to figure this out on own nickel if you want this sort of design. The cond codes, PC and other cpu state regs will exist 4 times, these can use Srl16s, a DP ram or a barrel wheel of 4 states moving on mostly 1 phase.

The ARM is a problem period, you tend to get chased or sued if you get anything done esp if any intent to give away or resell. I don't think it is that great anyway, copying any cpu designed for VLSI into FPGA leaves bad taste. Instead use own opcode set and look at Jan Gray's site for Lcc hints to port compiler etc.

As for associative caches, doing things the regular way with 1 or 2 way set assoc is very expensive, instead I use hashing and that makes things look very associative. I also expect to use RLDRAM but thats another story. One nice thing about 4 way and esp 2 phase design is that every opcode takes 8* 1 or sometimes 2,3 actual cycles. RLDRAM can clock at 300MHz and has latency of 8 cycles per threaded bank. So my DRAM is faster than my min opcode sequence for load,store so I don't need DCache. The ICache is there to help the much more predictable I flow but isn't really associative since its just a queue of opcodes near PC value. All 4 threads over many processor copies see their own private DRAM shared in 1 device.

As for multi core, this design is intended to be replicated a few times to combine with 1 MMU dispensing RLDRAM bandwidth amongst 4N threads. Since there is no memory wall, each thread compares with a scaler x86 at 2GHz/8 /4 so 8 PEs comes out about same. Deal with 4N threads and no cache misses or deal with broken serial model that dare not miss any cache.

The SDRAM is not actually too slow, it is only 2-3 x slower than RLDRAM as latency goes, the problem is it has no concurrency so only 1 bank in flight v 8 so RLDRAM gets 20x more work done. Threaded DRAM goes with threaded processor.

Think I said enough for now.

John

Reply to
JJ

I understand the basic idea. I can see how it solves a lot of problems because the time between cycles for an individual thread is long enough that you don't have to deal with forwarding or hazards or branch prediction or anything like that. Each thread is something of a multicycle architecture.

Unfortunately it seems that a multi-threaded architecture definitely needs a new programming paradigm. I don't think your standard C program would map well onto that. (If you were running 4 C programs, however, I could see it working quite well). But I suppose that is a different sort of problem to face.

Thanks for the info. I may very well look into an architecture like this at some point.

-Arlen

Reply to
gallen

Exactly, we do this all the time in DSP to break dependancies.

Typically if a processor is already running some sort of OS with time sharing of processes, then having to deal with 4 HW threads is not a big deal except that the threads run at 1/t of clock. But if many of these PEs are available in each MMU cluster then that is 4N threads. It gets much more interesting when the MMU introduces its own OS memory management issues and the language of choice looks like a hybrid of C/C++ with occam and Verilog.

C gives us structs with data members and usually manipulated by any old functions, no special logical structure at all.

C/C++ gives us classes to add member functions to member data for object oriented programming but no concurrency or liveness.

V++ (in development) give us a process which looks just like a class with added port list and body code that can instance other process objects ala Verilog.

// monospace

process pname1 ( in .., out ..., // just like Verilog port list, event driven ints etc ) { // data ports not event driven

data members; // just like C vars in struct function members; // just like C++ class methods

wires ...; // just like Verilog

process body code; // just like Verilog module body

l1: pname2(.. ); // just like Verilog instance of another process/module l2: pname3(.. ); // labels are used to name instances in the hierarchy

assign ...; // just like Verilog continuous assigns always { ;;; } // just like Verilog event driven parallel logic } // usually endmodule

Now a process hierarchy combines C++ class OO structure with an event driven HDL like structure with some help from processor to support many threads or processes etc. 1) Data, 2) Objects, 3) Processes.

regards

John

transputer guy

Reply to
JJ

John,

Thank you very much for the insightful description. Is there any chance that you could post some HDL to OpenCores? I am certain that others are as interested as myself in playing with a simple MTA.

Stephen

Reply to
Stephen Craven

Stephen

That will depend on future events.

I would like to complete this compiler and finish the remaining opcodes, and MMU first, its a unified compiler + processor project as was the original Transputer. One project makes no sense without the other.

I would prefer to make something commercial out of it with some free use for .edu. I am not too worried about time to market since there is lots of work to do and nobody else seems to be interested in doing this. Most seem happy to reinvent the same dead end ST designs and ST languages over and over.

If I do put it out in the open, it could be on opencores or whatever, with BSD/MIT license, but it would better for me to exploit this if I can too.

Updates here, c.a, c.s.t etc osnews, one day a web home too.

John

transputer guy

Reply to
JJ

Hi John,

The (fairly boggo) SDRAM I've used has 4 banks open concurrently, so so long as your scatter your thread data across the banks, you can have low-latency access ... or is that not what you mean?

Cheers, Martin

--
martin.j.thompson@trw.com 
TRW Conekt - Consultancy in Engineering, Knowledge and Technology
http://www.trw.com/conekt
Reply to
Martin Thompson

Hi Martin

The modern SDRAM has had 4 banks now forever, but most processor designs and their miserable controllers and cache hierarchy haven't had much use for banking. To feed cache refills, bandwidth has been king, copy big chunk of data from DDR CAS cycles into cache lines and stay away from RAS cycles as much as possible.

Even in the old days, banking was always considered as do as little as possible to save power etc. The packaging of DRAMs with muxed address halves and the requirement to access fully random over full address means several pin transactions per full cycle rather than 1.

According to the Micron website on the benefits of RLDRAM over SDRAM, they suggest that some SDRAM bank overlap is possible, ie as soon as RAS has delivered data from one bank, another bank open can start while the previous bank can finish up the bank cycle so its like 40,40,40, rather than 40+20,40+20,,, so 1.5 times better.

The real benefit comes from unfettered banking apart from bank collisions to start cycles every 2.5ns instead of 40ns. It would have been even better if Micron had had the forsight to use 16,128, or even

64K banks although only 8 or so would ever be simultaneously in flight.

The latency of 20ns can be managed by MTA processor design, but the ability to issue every 2.5ns (3.3ns in FPGA with 6 clocks latency) is the charm. 40ns every 40ns just isn't the same. Lots of issue rate means lots of slower PEs can share it.

If conventional SDRAM can overlap 2 banks during the RAS time, that would be news to me.

regards

John

Reply to
JJ

"vssumesh" schrieb im Newsbeitrag news: snipped-for-privacy@g43g2000cwa.googlegroups.com...

If you insist on understanding the DLX stuff, the following link is an excellent in-depth tutorial about memory architecture and pipelining. H & P for homebrew pipeliners...

formatting link

MIKE

--
www.oho-elektronik.de
OHO-Elektronik
Michael Randelzhofer
FPGA und CPLD Mini Module
Klein aber oho !
Reply to
M.Randelzhofer

Yes, I can see that would be different - I'll go and read up on RLDRAM now :-)

No, I don't think it can :-(

Cheers, Martin

--
martin.j.thompson@trw.com 
TRW Conekt - Consultancy in Engineering, Knowledge and Technology
http://www.trw.com/conekt
Reply to
Martin Thompson

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.