The real problem here is that the H & P comp arch bible and most books that repeat the same material don't teach anyone today how to do anything that doesn't look like a DLX so tough if you have to figure it out yourself. This is esp true for MTA design, a much overlooked technique.
Now MTA (multi threaded architecture) isn't even new, it goes back to the 50s in previous century (the one that starts with 1). The idea is really simple, very familiar to DSP people who do a lot of transposition between parallel & serial DSP, bit wise v word wise.
I will elaborate a simple design that works well for me at 300MHz in V2Pro and still uses only < 500 FFs LUT sites (not the 1000 typically needed for 32b work) but not complete in some opcode decodes.
I use a single Blockram to hold 4 sets of state, 128 by 32 bit words each. That is further split for each thread, half for register file and half for ICache. The RFile therefore gives 64 regs, and the ICache or queue is 128 16 bit opcodes (or 64 32b opcodes or 256 8b opcodes).. Please don't ever do 8b opcodes!
The primary controller is a 3 bit counter counting through 8 states, b0 is used for odd,even for each instruction slot. B1,2 used to distuinguish which thread is in effect.
The odd,even bit lets me do 32bit math over 2 clocks 16b at a time. It also lets me get 2 operand reads and a later write back paired with an early opcode fetch in 2 cycles so its 4 way ported. These reads, writes, Ifetch are for 3 different threads though. All Blockram accesses are 32b wide. The datapath takes 32b every other cycle for x,y inputs and 5 clocks later returns z 32b result on opposite phase to Blockram. At same phase, another opcode pair is fetched.
The big bang is that the design clocks at the limit of the Blockram, or
16b add or 3 LUTs of logic which is about 2x faster than the usual 32b flat single thread pipelines. The usual instruction decision logic that is often crammed into 1 pipeline, now straddles 8 pipelines so very little logic needed between pipes.
Now thread i+0 reads data operands in clock t0 but writes results back at t5 and later in next opcode for that thread reads operands in t8, same for t16 & so on.
Thread i+0 uses t0,t5, t8,t13 etc for reg reads & writes Thread i+1 uses t1,t6, t9,t14 etc. Thread i+2 uses t2,t7, t10,t15 etc Thread i+3 uses t3,t8, t11,t16 etc
So all threads stay out of each others way, no interlocks, no forwarding, no hazards, no branch prediction, but 4 thread states.
I missed out alot of detail, hey you have to figure this out on own nickel if you want this sort of design. The cond codes, PC and other cpu state regs will exist 4 times, these can use Srl16s, a DP ram or a barrel wheel of 4 states moving on mostly 1 phase.
The ARM is a problem period, you tend to get chased or sued if you get anything done esp if any intent to give away or resell. I don't think it is that great anyway, copying any cpu designed for VLSI into FPGA leaves bad taste. Instead use own opcode set and look at Jan Gray's site for Lcc hints to port compiler etc.
As for associative caches, doing things the regular way with 1 or 2 way set assoc is very expensive, instead I use hashing and that makes things look very associative. I also expect to use RLDRAM but thats another story. One nice thing about 4 way and esp 2 phase design is that every opcode takes 8* 1 or sometimes 2,3 actual cycles. RLDRAM can clock at 300MHz and has latency of 8 cycles per threaded bank. So my DRAM is faster than my min opcode sequence for load,store so I don't need DCache. The ICache is there to help the much more predictable I flow but isn't really associative since its just a queue of opcodes near PC value. All 4 threads over many processor copies see their own private DRAM shared in 1 device.
As for multi core, this design is intended to be replicated a few times to combine with 1 MMU dispensing RLDRAM bandwidth amongst 4N threads. Since there is no memory wall, each thread compares with a scaler x86 at 2GHz/8 /4 so 8 PEs comes out about same. Deal with 4N threads and no cache misses or deal with broken serial model that dare not miss any cache.
The SDRAM is not actually too slow, it is only 2-3 x slower than RLDRAM as latency goes, the problem is it has no concurrency so only 1 bank in flight v 8 so RLDRAM gets 20x more work done. Threaded DRAM goes with threaded processor.
Think I said enough for now.
John