Mill: FPGA version?

I've been wondering if there could be a successful (not too large, reasonable clock frequency) soft-core FPGA version of a low end variant of the Mill. One issue with FPGAs is that only two-port memory is available. It's not so hard to make extra read ports with these, but extra write ports are a big problem. This pretty much precludes any sane implementation of Tomasulo's algorithm for OOE.

If I understand "the belt" properly, the number of write ports is set to just the number of pipelines (or maybe to two to support the spilling). The pipes need to be able to read from anywhere, but we can do that with extra read ports and one extra mux stage. Anyway, it seems like a nice way to get superscalar on an FPGA.

Perhaps Out of the Box Computing has an FPGA version now already for bring up? Very curious what clock frequency they are getting..

--
/*  jhallen@world.std.com AB1GO */                        /* Joseph H. Allen */ 
int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0) 
 Click to see the full signature
Reply to
Joseph H Allen
Loading thread data ...

We're working on an FPGA, but it's behind patents, fund raising, and talks, so it won't be soon :-(

As for your technical question, I am a software guy and don't know the innards of FPGAs so I have bounced your question to our hardware team and will post their answer.

Ivan

Reply to
Ivan Godard

The belt is not implemented as SRAM, nor as a shift register. The belt is implemented in the same manner as a conventional register bypass, with bypass muxes.

In the case of a conventional, the bypass mux selects are based upon qualified equals comparators. So if a source of the current operation has the same register number as the result of a previous operation, then that previous value is used as long as it is not old enough to have been written to the register file.

In the case of the belt, the belt result selects are based upon qualified equals comparators also, but what is being compared are belt "temporal location" tags. The register file isn't used roughly 80% of the time, so why use it and its associated "physical location" tags? A register file can take 3 or 4 clocks for a result written to it to show up at a functional unit's source. That turns out to be on the order of average life time of a result's "temporal location" tag before it gets "incremented off the end" of the belt.

So in the case of both a conventional bypass and the belt, each functional unit's source has a multiplexer to select from each possible slot result output. So each slot source has a quick path thru the result select mux for latency 1 results, and nearly a full pipe stage path thru the mux for all the rest.

I hope that answers your question - I wasn't sure if it was regarding "normal" belt operation or spilling/filling for calls, returns, interrupts, traps and the like.

--
Arthur Kahlich 
Chief Engineer 
 Click to see the full signature
Reply to
Ivan Godard

I have wondered about an FPGA implementation of the 360/91, I believe where Tomasulo's algorithm was developed. As far as I know, though, not enough details are available.

-- glen

Reply to
glen herrmannsfeldt

Heh, if you are really brave, start Herrmannsfeldt Corporation (like Amdahl Corporation): make an FPGA implementation of any recent (64-bit) IBM mainframe- probably there is a market even if it's a slow, simple in-order single-issue design. Use the Hercules emulator as the predictor in your design verification suite.

--
/*  jhallen@world.std.com AB1GO */                        /* Joseph H. Allen */ 
int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0) 
 Click to see the full signature
Reply to
Joseph H Allen

On an FPGA the "all the rest" bypass MUXing is too expensive to do in one stage. The way I know to deal with this is to spread the bypass muxing across a pipeline which feeds the function unit. This bypass pipeline is the same length as the datapath pipeline. The FU results are broadcast to all of these pipeline stages- if a request needs that particular result, it's MUXed in. With one FU, you only need 2:1 MUXes. With N FUs, you need (N+1):1 MUXes in each stage (unless there is clustering..).

Anyway, on an FPGA you have SRAM, so you definitely want to use it for the register file, but you are stuck with 2 addressed ports per SRAM. So I'm envisioning that each FU has its own 2-port register file, plus copies so other FUs can read results (older than in the bypass). In other words, one write port, many read ports is OK.

That you can make a compiler for the implied temporal location write addressing, tells me that you should also be able to make one for a more conventional design where there is explicit physical write addressing, but with the restriction that each FU can only write to 1/N dedicated locations in the register file.

With 4 FUs perhaps the load unit only can write to register 0, 4, 8, 12, etc. ALU1 can only write to register 1, 5, 9, 13, etc.. and so on. It seems this is a reasonable architectural restriction. I'm wondering now if there were any other machines like this this (well except for FP / integer split).

I like also that load completons get the most recent store results to avoid aliasing.. but I thought other systems did this also (basically bypass the dcache). Hmm, now that I think about it it's not fully solving the alias problem. You are not free to move a load completion earlier than any store that you can't prove is alias hazard free.

Thanks!

--
/*  jhallen@world.std.com AB1GO */                        /* Joseph H. Allen */ 
int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0) 
 Click to see the full signature
Reply to
Joseph H Allen

I think that I can reply to this without putting my foot in it (Art is even more swamped than I am).

Your suggestions would be appropriate were we trying to actually make a usable CPU in an FPGA, but that's not our plan; the Mill FPGA is purely a (faster than software) simulator for a regular hardware Mill. Its purpose is to test out correctness and to speed up software development, and we don't care how fast it runs. Consequently, the FPGA will work

*exactly* the same way that the corresponding chip Mill will, using the same compilers. If that means that each Mill clock requires five, or 50, FPGA stages and the bypass requires multiple SRAMS and muxes, then so be it. It's still going to be 10,000 times as fast as software sim.

Load retires (completions) retain the same ordering vis a vis stores as program order. Only load issue is moved on a Mill. Of course, when the compiler can prove the absence of aliasing then it can also move load retire too, but that's the compiler; the hardware does not do moves, and views a compiler-revised order as "program order". The Mill is not responsible for compiler bugs :-)

Reply to
Ivan Godard

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.