Spartan 3 - avaliable in small quantities?

to be the tools to compile the

might argue that the high speed

asembler?

"AS" from Alfred Arnold is a good wide-cores assembler, with a choice of Pascal or C sources :

formatting link

And HLA (High Level Assembler) is currently x86 only, but the front end, and approach is much closer to higher level languages (but minus the bloat). V2 will allow different back ends, for opcode outputs. Worth watching.

formatting link

This is able to support quite large code efforts, and remain close to the iron..

A benefit of working from the 'best assembler' end, is the ease of support multiple/tiny core instances - which is one of the advantages of such soft cores.

-jg

Reply to
Jim Granville
Loading thread data ...

Do you not think that the number of ways has to be as least as great as the number of threads? I would expect a significant amount of conflict misses (particularly in the I-Cache) if this is not the case. Hit-under-miss is a must. Otherwise all those impressive Mega-Hurtz will just be thown away stalling for cache refills.

Cheers, JonB

Reply to
Jon Beniston

going to be the tools to compile the

might argue that the high speed

asembler?

running

helps

Although an assembler is only a tiny fraction of the effort of a C compiler, once done it only opens the door just enough to bootstrap up slowly. For a processor to have much wider appeal needs the full effort either to port or write from scratch.

I will probably set the hard type semantics of C aside for awhile and just add a very quick dirty codegen that handles C style assembler and simple 1 size expressions with none of the usual optimizations and just play dumb. Then baseline C/Verilog/Occam/inline asm can be written that might violate some proper rules. The compiler wouldn't be able to compile itself but I could get on with testbench and verification. Right now it can analyize itself but doesn't emit anything. It does have a nice #preprocessor built into the lexer that allows C++ like use of definitions with same name but varying no of params that is not described in lcc book.

The usual way in the past was to define subsets of the target language and compile for that with the compiler also being restricted to that level. The 1st pass might be an assembler. The compiler could then operate at some level on the target and as the language subset is raised, the compiler gets to use the new features and tests them on the next round. I don't think people do that anymore unless the language is brand new and no compiler exists yet. Once it does exist, it's usually easier to cross port.

This brings up a point, can a new compiler be $ distributed if the design is largely based off of previous open code. I will have to go check the license on lcc.

johnjakson_usa_com

Reply to
john jakson

Hi Jon

Not neccesarily. On a conventional HT cpu, the threads would all be independant, and likely fight over the cache set size and 2 way would probably be a min. Since these threads are supposed to be cooperating as Occam proceses would, then their opcodes would be local but that assumes sibling processes run close to each other in time space. No guarantee of that. In the HW event driven case, its much easier to speculate about what will likely happen as the scheduling model is so much simpler. Even if there are lots of conflicts what will happen is the threads will just keep delaying.

In the HW time wheel, there are actually 16 threads waiting to go (or null Ps if less available). These 16 represent the front of the proper P queue stored in linklist out in memory space (only some of which might be in cache at any time). The HW only allows the front 4 of those to queue up in the Iop queue. The fetcher steals or forces available cache reads slots to keep this full rotating between the 4 queues which live inside distributed 16b DP rams by 64 wide. Hence each running thread can buffer up to 16 small ops or 4 extended 64b ops or some mix.

On a side note, if the cpu were 64b wide, the HT would have to be

8way, but then the Iqueue HW would be twice as wide too so that still allows each P to buffer up same no of ops. I would have to tweak the HDL code to group the rams for hight v width keeping it 64b wide output always. Wider data ops doesn't really change the opcode fetch rate since that now looks half as much as before. The fractional costs of executing ops now changes from 9/8 to 17/16 cycles for ALU a>4 ops, but then it will have been P switched already. The bra decision when it does arrive will post back the modified ip into the Pid selected ip field. Pid rides along the datapath pipeline too. Bra pts may be used to do the outer timesharing but I may leave that to a SW kernal.

Cache misses will probably be treated same way, if the miss is going to be long, switch to the next P in the side queue. You can imagine a little railway track figure of 8 made up of selective pipelines & muxes holding minimal P state. Something like Johnson logic or hot coded state engine in charge.

One huge difference between this HT processor and the ones you hear about x86, Alpha etc, I expect to use RLDRAM as 2nd level cache which RAS cycles in 20ns which is about 1.5 effective cpu cycles 13.3ns. It is 8way banked and can support 2.5ns datarates and control. I will probably be limited to the 311MHz rate and DDR is limited to 622MHz in the specs (conveniant 2x), this is right on the edge of what FPGAs can do and below the RLDRAM2 800MHz std.

Remember x86 in particular has to be designed to work with very slow RAS elcheapo DDR Rams which can be several 100x slower than cpu cycles. Intel can't do a special tweak for RLDRAM since the difference is still very large, maybe 50 or more.

In this cpu I could almost throw cache out and go direct to RLDRAM as main memory which is why I am not too concerned about tiny cache. I will be building an RLDRAM model soon by faking a bunch of 8 Blockrams together with delays and muxes demuxes. This will let me test out 1-8 cpu models running with faked RLDRAM all inside a sp-400 part. Further a 64b 8way HT cpu would actually cycle slower than RLDRAM ie 26ns.

The real purpose of the cache which is a unified data-instruction-workspace is to satisfy the enormous bandwith req of the workspace operations. Reg cpus have 1 or more reg files separate from d/i cache but they have the burden of very high swap contexts. R3 keeps many workspaces in uni cache and provide 3ports to datapath 2 reads and 2joined writes using a pair of DP rams. The instr and data fetch requirements could be met by fast RLDRAM without cache, some buffering would still be needed. The T9000 style workspace caching is what makes this all work and that the cpus run close to RLDRAM speed. If R3 ever went ASIC and n x faster, ofcourse the cache would go full cuctom and bigger by far.

Hope that helps

johnjakson_usa_com

Reply to
john jakson

Jan Gray did an interesting article on this for Circuit Cellar a few years back, targeting the lcc compiler. The article will still be on

formatting link

Reply to
Tim

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.