OpenSPARC released

- S
- Shyam
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 28, 2006 8:50 PM

Anyway, I've been trying to figure out if this SPARC is superscalor or not. It does not appear to have any dataflow management logic and the mainregister file is only 5 ports, so I suspect not.

---Yeah, I suspect the existing Verilog is not for a superscalar microarchitecture. Check out

formatting link

.

-Shyam

- A
- Art Stamness
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 28, 2006 10:00 PM

Please provide any evidence of this assertion :

" The price you pay is very large: unmaintainable, unreadable code which is probably an order of magnitude larger than proper RTL."

This coding style which you so clearly denegrate as sub par, is actually quite standard among high end chip development. Some reasons :

1) Much easier to swap flop models, because noone is allowed to write there own always @ flop blocks. To replace the library for flops, it is as simple as changing an include.

Attempting to do this in a "proper RTL" is actually "unmaintainable". Experience in porting a design from one technology library to the next will give you the type of experience that shows these coding standards are *necessary*, to be able to do this type of work, your "proper RTL" style is inflexible compared to this.

2) The synthesis tool does not care. If you write up a inverter going into a flop, or code up a flop with an inverting input, the synthesizer doesn't care where you placed the code. The final result is the exact same thing.

Now if you had followed these coding guidelines, swapping out this flop can be done by tweaking the include path for the library, you would have to visit every line of code in your design to see whether or not the always block is actually a flop, and then recode by hand. ( Good if you get paid by the hour, not so good for your employer ).

3) Rebalancing logic across clock domain crossings is easier when the logic is seperate from the flop : X's are flops (a,b,c) is assign wires

X1 --> a --> b --> c --> X2 X1 --> a --> b --> X2 --> c

The only changes that need to occur, is the input from the X2 is changed to be b, instead of c, and the input to c is changed to be the output of X2 instead of the output of b.

Using "proper RTL", you might have coded "a,b,c" inside of an always block. You then need to create more wires or modify an always blocks to pull this logic out, and then hook it up. At the end, after you have applied your timing fix, the code is larger.

Worse yet, you may have made a mistake. These things tend to happen, and when you recoded this flop, you have have left some path out, and have turned it into a latch. These types of mistakes are not possible to do in a instance -- assign -- assign -- instance methodology. because "always @(posedge...)" is not allowed in your code, it belong inside a library.

Now you may say "But I am smarter than that!", well that is nice for you. But when setting up a coding standard that needs to be used by hundreds of engineers, and verified by tools, ad-hoc methods of "proper RTL" get left in the dust behind rigid standards that prevent bad stuff from happening in the first place.

Please consider that this chip was probably design by a group of engineers easily topping 100+, there were many compiler tools, synthesis, and other tools that needed to manipulate this code, and get meaningful information from it. Having each engineer write in what you describe as "proper RTL" style is not acceptable in these situations. It is not flexible enough ( you have to add lines of code just to make timing fixes), error prone ( you can write logic that is not possible or available in your library ), and doesn't get you *any* better results.

I fail to see any benefit from using your "proper RTL" style. If there is some that would offset the costs I have listed above, I am open to reconsider. And I realize if you have not been exposed to these ideas before they may sound like problems that you have not faced. But these problems are common among large scale IC's that need to be taped out many technologies, and go through extensive ECO timing fixes to achieve maximum performance.

-Art

- J
- Joseph H Allen
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 28, 2006 10:56 PM

This only makes sense to me if you have different flops in the same design, otherwise just rename the flop itself to match the one generated by the synthesis tool.

So if the real problem is that you have different kinds of flops in the same design, why not just attach an attribute to the 'reg' which becomes the flop?

reg [15:0] foo /* synthesis attribute floptype="master_slave_3" */;

There is another advantage to this. Definitions in include files are frequently bad because they are global. It is almost always better to use parameters whenever possible- then you can reuse code in different ways by overriding a parameter:

parameter main_flop="master_slave_3";

reg [15:0] foo /* synthesis attribute floptype=main_flop */;

It doesn't care if you instantiate an inverter, but it certainly does care if you code up a state machine with a 'case' statement.

Also, by instantiating flops, you are not giving the synthesis tool information about the flop: in particular, the flop might have a built in clock-enable pin that you want the synthesis tool to use.

Why are you doing this in source code? Can't your synthesis tool rebalance the logic at this micro level?

Human editing is bad. Make the tool do it.

IBM long ago showed that bugs were proportional to the number of lines. Thus anything that reduces the number of lines of code is going to reduce your verification cost, which is substantial.

--
/*  jhallen@world.std.com (192.74.137.5) */               /* Joseph H. Allen */
int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0)
+r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p158?-79:0,q?!a[p+q*2
]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);}

- J
- Jason Zheng
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 28, 2006 11:37 PM

Art,

You wrote quite a bit, but I cannot agree with any of your arguments.

Art Stamness wrote:

Care to show some examples? What "high end chip" are you talking about?

If you code your flops in always blocks, this is also true. The flops are simply implied by the synthesis tools.

Again, I don't see how writing always blocks is "unmaintainable." Maybe I haven't had enough experience to "give me the type of experience." Anyhow, suppose you had instantiated a dff that takes clock, din, clr, and set inputs, and synchronous resets are generated from a mux logic. Now, a new technology library gives you cells that has built-in syncronous reset. To port to this new library, you must manually recode all the synchronous reset logic. Compare this to an always block: you don't need to do anything.

If you had coded in an always block, you don't have to anything at all! Any synthesis tool will figure out whether the cells have inverting inputs or not. Even if you accidentally put two inverters along the logic path, this will be optimized away from the truth tables constructed by the synthesis tools.

Excuse me? Rebalancing across clock domains? It is never trivial and what you offered only works when you are balancing with the SAME clock domain.

That's a lot of assumptions, not to mention many synthesis tools rebalance logic for you automatically.

I see the complete opposite. Having 100+ engineers working on the same project require them to actually understand each other's code quickly. A netlist style is NIGHTMARE. RTL is created to be different from netlist so that it is more readable.

- A
- Art Stamness
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 28, 2006 11:59 PM

I think the main point you are missing of mine is that these techniques solve problems which it does not appear you are not familiar with.

Adding another layer of indirection in the form of an instantiated flop model solves many of these problems and is an industry standard as far as high end retargetable ASIC coding standards are concerned.

Each solution you have described involves adding more lines of code to the actual RTL source, and strips out any layers of indirection. It attaches implementation attributes directly to a design that is intentionally high level, so that it can be retargetd. Those layers of indirection are invisible to the tools that use them, synthesis, and simulator, but provide flexibility for the person who needs to model different behavior or change libraries.

Now maybe you have not had the need to do this, in which case it seems superficial and a waste of time. But I can assure you, lots of time and effort was spent in building these components this way for a very good reason.

-Art

- A
- Art Stamness
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Mar 29, 2006 12:29 AM

haven't had enough experience

Let met explain then. Here is the "proper RTL" as some other might right it :

wire [31:0] a ; wire [31:0] b ; reg [32:0] result ; always @(posedge clk) result

- J
- Jan Decaluwe
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Mar 29, 2006 10:35 AM

I fear you're right :-)

From this and other posts I believe you mentioned the following tasks as arguments for the coding style in question:

1) randomization of flip-flop start-up values 2) retargeting a netlist to another technology 3) retiming for performance

My feedback would be that we face a methodology problem. Proper RTL also means proper usage of available abstraction levels. RTL is effective for functional description and verification, but that's it. The task you describe can better be handled as follows:

1) gate level simulation 2) synthesis tool used in retargeting mode 3) advanced synthesis tool working at the gate level

Trying to do such things manually and at the RTL level will naturally get you into trouble, to the point of generating self-fulfilling prophecies ...

There you have it ;-)

Jan

--
Jan Decaluwe - Resources bvba - http://www.jandecaluwe.com
Losbergenlaan 16, B-3010 Leuven, Belgium
     From Python to silicon:
     http://myhdl.jandecaluwe.com

- J
- Jan Decaluwe
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Mar 29, 2006 11:54 AM

A proper comparision would be to undertake the same project twice, using the 2 design styles independently and then compare results. Typically unfeasible of course. However, in my previous life at the design service company I co-founded (Easics) I have had the occasional opportunity to compare.

A good example is the following. In 1996, we had an industry-first implementation of a complete USB slave (PHY+HUB) (Philips was the customer.) At one point, Intel released a reference design of the PHY part and we compared.

Their design was written in, let's say, OpenSPARC style, and had 30+ modules with low level, incomprehensible code. It synthesized to 4000+ gates. Ours had just 3 modules with clear RTL code and synthesized to around 2500 gates.

Small design of course, but that is the trend. We have seen it confirmed on a few other comparison occasions, and there is every indication that things get only worse for larger designs.

Jan

--
Jan Decaluwe - Resources bvba - http://www.jandecaluwe.com
Losbergenlaan 16, B-3010 Leuven, Belgium
     From Python to silicon:
     http://myhdl.jandecaluwe.com

- M
- Mike Treseler
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Mar 29, 2006 3:14 PM

Interesting story.

Some organizations have huge monetary and cultural commitments to certain classical cae point tools that defy rational discussion.

-- Mike Treseler

- J
- Jason Zheng
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Mar 29, 2006 4:30 PM

Yes, I agree with you that it is easy to do this particular transformation with your style. However, it's not enough to convince me, as this transformation can also be easily done with the proper RTL style, if not easier. With one or two lines of regular expression code/shell script, I can mine through the entire source code and generate a big initial block, which I conveniently insert in the testbench.

To me, portability is greater with more abstraction in the code, because it gives synthesis tools greater freedom to implement the logic.

- J
- J o h n _ E a t o n (at) hp . com (no spaces)
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Mar 29, 2006 6:15 PM

haven't had enough experience

And here is what we use in our coding standard

wire [31:0] a; wire [31:0] b; reg [32:0] result; reg [32:0] next_result;

always@(*) begin next_result = a + b ; end

always@(posedge clk) result[32:0]

- J
- JJ
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Mar 29, 2006 7:46 PM

Precisely, anyone who has done FPGA cpu design knows how limiting FPGAs can be .ie 20-120MHz is typical for unfriendly architectures.

If your not in that very small club of Intel, AMD, IBM, then even full custom is also pretty limiting by the extreme expense of it all. The top tier may still be directly instantiating transistors as well as flops. But then transistor level design is still able to significantly outperform standard cell logic using a variety of mostly nmos differential techniques, I guess by a factor of 3. At Suns level, they are much closer to full standard cell with synthesis with a fraction of the clock of P4s but they make up for it by going to massive threading and latency hiding to bring out the throughput.

Theres the clue there. The same can be done in FPGA cpu for multithreading architecture to simplify the design so that you are not limited to 32b carry ripples. In my Transputer design I was seeing

300MHz on the PEs because it could use 2 clocks per basic opcode and used 8 clocks for 4 thread instructions, alot of cycle limiting logic just vanishes, ie no hazard logic or register forwarding paths. The hardware design of the MMU hasn't started so there is nothing to release.

For information, my PE used 500 LUTs & 1 BlockRam and a few hundred LOC in RTL Verilog. Given V4 can hold upto 554 BlockRams, means I coudl instance quite a few of these PEs too. In some ways it is quite similar similar to the Niagara/Sparc, whats the difference between slidy register files with stack spilling v register files in memory but cached on demand (process swapped) into register caches as the T9000 did.

If I wanted to see a Niagara core in FPGA I think I would go back to the Sparc architecture documents and maybe LEON and see if a threaded design could be done from scratch that executes the ISA but possibly make some very different choices so the FPGA version wouldn't get crippled. I wouldn't be constrained to 1 opcode per clock either, using more clocks, lowers PE performance by clock but allows a much faster clock and much less logic so more PE cores.

I am surprised that we haven't seen alot more native FPGA MTA designs though,.

John Jakson Transputer guy

- A
- Art Stamness
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Mar 29, 2006 8:06 PM

I would have to disagree about hand crafted code going away.

I believe Intel, IBM, Sun, AMD, Cisco, and even HP, all still use RTL based hand coded design methodologies for most all of their designs, that I am aware of.

IP reusage models cut down on the actual number of lines of code written by leveraging pre-built high level libraries, with fifo's, memories, register files, and other constructs. Soc vendors are also doing less and less design, as more of the code they actually produce comes in the form of pre-verified IP, and their job is moved to verification of the integrated system from verification of the individual components.

I don't see however "whats next", as far as getting us away from hand written code. We can save time by buying others hand written code, but what other tools are you aware of that would take the place of Verilog ??

-Art

- J
- J o h n _ E a t o n (at) hp . com (no spaces)
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Mar 30, 2006 12:07 AM

Yes there is a lot of legacy code out there and a lot of design houses are still hand coding more. They are also finding it harder and harder to support this code as chips grow and processes shrink. You start finding teams that are afraid to touch blocks that have been proven in silicon for fear of introducing bugs. One of our standard practices is to put in a new component but leave in the old version and give firmware the ability to select which one is used at run time.

Creating the design is 20% of the effort. Verification is the other 80%. The industry is heading toward systems where you can specify the desired behaviour and create the design and the testbench at the same time from the same source. This goes well beyond providing a richer assortment of register files. We are advancing from "assemblers" into "compilers"

But thats only the start. You want to design a component once and reuse it over and over again. Each user should be able to reconfigure your component's parameters for their exact needs and your tool set must be able to rebuild both the rtl code and the testbench for these new parameters.

Futhermore the user will want to use this newly configured component as a building block of another component in their chip. The tool set must be able to "Plug&Play" the rtl code into the chip. This also involves building a chip level testbench and documentation from all of the included components in a fast and efficient manner.

Each stage should be able to verify that everything below it still works and you can't spend weeks of engineering time fixing sims and hand placing components to reach that point. The methodologies must be able to handle it.

John Eaton

- A
- Art Stamness
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Mar 30, 2006 1:21 AM

I agree that the 80% effort is in the verification. That is why the verification tools have been improving so dramatically over the last decade, where RTL design tools hasn't really changed all that much.

Vera, Specman, and now SystemVerilog Testbench are all nicer and higher level way of verifying designs. Constraint based Random Verification gets more bang for the buck out of your simulation dollar. They are still costly in runtime ( read : slow ), but are definately an improvement over the old style hand coded testbenches.

Assertion Based Verification using OVA, OVL, PSL, SVA types are maturing and more mainstream, and this will definately help in the reuse of IP, however the cost of use is high because, this verification runs continuously in your simulation. In fact just a few temporal assertions added to some interface can half your runtime performance ( from my experience on a 2MGate microprocessor core). Static assertions are clearly more performance friendly but must more limited in there checking.

I know companies like Real Intent have tools that try to use "Automatic formal" ( which extracts the assertions, and formally proves they are possible scenarios in which illegal behavior can be expressed) . This also seems like a good thing, unfortunately I have found that this type of work is typically done by the RTL designer, and increases his workload, not decreases.

And the holy grail of "Spec -> RTL" tool still doesn't exist. But I agree we have more tools to get us closer. I still don't see there being any other way of describing large ASIC's in any other way than hierarchical hand coded RTL. And these tools that try to make it easier to verify, I haven't seen one yet, that actually implements your logic, because usually the designer is the one who knows what to optimize for : performance, speed, area, cost, manufacturability, testability . . .

So, at the end of the day, I don't think the job of the RTL designer is going away any time soon. In fact most of these tools still require more code to be written to describe the correct behavior. I don't think these technology is driving RTL designers out of work.

-Art

- T
- Tommy Thorn
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Mar 30, 2006 5:56 AM

I think you're being overly generous to Sun here.

I think this is pretty much well known, although no less true. However, as Amdahl put it, "What would you rather use to plow a field? Two oxen or a thusand chicken?". In your world things are of course different as you're coming from a paradigm of many many threads. However the rest of the world is only slooowly moving to multiple threads.

It is interesting though that by giving up half the speed on single thread performance, you can gain 3-4 times the throughput for free. I'll definitely play with that.

In addition to what I mentioned, there's surely more inertia issues and the complication of multi-threaded software (assuming you can even take advantage of it).

My $0.01 Tommy

- J
- Joseph H Allen
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Mar 30, 2006 3:38 PM

I've worked on FPGA based NPs. It is a no-brainer for this case, each packet can be treated as a separate thread. With enough buffering it's feasible to have 16 threads, which conveniently matches the size of Xilinx SRL16s. A simple micro-engine will run at 200 MHz with this technique, and ends up being the same physical size as the single-threaded version.

--
/*  jhallen@world.std.com (192.74.137.5) */               /* Joseph H. Allen */
int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0)
+r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p158?-79:0,q?!a[p+q*2
]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);}

- J
- Joseph H Allen
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Mar 30, 2006 4:29 PM

In article , at) hp . com (no spaces)"

- J
- JJ
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Mar 30, 2006 4:30 PM

Yes the networking, communications, and DSP industries are thouroughly into the simple idea of timesharing, latency hiding etc and thats where I have been since leaving Inmos 20yrs ago. I also use those SRL16s to keep 4 sets of instruction fetch state over 8 clocks so that variable length opcodes can interleave without confusion. Without those, I'd be looking at 50% more LUTs per PE.

As a Transputer person I want as many threads as possible for a thread pool, and these can then be effectively allocated on demand to the concurrent language threads. Of course for idle threads, I would push all threads onto busy PEs and shut down fully idle PEs, so power consumption follows work done. In the Niagara case, the goals are different, continuous threaded server loads. One thing I did realize is that the after doing all the MTA, MMU work, any old instruction set could be used even that damned x86 but since this is an FPGA design, its still better to tune for that and go KISS at speed.

Its really a question of trading the single threaded Memory Wall problem for a Thread Wall problem. A problem for single threaded C guys, not so for the CSP people out there. Amdahls law has done more to set back parallel computing than who knows what, if its serial, then its serial, but there usually room to mix seq & par at many levels. Even if a task has 2 or 3 threads, MTA still comes out ahead on hardware cost.

The paper I gave on this Transputer design at CPA2005 last sep, is finally available at wotug.org for anyone thats interested.

Regards

John Jakson

PS Perhaps oneday somebody out there could make a nano CC size FPGA card but with some RLDRAM on it for a TRAM replacement.

- M
- Mike Treseler
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Mar 30, 2006 8:34 PM

Yes. This works fine. If others don't use it, that's to your advantage. Note if you name the block, you can declare your wire inside it:

always @(posedge clk) begin : my_block reg q; // q is a wire

-- Mike Treseler