Optimizations, How Much and When?

Question

My projects typically are implemented in small devices and so are often spa ce constrained. So I am in the habit of optimizing my code for implemented size. I've learned techniques that minimize size and tend to use them eve rywhere rather than the 20/80 rule of optimizing the 20% of the code where you get 80% of the benefit, or 10/90 or whatever your favorite numbers are.

I remember interviewing at a Japanese based company once where they worked differently. They were designing large projects and felt it was counter pr oductive to worry with optimizations of any sort. They wanted fast turn ar ound on their projects and so just paid more for larger and faster parts I suppose. In fact in the interview I was asked my opinion and gave it. A l ead engineer responded with a mini-lecture about how that was not productiv e in their work. I responded with my take which was that once a few techni ques were learned about optimal coding techniques it was not time consuming to make significant gains in logic size and that these same techniques pro vided a consistent coding style which allowed faster debug times. Turns ou t a reply was not expected and in fact, went against a cultural difference resulting in no offer from this company.

I'm wondering where others' opinions and practice fall in this area. I ass ume consistent coding practices are always encouraged. How often do these practices include techniques to minimize size or increase speed? What tech niques are used to improve debugging?

pault.eg · Accepted Answer

ssume consistent coding practices are always encouraged. How often do thes e practices include techniques to minimize size or increase speed? What te chniques are used to improve debugging?

I size optimised this

formatting link

by leaving a few things out, using an 8-bit ALU rather than a 16-bit ALU, and matching the register file to the machxo3 FPGA architecture so that it was a good fit for the sm all FPGAs in that series. But that's not work related. That's from a small personal interest in small micros in small FPGAs that can be used with GCC :)

Otherwise I've been lucky enough not to have to worry too much about optimi sing for size.

David Brown · Answer

This seems a little mixed up.  The "20/80" (or whatever) rule is for  /speed/ optimisation of software code.  The principle is that if your  program spends most of its time in 10% or 20% of the code, then that is  the only part you need to optimise for speed.  The rest of the code can  be optimised for size.  Optimising only 10% or 20% of your code for size  is pointless. I can only really tell you about optimisation for software designs,  rather than for programmable logic, but some things could be equally  applicable. First, you have to distinguish between different types of  "optimisation".  The world simply means that you have a strong focus on  one aspect of the development, nothing more.  You can optimise for code  speed, development speed, power requirements, safety, flexibility, or  any one of dozens of different aspects.  Some of these are run-time  (like code speed), some are development time (like ease of debugging).  You rarely really want to optimise, you just want to...

Rick C · Answer

Lol!  I guess you don't really understand HDL.  Yes, that much we understand.        There is your first mistake.  Optimizing code can involve tradeoffs between , size, speed, power consumption, etc., but can also involve finding ways t o improve multiple aspects without tradeoffs.   If you are going to talk about a tradeoff between code development time (th inking) and other parametrics, then you are simply describing the process o f optimization.  Duh!            Not at all rare that speed optimizations can create issues in other areas.   So that's your second mistake.        Which is the basis of the 20/80 rule.  Don't spend time optimizing code tha t isn't going to give a good return.    I think you didn't really read my original post where I mentioned using opt imization techniques consistently in my coding as a matter of habit.  Rathe r than brute force an algorithm into code in the most direct way, I try to  visualize the hardware that will implement the task and then use HDL to...

Rick C · Answer

On Sunday, January 5, 2020 at 7:40:21 AM UTC-5,  wrote :  assume consistent coding practices are always encouraged.  How often do th ese practices include techniques to minimize size or increase speed?  What  techniques are used to improve debugging?  s out, using an 8-bit ALU rather than a 16-bit ALU, and matching the regist er file to the machxo3 FPGA architecture so that it was a good fit for the  small FPGAs in that series. But that's not work related. That's from a smal l personal interest in small micros in small FPGAs that can be used with GC C :) mising for size. Interesting effort.  I'm surprised the result is so small.  I'm also surpri sed the Cyclone V result is smaller than the Artix 7 result.  Any idea why  the register count varies?  Usually the register count is fixed by the code .  Did the tools use register splitting for speed?  Does it take a lot of cycles to run code?  --    Rick C.   -- Get 1,000 miles of free Supercharging   -- Tesla referral code -

David Brown · Answer

My experience with HDL is small and outdated, but not non-existent. In software, at least in a language like C, you can very roughly  approximate that the size of your C files corresponds to the size of the  code.  So if you pick 20% of your code and reduce its size by half, you  have only reduced your overall code size by 10%.  If you want to reduce  the size of the total code by a meaningful amount, you have to look at  most of the code. Is HDL design so different in your experience?  Do you typically find  that a mere 10% or 20% of the design accounts for the solid majority of  the space? If you were talking about optimising for speed, then I would understand  better - I can easily imagine that the maximum frequency for a design is  limited by a few bottleneck points in the code, just as it often is in  sequential software. You seem to want to argue, rather than discuss.  I agree that working to  improve one aspect (whether you call it "optimising" or not) can often  improve other...

Rick C · Answer

Yes, in that 20% of the code is something that CAN be optimized for size.   Much of the code is straightforward and will produce what it produces with  no leeway.       Sometimes but it's not like sequentially executing code where the times all  add up and so much of the time is spent doing a lot of little things and a  few longer things.  In HDL nearly everything is in parallel.  So picture a  histogram with a spread of time delays.  The longest delay is the clock sp eed limiting path.  Solve that and you bring the clock cycle in a small amo unt to the next entry in the histogram.  Lather, rinse, repeat.  As you spe ed up more of the long paths you will find you have to improve a larger num ber of paths to get the same clock time improvement, resulting in the same  sort of 80/20 rule, but a bit different.  It's more like you can address 20 % of the delay but then 80% remains intractable.  So not really the same th ing.  It's a lot of work no matter what... unless the delays are...

Theo · Answer

David Brown  wrote:

David Brown · Answer

On 06/01/2020 00:42, Theo wrote:

David Brown · Answer

That is not answering my question.  I can well understand that only a small proportion of the code offers much scope for size optimisation - the same is true of sequential software.  And in both sequential software and HDL design, a substantial part of the bulk is usually taken by libraries or pre-generated blocks of some kind, which offer little scope for developer optimisation - just basic compiler flags for "optimise for speed" or "optimise for size". What I asked, was if this 20% of the code that you /can/ optimise for space accounts for a solid majority of the final placed and routed binary?  If it does not - as I suspect is the case in most designs - then you can never make more than a dent in the total space no matter how you improve that 20%. I am not saying that you shouldn't write that 20% as best you can, nor do I disagree that doing so will likely give many other benefits.  What I am saying is that optimising it primarily for size is not often a meaningful aim in itself....

pault.eg · Answer

rised the Cyclone V result is smaller than the Artix 7 result. Any idea wh y the register count varies? Usually the register count is fixed by the co de. Did the tools use register splitting for speed?

A state machine is a significant part of the design, so I expect that to be the main reason for differences in register count, depending on how the im plementation tools decide to encode the state machine. The number of LUTs a nd registers also vary depending on tool optimisation settings. The registe r and LUT counts on the s430 page are for default tool settings.

Yes it does, see the cycle count for a selection of instructions on the s43

0 page. The design is not a pipelined processor design, which saves logic o f course but hurts performance. When I designed it I very roughly aimed at less than 50% resources in the ~1200 LUT/FF machxo3 devices for low power p rocessing tasks with gcc.

Part of the reason it's small is that it doesn't have the 16-bit ALU, but t hen two clocks are required instead of one for an ALU operation. I seem to recall reading somewhere that some of the earlier Z80s had 4-bit ALU's rath er than 8 to save on space, so I think that's been done for a while now :).

Interestingly on Github there is a NEO430 project that someone else has des igned, and the end results there are not too dissimilar to what I've got, n oting the optimisations I used.

Rick C · Answer

On Monday, January 6, 2020 at 1:34:47 PM UTC-5,  wrote : rprised the Cyclone V result is smaller than the Artix 7 result.  Any idea  why the register count varies?  Usually the register count is fixed by the  code.  Did the tools use register splitting for speed?  be the main reason for differences in register count, depending on how the  implementation tools decide to encode the state machine. The number of LUTs  and registers also vary depending on tool optimisation settings. The regis ter and LUT counts on the s430 page are for default tool settings. 430 page. The design is not a pipelined processor design, which saves logic  of course but hurts performance. When I designed it I very roughly aimed a t less than 50% resources in the ~1200 LUT/FF machxo3 devices for low power  processing tasks with gcc.  then two clocks are required instead of one for an ALU operation. I seem t o recall reading somewhere that some of the earlier Z80s had 4-bit ALU's ra ther than 8 to save on space, so...

Optimizations, How Much and When?

Join the Discussion

Didn't find your answer?