Optimizations, How Much and When?

- R
- Rick C
  
  Contact options for registered users
posted
4 years ago

Sat, Jan 4, 2020 7:59 PM

My projects typically are implemented in small devices and so are often spa ce constrained. So I am in the habit of optimizing my code for implemented size. I've learned techniques that minimize size and tend to use them eve rywhere rather than the 20/80 rule of optimizing the 20% of the code where you get 80% of the benefit, or 10/90 or whatever your favorite numbers are.

I remember interviewing at a Japanese based company once where they worked differently. They were designing large projects and felt it was counter pr oductive to worry with optimizations of any sort. They wanted fast turn ar ound on their projects and so just paid more for larger and faster parts I suppose. In fact in the interview I was asked my opinion and gave it. A l ead engineer responded with a mini-lecture about how that was not productiv e in their work. I responded with my take which was that once a few techni ques were learned about optimal coding techniques it was not time consuming to make significant gains in logic size and that these same techniques pro vided a consistent coding style which allowed faster debug times. Turns ou t a reply was not expected and in fact, went against a cultural difference resulting in no offer from this company.

I'm wondering where others' opinions and practice fall in this area. I ass ume consistent coding practices are always encouraged. How often do these practices include techniques to minimize size or increase speed? What tech niques are used to improve debugging?

--

  Rick C. 

  - Get 1,000 miles of free Supercharging 
  - Tesla referral code - https://ts.la/richard11209

- P
- pault.eg
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Sun, Jan 5, 2020 12:40 PM

ssume consistent coding practices are always encouraged. How often do thes e practices include techniques to minimize size or increase speed? What te chniques are used to improve debugging?

I size optimised this

formatting link

by leaving a few things out, using an 8-bit ALU rather than a 16-bit ALU, and matching the register file to the machxo3 FPGA architecture so that it was a good fit for the sm all FPGAs in that series. But that's not work related. That's from a small personal interest in small micros in small FPGAs that can be used with GCC :)

Otherwise I've been lucky enough not to have to worry too much about optimi sing for size.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Sun, Jan 5, 2020 3:29 PM

This seems a little mixed up. The "20/80" (or whatever) rule is for /speed/ optimisation of software code. The principle is that if your program spends most of its time in 10% or 20% of the code, then that is the only part you need to optimise for speed. The rest of the code can be optimised for size. Optimising only 10% or 20% of your code for size is pointless.

I can only really tell you about optimisation for software designs, rather than for programmable logic, but some things could be equally applicable.

First, you have to distinguish between different types of "optimisation". The world simply means that you have a strong focus on one aspect of the development, nothing more. You can optimise for code speed, development speed, power requirements, safety, flexibility, or any one of dozens of different aspects. Some of these are run-time (like code speed), some are development time (like ease of debugging). You rarely really want to optimise, you just want to prioritise how you balance the different aspects of the development.

There are also many ways of considering optimisation for any one aspect. Typically, there are some things that involve lots of work, some things that involve knowledge and experience, and some things that can be automated.

If we take the simple case of "code speed" optimisations, there are perhaps three main possibilities.

You can make a point of writing code that runs quickly. This is a matter of ability, knowledge and experience for the programmer. The programmer knows when to use data of different types, based on what will work well on the target device. He/she knows what algorithms make sense. He/she knows when to use multi-tasking and when to use a single threaded system - and so on. There is rarely any disadvantage in doing this sort of thing, unless it becomes /too/ smart - then it can lead to maintainability issues if the original developer is not available.
You can enable compiler optimisations. This is usually a cheap step

- it's just a compiler flag. Aggressive optimisations can make code harder to debug, but enabling basic optimisations typically makes it easier. It also improves static analysis and warnings, which is always a good thing. But there can be problems if the developers are not sufficiently trained or experienced, and tend to write "it worked when I tried it" code rather than knowing that their code is correct.

You can do a lot of work measuring performance and investigating different ways of handling the task at hand. This can lead to the biggest gains in speed - but also takes the most time and effort for developers.

I expect the optimisations you are thinking of for programmable logic follow a similar pattern.

And I think a fair amount of apparent disagreements about "optimisation" comes mainly from a misunderstanding about types of optimisation, and which type is under discussion.

- R
- Rick C
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Sun, Jan 5, 2020 6:24 PM

Lol! I guess you don't really understand HDL.

Yes, that much we understand.

There is your first mistake. Optimizing code can involve tradeoffs between , size, speed, power consumption, etc., but can also involve finding ways t o improve multiple aspects without tradeoffs.

If you are going to talk about a tradeoff between code development time (th inking) and other parametrics, then you are simply describing the process o f optimization. Duh!

Not at all rare that speed optimizations can create issues in other areas. So that's your second mistake.

Which is the basis of the 20/80 rule. Don't spend time optimizing code tha t isn't going to give a good return.

I think you didn't really read my original post where I mentioned using opt imization techniques consistently in my coding as a matter of habit. Rathe r than brute force an algorithm into code in the most direct way, I try to visualize the hardware that will implement the task and then use HDL to des cribe that hardware. The alternative is to code the algorithm directly and let the tools try to sort it out in an efficient manner which often fails because they are constrained to implement exactly what you have coded.

Once I couldn't figure out why I was getting two adder chains when I expect ed one chain with a carry output. Seems I had made some tiny distinction b etween the two pieces of code so the adders were not identical. So now my habit is to code the actual adder in a separate line of code outside the pr ocess that is using it assuring it is the same adder for both uses of the r esults.

This is why it's still the 20/80 rule since there are large sections of cod e that don't have much to gain from optimization. But a halving of the siz e is significant for the sections of code that can benefit.

--

  Rick C. 

  + Get 1,000 miles of free Supercharging 
  + Tesla referral code - https://ts.la/richard11209

- R
- Rick C
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Sun, Jan 5, 2020 6:33 PM

assume consistent coding practices are always encouraged. How often do th ese practices include techniques to minimize size or increase speed? What techniques are used to improve debugging?

s out, using an 8-bit ALU rather than a 16-bit ALU, and matching the regist er file to the machxo3 FPGA architecture so that it was a good fit for the small FPGAs in that series. But that's not work related. That's from a smal l personal interest in small micros in small FPGAs that can be used with GC C :)

mising for size.

Interesting effort. I'm surprised the result is so small. I'm also surpri sed the Cyclone V result is smaller than the Artix 7 result. Any idea why the register count varies? Usually the register count is fixed by the code . Did the tools use register splitting for speed?

Does it take a lot of cycles to run code?

--

  Rick C. 

  -- Get 1,000 miles of free Supercharging 
  -- Tesla referral code - https://ts.la/richard11209

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Sun, Jan 5, 2020 10:11 PM

My experience with HDL is small and outdated, but not non-existent.

In software, at least in a language like C, you can very roughly approximate that the size of your C files corresponds to the size of the code. So if you pick 20% of your code and reduce its size by half, you have only reduced your overall code size by 10%. If you want to reduce the size of the total code by a meaningful amount, you have to look at most of the code.

Is HDL design so different in your experience? Do you typically find that a mere 10% or 20% of the design accounts for the solid majority of the space?

If you were talking about optimising for speed, then I would understand better - I can easily imagine that the maximum frequency for a design is limited by a few bottleneck points in the code, just as it often is in sequential software.

You seem to want to argue, rather than discuss. I agree that working to improve one aspect (whether you call it "optimising" or not) can often improve other parts too. It is less common that there are /no/ tradeoffs, but certainly common that it is worth the cost.

Again, what's with this "mistake" nonsense? Or are you really saying that, unlike in sequential software development, when you do hardware development you find that making one part of your system fast breaks other parts? That sounds counter-intuitive to me, and does not match my limited HDL experience, but you have far more HDL experience than I.

Yes, I know.

Yes, I read it. That sounded very like the "type 1" optimisations I mentioned above. But the description of the Japanese company sounded like "type 2" and "type 3" optimisations. And it sounded like neither you nor that company understood the differences. (Note that I write here "it sounded like".)

I would expect that methodology to work well sometimes, but not always.

Again, let me compare this to the sequential software world - and you can tell me if this does not apply in HDL. When you are dealing with more limited compilers, it can often give the most efficient final results if you try to imagine the generated assembly, and try to arrange source code that would give that assembly. So people would use pointer arithmetic instead of arrays, write "x

- R
- Rick C
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Sun, Jan 5, 2020 11:13 PM

Yes, in that 20% of the code is something that CAN be optimized for size. Much of the code is straightforward and will produce what it produces with no leeway.

Sometimes but it's not like sequentially executing code where the times all add up and so much of the time is spent doing a lot of little things and a few longer things. In HDL nearly everything is in parallel. So picture a histogram with a spread of time delays. The longest delay is the clock sp eed limiting path. Solve that and you bring the clock cycle in a small amo unt to the next entry in the histogram. Lather, rinse, repeat. As you spe ed up more of the long paths you will find you have to improve a larger num ber of paths to get the same clock time improvement, resulting in the same sort of 80/20 rule, but a bit different. It's more like you can address 20 % of the delay but then 80% remains intractable. So not really the same th ing. It's a lot of work no matter what... unless the delays are because of poor design which is always possible.

My original post wasn't really about optimizing after code was written. It was about coding styles that achieve a level of optimization across the va rious parameters before reviewing the results. So maybe that is why you fe el like I am arguing with you. We aren't really discussing the same thing. You acknowledge you have little experience in HDL yet seem to be forcing a match to your experience in sequential coding. Try going with the flow h ere and learn rather than trying to teach something you don't know.

Perhaps I didn't understand your point. Rereading it I can't say I really get your point other than the fact that speed optimizations can be complica ted and hard to maintain. That is also true in HDL. Code that is worked a nd reworked to find optimizations which can be nebulous and fragile. Logic is always "optimized" by the compiler. Logic is not always simple and str aightforward in a single assignment. In a complex CASE statement with cond itionals a simple assignment can produce more complex logic than the writer can appreciate. A small change can improve one thing and make another com plex and slow. I use complex conditionals like CASE and nested IFs, but if I want fast logic I separate that out and make it part of the 20% I'm spec ifically optimizing (which is not the same as my original topic here).

I recall a CPU design I did where I targeted certain enables as fast decode s and put them in a separate module.

The Japanese company didn't do optimizations at all. That was the part tha t surprised me. I tried to discuss it with them but the guy had what we wo uld call an attitude about it. Clearly there was an issue with culture whe re either I did not phrase my statements properly or I was simply not suppo sed to question the guy at all.

I don't think there are type 2 optimizations so much, but then I haven't do ne much with the tools in years. Maybe they have added levels of optimizat ions, but I don't think the same things apply to HDL as sequential coding. It's more a matter of payment if anything. You pay for specific features in the tools which allow better results. I don't know for sure, I've alway s used the free tools. I also mostly work on smaller tasks where I can use the exploratory tools to see what my results are and see if they are match ing my intent. That is the part the Japanese company didn't want to do.

- T
- Theo
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Sun, Jan 5, 2020 11:42 PM

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Mon, Jan 6, 2020 8:02 AM

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Mon, Jan 6, 2020 3:41 PM

That is not answering my question. I can well understand that only a small proportion of the code offers much scope for size optimisation - the same is true of sequential software. And in both sequential software and HDL design, a substantial part of the bulk is usually taken by libraries or pre-generated blocks of some kind, which offer little scope for developer optimisation - just basic compiler flags for "optimise for speed" or "optimise for size".

What I asked, was if this 20% of the code that you /can/ optimise for space accounts for a solid majority of the final placed and routed binary? If it does not - as I suspect is the case in most designs - then you can never make more than a dent in the total space no matter how you improve that 20%.

I am not saying that you shouldn't write that 20% as best you can, nor do I disagree that doing so will likely give many other benefits. What I am saying is that optimising it primarily for size is not often a meaningful aim in itself.

Yes, I appreciate that.

It is not hugely different for sequential software, and follows a similar pattern - it is often just a small part of the code that limits the speed of the system. There are differences in the details, of course - with HDL it doesn't matter how much you improve the speed of a part that is not the limiting path. (Software can often consist of multiple threads in parallel, which can make it a little more like HDL here - though obviously the scale of the parallelism is very different.)

That is, I think, very much my point. "Optimisation" can mean so many things, and it is vital to understand the differences, and make it clear what you mean by it.

I see this as a discussion, not "teaching" - I am sharing experiences and thoughts, and trying to provoke other thoughts and suggesting other viewpoints. It is not about being right or wrong, or teaching and learning, it is about swapping thoughts and experiences. It is about looking at analogies from related fields - there is a lot to be learned by looking across sequential software development, HDL development, and electronics hardware development.

I have been trying to describe different meanings of "optimisation" - I don't think it makes sense to talk about how much optimisation you should do until it established which kind of optimisation you are talking about. (I have a much better idea of this than from your original post - so in that way, the discussion has been helpful.)

Yes, I understand that. This is perhaps one way in which HDL (at least, with Verilog or VHDL) can be harder than sequential software - it is really easy to get large effects from small changes to the code in things like a complex CASE statement. In particular, you can very easily create a latch unintentionally.

The solution to this kind of thing is the same in sequential software and HDL - modularisation, factoring out common code, and using higher-level features or higher-level languages.

(Higher level HDLs, like MyHDL, Spinal, Confluence, etc. invariably make a point of making it very clear when you are generating synchronous or combinatorial logic, as far as I have seen.)

Going the other way, one thing that is easy to do in sequential software (in most languages), but hard to do in HDL, is have multiple places where you assign to the same variable. This is often a source of errors in sequential code - and thinking of your variables like hardware signals or registers that are fed from one place and read from many, can give you much better code.

My guess would be a bit of both - you could easily have been meaning different things when talking about optimisations. And there can certainly be cultural differences - Japanese society is a lot more strictly hierarchical that is usually found in small Western companies.

Certainly it seems common practice for the paid-for versions to have more features. I haven't kept up with FPGA tools for a long time, but I know it was common to have things like parallel place and/or route in the paid versions - leading to faster build times. Whether they also had more or less optimisations, I don't know. But I do remember a certain amount of flags and options for logic generation that could be considered optimisations. However, it is perhaps not really equivalent to optimisation flags in sequential programming tools.

(In the software world, you sometimes get tools where you have to pay, or pay more, for an optimising compiler.)

- P
- pault.eg
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Mon, Jan 6, 2020 6:34 PM

rised the Cyclone V result is smaller than the Artix 7 result. Any idea wh y the register count varies? Usually the register count is fixed by the co de. Did the tools use register splitting for speed?

A state machine is a significant part of the design, so I expect that to be the main reason for differences in register count, depending on how the im plementation tools decide to encode the state machine. The number of LUTs a nd registers also vary depending on tool optimisation settings. The registe r and LUT counts on the s430 page are for default tool settings.

Yes it does, see the cycle count for a selection of instructions on the s43

0 page. The design is not a pipelined processor design, which saves logic o f course but hurts performance. When I designed it I very roughly aimed at less than 50% resources in the ~1200 LUT/FF machxo3 devices for low power p rocessing tasks with gcc.

Part of the reason it's small is that it doesn't have the 16-bit ALU, but t hen two clocks are required instead of one for an ALU operation. I seem to recall reading somewhere that some of the earlier Z80s had 4-bit ALU's rath er than 8 to save on space, so I think that's been done for a while now :).

Interestingly on Github there is a NEO430 project that someone else has des igned, and the end results there are not too dissimilar to what I've got, n oting the optimisations I used.

- R
- Rick C
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Mon, Jan 6, 2020 6:50 PM

rprised the Cyclone V result is smaller than the Artix 7 result. Any idea why the register count varies? Usually the register count is fixed by the code. Did the tools use register splitting for speed?

be the main reason for differences in register count, depending on how the implementation tools decide to encode the state machine. The number of LUTs and registers also vary depending on tool optimisation settings. The regis ter and LUT counts on the s430 page are for default tool settings.

430 page. The design is not a pipelined processor design, which saves logic of course but hurts performance. When I designed it I very roughly aimed a t less than 50% resources in the ~1200 LUT/FF machxo3 devices for low power processing tasks with gcc.

then two clocks are required instead of one for an ALU operation. I seem t o recall reading somewhere that some of the earlier Z80s had 4-bit ALU's ra ther than 8 to save on space, so I think that's been done for a while now : ).

esigned, and the end results there are not too dissimilar to what I've got, noting the optimisations I used.

Thanks, it's always interesting to see not just the results, but the goals and motivations for CPU projects.

I think you might be referring to the Z8 rather than the Z80? I guess ther e were some clones but I don't think the original Z80 had a 4 bit, double p umped ALU. Sounds more like something done in a Chinese 4 bit processor bu ilt to run Z80 code. The Z8 on the other hand was all about low selling pr ice, so they may well have minimized the ALU and other logic this way.

Anyone know if the 4 bit MCUs are still dominating the low end of the CPU m arket or have the cost differences with the 8 bit devices faded away?

--

  Rick C. 

  +- Get 1,000 miles of free Supercharging 
  +- Tesla referral code - https://ts.la/richard11209