Optimizations, How Much and When?

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
My projects typically are implemented in small devices and so are often spa
ce constrained.  So I am in the habit of optimizing my code for implemented
 size.  I've learned techniques that minimize size and tend to use them eve
rywhere rather than the 20/80 rule of optimizing the 20% of the code where  
you get 80% of the benefit, or 10/90 or whatever your favorite numbers are.
  

I remember interviewing at a Japanese based company once where they worked  
differently.  They were designing large projects and felt it was counter pr
oductive to worry with optimizations of any sort.  They wanted fast turn ar
ound on their projects and so just paid more for larger and faster parts I  
suppose.  In fact in the interview I was asked my opinion and gave it.  A l
ead engineer responded with a mini-lecture about how that was not productiv
e in their work.  I responded with my take which was that once a few techni
ques were learned about optimal coding techniques it was not time consuming
 to make significant gains in logic size and that these same techniques pro
vided a consistent coding style which allowed faster debug times.  Turns ou
t a reply was not expected and in fact, went against a cultural difference  
resulting in no offer from this company.  

I'm wondering where others' opinions and practice fall in this area.  I ass
ume consistent coding practices are always encouraged.  How often do these  
practices include techniques to minimize size or increase speed?  What tech
niques are used to improve debugging?  

--  

  Rick C.

  - Get 1,000 miles of free Supercharging
We've slightly trimmed the long signature. Click to see the full one.
Re: Optimizations, How Much and When?
On Saturday, January 4, 2020 at 7:59:14 PM UTC, Rick C wrote:

Quoted text here. Click to load it
ssume consistent coding practices are always encouraged.  How often do thes
e practices include techniques to minimize size or increase speed?  What te
chniques are used to improve debugging?  
Quoted text here. Click to load it

I size optimised this https://www.p-code.org/s430/ by leaving a few things  
out, using an 8-bit ALU rather than a 16-bit ALU, and matching the register
 file to the machxo3 FPGA architecture so that it was a good fit for the sm
all FPGAs in that series. But that's not work related. That's from a small  
personal interest in small micros in small FPGAs that can be used with GCC  
:)

Otherwise I've been lucky enough not to have to worry too much about optimi
sing for size.

Re: Optimizations, How Much and When?
:
Quoted text here. Click to load it
 assume consistent coding practices are always encouraged.  How often do th
ese practices include techniques to minimize size or increase speed?  What  
techniques are used to improve debugging?  
Quoted text here. Click to load it
s out, using an 8-bit ALU rather than a 16-bit ALU, and matching the regist
er file to the machxo3 FPGA architecture so that it was a good fit for the  
small FPGAs in that series. But that's not work related. That's from a smal
l personal interest in small micros in small FPGAs that can be used with GC
C :)
Quoted text here. Click to load it
mising for size.

Interesting effort.  I'm surprised the result is so small.  I'm also surpri
sed the Cyclone V result is smaller than the Artix 7 result.  Any idea why  
the register count varies?  Usually the register count is fixed by the code
.  Did the tools use register splitting for speed?  

Does it take a lot of cycles to run code?  

--  

  Rick C.

  -- Get 1,000 miles of free Supercharging
We've slightly trimmed the long signature. Click to see the full one.
Re: Optimizations, How Much and When?
On Sunday, January 5, 2020 at 6:33:05 PM UTC, Rick C wrote:
Quoted text here. Click to load it
rised the Cyclone V result is smaller than the Artix 7 result.  Any idea wh
y the register count varies?  Usually the register count is fixed by the co
de.  Did the tools use register splitting for speed?  
Quoted text here. Click to load it

A state machine is a significant part of the design, so I expect that to be
 the main reason for differences in register count, depending on how the im
plementation tools decide to encode the state machine. The number of LUTs a
nd registers also vary depending on tool optimisation settings. The registe
r and LUT counts on the s430 page are for default tool settings.

Quoted text here. Click to load it

Yes it does, see the cycle count for a selection of instructions on the s43
0 page. The design is not a pipelined processor design, which saves logic o
f course but hurts performance. When I designed it I very roughly aimed at  
less than 50% resources in the ~1200 LUT/FF machxo3 devices for low power p
rocessing tasks with gcc.

Part of the reason it's small is that it doesn't have the 16-bit ALU, but t
hen two clocks are required instead of one for an ALU operation. I seem to  
recall reading somewhere that some of the earlier Z80s had 4-bit ALU's rath
er than 8 to save on space, so I think that's been done for a while now :).

Interestingly on Github there is a NEO430 project that someone else has des
igned, and the end results there are not too dissimilar to what I've got, n
oting the optimisations I used.




Re: Optimizations, How Much and When?
:
Quoted text here. Click to load it
rprised the Cyclone V result is smaller than the Artix 7 result.  Any idea  
why the register count varies?  Usually the register count is fixed by the  
code.  Did the tools use register splitting for speed?  
Quoted text here. Click to load it
be the main reason for differences in register count, depending on how the  
implementation tools decide to encode the state machine. The number of LUTs
 and registers also vary depending on tool optimisation settings. The regis
ter and LUT counts on the s430 page are for default tool settings.
Quoted text here. Click to load it
430 page. The design is not a pipelined processor design, which saves logic
 of course but hurts performance. When I designed it I very roughly aimed a
t less than 50% resources in the ~1200 LUT/FF machxo3 devices for low power
 processing tasks with gcc.
Quoted text here. Click to load it
 then two clocks are required instead of one for an ALU operation. I seem t
o recall reading somewhere that some of the earlier Z80s had 4-bit ALU's ra
ther than 8 to save on space, so I think that's been done for a while now :
).
Quoted text here. Click to load it
esigned, and the end results there are not too dissimilar to what I've got,
 noting the optimisations I used.

Thanks, it's always interesting to see not just the results, but the goals  
and motivations for CPU projects.  

I think you might be referring to the Z8 rather than the Z80?  I guess ther
e were some clones but I don't think the original Z80 had a 4 bit, double p
umped ALU.  Sounds more like something done in a Chinese 4 bit processor bu
ilt to run Z80 code.  The Z8 on the other hand was all about low selling pr
ice, so they may well have minimized the ALU and other logic this way.  

Anyone know if the 4 bit MCUs are still dominating the low end of the CPU m
arket or have the cost differences with the 8 bit devices faded away?  

--  

  Rick C.

  +- Get 1,000 miles of free Supercharging
We've slightly trimmed the long signature. Click to see the full one.
Re: Optimizations, How Much and When?
On 04/01/2020 20:59, Rick C wrote:
Quoted text here. Click to load it

This seems a little mixed up.  The "20/80" (or whatever) rule is for  
/speed/ optimisation of software code.  The principle is that if your  
program spends most of its time in 10% or 20% of the code, then that is  
the only part you need to optimise for speed.  The rest of the code can  
be optimised for size.  Optimising only 10% or 20% of your code for size  
is pointless.

Quoted text here. Click to load it

I can only really tell you about optimisation for software designs,  
rather than for programmable logic, but some things could be equally  
applicable.

First, you have to distinguish between different types of  
"optimisation".  The world simply means that you have a strong focus on  
one aspect of the development, nothing more.  You can optimise for code  
speed, development speed, power requirements, safety, flexibility, or  
any one of dozens of different aspects.  Some of these are run-time  
(like code speed), some are development time (like ease of debugging).  
You rarely really want to optimise, you just want to prioritise how you  
balance the different aspects of the development.

There are also many ways of considering optimisation for any one aspect.  
  Typically, there are some things that involve lots of work, some  
things that involve knowledge and experience, and some things that can  
be automated.

If we take the simple case of "code speed" optimisations, there are  
perhaps three main possibilities.

1. You can make a point of writing code that runs quickly.  This is a  
matter of ability, knowledge and experience for the programmer.  The  
programmer knows when to use data of different types, based on what will  
work well on the target device.  He/she knows what algorithms make  
sense.  He/she knows when to use multi-tasking and when to use a single  
threaded system - and so on.  There is rarely any disadvantage in doing  
this sort of thing, unless it becomes /too/ smart - then it can lead to  
maintainability issues if the original developer is not available.

2. You can enable compiler optimisations.  This is usually a cheap step  
- it's just a compiler flag.  Aggressive optimisations can make code  
harder to debug, but enabling basic optimisations typically makes it  
easier.  It also improves static analysis and warnings, which is always  
a good thing.  But there can be problems if the developers are not  
sufficiently trained or experienced, and tend to write "it worked when I  
tried it" code rather than knowing that their code is correct.

3. You can do a lot of work measuring performance and investigating  
different ways of handling the task at hand.  This can lead to the  
biggest gains in speed - but also takes the most time and effort for  
developers.


I expect the optimisations you are thinking of for programmable logic  
follow a similar pattern.

And I think a fair amount of apparent disagreements about "optimisation"  
comes mainly from a misunderstanding about types of optimisation, and  
which type is under discussion.




Re: Optimizations, How Much and When?
On Sunday, January 5, 2020 at 10:29:53 AM UTC-5, David Brown wrote:
Quoted text here. Click to load it
  
Quoted text here. Click to load it
  
Quoted text here. Click to load it
  
Quoted text here. Click to load it

Lol!  I guess you don't really understand HDL.  


Quoted text here. Click to load it

Yes, that much we understand.  


Quoted text here. Click to load it
  
Quoted text here. Click to load it
  
Quoted text here. Click to load it
  
Quoted text here. Click to load it

There is your first mistake.  Optimizing code can involve tradeoffs between
, size, speed, power consumption, etc., but can also involve finding ways t
o improve multiple aspects without tradeoffs.  

If you are going to talk about a tradeoff between code development time (th
inking) and other parametrics, then you are simply describing the process o
f optimization.  Duh!  


Quoted text here. Click to load it
  
Quoted text here. Click to load it
  
Quoted text here. Click to load it
  
Quoted text here. Click to load it
  
Quoted text here. Click to load it
  
Quoted text here. Click to load it

Not at all rare that speed optimizations can create issues in other areas.  
 So that's your second mistake.  


Quoted text here. Click to load it
  
Quoted text here. Click to load it
  
Quoted text here. Click to load it
  
Quoted text here. Click to load it

Which is the basis of the 20/80 rule.  Don't spend time optimizing code tha
t isn't going to give a good return.  


Quoted text here. Click to load it
  
Quoted text here. Click to load it

I think you didn't really read my original post where I mentioned using opt
imization techniques consistently in my coding as a matter of habit.  Rathe
r than brute force an algorithm into code in the most direct way, I try to  
visualize the hardware that will implement the task and then use HDL to des
cribe that hardware.  The alternative is to code the algorithm directly and
 let the tools try to sort it out in an efficient manner which often fails  
because they are constrained to implement exactly what you have coded.  

Once I couldn't figure out why I was getting two adder chains when I expect
ed one chain with a carry output.  Seems I had made some tiny distinction b
etween the two pieces of code so the adders were not identical.  So now my  
habit is to code the actual adder in a separate line of code outside the pr
ocess that is using it assuring it is the same adder for both uses of the r
esults.  

This is why it's still the 20/80 rule since there are large sections of cod
e that don't have much to gain from optimization.  But a halving of the siz
e is significant for the sections of code that can benefit.  

--  

  Rick C.

  + Get 1,000 miles of free Supercharging
We've slightly trimmed the long signature. Click to see the full one.
Re: Optimizations, How Much and When?
On 05/01/2020 19:24, Rick C wrote:
Quoted text here. Click to load it

My experience with HDL is small and outdated, but not non-existent.

In software, at least in a language like C, you can very roughly  
approximate that the size of your C files corresponds to the size of the  
code.  So if you pick 20% of your code and reduce its size by half, you  
have only reduced your overall code size by 10%.  If you want to reduce  
the size of the total code by a meaningful amount, you have to look at  
most of the code.

Is HDL design so different in your experience?  Do you typically find  
that a mere 10% or 20% of the design accounts for the solid majority of  
the space?

If you were talking about optimising for speed, then I would understand  
better - I can easily imagine that the maximum frequency for a design is  
limited by a few bottleneck points in the code, just as it often is in  
sequential software.

Quoted text here. Click to load it

You seem to want to argue, rather than discuss.  I agree that working to  
improve one aspect (whether you call it "optimising" or not) can often  
improve other parts too.  It is less common that there are /no/  
tradeoffs, but certainly common that it is worth the cost.

Quoted text here. Click to load it

Again, what's with this "mistake" nonsense?  Or are you really saying  
that, unlike in sequential software development, when you do hardware  
development you find that making one part of your system fast breaks  
other parts?  That sounds counter-intuitive to me, and does not match my  
limited HDL experience, but you have far more HDL experience than I.

Quoted text here. Click to load it

Yes, I know.

Quoted text here. Click to load it

Yes, I read it.  That sounded very like the "type 1" optimisations I  
mentioned above.  But the description of the Japanese company sounded  
like "type 2" and "type 3" optimisations.  And it sounded like neither  
you nor that company understood the differences.  (Note that I write  
here "it sounded like".)

Quoted text here. Click to load it

I would expect that methodology to work well sometimes, but not always.

Again, let me compare this to the sequential software world - and you  
can tell me if this does not apply in HDL.  When you are dealing with  
more limited compilers, it can often give the most efficient final  
results if you try to imagine the generated assembly, and try to arrange  
source code that would give that assembly.  So people would use pointer  
arithmetic instead of arrays, write "x << 2" instead of "x * 4", and so  
on.  But with better tools, writing the code cleanly and simply gives  
more efficient end results, because the compiler can handle the clear  
code better, and you have the benefits of source code that is easier to  
understand, easier to get right, harder to get wrong, and easier to  
maintain.

Is it so different in the HDL world?  Are the tools still so primitive?

Quoted text here. Click to load it

Factoring out common code is regularly a good idea.

Quoted text here. Click to load it

But if you half the size of that 20%, your total is now 90% of the  
original.  That is not typically a significant gain.

Re: Optimizations, How Much and When?
On Sunday, January 5, 2020 at 5:11:45 PM UTC-5, David Brown wrote:
Quoted text here. Click to load it
  
Quoted text here. Click to load it
  
Quoted text here. Click to load it
  
Quoted text here. Click to load it
  
Quoted text here. Click to load it

Yes, in that 20% of the code is something that CAN be optimized for size.  
Much of the code is straightforward and will produce what it produces with  
no leeway.  


Quoted text here. Click to load it
  
Quoted text here. Click to load it
  
Quoted text here. Click to load it

Sometimes but it's not like sequentially executing code where the times all
 add up and so much of the time is spent doing a lot of little things and a
 few longer things.  In HDL nearly everything is in parallel.  So picture a
 histogram with a spread of time delays.  The longest delay is the clock sp
eed limiting path.  Solve that and you bring the clock cycle in a small amo
unt to the next entry in the histogram.  Lather, rinse, repeat.  As you spe
ed up more of the long paths you will find you have to improve a larger num
ber of paths to get the same clock time improvement, resulting in the same  
sort of 80/20 rule, but a bit different.  It's more like you can address 20
% of the delay but then 80% remains intractable.  So not really the same th
ing.  It's a lot of work no matter what... unless the delays are because of
 poor design which is always possible.  


Quoted text here. Click to load it
  
Quoted text here. Click to load it

My original post wasn't really about optimizing after code was written.  It
 was about coding styles that achieve a level of optimization across the va
rious parameters before reviewing the results.  So maybe that is why you fe
el like I am arguing with you.  We aren't really discussing the same thing.
  You acknowledge you have little experience in HDL yet seem to be forcing  
a match to your experience in sequential coding.  Try going with the flow h
ere and learn rather than trying to teach something you don't know.  


Quoted text here. Click to load it
  
Quoted text here. Click to load it

Perhaps I didn't understand your point.  Rereading it I can't say I really  
get your point other than the fact that speed optimizations can be complica
ted and hard to maintain.  That is also true in HDL.  Code that is worked a
nd reworked to find optimizations which can be nebulous and fragile.  Logic
 is always "optimized" by the compiler.  Logic is not always simple and str
aightforward in a single assignment.  In a complex CASE statement with cond
itionals a simple assignment can produce more complex logic than the writer
 can appreciate.  A small change can improve one thing and make another com
plex and slow.  I use complex conditionals like CASE and nested IFs, but if
 I want fast logic I separate that out and make it part of the 20% I'm spec
ifically optimizing (which is not the same as my original topic here).  

I recall a CPU design I did where I targeted certain enables as fast decode
s and put them in a separate module.  
  

Quoted text here. Click to load it

The Japanese company didn't do optimizations at all.  That was the part tha
t surprised me.  I tried to discuss it with them but the guy had what we wo
uld call an attitude about it.  Clearly there was an issue with culture whe
re either I did not phrase my statements properly or I was simply not suppo
sed to question the guy at all.  

I don't think there are type 2 optimizations so much, but then I haven't do
ne much with the tools in years.  Maybe they have added levels of optimizat
ions, but I don't think the same things apply to HDL as sequential coding.  
 It's more a matter of payment if anything.  You pay for specific features  
in the tools which allow better results.  I don't know for sure, I've alway
s used the free tools.  I also mostly work on smaller tasks where I can use
 the exploratory tools to see what my results are and see if they are match
ing my intent.  That is the part the Japanese company didn't want to do.  


Quoted text here. Click to load it
  
Quoted text here. Click to load it
  
Quoted text here. Click to load it
  
Quoted text here. Click to load it
  
Quoted text here. Click to load it

I don't know because I really don't get what you are referring to.  how is  
x << 2 different from x * 4?  I guess there might be an issue if X is a sig
ned number, so treat it as unsigned and they are the same.  I had exactly t
hat sort of issue reading some code I wrote years ago for mulaw conversion.
  The starting data was 16 bit signed of which only 14 bits are used includ
ing the sign bit.  I wrote the code to be optimized and tossed out bits as  
soon as they were no longer useful.  Ultimately the algorithm uses 11 bits  
to form the exponent and mantissa and of course the original sign bit.  So  
the number was converted to absolute value before doing the rest of the alg
orithm involving aligning the most significant '1' like a floating point nu
mber.  Some of this involved shifting or multiplying depending on how you c
oded it.  No difference to the tools that I am aware of.  

On reviewing the code I got myself wrapped around the axle trying to figure
 out if I had the head room to do the bias addition before the clipping and
 had to write an app to calculate the mulaw data so I could more thoroughly
 test this module over some data values that would show the flaw if it were
 there.  By the time I had the program written the confusion was gone and I
 realized I had done it correctly the first time.  Whew!  I've got nearly 1
0,000 of these in the field and they would have to be brought back for repr
ogramming the updates!  

The point is the optimizations I did made it harder to "maintain" the code.
  I don't see any way the compiler would have mattered.  


Quoted text here. Click to load it

No, no, not common code, two adders instantiated for the same functionality
.  If you are decrementing a count and want to detect it being at zero you  
can use the carry out.  No need for two counters as long as you code it rig
ht.  

X <= X - 1;  
and elsewhere
If X = 0  

I also don't want an N bit compare to all zero bits in the word.  


Quoted text here. Click to load it

Lol!  If you have 1 MB of code space and your code is 1.001 MB, 90% is very
 significant.  Maybe that's still too close for comfort in a CPU, but if yo
u can get it into the FPGA at 90% utilization 90% is golden.  

Besides the numbers don't necessarily add up the way you indicate.  I think
 I've said before, some code produces more logic than other.  A perfect exa
mple is the multiplier I had to use in the front end of a DSP path vs. the  
multiplier I used in an audio path.  The former was optimized to use just e
nough bits to get the job done since it needed to run at close to the clock
 rate so it used a fair number of LUTs (the target part had no multipliers)
.  The latter was in an audio path that could process it in multiple clocks
 and used a lot less logic, but I had to optimize this one differently.  

The optimizations that would be more like what is done in software might by
 the algorithmic optimizations where different ways of designing the algori
thm are considered.  I don't know.  I don't typically need to optimize any  
of the software I write.  It all runs pretty well on today's CPUs, even the
 tiny ones in my FPGAs.  Oh, and I avoid languages like C.  I use Forth or  
assembly which is nearly the same thing on my CPUs.  

--  

  Rick C.

  -+ Get 1,000 miles of free Supercharging
We've slightly trimmed the long signature. Click to see the full one.
Re: Optimizations, How Much and When?
On 06/01/2020 00:13, Rick C wrote:
Quoted text here. Click to load it

That is not answering my question.  I can well understand that only a
small proportion of the code offers much scope for size optimisation -
the same is true of sequential software.  And in both sequential
software and HDL design, a substantial part of the bulk is usually taken
by libraries or pre-generated blocks of some kind, which offer little
scope for developer optimisation - just basic compiler flags for
"optimise for speed" or "optimise for size".

What I asked, was if this 20% of the code that you /can/ optimise for
space accounts for a solid majority of the final placed and routed
binary?  If it does not - as I suspect is the case in most designs -
then you can never make more than a dent in the total space no matter
how you improve that 20%.

I am not saying that you shouldn't write that 20% as best you can, nor
do I disagree that doing so will likely give many other benefits.  What
I am saying is that optimising it primarily for size is not often a
meaningful aim in itself.

Quoted text here. Click to load it

Yes, I appreciate that.

Quoted text here. Click to load it

It is not hugely different for sequential software, and follows a
similar pattern - it is often just a small part of the code that limits
the speed of the system.  There are differences in the details, of
course - with HDL it doesn't matter how much you improve the speed of a
part that is not the limiting path.  (Software can often consist of
multiple threads in parallel, which can make it a little more like HDL
here - though obviously the scale of the parallelism is very different.)

Quoted text here. Click to load it

That is, I think, very much my point.  "Optimisation" can mean so many
things, and it is vital to understand the differences, and make it clear
what you mean by it.

Quoted text here. Click to load it

I see this as a discussion, not "teaching" - I am sharing experiences
and thoughts, and trying to provoke other thoughts and suggesting other
viewpoints.  It is not about being right or wrong, or teaching and
learning, it is about swapping thoughts and experiences.  It is about
looking at analogies from related fields - there is a lot to be learned
by looking across sequential software development, HDL development, and
electronics hardware development.

Quoted text here. Click to load it

I have been trying to describe different meanings of "optimisation" - I
don't think it makes sense to talk about how much optimisation you
should do until it established which kind of optimisation you are
talking about.  (I have a much better idea of this than from your
original post - so in that way, the discussion has been helpful.)

Quoted text here. Click to load it

Yes, I understand that.  This is perhaps one way in which HDL (at least,
with Verilog or VHDL) can be harder than sequential software - it is
really easy to get large effects from small changes to the code in
things like a complex CASE statement.  In particular, you can very
easily create a latch unintentionally.

The solution to this kind of thing is the same in sequential software
and HDL - modularisation, factoring out common code, and using
higher-level features or higher-level languages.

(Higher level HDLs, like MyHDL, Spinal, Confluence, etc. invariably make
a point of making it very clear when you are generating synchronous or
combinatorial logic, as far as I have seen.)

Going the other way, one thing that is easy to do in sequential software
(in most languages), but hard to do in HDL, is have multiple places
where you assign to the same variable.  This is often a source of errors
in sequential code - and thinking of your variables like hardware
signals or registers that are fed from one place and read from many, can
give you much better code.

Quoted text here. Click to load it

My guess would be a bit of both - you could easily have been meaning
different things when talking about optimisations.  And there can
certainly be cultural differences - Japanese society is a lot more
strictly hierarchical that is usually found in small Western companies.

Quoted text here. Click to load it

Certainly it seems common practice for the paid-for versions to have
more features.  I haven't kept up with FPGA tools for a long time, but I
know it was common to have things like parallel place and/or route in
the paid versions - leading to faster build times.  Whether they also
had more or less optimisations, I don't know.  But I do remember a
certain amount of flags and options for logic generation that could be
considered optimisations.  However, it is perhaps not really equivalent
to optimisation flags in sequential programming tools.

(In the software world, you sometimes get tools where you have to pay,
or pay more, for an optimising compiler.)

Quoted text here. Click to load it

In the software world, imagine a small, simple cpu core with no hardware
multiply.  Directly translated, "x << 2" can probably be done with two
"rotate left" instructions.  Directly translated, "x * 4" would call the
library multiplication function.  There would be a huge difference in
the timing of the two implementations.

There was a time when manually writing "x << 2" instead of "x * 4" was a
good idea, at least for some compilers.  However, for any half-decent
compiler, you will get the same results - and thus you should write the
code in the way that is the most obvious and clearest, /not/ in the way
you think matches the assembly you think should be generated.

In the HDL world, "x * 4" might be implemented using a DSP multiplier
block - and "x << 2" might be done just by routing of the signals.  I
would hope that you are able to write "x * 4" in the HDL when that is
the more natural way to express your intent, and the tools will
implement it as though you had written "x << 2".

And in reference to signed and unsigned data - this means you can write
your code in clearer arithmetic, without having to be so concerned about
the details of the sign bits.  (With two's complement integers, you can
do your multiplications with a disregard for signedness, but it can be
complicated for division.  Let the compiler figure out how to handle the
sign, or which shift instructions to use, or when biased arithmetic
makes most sense, or when to turn division by a constant into
multiplication by a constant.)

Does that make it a little clearer?

Quoted text here. Click to load it

Can't you write this sort of thing at a higher level, using normal
arithmetic, and let the tools figure out the details?  What you are
describing sounds like the equivalent of assembly programming in
sequential software - and it is a long time since that has made sense
for all but the most niche cases.  (I have a long history as an assembly
programmer - and I'm glad I don't have to do so any more.)

As an example in software, consider a "divide by three" function.
Division instructions - assuming the cpu has one - are generally very
slow.  This is what a compiler generates for ARM code:

int div_by_three(int x) {
    return x / 3;
}

div_by_three(int):
        ldr     r3, .L5
        smull   r2, r3, r3, r0
        sub     r0, r3, r0, asr #31
        bx      lr
.L5:
        .word   1431655766

The compiler transforms the division into a multiplication, with shifts
and biases to get it exactly right, for all values, positive and
negative.  I don't need to go through hours of work figuring out the
numbers to use here, and hours more in comprehensive testing of corner
cases.

But go back a number of years, or switch to poorer tools, and the
compiler would be generating division code - and thus there would be a
lot of scope for manually optimising the code (if the speed mattered, of
course).


Do HDL tools do this kind of arithmetic transformation automatically?

Quoted text here. Click to load it

Ah, okay.


Of course there are edge cases where a 10% improvement in code space
makes a world of difference.  It's the same in software.

The difference is, a 50% improvement in the code density of the key 20%
leads to a 10% improvement in the end result for size.  A 50%
improvement in the speed of the key 20% can give close to 50%
improvement in the end result for speed.  A 50% improvement is usually
significant - a 10% improvement usually not.

Quoted text here. Click to load it

Of course not - these are very rough, and will certainly not always be
remotely realistic.

Quoted text here. Click to load it


Re: Optimizations, How Much and When?
Quoted text here. Click to load it

I think the issue in the HDL world is the 'instruction set' is a lot more
complex.  Let's take a software analogy...

Essentially, a compiler is trying to pattern match the high level code you
wrote with a toolbox of lower level pieces ('instructions' in the software
case).  The 1980s RISC movement was all about making those instructions
simpler and faster, so the compiler could use half a dozen to represent an
operation instead of 1-2 CISC instructions.  Over the years compilers have
got better at using those RISC instructions and optimising away redundancy
between HLL statements.

In the FPGA and ASIC world, you have extremely CISC-y 'instructions'.  So
instead of shifts and adds, you have complexity equivalent to VAX's
polynomial multiply instruction.  The compiler now has to read your C code
and decide what to fit to a polynomial multiply.  Maybe it can spot:

p = a*x*x + b*x + c;

(which would become POLY p,x,c,b,a for my fictional polynomial expansion
instruction)

but change that to:

p = a*x*y + b*x + c;

and suddenly it doesn't match any more (it's no longer a polynomial in terms
of x).

So you do need to code thinking about how it might implement your system,
because you do have to convince the compiler to fit your code to the
primitives you have.  If you code slightly differently such the compiler
doesn't miss your pattern, you end up with a bit pile of registers rather
than a BRAM, an expanded multiply rather than a DSP block, or whatever it
might be.

The job of an FPGA compiler is a lot harder than a software compiler, and so
you need to feed it more carefully.

Theo

(my mention of the VAX instruction set is but a caricature used for
rhetorical purposes - I've never actually programmed one)

Re: Optimizations, How Much and When?
On 06/01/2020 00:42, Theo wrote:
Quoted text here. Click to load it

In assembly, one "high level" statement or expression is essentially
turned into one machine code instruction (though this is not always
true).  For low-level languages with simpler compilers, such as earlier
or more limited C compilers, each section of bit of source code was
translated into a particular set of assembly instructions.  But for most
software languages, and modern compilers for languages like C, it's a
different matter.  The source code describes operations defined on an
abstract machine, and it is up to the compiler to generate assembly that
gives the requested results.

Quoted text here. Click to load it

For some kinds of software coding, you will be thinking in the same
lines.  For example, you might code thinking "I'll make sure I use
single-precision floating point here, because my cpu has hardware for
that, and avoid double-precision because that uses slow software emulation".

So yes, I understand what you mean - and I know that is something that
can be very relevant in HDL coding.

And for both sequential software and HDL development, much of this is
down to the knowledge and experience of the programmer - and the
required portability of the code.

Quoted text here. Click to load it

Both need careful feeding to be sure of getting correct results (that's
the main point), and to prefer efficient results (a secondary aim, but
still very important).

Quoted text here. Click to load it


Site Timeline