modulo 2**32-1 arith

- I
- Ilya Kalistru
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Dec 19, 2015 6:13 PM

I also should apologize for my absence for long time. Device I'm working on suddenly started to lose PCIe link (without any reason) and I have to put off all other works until this problem is resolved.

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Dec 19, 2015 7:18 PM

- I
- Ilya Kalistru
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Dec 19, 2015 9:00 PM

It's just a way to have a rough estimate of the timings without insertion of this block in actual design or having deals with timings from/to input ports. May be I'll check this designs with set_false_path -from A_in -to A and so on.

It is. But as I said this particular algorithm uses nonstandard definition. (obviously they wanted to make hardware implementation easier)

Cool! It makes LUT2 + 16*CARRY4 with 3.112 ns estimated propagation time (before route) and uses 64 lutes.

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Dec 19, 2015 9:33 PM

That doesn't make sense. Either it is one definition or the other. The what I wrote above is what is produced by my code. I believe your code can be simply modified to produce the same result, just use Sum_P1(32) in the conditional instead of Sum(32). I don't know how you can use your code if you need a modulo function since it will produce a result of 2**n-1 which is not in the range of mod 2**n-1 and so is not a valid result.

Since the carry chains are not part of the routing, you will find little additional delay if the placement is good.

--

Rick

- I
- Ilya Kalistru
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Dec 19, 2015 9:38 PM

Here you are: after implenetation Rick's Y

- I
- Ilya Kalistru
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, Dec 19, 2015 10:10 PM

It's exactly what I've done in my second post - there's test results for both definitions of "modulo" with Sum_P1(32) and Sum(32) used as an input of the multiplexer.

Sorry, I don't understand you here. Could you paraphrase it a bit simpler or wider? I don't need a modulo function, I need a function which is like modulo but with this strange difference. I've done testing of both definitions for the sake of comparison between them and to be able to compare my results of "standard modulo" on Xilinx with "standard modulo" results of KJ on Altera.

I want to check how routing affects performance in real design when I fix this problem with the bloody PCIe. I hope I won't forget to report my results here.

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Dec 20, 2015 4:56 AM

Using Sum high bit as the selector means the mux will produce an output of 2**n-1. This is not a valid output for a modulo 2**n-1 function. In other words this is not a valid function to produce the modulo 2**n-1 function. It is the wrong circuit.

I can maybe illustrate this better with 8 bits. Mod 255 means the output will range from 0 to 254. So if you used Sum high bit to control the mux it will produce an range of 0 to 255 which is not correct. Use Sum_p1 high bit to control the mux and the output will range from 0 to

254 which is correct.

I'm interested. Placement will have a huge effect. I look at routed results for my implementation and I saw one route of 4.5 ns. Looks like they shoved the input FFs into the IOBs, so the design can't all be close together as the IO is scattered around the chip perimeter. I'd need to kill that or add another layer of FFs so the routes of interest are all internal FFs which could be co-located. Still doesn't say they

*will* be close together. I find tools to be hard to manage in that regard unless you do floor-planning. I much prefer a slow clock... lol

--

Rick

- I
- Ilya Kalistru
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Dec 20, 2015 8:57 AM

Ok. It's wrong function to produce the modulo 2**n-1. But I don't need corr ect modulo function, I need function which gives us that: A+B if A+B=2**32 It's absolutely different from modulo function, but it's what they use in t he description of algorithm. It's not my responsibility to change algorithm and this algorithm is used in software implementation already and if I cha nged this function to correct modulo function my hardware implementation wo uld be incompatible with software implementation and wouldn't comply to req uirements of government.

That's why I didn't use postplacement results in my first posts. Later I've worked around it. As you can see I've used A_in,B_in,C_out as a ports and A,B,C as a FF. I've also made a rule that the paths from *_in port to A,B F Fs and from C FFs to C_out ports are "don't care" paths.

set_false_path -from [get_ports A_in*] -to [get_cells A_reg*] set_false_path -from [get_ports B_in*] -to [get_cells B_reg*] set_false_path -from [get_cells C_reg*] -to [get_ports C_out*]

It allows to place FF anywhere on the chip.

I usually don't do any floor-planning at all because it's usually enough to set constrains well for my designs at ~ 250 MHz and Vivado placement tool does its job well.

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Dec 20, 2015 3:56 PM

If that is your requirement, then the other solutions are wrong, like mine.

In your original post you say you want a function to produce "modulo

2**32-1". The description you supply is *not* "modulo 2**32-1". It sounds like you have an existing implementation that may or may not be correct but you have been tasked with duplicating it. If I were tasked with this I would go through channels to get verification that the specification is correct. It wouldn't be the first time there was an error in an implementation that works most of the time, but fails 1 time in 2^32 in this case.

What happens downstream if your function spits out a number that is not in the range of "modulo 2**32-1"?

The problem I found was the input FFs were shoved into the IOB (a common opimization). This spreads the FFs around the IOs which are spaced far apart. You need to tell the tools not to do that *or* place additional FFs in the circuit so the ones in the IOBs are not part of the path being timed. Maybe you already have IOB FFs disabled.

--

Rick

- R
- Richard Damon
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Dec 20, 2015 6:39 PM

This is absolutely a module function, It just treats 2^32-1 as == 0 (which it is module 2^31-1).

- I
- Ilya Kalistru
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Dec 20, 2015 7:07 PM

They are. But they give me ideas and approaches.

In my original post I was wrong. I forgot about this oddity and gave you wr ong description. This specification is 26 years old and I have tons of impl ementations :). I don't think it's a good idea to try to get verification. It's an old standard set by government and it's a very long way to go. Espe cially if the part of work you are responsible for haven't done yet.

Nothing special. It's just a way to generate pseudorandom data. It could be done this way or other way, but should be similar on all devices. The only difference - security, but I wouldn't check algorithm after professional m athematicians - not my business.

I think that it's pointless to shove FF inside the IOB if there is a rule t hat we don't care about timings between ports and FFs. Why could we do that ? I haven't disable IOB FFs.

- I
- Ilya Kalistru
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Dec 20, 2015 7:10 PM

Maybe. Let's not fight about mathematical definitions. :)

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Dec 20, 2015 8:14 PM

That's the problem. Some of the algorithms provided create mod 2^32-1 while others include 2^32-1 in the output and so *are not* mod 2^32-1. To get mod 2^32-1 you need to add a 1 to the sum that controls if a one is added to the sum before taking mod 2^32. The way Ilya is specifying the algorithm the 1 is not added to the sum that controls the adding of the 1 to the output sum and so 2^32-1 will show up in the output, not converted to 0 as it should be for the mod function.

--

Rick

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Dec 20, 2015 8:25 PM

Ok, if they have been using this method for a long time I guess the issue of being out of range does not cause a problem. But certainly if this were being used in a critical application (such as cryptography) it would be a major concern since I am almost positive the greater algorithm is expecting numbers in the more restricted range. Otherwise why specify it as mod 2^n-1?

In any event, remove the " + 1" from the end of the sum I offered and you should have the behavior you are looking for.

This sounds familiar. I think I was in a conversation about this algorithm in another group sometime not too long ago. The mod 2^n-1 rings that bell.

I think I'm not being clear. In my P&R the input FFs are automatically placed into the IOBs. I don't know if the output FFs are also placed in the IOBs, I didn't dig this info out of the report once I saw the inputs were not right.

This placement distorts the timing numbers because the routes have to be much longer to reach the widely spread IOBs. I don't know if your tool is doing this or not. In my case I would either need to find the setting to prevent placing the FFs in IOBs or I need to add extra FFs to the test design so that the routes I want to time are all between fabric FFs.

This has nothing to do with timing rules. In fact, if you add a timing rule that says to ignore timings between IOB FFs and the internal FFs it may not time your paths at all. But I expect your tool is not using the IOB FFs so you aren't seeing this problem.

Please don't feel you need to continue this conversation if you have gotten from it what you need. I don't want to be bugging you with nagging details.

--

Rick

- G
- Gabor Szakacs
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Dec 20, 2015 10:11 PM

So in fact this is what I originally suggested, i.e. "end around carry." This is often used in checksum algorithms. it's the equivalent of (without the type / size changes):

sum

--
Gabor

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Dec 21, 2015 4:23 AM

Another problem with this method is that it creates a combinatorial loop which prevents proper static timing analysis. It is very efficient with the logic however.

I'm just very curious about why this is the desired algorithm while it is being called "mod 2^32-1", but obviously we will never know the origin of this.

--

Rick

- G
- Gabor Szakacs
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Dec 21, 2015 4:26 PM

I have always heard this method referred to as "end around carry," however if you are using this as an accumulator, i.e. A is the next data input and B is the result of the previous sum, then it is in effect the same as taking the sum of inputs modulo 2**32 - 1, with the only difference being that the final result can be equal to

2**32 - 1 where it would normally be zero. Intermediate results equal to 2**32-1 do not change the final outcome vs. doing a true modulo operator.

--
Gabor

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Dec 21, 2015 6:57 PM

I guess I see what you are saying, even if this only applies to an accumulator rather than taking a sum of two arbitrary numbers. When a sum is equal to 2**N-1 the carry is essentially delayed until the next cycle. But... if the next value to be added is 2**n-1 (don't know if this is possible) a carry will happen, but there should be a second carry which will be missed. So as long as the input domain is restricted to number from 0 to 2**N-2 this will work if the final result is adjusted to zero when equal to 2**N-1.

--

Rick

- I
- Ilya Kalistru
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Dec 21, 2015 7:33 PM

Today I've tried the new method in the real design and it tuned out that it works quite well. Biggest negative timing slack was about -0.047 or so. If I can take an advantage of single cycle operation, I'll try to implement it there. Total delay of critical path there is about 4 ns and it's bigger then in my simple test mostly because of worse routing to LUTs at the beginning and at the end of a dedicated CARRY network.

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Dec 21, 2015 7:52 PM

Lots of times routing is half the total delay in a design. In this case it may be less because of there just being two routes and the carry chain. But the point is a bad routing delay in just one place can dominate.

--

Rick