Digesting runs of ones or zeros "well"

J

John_H 22 years ago

Greetings,

I need to detect runs. I want to look at 65 bits and show when there are 9 consecutive 1s or 0s from the byte boundaries resulting in 8 values per clock. This should be comfortably done in two logic levels (I need clean logic delays).

The idea is simple but the implementation is tough. I'm working with Verilog in Synplify, targeting a Xilinx Spartan-3. I have to resort to design violence to get the results that I believe are "best."

Any thoughts on how to do this "better?" (the following code likes fixed fonts)

- John_H ===================================== module testRun ( input clk , input [64:0] bytePlus1 , output reg [ 7:0] runByte /* synthesis xc_props = "INIT=R"

*/ ); // INIT included to force register as FD primitive - bleah

reg [23:0] runBits; // I wanted the syn_keep on this combinatorial "reg" wire [23:0] runBits_ /* synthesis syn_keep = 1 */ = runBits; // - bleah reg [ 7:0] runByte_; integer i,j,k;

always @(*) begin runBits = -24'h1; runByte_ = -8'h1; k = 0; // overlapping aaa aaaa for( i=0; i

Vote

R

rickman 22 years ago

I can't say I really understand your problem statement. I also don't see how your code is solving the problem. Can you give a better explanation of the problem? What do you mean "from the byte boundaries"? Are you counting only the bit sets of {0..8}, {8..16}, {16..24}...? If so, this seems like an easy problem to implement.

John_H wrote:

Rick "rickman" Collins rick.collins@XYarius.com Ignore the reply address. To email me use the above address with the XY removed. Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design URL http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX

Vote

J

John_H 22 years ago

You have it right - 8:0, 16:8, 24:16.... In the 41 bits illustrated below I want to note when the sequence across [16:8] (illustrated by the 9 1s below the count) in result[1].

11010101000101010111101111111111110010101 pattern 09876543210987654321098765432109876543210 count 00000000000000000000000011111111100000000 (reference)

The trouble is that while it seems easy to implement, getting the logic to come out clean in the implementation - the "pushing the rope" problem - doesn't comes easily. If one does a loop looking for 8 adjacent equals or the 8-wide AND of 8 XNORs, the realization takes up 3 levels of logic with FDR primitives resulting in 2x-3x the resources and about twice the delay.

The code with the two nested for loops breaks up the 9-bit compare into two

4-bit and one 3-bit compare to verify all the 9 bits are equal to each other and to break the implementation into 2 levels of logic rather than 3+ (the + being from the flop's reset input taking some extra routing delay).

I'd love a cleaner approach that doesn't come off so complex and gets realized in the resources it "should" use. It's tough to get it where I want by pushing on the rope.

are 9

clean

fixed

"INIT=R"

bleah

"reg"

Vote

R

rickman 22 years ago

My experience has been that it does not much matter how you code combinatorial logic like this. The tools run it through a grinder and produce an optimal version (in its own mind). When I want to optimize like this, I either use a "keep" attribute on the wire, or sometimes you can instantiate primitives. For logic I don't think primitives work since gates just get remapped.

But I still don't understand your code. Why does the outer loop range over 64 values. I would code two nested loops where the outer loop ranges over the 8 outputs and the inner loop ranges over the 9 inputs for each output. Or just skip the inner loop and use two outputs from two sets of four inputs feeding a 3 input function and use keeps on the first two output arrays. Maybe that is what you are doing, but I can't figure out the code easily.

I see you are > > >

Rick "rickman" Collins rick.collins@XYarius.com Ignore the reply address. To email me use the above address with the XY removed. Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design URL http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX

Vote

V

Vinh Pham 22 years ago

John

Pushing rope is a good analogy. Jan Gray, yeah?

When writing HDL, on one extreme you can write your code to explicity tell the synthesizer how to structure your logic. You'd instantiate logic primitives and wire them all up. You're basically doing all the work for the synthesizer. The benefit is total control. The problem is a lot of code to write, which means increase chances of human error and a lot of rewriting when you make changes to your design.

On the other extreme, you have code that's structureless. You imply no logic or wiring inside. It's purely a black box and leaves it up to the synthesizer to figure things out. The benefits are more compact code and "ease" of modification (if your algorithm isn't too convoluted). The problem of course is the synthesizer usually doesn't do exactly what you want. Much like a 2 year old coming up to you with a smile full of pride saying, "I went potty!" only to realize they did it on the floor.

So when the synthesizer isn't smart enough, you need to help it out by putting a little more structure into your code. Basically you're giving the synthesizer hints on what you want. Also hardware engineers tend to better understand algorithms that have a bit of hardware structure to them. We once had a math major write VHDL, and the comment from one engineer was, "Man, his code looks like C." :_D

Unfortunately I don't use Verilog, but hopefully some psuedo code will be enough to give you an idea of what I'm thinking.

I noticed your code was bit based. Since you only are checking on byte boundaries, it can help to write your code byte based. Good luck with this psuedo code (you should see my C).

data[64:0] -- 65-bit input data signal eight_0s_consec_flag[7:0] -- intermediate signals eight_1s_consec_flag[7:0] -- intermediate signals ninth_bit[7:0] -- intermediate signals consec_detect_flag[7:0] -- 8-bit output signal

for byte = 0...7

lsb = byte*8 msb = byte*8+7

if data[msb:lsb] = "00000000" then eight_0s_consec_flag[byte] = '1' else eight_0s_consec_flag[byte] = '0' end if

if data[msb:lsb] = "111111111" then eight_1s_consec_flag[byte] = '1' else eight_1s_consec_flag[byte] = '0' end if

ninth_bit[byte] = data[msb+1]

if ninth_bit[byte] = '1' then consec_detect_flag[byte] = eight_1s_consec_flag else consec_detect_flag[byte] = eight_0s_consec_flag end if;

end loop

Well hopefully that's correct, and makes sense. Let me know if you have any questions or see any problems.

Regards, Vinh

Vote

M

Martin Euredjian 22 years ago

John,

1- How many 65 bit words per second (ms, ns?) do you have to process? 2- Where do the 65 bits come from? (internal, external) 3- Do they get into the FPGA in parallel or serially? 4- Why are you saying that you need two levels of logic? (trying to control delay with combinatorial logic is not a great idea). 5- Why fight with inference? Instantiate what primitives you need.

Two logic levels?

Two LUT's to look at two consecutive nibbles. One LUT to AND the output of the above with the next most significant bit (the ninth bit). That's it. Two levels. 24 LUT's. Is that what you wanted?

-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Martin Euredjian

To send private email:

0_0_0_0 snipped-for-privacy@pacbell.net where "0_0_0_0_" = "martineu"

9

"INIT=R"

bleah

Vote

V

Vinh Pham 22 years ago

Whoopsy, brain-fart. My previous code will create 3 levels of logic. If we didn't have to detect both nine 1s or nine 0s, then it'd work okay.

Here's an idea for one that should generate 2 levels, but it looks uglier. Definately not as compact as rickman's.

data[64:0] -- input signal ninth_bit[7:0] -- intermediate signal run_dibble[31:0] -- intermediate signal run_byte[7:0] -- output signal

for byte 0...7

ninth_bit[byte] = data[(byte+1)*8]

for dibble 0...3

lsb = byte*8 + dibble*2 msb = byte*8 + dibble*2 + 1

if data[lsb] = ninth_bit[byte] AND data[msb] = ninth_bit[byte] then run_dibble[byte*4 + dibble] = 1 else run_dibble[byte*4 + dibble] = 0 end

end loop

lsb = byte*4 msb = byte*4 + 3

if run_dibble[msb:lsb] = "1111" then run_byte[byte] = 1 else run_byte[byte] = 0 end

end loop

Vote

J

John_H 22 years ago

I overuse the syn_keep attribute and I hate the idea of instantiating LUTs. My Carnot skills aren't exactly used regularly.

I've had problems with bit ranges in the past where [i+4:i] is a complaint. Perhaps this isn't an issue with for loops but I've learned to avoid them in general logic. They do work fine in generate blocks, however. I stepped through every bit to make a comparison to the adjacent bit; 3 adjacent comparisons lumped into one variable (with an eventual syn_keep) would give me 4-input functions that should pack into LUTs. The complex end of the inside loop is so that the three "LUTs" per byte are 4-input, 4-input, and 3-input functions.

This works very well for runs of ones only. I need to identify runs of ones or runs of zeros. The technique can be expanded to my needs resulting in runBitsA, B, and C where one of them needs to cover 2 comparisons, not 3 like the others. ...which is really is the approach I was coding but using consecutive bits in a vector rather than {A,B,C} and using the one statement rather than 3 to make the assignments, dealing with the 2 comparison exception by terminating the inside loop early.

Thanks for the help.

Vote

J

John_H 22 years ago

Thanks for noticing :-)

I like the code below with respect to its symmetry - it's a lot easier to read than the stuff I generated. The four 3-input LUTs feed a single 4-input LUT with (only a) little arguement from the synthesizer, I'm sure. It can be done with fewer LUTs by using

4-input LUTs covering 3 compares each but then the symmetry gets lost and the coding gets unpleasant.

I think I have an acceptable solution together that gives me good speed and good utilization which I'll post separately.

Thanks for your thoughs with this.

Vote

J

John_H 22 years ago

This run detection is one part of a 100MHz-200MHz mechanism.

Internal, blindsided from BlockRAMs with a new value per cycle.

Entirely parallel, into the BlockRAMs at full width.

If I go from BlockRAM to registers, I have the (relatively) long Tcko delay for the BlockRAM read and associated routing leaving little time to manipulate the data within the period. If I register the data from the BlockRAM, it's best to generate and use the run values in the next cycle requiring moe logic after I flag the runs, suggesting minimum delay is best.

The logic primitives are what I've avoided. I don't want to use LUT4 primitives with INIT attributes since I might mess up the carnot map. This is why the inference has been broken down into bits that can be retained (with syn_keep or other method).

Almost. The LUTs can't look at full nibbles. Since I need to make sure all bits are equal to each other, there's a "smear." One attempt was to XOR adjacent bits, then to do the 8-wide AND of the result, letting the synth give me the "best" results. It didn't. Thinking about the XOR-to-AND progression, 4 bits are needed to implement 3 bits of the AND, so two 4-bit LUTs and one 3-bit LUT are needed, feeding a 3-input AND.

That's it. Two levels. 32 LUTs. But the synth doesn't like my inferrences. I think I have a solution that "works."

Vote

J

John_H 22 years ago

For anyone interested in how I got things together, I ended up using one generate loop to instantiate 8 MUXF5s. Why MUXF5s?

1) One can make an 8-input AND with 2 LUTs and a little extra delay by having the first 4-bit AND feed the select and the sel==0 input - if the AND is false the result is false, if the AND is true, the result is the other AND.

2) By using a primitive, the logic feeding the primitive's pin isn't optimized across the primitive.

The synthesizer will produce a nice 2-level implementation for 5 compares but not 8 so splitting it up into 5 compares and 3, the MUXF5 used as an AND can give a nice balance of delays. Its slightly more than 2 LUTs of delay, but very slightly. The code looks cleaner and the implementation is tight.

=============================================================== module testRun ( input clk , input [64:0] bytesPlus1 , output reg [ 7:0] runByte );

wire [ 7:0] runMux; wire [63:0] xnorBits = bytesPlus1[63:0] ^~ bytesPlus1[64:1]; // the result of a bit compare a==b is the same as a^~b

genvar h; generate for( h=0; h

Vote

G

Goran Bilski 22 years ago

Hi,

Why not use the carry-chain?

You can do any kind of detection on that primitive and it will save you LUTs

Göran

John_H wrote:

Vote

R

rickman 22 years ago

Actually, I don't think logic primatives will work since the back end mapper can redo logic at will. The keep attribute is what is required to define the LUTs and even that is not guaranteed since it only results in a wire being kept; the LUT can still be split if other logic uses the same inputs.

I don't really see what problem you are trying to solve with that, but then I am not as well versed in verilog compared to my VHDL.

Again, I may not completely understand your problem. This was intended to show you how to solve the problem. To cover the adjacent zeros, you just do the same logic using the OR operator and invert the result.

Rick "rickman" Collins rick.collins@XYarius.com Ignore the reply address. To email me use the above address with the XY removed. Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design URL http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX

Vote

V

Vinh Pham 22 years ago

Yeah, it hit me when I was laying down to sleep. I wanted to put it off till today but it annoyed me too much.

Great catch. Yes, you can save a LUT per byte by comparing on tribbles instead of dibbles, but the code gets even more ugly than it already is.

Cool, I'm looking forward to seeing it.

Vote

J

John_H 22 years ago

I tried that approach earlier today but I wasn't getting the carry chain I was trying to infer. The Virtex-IIs started getting poorer at getting on/off carry chains timing-wise relative to the general logic resources so I was trying to get general logic to work; I suspect Spartan-3s are similar. If I go straight to register, I would need 4 LUTs to go into the register through the XORCY instead of the natural XORCY so the logic savings isn't a given to achieve the speed but I could keep it to 3 LUTs with a small routing hit.

I believe I'd need to implement all the carry chain primitives through the generate block, including the MUXCYs and XORCY elements because the synthesizer sees that "oh, it's a short chain" and converts my simple arithmetic form to a cascade of LUTs rather than the carry chain. I tried my tricks, I stopped pursuing.

Maybe I'll try to coax it again tomorrow. Thank goodness for that generate!

Vote

G

Goran Bilski 22 years ago

020901030008010704030503 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit

Hi,

What synthesis tool did you use? When I instanciate primitives directly in my code, they tend to stay in the netlist. The synthesis tool usually leaves them alone. If isn't working attach a U_SET attribute to them so the tools thinks it's a RPM which is normally leaves alone. That has been my approach for doing primitives.

Göran

John_H wrote:

Vote

J

John_H 22 years ago

Synplify does a good job of leaving the instantiated primitives in the code, sure. My first issue was that I believe carry chains are longer than 2 levels of LUTs. The second issue is that I was trying to infer

- not instantiate - the adder chain by adding 1 to {1,1,1} when all three LUTs worth of logic are valid, using the sign as my output. The synthesizer turned the inferrence into LUTs which is probably more effective in logic delay. I would need to generate 3 MUXCYs per chain for 8 chains. If I want the quick-register destination, I need to also instantiate the XOR in the sign bit. 4 primitives replicated 8 times for timing which *may* be worse. I chose not to pursue hte instantiations because of the expected lower performance.

Vote

G

Goran Bilski 22 years ago

050707070408040807090405 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit

Hi,

Yes, The crosspoint for using carry-chain or luts are somewhere between

2 and 3 levels.

But I have found that I can continue doing logic operation by continuing the carry-chain. It's easy to do "and","or" operation on the carry-chain.

So it depends on how signals are used afterwards.

But maybe for your problem, simple LUT implementati>

Vote

V

Vinh Pham 22 years ago

Yes, it looks really clean. Nice :_)

Cool use of the XNOR and AND functions to do your compare. A lot more elegant than all my for loops. Learn something new every day.

But something is nagging me. Can't your xnor/and scheme be implemented in two levels of LUTs without the need of a MUXF5? I must be missing something, or maybe the problem is the synthesizer isn't mapping it that way?

The first level LUTs would implement (output = (bit0 xnor bit1) and (bit1 xnor bit2)). You'd have four LUTs in the first level, then the second level LUT would AND all of them together.

Vote

H

Hal Murray 22 years ago

Are Carnot skills needed? I can generally count to 4. A 4 input LUT can implement any function of 4 inputs.

-- The suespammers.org mail server is located in California. So are all my other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited commercial e-mail to my suespammers.org address or any of my other addresses. These are my opinions, not necessarily my employer's. I hate spam.

Vote

Digesting runs of ones or zeros "well"

Join the Discussion

Didn't find your answer?