shift register with distributed ram

- C
- CMOS
  
  Contact options for registered users
posted
17 years ago

Sat, Mar 24, 2007 7:26 AM

is it possible to implement a serial in , parellel out shift register from xilinx distributed ram? any guidance is appreciated.

- C
- CMOS
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sat, Mar 24, 2007 7:26 AM

is it possible to implement a serial in , parellel out shift register from xilinx distributed ram? any guidance is appreciated.

- J
- John McCaskill
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sat, Mar 24, 2007 3:56 PM

I remember seeing an app note on the Xilinx web site dealing with using the JTAG port to initialize BRAMS that showed how to get two bits per lut plus one more from the FF. I think it was an app note by Ken Chapman showing how to initialize the BRAM in a PicoBlaze. A quick search of the Xilinx web site mentions a program called JTAG_loader that I think uses this technique. I will leave the rest of the searching up to you, unless maybe some one with near perfect recall just happens to remember the app note I am talking about.

Regards,

John McCaskill

formatting link

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sat, Mar 24, 2007 3:57 PM

The LUT-based distributed RAM in Xilinx FPGAs can be used as a shift register, called SRL16 or SRL32, with a length (depth) that is dynamically adjustable by the address inputs. But since the LUT has only one output (2 in Virtex-5) you cannot use a LUT as serial-to- parallel converter. Peter Alfke, Xilinx Applications

- J
- John_H
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sat, Mar 24, 2007 5:04 PM

Think of how a serial-in, parallel-out shift register is put together. There is series of shift elements that shift the data in with a broadside dump of all the shift registers into output holding registers.

If you implement an n-bit serial-in, parallel out shift register where the most recently shifted in bit is present on the output, you'll need

2n registers.

If you want the top n bits of an m-bit shift register where the most recently shifted bit is m-n bits from the parallel-out data, you can use

2(n-1) independent registers and int((m-n+15)/16) shift registers where the last distributed memory shift register also uses the embedded register for output.

While the tools would not synthesize the stages, you could instantiate an SRLC16E element with an output mux address of 0 to accompany the output registers to get an n-bit serial-in, parallel-out shift register in n LUTs with n embedded output registers.

But since registers are plentiful in the Xilinx series (heck, Lattice even tossed out 25% of the rigisters in their low-cost family in recognition of this fact) it's probably much better to used the registers (implemented with direct inputs) and leave the LUTs for use for other combinatorial logic.

The register-only approach also eliminates the problems with whether to reset or not since the distributed RAM shift registers 1) cannot take a reset themselves and 2) make the use of a reset for the embedded output register almost impossible.

- John_H

- J
- John McCaskill
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sat, Mar 24, 2007 5:21 PM

The SLR16s have two outputs that can be used for serial to parallel shifters, the Q15 for cascading, and the selectable output. I finally found the article that describes how to use an SRL plus a FF to build a 6 bit per slice serial to parallel shifter. It is by Kris Chaplin, not Ken Chapman like I had been thinking.

The article is a TechXclusives "Reconfiguring Block RAMs" at:

formatting link

or:

formatting link

Regards,

John McCaskill

formatting link

- M
- Marty Ryba
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Mar 26, 2007 1:48 AM

Slightly in another direction...is there a trick to setting up the cascades on the SRL16s to maintain a consistent delay? We strung 8 in a row to get an adjustable 1-bit delay line. It works, but there's a bunch of extra muxes, etc. to get the delay consistent (3 clocks plus whatever tap I pull as output). I'm actually the systems guy and not the VHDL coder (and communication of requirements is always tricky when you're doing something new), but I'm always interested in the mechanics to see if it can be done better (smaller) while still meeting requirements. Especially since I'd really like a 1024 tap delay but I ran out of space (I need tens of these, plus other DSP goodies). Suggestions on other mechanisms to use are also welcome.

Dr. Marty Ryba martin (dot) ryba (at) verizon (dot) net (man, I hate sp*m)

- J
- John_H
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Mar 26, 2007 3:35 AM

Shift registers are clocked. Clocked elements don't have routing consistency issues, they have routing maximum issues. I'd suggest using some Xilinx routing for combinatorial delays in an *extremely* well controlled situation, inverting consecutive stages of a multi-tap delay to reduce pulse width distortion. But a 1024 element delay line?! It sounds like you need a nice, clocked delay. SRLs in series shouldn't have delay issues.

Is it that you're taking the output from a very long clocked shift register? If so, just clock the muxed outputs to get all the SRLs to show up at the output pin at a predictable time.

Often the conceptual problem with unclocked delay lines is figuring out how to get a consistent input path or a consistent output path; the trouble is, both are needed.

What is your desired range and resolution? Acceptable jitter?

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Mar 26, 2007 3:59 AM

- M
- Marty Ryba
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Mar 27, 2007 3:09 AM

Thanks for the suggestions. Just to clarify, it is a clocked delay that I want; everything is on a clock, which functions as a sample counter as well. Say there is a bitstream, and I want two copies of it: one "prompt" copy and one "delayed" copy with the delay being a variable number of samples in some kind of buffer. This is easy with RAM in a GPP (pointer arithmetic), but a GPP is not fast enough for pipelined processing (~30 Mbps on each of 10 or more bit streams). During routine processing, the delay is fixed and I want on each clock the pipeline to shift by one. Now, on subsequent processing cycles, *maintaining the state of the pipeline* I may want to tap a different delay point.

Now, since I want more than 16 taps of delays, I see two approaches (let N be the total delay):

1) The Q15's are connected for cascading, the output of the (N/16)+1 SRL is set to the remainder of the delay, and a mux selects the output pin of the (N/16)+1 SRL to use as output of the block. Based on the delay value, the timing of the appearance of the correct sample depends on how many SRL's it traverses before it exits. So, delay needs to inserted to make it constant. I'm likely glossing over details I don't quite understand since I didn't code it myself, and most of this design was done 2 years ago. This is how I believe it's implemented right now.

2) The output of each SRL is connected to the next one's input. The first (N/16) of the SRL's addresses would be set to 15 (max delay), the "middle" one would have mod(N,16), and the others would all be set to zero. The block's output is connected to the output of the last SRL. This I think would give a consistent delay of about N+(# of SRL's). The problem is that it would enforce a minimum delay that I would likely have to insert into my "prompt" channel to balance things back out.

3) Other ideas?? For instance, I actually would prefer to be able to create tens of "fingers" of delay without needing separate parallel pipelines but maybe by having them cascade into each other. Any app notes out there that I haven't dug up yet? I'm revisiting this since we're restarting the program and have the opportunity to revamp parts of the design.

Marty Ryba semi-mad scientist proud member of the Luxuriant Flowing Hair Club for Scientists (no kidding!)

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Mar 27, 2007 3:34 AM

Marty, forget the GPP and just wrap two counters around a BlockRAM. That can run ten times fater than you need it. Maybe you can then do some time-division multiplexing to save on BlockRAMs...just an idea... Peter Alfke, from home

During routine processing, the delay is fixed and I want

- R
- Ray Andraka
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Mar 30, 2007 9:46 PM

This works nicely as long as you don't need speed. The direct outputs (ie not through the slice register) from the SRL16 are quite slow. Even just routing the SRL output directly to a flip-flop in another slice yields a huge minimum clock hit.

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Mar 30, 2007 10:13 PM

- R
- Ray Andraka
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Apr 4, 2007 2:13 AM

And a killer when the clock runs at 250 MHz. 100MHz is slow in modern FPGAs...slow compared to what the FPGA is capable of with careful design.