For a future project I need to design a programmable delay line. The specifications are: steps of +/- 1 ns, range of 0 to 8 ns. I have no PLLs left so I can't use a high-speed pipeline to design this. Note that the delay should be fixed when temperature varies, this means that I'll probably need some calibration. I think that Xilinx uses something like this to delay the DQS signal in the S3E DDR controller. Has anybody got an idea how I should implement this?
Are you sure you need 1 ns steps? Can n steps (n>=8) that can be shown through internal measurements to cover 8 ns be sufficient? Can you handle some drift with temperature as long as you can adjust the taps when the change makes another tap more "ideal" for a fixed time?
With a CycloneII you probably have zero chance of a PVT-independent (process, voltage, tolerance) delay line or a delay line with explicit precision (e.g., 1.0 ns steps).
Can you look at other families? The DQS circuitry you might recall could be from the Xilinx Virtex4 or the Lattice ECP2 family. The Virtex-4 uses calibration, the ECP2 uses DLLs and neither covers the 8ns range.
External PCB traces from identical drivers to identical drivers might be your only hope, but it's real estate.
I can do recalibration in the video blanking time (this for a video application). The step size doesn't have to be 1 ns and not all steps need to have the same latency. However if I program it to 3.5ns it should stay between 3 and 4 ns for the whole temperature range, a drift of 0.5ns is acceptable. Calibration can be done during this sweep through the temperature range. The temperature gradient won't be high so you can assume that there are enough calibration cycles .
Two directions might get you close to where you want to be:
In your Cyclone II, check what your LUT delay is and what your adjacent LAB routing delay is. You could put together a delay line sources on the left, travels through LEs/LUTs/LABs to the right, and returns through LEs that select between the rightward going path and the adjacent leftward going path. The coarseness is probably too much for your needs.
Another approach could use the carry chain but the carries as implemented in the LABs might not have the regularity for a good, variable delay. One signal feeding the LAB could be "picked up" at various point along the carry chain for finer adjustments than the previous approach.
You can work with combinations of features in your FPGA that give you consistent delays you can use. It depends on what you can get from your silicon as to how fine your resolution is.
The actual delay can be measured by configuring the delay line as a ring oscillator with the injection point into your fixed-output delay line varied as you would vary the live signal. Measuring the ring oscillator frequency compared to your reference will give you the calibration points.
The process variation could give you a 4:1 change in your delay line performance as PVT changes meaning the method would typically need to significantly overbuild "typical" to handle the fastest cases yet maintain the resolution for the slowest.
With this added information on how it could be put together, do you still want to pursue this delay line?
I have great results sampling and deskewing multiple low-quality 600 Mb/s data channels in a Spartan3E with similar techniques but this front end of mine is overbuilt and doesn't deliver "delay" as you appear to want it.
If you do experiment along the FPGA silicon delay lines, try to keep your signal inverting as it goes through the chain to avoid "duty cycle compression" where the ones and zeros get lopsided in their size. Ring oscillators use multiple inverters in stages for a reason.
If your were talking 2 ns I would suggest using the DDR structure of an I/O cell and opposite phases of clocks to use a 250MHz clock if this is to the outside world. If you were using a more expensive part like a Virtex-4 you could probably go to 1 nS resolution. Spartan-3/3E might but it is beyond the spec.
Otherwise, and possible slightly horrible, is to have 4 phases of 250 MHz clock, driving 4 flops, which are put through a LUT acting as a OR. Your logic would have to figure out inputs to the flops to drive the appropriate one against necessary delay. This scheme would give some variance timing due voltage, batch etc. in the backend LUT and routing. It would need to be hand placed within the FPGA using something like FPGA Editor to get closely balanced paths otherwise you are very likely to get non linear steps in your delay.