You have a few choices.
You can instantiate the SRL16 and FF. Doing that guarantees the proper components, and provided you don't have a reset on the FF, the mapper will pack the two into the same slice even if you don't put RLOCs on it. Without the RLOCs, there is no issue of portability between Xilinx families later than XC4000* and SpartanI. Including RLOCs adds an additional wrinkle because the RLOC format for Virtex, VirtexE and Spartan2 is different than that for later families, and with Spartan3 or Virtex4, there are restrictions as to which columns can have an SRL16. Anyway, you have the choice of using or not using RLOCs
You can also infer the flip-flop by connecting global reset to it. If you do this, you MUST make sure every flip-flop in the design also has the global reset connected to it if you are inferring the global reset. If you leave any flip-flop out, you wind up with a huge net on general routing resources for the reset. You also get a signal wired to the SRL flip-flop reset pin, which in turn forces it out of the SLR slice.
You can connect the inferred flip-flop reset pin to an instantiated ROC component. This puts the flip-flop reset on the built in reset network
Depending on the synthesis tool, you may be able to set an attribute to force the synthesizer to put a flip-flop on the SRL16 output. If you do this, check your result any time you use a different tool or version of the tool. If you have more than one SRL16 chained together, the tools historically have only put the flip-flop on the last one in the chain, which is no better than not using the flip-flop at all.
Depending on the tool, you may also be able to put a keep buffer btween the inferred SRL16 and flip-flop to force that signal to be retained. Early on, I had mixed results with this using Synplify. Some versions it worked, others it didn't (one version it forced a LUT to be inserted between the SRL and the FF....the worst possible outcome).
How do I deal with it? I have an IP block that instantiates RLOC'd SRL16's and flip-flops. It takes the desired delay and virtex family as generics and generates an array of SRL16's and FFs to match the width of the output port and divides the delay up into as many SRL16+FF segments needed to create the delay.
The root of the problem is the SRL16 has a compartively very slow clock-Q time, which is not a problem as long as the SRL16 is wired only to the flip-flop in the same slice (thereby avoiding adding routing delays to the long clock to Q). This is compounded because synthesis tools don't automatically stick a register on the SRL16 output.