Not knowing your exact requirements, the simplest I can see would be a running average of a large number of samples, and subtract that.
E.g total = total + sample(n) - sample(n-2048), output = sample-total/2048.
The division would be a bit shift, and the 2k of samples could be held in a ram block.
It is pretty trivial to show that DC would be blocked, and signals at f/2048 will not be attenuated (as the average would be zero).
It would also be very fast and have minimal latency, however it would introduce some phase distortion as it isn't symetrical.
If that is an issue for your application, then a way around it would be to total over 2047 samples, and subtract the total from 2047*sample(n-1024), then divide by 2048 (once again this can be implemented with addition, subtraction and bitshifts). The output will be slightly attenuated (by 1/2048), but no phase distortion would occur. It will also have a latency of 1024 cycles or so.
Mike