It isn't, that's the point. /If/ you can write in a high-ish level language like CUDA or OpenCL and achieve your goal with some commodity hardware you can buy in every city, why would you want to use an FPGA? Why would you want to worry about synthesis and meeting timing and state machines and debugging with a logic analyser? Maybe your problem doesn't fit the GPU model and is better suited to an FPGA, but you really ought to stop and think first.
It's not the HDL as a language as such (though Verilog's lax syntax makes it easy to introduce bugs), it's that the abstraction is not sufficiently high enough to make any useful progress. If somebody wants to experiment with architecture they should be able to do that without having to manage all the underlying complexity, and then be able to go back later and refine the code for performance. But this is awkward in (eg) verilog unless you have a very good test suite - it's too easy to introduce control flow bugs.
I've read the code of a web browser written in assembler. That's an example where the abstraction was not sufficiently high - hand management of registers and bit twiddling of memory made it simply impossible to keep control of the complexity. Development eventually ground to a halt because it was simply impossible to develop. Likewise verilog gives you ultimate control, and that's not always what you want when you're just evaluating ideas.
All I'm saying is that verilog/VHDL are insufficiently high levels of abstraction for architectural exploration. I'm not saying all HDLs/HLS are bad, just that you need to pick the right language.
Agreed. Some problems are about relatively simple heavily-parallel compute, and if especially if they can be easily pipelined then they fit FPGA nicely. Likewise if they have Gbps of external I/O FPGA will leave a GPU standing (or if the I/O is not in a PC-friendly format).
However if they need heavy floating point, like a lot of scientific compute, this starts eating up area rapidly. If they're memory-bound, then you're up against the limits of DDR3, which is a lot less bandwidth than GDDR5. Or if you want to do iterative development: many-hours FPGA synthesis times are not conducive.
Horses for courses and all that. My point is that you should do the work on your algorithm to see how it best fits the technologies available to you (CPU, GPU, FPGA), and then refactor it to suit. You may get substantially more performance by refactoring the algorithm for a given technology, rather than simply jumping in and implementing a naive algorithm. Once you've done this, only then implement it. But then be prepared to (repeatedly) refactor your architecture again in the light of that experience.
I'm not saying 'FPGA bad, GPU good', I'm saying implementing an FPGA design for scientific compute is a lot of work. So you need to have a clear reasoning why you're doing it. Just doing it 'to make my Matlab go faster' is not a good enough reason, because there's a lot less painful ways to achieve that.
Theo