Fencepost errors are also relatively easy to test for. Having a good method of generating test vectors is important.
For things like CUDA or OpenCL you're smack up against the von Neumann Bottleneck. I expect the video card makers to embrace interfaces faster than ePCI relatively soon. Like M.2NVMe .
I may get back to it, but for one thing I work on ( VST plugin convolution ), the GPP approach wins for now.
Other than that, it depends on what you mean by "massively parallel." With bog standard open/read/write/close Linux driver ioctl() and event driven things like select()/poll()/epoll() it gets quite a bit easier.
If that's not good enough, shared memory is a possibility. There are other paradigms.
When I think of libraries, I think of what's available for Fortran.
Writing libraries otherwise isn't that good of a business model. Open source libraries are only as good as the people who steer them.
The "boost" library quite should be a wonderful thing; it is , sometimes but more often it's just a whacking great overhead.
But it's not like there are healthy markets around for "bolt makers" in software. And there's got to be a limit to how good the tools actually are. Turns out I can start using CLANG at home; I'll see how impressive it is.
The problem in software is pretty simple 0 half the practitioners have been doing it less than five years. Throw in the distractions of developing for the Web and it's even worse.