It does spend a lot of time. It's still worth doing a processor context switch (at around 500ns) than it is waiting for the I/O, even from an NVMe device; the I/O software stack is becoming a major component of the delay, but it's still less than the device respnse time.
And never will be. The next technology wave is persistent memory (PM), not NVMe. NVMe sits on a PCIe bus, which is bandwidth and latency limited. PCIe was always about building to a price, not for performance.
There are other much faster bus technologies in the offing; CCIX, GEN-Z, OpenCAPI for example. A good overview of these;
PM sits directly on the memory bus. In other words, you don't do block IO to these new memory devices; you do loads and stores. Currently PM latency is in the high hundreds of nS to single digit uS, so it sits between DRAM (which is not persistent, single didgit ns to low 10s of
ms depndant on the software stack and bus).
SNIA (a storage standards organisation) covers a lot of the background
Smaller gate sizes paradoxically increases the power consumption and hence heat generated per volume of silicon.
I disagree; big data centers definitely need faster everything. The more data you have and need to number crunch, the harder it becomes to move it around, and the big push right now is providing very high speed RDMA (memory to memory through smart network cards that don't involve the CPU) type links between storage and processors to make this easier.