Hi,
Not sure I want to jump into this (but I couldn't resists ;-) ) but I created a stack RISC processor back in 1990 which was targetting space application. It had the ADA run-time kernel in hardware and support 8 tasks in hardware. We handled the memory accesses using cache, there is no real difference in cache for stack machine or register-file based CPU. We did however had one operand to minimize the program code size. Instead of just operating on the two operand on the stack, one operand was address with a stack offset. This removed tons of push instructions and thus minimized the program code space.
A processor needs assembler, simple as that. Debugger is nice to have but you can develop stuff without it, it just takes longer time. C compiler is needed if you want more users.
With Xilinx 6-LUT, you can really make small 16-bit RISC machines which is register file based. Programming a register based CPU in assembler is much easier than a stack machine. I crafted a couple of years ago a 16-bit machine which could be as small as 200 LUTs (4-LUT) but was around 300 LUTs in general. It might be possible to do a 16-bit RISC at around 100 LUTs (6-LUT). So the only benefit I see a stack machine has is more compact code.
G=F6ran Bilski