What I'm doing works.
I have the UART interrupt service routine, which builds a string from edited input characters. When the command string is done, it sets a flag to let the main loop know that a command string is ready, and the main loop in turn calls the command parser, which executes the string. The box I'm doing now has exactly 100 distinct serial commands. Some user commands can call certain canned sequences, stored as strings, which the parser executes re-entrantly. So far so good.
But some command strings can take a long time, perhaps minutes, to execute. So I've assigned one serial character as an abort. When the ISR sees this, it just resets the stack and jumps to the top of the main opside loop, totally dumping whatever was going on and starting fresh. I've asked a couple of my C guys how they would do this in C, and the answers were sort of mumbles.
Another thing I like to do is copy a chunk of code, like the FPGA loader, from eprom into a block of cpu ram, and execute it from there. That speeds things up by 2:1 or so... no byte-wide bus fetches. Then I can use that block of ram for normal system variables. All that sort of thing is easy in assembly.
All my embedded programs are monolithic, absolute assemblies of a single source file. I can usually archive a project *including all the tools* on a single floppy. The program listings are heavily commented, neatly paged, and start with an automatically generated table of contents, which makes it really easy to see the structure and find whatever you may be looking for. I can revisit a years-old project and find what I want in seconds.
The methodology is ancient, like making bread in wood-fired ovens. But it works.
John