Well, I was wrong in at least one thing: I thought that, with CSAAT=0, CS would be deasserted (high) between consecutive "word transfers" within one "block transfer", but it is not. I had clear from the beginning (from diagrams and text) that there was a way to keep CS=0 between word transfers, but I thought that it implied CSAAT=1, and it is not true. CS is 0 between consecutive word transfers (of the same block transfer) regardless of the value of CSAAT.
So, yes, I can leave CSAAT=0 permanently, there is no CPU intervention needed (other than at the beginning and at the end of each block transfer), and I can use DMA, with two 11-bit word transfers per block transfer.
This is good, but I think that it could be better. Difficult to explain, but I'll try:
Imagine my external ADC (with SPI interface) is sampling the analog input at 100 ksa/s (the TI ADS8320 that I mentioned allows that). So,
10 us between samples. Not much. Each sample needs 22-clock cycles inside each assertion of CS=0, so each sample needs one DMA block transfer (with for instance two 11-bit word transfers inside). Each DMA block transfer needs CPU intervention. So, I need CPU intervention every 10 us. That's a short time. Only 480 cycles of my 48 MHz SAM7. Since (that I know) a DMA block transfer cannot be triggered directly by a timer overflow or underflow, an interrupt service routine (triggered by a 10 us timer underflow) must be executed every so often, so that the CPU can manually trigger the DMA block transfer and collect the data. Adding up the overhead of the interrupt context switching and the instructons needed to move data from and to the block buffers, to re-trigger the block transfer, and all this in C++, I think that all that may consume a "significant" portion of those 480 cycles. And the CPU is supposed to do something with that data, and some other things. I see that hog as a killer, or at least as a pitty.
If the SPI in the MCU allowed 22-bit (word) transfers, and the DMA allowed triggering the next word transfer (inside a block transfer) when a certain timer underflows, then the DMA blocks wouldn't need to be so small. Each analog sample could travel in one single SPI word transfer, and one DMA block could be planned to carry for instance
1000 word transfers. That would be one DMA block every 10 ms. The buffer (FIFO) memory would be larger, but the CPU intervention needed would be much lower. There would be the same number of useful cycles, but much fewer wasted cycles. There wouldn't need to exist an interrupt service routine executed every 10 us, which is a killer. That would be a good SPI and a good DMA, in my opinion, and the extra cost in silicon is negligible, compared to the added benefit. Why don't most MCUs allow that? Even cheap MCUs could include that. An MCU with the price of a SAM7 should include that, in my opinion.
Best,