I have built a fully functional ARM7 prototype board based on the Atmel AT91R40008 processor. Everything works fine, but the performance of the processor is approximately 1/10th what it should be. In a simple in SRAM memory write test, I first copy my code to SRAM, and then run out of SRAM and write blocks of 32 bytes to consequetive locations in an unrolled loop for a total of 9600 bytes (a simple test buffer) then do this loop 8 times, so the scope can get a good lock. The original C/C++ code and the dissasembled ARM code are below for reference. The key element is that other than the looping overhead the instruction stream should be nothing other than fetch, decode, execute of store byte immediate to internal SRAM of the form:
STRB Rn,[ip,#dd]
At worst case this should take 1-3 cycles per operation, I am scoping this and getting a memory write every 40 -"FORTY" cycles approximately!!!! This is bizzare. Of course the External bus interface settings are irrelevant for the internal bus, and I am not pulling on the external nWait pin. I hypothesize that the processor is in some mode after reset and running slower? Maybe has something to do with the debug interface, I am not sure, nothing I have found in all 3000+ pages of ARM docs lead me to any conclusions...
As another brief example, this is the C/C++ code for a max speed I/O toggle, I basically have a scope on one of the I/O pins and I am toggling in a loop at max speed and then looking at the waveform:
******** C/C++ codewhile(1) { pio_base_ptr[PIO_SODR/4] = 0x00020000; pio_base_ptr[PIO_CODR/4] = 0x00020000; }
And here's the dissassembled ARM code, 5 instructions, yet it it taking nearly 400 clocks to run these 5 instructions! Again, running out of SRAM and that's it, bizzare ???
************* ARM CODE|L000630.J10.C_Entry| LDR a2,[v2,#4] STR a1,[a2,#&30]! LDR a2,[v2,#4] STR a1,[a2,#&34]! B |L000630.J10.C_Entry|
There are very few resources with HARDCORE info, any insight would be greatly appreciated :)
Desperately seeking a GURU,
Xander. snipped-for-privacy@yahoo.com
*********** C/C++ version of the memory fill// fill memory up with incremental values
for (t=0; t < 8; t++) for (ram_index = 0; ram_index < 9600/1-32; ram_index+=32) { work_ptr[ram_index+0] = 1; work_ptr[ram_index+1] = 2; work_ptr[ram_index+2] = 3; work_ptr[ram_index+3] = 4; work_ptr[ram_index+4] = 1; work_ptr[ram_index+5] = 2; work_ptr[ram_index+6] = 3; work_ptr[ram_index+7] = 4; work_ptr[ram_index+8] = 1; work_ptr[ram_index+9] = 2; work_ptr[ram_index+10] = 3; work_ptr[ram_index+11] = 4; work_ptr[ram_index+12] = 1; work_ptr[ram_index+13] = 2; work_ptr[ram_index+14] = 3; work_ptr[ram_index+15] = 4; work_ptr[ram_index+16] = 1; work_ptr[ram_index+17] = 2; work_ptr[ram_index+18] = 3; work_ptr[ram_index+19] = 4; work_ptr[ram_index+20] = 1; work_ptr[ram_index+21] = 2; work_ptr[ram_index+22] = 3; work_ptr[ram_index+23] = 4; work_ptr[ram_index+24] = 1; work_ptr[ram_index+25] = 2; work_ptr[ram_index+26] = 3; work_ptr[ram_index+27] = 4; work_ptr[ram_index+28] = 1; work_ptr[ram_index+29] = 2; work_ptr[ram_index+30] = 3; work_ptr[ram_index+31] = 4; }
********* ARM ASM version of the memory fill|L000638.J8.C_Entry| STR v2,[v4,#&c5c] MOV a2,#0 STR v2,[v4,#&c60] |L000644.J10.C_Entry| MOV a1,#0 |L000648.J11.C_Entry| STRB v2,[v1,a1] ADD ip,v1,a1 STRB a4,[ip,#1] STRB v3,[ip,#2] STRB lr,[ip,#3] STRB v2,[ip,#4] STRB a4,[ip,#5] STRB v3,[ip,#6] STRB lr,[ip,#7] STRB v2,[ip,#8] STRB a4,[ip,#9] STRB v3,[ip,#&a] STRB lr,[ip,#&b] STRB v2,[ip,#&c] STRB a4,[ip,#&d] STRB v3,[ip,#&e] STRB lr,[ip,#&f] STRB v2,[ip,#&10] STRB a4,[ip,#&11] STRB v3,[ip,#&12] STRB lr,[ip,#&13] STRB v2,[ip,#&14] STRB a4,[ip,#&15] STRB v3,[ip,#&16] STRB lr,[ip,#&17] STRB v2,[ip,#&18] STRB a4,[ip,#&19] STRB v3,[ip,#&1a] STRB lr,[ip,#&1b] STRB v2,[ip,#&1c] STRB a4,[ip,#&1d] STRB v3,[ip,#&1e] STRB lr,[ip,#&1f] ADD a1,a1,#&20 CMP a1,a3 BLT |L000648.J11.C_Entry| ADD a2,a2,#1 CMP a2,#8 BLT |L000644.J10.C_Entry| B |L000638.J8.C_Entry|