Atmel releasing FLASH AVR32 ?

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Mar 22, 2007 4:51 PM

"Wilco Dijkstra" skrev i meddelandet news:C_vMh.3351$ snipped-for-privacy@newsfe3-win.ntli.net...

Lets see: to execute

we can assume the following assembler code:

lsld 1,r0 load mosi,r1 or r1,r0 waitfor eventflag_1 ; YES ; H/W to support event wait

So the multithreaded CPU will complete in 40 x 4 = 160 instructions

Id like to see a single threaded CPU doing this in 160 instructions.

I think an interrupt is probably 5-10 clocks and return from interrupt the same. So 10 clocks * 40 interrupts = 400 clocks to start with. I think you will run about 5 times slower. With more overhead for interrupts, much much slower.

No, because a proper multithreaded architecture releases the pipeline to computable threads when they do not need to be active.

If you do not need top performance in a single thread, you can greatly simplify the pipeline and thus increase the frequency of the CPU. You are able to mix programs from several sources on a single CPU instead of having several CPUs, because noone knows how to maintain code from different sources. Sometimes you don´t even get source of the firmware.

A classic example would be something implementing a V.22 modem in S/W. You can have the V.22 S/W running in a thread, and you cannot screw up the performance of the modem S/W. By allocating a certain number of MIPS and guaranteeing that the program is not stopped by application S/W running at high priority, you have solved the problem.

Another example: today you can get single chip GPS. To reduce cost, they are ROMmed, and you add an external microcontroller to do the user interface. At this stage, the ARM7 CPU running the GPS S/W needs about 20 MIPS, and there is no plan to let anyone touch the ARM, due to the sensitivity of the S/W.

With a multithreaded CPU you could allocate 20 MIPS for the GPS and run the application S/W on the remining MIPS.

--
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Mar 22, 2007 6:34 PM

Yes, that is possible if you have enough hardware threads. If you do then you don't need traditional interrupts at all anymore and only use polling. I can see that would simplify hardware and software (I'm sure you agree polling is easier).

However few cores support more than 2 hardware threads as having many large register files is a waste of die area and a cycle time limiter. I can imagine keeping contexts in memory, but that makes switching threads expensive, defeating the advantage.

That would be nice, but the reality is that it will take some time before you start executing another thread. You always have the event synchronization time and the thread startup time. You avoid save/restore like in a traditional interrupt, but you still have all the other overheads.

While it is possible to reduce this to a bare minimum (say less than 5 cycles), you can do the same for interrupts. It's just a design tradeoff whether you want the lowest possible latency for added complexity and (likely) lower average performance.

Wilco

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Mar 22, 2007 7:03 PM

And you say that using two cores (which is the current solution) is less of a waste... Show me a core which runs lets say Bluetooth MAC and GPS MAC (or similar combination) in a single thread.

In fact, show me a single thread core which can do a full duplex S/W UART at as high speed as a two thread core.

No, zero cost context switch cores exist already today. (And has existed for 20-30 years)

If we assume that we want a thread to react on an edge on an I/O pin, then there will be a synchronisation delay from the edge to the time when the event has been raised and changed the status of the thread from "event wait" to "computable". During that time, the CPU can execute other threads.

There is no thread startup time when you have a zero cost context switch architecture. - Several are around. This means, in the SPI slave example, that after the clock event is raised suddenly all 40 threads become computable. The CPU will switch thread every clock cycle, so after 40 clock cycles the CPU will have executed: data While it is possible to reduce this to a bare minimum (say

Less than 5 cycles = 0 cycles in this case.

--
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Mar 22, 2007 7:51 PM

I think I'll leave it here, but observe that's quite a lot of "qualifiers" you've now added, to the original statement, including one that seems to shift the definition of Microcontroller ;) You see, not all Microcontrollers have such elastic memory timings.

"non trivial" is also vague: most designers that go to the effort to get time invariant code, consider that effort/code non-trivial, but somehow I know you'll qualify that again....

-jg

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Mar 22, 2007 8:14 PM

I didn't say that, but see below. For a small embedded core having 2 threads is maybe 25% extra area, 4 is more likely to be 50%. A faster core replacing 2 smaller cores has around 50% overhead due to the extra complexity to get faster cycle time. So we have:

1 simple core: 100% 2 simple cores: 200% 1 faster core: 150% 2-way multithreaded core: 188% 4-way multithreaded core: 225%

These are finger in the wind numbers but you can see a heavy multithreaded core will be larger than several simple cores.

Any core that is fast enough will do. Merging two complex pieces of software is obviously non-trivial but it would be equally non-trivial to change them to use multiple threads.

Again any core will do: using polling you can reach the same speed as a multithreaded core. A good way of doing this is to start with an interrupt, then poll for a while when receiving high-speed data and revert to interrupts again when there are pauses in the data. This way you don't lock up the CPU except when you receive data.

Can you mention one? I've seen the Ubicom cores but they switch at the start of the (rather long) pipeline, so it takes many cycles to switch.

Absolutely.

When you say zero cost context switch, can you tell me how long it would take to execute a "wait_for_event" instruction, the thread going to sleep followed by the event being signaled immediately afterwards followed by resuming execution of the next instruction? On the Ubicom core I believe it takes around 10 cycles, far from zero...

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Mar 23, 2007 1:40 AM

Why not use *REAL* data.

MIPS 34k core with 9 threads = 2,1 mm2 in 90 nm. MIPS 24k core with 1 thread = 2,8 mm2 in 130 nm

It is probaly fair to assume that 90 nm = 0,5 * 130 nm so a MIPS 34k would be about 4,2 mm2 in 130 nm or about 50 % larger with 9 threads.

The MIPS 34k is actually a dual core (dual VPE), so you have to deduct for that.

I think you will find that it is more like 10% overhead for a simple core It is less overhead for a multithreaded "faster" core than it is for a single threaded "faster" core, if you accept the limitation that a thread can only run max 1/2 or 1/3rd of the cycles because you get rid of feedback muxes. Less logic in critical datapath = higher frequency.

I think the finger is up somewhere... and that wind ain't nice.

No, you run one thread with the Bluetooth MAC and another for the GPS MAC. No or very little change needed...

Are you sure, they cannot switch every clock cycle? MIPS 34k. In a simple three stage pipeline is it a piece of cake to do what I want. Main cost is:

PC is changed from a register to an SRAM. Register Bank becomes a register bank array. Multiple PSRs

and then you have the scheduling which can be an advanced timer working on a register bank.

Each thread adds a time quanta every n cycles and deducts another time quanta every time it gets to use the pipeline and you try to execute the threads which have accumulated a lot of time quanta. Not so hard to implement.

The zero context switch time is between two different threads. If you explicitly yield the thread, then it can take time to stop/start, but in fine grained parallelism, you execute for one clock and then the next clock another thread executes.

Show me one ;-)

You will not be able to maintain a large number of equal prioritized threads unless you modify the concept of interrupts to be equal to multithreading.

--
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Mar 23, 2007 5:34 PM

130->90nm scaling is more like 55-60%, so it is more likely to be 25% larger, not 50%. However consider these are high-end embedded cores with 32KB cache, so the actual core area more than doubles.

that.

Actually it is a single core. A VPE is simply a virtual CPU to make the OS believe there are 2 cores.

Wrong. On a micro controller with a far simpler pipeline it would be much worse. A while ago we discussed the size of a register file in embedded CPUs, this is 10-15% of a typical core like ARM7. Imagine 9 copies...

That is certainly feasible, but you'll have a hard time getting it past marketing types who want to show good benchmarking results... Single threaded performance is still important and will be for a long time.

Of course they can switch every clock cycle. But what matters is how fast they can react to asynchronous events such as branch mispredicts, cachemisses, wait for event etc. If a thread is scheduled to run but it has an unexpected idle cycle, is it possible to immediately switch to another thread and use that cycle? Remember a bubble may appear at the end of the pipeline but instruction fetch is at the beginning, so it can take a while...

I don't have much information how threading works on the 34k, but from what little is available, it appears each thread maintains a separate instruction queue. This indicates they can switch pretty quickly. I'd be impressed if it can switch to reclaim idle cycles.

The concept is simple indeed, but the details are non-trivial, especially if you want fast thread switching to use idle cycles.

Yes. But my point is that if it takes time to start/stop threads then this is equivalent to the interrupt latency. You can't claim that interrupt latency is bad for performance but that thread start/stop latency isn't. It lowers the maximum performance of that thread (in your example of 40 SPI devices it lowers the maximum SPI frequency) and if the CPU can fill the idle cycles with another thread they also reduce overall performance.

Any multithreaded CPU with a zero-cost context switch will do. You're claiming those exist, right? So zero-cost interrupt latency exists too.

If I run a main thread and have a higher priority interrupt thread servicing interrupts using 100% of CPU time, do you agree it is identical to an interrupt-based CPU? So an interrupt driven application can be as fast as a multithreaded one.

Wilco

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Mar 23, 2007 8:01 PM

"Wilco Dijkstra" skrev i meddelandet news:fcUMh.24370$ snipped-for-privacy@newsfe7-gui.ntli.net...

From MIPS homepage: " 2.1 mm2 (core only, extracted from full layout GDSII database)"

I meant per thread. You do not need much more than the register file and prefetch buffer so 10-15% extra per thread does not seem unreasonable.

A dual thread 40 MHz CPU can replace two 20 MHz CPUs. A single thread 40 MHz CPU cannot always replace two 20 MHz CPUs. Let's take an obvious case, where one is running the OSE operating system and the other is running Thread/X. How are you going to do that on a single thread? The combined GPS and Bluetooth stack is better. A GPS company would normally not allow anyone to mess with the code running on the ARM. The impact on support and maintenance is to high.

Running a thread with the GPS is much more attractive and would allow the user to run their own threads without affecting the GPS timing enough to be a problem.

Not for a 20 MIPS application, it aint. There is noone interested in how many MIPS the cpu core in a GPS chip has.

Yes, when you have a jump you would immediately make this task non--computable, and have another computable thread enter the pipeline. If it becomes computable the next clock, you can switch it in again.

Why not, the AVR32 removes jumps from the pipeline so the execution unit will only see aritmetic instructions.

I expect that in normal operation you will switch thread EVERY clock cycle. It is becomes more complex if you want dynamic allocation of threads.

A real simple solution would be to have a circular buffer of programmable size. Each entry in the buffer, is a thread number. So if you had a 10 entry circular buffer you could have

1,2,1,3,1,2,1,4,1,5

At 100 MHz, this would give you Thread 1: 5 entries = 50 MHz Thread 2: 2 entries = 20 MHz Thread 3,4,5 = 1 entry each = 10 MHz

If a thread is not computable, then you can give the cycle to one of the other threads, or to a dbug thread, or to a backgorund thread or whatever.

No, but I say, that it does not reduce the total throughput of a CPU that you have latencies. Even with latencies, you can get a higher utilization of the pipeline as long as there is at least 1 computable thread. No bubbles in the pipeline, no branch prediction needed. Branch prediction will improve the performance of a single thread but it will not allow the CPU to execute more instructions.

I believe that a thread that replaces an interrupt is started already at initialization, and put in an event wait state. Since there is no context to save/restore, then the thread can react much faster than an interrupt driven device.

I am not claiming that a multithreaded CPU has zero interrupt latency. I am claiming that once it has been decided to switch thread you can do it without any overhead. It is still going to take time after an event has occured, before the decision has been made.

You were trying to prove that a single thread core is as good as a multithreaded core, and now you are claiming that a multithreaded core is as good as a multithreaded core , duh!

Again, show me a real CPU with zero cost interrupt latency

If you go back to the case where you are servicing 40 slave SPIs you will NOT get the same throughput in a single thread machine simply because you have overhead in servicing the interrupt and the fact that you will not interrupt another task which has the same priority level.

Do you EVER give up a lost cause?

--
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Mar 28, 2007 10:53 PM

On the subject of Multiple cores, and multiple threads, news today shows this is advancing quite quickly. Intel does not seem to think it is a 'waste of die area'.....

Eight cores and 16 threads (probably they mean per-core?) is impressive for what sound like fairly mainstream cores.

formatting link

"Intel's 45-nm high-k process technology offers approximately twice the transistor budget, 20 percent faster transistor switching speed and lower leakage current when compared with the company's 65-nm technology, Gelsinger said.

Nehalem's scalable architecture provides for between one and 16 or more threads utilizing one to eight or more cores, Gelsinger said. He added that Nehalem processors already in design have eight cores and 16 threads. Some Nehalem processors are likely to have more cores, he said, declining to discuss specific product configurations.

Nehalem's architecture provides for simultaneous multi-threading, multi-level shared cache, Gelsinger said."

-jg

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Apr 1, 2007 11:40 PM

Data sheets and info on Eval PCB, etc, are now up at

formatting link

-jg

- F
- FreeRTOS.org
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Apr 2, 2007 7:52 AM

formatting link

...... and of coarse the FreeRTOS.org port to go along with it :o)

formatting link

[direct link - without menu frame (horror)]

--
Regards,
Richard.

+ http://www.FreeRTOS.org
A free real time kernel for 8, 16 and 32bit systems.

+ http://www.SafeRTOS.com
An IEC 61508 compliant real time kernel for safety related systems.

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Apr 2, 2007 8:39 AM

:)

Did you try the AVR32 Studio ? - any comments ?

-jg

- F
- FreeRTOS.org
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Apr 2, 2007 9:06 AM

All I've done with it is start it up and note that it was Eclipse. I did not use it in anger. I suppose I'm going to have to get into Eclipse (old dog, new tricks), but so far have not found a way of creating a project in Eclipse that permits files to be included using a relative path (below the project directory).

If its as good as the 8bit AVRStudio version then it will be a very useful tool. I don't know if the 8bit version will be getting migrated over to Eclipse too?

--
Regards,
Richard.

+ http://www.FreeRTOS.org
A free real time kernel for 8, 16 and 32bit systems.

+ http://www.SafeRTOS.com
An IEC 61508 compliant real time kernel for safety related systems.

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Apr 4, 2007 11:49 PM

If you read what I wrote then you'd know that on a high end CPU it takes far less area than on a low end CPU. However Intel must still think it is a waste of die area, otherwise all their CPUs would have it...

It is required now as 8 cores on a single chip use so much bandwidth that most cores are waiting for external memory most of the time (despite the huge L2 and L3 caches). Switching to a different thread on a cache miss makes sense in this case.

It clearly says 2 threads per core. Any more would be a waste.

Wilco

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Apr 5, 2007 9:41 AM

Multithreading on a high end general purpose CPU gives problem on their own. Especially with cache trashing. With an embedded core where you use tightly coupled high bandwidth memory for most of the threads you do not have that problem

Note I am not advocating symmetric multiprocessing.

I think it is eminently useful for assymetric multiprocessing where you have some dedicated tasks to do which are best implemented in a separate CPU to avoid real time response conflicts and can be implemented in a low end 32 bitter.

I think you need to stop trying to explain why a single CPU is better than a multiththreaded CPU, because noone is using a single CPU for implementing two simulaneously operating software MACs. If you continue, that just proves that you are either ignorant or not listening

The issues is replacing multiple CPUs/Memory Subsystems with a single multithreaded CPU addressing a memory subsystem´ consisting of internal TCM memory, internal loosely coupled memory (flash?) and external memory.

Look at Sun and UltraSparc T1, they certainly do not see the boundaries that you see. I do not think that they are limited by Intels vision... Also I pointed you at the new MIPS Multithreading core. They certainly do not agree with You!

--
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Apr 5, 2007 9:47 PM

Absolutely. The "solution" is to add more cache...

Same solution: more fast on-chip memory.

I'm not quite sure what you're saying here. Are you advocating asymmetric multiprocessing or asymmetric multithreading?

First of all, you're the one that claims one CPU is better than 2... I believe 2 CPUs is better in many cases - multicore is the future. However if you do move to a single (faster) CPU then it doesn't make much difference in terms of realtime response whether that CPU is multithreaded or not. You seem to believe that threads are somehow much better than interrupts - but as I've shown they are equivalent concepts.

That kind of response is not helping your case. If you believe I'm wrong, then why not prove me wrong with some hard facts and data?

Most realtime CPUs have some form of fast internal memory, this is not relevant to multithreading.

you see.

The T1 has tiny caches and stalls on a cachemiss unlike any other high-end out-of-order CPU, so they require more threads to keep going if one thread stalls. It is also designed for highly multithreaded workloads, so having more thread contexts means fewer context switches in software, which can be a big win on workloads running on UNIX/Windows (realtime OSes are far better at these things).

If you do not understand the differences between cores like Itanium-2, Pentium-4, Nehalem, Power5, Power6 (all 2-way multithreaded), and cores like the T1, MIPS34K and Ubicom (8+ -way threaded), then you're not the expert on multithreading you claim to be.

Wilco

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Apr 6, 2007 7:32 AM

"Wilco Dijkstra" skrev i meddelandet news:e8eRh.2250$ snipped-for-privacy@newsfe4-gui.ntli.net...

No, the solution is to have more associativity in the cache. Having 4GB of direct mapped cache will not help you when two threads start using the same cache line.

If you want to solve the problem, general purpose for symmetric multiprocessing by putting the application memory on the chip, you are going to run into significant problems. You are beginning to get out of touch with reality, my dear friend.

I am saysing that it is cheaper to use asymmetric multithreading than asymmetric multiprocessing..

In order for interrupts to be equivalent to multithreading, where you can select a new executing an instruction from an interrupt every new clock cycle, you have to add additional constraint to your "interrupt" system.

You have to have multiple register files and multiple program counters in the system. You have to add additional hardware to dynamically raise/lower priorities in order to distribute instructions among the different interrupts. Your "interrupt" driven system is likely to be mistaken for a multithreading system.

Your way of discussion is way off , you ignore ALL arguments and requests to prove your point, in favour of continued rambling...

You need to show that the given example (Multiple SPI slaves) can be handled equally well by an *existing* interrupt driven system as well as how it can be handled by an *existing* multithreaded system like the zero context switch cost MIPS processor,

I now put the flip on the shoulder, can you concentrate to that instead of rambling?

I already did. I showed that there exist zero context switch cost MIPS processor. You have not shown that there exist zero cost interrupts.

If go back to the example.

You have a fixed clock. This is used by a number of SPI masters to provide data to your chip. Your chip implements SPI slaves and each SPI slave should run in a separate task/thread or whatever. The communication on each SPI slave channels is totally different and should be developed by two teams which do not communicate between each other and they are not aware of each other. once per byte, the SPI data is written to memory and an event flag register private to the thread/interrupt is written.

They are aware of the execution environment, which in the interrupt case is the RTOS and how interrupts are handled

Using one multithreaded and one interrupt processor, with frequency scaled so the top level of MIPS is equivalent, show that you can implement the SPI slave.

It is the other way around. *Because* you have many threads you CAN stall a thread on a cache miss, without affecting the total throughput of the CPU. It is very likely that the T1 shoves more instructions per clock cycle than a "high end, branch prediction, out of order" single or dual thread CPU.

You seems to want to slip into a discussion which type of CPU will exhibit the highest MIPS rate for a single thread. That is trying to force open an already open door.

--
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

- A
- Anton Erasmus
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Apr 6, 2007 10:04 AM

The IP3000 from Ubicom supports 8 threads in hardware. Their solution seems to me to be a very good solution for multithreading in hardware, where one needs deterministic response from all threads. It looks like they essentially switch between instruction streams in hardware such that from a software point of view each thread runs as if it is the only thread, but running on a CPU with only a percentage of the total speed.

Regards Anton Erasmus

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Apr 6, 2007 11:58 AM

No. If you switch between threads in a finegrained way you need to ensure that the working set of each thread stays in the cache. This means the cache needs to be large enough to hold the code and data from several threads. The problem is that L1 caches are often too small even for a single thread...

Associativity is not an issue at all, most caches are already 4 or 8-way set associative. If it were feasible, a 4GB direct mapped cache would not thrash at all as no threads would ever use the same line.

significant

The current trend is clear: more on-chip memory either as caches or tightly coupled memory. And FYI there are no problems with symmetric multiprocessing, people have been doing it for many years. Cache coherency is a well understood problem, even high-end ARMs have it.

system.

Is it really that difficult to understand? Let me explain it in a different way.

Start with the MIPS 34k core, and assign 1 thread to the main task and the others to one interrupt each. Set the thread priority of the interrupt threads to infinite. At this point the CPU behaves exactly like an interrupt driven core that uses special registers on an interrupt (many do so, including ARM). If you can only ever run one thread, you can't mistake this for a multithreaded core.

From the other perspective, in an interrupt drive core you typically associate a function with each interrupt. There is *nothing* that prevents a CPU from prefetching the first few instructions of some or all interrupt routines. In combination with the use of special registers to avoid save/restore overhead, this can significantly reduce interrupt latency.

Now tell me what the difference is between the above 2 cases. Do you still believe interrupts and threads are not closely related?

Done, that, please reread my old posts. I have also shown that any zero-cost context switch multithreaded CPU (if it exists) can behave like a zero-cost interrupt based CPU.

However you haven't shown a 40-thread CPU capable of running your example. Without one thread for each interrupt you need to use traditional interrupt handling rather than polling for events. Most embedded systems need more than the 8 interrupts/threads MIPS could handle, especially when combining 2 or more existing cores into 1 as you suggest.

listening

No you didn't. The MIPS core can switch between threads on every cycle, but that doesn't imply zero cost context switch on an interrupt.

There is no such thing as zero-cost interrupt. There are a few CPUs that can respond extremely quickly (eg. Transputer, Forth chips). However there is a tradeoff between the need for fast execution of normal code and fast interrupt response time.

the top

I've already described 2 ways of doing it, reread my old posts. If you think it is not possible, please explain why exactly you think that, then I'll explain the fallacy in your argument.

For the same amount of hardware, more threads means less space for caches, so more cache misses. More cache misses means you need more threads. Typical chicken and egg situation...

Actually T1 benchmarks are very disappointing: with twice the number of cores and 8 times the number of threads the T1 does not even get close to Opteron or Woodcrest on heavily multithreaded benchmarks...

It doesn't mean the whole idea is bad, I think the next generation will do much better (and so will AMD/Intel). However claiming that an in-order multithreaded CPU will easily outperform an out-of-order CPU on total work done is total rubbish.

No, I wasn't talking about fast single thread performance. My point is that it is a fallacy to think that adding more and more threads is always better. Like so many other things, returns diminish while costs increase. I claim it would be a waste to add more threads on an out-of-order core (max frequency would go down, more cache needed to reclaim performance loss, so not cost effective).

Wilco

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Apr 6, 2007 5:14 PM

Again you do not read, or you may not be aware of the difference between a direct mapped cache and a set-associative cache. And your memory is failing as well, as I am proposing tightly coupled memory without any cache for all threads except the "application" thread

Direct mapped means that for each memory location there is exactly one location in the cache which can fit that word. Since your cache is not the same size as the primary memory you have for each location in the cache a large number of memory locations which will only fit into that cache location. If all threads happen to access a memory location where all locations map into the same cache location, you have terrible cache trashing. Read a book on caches...

And way too expensive, if you can solve it with a multithreaded core connected to TCM.

Tell me how your interrupt system will make the pipeline execute instructions for two interrupts A and B occuring in the same time.

A1:B1:A2:B2:A3:B3:A4:B4:A5:B5:A6:B6:A7:B7

Instead of

B1:B2:B3:B4:B5:B6:B7:A1:A2:A3:A4:A5:A6:A7

Which I believe is the normal way for interrupts to behave...

You may want to note the time until both threads/interrupt

No, you have not shown that an interrupt based CPU can interleave instructions in the way a multithreaded core can do it. Your "zero" interrupt latency core does not and will not exist.

Again you refrain from answering. I have shown the MIPS threaded core, and running 40 threads on such a core is a simple extension of the basic concept. If it makes you happier, then try do it with 8 threads you can fit into the MIPS core.

I have never tried to prove that there is zero cost interrupts That is your idea which will never fly.

Done earlier in this post. You cannot interleave instructions at a pretermined rate.

If you dont have a cache, you dont get any cache misses.

If you can replace a full core with a thread you always win.

Obviously you are not going to take the time to go through the SPI slave example which proves you wrong. I suspect the reason is that you know you are wrong but to stiff headed to admit it, so I consider any future discussion on this subject with You a total waste of time.

--
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB