New soft processor core paper publisher?

- E
- Eric Wallin
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Jun 24, 2013 9:30 PM

Verilog code for my Hive processor is now up:

formatting link

(Took me most of the freaking day to figure out SVN.)

- B
- Bakul Shah
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Jun 24, 2013 10:03 PM

location Z. It then reads a value at location X, increments it, and writes it back to location X.

incremented. It reads the integer value at Z, performs some function on it, and writes it back to location Z. It then reads a value at Y, increments it, and writes it back to location Y to let thread A know it took, worked on, and replaced the integer at Z.

otherwise delayed, and I don't see how interrupts are germane, but perhaps I haven't taken everything into account.

Consider a case where *both* thread A and B want to increment a counter at location X? A reads X and finds it contains 10. But before it can write back 11, B reads X and finds 10 and it too writes back 11. Now you've lost a count. Can this happen in your design? If so you need some sort of atomic update instruction.

- E
- Eric Wallin
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Jun 24, 2013 10:23 PM

It can happen if the programmer is crazy enough to do it, otherwise not.

Anyone have comments on my paper or the verilog?

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Jun 24, 2013 10:54 PM

(snip)

In the core memory days, there was a special solution. Core read is destructive, so after reading the value out it has to be restored. For read-modify-write instructions, one can avoid the restore, and instead rewrite the new value. That assumes that the instruction set has a read-modify-write instruction, a favorite for DEC machines being increment and decrement.

DRAM also has descructive read, but except for the very early days, I don't believe it has been used in that way.

If the architecture does have a read-modify-write instruction, such as increment, it can be designed such that no other thread or I/O can come in between.

-- glen

- B
- Bakul Shah
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Jun 24, 2013 11:00 PM

Concurrent threads need to communicate with each other to cooperate on some common task. Consider two threads adding an item to a linked list or keeping statistics on some events or many such things. You are pretty much required to be "crazy enough"! Any support for mutex would simplify things quite a bit. Without atomic update you have to use some complicated, inefficient algorithm to implement mutexes.

- T
- Tom Gardner
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Jun 24, 2013 11:17 PM

Just so.

A programmer that doesn't understand that is the equivalent of a hardware engineer that doesn't under stand metastability. (When I started out most people denied the possibility of synchronisation failure due to metastability!)

Mind you, I'd *love* to see a radical overhaul of traditional multicore processors so they took the form of - a large number of processors - each with completely independent memory - connected by message passing fifos

In the long term that'll be the only way we can continue to scale individual machines: SMP scales for a while, but then cache coherence requirements kill performance.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Jun 25, 2013 9:12 AM

This sounds nice in theory, but in practice there can be problems. Scaling with number of processors can quickly become an issue here - lock-free algorithms and fifos work well between two processors, but scale badly with many processors. Independent memory for each processor sounds nice, and can work well for some purposes, but is a poor structure for general-purpose computing.

If you want to scale well, you want hardware support for semaphores. And you don't want to divide things up by processor - you want to be able to divide them up by process or thread. Threads should have independent memory areas, which they can access safely and quickly regardless of which cpu they are running on. Otherwise you spend much of your bandwidth just moving data around between your cpu-dependent memory blocks (replacing the cache coherence problems with new memory movement bottlenecks), or your threads have to have very strong affinity to particular cpus and you lose your scaling.

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Jun 25, 2013 2:53 PM

I'm glad you understand that.

No point in making such a comparison. If you want to understand Eric's chip, then learn about Eric's chip. I certainly don't know enough about the Propeller chip to compare in a meaningful manner.

Just think of each processor executing one instruction every 8 clocks, but all processors are out of phase, so no one completes on the same clock.

Not sure what you mean by "machine cycle". As I said above, there are 8 clocks to the processor machine cycle, but they are all out of phase. So on any given clock cycle only one processor will be updating registers or memory.

I believe Eric's point is that the thing that prevents more than one processor from accessing the same memory location is the programmer. Is that not a good enough method?

Have you read the paper? How do you know its not there?

Ok, so is this discussion over?

If you still have reservations, then learn about the design. If you don't want to invest the time to learn about the design, why are you bothering to object to it?

--

Rick

- T
- Tom Gardner
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Jun 25, 2013 2:54 PM

I agree with all your points. Unfortunately they are equally applicable to the current batch of SMP/NUMA architectures :(

A key point is the granularity of the computation and message passing, and that varies radically between applications.

There are a large number of commercially important workloads that would work well on such a system, ranging from embarrassingly parallel problems such as soft real-time event proccessing, some HPC, big data (think map-reduce).

But I agree it wouldn't be a significant benefit for bog-standard desktop processing - but current machines are more than sufficient for that anyway!

I agree with all those points too.

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Jun 25, 2013 3:01 PM

brashly.

programming it afterward, for me anyway, was just too much.

get a lot more with a small size increase. I think a 32 bit opcode is pushing it for a small FPGA implementation, but a 16 bit opcode gives one a couple of small operand indices, and some reasonably sized immediate instructions (data, conditional jumps, shifts, add) that I find I'm using quite a bit during the testing and verification phase. Data plus operation in a single opcode is hard to beat for efficiency but it has to earn it's keep in the expanded opcode space. With the operand indices you get a free copy/move with most single operand operations which is another efficiency.

honored if you were to present it to SVFIG.

I was going to talk about the CPU design I had been working on, but I think it is going to be more of a survey of CPU designs for FPGAs ending with my spin on how to optimize a design. Your implementation is very different from mine, but the hybrid register/stack approach is similar in intent and results from a similar line of thought.

Turns out I am busier in July than expected, so I will not be able to present at the July meeting. I'll shoot for August. I've been looking at their stuff on the web and they do a pretty good job. I was thinking it was a local group and it would be a small audience, but I think it may be a lot bigger when the web is considered.

--

Rick

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Jun 25, 2013 3:11 PM

The *only* way? lol You think like a programmer. The big assumption you are making that is no longer valid is that the processor itself is a precious resource that must be optimized. That is no longer valid. When x86 and ARM machines put four cores on a chip with one memory interface they are choking the CPU's airway. Those designs are no longer efficient and the processor is underused. So clearly it is not the precious resource anymore.

Rather than trying to optimize the utilization of the CPU, design needs to proceed with the recognition of the limits of multiprocessors. Treat processors the same way you treat peripheral functions. Dedicate them to tasks. Let them have a job to do and not worry if they are idle part of the time. This results in totally different designs and can result in faster, lower cost and lower power systems.

--

Rick

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Jun 25, 2013 3:13 PM

What assumptions is this based on? Do you know?

What are the alternatives to "mutexes"? How inefficient are they? When do you need to use a mutex?

Have you looked at Eric's design in the least? Do you have any idea of the applications it is targeted to?

--

Rick

- T
- Tom Gardner
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Jun 25, 2013 3:14 PM

I mean it in the same sense as it was used in the posting that I replied to.

from accessing the same memory location is the programmer. Is that not a good enough method?

I'd prefer it if Eric gave the correct answer rather than someone else's possibly correct answer.

It is a good enough method for some things, and not for others.

want to invest the time to learn about the design, why are you bothering to object to it?

There are *many* new designs which might be interesting. Nobody has time to look at them all so they make fast decisions as to whether to design and designed is credible.

I'm not objecting to it, but I am giving the designer the opportunity to pass the "elevator pitch" test.

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Jun 25, 2013 3:16 PM

You mean you actually figured it out?

--

Rick

- T
- Tom Gardner
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Jun 25, 2013 3:23 PM

making that is no longer valid is that the processor itself is a precious resource that must be optimized. That is no

memory interface they are choking the CPU's airway. Those designs are no longer efficient and the processor is underused. So

I don't think that and your statements don't follow from my comments.

proceed with the recognition of the limits of multiprocessors. Treat processors the same way you treat peripheral

if they are idle part of the time. This results in totally different designs and can result in faster, lower cost and

That approach is valuable when and where it works, but can be impractical for many workloads.

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Jun 25, 2013 5:14 PM

(snip)

If there 8 processors that never communicate, it would be better to have 8 separate RAM units.

So no thread ever communicates with another one?

Well, read the wikipedia article on spinlock and the linked-to article Peterson's_Algorithm.

It is more efficient if you have an interlocked write, but can be done with spinlocks, if there is no reordering of writes to memory.

As many processors now do reorder writes, there is need for special instructions.

Otherwise, spinlocks might be good enough.

-- glen

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Jun 25, 2013 6:06 PM

Why is that? What would be "better" about it?

Are we talking about the same thing here? We were talking about the Hive processor.

So your point is?

What would the critical section of code be doing that is critical? Simple interprocess communications is not necessarily "critical".

--

Rick

- E
- Eric Wallin
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Jun 25, 2013 6:07 PM

All threads share the same Von Neumann memory, so of course they can communicate with each other.

If only there were a paper somewhere, written by the designer, freely available to anyone on the web...

- E
- Eric Wallin
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Jun 25, 2013 6:18 PM

ocessor from accessing the same memory location is the programmer. Is that not a good enough method?

If Rick says anything wrong I'll correct him.

The paper has bulleted feature list at the very front and a downsides bulle ted list at the very back. I tried to write it in an accessible manner for the widest audience. We all like to think aloud now and then, but I'd thi nk a comprehensive design paper would sidestep all of this wild speculation and unnecessary third degree.

formatting link

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Jun 25, 2013 7:02 PM

(snip, someone wrote)

(then I wrote)

Well, if the RAM really is fast enough not to be the in the critical path, then maybe not, but separate RAM means no access limitations.

I was mentioning it for context. For processor that do reorder writes, you can't use Peterson's algorithm.

Without write reordering, it is possible, though maybe not efficient, to communicate without interlocked writes.

"Critical" means that the messages won't get lost due to other threads writing at about the same time. Now, much of networking is based on unreliable "best effort" protocols, and that may also work for communications to threads. But that involves delays and retransmission after timers expire.

-- glen