New soft processor core paper publisher?

B

Bakul Shah 13 years ago

processor from accessing the same memory location is the programmer. Is that not a good enough method?

This is not good enough in general. I gave some examples where threads have to read/write the same memory location.

I agree with you that if threads communicate just through fifos and there is exactly one reader and one writer there is no problem. The reader updates the read ptr & watches but doesn't update the write ptr. The writer updates the write ptr & watches but doesn't update the read ptr. You can use fifos like these to implement a mutex but this is a very expensive way to implement mutexes and doesn't scale.

list at the very back. I tried to write it in an accessible manner for the widest audience. We all like to think aloud now and then, but I'd think a comprehensive design paper would sidestep all of this wild speculation and unnecessary third degree.

I don't think it is a question of "third degree". You did invite feedback!

Adding compare-and-swap or load-linked & store-conditional would make your processor more useful for parallel programming. I am not motivated enough to go through 4500+ lines of verilog to know how hard that is but you must already have some bus arbitration logic since all 8 threads can access memory.

I missed this link before. A nicely done document! A top level diagram would be helpful. 64K address space seems too small.

Vote

T

Tom Gardner 13 years ago

location Z. It then reads a value at location X, increments it, and writes it back to location X.

incremented. It reads the integer value at Z, performs some function on it, and writes it back to location Z. It then reads a value at Y, increments it, and writes it back to location Y to let thread A know it took, worked on, and replaced the integer at Z.

otherwise delayed, and I don't see how interrupts are germane, but perhaps I haven't taken everything into account.

Have a look at

formatting link

section 25.3 et al for one exposition of the kinds of problem that arise.

That exposition is in x86 terms but it applies equally to all other 11 major processor families I've examined over the past 35 years. If there is a reason your processor cannot experience these issues, let us know.

Subsequent chapters on the solutions can be found

formatting link

Vote

B

Bakul Shah 13 years ago

Spinlock is not good enough without special instructions. That is why Petersen's or Dekker's or Szymanski's algorithms. Now most processor provide some h/w support for mutexes. Most papers on implementing mutex with just shared memory are 25+ years old. Now this is just an interesting puzzle!

Vote

T

thomas.entner99 13 years ago

Am Mittwoch, 12. Juni 2013 23:17:18 UTC+2 schrieb Eric Wallin:

The processor is unusual in that it uses four indexed LIFO stacks with ex plicit stack pointer controls in the opcode. It is 32 bit, 2 operand, full y pipelined, 8 threads, and produces an aggregate 200 MIPs in bargain basem ent Altera Cyclone 3 and 4 speed grade 8 parts while consuming ~1800 LEs. The design is relatively simple (as these things go) yet powerful enough to do real work.

Hi Eric,

first of all: I like your name, I have designed a soft-core CPU called ERIC

5 ;-)

I have read your paper quickly and would like to give you some feedback:

- What is the target application of your processor? Barrel processors can m ake sense for special (highly parallel) applications but will have the prob lem that most programmers prefer high single thread performance simply beca use it is much easier to program.

- If you target general purpose applications in FPGAs, your core will be co mpared with e.g. Nios II or MICO32 (open source). They are about the same s ize, are fully 32bit, have high single thread performance and a full design suite. What are the benefits of your core?

- If you want the core to be really used by others, a C-compiler is a MUST. (I learned this with ERIC5 quickly.) This will most likely be much more ef fort than the core itself...

I know that designing a CPU is a lot of fun and I assume that this was the real motivation (which is perfectly valid, of course). Also it will give yo u experience in this field and maybe also reputation with future employees or others. However, if you want to make it a commercial successful product (or even more widely used than other CPUs on opencores), it will be a long hard way against Nios II, etc.

Regards,

Thomas

formatting link

Vote

E

Eric Wallin 13 years ago

df

It talks about separate threads writing to the same location, which I under stand can be a problem with interrupts and without atomic read-modify-write . All I can do is repeat that you don't program this way it won't happen. A subroutine can be written so that threads can share a common instance of it, but without using a common memory location to store data associated wi th the execution of that subroutine (unless the location is memory mapped H W). In Hive, there is a register that when read returns the thread ID, whi ch is unique for each thread. This could be used as an offset for subrouti ne data locations.

Vote

R

rickman 13 years ago

I didn't see any examples that were essential. You talked about two processes accessing the same data. Why do you need to do that? Just have one process send the data to the other process so only one updates the list.

Doesn't scale? Can you explain?

You don't understand even the most basic concept of how the device works. There is no arbitration logic because there is only one processor that is time shared between 8 processes on a clock cycle basis to match the 8 deep pipeline.

I'm not trying to be snarkey, but there are a lot of people posting here who really don't get the idea behind this design.

Too small for what? That is part of what people aren't getting. This is not intended to be even remotely comparable to an ARM or an x86 processor. This is intended to replace a microBlaze or a B16 type FPGA core.

Rick

Vote

E

Eric Wallin 13 years ago

make sense for special (highly parallel) applications but will have the pr oblem that most programmers prefer high single thread performance simply be cause it is much easier to program.

The target application is for an FPGA logic designer who needs processor fu nctionality but doesn't want or need anything too complex. There is no nee d for a toolchain for instance, and operation has been kept as simple as po ssible.

compared with e.g. Nios II or MICO32 (open source). They are about the same size, are fully 32bit, have high single thread performance and a full desi gn suite. What are the benefits of your core?

The benefit is it is really is free so you aren't legally bound to vendor s ilicon (not that all are). And if you hate having yet another toolset betw een you and what is going on you're probably SOL with most soft processors as they are quite complex (overly so for many low level applications, IMO). No one will be running Linux on Hive for instance. But running Linux on any soft core seems kind of dumb to me, when you need that much processor y ou might as well buy an ASIC which is cheaper, faster, etc. and not a blob of logic.

T. (I learned this with ERIC5 quickly.) This will most likely be much more effort than the core itself...

Nope, ain't gonna do it, and you can't make me! :-) A compiler for someth ing this low level is overkill and kind of asking for it IMO.

e real motivation (which is perfectly valid, of course). Also it will give you experience in this field and maybe also reputation with future employee s or others. However, if you want to make it a commercial successful produc t (or even more widely used than other CPUs on opencores), it will be a lon g hard way against Nios II, etc.

I have like zero interest in Nios et al. Hive is mainly for my real use fo r serializing low bandwidth FPGA applications that would otherwise under ut ilize the fast FPGA fabric. But after all the work that went into it I wan ted to get it out there for others to use, or perhaps to employ one or more aspects of Hive in their own processor core.

I hope to use Hive in a digital Theremin I've been working on for about a y ear now. Too soon to really know, but one thread will probably handle the user interface (LCD, rotary encoder, LEDs, etc.) another will probably hand le linearization and scaling of the pitch side, another the wavetable and f iltering stuff, etc. so I believe I can keep the threads busy. My main fea r at this point is that heat from the FPGA will disturb the exquisitely sen sitive electronics (there's only about 1 pF difference over the entire play able pitch range). The open project is described in a forum thread over at

formatting link

if anyone is interested (I'm "dewster").

Vote

T

Tom Gardner 13 years ago

If that is a constraint on the permissible programming style then it would be good to state that explicitly - to save other people's time, to save you questions, and to save everybody late unpleasant surprises.

That is a very common programming paradigm that people will expect to employ to solve problems they expect to encounter. It would be beneficial for you to demonstrate the coding techniques that you expect to be used to solve their problems. Think of it as an application note :)

I presume "it" = code.

Sounds equivalent to keeping all the data on the thread's stack in most of the other processors I've used.

Works for data that isn't shared between threads.

But what about data that has, of necessity, to be shared between threads? For example a flag indicating whether or not a non-sharable global resource (e.g. some i/o device, or some data structure) is in use or is free to be used.

None of these situations are unique to your processor. They first became a pain point in the 1960s and necessitated development of techniques to resolve the problem. If you've found a way to avoid such problems, write it up and become famous.

Vote

R

rickman 13 years ago

I don't follow your logic, but I bet that is because your logic doesn't apply to this design. Do you understand that there is really only one processor? So what advantage could there be having 8 RAMs?

Ok, so this does not apply to the processor at hand, right?

Your quotes are a bit hard to read. They turn the quoted blank lines into new unquoted lines. Are you using Google by any chance and ripping out all the double spacing or something?

Since this processor doesn't do write reordering Bob's your uncle!

You are talking very general here and I don't see how it applies to this discussion which is specific to this processor.

Rick

Vote

E

Eric Wallin 13 years ago

I plan to have one and only one thread handling I/O and passing the data on as needed via memory space to one or more other threads. I promise to be careful and not blow up space-time when I write the code. ;-)

Vote

E

Eric Wallin 13 years ago

Hive is a barrel processor! Thanks for that term Thomas! I knew the idea wasn't original with me, but I had no idea the concept was so old (1964 Cray designed CDC 6000 series peripheral processors) and has been implemented many times since:

formatting link

Vote

T

thomas.entner99 13 years ago

Am Mittwoch, 26. Juni 2013 01:07:28 UTC+2 schrieb Eric Wallin:

year now. Too soon to really know, but one thread will probably handle th e user interface (LCD, rotary encoder, LEDs, etc.) another will probably ha ndle linearization and scaling of the pitch side, another the wavetable and filtering stuff, etc. so I believe I can keep the threads busy.

OK, I understand your idea behind the processor better now. But I think you are targeting applications that could be realized also with PicoBlaze / Mi co8 / ERIC5 which are all MUCH smaller than your design. Of course your des ign has the benefit of 32b operations.

I guess it makes sense and will be fun for you to use it in your own projec ts. However, if other people compare Hive with e.g. Nios, most of them will choose Nios because (for them) it looks less painful (both processors are new for them anyway, one can be programmed in C, for the other they have to learn a new assembler language, one is supported by a large company and la rge community, the other not, etc.).

I just want to point out that there is a lot of competition out there...

Regards,

Thomas

formatting link

Vote

G

glen herrmannsfeldt 13 years ago

(snip)

Sure, that is pretty common. It is usually related to being reentrant, but not always exactly the same.

Yes, but usually once in a while there needs to be communication between threads. If no other time, to get data through an I/O device, such as separate threads writing to the same user console. (Screen, terminal, serial port, etc.)

-- glen

Vote

T

thomas.entner99 13 years ago

wasn't original with me, but I had no idea the concept was so old (1964 Cray designed CDC 6000 series peripheral processors) and has been implemented many times since:

Yes, the old heros of super computers and mainframes invented almost everything... E.g. I long assumed that Intel invented all this fancy out-of-order execution stuff, etc., just to learn recently that it was all long there, e.g.:

formatting link

Regards,

Thomas

formatting link

Vote

G

glen herrmannsfeldt 13 years ago

(snip)

OK, but you need a way to tell the other thread that its data is ready, and a way for that thread to tell the I/O thread that it got the data and is ready for more. And you want to do all that without too much overhead.

-- glen

Vote

R

rickman 13 years ago

If you care to go back through the discussion, I believe he did exactly that, say that two threads should not write to the same address. And we have already discussed that this can be worked around.

Or you just don't share data...

One issue is the use of the word "thread". I never understood the difference between thread and process until I read the link you provided. We don't have to be talking about threads here. I expect the processors will be much more likely to be running separate processes using separate memory. Does that make you happier? Then we can just say they don't share memory other than for communications that are well defined and preclude the conditions that cause problems.

Yes, or more specifically, it works as long as two threads (or processes) don't write to the same locations.

That's easy, don't have *global* I/O devices... let one processor control that I/O device and everyone else asks that processor for I/O support. In fact, that is one of the few ways to actually get benefit from this processor design. It is not all that much better than a single threaded processor in an FPGA. The J1 runs at 100 MIPS and this runs at 200 MIPS. But no one processor does more than 25. So how to you use that? You can assign tasks to processors and let them do separate jobs.

Yes, but none of these apply if you just read his paper...

Rick

Vote

B

Bakul Shah 13 years ago

other process so only

There by reducing things to single threading.

Scale to more than two threads. For that it may be better to use one of the other algorithms mentioned in my last article. Still pretty complicated and inefficient.

There is no arbitration

processes on a clock cycle

In this case load-linked, store conditional may be possible? load-linked records the loaded address & thread id in a special register. If any other thread tries to *write* to the same address, a subsequent store-conditional fails & the next instn can test that. You could simplify this further at some loss of efficiency: fail the store if there is *any* store by any other thread!

Vote

G

glen herrmannsfeldt 13 years ago

(snip)

The 360/91 is much more fun, though. Intel has out-of-order execution, but in-order retirement. The results of instructions are done in order. That takes memory to keep things around for a while.

The 360/91 does out-of-order retirement. It helps that S/360 (except for the 67) doesn't have virtual memory.

When an interrupt comes through, the pipelines have to be flushed of all instructions, at least up to the last one retired. The result is imprecise interrupts where the address reported isn't the instruction at fault. (It is where to resume execution after the interrupt, as usual.) Even more there is multiple imprecise interrupt as more can occur before the pipeline is empty.

Much of that went away when VS came in, so that page faults could be serviced appropriately.

The 360/91 was for many years, and maybe still is, a favorite example for books on pipelined architecture.

-- glen

Vote

R

rickman 13 years ago

Why would communications be a problem? Just let one processor control the I/O device and let the other processors talk to that one.

Rick

Vote

R

rickman 13 years ago

Maybe you need to define what you mean by thread and process...

I didn't mean explain what "more" means, explain *why* it doesn't scale.

Do you understand the processor design?

Rick

Vote

New soft processor core paper publisher?

Join the Discussion

Didn't find your answer?