TILE64 embedded multicore processors - Guy Macon

G

Guy Macon 18 years ago

I have been following the development of these processors for the last five years, but only recently have I seen a bunch of marketing material that ranges from being wrong to outright deception.

For an example of being just plain wrong, look at the pretty picture here: [

formatting link

]

Looks like the corner processors only connect to two other processors, doesn't it? Actually, the topology is a torus, so the far right processor on each row has a wraparound connection to the far right processor. Ditto for top/bottom.

Another claim that is just plain wrong: "In architectures of this sort, you can keep growing and you won't have any serious congestion."

The reality is that it takes one cycle for data to move from a processor to one of it's nearest four neighbors two cycles to reach the four nearest diagonal processors, and eight cycles to reach the processor farthest away -- and that 8 cycles will become 16 cycles on a 256-core design. Note that these 8 or 16 cycles limit the latency of the L3 cache... It is also a basic reality of this architecture that as you scale up to more processors, each one has more data passing through it, causing -- you guessed it -- serious congestion.

And, of course, they are trotting out the age-old vaporware pixie dust compiler that will by magic solve all the problems involved with writing code for parallel processing, just like all the previous vaporware pixie dust compilers were supposed to solve all the problems involved with parallel processing.

It is also quite telling that they aren't really revealing all the technical details. Go ahead and try to find out what the instruction set is, whether all those processors can each talk directly to the gigabit ethernet ports on the board they say they are selling, or even the price of that board.

The hype says that this is a "sea change in the computing industry," and the "first significant new chip architectural development in a decade."

The reality is that this is an old idea with a few new twists, suitable for some embedded applications but nothing earthshaking.

Guy Macon

Vote

D

Del Cecchi 18 years ago

Isn't that a really dumb way to make a torus, with one really long link? Why wouldn't you interleave and make the links all two processors long. ?

Just marketing bogosity.

If they are just now at hotchips, they might have first hardware.

Remember all the puffery over tera? Nowadays the silence is deafening.

Vote

G

Guy Macon 18 years ago

I am having trouble visualizing what you are talking about. Is there a way to organize a row/collum array so that every cell links to four neighbor cells (torus topology) without the wraparound link being longer?

Not that I think that the extra length matters the difference in propagation delay is a small fraction of a nanosecond.

Vote

W

Wes Felter 18 years ago

In a folded torus, logical neighbors are not physically adjacent. So each tile does not connect to its physical neighbors; instead all links are 2 tiles long and connect to logical neighbors that are one tile farther away.

You can find the details in Dally & Towles section 5.3; this link might work:

formatting link

Wes Felter - snipped-for-privacy@felter.org

Vote

A

Alex Colvin 18 years ago

build it as a real torus, then squash it flat. The two sides are now interleved, but not actually connected except at the edges.

maybe in capacitance?

mac the naïf

Vote

I

Ivan Wang 18 years ago

Hi Guy, How does this architecture solve the cache coherent problem?

Vote

T

Terje Mathisen 18 years ago

I wondered about the same thing for a second or two, here's how I would solve it:

+--+-----+-----+-----+ 1 2 3 4 5 6 7 8 +-----+-----+-----+--+

I.e. you get two short connections at the ends, all the others are the same 2-core length. It is of course easy to extend the corner strips so as to make all links the same latency.

Terje

- "almost all programming can be viewed as an exercise in caching"

Vote

G

Guy Macon 18 years ago

Lack of cache coherence is a big problem with the architectures PCs use now, but that's largely an artifact of the attempt to make multiple processors look like one processor to the software. Once you decide to start clean you can write code that largely avoids the root cause of cache coherency issues; two processors having read/write access to the same area of memory at the same time. But see below concerning how difficult that is to do...

With this class of multiprocessor systems (I say "this class" because they refuse to explain the details of how the TILE64 works internally) the usual method for systems that don't have any special hardware features to help (snooping, snarfing, etc.) is as follows:

Start by using semaphores to exclusively allocate areas of memory to individual processors. Obviously those areas have no cache coherency problem. Next, allow the processors to negotiate handing off the ownership of areas of memory with data intact. Finally, allow the processors to negotiate areas of memory that are uncached. None of this can be done without taking a hit on performance and scalability, but that's a problem all multiprocessor systems have, and the more processors, the worse it gets.

This is a lot simpler that the MOESI (Modified/Owned/Exclusive/ Shared/Invalid) scheme used in processors such as the AMD64, but MOESI really needs low latency and high bandwidth between the CPUs. Data on a TILE64 can require as many as eight processor-to-processor hops to move data between processors.

The basic reality that the marketing fluff cannot hide is that to make an application run well on 64 processors is *hard*. The application has to be structured to take advantage of multiple processors, which is easy with some apps, impossible with others. That's why I said that this is a good candidate for embedded systems. If I am developing a gigabit router the huge effort to take advantage of 64 cores is well worth it if by doing so I can reduce product costs by a few dollars. For a general purpose computer running a wide variety of software, no. That's why the marketing fluff always talks about these magical compilers that will sprinkle pixie dust over existing code and make it run well on 64 processors. And we will all be living on the big rock candy mountain and eating rainbow stew...

Here is another choice bit of fluff:

"Our view is that the battle over instruction set architectures is over," explains Doud. "The processor core is the new transistor, and no one cares about ISAs unless they are coding in assembly language at this point." That is a paraphrase of a saying that Argawal, who is now chief technology officer at Tilera, has drilled into everyone's heads.

formatting link

I say that assembly language will *always* matter.

(Note that the author of the above - Timothy Prickett Morgan - is *not* writing fluff; only the quotes from Tilera in the article are fluffy.)

Guy Macon

Vote

D

dbd 18 years ago

'The cache coherent problem' occurs when applications require or share more memory than is local to each processor. Some applications like signal processing chains may easily partition to multiple processors to spread across multiple local memories, but some won't. So, some applications might be easily supported by software to be spread across multiple processors, but many won't, so YMMV. If you were the marketeer, which would you be talking about?

It will always matter to people who write:

1) the processor kernel 2) function libraries 3) critical portions of applications and others. But these people are dealing at a different level of abstraction from the systems architects. So if you are indoctrinating your systems architects it is probably useful to say that ISAs don't matter while you are deciding the multi-processor interconnect and packaging.

Dale B. Dalrymple

formatting link

http:/stores.lulu.com/dbd

Vote

G

Guy Macon 18 years ago

Aha! I think I see how this sort of layout works.

Here is an 8x8 torus array the way I would have wired it:

TORUS TOPOLOGY: - - - - | | | | | | | | - A-|-B-|-C-|-D-|-- | | | | | | | | | | --|-|-|-|-|-|-|-|-- | | | | | | | | --E |-F-|-G-|-H-|-- | | | | | | | | | | --|-|-|-|-|-|-|-|-- | | | | | | | | --I-|-J-|-K-|-L-|-- | | | | | | | | | | --|-|-|-|-|-|-|-|-- | | | | | | | | --M |-N-|-O |-P-|-- | | | | | | | | | | | - - - - | --------------------

(drawn another way) TORUS TOPOLOGY:

------------------------------ | ------------------------ | | | ------------------ | | | | | ------------ | | | | | | | | | | | --------A---B---C---D-------- | | | | | | | | | | | | | | | ------E---F---G---H------ | | | | | | | | | | | | | | | | | | | ----I---J---K---L---- | | | | | | | | | | | | | | | | | | | | | | | --M---N---O---P-- | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | --|-|-|-|--- | | | | | | | | | ------|-|-|-|----- | | | | | | | ----------|-|-|-|------- | | | | | --------------|-|-|-|--------- | | | | | | | | | | | | | | | | | | | ----------------- | | | | | --------------------- | | | ------------------------- | -----------------------------

With the above topology, each processor can reach any other processor in 4 hops or less, and the hops are either one unit or 3 units of physical distance.

If you replace each row and column...

--A---B---C---D-- | | -----------------

With this:

------- | | A---C B---D | | ------- You get shorter maximum distances.

I tried sketching the 4x4 matrix above using this scheme (I am *not* going to attempt it in ASCII art...) and got a maximum of 3 hops (same as before) with the maximum line length reduced from 3 unit to 2. It appears to still be a torus.

Unless I am sketching/visualizing this wrong, it looks like an 8x8 matrix has a maximum of 8 hops (same as before) with the maximum line length reduced from 7 units to 2.

----- ----- ----- / \ / \ / \ A---B C D E F G---H \ / \ / \ / ------ ----- -----

(This is the topology Terje Mathisen drew; see above)

...I didn't sketch it, but it looks like it will scale to

16x16 or 23x32 arrays. Simple once you see it!

I have never had occasion to lay out a grid of anything and interconnect them as a torus, but if I ever do, I can minimize the line length instead of doing the long wraparound. Thanks to Del Cecchi and Terje Mathisen for explaining the technique.

Guy Macon

Vote

G

Guy Macon 18 years ago

That one is going into my Quote File:

"[This is] where it becomes interesting. Having only a few of some resource makes it hard to manage. Having one is easy - there's no choice. Having lots is easy - use what you can. Having a few means you have to choose carefully." -Alex Colvin

Vote

R

Rob Warnock 18 years ago

+--------------- | Alex Colvin wrote: | >That's the point where it becomes interesting. Having only a few of some | >resource makes it hard to manage. Having one is easy - there's no choice. | >having lots is easy - use what you can. Having a few means you have to | >choose carefully. | | That one is going into my Quote File: | "[This is] where it becomes interesting. Having only a few | of some resource makes it hard to manage. Having one is | easy - there's no choice. Having lots is easy - use what | you can. Having a few means you have to choose carefully." | -Alex Colvin +---------------

This is, of course, directly related to the old saying that the only magic numbers in computer (or systems or operating system) architecture are zero, one, (two[1],) and infinity. See also:

formatting link

And let's not forget George Gamow's 1988 book "One, Two, Three, Infinity: Facts and Speculations of Science". But that's probably getting a bit off-topic... ;-}

-Rob

[1] Some pundits [sometimes including yours truly] will insist on adding "two" to the list because there are a few extremely special cases that only work for two of something [e.g., Dekker's Algorithm]. The "c2.com" URL mentioned above includes considerable discussion about whether or not to include "two".

----- Rob Warnock

627 26th Avenue San Mateo, CA 94403 (650)572-2607

Vote

B

Bill Todd 18 years ago

...

Er, my copy dates from the '50s, with a copyright date of 1947. Did he release an update 20 years after his death (I wouldn't put it past him)?

- bill

Vote

J

jetlagmk2 18 years ago

Why would you think this? How do you suppose the tiles communicate with the outside world if their networks go 'round in circles?

Vote

A

Alex Colvin 18 years ago

I won't take credit for that - it's folk wisdom.

In fact, the only [finite] regular cardinals are 0, 1, 2. where regular cardinals can't be made out of fewer smaller cardinals.

mac the naïf

Vote

G

Guy Macon 18 years ago

The TILE64 I/O is not the same as the TILE64 interprocessor communication. The TILE64 interprocessor communication consists of each processor communicating with its four (not three or two) nearest neighbors and the processors handing data off if it is destined for a processor that is not an immediate neighbor. The I/O (what you call communicating with the outside world) is not revealed in the rather poor descriptions of the TILE64, but is almost certainly not part of the processor-to-processor mesh. I would guess a buss topology for the I/O, but those 16 gigabit ethernet ports on the development board make me wonder whether perhaps they have one ethernet port connected to each four processors.

The processors on the right edge do indeed communicate directly with the processors on the left edge. Same with top and bottom. Thus each processor talks to four neighbors, including the ones on the corners and edges. Look carefully at the descriptions.

Guy Macon

Vote

J

jetlagmk2 18 years ago

You might want to read the Architecture Brief:

formatting link

Vote

R

Rob Warnock 18 years ago

+--------------- | Rob Warnock wrote: | > And let's not forget George Gamow's 1988 book "One, Two, Three, | > Infinity: Facts and Speculations of Science". | | Er, my copy dates from the '50s, with a copyright date of 1947. Did he | release an update 20 years after his death (I wouldn't put it past him)? +---------------

*D'Oh!!* Of *course* it had to be that long ago, since I myself read it in *high school* in the early 60's!! [I even still remember where the shelf it was on was located, oddly enough, and roughly where it was kept on that shelf.] Like an idiot, I just blindly cut & pasted the publication date listed on the Google Books page:

formatting link

One Two Three . . . Infinity: Facts and Speculations of Science By George Gamow Published 1988 Courier Dover Publications Science/Popular works 352 pages ISBN 0486256642

without actually *reading* the date!! (*Sheesh!*)

Thanks for the catch!

Hmmm... The "Courier Dover" editi Dover Publicati New American Library 1960 Paperback

The one I read was a hardback, so it must have been older still...

Aha!! The Internet Archive has the 1961 version online for free here:

formatting link

The PDF is 32 MB, but the ".txt" version [caveat: seems to have a rather large number of OCR scanning errors!] is only 629 KB:

formatting link

and the beginning of that suggests that the original version was:

One two three ... infinity

FACTS & SPECULATIONS OF SCIENCE

by George Gamow

PROFESSOR OF PHYSICS UNIVERSITY OF COLORADO ... THE VIKING PRESS * NEW YORK ... COPYRIGHT 1947, 1961 BY GEORGE GAMOW ... REVISED EDITION PUBLISHED IN 1961 BY THE VIKING PRESS, ING. 625 MADISON AVENUE, NEW YORK 22, N.Y.

SECOND PRINTING OCTOBER 1962

PUBLISHED SIMULTANEOUSLY IN CANADA BY THE MACMILLAN COMPANY OF CANADA LIMITED

-Rob

----- Rob Warnock

627 26th Avenue San Mateo, CA 94403 (650)572-2607

Vote

G

Guy Macon 18 years ago

I did. It does *not* describe the architecture of the TILE64.

It is a marketing document, describing various "innovations" without revealing basic architectural details such as how the I/O is connected to the cores, how wide the data paths are, or what the instruction set looks like.

Guy Macon

Vote

D

Del Cecchi 18 years ago

I would consider the width of data paths an implementation artifact, not part of the architecture. In fact, how the I/O is connected is probably also an implementation artifact.

Vote

TILE64 embedded multicore processors - Guy Macon

Join the Discussion

Didn't find your answer?