The next step: A way to produce flexible gallium arsenid wafers in quantity has been found

- P
- Phil Hobbs
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, May 28, 2010 12:43 AM

example),

300 GHz.

ready in the morning.

contrary,

take advantage of those

improvement.

incredibly nice features,

Rendering is embarrassingly parallellizable, i.e. you can get nearly an N-fold speedup from N cores. Database updates are another story. When I was at IBM, I used to kid the mainframe guys that if anyone figured out how to parallellize database updates really well, they'd be out of work. Most of them agreed, IIRC.

Multicore isn't very different from ordinary multithreading, except in the details of managing resource contention, where you have to use spinlocks sometimes. My 3D electromagnetic simulator code runs on a Linux or Windows cluster built out of multicore machines. IIUC newer multicore architectures are even more like single-core multithreading since instead of spinlocks you can put the core to sleep for awhile, which is pretty similar to a thread blocking on a mutex.

I found it fairly hard to make a serialization model that worked on symmetric multiprocessors as well as clusters, in Windows, Linux, and OS/2. (I started writing it in 2003, but there's a debugger that I

*love* that only runs in OS/2, so I wanted to support it too. RIP.)

To make a portable solution, I eventually used a combination of mutexes and file handles. Linux filesystem semantics are really loosey-goosey--any number of fopen() calls on a file will succeed if they're from the same process, which makes file pointers useless as synchronization elements in Linux, whereas they're pretty good in Windows and especially OS/2. And then there's NFS, but don't get me started. That's why I wound up with a combination of mutexes and file handles as the portable solution.

So in general multicore is a problem mostly from those being dragged kicking and screaming from the single-thread world. No?

Cheers

Phil Hobbs

(Who wrote his first 32-bit multithreaded app under OS/2 2.0 in April

1992, while Windows was just eye candy on top of DOS, and Linux was still a gleam in its father's eye.)

--
Dr Philip C D Hobbs
Principal
ElectroOptical Innovations
55 Orchard Rd
Briarcliff Manor NY 10510
845-480-2058
hobbs at electrooptical dot net
http://electrooptical.net

- P
- Paul Keinanen
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, May 28, 2010 7:57 AM

If we talk about video encoding and compression in general, there are several cases in which a huge parallel processing power would be useful.

In modern compression systems, there are numerous options how to encode a sequence. It is hard to predict in advance which method will give the best result in terms of quality and transfer or storage requirements. With sufficient computing power, each encoding option can be executed in parallel for the same uncompressed material and after encoding, select the method, which gives the best result on a second by second basis.

Generating motion vectors requires that an object is detected and also detecting were it has moved in the next picture (or were it was in the previous picture). Performing the search in all directions around the current location can be performed in parallel and then use the best match to generate the motion vector.

Video sequences consist of several pictures, each of which could be processed with a separate groups of processors at lt least until intra-coding is used.

For instance the HDTV 1920x1080 picture can be divided into 8100 macro-blocks of 16x16 each. With only 300 cores, each core would have to handle a slice of macro-blocks :-).

- J
- Jan Panteltje
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, May 28, 2010 9:31 AM

On a sunny day (Fri, 28 May 2010 10:57:50 +0300) it happened Paul Keinanen wrote in :

Thank you for the deep insight. Yes does not work for all frame types hehe.

Thank you for the deep insight. Just a quick question: How do you transfer data between those 300 cores?

I am just eager to see all the theoretical advantages of a 300 core, resulting in a real product that beats a 300x clock single core.

It will never happen. Publish the code! Your chance to fame!

- J
- Jan Panteltje
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, May 28, 2010 9:45 AM

On a sunny day (Thu, 27 May 2010 16:37:00 -0500) it happened "Tim Williams" wrote in :

Often video comes in ONE FRAME AT THE TIME. So I also do real time procesing. In a real broadcast say HD environment you have several HD cameras streaming, encoding is needed, recording is needed.

I give you that you could indeed chop a stream up once it is recorded and work on sections of that. not a bad idea actually. Bu tnot always easy to do, say you render a sequence with Blender? But for pure transcoding it could work.

As to pipes and streams, it is the best way I know to do multiple operations on signals. It is the Unix way, many have critised it, and it has always won in the end. First it was text only, filters via grep or awk or sed or whatever, then it was audio, then as speed increased it was video, I was the one of first to use it for video I think, more then 10 years ago. The system has proved itself, I wrote the C code with Moore's law in mind, knowing it would run faster and faster and finally real time.

Now yo ugo and write the 300 core parallel processing stuff, I am waiting.

You are so clever, I am amazed. I am just waiting for the programs. OTOH I aint'not buying no more then a 6 core I think. In fact I am not buying anything in the form of computer now, but I will go for the 300 GHz single core. Been fixing up the house lately :-) Tell you one thing, computahs are much cheaper.

- P
- Paul Keinanen
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, May 28, 2010 11:21 AM

A camera generating 1920x1080p60 requires only 125 Mpix/s (8ns/pixel), assuming 3x10 bits/pixel, this would fit into a 32 bit bus.

Using the producer/consumer model, in which each node only pics the data it is interested in, only a single 32 bit bus connected to all cores would be required running at 125 MHz. This does not seem too hard these days.

With 16x16 macro blocks, you need to communicate with 8 neighbor macro-blocks (N, NE, E, SE, S, SW, W, NW).

With slices, you only need to communicate with the slice above and below.

This is a simple case that would benefit from al large number of processors, making a general product that could use effectively the large number of cores is of course much harder.

- J
- Jan Panteltje
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, May 28, 2010 3:32 PM

On a sunny day (Fri, 28 May 2010 14:21:10 +0300) it happened Paul Keinanen wrote in :

Yes. Do you remember the 'transputer'?

- T
- Tim Williams
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, May 28, 2010 9:33 PM

Well in that case, you can't do better than transcoding 1 frame in 1/FPS seconds, so you only need a limited amount of processing power regardless. And it's going to be a lot less power than e.g. transcoding the Library of Congress in an hour.

What is significant about a "sequence with Blender" that can't be evaluated for all time? Are there animations composed of difference equations, rather than predefined equations (e.g. "live" physics simulation)?

Even if so, the geometry can be solved beforehand, or evaluated to a certain point so that each process can evaluate its section of time independently.

Lots of differentials problems (think FEA) are subject to useful parallelism (if not always as embarrassingly so as graphics tends to be), so I fail to see how it ends here. Just like ever... it is subject to the skill of the programmer, and how much foresight he has.

Tim

--
Deep Friar: a very philosophical monk.
Website: http://webpages.charter.net/dawill/tmoranwms

- J
- Jan Panteltje
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sat, May 29, 2010 10:19 AM

On a sunny day (Fri, 28 May 2010 16:33:48 -0500) it happened "Tim Williams" wrote in :

That last remark I fail to see, we are moving towards higher and higher resolutions, more complex encoding schemes, more frames per second with 3D, and more then one stream at the time.

Well, I dunno how big that is, but one day maybe it will fit all on a postage stamp size medium, but that is an other subject, and interesting one, it reminds me of the Alien anecdote: Alien came to earth, looked a around a bit, found it all very interesting, and wanted to take the accumulated knowledge of the earthlings home with him (it?) to the home-planet. So he got Encyclopedia Britannica, but it was too big and too heavy to fit in the flying sourcer. I have to change sentence construction constantly as to avoid the him / her / it dilemma with that alien. not sure how aliens reproduce, maybe of well, back to that problem with weight and size, hehe So, anyways, Alien writes down all of the characters in those books as one long hex ASCII string. Very long. Then did a 1 / number, took a stick, and put a mark on it from one side to represent that ratio, and took the stick with the (hehe) flying wok or sourcer or dishwasher or whatever,

One of the points and advantages of NOT going multi core is that you can use existing programs

formatting link

I can just have it render an AVI movie from some thing I designed.

Sounds cryptic to me.

When thinking about your idea of chopping up say a mpeg file, and 'real time' I was thinking maybe GOPs of 15, so 15 frames, at 25 fps that makes .6 seconds, you need to read-ahead for 300 cores makes 180 seconds latency (not counting any data transport delays).

3 minutes, could even be useful for live streams of politicianions, as you can then cut in time if they say something stupid, like 'The Internets'.

Yea, it is all fun, you need to code some of that stuff, just for the honour, Klingon like :-()

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Jun 4, 2010 9:16 PM

quantity has been found:

formatting link

hers.html

Naw, 80 GHz (U)LVPECL 8-bitters and maybe 12 or 16 bitters. Single 1.5 V supply.

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Jun 4, 2010 9:34 PM

Larkin

quantity has been found:

formatting link

hers.html

efficiency is big news.

physics.

for that,

- filtering

work better then a general purpose core.

becomes a horror.

tasks

rarely.

any sense.

that point.

is not done very well every day,

multicore' remark,

impossible,

of the world may claim,

won.

manufacturer spouts about multicore.

=20

On the other hand is gets increasing difficult to fully feed a processor past 100 MIPS (at whatever instruction word width), at 1 GHz it is sufficiently intractable as to require multiple levels of cache. For one core that is ok but with two, four or more cores cache coherency becomes a real problem, at 80 cores it is intractable. The maximum speeds going on-off chip are limited to about 12 Gbytes/s and that is nearly intractable. Optical interconnect may give us one more 10x speed increase, then again it may not. Distributed machine architectures are scalable to thousands of cores but are much harder to program and do not work well for all applications.

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Jun 4, 2010 9:53 PM

Williams"

:

If=20

wouldn't=20

How could you feed it over 30 billion memory transactions per second? =46or that matter over 3 billion memory transactions per second? Several sequenced edges are involved, and what about bus width? 64 lane PCIe

2/3? Where are you going to get the RAM?

How are you going to cool it?

- P
- Phil Hobbs
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sat, Jun 5, 2010 3:48 AM

has been found:

formatting link

In GaAs? Don't think so. Just driving the wires at that speed would take insane amounts of power.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs
Principal
ElectroOptical Innovations
55 Orchard Rd
Briarcliff Manor NY 10510
845-480-2058
hobbs at electrooptical dot net
http://electrooptical.net

- P
- Phil Hobbs
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sat, Jun 5, 2010 3:57 AM

quantity has been found:

formatting link

is big news.

that,

filtering

better then a general purpose core.

becomes a horror.

sense.

point.

not done very well every day,

multicore' remark,

world may claim,

spouts about multicore.

There have been 128-way SMPs with full cache coherency for over a decade, and high performance processors are more like a terabit per second off-chip bandwidth. You do need quite a few I/Os for that! In my previous life, I was working on ways to have hundreds of terabits per second of on-chip bandwidth without causing the chips to melt. You have to get down to below 100 fJ/bit (i.e. 100 uW per Gb/s) to do that.

You can also have machines that are a bit more NUMA-ish(*), i.e. data cached for cores further away takes longer to get. That puts a burden on the compiler and OS to maintain efficiency, but oh, well.

Cheers

Phil

(*) Nonuniform memory access

--
Dr Philip C D Hobbs
Principal
ElectroOptical Innovations
55 Orchard Rd
Briarcliff Manor NY 10510
845-480-2058
hobbs at electrooptical dot net
http://electrooptical.net

- P
- Paul Keinanen
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sat, Jun 5, 2010 5:33 AM

The RAM part does not sound too demanding.

Assuming sufficient on chip memory, any external DRAM could act as the backing storage in a virtual system, assuming 4 KiB pages (for x86 architecture), a page load would transfer 32 Kib.

A 1 Gib DRAM is arranged as 32 kiRows x 32 KiColumns. Inside a DRAM, in a read request, the row address will be used to represent all bits in a row to the sense amplifiers (and eventually write pack those bits to all bits in the row). The column address is then used to select one (or more) sense amplifier(s) to the output pin(s). In videoRAMs, the row is parallel loaded into a shift register which is then clocked out at high speed.

Assuming 50 ns RAS cycle time, a virtual memory page can be delivered in less than 50 ns (compared to several milliseconds for a disk based backing storage), corresponding to 80 GiBytes/s.

With such long messages, the low speed of light does not destroy the throughput, even if the DRAM and CPU are at some distance from each other (for cooling etc.).

Transferring 80 GiBytes/s with one or few pins would be challenging, even with multilevel coding, however, this optical fiber, this could be realistic with multicolour (WDM) systems.

- P
- Paul Keinanen
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sat, Jun 5, 2010 6:55 AM

Why would driving a 50 ohm transmission line require a huge amount of power ? On the receiver side, how many bits would be required to _reliably_ detect if 0 or 1 is sent ?

Assuming -174 dBm/Hz thermal noise density at room temperature, at 80 GHz bandwidth, the thermal noise power would be -65 dBm and assuming a few dB extra required for binary detection, we are still talking about a few nanowatts at the receiver end.

Of course at these frequencies, the transmission line skin effect and dielectric losses on a PCB would be considerable, requiring a high transmitter power and hence limiting the transfer distance.

At such high frequencies, a low loss waveguide would have nearly manageable dimensions for "long distance" communication across the PCB:-).

- P
- Phil Hobbs
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sun, Jun 6, 2010 12:53 AM

Lines on ICs aren't 50 ohms, they're all RC. There are millions of them, so even with 200 mV swings you'd be talking about 400 watts per million wires. Lava city. Not to mention that the long lines all have repeaters to preserve the bandwidth, which multiplies the power dissipation.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs
Principal
ElectroOptical Innovations
55 Orchard Rd
Briarcliff Manor NY 10510
845-480-2058
hobbs at electrooptical dot net
http://electrooptical.net

- P
- Paul Keinanen
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sun, Jun 6, 2010 5:39 AM

When the speed goes up, the physical distances must be reduced, in which a single synchronous clock can be used and the logic considered by simple RC model.

In the old days a complete 19" box might considered a single entity clocked by a central clock and the interconnections analyzed as RC circuits. The interconnection between the boxes was handled with serial or parallel transmission lines driven by proper line drivers and receivers.

Later on a single card was a self contained unit with transmission line communication through the backplane.

These days the interconnections between ICs on a PCB are often transmission lines.

For even greater speeds, physically small sections within a single IC chip must be considered as independent entities, interconnected asynchronous transmission lines to transfer data between entities. The popularity of multicore processors is a clear indication of this trend.

On an independent entity, much less than 1 mm² in size, what forces using such huge voltage swing ?

At lower speeds with unbalanced logic, the ground bounce will finally eat the noise margin. How about some ECL style gates with true and complement outputs, the ground potential fluctuations would not be significant, thus reducing the required voltage swing and hence power dissipation ?

How many decibels/mm are the losses on a transmission line on the chip?

- P
- Phil Hobbs
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Jun 7, 2010 4:20 AM

It's a statement of fact. IC lines do not have resistive terminations because you can't stand the dissipation.

Sure--such as a VAX 11/780, circa 1982. Not any time recently.

Almost always, in high performance computers.

That's one of several reasons for multicore, as I mentioned earlier.

Lots of reasons. For one thing, CMOS wants to swing from rail to rail, and you need a VDD of several times kT/e to have any output drive from any FET whatsoever. If the swing is less than VDD, somebody has to be dissipating a bunch of power to reduce it, unless you want to have a DC-DC converter for every single gate.

The higher you make VDD, the faster the logic goes, until it slows down due limits on the power dissipation density.

That doesn't reduce power dissipation, it just moves it from the load to the driver.

Because the losses per unit length go as 1/sqrt(f), the bandwidth of a given line goes as 1/(length)**2. Repeater spacing depends on clock frequency--I'm not sure what it is in 32-nm CMOS, but it takes several of them to get from one side of a chip to the other.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs
Principal
ElectroOptical Innovations
55 Orchard Rd
Briarcliff Manor NY 10510
845-480-2058
hobbs at electrooptical dot net
http://electrooptical.net

- K
- keithw86
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Jun 7, 2010 12:21 PM

e 1.5 V

ould

ion.

Repeaters aren't used because of loss. They're used because RC is too high. The delay of a line is ~ the square of its length. At some point a gate delay becomes less than the difference between (2l)^2 and

2l+gate. I've seen lines with four repeaters. Major work was done to get the tools to just use inverters when there were an even number of repeaters; even larger gain.

- P
- Paul Keinanen
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Jun 7, 2010 7:34 PM

I think that we disagree about what is intraunit and what is interunit communication.

In RF design, the old rule of thumb is that anything longer than about lambda/10 should be treated as a transmission line (in fact lambda/4 is an impedance inverter).

In the old days, the lamda/10 limit was not an issue, as long as the equipment was in the same room. However, if you want to operate with

80 GHz clocks (as this thread started), you really have to keep the synchronous clock area about the size of the dot on paper at the end of this sentence.

At such frequencies, communication between the "dots" must be analyzed as transmission lines.