A stupid post about Intel's latest computer chip ( s)

A

A Man Crying Alone In The Wild 20 years ago

Come on now, you are less of an idiot to understand this,

IBM/INTEL architecture,

REGISTER_1 ( A storage location) REGISTER_2 ( A storage location) REGISTER_3 ( A storage location) ( etc. . . . ) REGISTER_16 ( A storage location)

V.S single stack enhanced architecture, ( dynamic frequency profiled)

STACK_1 [ 1..8] STACK_2 [ 1..4] STACK_3 [ 1..4]

Which one do you believe requires less chip internal hardware wires?

( and, thus, a higher efficiency of "Turing" machine language expression ( and code profile))

( HINT : Have you every read about minimal ANSI FORTH machines? )

MIMD Multiple Instruction Multiple Data VLIW Variable Length Instruction Word MPP Multiple Parallel Processors ( many SMPs linked together like an interconnecting LEGO(tm)-like block game to add more processing power ) SMP Symmetric Multiple Processor ( like, multiple cores on a single CPU chip) ( between sixteen and two with IBM/Intel, set at a constant factor of sixteen and derivative of super-scalable application dynamic frequency profile )

I have been shouting news of the VLIW SMP MPP FORTH formula to Washington and has been published, since 1996, all around the St. Paul and Minneapolis Minnesota area.

However, IBM/Intel continues to shout anti-news.

Vote

C

cpu16x1832 20 years ago

Come on now, you are less of an idiot to understand this,

IBM/INTEL architecture,

REGISTER_1 ( A storage location) REGISTER_2 ( A storage location) REGISTER_3 ( A storage location) ( etc. . . . ) REGISTER_16 ( A storage location)

V.S single stack enhanced architecture, ( dynamic frequency profiled)

STACK_1 [ 1..8] STACK_2 [ 1..4] STACK_3 [ 1..4]

Which one do you believe requires less chip internal hardware wires?

( and, thus, a higher efficiency of "Turing" machine language expression ( and code profile))

( HINT : Have you every read about minimal ANSI FORTH machines? )

MIMD Multiple Instruction Multiple Data VLIW Variable Length Instruction Word MPP Multiple Parallel Processors ( many SMPs linked together like an interconnecting LEGO(tm)-like block game to add more processing power ) SMP Symmetric Multiple Processor ( like, multiple cores on a single CPU chip) ( between sixteen and two with IBM/Intel, set at a constant factor of sixteen and derivative of super-scalable application dynamic frequency profile )

I have been shouting news of the VLIW SMP MPP FORTH formula to Washington and has been published, since 1996, all around the St. Paul and Minneapolis Minnesota area.

However, IBM/Intel continues to shout anti-news.

A simple enumeration of basic primitives with a stack enhanced architecture yields an powerful micro processor core. ( For example ANSI FORTH machine implicit and explicit primitives, JUMP_IF_ZERO JUMP CALL RETURN LITERAL 0< AND XOR DROP OVER DUP @ ! 2* 2/ >R R> INVERT + )

Vote

A

A Man Crying Alone In The Wild 20 years ago

Come on now, you are less of an idiot to understand this,

IBM/INTEL architecture,

REGISTER_1 ( A storage location) REGISTER_2 ( A storage location) REGISTER_3 ( A storage location) ( etc. . . . ) REGISTER_16 ( A storage location)

V=2ES single stack enhanced architecture, ( dynamic frequency profiled)

STACK_1 [ 1..8] STACK_2 [ 1..4] STACK_3 [ 1..4]

Which one do you believe requires less chip internal hardware wires?

( and, thus, a higher efficiency of "Turing" machine language expression ( and code profile))

( HINT : Have you every read about minimal ANSI FORTH machines? )

MIMD Multiple Instruction Multiple Data VLIW Variable Length Instruction Word MPP Multiple Parallel Processors ( many SMPs linked together like an interconnecting LEGO(tm)-like block game to add more processing power ) SMP Symmetric Multiple Processor ( like, multiple cores on a single CPU chip) ( between sixteen and two with IBM/Intel, set at a constant factor of sixteen and derivative of super-scalable application dynamic frequency profile )

I have been shouting news of the VLIW SMP MPP FORTH formula to Washington and has been published, since 1996, all around the St. Paul and Minneapolis Minnesota area.

However, IBM/Intel continues to shout anti-news.

A simple enumeration of basic primitives with a stack enhanced architecture yields an powerful micro processor core. ( For example ANSI FORTH machine implicit and explicit primitives, JUMP_IF_ZERO JUMP CALL RETURN LITERAL 0< AND XOR DROP OVER DUP @ ! 2* 2/ >R R> INVERT + )

Vote

J

Jerry Avins 20 years ago

Have you been skipping your meds?

Jerry

-- Engineering is the art of making what you want from things you can get. ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

Vote

M

Mark Nudelman 20 years ago

Perhaps if you could write in grammatical English, people could understand what you're trying to say.

This is a meaningless question. A stack architecture could be implemented with fewer "wires" (by which I think you mean "gates") if it stores the stack entirely in off-chip memory, but then it would be much slower than a register-based machine. Most reasonable stack machines keep the top N stack entries in on-chip registers, which makes it look pretty similar to a register-based architecture from the point of view of chip resources. On the other hand, a register-based machine could keep its registers in off-chip memory in order to save gates, but this would be a pretty stupid design.

However, counting gates (or "wires") is not the way to determine the efficiency of a chip. In general, chips with more gates are MORE efficient, since they implement a lot of optimizations which are not possible in smaller chips.

But possibly I entirely misunderstood your point, because your posting is very unclear.

Also, when people reply to you and you just repost your original post as a reply to them, it makes it look like you can't understand their replies (or that you're a bot). You should at least respond to the substance of posts that reply to you.

--Mark

Vote

N

nospam 20 years ago

FORTH never went anywhere for a good reason.

Totally un-maintainable.

Vote

A

A Man Crying Alone In The Wild 20 years ago

Presumably for the same reason you understand all machine code is un-maintainable. Maybe read some more to develop you knowledge of computer programming languages and their relationship to machine code.

Regards,

maw

Vote

A

A Man Crying Alone In The Wild 20 years ago

Because I took the time to explain the terms clearly, maybe read more stack machine architecture, in general, most modern stack machine architures, of the last ten years, focus upon on-chip "stack" registers.

Here is something you may consider,

formatting link

Please be kindly enough to quote /entirely/ with what information you are more knowledgeable so at to disagree ( or, agree) with in finding(s).

Regards,

maw

Vote

J

Jerry Avins 20 years ago

...

A troll is a troll is a troll.

Jerry

Engineering is the art of making what you want from things you can get. ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

Vote

T

The Ghost In The Machine 20 years ago

In sci.math, A Man Crying Alone In The Wilderness

wrote on 23 Oct 2005 08:11:24 -0700 :

Actually, I think the registers are stored in internal flipflops in the microprocessor itself.

An interesting question. Given what little I know about modern chip design (I worked in a fab as a software engineer 15-20 years ago so absorbed a little microelectronics by osmosis :-) ), it would appear to me that it's little if any difference, though it depends on the specifics of how the stack is implemented.

At the software level, most stacks are basically regions of memory, accessed via a stack pointer. A logical implementation (to me, anyway) of a stack machine at the hardware level would be a region of memory, a stack pointer, and a bus of width 4 * wordwidth. This bus would contain at most 4 words and would then be fed to the ALU. The ALU would decides to write at most 2 words back to the bus, which would stick them back on the stack. The stack pointer would have various wiggles on it to "pop 2", "pop 1", "push 2", and "push 1", perhaps.

Note that I'm not really specifying a word size, although most contemporary architectures would be 32 or 64 bits. Ideally one could do this in a "bit slice" fashion; just add more chips for bigger words. However, I'm not sure how well that will work for the more complex instructions.

(It's worth noting here that

formatting link

suggests that the G4 has a 128-bit internal bus.)

The arithmetic instructions would be fairly simple:

ADD: take 2 operands, SP++, shove result, set conditionflags SUB: take 2 operands, SP++, shove result, set conditionflags NEG: take 1 operand, shove result, set conditionflags MUL: take 2 operands, shove product and overflow, set conditionflags MULA: take 3 operands, SP++, shove product and overflow, set conditionflags DIV: take 3 operands, SP++, shove quotient and remainder, set conditionflags MULDIV: take 3 operands, SP+=2, shove results, set conditionflags MULDIVMOD: take 3 operands, SP++, shove results, set conditionflags MULADIV: take 4 operands, SP+=3, shove results, set conditionflags MULADIVMOD: take 4 operands, SP+=2, shove results, set conditionflags

The logical instructions:

AND, OR, XOR: take 2 operands, SP++, shove resuilt, set conditionflags NAND, NOR: take 2 operands, SP++, shove resuilt, set conditionflags NOT: take 1 operand, shove result, set conditionflags TEST: take 1 operand, SP++, set conditionflags BITTEST: take 2 operands, SP+=2, set conditionflags

The "bitfiddle" instructions:

LSHIFT: take 2 operands, SP++, shove result, set conditionflags RSHIFT: take 2 operands, SP++, shove result, set conditionflags ARSHIFT: take 2 operands, SP++, shove result, set conditionflags (the main difference is the handling of the sign bit) LSHIFT2: take 3 operands, SP++, shove result, set conditionflags RSHIFT2: take 3 operands, SP++, shove result, set conditionflags ARSHIFT2: take 3 operands, SP++, shove result, set conditionflags

The "stackfiddle" instructions:

DUP: take 1 operand, SP--, shove results, set conditionflags DUP2: take 2 operands, SP-=2, shove results, set conditionflags DROP: SP++ DROP2: SP+=2 SWAP: take 2 operands, swap 'em, set conditionflags SWAP2: take 4 operands, swap 'em, set conditionflags OVER: take 2 operands, SP++, shove result, set conditionflags OVER2: take 4 operands, SP+=2, shove result, set conditionflags

PICK: This one's tricky. One could implement 1PICK=DUP, 2PICK=OVER,

3PICK, and 4PICK easily enough; beyond that one would have to engineer instructions to move around the stack and a temporary holding register -- or one can add to/subtract from the stack (a sort of DROPn instruction) as opposed to merely incrementing and decrementing it.

PICK2: Similar to PICK except it uses word pairs instead of words.

ROLL: This one's even trickier. Two words are fetched from the stack, then the stack rotated. SP += 2 after the operation but a lot is happening in between instruction start and instruction end. Note that 2 1 ROLL (or was it 1 2 ROLL?) is equivalent to a SWAP.

ROLL2: Similar to ROLL except it uses word pairs instead of words.

LOAD1: Whatever word's following in the instruction stream, push it onto the stack. LOAD2, LOAD3, LOAD4: Similar to LOAD1 except more data is pushed.

Arbitrary memory fetch:

I'm not sure how to properly structure this, but here's one fairly obvious method:

FETCH: take 2 operands, SP++, push result, set conditionflags The first operand is a base address, the second an offset. Usage might be along the lines of '5 MADDR FETCH', which picks the word from the location MADDR+5 and pushes it onto the stack. RFETCH: Same as FETCH except the operands are reversed. This one might be useful in certain structure contexts; e.g. in "MADDR 5 RFETCH", one could define "5 RFETCH" as "GETCFLAGS" and use it everywhere. Of course one could define RFETCH as "SWAP FETCH" anyway. STORE: take 3 operands, SP+=3. RSTORE: take 3 operands, SP+=3. MSWAP: take 3 operands, SP+=2, push result, set conditionflags This is basically an atomic swap, which is of some importance to proper implementation of locking, semaphors, and monitors. RMSWAP: take 3 operands, SP+=2, push result, set conditionflags

Program control:

Depending on desire, one might have a separate stack for the control instructions (as I recall, in some Forths, one has R> and >R "words"). One could then do things in a fairly obvious fashion:

JMP: whatever word's following, replace the top of the R-stack with it. CALL: push the word following onto the R-stack. RET: DROP for the R-stack. JMPI: pop the top of the numeric stack and replace the top of the R-stack with it. SAV: take the top of the R-stack and push it onto the numeric stack. BRANCH: add the word following to the top of the R-stack. Depending on desired sophistication one can have signed byte offset, int16 offset, and int32 offset variants as well.

And of course one has the conditional modified forms, which would simply test for various combinations of the condition flags in this particular design, but other microprocessors actually bother to look at the top word of the stack or the contents of a register.

TRAP and TRAPRET instructions might be useful; these would allow for simple context switching and scheduling. Some PROBE instructions might allow for access to the user's space from the kernel, for checking purposes. I/O would only be permitted from the kernel context; these might include the usual READ and WRITE instructions for port manipulation, and some form of DMA setup which would relinquish the outside data bus for a short time to allow transference between device and physical memory. I'm not sure how I'd handle various issues such as virtual -> physical page translation, and initial program load.

Granted, this is an ad hoc machine design; I'd have to burrow deep into the JVM whitepaper to see how they do it, and there's a few issues regarding name lookup in there. This is also designed without regard to losing machine control; ideally, it would be virtually impossible to "hack" the machine by e.g. LOAD #x JMPI to jump to an undesired (well, undesired by the system or algorithm designer, anyway) location. Or one could obliterate the operand stack (oops) or the program stack (extremely dangerous).

There's some interesting issues regarding clocking. Does one really need a system clock? Some intriguing designs were suggested last decade that basically ran "as fast as possible". Admittedly, these probably dropped into the bit bucket, as DRAM requires a clock anyway.

And then there's pipelining -- basically, doing two things at once. For example, HP PA Risc had some interesting things going on that could execute one extra instruction immediately following a conditional branch, regardless of whether that branch was actually taken or not. The G4 can fetch and execute three instructions per clock cycle.

Small wonder that processors such as the 1802, 6502, and 8088 are relatively puny in transistor number (and capability) whereas modern processors are pushing the 50 million mark, and modern machines the 400 W mark (the original PC-XT used all of 63.5 watts, if that).

But it's not just because the words are bigger (the 1802 had

16-bit address and 8-bit data capability; AMD-64 has 64-bit address and data capability).

Perhaps it's time to step back a bit and contemplate the bigger question: who (or what) should control the machine?

#191, ewill3@earthlink.net It\'s still legal to go .sigless.

Vote

A

A Man Crying Alone In The Wild 20 years ago

As balanced for high microprocessor efficiency, an MPP SMP stack machine architecture for FORTH, C, Scheme, Java, you-name-it-computer-programming-language

It uses a simple stack to stack messaging, for both SMP multi core, internally and MPP, CPU16-to-CPU16, externally, for simply solving MIMD and a host of other SMP multi core chip design problems, ...

as you may read, a C compiler is almost an IBM/Intel no-brainer,

In general, microprocessor efficiency minimizes transistor count and maximizes utilization of those transistors, however, externally, traditional "bandwidth" benchmarking program suites, a 'raw' efficiency will be displayed, even more so where a benchmark relies upon parallel architectures, I guess ten to ONE HUNDRED times faster, for some real-world practical parallel programming benchmark suites. ( hydrodynamic or thermodynamic modeling, etc. )

This model is the most efficient SMP MPP microprocessor model I have reference, a hydrid of Mr. Moore's work and mine, and, as a final note, I am having difficulty developing my chip model any further than this, URL,

formatting link

Here is 16-bit VLIW protocol reference, ( from dynamic profiling) URL,

formatting link

Regards,

maw

Vote

C

cpu16x1832 20 years ago

I currently use five stacks, for my "Holy Grail" almost all purpose super scalable multi core architecture model,

COPIED FROM ANOTHER POST, URL,

formatting link

" Example extended on-chip stack register map,

sixteen ( 16) return stack elements, eight ( 8) parameter stack elements, four ( 4) Supplementary stack elements ( X, Y), thirty two ( 32) status /machine state logic/ stack elements "

Regards,

maw

Vote

T

The Ghost In The Machine 20 years ago

In sci.math, A Man Crying Alone In The Wilderness

wrote on 23 Oct 2005 11:16:57 -0700 :

Depends on how one pushes and pops the stacks, perhaps. I'll admit I don't see STACK_1 being deep enough. Also, is there a reason for

3 separate stacks? My hypothetical required only two: numeric values and codepointers.

Did you anticipate using something along the lines of dual barrel shift registers? That makes some sense, if it's fast enough; however, there's a lot of issues regarding pipelining with a stack register architecture; basically, the second instruction can't execute until the first one's done playing with the stack. At least in a register-based architecture where one has the code sequence

SUB AX, BX ADD CX, DX

one could conceivably be executing the SUB instruction and the ADD instruction more or less simultaneously. Ideally, though, the simpler architecture would run at a higher clockrate.

Perhaps if you were to clarify what you mean by "chip internal hardware wiring"? For instance, does that mean:

[1] die size, given a certain transistor size? [2] total wiring area? [3] number of vias? [4] a combination of the above?

Note also that buffer transistors -- those things that have to drive the outside world pins -- are huge compared to the internal wiring. And there's a lot of them. Try to optimize the internal wiring too much and one might just waste space.

Turing machines don't do arithmetic all that well. If one postulates, for example, a decimal number, followed by a blank, followed by another decimal number, followed by an indefinite number of blanks, one could do the following.

state 0, any char but blank; write that char, right, state 0 state 0, blank: write blank, back up, state 1

state 1, char '0': write '0', go to state 2-0 state 1, char '1': write '1', go to state 2-1 .... state 1, char '9': write '9', go to state 2-9

state 2-x, blank: write blank, right, go to state 3-x state 3-x, any char but blank, write that char, right, stay in this state state 3-x, blank: write blank, left, go to state 4-x

state 4-x, char 'y': write 'y', right, go to state 5-{x+y} or 6-{x+y-10}

I could go on but it gets pretty tedious. :-) And that's for

*addition*; I shudder what I would have to do for multiplication or division.

If one postulates two binary numbers as opposed to two decimal ones, the machine gets slightly simpler but it's still pretty tedious.

Of course one could postulate a 2^32+1 character alphabet, and an impossibly huge state matrix, if one wishes. That gets slightly silly, though.

Can't say I have. I know a little Forth; it's a strange language, which can modify itself. Very interesting and efficient, but it doesn't do files all that well; the traditional method involves numbered screen loading, as I recall. Of course that was way back then.

Bit-slice architectures have been known for years, if not decades.

Why so low? 64-bit is the way to go, if one can afford the die space. The practical considerations are these:

[1] How many die per year can one fabricate? Note that this is a function of wafer size, yield, transistor size, and process complexity; the smaller the transistors the more sensitive they are to process variations. [2] How much does each die cost to make? [3] How much can one sell each die for? [4] How well does a die actually work in regards to contemporary microprocessors?

I don't see FORTH being limited to 16-bit.

#191, ewill3@earthlink.net It\'s still legal to go .sigless.

Vote

T

The Ghost In The Machine 20 years ago

In sci.math, Mark Nudelman

wrote >> Come on now, you are less of an idiot to understand this,

The two are not unrelated. Cross polysilicate with diffusion, and one has a transistor gate (FET) -- at least, for NMOS, PMOS, or CMOS architectures. Of course in CMOS one has to have another transistor somewhere else, of the opposing type in the transistor wiring "graph": a 2-input NAND gate, in particular, has 2 N-types in series and 2 P-types in parallel.

(With my luck all this is in a FAQ somewhere. :-) )

I for one think he's thinking internal stack. An 8-way numeric stack, though, is rather small, considering that modern micros have 2 MB or more internal memory cache.

I'd probably want to use two 1kword or 4Kword barrel shift registers. The ALU would connect to the top four slices of the barrel. Ideally, the physical layout would in fact look a bit like a barrel, to optimize propagation delay. However, I'm far from expert in this stuff.

One possibility might be to replace the flipflops in a traditional barrel shift register with a DRAM unit (transistor + capacitor); the barrel shift register would then *have* to shift (either forward or backward) every clockpulse transition, perhaps.

Actually, the 486 does exactly this, if one switches contexts. Basically, the registers are shoved into a TSS structure in memory.

I suspect more modern chips have similar capabilities.

I for one would think it depends on what one wants to optimize.

[1] Raw chip speed -- how fast can that sucker go? [2] Chip power dissipation. [3] Chip size. [4] Number of transistors. (This is not quite the same as chip size, since other variables include fanin or fanout per transistor.) [5] Number of transistor flips during execution of a specific problem (e.g., Erastothene's Sieve). Presumably, this is related to [2].

I'm not sure my reply was all that basic. :-) But it's clear he didn't pursue the details thereof.

#191, ewill3@earthlink.net It\'s still legal to go .sigless.

Vote

K

Ken Smith 20 years ago

In article , The Ghost In The Machine wrote: [...]

[6] How fast it will go running something compiled with a C compiler a mere mortal can design.

In a very pipelined machine, you can get more speed per transistor by making it the compilers job to make sure that two numbers aren't trying to go down the same bus. If different instructions have all manner of different timings, coming up with the optimum code can be very tricky.

-- kensmith@rahul.net forging knowledge

Vote

M

martin griffith 20 years ago

Why the Fu*k is this in SED?

Intel= BORING.

martin

Vote

N

nospam 20 years ago

yep. With NO applications either.

Vote

T

The Ghost In The Machine 20 years ago

In sci.math, Tim Clacy

wrote >> Come on now, you are less of an idiot to understand this,

Three words: lots of comments. :-P Besides, Forth is generally portrayed as a dictionary of source; if one needs to change a word, it can be changed easily, although one might have to reload the system for the change to take proper effect.

For example:

: GODOIT 1 . ; : DOITAGAIN CR GODOIT GODOIT CR ; : GODOIT 2 . ; DOITAGAIN

would print '11' or ' 1 1', not '22' or ' 2 2'. A quick install of 'gforth' confirms this.

formatting link

Comments use '()' or backslash; note that these must be separated by a space as they are interpreted by the gforth system:

( this is a comment; it can be multiline but cannot nest ) \\ so is this; it extends to the end of the line

(The \\ is new to me, but reasonably logical. The () I've seen before.)

A typical definition, for example, notes the stack effects in ():

: square ( n -- n^2 ) dup * ;

And of course it helps to use intuitive tokens as opposed to cryptic one-character affairs.

#191, ewill3@earthlink.net It\'s still legal to go .sigless.

Vote

T

Tim Clacy 20 years ago

Post-script, Lego Mindstorms, Open Firmware...

No need to maintain if it does the job already

Vote

C

Casper H.S. Dik 20 years ago

And a Forth interpreter is shipped embedded in the hardware of all Macs and Sun SPARCs. And that is a pretty good indication of the application space.

Casper

Vote

A stupid post about Intel's latest computer chip ( s)

Join the Discussion

Didn't find your answer?