Forth-CPU design

PRBS sequences, maximal length!!!

don't rotate the word, rotate the xor bit taps and insert location!!!

ok

cheers!

Reply to
jacko
Loading thread data ...

hi sorry, this is corrected

/.s=q /.q.r ; get 1st pointer to q /.q.r /.q.r ; copied 3 16 bit words to r /=q.r /.s=q ; save address for write over and get other address /.q.s /.q.s ; get other 3 16 bit words to s /.q.s /=q.s ; and save q on s /.r=q /.s.r ; restore 1st address /.s.q /.s.q ; overwrite 3 16 bit words /.s.q /.r=q ; restore 2nd pointer in q (correction) /.r.q /.r.q ; overwrite 2nd set /.r.q /=p.p ; and a nop (possible faster pipeline)

cheers

Reply to
jacko

If it's just adders you want to avoud, I've seen linear feedback shift regsiters used for this: very small and cheap.

Andrew.

Reply to
Andrew Haley

Why bother? On FPGAs carry chain logic is free, fast and the easy path. I guess you are thinking custom chip, eh? Even then is it a real issue for the stack counter to be binary? How large will this be,

4 bits, 6 bits, 8 bits?
Reply to
rickman

CALL . This because CALL is too slow . New Forths jump indirect at worst and at best ( Chucks NS4000 ) dont pay any price at all , they JUMP direct to next executable !!!

There is no way to salvage the function "CALL" It is a bad idea . Why return when thats not what you want to do ?

BTW NewForth has to simulate a fast "executable to excuatble" cause ARM is slow in all branches . NF does next best thing , Indirect jump directly to next executable .

There is no faster way . _____________________________________________________________

Jecel wrote:

. And in general it

[[[[[[[[ STACKS are trouble , use modern programming pls ]]]]]]
Reply to
werty

If by NS4000 you mean the Novix NC4016 then it is exactly an example of what I was talking about - subroutine call is the most optimized instruction with its opcode taking up a single bit.

formatting link

There are several kinds of Forth implementations, but at least Chuck Moore's recent efforts seem to favor the subroutine threaded model. Even for models where you don't need an actual call instruction the equivalent functionality is included in the kernel (you need to save stuff to the return stack at some point). With subroutine threading it is indeed a great idea to replace tail calls with jumps.

Sometimes it isn't what you want to do, but sometimes it is.

What is fast or not depends on the processor architecture and even more on the implementation of the memory hierarchy.

-- Jecel

Reply to
Jecel

Gray codes are useful for reading data, but they need extra hardware for counting and addressing. (If course, Gray-code addressing is no different from any other scrambling of the address lines too a memory chip. Page accesses aside, the memory doesn't care. It can make a ROM difficult to reverse engineer, particularly id the data lines are scrambled too.)

Jerry

--
Engineering is the art of making what you want from things you can get.
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
Reply to
Jerry Avins

I wasn't sure I understood the question. Is it to reduce the worst-case speed for the increment? A traditional carry-type grey code implement has a worst case just as bad as binary, but it sounds like he's thinking about dedicating the space for a faster method.

But likely I misunderstood. Why would he need the increment to be fast when something else would surely be slower anyway? And I'd expect a binary increment to take less space.

Reply to
J Thomas

Yes, there are few advantages to a Gray code counter. But one is a lower power consumption because only one bit changes on each increment. Again this is unlikely to be noticed in a real chip given the small size of the counter.

Reply to
rickman

If you're getting your code from offchip, and you need a new address every time you get a new instruction, then the power consumption advantage might add up, right?

To my way of thinking this is a braindead way to get code to a processor, but it seems to be standard.

Reply to
J Thomas

Maybe I misunderstood. I thought we were discussing addressing a hardware stack memory. A stack can often be just 6 bits or less. At that size the power difference would be virtually unmeasureable. Off chip sequential accesses could be a different story, but then you only need a one bit address bus to indicate, up or down and put the counter elsewhere. As long as you are doing custom chips, why not optimize the whole thing?

Reply to
rickman

Agreed. I'm saying what you're saying.

Reply to
J Thomas

Generating a new address each time is certainly widespread, but there are small pockets where this is done better. Some game machines have a smarter ALE + Counter scheme, and some flash devices have Multi Cycle ALE +Clock schemes ( and Serial flash memories all generates an address, and then streams the data ) There are sync memories that have limited page-burst sizes, but they are complex to use efficently.

-jg

Reply to
Jim Granville

rickman schrieb:

Whereas a binary counter changes two bits on average. (1 + 1/2 + 1/4 + ....) This saves you what, 5fW? And that is probably more than offset by more switching internal signals in an FPGA implementation.

Kolja Sulimma

Reply to
Kolja Sulimma

Ripple carry counter power is independant of counter size. (Except for clock tree power)

Kolja

Reply to
Kolja Sulimma

Good example of a rule of thumb gone bad.

If you think about that you will realize that it can't be true... otherwise the extra bits must use no power!

Each added bit toggles half as fast as the last. So the power for a ripple carry counter is a power series which is asymptotic to two bits toggling all the time. That is not the same as independant. However your point is valid that once you get past a few bits the added power for more bits is very small. The power for a Gray code counter is pegged at 1 bit toggling. So the difference is always small.

Reply to
rickman

rickman schrieb:

Hihi, nice wording...

But it is close. Considering the error that power estimation has on sub

100nm CMOS the assumption of constant power is probably within the error bar starting from the second bit.

Kolja Sulimma

Reply to
Kolja Sulimma

One of the nice things about an FPGA processor core is that they are typically expandable. It should be a simple matter to add hardware mapped stacks to any of these cores. It could be memory mapped, or even mapped to registers. I think there are three operations on the stack, read, write, push and pop. So unless there are some addressing modes that could be usurped, it might require mapping each stack to two registers, one for read/write and one for pop/push.

But then I am not sure the best way to implement forth is to simulate the hardware of a stack machine. Modern Forth compilers seem to do a durn good job of mapping the forth functionality to RISC/CISC processors. I guess the question is whether I can make a MISC processor that works as well as a RISC/CISC processor and uses fewer hardware resources.

Thanks for pointing that out. The original FPGA that I designed my Forth core for did not have a vendor supplied RISC core. But all of the newer ones do. To compare to my MISC code the RISC would need to load the two addresses and a count which would be three more instructions and I'm not sure how many bytes, at least 6 more for a total of about 20 bytes. So my MISC 19 bytes/13 loop clock cycles is at least as good if not better than the RISC 20 or more bytes if slower in clock cycles.

Reply to
rickman

You miss the point .

You never want to RETURN to a "controlling"/"Calling" Word in NewForth , you can not justify it EVER . You NEVER need to RETURN EVER , you can always run faster and yet the upper level word is still in control without ever RETURNING !!!!

Got it ? NEVER RETURN ..

Example : A hi level word starts the show and has a list of 13 Mid-Levels it will "run' , at the end of the first , instead of returning to the main word , I.P. looks at the list of 13 in main word and JUMPS indirect thru that list , to the next subroutine .... But w/o cost !! The mechanism is fast , it does NOT return to main word .

That is NOT returning , it is transparently jumping to next subroutine W/O wasting time Returning ,

It gets worse as you study the ARM 4 cores ...The BRA is actually slower for the pipeline is slow sending the address to the P.C. ! it takes 3 clocks !

The problem is people get stuck on a Return stack , they just can't see any other way of doin it clean .... sorry about that !

Imagine a CPU that has an external list of addresses that is its "program" . alias I.D.C./ IndirectThreadedCode Everything in outer RAM is an address . No executable code in outer RAM . But inside CPU is More RAM that holds Primatives . Those Primatives look like an extension of the instruction set . Boot CPU and it looks for first address from FFFF . After a few Primatives are run from that sequence from outer RAM there's a conditional and the Primatives , change the list from which they are executing . This sequence can of course create new lists in outer RAM as it runs , and create data .

Do you see any STACKs ? any RETURNs ? All CPUs today can do this and its faster than their own CALL/RETURN .

Return STACKS are slow , hard to address , uneeded ....

NewForth for ARM 920T will be Free OpSys ....

Reply to
werty

You've said you're making a forth on an arm core, or something like that. How does that work? Wouldn't a forth cpu execute forth instructions natively? Where would the ARM fit in?

-Dave

--
David Ashley                http://www.xdr.com/dash
Embedded linux, device drivers, system architecture
Reply to
David Ashley

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.