SIXTYFORTH?

With oodles (Soon to be tera-oodles?) of RAM available on the RPi3, and in future releases likely to be even more, is there any point to continuing to cater for bytes, halfwords and words, when everything, including CHAR, can be a 64 bit quantity?

An enjoyable hacker's language such as FORTH becomes less complex by such provision, and a stand-alone SIXTYFORTH a means to break the stranglehold of the bloated Unices.

Although I condemn FORTH as being unsuitable for professional designs, especially where its adaptation resulted from those with limited software awareness, it remains an exciting tool with which to dabble.

Reply to
Gareth's Downstairs Computer
Loading thread data ...

See

formatting link

--
Gerry
Reply to
Gerry Jackson

If you'd like to partake in the discussion, then please do so, but don't expect to send us off to do your research for you.

Reply to
Gareth's Downstairs Computer

What research? That's a link Forth that does exactly as you suggest, read his text file LSE64.txt, and if you're too lazy to find the appropriate bit, look at lines 13 to 16 and no I won't copy it here for you to read. Comments like yours make me less inclined to be helpful!

--
Gerry
Reply to
Gerry Jackson

Yes, performance.

--
https://www.greenend.org.uk/rjk/
Reply to
Richard Kettlewell

OK. I'll bite, how would performance be affected?

I've watched int go from 16 to 64 bits and stuff just got faster..:-)

How is 'load a byte' DONE on a 64 bit processor other than load [aligned] and shift/mask?.

How does a compiler treat char *p,c; for (i=0;i>327;i++) { c=*p++; echo(p); }

Is it fetching *p as a 64 bit chunk and mnipulating it or is it retrieving the same 64 bits of memory over and over and taking a different bit. Or is it cached and cache aware? Or does the processor itself have some magic whereby repeated calls to a pointer incrementing a byte at a time are dealt with differently for 64 bit aligned and non aligned addrtesses?

I honestly would like to know...

Loking at ARM it appears that address registers can be aligned on 8 bit lines.

Is the retrieval of a byte slower in 'unaligned on 64 bit' boundaries?

One deaqls with hardwqare so rarely these days

Stackexchange found this

"Here's what the Intel x86/x64 Reference Manual says about alignments:

4.1.1 Alignment of Words, Doublewords, Quadwords, and Double Quadwords

Words, doublewords, and quadwords do not need to be aligned in memory on natural boundaries. The natural boundaries for words, double words, and quadwords are even-numbered addresses, addresses evenly divisible by four, and addresses evenly divisible by eight, respectively. However, to improve the performance of programs, data structures (especially stacks) should be aligned on natural boundaries whenever possible. The reason for this is that the processor requires two memory accesses to make an unaligned memory access; aligned accesses require only one memory access. A word or doubleword operand that crosses a 4-byte boundary or a quadword operand that crosses an 8-byte boundary is considered unaligned and requires two separate memory bus cycles for access.

Some instructions that operate on double quadwords require memory operands to be aligned on a natural boundary. These instructions generate a general-protection exception (#GP) if an unaligned operand is specified. A natural boundary for a double quadword is any address evenly divisible by 16. Other instructions that operate on double quadwords permit unaligned access (without generating a general-protection exception). However, additional memory bus cycles are required to access unaligned data from memory.

Don't forget, reference manuals are the ultimate source of information of the responsible developer and engineer, so if you're dealing with something well documented such as Intel CPUs, just look up what the reference manual says about the issue."

So that implies that whilst you vcan get 64 bit chunks alignbed on any address, it pays not to.

For ARM

4.2.2. ARMv6 extensions

ARMv6 adds unaligned word and halfword load and store data access support. When enabled, one or more memory accesses are used to generate the required transfer of adjacent bytes transparently, apart from a potentially greater access time where the transaction crosses a word-boundary.

The memory management specification defines a programmable mechanism to enable unaligned access support. This is controlled and programmed using the CP15 register c1 U bit, bit 22.

Non word-aligned load and store multiple, double, semaphore, synchronization, and coprocessor accesses always signal Data Abort with an Alignment fault status code when the U bit is set.

Strict alignment checking is also supported in ARMv6, under control of the CP15 register c1 A bit, [bit 1], and signals a Data Abort with an Alignment fault status code if a 16-bit access is not halfword aligned or a single 32-bit load/store transfer is not word aligned.

ARMv6 alignment fault detection is a mandatory function associated with address generation rather than optionally supported in external memory management hardware.

So 64 bit unaligned access are slower

What I havent found out is what a processor does with byte access.

Is it a case that e.g. it fetches 64 bits and uses the lower 3 addess bits to index into the 64 bit quantity and shift it?

And does it repeat the memory access to get the next 8 bits or not?

It seems that addresses are always byte addtresses as far as code is concerned, so 64 bit computers must 'lose' the 3 LSBS when doing bus accesses and sort the rest out in microcode.

--
In a Time of Universal Deceit, Telling the Truth Is a Revolutionary Act. 

- George Orwell
Reply to
The Natural Philosopher

Not really an issue, for, if you're chasing execution time on a 1GHz processor, then get yourself a 2GHz processor.

Reply to
Gareth's Downstairs Computer

Wow. Really?

And what do you do when you run out of GHz to pursue? Tell your clients to get multiple boxes?

Reply to
Ron Aaron

We're nearly at the GHz limit for processors; I don't know the exact figures, but anything beyond 5GHz seems to be it. Certainly the power consumption & heat dissapation is a big problem. Intel are shipping multi core monsters, not faster single cores.

--
Alex
Reply to
Alex McDonald

Exactly my point. CPU speeds haven't increased appreciably for a number of years.

Reply to
Ron Aaron

Main memory is _very slow_ compared to the CPU - the latency of a single read could be 100 CPU cycles or more, during which time your CPU could, at worst, be completely idle.

formatting link
gives 2012 numbers but the Pi isn?t exactly bleeeding edge hardware so that doesn?t seem inappropriate...)

Since, as you?ve noticed, our computers have got substantially faster since the 1980s, there must be something addressing this problem, and you?re right that it involves caching.

The effect of a memory read, even if only a single byte is requested, is to fill (depending on the technology) up to 64 bytes in the cache[1]. So a subsequent read (of any size) at a nearby address will be much faster than the initial read.

[1] in fact there are usually several levels of cache

In the current world, where each ASCII character is represented by 1 byte, that means that when processing a nontrivial amount of data, you only need to pay that 100+-cycle cost once every 64 characters - so you could run as fast as 1.5 cycles per character. If each character was 8 bytes instead then your best case is 12.5 cycles per character.

That?s one effect. Another is that the cache is relatively small (for instance the Pi 3 has a 32KB L1 cache). If you make each character 8 times as big as it needs to be then the effect is (roughly speaking) to divide the effectiveness of the cache by the same factor.

The exact size of these effects will depend on what kind of data you?re dealing with (there?s more to life than ASCII) and what you?re doing with it (if you?re doing 100s of cycles per character of work anyway then a bit of extra latency is neither here nor there, though the cache occupancy effects may well still be significant).

Elsewhere:

| Not really an issue, for, if you're chasing execution time on | a 1GHz processor, then get yourself a 2GHz processor.

Won?t help. The speed of the CPU is not the problem.

--
https://www.greenend.org.uk/rjk/
Reply to
Richard Kettlewell

Thx. Id forgotten how slow memory is...

--
  ?A leader is best When people barely know he exists. Of a good leader,  
who talks little,When his work is done, his aim fulfilled,They will say,  
?We did this ourselves.? 

? Lao Tzu, Tao Te Ching
Reply to
The Natural Philosopher

Uh... right. Took me a while to remember that. :-)

--
/~\  cgibbs@kltpzyxm.invalid (Charlie Gibbs) 
\ /  I'm really at ac.dekanfrus if you read it the right way. 
 X   Top-posted messages will probably be ignored.  See RFC1855. 
/ \  Fight low-contrast text in web pages!  http://contrastrebellion.com
Reply to
Charlie Gibbs

That's parallelism, innit?

So now software bloat has nowhere to hide...

--
/~\  cgibbs@kltpzyxm.invalid (Charlie Gibbs) 
\ /  I'm really at ac.dekanfrus if you read it the right way. 
 X   Top-posted messages will probably be ignored.  See RFC1855. 
/ \  Fight low-contrast text in web pages!  http://contrastrebellion.com
Reply to
Charlie Gibbs

As of 2011, the Guinness World Record for the highest CPU clock rate is an overclocked, 8.805 GHz AMD Bulldozer-based FX-8150 chip. It surpassed the previous record, a 8.670 GHz AMD FX "Piledriver" chip.[3]

As of mid-2013, the highest clock rate on a production processor is the IBM zEC12, clocked at 5.5 GHz, which was released in August 2012.

Reply to
ray carter

Indeed.

Probably get a new generation of compilers that analyse bloatware and completely rewrite it to work as intended, rather than as written...

--
?Progress is precisely that which rules and regulations did not foresee,? 

  ? Ludwig von Mises
Reply to
The Natural Philosopher

Virtually all modern processors have multi-level caches.

A reference to an address not yet in cache will result in a cache fault at all levels, causing a main memory access that transfers a cache line of data to the primary (largest, slowest) cache, and the next level cache to receive its (typically smaller) line of data containing the referenced word(s). This continues until the smallest, fastest level 0 cache is loaded with the referenced word, which is usually bypassed directly to the processor?s register file (which can be thought of as the ultimate cache, managed by the compiler).

Every doubling of data size effectively halves the size of all caches and data memory, so the performance cost is considerable for any program that stresses any level of cache.

And don?t expect Moore?s ?Law? to save you. We are past the point of increasing clock frequency?now all the density improvements just deliver more cores on a chip, so unless you love parallel algorithms, you?re out of luck. ;-(

--
-michael - NadaNet 3.1 and AppleCrate II:  http://michaeljmahon.com
Reply to
Michael J. Mahon

warning - a bit of a waffle...(You can thank uncle Glenmorangie for this...)

FWIW - an evaluation of a move to exascale computing by the US ASCAC sub-commitee at the DOE said in 2010: " An exaflop system made entirely out of today's technology would probably cost $100B, require $1B per year to supply the needed power, and its own dedicated power plant to produce that power."

and

"Based on current technology, scaling todays systems to an exaflop level would consume more than a gigawatt of power, roughly the output of the Hoover Dam. Reducing the power requirement by a factor of at least 100 is a challenge for future hardware and software technologies."

(Come along Forth programmers ... where are you?) Incidently the Hoover Dam produces about 2000 MW these days...

and

Since there's chatter about multi-tasking - how about a system that run's up to 500,000 threads? The US has one. I find that mind boggling. I don't think round robbin Forth would cut the mustard though.

Nobody is looking to increase clock speeds beyond todays exisiting levels for these projects (because we have indeed reached practical limits there) And then there's the latency issues regardless of clocking of course as I think Alex said later. And addressing tied to 64 bit words - well OK - Crazy idea suggested - crazy idea shot down ... it's good to have crazy ideas to think about just the same. I have a few myself. Some of them may even have legs. (But I'm not sure if I have a biped, Octopus or centipede)

64bit granularity? You'd have a stack of FR4 6 inches thick.... (On the other hand - strange nobody mentioned VLIW ...)
--

john 

========================= 
http://johntech.co.uk 
=========================
Reply to
john

15 years ago Greg Bailey's customer using an off-the-shelf PC was servicing >1,000 tasks. 30 years ago we had a customer running 100 tasks on an 8085.

It's doable because computers spend most of their time waiting for I/O, and an efficiently implemented Forth round robin multitasker can take advantage of this. Pre-emptive task schedulers spend many us deciding what tasks to run and then doing the swap. A task swap in polyFORTH or SwiftX takes 3-4 machine instructions total. That is way faster than any other model.

Cheers, Elizabeth

--
Elizabeth D. Rather 
FORTH, Inc. 
6080 Center Drive, Suite 600 
Los Angeles, CA  90045 
USA
Reply to
Elizabeth D. Rather

Perhaps if it posted snide remarks about the programmer to a public forum every time it did this, we could shame people into writing decent code again.

--
/~\  cgibbs@kltpzyxm.invalid (Charlie Gibbs) 
\ /  I'm really at ac.dekanfrus if you read it the right way. 
 X   Top-posted messages will probably be ignored.  See RFC1855. 
/ \  Fight low-contrast text in web pages!  http://contrastrebellion.com
Reply to
Charlie Gibbs

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.