With oodles (Soon to be tera-oodles?) of RAM available
on the RPi3, and in future releases likely to be even more,
is there any point to continuing to cater for bytes, halfwords
and words, when everything, including CHAR, can be a 64 bit
An enjoyable hacker's language such as FORTH becomes less complex
by such provision, and a stand-alone SIXTYFORTH a means to
break the stranglehold of the bloated Unices.
Although I condemn FORTH as being unsuitable for professional
designs, especially where its adaptation resulted from those
with limited software awareness, it remains an exciting tool with
which to dabble.
What research? That's a link Forth that does exactly as you suggest,
read his text file LSE64.txt, and if you're too lazy to find the
appropriate bit, look at lines 13 to 16 and no I won't copy it here for
you to read. Comments like yours make me less inclined to be helpful!
OK. I'll bite, how would performance be affected?
I've watched int go from 16 to 64 bits and stuff just got faster..:-)
How is 'load a byte' DONE on a 64 bit processor other than load
[aligned] and shift/mask?.
How does a compiler treat
Is it fetching *p as a 64 bit chunk and mnipulating it or is it
retrieving the same 64 bits of memory over and over and taking a
different bit. Or is it cached and cache aware? Or does the processor
itself have some magic whereby repeated calls to a pointer incrementing
a byte at a time are dealt with differently for 64 bit aligned and non
I honestly would like to know...
Loking at ARM it appears that address registers can be aligned on 8 bit
Is the retrieval of a byte slower in 'unaligned on 64 bit' boundaries?
One deaqls with hardwqare so rarely these days
Stackexchange found this
"Here's what the Intel x86/x64 Reference Manual says about alignments:
4.1.1 Alignment of Words, Doublewords, Quadwords, and Double Quadwords
Words, doublewords, and quadwords do not need to be aligned in
memory on natural boundaries. The natural boundaries for words, double
words, and quadwords are even-numbered addresses, addresses evenly
divisible by four, and addresses evenly divisible by eight,
respectively. However, to improve the performance of programs, data
structures (especially stacks) should be aligned on natural boundaries
whenever possible. The reason for this is that the processor requires
two memory accesses to make an unaligned memory access; aligned accesses
require only one memory access. A word or doubleword operand that
crosses a 4-byte boundary or a quadword operand that crosses an 8-byte
boundary is considered unaligned and requires two separate memory bus
cycles for access.
Some instructions that operate on double quadwords require memory
operands to be aligned on a natural boundary. These instructions
generate a general-protection exception (#GP) if an unaligned operand is
specified. A natural boundary for a double quadword is any address
evenly divisible by 16. Other instructions that operate on double
quadwords permit unaligned access (without generating a
general-protection exception). However, additional memory bus cycles are
required to access unaligned data from memory.
Don't forget, reference manuals are the ultimate source of information
of the responsible developer and engineer, so if you're dealing with
something well documented such as Intel CPUs, just look up what the
reference manual says about the issue."
So that implies that whilst you vcan get 64 bit chunks alignbed on any
address, it pays not to.
4.2.2. ARMv6 extensions
ARMv6 adds unaligned word and halfword load and store data access
support. When enabled, one or more memory accesses are used to generate
the required transfer of adjacent bytes transparently, apart from a
potentially greater access time where the transaction crosses a
The memory management specification defines a programmable mechanism to
enable unaligned access support. This is controlled and programmed using
the CP15 register c1 U bit, bit 22.
Non word-aligned load and store multiple, double, semaphore,
synchronization, and coprocessor accesses always signal Data Abort with
an Alignment fault status code when the U bit is set.
Strict alignment checking is also supported in ARMv6, under control of
the CP15 register c1 A bit, [bit 1], and signals a Data Abort with an
Alignment fault status code if a 16-bit access is not halfword aligned
or a single 32-bit load/store transfer is not word aligned.
ARMv6 alignment fault detection is a mandatory function associated with
address generation rather than optionally supported in external memory
So 64 bit unaligned access are slower
What I havent found out is what a processor does with byte access.
Is it a case that e.g. it fetches 64 bits and uses the lower 3 addess
bits to index into the 64 bit quantity and shift it?
And does it repeat the memory access to get the next 8 bits or not?
It seems that addresses are always byte addtresses as far as code is
concerned, so 64 bit computers must 'lose' the 3 LSBS when doing bus
accesses and sort the rest out in microcode.
In a Time of Universal Deceit, Telling the Truth Is a Revolutionary Act.
- George Orwell
We're nearly at the GHz limit for processors; I don't know the exact
figures, but anything beyond 5GHz seems to be it. Certainly the power
consumption & heat dissapation is a big problem. Intel are shipping
multi core monsters, not faster single cores.
Main memory is _very slow_ compared to the CPU - the latency of a single
read could be 100 CPU cycles or more, during which time your CPU could,
at worst, be completely idle.
gives 2012 numbers but the Pi isn?t exactly bleeeding edge hardware so
that doesn?t seem inappropriate...)
Since, as you?ve noticed, our computers have got substantially faster
since the 1980s, there must be something addressing this problem, and
you?re right that it involves caching.
The effect of a memory read, even if only a single byte is requested, is
to fill (depending on the technology) up to 64 bytes in the cache. So
a subsequent read (of any size) at a nearby address will be much faster
than the initial read.
 in fact there are usually several levels of cache
In the current world, where each ASCII character is represented by 1
byte, that means that when processing a nontrivial amount of data, you
only need to pay that 100+-cycle cost once every 64 characters - so you
could run as fast as 1.5 cycles per character. If each character was 8
bytes instead then your best case is 12.5 cycles per character.
That?s one effect. Another is that the cache is relatively small (for
instance the Pi 3 has a 32KB L1 cache). If you make each character 8
times as big as it needs to be then the effect is (roughly speaking) to
divide the effectiveness of the cache by the same factor.
The exact size of these effects will depend on what kind of data you?re
dealing with (there?s more to life than ASCII) and what you?re doing
with it (if you?re doing 100s of cycles per character of work anyway
then a bit of extra latency is neither here nor there, though the cache
occupancy effects may well still be significant).
| Not really an issue, for, if you're chasing execution time on
| a 1GHz processor, then get yourself a 2GHz processor.
Won?t help. The speed of the CPU is not the problem.
As of 2011, the Guinness World Record for the highest CPU clock rate is
an overclocked, 8.805 GHz AMD Bulldozer-based FX-8150 chip. It surpassed
the previous record, a 8.670 GHz AMD FX "Piledriver" chip.
As of mid-2013, the highest clock rate on a production processor is the
IBM zEC12, clocked at 5.5 GHz, which was released in August 2012.
Virtually all modern processors have multi-level caches.
A reference to an address not yet in cache will result in a cache fault at
all levels, causing a main memory access that transfers a cache line of
data to the primary (largest, slowest) cache, and the next level cache to
receive its (typically smaller) line of data containing the referenced
word(s). This continues until the smallest, fastest level 0 cache is
loaded with the referenced word, which is usually bypassed directly to the
processor?s register file (which can be thought of as the ultimate cache,
managed by the compiler).
Every doubling of data size effectively halves the size of all caches and
data memory, so the performance cost is considerable for any program that
stresses any level of cache.
And don?t expect Moore?s ?Law? to save you. We are past the point of
increasing clock frequency?now all the density improvements just deliver
more cores on a chip, so unless you love parallel algorithms, you?re out of
-michael - NadaNet 3.1 and AppleCrate II: http://michaeljmahon.com
warning - a bit of a waffle...(You can thank uncle Glenmorangie for this...)
FWIW - an evaluation of a move to exascale computing by the US ASCAC sub-commitee
at the DOE said in 2010:
" An exaflop system made entirely out of today's technology would probably cost $100B,
require $1B per year to supply the needed power, and its own dedicated power plant to produce that power."
"Based on current technology, scaling todays systems to an exaflop
level would consume more than a gigawatt of power, roughly the output of the
Hoover Dam. Reducing the power requirement by a factor of at least 100 is a
challenge for future hardware and software technologies."
(Come along Forth programmers ... where are you?)
Incidently the Hoover Dam produces about 2000 MW these days...
Since there's chatter about multi-tasking - how about a system that
run's up to 500,000 threads?
The US has one.
I find that mind boggling. I don't think round robbin Forth would cut the mustard
Nobody is looking to increase clock speeds beyond todays exisiting levels for
(because we have indeed reached practical limits there) And then there's the
latency issues regardless of clocking of course as I think Alex said later.
And addressing tied to 64 bit words - well OK - Crazy idea suggested - crazy idea shot down ...
it's good to have crazy ideas to think about just the same. I have a few myself.
Some of them may even have legs. (But I'm not sure if I have a biped, Octopus or centipede)
64bit granularity? You'd have a stack of FR4 6 inches thick....
(On the other hand - strange nobody mentioned VLIW ...)
15 years ago Greg Bailey's customer using an off-the-shelf PC was
servicing >1,000 tasks. 30 years ago we had a customer running 100 tasks
on an 8085.
It's doable because computers spend most of their time waiting for I/O,
and an efficiently implemented Forth round robin multitasker can take
advantage of this. Pre-emptive task schedulers spend many us deciding
what tasks to run and then doing the swap. A task swap in polyFORTH or
SwiftX takes 3-4 machine instructions total. That is way faster than any