Do you know the intended purpose of the call graphs? It seems to me that i t would be to match expectations to what was coded. It shouldn't matter wh o was doing the evaluation, there should have been an accounting of expecta tions regarding the presence and/or absence of recursion.
Much like a check list, it doesn't just assure the presence of everything o n the list, it can be used to verify the absence of anything not on the lis t.
IIRC they were doing independent SW verification & validation of the program (and the WCET analysis was also a part of that). But it was many years ago, and I don't remember the details well enough to say much more, nor can I say why the program was recursive in this way, or if it could as easily have been made non-recursive.
Slightly OT, but I have often wonder how primitive a computer architecture can be and still do some useful work. In the tube/discrete/SSI times, there were quite a lot 1 bit processors. There were at least two types, the PLC (programmable Logic Controller) type replacing relay logic. These had typically at least AND, OR, NOT, (XOR) instructions.The other group was used as truly serial computers with the same instructions as the PLC but also at least a 1 bit SUB (and ADD) instructions to implement all mathematical functions.
However, in the LSI era, there down't seem to be many implement ions.
One that immediately comes in mind is the MC14500B PLC building block, from the 1970's, which requires quite lot of support chips (code memory, PC, /O chips) to do some useful work.
After much searching, I found the (NI) National Instruments SBA (Serial Boolean Analyser)
from the same era, with 1024 word instructions (8 bit) ROM and four banks of 30 _bits_ data memory and 30 I/O pins in a 40 pin package. For the re-entrance enthusiasts, it contains stack pointer relative addressing :-). THe I/O pins are 5 V TTL compatible, so a few ULN2803 Darlington buffers may be needed to drive loads typically found in PLC environment.
Anyone seen more modern 1 bit chips either for relay replacement or for truly serial computers ?
On Sunday, October 21, 2018 at 10:47:26 AM UTC-4, firstname.lastname@example.org wrote :
It is hard for me to imagine applications where a 1 bit processor would be useful. A useful N bit processor can be built in a small number of LUTs. I've built a 16 bit processor in just 600 LUTs and I've seen processors in a bit less.
I discussed this with someone once and he imagined apps where the processin g speed requirement was quite low and you can save LUTs with a bit serial p rocessor. I just don't know how many or why it would matter. Even the sma llest FPGAs have thousands of LUTs. It's hard to picture an application wh ere you couldn't spare a few hundred LUTs.
On Sunday, October 21, 2018 at 10:08:06 AM UTC-5, email@example.com w rote:
e useful. A useful N bit processor can be built in a small number of LUTs. I've built a 16 bit processor in just 600 LUTs and I've seen processors i n a bit less.
ing speed requirement was quite low and you can save LUTs with a bit serial processor. I just don't know how many or why it would matter. Even the s mallest FPGAs have thousands of LUTs. It's hard to picture an application where you couldn't spare a few hundred LUTs.
]>It's hard to picture an application where you couldn't spare a few hundre d LUTs.
There are advantages to using several soft core processors, each sized and customized to the need.
]>I've built a 16 bit processor in just 600 LUTs and I've seen processors i n a bit less.
There are many under 600 LUTs, including 32-bit. Had hoped the full featur ed LEM design would be under 100 LUTs. Have done some rough research of whats available for under 600 LUTs:
select: "By Performance Metric"
A big rational for small soft core processors is that they replace LUTs (sl ow speed logic) with block RAM (instructions). And they are completely det erministic as opposed to doing the same by time slicing a ASIC (ARM) proces sor.
There is not much point in 1-bit processing with modern architectures and FPGAs. But it used to be more useful, for cheap and scalable solutions. You got systems that scaled in parallel, using bit-slice processors to make cpus as wide as you want. And you got serial scaling, giving you practical numbers of bits with minimal die area (like the COP8 microcontrollers).
Circa 1985-1993, Thinking Machines Connection Machine. Circa 1987-1996, MasPar MP series.
The CM-1, 2, 2a, and 200 all were SIMD parallel using 1-bit serial integer-only CPUs. Sizes ranged from 8K CPUs at the low end to 64K CPUs at the high end. Each CPU had 4K *bits* of private RAM, and the CPUs were connected in a multidimensional hypercube network.
The CM-2, 2a, and 200 were augmented with 32-bit FPUs (1 per 32 CPUs), and the 200 featured a higher clock speed.
The MP-1 was SIMD parallel using 4-bit serial integer-only CPUs in sizes from 1K to 16K CPUs. It also had 32-bit FPUs, but I don't remember how many / what ratio. I remember that it had an accumulator register rather than going memory->memory like the CM.
[I can't find much information now about the MP-1 ... unfortunately MasPar didn't last very long in the marketplace. The Wikipedia article has some information about the MP-2, but the MP-2 was a later full 32-bit design, very different from the MP-1.]
My college had both an 8K CM-2 and a 1K MP-1, accessible to those who took various parallel processing electives. I never got to use the MP-1 much - it was new at the end of my time and I only ever played with it a bit. But I spent 2 semesters working with the CM-2.
Even though the CM's clock speed was only ~8MHz, the performance was amazing IF the problem was a good fit to the architecture. E.g., at that time, I owned a 66MHz (dx2) i486. Converted for the CM-2 architecture, O(n^4) array processing on the i486 became O(n) on the CM-2. I had a physics simulation that took over 3 hours on my i486 that ran in ~10 minutes on the CM.
They're back in stock, though the price rose by 21% to 0.046$. Also, LCSC seems to now be stocking more Padauk parts, including more dual-core devices. Unfortunately, the programmer seems to be out of stock, and they have neither the flash nor the DIP variants.
Those adresses are shared across all cores. Each core only has its own A, SP, F, PC. How do we handle local variables?
Option 1: Make functions non-reentrant. Requires duplication of code (we need per-thread copies of functions), and link-time analysis to ensure that each thread only calls the function implementation meant for it. Functions pointers get complicated.
Option 2: Use an inefficient combination of thread-local storage and stack.
compiler inserts (e.g. for multiplication); of course those are affected by the same problems.
Am 12.10.18 um 20:39 schrieb firstname.lastname@example.org:
But static memory allocation would require one copy of each function per thread. And the linker would have to analyze the call graph to always call the correct function for each thread. Function pointers get complicated.
Unfortunately, reentrancy becomes even harder with hardware-multithreading: TO access the stack, one has to construct a pointer to the stack location in a memory location. That memory location (as any pseudo-registers) is then shared among all running instances of the function. So it needs to be protected (e.g. with a spinlock), making access even more inefficient. And that spinlock will cause issues with interrupts (a solution might be to heavily restrict interrupt routines, essentially allowing not much more than setting some global variables).
The there is the trade-off of using one such memory location per function vs. per program (the latter reducing memroy usage, but resulting in less paralellism).
The pseudo-registers one would want to use are not so much a problem for interrupt routines (they would just need saving and thus increase interrupt overhead a bit), but for hardware parallelism. Essentially all access to them would again have to be protected by a spinlock.
All these problems could have relatively easily been avoided by providing an efficient stack-pointer-relative addressing mode. Having a few general-purpose or index registers would have somewhat helped as well.
A low-end Cortex would still be far heavier than a Padauk variant with an sp-relative adressing mode or a few registers added. I think a more multithreading-friendly variant of the Padauk would even still be simpler than an STM8. But one could surely create a nice STM8-like (with a few STM8 weaknesses fixed) processor with hardware multihreading.
For a foreground/background monitor, the worst case would be two copies of static data, if both threads use the same rubroutine.
Linker for such small target ?
With such small processor, just track any dependencies manually.
Do you really insist of using function pointer with such small targets?
With two hardware threads, you would need at most two copies of static data.
Why would you want to access the stack ?
The stack is usable for handling return addresses, but I guess that a hardware thread must have its own return address stack pointer.
In fact many minicomputers from the 1960's did not even have a stack at all. The calling program just stored the return address in the first word of the subroutine and the at the end o the subroutine, performed an indirect jump through the first word of the subroutine to return to the calling program. Of course, this is not re-entrant and in those days one did not have to worry about multiple CPUs accessing the same routines:-).
BTW, who needs a program counter (PC), many microprograms run without a PC, with the next instruction address stored at the end of the long instruction word :-)
Disabling all interrupts for the duration of some critical operations is often enough, but of course, the number of instructions executed during interrupt disabled should be minimized. In MACRO-11 assembler, the standard practice was to start the comment field with a semicolon, when task switching was disabled with two semicolons and when interrupt disabled with three semicolons, it was visually easy to detect when interrupts were disabled and not mess too much with such code sections.
Am 08.11.18 um 20:52 schrieb email@example.com:
Of course. The support routines the compiler uses reside in some library, the linker links them in if necessary. Also, the larger variants are not that small, with up to 256 B of RAM and 8 KB of ROM.
soft UART, etc.
I want to have C, function pointers are part of it.
Padauk still makes one chip with 8 hardware threads (and it looks to me as if there were more in the past, though they are not currently listed on their website, one can find them e.g. in their IDE).
For reentrency, so I can use one function implementation for all threads. It would also be useful to dynamically assign threads to hardware threads (so no thread is tied to specific hardware, and some OS schedules them).
Each hardware thread has its flag register (4 bits) accumulator (8 bits), pc (12 bits) and stack pointer (8 bits).
Disabling interrupts any time a spinlock is held or a thread is wating for one might be too much, especially if there are many threads, so the spinlock is held often.
A linker is required, if the libraries are (for copyright reasons) delivered as binary object code only.
However, if the library are delivered as source files and the compiler/assembler has even a rudimentary #include mechanism, just include those library files you need. With a include or macro processor with parameter passing, just invoke same include file or macro twice with different parameters for different static variable instances.
Of course, linkers are also needed, if very primitive compilation machines are used, such as floppy based Intellecs or Exorcisers. It could take a day to compile a large program all the way from sources, with multiple floppy changes to get the final absolute file to a single floppy, ready to be burnt into EPROMS for an additional hour or two. In such environment compiling, linking and burning only the source file changed would speed up program development a lot.
When using a modern PC for compilation, there are no such issues.
Am 08.11.18 um 23:35 schrieb firstname.lastname@example.org:
Separate compilation and then linking is the normal thing to, and a common workflow for small devices. This is e.g. how most people use SDCC, a mainstream free compiler targeting various 8-bit architectures.
That doesn't mean it is the only way (and since SDCC does not have link-time optimization it might not be the optimal way either). But it is something people use and expect to work reasonably well.
So for anyone designing an architecture it would be wise to not put too many obstacles into that workflow.