Altera Cyclone replacement

- G
- gnuarm.deletethisbit
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Thu, Jan 31, 2019 2:11 AM

Not sure what these are about. They certainly don't compare third party implementations of their architectures.

Rick C.

--- Get 6 months of free supercharging --- Tesla referral code -

formatting link

- G
- Gerhard Hoffmann
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Thu, Jan 31, 2019 3:30 AM

Proprietary maybe; when the re-implementation is clean, it's OK. You might also have to re-implement the assembler & C-compiler for license reasons.

I once have changed the register implementation of a PICO-blaze. That was not too hard. Its VHDL representation is compiled with the rest of your FPGA circuit.

The problem was, we used picoblazes in an ORIGINAL dinosaur Virtex in a space application, and we had to scrubb the configuration memory every minute or so. That means reloading the configuration memory to fight accumulation of bad bits due to radiation etc. It works just like booting the FPGA at powerup - and killing this process one clock before the before the global reset happens!

The icing on the cake was that the reload circuitry was in the FPGA itself. That's much like exchanging the carpet under your feet.

I have witten a nice package of triple module redundant standard logic vectors for that, and for other sensitive processing.

tmr_sl and tmr_slv could be used almost like standard_logic and the peculiarities were carefully hidden, like avoiding that the ISE proudly optimizes the redundancy away. The Xilinx TMR tool was unavailable for European space projects because of ITER. :-(

(Maybe I should do an open source re-implementation in modern VHDL as a WARM THANK YOU. I know now how to make it even better and we could make tamagotchis for the children of Fukushima.)

But I disgress. The reason for the picoblaze modification was that picoblaze uses CLB rams for its registers and these are really snippets of the configuration RAM. So, during each scrubbing of the configuration the CPU forgets its register contents.

Replacing the rams with arrays of flip-flops increased the resource consumption but it was _not_ much slower.

best regards, Gerhard

Hoffmann Consulting: ANALOG, RF and DSP Design.

- A
- already5chosen
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Wed, Feb 6, 2019 11:22 AM

As you probably found out yourself, the least painful and the most cost effective migration path is to Cyclone 10LP. Despite the name 10, it is relatively old family (60 nm) that is less likely than new chips to have problems with 3/3.3V external I/O. MAX 10 is very cheap at 2KLUTs. If your design is bigger than that then Cy10LP would be cheaper.

For relatively big volumes consider Lattice Mach. Their list price is no good, but volume discounts are fantastic. But be ready for much higher level of pain during development than what you probably accustomed to with Cyclone.

- A
- already5chosen
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Wed, Feb 6, 2019 11:54 AM

On Thursday, January 31, 2019 at 2:42:54 AM UTC+2, snipped-for-privacy@ieee.org wrot e:

y migrating to another vendor that isn't likely to get acquired or axed. X ilinx has the single core Zynq-7000 devices if you want to go with a more m ain-stream, ARM processor sub-system (although likely overkill for whatever your Nios is doing). Otherwise, the Artix-7 and Spartan-7 would be good t argets if you want to migrate to a Microblaze or some other soft core. The Spartan-7 family is essentially the Artix-7 fabric with the transcievers r emoved and are offered in 6K to 100K logic cell densities.

ing a MicroBlaze processor isn't "future proofing" anything. It is just sh ifting from one brand to another with the exact same problems.

FPGA company in-house processor and use an open source processor design. T hen you can use any FPGA you wish.

to replace a microblaze when it became unequal to the task at hand.

endation: "or some other soft core."

other is of limited value. Microblaze is proprietary. I believe there may be some open source versions available, but I expect there are open source versions of the NIOS available as well. But perhaps more importantly, the y are far from optimal. That's why I posted the info on the J1 processor. It was invented to replace a Microblaze that wasn't up to the task.

core is necessary (yet). How simple is the software running on it? Can it reasonably be ported to HDL, thus ensuring portability? I tend to lean th at way unless the SW was simple due to capability limitations in the earlie r technologies (e.g., old Cyclone and Nios) and the desire is to add more f eatures that are realizable with new generation devices and soft (or hard) core capabilities.

they are added because of the complexity of expression. Regardless of how simply we can write HDL, the large part of the engineering world perceives HDL as much more complex than other languages and are not willing to port code to an HDL unless absolutely required. So if the code is currently in C, it won't get ported to HDL without a compelling reason.

rception that FPGAs are difficult to use, expensive, large and power hungry . That is largely true if you use their products only. Lattice has been a ddressing a newer market with small, low power, inexpensive devices intende d for the mobile market. Now if someone would approach the issue of ease o f use by something more than throwing an IDE on top of their command line t ools, the FPGA market can explode into territory presently dominated by MCU s.

ust need a cheap enough FPGA in a suitable package.

sions available, but I expect there are open source versions of the NIOS av ailable as well.

nfire_core, openfire2, secretblaze

I am playing with one right now. Already have half-dozen working variants each with its own advantage/disadv antange in terms of resources usage (LEs vs M9K) and Fmax. The smallest one is still not as small as Altera's Nios2e and the fastest one is still not as fast as Altera's Nios2f. Beating Nios2e on size is in my [near] future p lans, beating Altera's Nios2f on speed and features is of lesser priority.

My cores are less full-featured than even nios2e. They are intended for one certain niche that I would call "soft MCU". In particular, the only suppor ted program memory is what Altera calls "tightly coupled memory", i.e. emb edded dual-ported SRAM blocks with no other master connected. Another limit ations are absence of exceptions and external interrupts. For me it's o.k. that's how I use nios2e anyway.

I didn't check if what I am doing is legal. Probably does not matter as long as it's just a repo on github.

w LUT counts.

Fixed-instruction-width 32-bit subset of RISC-V ISA is nearly identical to Nios2 down to the level of instruction formats. The biggest difference is 1

2-bit immediate in RV vs 16-bit in N2. Not a big deal. So I expect that RV32 cores available in source form can be modified to run Nios2 in few days (or, if original designer is involved, in few hours).

The bigger difference would be external interface. In N2 one expects Avalon

-mm. I have no idea what's a standard bus/fabric in the world of RV soft co res and how similar it is to AVM.

- G
- gnuarm.deletethisbit
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Thu, Feb 7, 2019 6:36 PM

On Wednesday, February 6, 2019 at 6:54:27 AM UTC-5, snipped-for-privacy@yahoo.com wr ote:

ote:

:

by migrating to another vendor that isn't likely to get acquired or axed. Xilinx has the single core Zynq-7000 devices if you want to go with a more main-stream, ARM processor sub-system (although likely overkill for whatev er your Nios is doing). Otherwise, the Artix-7 and Spartan-7 would be good targets if you want to migrate to a Microblaze or some other soft core. T he Spartan-7 family is essentially the Artix-7 fabric with the transcievers removed and are offered in 6K to 100K logic cell densities.

using a MicroBlaze processor isn't "future proofing" anything. It is just shifting from one brand to another with the exact same problems.

y FPGA company in-house processor and use an open source processor design. Then you can use any FPGA you wish.

ed to replace a microblaze when it became unequal to the task at hand.

mmendation: "or some other soft core."

another is of limited value. Microblaze is proprietary. I believe there m ay be some open source versions available, but I expect there are open sour ce versions of the NIOS available as well. But perhaps more importantly, t hey are far from optimal. That's why I posted the info on the J1 processor . It was invented to replace a Microblaze that wasn't up to the task.

t core is necessary (yet). How simple is the software running on it? Can it reasonably be ported to HDL, thus ensuring portability? I tend to lean that way unless the SW was simple due to capability limitations in the earl ier technologies (e.g., old Cyclone and Nios) and the desire is to add more features that are realizable with new generation devices and soft (or hard ) core capabilities.

es they are added because of the complexity of expression. Regardless of h ow simply we can write HDL, the large part of the engineering world perceiv es HDL as much more complex than other languages and are not willing to por t code to an HDL unless absolutely required. So if the code is currently i n C, it won't get ported to HDL without a compelling reason.

perception that FPGAs are difficult to use, expensive, large and power hung ry. That is largely true if you use their products only. Lattice has been addressing a newer market with small, low power, inexpensive devices inten ded for the mobile market. Now if someone would approach the issue of ease of use by something more than throwing an IDE on top of their command line tools, the FPGA market can explode into territory presently dominated by M CUs.

just need a cheap enough FPGA in a suitable package.

ersions available, but I expect there are open source versions of the NIOS available as well.

penfire_core, openfire2, secretblaze

dvantange in terms of resources usage (LEs vs M9K) and Fmax. The smallest o ne is still not as small as Altera's Nios2e and the fastest one is still no t as fast as Altera's Nios2f. Beating Nios2e on size is in my [near] future plans, beating Altera's Nios2f on speed and features is of lesser priority .

ne certain niche that I would call "soft MCU". In particular, the only supp orted program memory is what Altera calls "tightly coupled memory", i.e. e mbedded dual-ported SRAM blocks with no other master connected. Another lim itations are absence of exceptions and external interrupts. For me it's o.k . that's how I use nios2e anyway.

low LUT counts.

o Nios2 down to the level of instruction formats. The biggest difference is 12-bit immediate in RV vs 16-bit in N2. Not a big deal.

un Nios2 in few days (or, if original designer is involved, in few hours).

on-mm. I have no idea what's a standard bus/fabric in the world of RV soft cores and how similar it is to AVM.

Should I assume you are not using C to program these CPUs?

If that is correct, have you considered a stack based CPU? When you refer to CPUs like the RISC-V I'm thinking they use thousands of LUT4s. Many sta ck based CPUs can be implemented in 1k LUT4s or less. They can run fast, >

100 MHz and typically are not pipelined.

There is a lot of interest in stack CPUs in the Forth community since typic ally their assembly language is similar to the Forth virtual machine.

I'm not familiar with Avalon and I don't know what N2 is. A popular bus in the FPGA embedded world is Wishbone.

Rick C.

--+ Tesla referral code -

formatting link

- A
- already5chosen
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Thu, Feb 7, 2019 9:00 PM

e:

te:

gn by migrating to another vendor that isn't likely to get acquired or axed . Xilinx has the single core Zynq-7000 devices if you want to go with a mo re main-stream, ARM processor sub-system (although likely overkill for what ever your Nios is doing). Otherwise, the Artix-7 and Spartan-7 would be go od targets if you want to migrate to a Microblaze or some other soft core. The Spartan-7 family is essentially the Artix-7 fabric with the transcieve rs removed and are offered in 6K to 100K logic cell densities.

y using a MicroBlaze processor isn't "future proofing" anything. It is jus t shifting from one brand to another with the exact same problems.

any FPGA company in-house processor and use an open source processor design . Then you can use any FPGA you wish.

used to replace a microblaze when it became unequal to the task at hand.

commendation: "or some other soft core."

o another is of limited value. Microblaze is proprietary. I believe there may be some open source versions available, but I expect there are open so urce versions of the NIOS available as well. But perhaps more importantly, they are far from optimal. That's why I posted the info on the J1 process or. It was invented to replace a Microblaze that wasn't up to the task.

oft core is necessary (yet). How simple is the software running on it? Ca n it reasonably be ported to HDL, thus ensuring portability? I tend to lea n that way unless the SW was simple due to capability limitations in the ea rlier technologies (e.g., old Cyclone and Nios) and the desire is to add mo re features that are realizable with new generation devices and soft (or ha rd) core capabilities.

imes they are added because of the complexity of expression. Regardless of how simply we can write HDL, the large part of the engineering world perce ives HDL as much more complex than other languages and are not willing to p ort code to an HDL unless absolutely required. So if the code is currently in C, it won't get ported to HDL without a compelling reason.

t perception that FPGAs are difficult to use, expensive, large and power hu ngry. That is largely true if you use their products only. Lattice has be en addressing a newer market with small, low power, inexpensive devices int ended for the mobile market. Now if someone would approach the issue of ea se of use by something more than throwing an IDE on top of their command li ne tools, the FPGA market can explode into territory presently dominated by MCUs.

We just need a cheap enough FPGA in a suitable package.

versions available, but I expect there are open source versions of the NIO S available as well.

openfire_core, openfire2, secretblaze

sadvantange in terms of resources usage (LEs vs M9K) and Fmax. The smallest one is still not as small as Altera's Nios2e and the fastest one is still not as fast as Altera's Nios2f. Beating Nios2e on size is in my [near] futu re plans, beating Altera's Nios2f on speed and features is of lesser priori ty.

one certain niche that I would call "soft MCU". In particular, the only su pported program memory is what Altera calls "tightly coupled memory", i.e. embedded dual-ported SRAM blocks with no other master connected. Another l imitations are absence of exceptions and external interrupts. For me it's o .k. that's how I use nios2e anyway.

e low LUT counts.

to Nios2 down to the level of instruction formats. The biggest difference is 12-bit immediate in RV vs 16-bit in N2. Not a big deal.

run Nios2 in few days (or, if original designer is involved, in few hours) .

alon-mm. I have no idea what's a standard bus/fabric in the world of RV sof t cores and how similar it is to AVM.

That would be a wrong assumption. An exact opposite is far closer to reality - I pretty much never use anythi ng, but C to program these CPUs.

r to CPUs like the RISC-V I'm thinking they use thousands of LUT4s.

It depends on performance, one is looking for.

2-2.5 KLUT4s (+few embedded memory blocks and multipliers) is a size of ful ly pipelined single-issue CPU with direct-mapped instruction and data cache s, multiplier and divider that runs at very decent Fmax, but features no MM Us or MPU. On the other end of the spectrum you find winners of RISC-V core size compe tition - under 400 LUTs, but (I would guess, didn't check it), glacially sl ow in terms of CPI. But Fmax is still decent.

Half-dozen Nios2 cores of mine is in the middle - 700 to 850 LUT4s, CPI ran ging from (approximately) 2.1 to 4.7 and Fmax ranging from reasonable to im practically high. But my main goal was (is) a learning experience rather than practicality. I n particular, for majority of variants I set to myself impractical constrai n of implementing register file in a single embedded memory block. Doing it in two blocks is far more practical, but less challenging. The same goes w ith aiming to very high Fmax- not practical, but fun. May be, after I explore another half-dozen of dozen of fun possibilities I will settle on building the most practical solutions. But not less probable that I'll lose interest and/or focus before that. I am not too passionate about the whole thing.

run fast, >100 MHz and typically are not pipelined.

ically their assembly language is similar to the Forth virtual machine.

N2 is my shortcut for Nios2.

I payed attention that Wishbone is popular in Lattice cycles. But Altera wo rld is many times bigger than Lattice and here Avalon is a king. Also, whe n performance matters, Aavalon is much better technically.

- G
- gnuarm.deletethisbit
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Thu, Feb 7, 2019 9:43 PM

On Thursday, February 7, 2019 at 4:00:57 PM UTC-5, snipped-for-privacy@yahoo.com wro te:

hing, but C to program these CPUs.

fer to CPUs like the RISC-V I'm thinking they use thousands of LUT4s.

ully pipelined single-issue CPU with direct-mapped instruction and data cac hes, multiplier and divider that runs at very decent Fmax, but features no MMUs or MPU.

petition - under 400 LUTs, but (I would guess, didn't check it), glacially slow in terms of CPI. But Fmax is still decent.

anging from (approximately) 2.1 to 4.7 and Fmax ranging from reasonable to impractically high.

In particular, for majority of variants I set to myself impractical constr ain of implementing register file in a single embedded memory block. Doing it in two blocks is far more practical, but less challenging. The same goes with aiming to very high Fmax- not practical, but fun.

I will settle on building the most practical solutions. But not less probab le that I'll lose interest and/or focus before that. I am not too passionat e about the whole thing.

Ok, if you are doing C in FPGA CPUs then you are in a different world than the stuff I've worked on. My projects use a CPU as a controller and often have very critical real time requirements. While C doesn't prevent that, I prefer to just code in assembly language and more importantly, use a CPU d esign that provides single cycle execution of all instructions. That's why I like stack processors, they are easy to design, use a very simple instru ction set and the assembly language can be very close to the Forth high lev el language.

n run fast, >100 MHz and typically are not pipelined.

ypically their assembly language is similar to the Forth virtual machine.

world is many times bigger than Lattice and here Avalon is a king. Also, w hen performance matters, Aavalon is much better technically.

I'm not familiar with what bus is preferred where. I just know that every project I've looked at on OpenCores using a standard bus used Wishbone. If you say Avalon is better, ok. Is it open source? Can it be used on other than Intel products?

Rick C.

-+- Tesla referral code -

formatting link

- A
- already5chosen
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Thu, Feb 7, 2019 10:11 PM

i am not sure what 'open source" means in this context. Avalon-MM and Avalon-ST are specifications. the documents. The documents freely downloadable from Altera/Intel web site.

formatting link

A GUI tool, that connects together components, conforming to Avalon specs, which was called SOPC is 00s, then QSYS and now Intel system designer, or something like that, is proprietary close source program. The code that a tool generates is a normal VHDL or, more often, normal Verilog, that contains copyright statement like that: // ----------------------------------------------------------- // Legal Notice: (C)2007 Altera Corporation. All rights reserved. Your // use of Altera Corporation's design tools, logic functions and other // software and tools, and its AMPP partner logic functions, and any // output files any of the foregoing (including device programming or // simulation files), and any associated documentation or information are // expressly subject to the terms and conditions of the Altera Program // License Subscription Agreement or other applicable license agreement, // including, without limitation, that your use is for the sole purpose // of programming logic devices manufactured by Altera and sold by Altera // or its authorized distributors. Please refer to the applicable // agreement for further details.

So, you can't legally use QSYS-generated code with non-Intel devices. But (IANAL) nobody prevents you from writing your own interconnect generation tool. Or from not using any CAD tool at all and just connecting components manually within your HDL. Isn't it mostly what you do with Wishbone components, anyway?

- G
- gnuarm.deletethisbit
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Fri, Feb 8, 2019 12:22 AM

On Thursday, February 7, 2019 at 5:11:20 PM UTC-5, snipped-for-privacy@yahoo.com wro te:

era world is many times bigger than Lattice and here Avalon is a king. Also , when performance matters, Aavalon is much better technically.

ery project I've looked at on OpenCores using a standard bus used Wishbone. If you say Avalon is better, ok. Is it open source? Can it be used on o ther than Intel products?

freely downloadable from Altera/Intel web site.

, which was called SOPC is 00s, then QSYS and now Intel system designer, or something like that, is proprietary close source program.

rilog, that contains copyright statement like that:

tion tool. Or from not using any CAD tool at all and just connecting compon ents manually within your HDL. Isn't it mostly what you do with Wishbone co mponents, anyway?

Sorry, I don't know what you are referring to. But my concern with the bus is that it is entirely possible and not at all uncommon for such a design to have aspects which are under license. Some time ago it was ruled that t he Z80 did not infringe on Intel's 8080 design, but the nemonics were copyr ighted so Zilog had to develop their own assembler syntax. ARM decided to protect their CPU design with a patent on some aspect of interrupt handling if I recall correctly. So while there are equivalent CPUs on the market ( RISC-V for example), there are no ARM clones even though all the ARM archit ecture documents are freely available.

The point is I don't know if this Altera bus is protected in some way or no t. That's why I was asking. IANAL either

I think the term open source is pretty clear in all contexts. Lattice has their own CPU designs for use in FPGAs. The difference is they don't care if you use then in a Xilinx chip.

Rick C.

-++ Tesla referral code -

formatting link

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Fri, Feb 8, 2019 9:54 AM

I have no idea of the legal aspects of Avalon (I only ever used it on Altera devices, long ago). But technically it is very similar to Wishbone for many common uses. Things always get complicated when you need priorities, bursts, variable wait states, etc., but for simpler and static connections, I don't remember it as being difficult to mix them. (It was many years ago when I did this, however.)

- A
- already5chosen
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Thu, Feb 14, 2019 10:07 AM

n the stuff I've worked on. My projects use a CPU as a controller and ofte n have very critical real time requirements. While C doesn't prevent that, I prefer to just code in assembly language and more importantly, use a CPU design that provides single cycle execution of all instructions. That's w hy I like stack processors, they are easy to design, use a very simple inst ruction set and the assembly language can be very close to the Forth high l evel language.

Can you quantify criticality of your real-time requirements?

Also, even for most critical requirements, what's wrong with multiple cycle s per instructions as long as # of cycles is known up front? Things like caches and branch predictors indeed cause variability (witch by itself is o.k. for 99.9% of uses), but that's orthogonal to # of cycles pe r instruction.

can run fast, >100 MHz and typically are not pipelined.

1 cycle per instruction not pipelined means that stack can not be implement ed in memory block(s). Which, in combination with 1K LUT4s means that either s tack is very shallow or it is not wide (i.e. 16 bits rather than 32 bits). Either of it means that you need many more instructions (relatively to 32-b it RISC with 32 or 16 registers) to complete the job.

Also 1 cycle per instruction necessitates either strict Harvard memories or true dual-ported memories.

And even with all that conditions in place, non-pipelined conditional branc hes at 100 MHz sound hard. Not impossible if your FPGA is very fast, like t op-speed Arria-10, where you can instantiate Nios2e at 380 MHz and full-fea tured Nios2f at 300 MHz+. But it does look impossible in low speed grades b udget parts, like slowest speed grades of Cyclone4E/10LP or even of Cyclone

And I suppose that Lattice Mach series is somewhat slower than even thos e. The only way that I can see non-pipelined conditional branches work at 100 MHz in low end devices is if your architecture has branch delay slot. But t hat by itself is sort of pipelining, just instead of being done in HW, it i s pipelining exposed to SW.

Besides, my current hobby interest is in 500-700 LUT4s rather than in 1000+ LUT4s. If 1000 LUT4 available then 1400 LUT4 are probably available too, s o one can as well use OTS Nios2f which is pretty fast and validated to the level that hobbyist's cores can't even dream about.

- G
- gnuarm.deletethisbit
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Thu, Feb 14, 2019 11:24 AM

On Thursday, February 14, 2019 at 5:07:53 AM UTC-5, snipped-for-privacy@yahoo.com wr ote:

han the stuff I've worked on. My projects use a CPU as a controller and of ten have very critical real time requirements. While C doesn't prevent tha t, I prefer to just code in assembly language and more importantly, use a C PU design that provides single cycle execution of all instructions. That's why I like stack processors, they are easy to design, use a very simple in struction set and the assembly language can be very close to the Forth high level language.

Eh? You are asking my requirement or asking how important it is? Not sure how to answer that question. I can only say that my CPU designs give sing le cycle execution, so I can design with them the same way I design the har dware in VHDL.

les per instructions as long as # of cycles is known up front?

It increases interrupt latency which is not a problem if you aren't using i nterrupts, a common technique for such embedded processors. Otherwise mult i-cycle instructions complicate the CPU instruction decoder. Using a short instruction format allows minimal decode logic. Adding a cycle counter in creases the number of inputs to the instruction decode block and so complic ates the logic significantly.

by itself is o.k. for 99.9% of uses), but that's orthogonal to # of cycles per instruction.

Cache, branch predictors??? You have that with 1 kLUT CPUs??? I think we design in very different worlds. My program storage is inside the FPGA and runs at the full speed of the CPU. The CPU is not pipelined (according to me, someone insisted that it was a 2 level pipeline, but with no pipeline delay, oh well) so no branch prediction needed.

y can run fast, >100 MHz and typically are not pipelined.

nted

stack is very shallow or it is not wide (i.e. 16 bits rather than 32 bits) . Either of it means that you need many more instructions (relatively to 32

-bit RISC with 32 or 16 registers) to complete the job.

Huh? So my block RAM stack is pipelined or are you saying I'm only imagini ng it runs in one clock cycle? Instructions are things like

ADD, CALL, SHRC (shift right with carry), FETCH (read memory), RET (return from call), RETI (return from interrupt). The interrupt pushes return addr ess to return stack and PSW to data stack in one cycle with no latency so, like the other instructions is single cycle, again making using it like des igning with registers in the HDL code.

or true dual-ported memories.

Or both. To get the block RAMs single cycle the read and write happen on d ifferent phases of the main clock. I think read is on falling edge while w rite is on rising edge like the rest of the logic. Instructions and data a re in physically separate memory within the same address map, but no way to use either one as the other mechanically. Why would Harvard ever be a pro blem for an embedded CPU?

nches at 100 MHz sound hard.

Not hard when the CPU is simple and designed to be easy to implement rather than designing it to be like all the other CPUs with complicated functiona lity.

you can instantiate Nios2e at 380 MHz and full-featured Nios2f at 300 MHz+. But it does look impossible in low speed grades budget parts, like slowest speed grades of Cyclone4E/10LP or even of Cyclone5. And I suppose that Lat tice Mach series is somewhat slower than even those.

I only use the low grade parts. I haven't used NIOS and this processor won 't get to 380 MHz I'm pretty sure. Pipelining it would be counter it's des ign goals but might be practical, never thought about it.

0 MHz in low end devices is if your architecture has branch delay slot. But that by itself is sort of pipelining, just instead of being done in HW, it is pipelining exposed to SW.

Or the instruction is simple and runs fast.

0+ LUT4s. If 1000 LUT4 available then 1400 LUT4 are probably available too, so one can as well use OTS Nios2f which is pretty fast and validated to th e level that hobbyist's cores can't even dream about.

That's where my CPU lies, I think it was 600 LUT4s last time I checked.

Rick C.

- A
- already5chosen
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Thu, Feb 14, 2019 1:38 PM

than the stuff I've worked on. My projects use a CPU as a controller and often have very critical real time requirements. While C doesn't prevent t hat, I prefer to just code in assembly language and more importantly, use a CPU design that provides single cycle execution of all instructions. That 's why I like stack processors, they are easy to design, use a very simple instruction set and the assembly language can be very close to the Forth hi gh level language.

How important they are. What happens if particular instruction most of the time takes n clocks, but sometimes, rarely, could take n+2 clocks? Are system-le vel requirements impacted?

give single cycle execution, so I can design with them the same way I desi gn the hardware in VHDL.

ycles per instructions as long as # of cycles is known up front?

interrupts, a common technique for such embedded processors.

I don't like interrupts in small systems. Neither in MCUs nor in FPGAs. In MCUs nowadays we have bad ass DMAs. In FPGAs we can build bad ass DMA ou rselves. Or throw multiple soft cores on multiple tasks. That why I am inte rested in *small* soft cores in the first place.

r.

I see no connection to decoder. May be, you mean microsequencer? Generally, I disagree. At least for very fast clock rates it is easier to d esign non-pipelined or partially pipelined core where every instruction flo ws through several phases.

Or, may be, you think about variable-length instructions? That's again, ort hogonal to number of clocks per instruction. Anyway, I think that variable- length instructions are very cool, but not for 500-700 LUT4s budget. I woul d start to consider VLI for something like 1200 LUT4s.

ycle counter increases the number of inputs to the instruction decode block and so complicates the logic significantly.

h by itself is o.k. for 99.9% of uses), but that's orthogonal to # of cycle s per instruction.

e design in very different worlds.

I don't *want* data caches in sort of tasks that I do with this small cores . Instruction cache is something else. I am not against them in "hard" MCUs . In small soft cores that we are discussing right now they are impractical r ather than evil. But static branch prediction is something else. I can see how static branch prediction is practical in 700-800 LUT4s. I didn't have it implemented in my half-dozen (in the mean time the # is growing). But it is practical, esp . for applications that spend most of the time in very short loops.

PU. The CPU is not pipelined (according to me, someone insisted that it wa s a 2 level pipeline, but with no pipeline delay,

I am starting to suspect that you have very special definition of "not pipe lined" that differs from definition used in literature.

hey can run fast, >100 MHz and typically are not pipelined.

mented

er stack is very shallow or it is not wide (i.e. 16 bits rather than 32 bit s). Either of it means that you need many more instructions (relatively to

32-bit RISC with 32 or 16 registers) to complete the job.

ning it runs in one clock cycle? Instructions are things like

n from call), RETI (return from interrupt). The interrupt pushes return ad dress to return stack and PSW to data stack in one cycle with no latency so , like the other instructions is single cycle, again making using it like d esigning with registers in the HDL code.

s or true dual-ported memories.

different phases of the main clock. I think read is on falling edge while write is on rising edge like the rest of the logic. Instructions and data are in physically separate memory within the same address map, but no way to use either one as the other mechanically. Why would Harvard ever be a p roblem for an embedded CPU?

Less of the problem when you are in full control of software stack. When you are not in full control, sometimes compilers like to place data, e sp. jump tables for implementing HLL switch/case construct, in program memo ry. Still, even with full control of the code generation tools, sometimes you w ant architecture consisting of tiny startup code that loads the bulk of the cod e from external memory, most commonly from SPI flash. Another, less common possible reason is saving space by placing code and da ta in the same memory block. Esp. when blocks are relatively big and there are few of them.

ranches at 100 MHz sound hard.

er than designing it to be like all the other CPUs with complicated functio nality.

It is certainly easier when branching is based on arithmetic flags rather t han on the content of register, like a case in MIPS derivatives, including Nios

2 and RISC-V. But still hard. You have to wait for instruction to arrive fr om memory, decode an instruction, do logical operations on flags and select between two alternatives based on result of logical operation, all in one cycle. If branch is PC-relative, which is the case in nearly all popular 32-bit ar chitectures, you also have to do an address addition, all in the same cycle .

But even if it's somehow doable for PC-relative branches, I don't see how, assuming that stack is stored in block memory, it is doable for *indirect* jumps. I'd guess, you are somehow cutting corners here, most probably by re quiring the address of indirect jump to be in the top-of-stack register tha t is not in block memory.

e you can instantiate Nios2e at 380 MHz and full-featured Nios2f at 300 MHz

+. But it does look impossible in low speed grades budget parts, like slowe st speed grades of Cyclone4E/10LP or even of Cyclone5. And I suppose that L attice Mach series is somewhat slower than even those.

Nios, not NIOS. The proper name and spelling is Nios2, because for a brief period in early 00s Altera had completely different architecture that was c alled Nios.

ould be counter it's design goals but might be practical, never thought abo ut it.

100 MHz in low end devices is if your architecture has branch delay slot. B ut that by itself is sort of pipelining, just instead of being done in HW, it is pipelining exposed to SW.

I don't doubt that you did it, but answers like that smell hand-waving.

000+ LUT4s. If 1000 LUT4 available then 1400 LUT4 are probably available to o, so one can as well use OTS Nios2f which is pretty fast and validated to the level that hobbyist's cores can't even dream about.

Does it include single-cycle 32-bit shift/rotate by arbitrary 5-bit count (

5 variations, logical and arithmetic right shift, logical left shift, rotat e right, rotate left)? Does it include zero-extended and sign-extended byte and half-word loads (f etches, in you language) ? In my cores these two functions combined are the biggest block, bigger than 32-bit ALU, and comparable in size with result writeback mux. Also, I assume that you cores have no multiplier, right?

- G
- gnuarm.deletethisbit
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Thu, Feb 14, 2019 8:33 PM

On Thursday, February 14, 2019 at 8:38:47 AM UTC-5, snipped-for-privacy@yahoo.com wr ote:

ld than the stuff I've worked on. My projects use a CPU as a controller an d often have very critical real time requirements. While C doesn't prevent that, I prefer to just code in assembly language and more importantly, use a CPU design that provides single cycle execution of all instructions. Th at's why I like stack processors, they are easy to design, use a very simpl e instruction set and the assembly language can be very close to the Forth high level language.

e time

level requirements impacted?

Of course, that depends on the application. In some cases it would simply not work correctly because it was designed into the rest of the logic not e ntirely unlike a FSM. In other cases it would make the timing indeterminat e which means it would make it harder to design the logic surrounding this piece.

ns give single cycle execution, so I can design with them the same way I de sign the hardware in VHDL.

cycles per instructions as long as # of cycles is known up front?

ng interrupts, a common technique for such embedded processors.

ourselves. Or throw multiple soft cores on multiple tasks. That why I am in terested in *small* soft cores in the first place.

Yup, interrupts can be very bad. But if you requirements are to do one thi ng in software that has real time requirements (such as service an ADC/DAC or fast UART) while the rest of the code is managing functions with much mo re relaxed real time requirements, using an interrupt can eliminate a CPU c ore or the design of a custom DMA with particular features that are easy in software.

There are things that are easy to do in hardware and things that are easy t o do in software with some overlap. Using a single CPU and many interrupts fits into the domain of not so easy to do. That doesn't make simple use o f interrupts a bad thing.

der.

Decoder has outputs y(i) = f(x(j)) where x(j) is all the inputs and y(i) is all the outputs and f() is the function mapping inputs to outputs. If y ou have multiple states for instructions the decoding function has more inp uts than if you only decode instructions and whatever state flags might be used such as carry or zero or interrupt input.

In general this will result in more complex instruction decoding.

design non-pipelined or partially pipelined core where every instruction f lows through several phases.

If by "easier" you mean possible, then yes. That's why they use pipelining , to achieve clock speeds that otherwise can't be met. But it is seldom si mple since pipelining is more than just adding registers. Instructions int eract and on branches the pipeline has to be flushed, etc.

rthogonal to number of clocks per instruction. Anyway, I think that variabl e-length instructions are very cool, but not for 500-700 LUT4s budget. I wo uld start to consider VLI for something like 1200 LUT4s.

Nope, just talking about using multiple clock cycles for instructions. Usi ng variable number of clock cycles would be more complex in general and mul tiple length instructions even worse... in general. There are always possi bilities to simplify some aspect of this by complicating some aspect of tha t.

cycle counter increases the number of inputs to the instruction decode blo ck and so complicates the logic significantly.

tch by itself is o.k. for 99.9% of uses), but that's orthogonal to # of cyc les per instruction.

we design in very different worlds.

es. Instruction cache is something else. I am not against them in "hard" MC Us.

rather than evil.

Or unneeded. If the programs fits in the on chip memory, no cache is neede d. What sort of programming are you doing in But static branch prediction is something else. I can see how static bran ch prediction is practical in 700-800 LUT4s. I didn't have it implemented i n my half-dozen (in the mean time the # is growing). But it is practical, e sp. for applications that spend most of the time in very short loops.

If the jump instruction is one clock cycle and no pipeline, jump prediction is not possible I think.

CPU. The CPU is not pipelined (according to me, someone insisted that it was a 2 level pipeline, but with no pipeline delay,

pelined" that differs from definition used in literature.

Ok, not sure what that means. Every instruction takes one clock cycle. Wh ile a given instruction is being executed the next instruction is being fet ched, but the *actual* next instruction, not the "possible" next instructio n. All branches happen during the branch instruction execution which fetch es the correct next instruction.

This guy said I was pipelining the fetch and execute... I see no purpose i n calling that pipelining since it carries no baggage of any sort.

They can run fast, >100 MHz and typically are not pipelined.

lemented

ther stack is very shallow or it is not wide (i.e. 16 bits rather than 32 b its). Either of it means that you need many more instructions (relatively t o 32-bit RISC with 32 or 16 registers) to complete the job.

gining it runs in one clock cycle? Instructions are things like

urn from call), RETI (return from interrupt). The interrupt pushes return address to return stack and PSW to data stack in one cycle with no latency so, like the other instructions is single cycle, again making using it like designing with registers in the HDL code.

ies or true dual-ported memories.

on different phases of the main clock. I think read is on falling edge whi le write is on rising edge like the rest of the logic. Instructions and da ta are in physically separate memory within the same address map, but no wa y to use either one as the other mechanically. Why would Harvard ever be a problem for an embedded CPU?

esp. jump tables for implementing HLL switch/case construct, in program me mory.

want

ode from external memory, most commonly from SPI flash.

data in the same memory block. Esp. when blocks are relatively big and ther e are few of them.

There is nothing to prevent loading code into program memory. It's all one address space and can be written to by machine code. So I guess it's not really Harvard, it's just physically separate memory. Since instructions ar e not a word wide, I think the program memory does not implement a full wor d width.. to be honest, I don't recall. I haven't used this CPU in years. I've been programming in Forth on PCs more recently.

Another stack processor is the J1 which is used in a number of applications and even had a TCP/IP stack implemented in about 8 kW (kB?) (kinstructions ?). You can find info on it with a google search. It is every bit as smal l as mine and a lot better documented and programmed in Forth while mine is programmed in assembly which is similar to Forth.

branches at 100 MHz sound hard.

ther than designing it to be like all the other CPUs with complicated funct ionality.

than

os2 and RISC-V. But still hard. You have to wait for instruction to arrive from memory, decode an instruction, do logical operations on flags and sele ct between two alternatives based on result of logical operation, all in on e cycle.

architectures, you also have to do an address addition, all in the same cyc le.

I guess this is where I disagree on the pipelining aspect of my design. I register the current instruction so the memory fetch is in the previous cyc le based on that instruction. So my delay path starts with the instruction , not the instruction pointer. The instruction decode for each section of the CPU is in parallel of course. The three sections of the CPU are the in struction fetch, the data path and the address path. The data path and add ress path roughly correspond to the data and return stacks in Forth. In my CPU they can operate separately and the return stack can perform simple ma th like increment/decrement/test since it handles addressing memory. In Fo rth everything is done on the data stack other than holding the return addr esses, managing DO loop counts and user specific operations.

My CPU has both PC relative addressing and absolute addressing. One way I optimize for speed is by careful management of the low level implementation . For example I use an adder as a multiplexor when it's not adding. A+0 i s A, 0+B is B, A+B is well, A+B.

, assuming that stack is stored in block memory, it is doable for *indirect

jumps. I'd guess, you are somehow cutting corners here, most probably by requiring the address of indirect jump to be in the top-of-stack register t hat is not in block memory.

Indirect addressing??? Indirect addressing requires multiple instructions, yes. The return stack is used for address calculations typically and that stack is fed directly into the instruction fetch logic... it is the "retur n" stack (or address unit, your choice) after all.

ere you can instantiate Nios2e at 380 MHz and full-featured Nios2f at 300 M Hz+. But it does look impossible in low speed grades budget parts, like slo west speed grades of Cyclone4E/10LP or even of Cyclone5. And I suppose that Lattice Mach series is somewhat slower than even those.

f period in early 00s Altera had completely different architecture that was called Nios.

I haven't used those processors either.

would be counter it's design goals but might be practical, never thought a bout it.

t 100 MHz in low end devices is if your architecture has branch delay slot. But that by itself is sort of pipelining, just instead of being done in HW , it is pipelining exposed to SW.

Ok, whatever that means.

1000+ LUT4s. If 1000 LUT4 available then 1400 LUT4 are probably available too, so one can as well use OTS Nios2f which is pretty fast and validated t o the level that hobbyist's cores can't even dream about.

(5 variations, logical and arithmetic right shift, logical left shift, rot ate right, rotate left)?

There are shift instructions. It does not have a barrel shifter if that is what you are asking. A barrel shifter is not really a CPU. It is a CPU f eature and is large and slow. Why slow down the rest of the CPU with a ver y slow feature? That is the sort of thing that should be external hardware .

When they design CPU chips, they have already made compromises that require larger, slower logic which require pipelining. The barrel shifter is perf ect for pipelining, so it fits right in.

(fetches, in you language) ?

I don't recall, but I'll say no. I do recall some form of sign extension, but I may be thinking of setting the top of stack by the flags. Forth has words that treat the word on the top of stack as a word, so the mapping is better if this is implemented. I'm not convinced this is really better tha n using the flags directly in the asm, but for now I'm compromising. I'm n ot really a compiler writer, so...

an 32-bit ALU, and comparable in size with result writeback mux.

Sure, the barrel shifter is O(n^^2) like a multiplier. That's why in small CPUs it is often done in loops. Since loops can be made efficient with th e right instructions that's a good way to go. If you really need the optim um speed for barrel shifting, then I guess a large block of logic and pipel ining is the way to go.

I needed to implement multiplications, but they are on 24 bit words that ar e being shifted into and out of a CODEC bit serial. I found a software shi ft and add to work perfectly well, no need for special hardware.

Boman was using his J1 for video work (don't recall the details) but the Mi croblaze was too slow and used too much memory. The J1 did the same functi ons faster and in less code with generic instructions, nothing unique to th e application if I remember correctly... not that the Microblaze is the gol d standard.

By "cores" you mean CPUs? Core actually, remember the interrupt, one CPU, one interrupt. Yes, no hard multiplier as yet. The pure hardware impleme ntation of the CODEC app used shift and add in hardware as well but new fea tures were needed and space was running out in the small FPGA, 3 kLUTs. Th e slower, simpler stuff could be ported to software easily for an overall r eduction in LUT4 usage along with the new features.

I don't typically try to compete with the functionality of ARMs with my CPU designs. To me they are FPGA logic adjuncts. So I try to make them as si mple as the other logic.

I wrote some code for a DDS in software once as a benchmark for CPU instruc tion set designs. The shortest and fastest I came up with was a hybrid bet ween a stack CPU and a register CPU where objects near the top of stack cou ld be addressed rather than having to always move things around to put the nouns where the verbs could reach them. I have no idea how to program that in anything other than assembly which would be ok with me. I used an exce l spread sheet to analyze the 50 to 90 instructions in this routine. It wo uld be interesting to write an assembler that would produce the same output s.

Rick C.

- H
- Hul Tytus
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Thu, Feb 14, 2019 9:15 PM

32 bit RISC mcus with 32 registers... do you have any actual devices in mind?

Hul

snipped-for-privacy@yahoo.com wrote:

- A
- already5chosen
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Fri, Feb 15, 2019 12:07 AM

First, I don't like to answer to top-poster. Next time I wouldn't answer.

The discussion was primarily about soft cores. Two most popular soft cores Nios2 and Microblaze are 32-bit RISCs with 32 registers.

In "hard" MCUs there are MIPS-based products from Microchip.

More recently there appeared few RISC-V MCUs. Probbaly more is going to follow.

In the past there were popular PPC-based MCU devices from various vendors. They are less popular today, but still exist. Freescale (now NXP) e200 core variants are designed specifically for MCU applications.

formatting link

So, not the whole 32-bit MCU world is ARM Cortex-M. Just most of it ;-)

- A
- A.P.Richelieu
  
  Contact options for registered users
Vote on answer
posted
5 years ago

Fri, Feb 15, 2019 6:41 PM

I think the best way to get exact performance is to implement a multithreaded architecture. This is not the smallest CPU architecture, but the pipeline will run at very high frequency.

The Multithreaded architecture I have used has a classic three stage pipeline, fetch, decode, execute, so there are three instructions active all the time.

The architecture implements ONLY 1 clock cycle in each stage.

Many CPUs implement multicycle functionality, by having statemachines inside the decode stage. Thge decode stage can either control the execute stage (the datapath) directly by decoding the instruction in the fetch stage output, or it can control the execute stage from one of several statemachines implementing things like interrupt entry, interrupt exit etc.

The datapath can easily require 80-120 control signals, so each statemachine needs to have the same number of state registers. On top of that you need to multiplex all the statemachines together. This is a considerable amount of logic.

I do it a little bit differently. The CPU has an instruction set which is basically 16 bit + immediates. This gives room for 16 registers if you want to have a decent instruction set. 8 bit instruction and 2 x 4 bit register addresses.

The instruction decoder support an extended 22 bit instruction set. This gives room for a 10 bit extended instruction set, and 2 x 6 bit register addresses. The extended register address space is used for two purposes.

To address special registers like the PSR
To address a constant ROM, for a few useful constants.

The fetch stage can fetch instructions from two places.

The instruction queue(2). The instruction queue only supports 16 bit instructions with 16/32 bit immediates.
A small ROM which provides 22 bit instructions (with 22 bit immediates)

Whenever something happens which normally would require a multicycle instruction, the thread makes a subroutine jump (0 clock cycle jump) into the ROM, and executes 22 bit instructions.

A typical use would be an interrupt. To clear the interrupt flag, you want to clear one bit in the PSR.

The instruction ROM contains ANDC PSR, const22 ; AND constantROM[22] with PSR. ; ConstantROM[22] == 0xFFFFFEFF ; Clear bit 9 (I) of PSR

To implement multithreading, I need a single decoder, but multiple register banks, one per thread. Several special purpose registers per thread (like PSR) is also needed.

I also need multiple instruction queues (one per thread)

To speed up the pipeline, it is important to follow a simple rule. A thread cannot ever execute in a cycle, if the instruction depends in anyway on the result of the previous instruction. If that rule is followed, you do not need to feedback the result of an ALU operation to the ALU.

The simplest way to follow the rule is to never let a thread execute during two adjacent clock cycles. This limits the performance of a thread to max 1/2 that of what the CPU is capable of but at the same time, there is less logic in the critical path, so you can increase the clock frequency.

Now you suddenly can run code with exact properties. You can say that I want to execute 524 instructions per millisecond, and that is what the CPU will do.

You can let all the interrupts be executed in one thread, so you do not disturb the time critical threads.

The architecture is well suited for FPGA work since you can use standard dual port RAMs for registers.

I use two dual port RAMs to implement the register banks (each has one read port and one write port) The writes are connected together, so you have in effect a register memory with 1 write port and 2 read ports.

If the CPU architectural model has, lets say, 16 registers x 32, and you use 2 x (256 x 32) dual port RAMs, you have storage for

16 threads. 2 x (16 CPUs x 16 registers x 32 bits) If you use 512 x 32 bit DPRAMs you have room for 32 threads.

If you want to study a real example look at the MIPS multithreaded cores

formatting link

They decided to build that after I presented my research to their CTO. They had more focus on the performance than the real time control which is a pity. FPGA designers do not have that limitation.

AP