Lattice Announces EOL for XP and EC/P Product Lines

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Aug 29, 2013 9:00 AM

I think you mean Lattice offers a part in the QFN32. I only found the XO2-256. A selection guide that isn't even a year old says they offer an iCE40 part in this package, but it doesn't show up in the data sheet from 2013. I guess the product line is still a little schizo. They haven't finished cleaning house and any part or package could be next.

That's a pretty hard "push". I've looked at them but I don't get a warm fuzzy from a company that makes everything an uphill climb when they seem to think they are making it easy. I've been looking at the PSOC parts since they were new. At that time support was so crude, they had a weekly conference call and if you joined in you got a 1 on 1 training session. That progressed through a long development aimed at making their parts push button and I am pretty sure that won't even come close to working for my needs. I need a small FPGA, maybe 1000 LUTs to provide the high speed portion of the design. I don't even need a "real" processor, I bet I could live a rich full life (in this design anyway) with an 8051. In fact, that is an option, to add an MCU for most of the I/O and processing, then use something like the XO2-256 in a QFN32 to do the high speed stuff. I'm just not sure I can fit the design in 256 LUTs. Maybe the QFN32 is small enough I can use two? 5x5 mm!

Yeah, while the FPGA guys are rather phobic of issuing a lot of package combinations the MCU folks have tons of them. They have a much tougher problem with all the combos of RAM, Flash, I/O count, clock speed, ... I can see why the FPGA people haven't embraced the idea of combining MCU with FPGA, it just doesn't fit their culture.

--

Rick

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Aug 29, 2013 8:23 PM

If the size of the NIOS2 is as small as you say, then that only leaves two issues with using the NIOS2 in my FPGA designs. The first is that I don't need 32 bit data paths in addition to the large memory address bus. I assume this means the instructions are not so compact using more on chip memory than desired.

But the really big issue with using the NIOS2 is not technical, Altera won't let you use it on anything that isn't an Altera part. So in reality this is a non-starter no matter how good the NIOS2 is technically.

--

Rick

- J
- jg
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Aug 29, 2013 8:37 PM

The part code for this is ICE40LP384-SG32 Showing on price lists, but still 0 in the stock column. Mouser says 100 due on 9/30/2013

Try it and see. I found the XO2-256 seems to pack full quite well, and the tools are ok to use, so you can find out quite quickly. I did a series of capture counters in XO2-256, and once it worked, I increased the the width to fill more of the part. IIRC it got into the 90%+ with now surprises.

I've been meaning to compare the ICE40LP384 with the XO2-256, as the iCE40 cell is more primitive, it may not fit more.

-jg

- A
- already5chosen
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Aug 29, 2013 10:27 PM

out of 18K memory bits only 2K bits used). It's hard to translate exactly i nto old-fashioned LUTs, but I'd say - around 700.

a 32-bit CPU - very easy to program in C.

y

ing complex tightly-coupled logic with high internal complexity to fanout r atio. May be, a bit less, when implementing simple things with lots of regi sters and high fanout.

e old-fashioned architecture - 676 LCs + 2 M9Ks.

s2e into Cyclone2. It is even smaller at 565 LCs + 2 M4Ks.

or fabric, could be an interesting exercise, well suitable for coding compe tition. But, probably, illegal :(

es then you can design useful 32-bit RISC CPU which would be non-trivially smaller than 600 LCs.

t implements full Nios2 architecture including several parts that you proba bly don't need. In particular:

SRAM

Nios2e is small. And slow. Nios2s and Nios2f aren't small.

Yes, Nios2 code density is poor. About the same as MIPS32, may be, just a l ittle bit better. Similar to PPC. Measurably worse than "old" ARM. More tha n 1.5x worse than Thumb2.

.

I don't understand why. If you code in C then porting non-hardware-specific parts of your code from Nios2 to any other little-endian 32-bit processor with octet-addressable m emory will take very little time. Much much less than porting hardware-spec ific parts of code from, say, one ARM-Cortex SoC or MCU to another ARM-Cort ex SoC or MCU. If you thought about it in advance, then even porting to big-endian 32-bitt er is a non-issue, After all, we are talking about few KLOCs, at worst, few tens KLOCs. Unless you code in asm, the CPU-related part of porting sounds as absolute non-is sue. Esp. if you use gcc on both of your target.

Or, may be, you wanted to say that Nios2 is unsuitable if your original des ign not based on Altera FPGA? That's, of course, is true. But, then again, why would you *want* to use Nios2 outside of Altera realm? Other vendors have their own 32-bit soft core solutions. I didn't try them , but would think that in most aspects their solutions are similar to Nios2 . Or, as in case of Microsemi, they have licensing agreement with ARM which make Cortex-M1 affordable for low volume products.

In any case, unless the volumes are HUGE, "roll your own soft core" does no t sound to me as a right use of developer's time. The only justification fo r it that I can see about is personal enjoyment.

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Fri, Aug 30, 2013 12:48 AM

Slow is a relative term. I expect NIOS is designed for the instruction set rather than for the implementation. From your description the s and f versions burn logic to get speed while the e version is the minimum hardware that can do the job. This is not my idea of how to make an embedded core.

I would take the approach of designing a CPU which uses minimal resources as part of its architecture and uses an instruction set that is adequate and efficient rather than being optimized for a language. I am accustomed to writing assembly language code and even micro code for bit slice processors.

I can tell by the terms you use that you are thinking in terms of C programming and larger code bases than what I typically do. In particular the code for this job would be not far removed from the hardware and in fact would need to be written to work very efficiently with the hardware to meet the hard, real time constraints involved. This is not your typical C program.

Yes, you are thinking along very different lines than I am. The idea is not to port the code, but to port the processor. Then there is virtually no work involved other than recompiling the HDL.

Probably not even a single KLOC, lol. All I am doing is replacing some hardware functions with software. Use the ALU and data paths of the CPU to replace the logic and data paths of dedicated hardware. Not tons of work but the timing is important. So once it is written and working and more importantly, verified, I want to never have to touch the code again, just as if it were hardware (well, gateware). So the processor would need to be ported to whatever device this is implemented in.

A CPU design can be as hard or as easy as you want. If you must have C support there is a ZPU which was designed explicitly for that, but I don't think this is a good match for deterministic real time apps. I have worked on a couple of versions of a stack based processor design which is reasonably efficient. I have some new ideas for something a bit more novel. We'll see what happens. This is all due to the EOL from Lattice and we have until November to get a last time buy in and a new design won't be needed until those parts are used. So I've likely got a year or so.

--

Rick

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Fri, Aug 30, 2013 7:14 AM

Just goes to show, you have to keep up on the data sheets. They just released a new one last week, 8/22/2013. This one includes the 32 pin QFN. Still, it is the poor step child of the family with no memory at all other than the FFs. Actually, I looked back through my history of data sheets and I must have had a brain cramp, they all show the QFN32.

I have been looking at these parts for some time and I never realized they don't include distributed RAM using the LUTs. This part was not designed by Lattice, so I guess this may still be covered by patent. Lattice has a license on many Xilinx owned patents because they bought the Orca line from Lucent who had gotten all sorts of licensing from Xilinx in a weak moment. Not that this has hurt Xilinx much, but it is so out of character for them. I'll never understand why they licensed their products to Lucent. Maybe some huge customer required a second source for the 3000 and 4000 series. Or maybe it was just a huge wad of cash Lucent waved under their noses. Likely we'll never know.

The point is I'm not nearly as enamored with the iCE40 parts as I was a year ago. They dropped the 600 LUT member of their family and replaced it with this 384 LUT member. At the same time they raised the quiescent current spec for the 1k part from 40 uA to 100 uAs. The entire iCE65 product line was dropped (which was even lower static current). They just can't seem to pick a direction and stick with it.

"Try it" is not so simple. The existing design is all logic. To "try it" requires repeating the design with a dichotomy of slow functions in software, fast functions in hardware and interfaces which will allow it all to function as a whole. It's not a huge project, but some of the functions (like a buffer size controlled FLL) might be a bit tricky to get right in software and may need to remain in gateware. Without block RAM this is hard. The beauty of doing it all in the FPGA is that the entire design can be run in one VHDL simulation. If the processor were integrated into the FPGA, then we are back to a single simulation, schweet!

I'll more than likely go with one of the BGA packages, possibly the BGA256 because of the large ball spacing. This gives fairly relaxed design rules to the PCB. That then opens up the possibilities to a wide range of very capable parts. We'll see...

--

Rick

- B
- Brian Davis
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Sep 3, 2013 1:56 AM

Also of note, the ICE40 Block RAM's two ports consist of one read-only port, and one write-only port; vs. the two independent read+write ports of many other FPGA families.

I'd reckon AT&T/Lucent had a large semiconductor patent portfolio with which to apply strategic "leverage" for a favorable cross-licensing agreement.

As a yardstick, a system build for my homebrew RISC, including 4 Kbyte BRAM, UART and I/O, fits snugly into one of the 1280 LUT4 XO2 devices:

: Number of logic LUT4s: 890 : Number of distributed RAM: 66 (132 LUT4s) : Number of ripple logic: 110 (220 LUT4s) : Number of shift registers: 0 : Total number of LUT4s: 1242 : : Number of block RAMs: 4 out of 7 (57%)

The core proper (32 bit datapath, 16 bit instructions) is currently ~800 LUT4 in its' default configuration. [ I miss TBUF's when working on processor datapaths.]

I don't have the XO2 design checked in, but the similar XP2 version is in the following code repository, under trunk/hdl/systems/evb_lattice_xp2_brevia :

formatting link

The above is still very much a work-in-progress, but far enough along to use for small assembly projects ( note that interrupts are currently broken ).

-Brian

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Sep 3, 2013 5:25 AM

The iCE family of products have a number of shortcomings compared to the large parts sold elsewhere, but for a reason, the iCE lines are very, very low power. You can't do that if you have a lot of "fat" in the hardware. So they cut to the bone. This is not the only area where the parts are a little short. The question is how much does it matter? For a long time I've heard how brand X or A or whatever is better because of this feature or that feature. So the iCE line has few of these fancy features, how well do designs work in them?

Possible, but I don't think so. Any number of folks could have had semiconductor patents and no one else got anything like this. I would speculate that Xilinx needed a second source for some huge customer or maybe they were at a critical point in the company's growth and just needed a bunch of cash (as opposed to cache). Who knows?

The trick to datapaths in CPU designs is to minimize the number of inputs onto a "bus" which is implemented as multiplexers. Minimizing inputs gains speed and minimizes logic. When possible put the muxes inside some RAM on the chip to good use. I got sidetracked on my last iteration of a CPU design which was going to use a block RAM as the "register file" and stack in one. Since then I've read about some other designs which use similar ideas although not identical.

Why did you roll your own RISC design when each FPGA maker has their own? The Lattice version is even open source.

--

Rick

- B
- Brian Davis
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Sep 3, 2013 10:27 PM

There was definitely a second source in the XC3000 days, first from MMI (bought by AMD), later AT&T; but I don't remember there being anyone second sourcing the XC4000

IIRC, as Xilinx introduced the XC4000, AT&T went their own way in the ORCA, with similar features (distributed RAM, carry chains), but using the Neocad software.

My speculation is that at this juncture, AT&T leveraged rights to the Xilinx FPGA patents.

Back in 1995, the AT&T press release responding to the Neocad acquisition was re-posted here:

formatting link

and stated: " " When AT&T Microelectronics decided not to second source " the Xilinx 4000 family of FPGAs, we accelerated the " introduction of the ORCA family. "

-----------------

Yes, that's why I miss the TBUF's :)

In the XC4000/Virtex days, the same 32 bit core fit into

300-400 LUT4's, and a good number of TBUF's.

The growth to ~800 LUT4 is split between the TBUF replacement muxes and new instruction set features.

When the YARD core blinked it's first LED in 1999, there wasn't much in the way of free vendor RISC IP.

Being a perpetually-unfinished spare-time project, I never got enough loose ends tidied up enough to make the sources available until recently.

At the initial announcement, yes; but when I looked a couple years ago, the Lattice Mico source files had been lawyered up with a "Lattice Devices Only" clause, see the comments on this thread:

formatting link

-Brian

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Wed, Sep 4, 2013 2:35 AM

Yes, that is what we are discussing. Why did *Xilinx* give out the family jewels to Lucent? We know it happened, the question is *why*?

My understanding is that TBUFs may have been a good idea when LUT delays were 5 nS and routing was another 5 to 10 between LUTs, but as they made the devices more dense and faster they found the TBUFs just didn't scale in the same way, in fact the speed got worse! The capacitance being driven didn't go down much and the TBUFs needed to scale which means they had less drive. So they would have actually gotten slower. No, they are gone because TBUFs just aren't your friend when you want to make a dense, fast chip.

Ok, that makes sense. I rolled my first CPU around 2002 and, like you, it may have been used, but still is not finished.

Oh, that is a horse of a different color. So the Lattice CPU designs are out! No big loss. The 8 bitter doesn't have a C compiler (not that I care) and good CPU designs are a dime a dozen... I guess, depending on your definition of "good".

--

Rick

- G
- GaborSzakacs
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Wed, Sep 4, 2013 1:55 PM

rickman wrote: [snip]

I think TBUFs went away along with "long lines" due to capacitive delay as you noted. Modern FPGA's use buffered routing, and tristates don't match up with that sort of routing network since the buffered routes become unidirectional. The silicon for line drivers is now much faster than routing prop delays, making the buffered network faster than a single point driving all that line capacitance. So the new parts have drivers in every switch box instead of just pass FETs. I think the original Virtex line was the first to use buffered routing, part of the Dyna-Chip aquisition by Xilinx. They still had long lines and TBUFs, but that went away on Virtex 2.

--
Gabor

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Wed, Sep 4, 2013 3:30 PM

(snip)

That is probably enough, but it is actually worse than that.

At about 0.8 micron, the wiring has to use a distributed RC model.

Above, you can treat it as driving a capacitor with a current source. All points are, close enough, the same voltage, and the only thing that matters is what that voltage is. (LC delay is pretty low.)

Below 0.8 micron, and besides the fact that the lines are getting longer, the resistance is also significant. It is then modeled as series resistors and capacitors to ground, all the way down the line. (As well as I remember, the inductance is less singificant that resistance, but I haven't thought about it that closely for a while now.)

-- glen

- B
- Brian Davis
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Sep 5, 2013 12:16 AM

I appreciate the rationale. Yet still I miss their functionality for processor designs. [ "Lament of the TBUF" would make an excellent dirge title ]

I think I once read that the last generation or few of TBUF's were actually implemented with dedicated muxes/wired OR's, or something similar.

I wish that had been continued on a reduced scale, TBUF's every 4 or 8 columns, matching the carry chain pitch, spanning some horizontal fraction of a clock region.

-Brian

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Sep 5, 2013 12:40 AM

(snip)

As far as I know, they are still implemented by the synthesis tools as either OR or AND logic. I don't know any reason to remove that ability, as it doesn't depend on the hardware. Then again, it isn't hard to write the logic directly.

-- glen

- M
- Mark Curry
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Sep 5, 2013 5:34 PM

We do this now in verilog - declare our read data bus (and similar signals) as "wor" nets. Then you can tie them all together as needed. Saves you the hassle of actually creating/managing individual return data, and muxxing it all.

The individual modules must take care to drive 0's on the read_data when not in use. Then you're really creating multi-source signals (like past bus structures), but relying on the "wor" to resolve the net.

Works in Xilinx XST and Synplicity. Don't know about others. Don't know if this trick would work in VHDL.

--Mark

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Sep 5, 2013 8:38 PM

(snip, I wrote)

I think you can also do it with traditional tri-state gates, but pretty much the same as AND with the enable, and then onto the WOR line.

I can usually read VHDL but don't claim to write it.

-- glen

- J
- jonesandy
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Fri, Sep 6, 2013 3:03 PM

The standard data type (std_logic) is tri-statable in VHDL, so that would b e the preferred choice, rather than WAND or WOR. It does come in handy in t hat a single bidirectional port in RTL can represent both input and output wires, and part of the mux, at the gate level.

Tri-state bidirectional ports allow distributed address decoding in the RTL (give a module the address and a generic to tell it what addresses to resp ond to), even though at the gate level it will all get optimized together a t the muxes.

Some synthesis tools can even "register" tri-state values to allow you to s implify pipelining in the RTL. Synthesis takes care of the details of separ ating out the tri-state enable from the data, and registering both appropri ately.

Andy