Custom CPU Designs

I've spent months working around such problems :( We have an application that pushes gigabytes through JTAG UARTs and have learnt all about it...

There's a pile of specific issues:

- the USB 1.1 JTAG is an FT245 chip which basically bitbangs JTAG; it sends a byte containing 4 bits for the 4 JTAG wires. The software is literally saying "clock high, clock low, clock high, clock low" etc. Timing of that is not reliable. Newer development boards have a USB 2.0 programmer where things are a bit better here, but it's still bitbanging.

- being USB 1.1, if you have a cheap USB 2.0 hub it may only support USB SST which means all USB 1.1 peripherals share 12Mbps of bandwidth. In our case we have 16 FPGAs all trying to chat using that shared 12Mbps bandwidth. Starvation occurs and nobody makes any progress. A better hub with MST will allow multiple 12Mbps streams to share the

480Mbps USB 2.0 bandwidth. Unfortunately when you buy a hub this is never advertised or explained.

- The software daemon that generates the bitbanging data is called jtagd and it's single threaded. It can max out a CPU core bitbanging, and that can lead to unreliability. I had an Atom where it was unusable. I now install i7s in servers with FPGAs, purely to push bits down the JTAG wire.

- To parallelise downloads to multiple FPGAs, I've written some horrible containerisation scripts that lie to each jtagd there's only one FPGA in tte system. Then I can launch 16 jtagds and use all 16 cores in my system to push traffic through the JTAG UARTs

- Did I mention that programming an FPGA takes about 700MB? So I need to fit at least 8GB of RAM to avoid memory starvation when doing parallel programming (if the system swaps the bitbanging stalls and the FPGA programming fails)

- there's some troubles with jtagd and libudev.so.0 - if you don't have it things seem to work but get unreliable. I just symlink libudev.so.1 on Ubuntu and it seems to fix it.

- the register-level interface of the JTAG UART isn't able to read the state of the input FIFO without also dequeuing the data on it. Writing reliable device drivers is almost impossible. I have a version that wraps the UART in a 16550 register interface to avoid this problem.

- if the FPGA is failing timing, the producer/consumer of the UART can break in interesting ways, which look a lot like there's some problem with the USB hub or similar.

It's a very precarious pile of hardware and software that falls over in numerous ways if pushed at all hard :(

Theo [adding comp.arch.fpga since this is relevant to those folks]

Reply to
Theo
Loading thread data ...

Really it's the other way around. A typical programmer these days might not know how to implement a multitasker or OS on a bare machine, but they do know how to spawn processes and use them on a machine with an OS. Organizing a parallel or distributed program is much harder.

Reply to
Paul Rubin

Really? Multitasking is a lot more complex than just spawning tasks. Ther e are potential conditions that can lock up the computer or the tasks. Man aging task priorities can be a very complex issue and learning how to do th at correctly is an important part of multitasking. In real time systems it becomes potentially the hardest part of a project.

Breaking a design down to assign tasks on various processors is a much simp ler matter. It's much like hardware design where you dedicate hardware to perform various actions and simply don't have the many problems of sharing a single CPU among many tasks.

Do I have it wrong? Is multitasking actually simple and the various articl es I've read about the complexities overstate the matter?

--
  Rick C. 

  +-- Get 1,000 miles of free Supercharging 
 Click to see the full signature
Reply to
Rick C

Multitasking isn't exactly simple, but we (programmers) are used to it by now. The stuff you read about lock hazards is mostly from multi-threading in a single process. If you have processes communicating through channels, there are still ways to mess up, but it's usually simpler than dealing with threads and locks.

Reply to
Paul Rubin

Exactly, the mindset it to use multitasking... but it can still be complex. That's my point... what you are used to is what you use even when it's no t the best approach.

Splitting a design to run on independent processors is just as easy if not more so because of the lack of sharing issues.

The stuff you are thinking of with distributed processing is when your appl ication doesn't suit multitasking and it needs to be distributed over a lot of processors to speed it up. That's not the same issue at all of simply getting the job done. That's the sort of stuff they have problems with on super computers.

I think we've been down this road before.

--
  Rick C. 

  +-+ Get 1,000 miles of free Supercharging 
 Click to see the full signature
Reply to
Rick C

Nah, it's mostly the same whether you have one processor or many. /If/ you follow one basic rule, that is.

One thing that people often get wrong is they try to control something or set a variable from different places in the code. They get this wrong with single-tasking software, and lose their overview of the control flow and data flow (often without knowing that it is lost). It becomes impractical or impossible to see that the code is actually correct, and you usually can't find your problems in testing.

A common "solution" to this is to ban global variables. This is a mere band-aid, and usually unhelpful - the problem exists whether you set variable "foo" directly or call "set_foo()".

And when you have multiple threads or tasks, the problem is bigger - it is not just different bits of the program that can get mixed up, they can be in different contexts too, and can be interrupt in the middle.

The answer is to think like a hardware designer - think separate modules that communicate by signals. If you have different hardware modules that all access a shared resource, you need a multiplexer or prioritising system, possibly with locks or gates. The same applies in software. A single hardware output can drive many inputs, but an input can only be driven by one output - again you need a multiplexer, combination gate, or other selection system do to otherwise. The same applies in software. Bidirectional or tristate signals can be useful to cut down on resources, but are much harder to get right. The same in software.

If you have individually compiled programs for different cpus, it is harder to get this wrong - you have no choice but to have a clear interface between the parts. And if you use XC on XMOS, you have an advantage too - the tools won't let you set the same data from different virtual cores. (This used to be an irritating limit for cases when the programmer knows better than the tools about what is safe, but I believe this has improved.) And of course FPGA tools won't let you drive a signal from multiple sources.

Beyond that, you have mostly the same issues. Deadlock, livelock, synchronisation - they are all something you have to consider whether you are making an FPGA design, multi-tasking on one cpu, or running independent tasks on independent processors.

Task prioritising is an important issue. But it is not just for multitasking on a single cpu. If you have a high priority task A that sometimes has to wait for the results from a low priority task B, you have an issue to deal with. That applies whether they are on the same cpu or different ones. On a single cpu, you have the solution of bumping up the priority for task B for a bit (priority inheritance) - on different cpus, you just have to wait.

Reply to
David Brown

In British English, /anything/ can be used to sound vulgar! And the word "root" has several established meanings - most of them perfectly decent. (A common one is "support", as in "rooting for a football team".)

In Glaswegian, any word can be used as an adjective to mean "drunk". "I got absolutely rooted last night" - anyone from Glasgow will know exactly what you mean.

Reply to
David Brown

Multiple task priorities is too often used as a sticking plaster to cure livelock/deadlock problems - or more accurately /appear/ to cure them.

I much prefer to have two priorities: normal and interrupt, and then to have a supervisor of some sort which specifies which task runs at any given time.

Any such supervisor is probably application specific, coded explicitly, and can be inspected/debugged/modified like any ordinary task.

One I've used in the past is to have a "demultiplexer" which directs job fragments into different FIFOs, for processing by worker threads. That was naturally scalable and can be made "highly available", which is mandatory in telecom applications.

Reply to
Tom Gardner

A better solution, I think, is that it should not matter which task is running at any given time - because these tasks run to handle a specific situation then yield waiting for an event. Multiple priorities are convenient to express which tasks should be handled with lower latencies. I'm not keen on multiple tasks that have some kind of time-sharing switching between them, or round-robin pre-emptive multitasking. (Unless by "supervisor" here you just mean a "while (true)" loop that calls the tasks one after the other, which is fine.)

Reply to
David Brown

If there are many tasks with event(s) that have arrived, it becomes necessary to choose which to run. The choice of which to run is, of necessity, application dependent.

Applications can emphasise forward progress, mean or 95th percentile latency, earliest deadline first, priority, etc.

In one form or another, yes.

But the word "task" has too many different meanings in this context. Many would map a task onto a process, thread or fibre. That frequently leads to "suboptimal behaviour" sooner or later, and it can be a problem sorting out the cause and effect.

To expand on my previous comment below...

My preference is that an event is encapsulated in an object when it arrives. The scheduler puts the object in a FIFO, and worker threads (approx one per core) take the object and process the event.

Priority can be via multiple queues, and there are other obvious techniques to ensure other properties are met.

Reply to
Tom Gardner

I've never heard of livelock until this discussion. Reading about it I hav e no idea what the parallel would be in hardware design.

I don't even know of an example of deadlock in hardware design.

So clearly these are not such important issues in hardware design.

I expect the difference is that while hardware design uses signals (in fact that is the term for a "variable" in VHDL, not to be confused with a varia ble in VHDL which has limited scope and other limitations) it does not have resources to be allocated or seized or whatever the correct term is. If it does for synthesis, I don't know what that would be.

How is that an issue? Isn't task A stalled when it is waiting which allows task B to run?

I recall reading about priority issues such as "priority inversion" some ti me back. But not having written any multitasking software in a long time i t has all vanished. I'm pretty glad of it too. It was just a lot of ugly stuff I thought. My biggest problem writing VHDL is trying to manage the v erbosity. But then there are issues in the language I have simply internal ized and don't think about.

I have thought many times about the contradiction of my liking VHDL, a rath er restrictive (in theory) and verbose language as well as a much simpler, easy and concise language as Forth. Coding for Forth is like working in my basement with hand tools. Coding in VHDL is like working on a sculpture f or a museum. Never the twain shall meet.

--
  Rick C. 

  +-++ Get 1,000 miles of free Supercharging 
 Click to see the full signature
Reply to
Rick C

I have no insight into the licensing contracts (which are likely very confidential), but what I understand is that all Stratix 10 parts have an ARM but relatively few have it enabled. Additionally I understand the licence cost is only paid for parts where it is enabled. From that I surmise that the licence cost is significant; if the cost was minimal then why have a separate SKU without the ARM?

One other possibility is that a separate SKU allows the ARM to be faulty and the part still saleable, but it seems that ballpark 80-90% of the eval boards I see are offering parts without ARMs. Which suggests there's a strong motivation not to use it.

I'm not sure RISC-V is to the level of maturity for baking a Cortex A53 equivalent into a critical product.

Theo

Reply to
Theo

e cost

t's a

n

They do the same thing with the FPGA itself. It is not inexpensive to spin the masks for FPGAs at the bleeding edge of semiconductor fabrication tech nology. So they sell parts with more or less of the part enabled or even j ust tested (testing cost in an FPGA is not inexpensive). So you buy an FPG A with 50,000 LUTs or you buy one with 25,000 LUTs and it's the same part. The 50,000 part has the entire chip tested, the 25,000 LUT part only tests the section with 25,000 LUTs you will be using. They will get the price e ven lower if you are buying a large quantity and you give them your design, so they only test the parts of the chip your design uses!

So don't test the CPU and don't pay the license fee. Save some on the lice nse and save more on not testing the CPU and various supporting logic.

s a

I'm told if a chip fails a test, it is tossed. The savings comes from not testing a section to begin with. Testing equipment is not cheap and FPGAs take a lot of time on the beast.

Not sure what you are trying to say, but Microsemi is coming out with a RIS C-V FPGA device family this year.

--
  Rick C. 

  ++++ Get 1,000 miles of free Supercharging 
 Click to see the full signature
Reply to
Rick C

Livelock is when bits of a system are running fine, but the overall system is not making progress. Often it is temporary and resolves itself (unlike deadlock).

A hardware example might be if you have a crossbar for multiple bus masters to access a single slave device. You need some way of deciding which master gets access when both want it simultaneously. Perhaps you decide that master A is more important, and always gets priority. Then if master A hogs the bus, master B never gets a chance - livelock.

Often these kinds of things are straightforward to avoid as long as you think about what can happen. That applies to software and hardware.

The simplest way to get a deadlock is to have two shared resources, and two processes (hardware modules, software tasks, whatever) that need both the resources, but acquire them in different orders. But you don't usually get such simple cases, as they are so obvious.

If your designs don't involve much in the way of shared resources, you're not going see them. (The same applies in software.) It is also perfectly possible that the way you design your systems, they naturally don't occur - or that you think of them as bugs, hangs, blocks or stops rather than as "deadlocks". (The same applies in software.) It is also possible, though of course /highly/ unlikely, that your systems /do/ have the risk of deadlocks and they just haven't happened yet. (The same applies in software.)

You can have shared resources in hardware too.

It is perhaps fair to say that the way you design hardware makes shared resources stand out a bit more - you have explicit sharing with cross-switches, multiplexors, etc. And that might mean that deadlock-free solutions are mostly so obvious that you don't see them as a potential problem. I discussed previously about thinking the "hardware way" for shared data in software development - that also makes it very easy to avoid deadlock.

Yes. But it means task A - the high priority task - can't be completed as fast as you had wanted.

I program mostly in C and Python. It's hard to pick two software languages that are further apart - so I understand what you mean here.

Reply to
David Brown

That would make it different from many other large, complex parts where disabling failed sections and even having redundant parts in the design increase overall yields and lower costs. But I guess it depends on a balance between yields, types of failure, and testing costs.

Reply to
David Brown

There are similar multi-core priority-bumping schemes, but they are more complex and have more overhead, of course.

If there are no other tasks on this core, there is no extra delay for A

-- it waits just long enough for B to complete the/a result for A.

The real problem occurs when there is some other task C, _not_ logically connected to A or B, with a priority higher than B but lower than A. Task C can then delay B, and therefore also A, for whatever duration C runs. As David said, this priority inversion can be solved by temporarily increasing B's priority to A's priority until B has executed far enough to let A continue.

In the Ada language, the "protected object" inter-task communication mechanism implements this priority juggling automatically. The programmer only has to define the basic task priorities and the "ceiling priorities" for the protected objects.

--
Niklas Holsti 
Tidorum Ltd 
 Click to see the full signature
Reply to
Niklas Holsti

Traditional industrial protocols, like Profibus and Modbus (with all their variants) have a quite high overhead Thus, if a slave only wants to communicate a few bits or a single byte over the network, it will suffer a very low transfer efficiency using standard protocols.

With EtherCAT, it is possible to make very small nodes with only a few bits added/removed from the frame circulating around the industrial plants with only a few bit time additional propagation delay in each node. So it looks good.

However, EtherCAT nodes are still quite expensive and if you add/drop only a few bits in each node doesn't make economical sense.

What is the point of using multicore processors, if a single core can perform the basic EtherCAT node functionality. You can't cut the multicore chip and distribute it to multiple physically separate nodes:-).

In addition, if there are dozens of series connected twisted pair connectors, what is the electromechanical reliability of each connection ? A single fault will prevent the Ethernet frame circulating back to the master.

I much more prefer a dual layer approach, with CANbus (or CAN FD) up to a few meters transferring a few bits or a byte or two around the CAN bus and using concentrator nodes with communicate to a higher level systems using some traditional protocols, transferring perhaps

100 bytes in a single transaction.
Reply to
upsidedown

For something like simple digital I/O, you don't need a uController at all, the Beckhoff ET1100 EtherCAT controller can act as a stand-alone slave device.

What if you also want to run a web server and some other heavy-duty, encrypted, protocols under Linux in your EtherCAT slave? The most practical way to do that is with something like the Renesas RZ/N1D which has an EtherCAT controller, a Cortex M3 optimized for real-time stuff, and a couple Cortex A7 cores for running Linux. [There are other vendors with similar mult-core uControllers.]

If single point of failure is an issue, then you can connect the EtherCAT devices in a loop to get some redundancy.

--
Grant
Reply to
Grant Edwards

have no idea what the parallel would be in hardware design.

Ok, but that is very simple and doesn't sound like an issue that is very ha rd to deal with. In general I would not even think of it as a category of problems I need to think about. It's just an obvious issue in a given desi gn. There are many of those.

That was my point, I've never designed hardware that had "resources" to all ocate. Maybe my designs are just too simple. I do like to keep things sim ple when I design. I've never not been able to do that when designing hard ware.

I'm familiar with deadlock from software design, just don't see it in hardw are so far.

I understand the issues and both my hardware and software don't have such p roblems. My software has never been multitask. My hardware has always had simple relationships. I guess I just don't design complicated things.

fact that is the term for a "variable" in VHDL, not to be confused with a v ariable in VHDL which has limited scope and other limitations) it does not have resources to be allocated or seized or whatever the correct term is. If it does for synthesis, I don't know what that would be.

Or just not using multiple tasks accessing the same data. I've never found the need.

on

lows task B to run?

If task A is waiting for task B and you don't like the delay, that's bad de sign. If it has to wait on task B by definition of the problem, then that' s a limitation you have to live with. This isn't deadlocking unless task B is also waiting on task A. If you have this problem then you have not dec omposed the problem correctly.

e time back. But not having written any multitasking software in a long ti me it has all vanished. I'm pretty glad of it too. It was just a lot of u gly stuff I thought. My biggest problem writing VHDL is trying to manage t he verbosity. But then there are issues in the language I have simply inte rnalized and don't think about.

rather restrictive (in theory) and verbose language as well as a much simpl er, easy and concise language as Forth. Coding for Forth is like working i n my basement with hand tools. Coding in VHDL is like working on a sculptu re for a museum. Never the twain shall meet.

--
  Rick C. 

  ----+ Get 1,000 miles of free Supercharging 
 Click to see the full signature
Reply to
Rick C

If you think about it a bit you will see the only real way to have "redunda ncy" in FPGAs is to excise entire sections of the chip for a single failure . So a 50 kLUT chip will become a 25 kLUT chip if it has a failure(s) in o ne half. That's all I've heard of. Trying to replace a small section of a chip to retain the full functionality would result in uneven delays and th at's a real problem in FPGAs.

--
  Rick C. 

  ---++ Get 1,000 miles of free Supercharging 
 Click to see the full signature
Reply to
Rick C

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.