Looking for ARM system with RTOS

Bruce Varley · 2013-01-01T01:54:30+00:00

I'm right at the end of my tether in trying to move some real time stuff (MIDI) to Linux. The ARM-9 platform I've chosen, while very nice in other ways, uses a proprietary FPGA interface for all the port inputs, that makes it next to impossible to strip out the OS and replace it with something else, or to go right back to CPU-side code cutting (which is what I've always done in the past). My testing has shown that the Linux platform is right on the limit in terms of speed, and kernel interrupts could easily sink the whole show. Has anyone got a suggestion for a combination of an ARM SBC platform and a RTOS (preferably free or hobby-level cost) that I could adopt? I've followed the many comments in this newsgroup and others for a while, and also done a lot of browsing and struggling with selection lists, but the supplier websites for ARM seem to be singularly hard to get into unless you're a guru, which I'm not. I need: o CPU clock 200MHz or higher. o 2 serial ports, with access to the logic level lines on at least one (LV OK). o USB support. Socket support also would be nice, not essential. o Some sort of file system. o Guaranteed turnround of 10mS, even lower would be nice. My ARM Linux won'd do better than 20. I don't need: o ADCs or DACs or audio in/out. o Monitor/graphics output. Suggestions that don't turn out will remain appreciated. Any ARM system with RTOS will be a nice addition to my kit. We all have a few in our bits box. Or a website that would facilitate my search. Any input would make my day.

M

Mark Borgerson 13 years ago

part you listed is floating point. 21XX and 21XXX are also both a bit long in the tooth.

ADSP-BF592 @ 200MHz is ~$6 qty. 1 at digikey. TI also has some fixed point DSP 'controllers' that are similar.

OK. I'm not familiar enough with Analog Devices DSPs to know the difference. I just searched for ADSP-21 on digikey and didn't get any of the fixed point units when selecting for 50MHz units. A return visit showed some ADSP-21xx units, but still at $18 to $20.

I'm also not familiar enough with DSP to know whether the 32-bit FPU on the SMT32F4xx series would give you any advantage over the fixed-point DSP chips. IIRC, the Cortex-M4 does have some SIMD instructions useful for DSP work, but I don't know if they use the FPU or not.

Mark Borgerson

Vote

J

Jon Kirwan 13 years ago

one (LV

Linux

The ADSP-21xxx is not even close to the ADSP-21xx and I wasn't using the ADSP-21061KSZ. It was an ADSP-2111 and ADSP-2105. They were MUCH cheaper at the time (circa early

1990's) and the competition elsewhere was effectively zero. Since then there are many more options and many more players and the ADSP-21xx processors I was using probably aren't even available (much, if at all.) If I were doing this today, I'd pick something else.

There was NO floating point on the units I used. A nice barrel shifter (combinatorial, one-cycle) though and I used it for writing my own floating point. Power consumption was quite low --- for the time.

Jon

Vote

J

Jon Kirwan 13 years ago

ADSP-21XXX part you listed is floating point. 21XX and 21XXX are also both a bit long in the tooth.

ADSP-BF592 @ 200MHz is ~$6 qty. 1 at digikey. TI also has some fixed point DSP 'controllers' that are similar.

ALL FPUs are HUGE and consume LOTS OF POWER. You pay in die space, which reduces yield, increases cost, and burns power whether or not you need the FP at the moment. You pay for the beast every single cycle, need it or not.

The wonderful and brilliant idea behind the ADSP-21xx (an integer cpu from top to bottom) was it's support for FP in the form of specialized integer ALU functionality. You pay a LOT less if you are writing what amounts to your own FP microcode and have specialized units for the purpose. The main unit they provided was the combinatorial barrel shifter (and a MAC.) You pay a LOT LESS for those two on every cycle, and they take up so much less die space, too.

Besides all that, they have other useful abilities for some applications that no FP unit designer would consider making available in a hardware FP -- they are focused on providing an easy to use FP unit. But actually, the raw guts underneath the hood of an FP unit have purposes OTHER than FP, too. But they don't give you direct access to any of that because that isn't their market. So you pay for a burdensome, massive die with LOTS of under-the-hood functional units to get the job done, and you DO NOT get access to it in raw form so that you can take other advantage of it.

The ADSP-21xx simply exposed a couple of bare-bones bits, kept it small, and let you write the "firmware" you want.

When doing an FFT for example, there are some optimizations to the process that you CANNOT DO with a floating point unit but CAN DO when you have the raw pieces needed to make one, which allow you to perform very fast FFTs.

It was a good idea.

The reason it just isn't done much is that the only clients for such a beast are programmers who have thorough numerical methods experiences and are very good at math and writing FP microcode. Which is a TINY market, as they found out.

But the concept, for those of us who CAN do those things, is fantastic.

Jon

Vote

M

Mark Borgerson 13 years ago

ADSP-21XXX part you listed is floating point. 21XX and 21XXX are also both a bit long in the tooth.

ADSP-BF592 @ 200MHz is ~$6 qty. 1 at digikey. TI also has some fixed point DSP 'controllers' that are similar.

HUGE and LOTS OF POWER are relative---especially on chips with 1MB of flash and 192KB of RAM, high and low speed USB, ethernet and camera interfaces. As for increasing cost, the STM32F405 with FPU is within a dollar of the price of the STM32F205 without FPU.

I guess power is also relative. The STM32F405 at full speed uses about 100mA at 3.3V. Shut off most of the peripherals, and the power goes down by half.

The STM32F205, without FPU and at 120MHZ instead of

168MHZ uses about 80mA under the same conditions. So it looks like the increase in power to get the FPU is about 25%--but you get 25% higher clock speed as well.

I was running some RTOS test code yesterday and comparing the times to generate tables of sine values with and without the FPU. With the FPU was 8 to 10X faster. I can't really say what the power consumption was, since the Discovery board was running off the USB power. I've got an Olimex board where I can measure the current, so I'll give that a try.

The Cortex CM4 does have control bits to enable and disable the FPU, but I don't know their effect on power consumption, or whether you would want to do that on a function-by-function basis.

The easy way to save power on the CM4 is to just shut off the CPU clock until the next iterrupt.

I agree that a specialized DSP may have advantages if all you want is number crunching and limited IO at minimum cost and minimum power. However, the OP in this thread was looking for a more complex system

Which are the bare-bone bits that they exposed?

I guess it's possible to make a living that way. What was the old maxim ---"if it were easy, everyone would do it"

Mark Borgerson

Vote

J

Jon Kirwan 13 years ago

ADSP-21XXX part you listed is floating point. 21XX and 21XXX are also both a bit long in the tooth.

ADSP-BF592 @ 200MHz is ~$6 qty. 1 at digikey. TI also has some fixed point DSP 'controllers' that are similar.

Please keep the context in mind. I've had to remind you before. This is 1990.

Yes, we've been digressing. Or, at least, I have been. You can speak for yourself, of course.

I mentioned them. The combinatorial barrel-shifter, the MAC (which as it was designed was useful for FP work, as well as the regular integer work), and the two specialized DIV instructions they included. (Too long an explanation here, but suffice it that they didn't actually divide -- the implemented a subset step only.)

I liked the crafted balance they took.

Well, the manufacturers are looking for larger audiences and do NOT cater to niche markets until and unless every other better profit center has been exhausted.

At the time, I benefited from a narrow moment when doing a full FP implementation wasn't in the cards, yet the need for fast implementation was. I could implement a floating point complex-in, complex-out FFT that performed it's work in less time than their later FP versions of the CPU could do. Because I could take advantage of things. Eventually, the BlackFin and later incarnations advanced in clock rates AND performance and exceeded the older parts they no longer sold. But that's normal progression.

...

I'll give another example of my mindset. The current spate of multi-GHz x86 processors from Intel are fabricated with feature sizes and GTL technology (unless they've got something still newer since I last looked) that would permit the production of a VERY LOW power 100MHz laptop that could easily run for quite some time using nothing more than a few AA batteries. Nothing special. Just cheap Costco alkalines. (In fact, it was done once with the HP Omnibook 300/Win

3.1... but with older feature sizes.) The current technology would wipe the floor with that older HP Omnibook, which itself put Windows completely in ROM (no boot from secondary storage) and would run for weeks on AA batteries available anywhere in the world. I need nothing more than a 80386 using those feature sizes -- no FP -- and running at 66MHz to 100MHz for word processing. The nice thing about that specific Omnibook (and none of the others) is that there was no special battery technology, it weighed almost nothing, included a wonderful pop-out mouse built in, and required nothing special when you closed it. It just shut off all power except and only what was required to retain the static ram. So when I opened it, I was exactly where I left off -- cursor, etc -- with exactly 0 seconds wait. When someone asked me a question, I closed it, answered the question, opened the laptop, and just kept on going. Weight was VERY low -- lower than any laptop I'm aware of today.

But their is no longer a marketplace for this. So I can only get laptops with MUCH MUCH shorter active runtimes, despite huge advances in battery technology (for much more cost) and despite huge advances in FAB technology (which could be used to greatly reduce power consumption from that time.)

Jon

Vote

J

Jon Kirwan 13 years ago

there

Vote

M

Mark Borgerson 13 years ago

one (LV

Linux

I think all the Cortex M3 and M4s have single cycle barrel shifters and single-cycle multiply. Integer divides can take a few cycles.

Such are the advances in electronics that you get all this capability for less than the cost and power of an 8-bit CPU from 15 years ago.

I'm waiting on delivery of one of the Parallela multicore systems from Adapteva. It has an ARM supervisor running linux and multicore RISC chips with FPUs. More number crunching power than I should ever need.

For now, I just appreciate the ability of the CM4 to run fairly simple IIR and FIR filters using floating point coefficients I generate with Matlab.

Mark Borgerson

Vote

M

Mark Borgerson 13 years ago

OK, so you can run the CPU with a few milliwatts. Can you do anything other than a reflective LCD display? Lighting up even an 11" display could suck those AA cells dry pretty quickly.

Sounds sort of like a MacBook Air without the WIFI and

11" LCD screen.

I never did get an estimate on battery life for my OLPC laptop. I suspect that the onboard wifi contributed about half the power drain. I also suspect than the older Omnibooks didn't have either Ethernet or Wifi active most of the time. Those two alone will suck up a couple of AA cells pretty quickly.

You can easily run an ARM CM4 on an average power of 15mA. That should give you at least 100 hours off AA cells---it's the peripherals that people expect today that kill the batteries.

I've done a lot of low-power stuff---instruments that sit on oceanographic moorings for a year at a time. Displays aren't used and couldn't be continuously powered. The big power suckers are the storage medium---at 200MB per day, alkaline cells wouldn't cut it. We end up using Lithium primary cells, which makes shipping units and batteries a true PITA!

Mark Borgerson

Vote

A

Anders.Montonen 13 years ago

At least on the M3, the interrupt tail-chaining optimization can vary the latency by up to six cycles if I'm reading the documentation right (if there is a pending interrupt when the CPU is leaving an ISR, it will skip unstacking and immediately restacking the CPU registers). I don't know if there is a way to turn off this feature, and I assume it is also present on the M4. The newer and faster Cortexes also have various flash acceleration mechanisms that are growing ever closer to full caches, but those can at least be turned off (at the expense of increased latency).

-a

Vote

A

Anders.Montonen 13 years ago

I would claim that most consumer USB audio gear still adjust the DAC clock rate based on the 1ms USB frame timer. The audio devices that use one of the standard-specified feedback methods are considered better and probably even are, but the simple way is good enough for most use.

Buffer-wise, double 1ms buffers are normal on the device side. Host-side buffering requirements depends on a variety of factors, but eg. on Windows you can get as low as 1-2ms worth with properly written ASIO drivers and a reasonably powerful machine.

-a

Vote

J

Jon Kirwan 13 years ago

The HP Omnibook 300 with Win 3.1 in ROM would run on 4 AA batteries for about 2-3 weeks of regular use. In the early

1990s. Using OLD tech.

Today? A 80386 die space would be practically invisible, would have near 100% yield, and would use lots less power still. The static ram retention would be much less power, as well. And that was the main loss of battery power when the unit was closed. (It would retain SRAM for almost two months.) It also included a capacitor, so that you would have about 10 minutes to replace the AA batteries.

Except with weeks of typical use and months of SRAM retention and absolutely ZERO time delay when opening it up for use, even weeks later. Office and Win 3.1 were both ROM'd.

There were a LOT of Omnibooks. Only 1 of them though was anything at all like the 300. That one stood out among the other Omnibooks like an Ostrich stands out in an ant farm. It had nothing similar to any of the other Omnibooks. It was a complete outlier. Unique.

I still have the Omnibook, by the way. I haven't used it in a while (carefully packed away) but I would guess from memory that the (4) AA batteries give on the order of 100 hours or so of continuous use. The display was NOT color, but grayscale. So that's a difference. Used a 1" hard drive.

Thing is, AA batteries are the ONLY battery you can be most sure of being able to find anywhere in the world. No special requirements, no unique shape to find, no $100 cost. Do you remember ANY Windows laptop that used AA batteries? Ever??

This one does.

And that was 20 years ago.

Jon

Vote

J

Jon Kirwan 13 years ago

least one (LV

Linux

I don't think the M4 has a barrel shifter -- not one that is available to the instruction set. The ADSP-21xx could find the leading bit in a 16 bit word in 1 clock, in a 32-bit word in two clocks (two seperate instructions.) But during that time, I could also do two memory moves per cycle, as well.

So it normalized and denormalized in 1 to 2 clocks depending on the word size I was using. The number of shifts required (or used) was stored in another register.

If you know of the instructions on the M4 that do that, please let me know.

I love many aspects of today's micros. No question. But some aspects have little market, yet are something I'd use because I have the knowledge to use them. They appear from time to time. It used to be that every programmer was a Ph.D physicist or Ph.D mathematician. The "pyramid" of programmer skills was tiny -- only the apex of today's pyramid existed then because EVERYONE was highly skilled. Now, that pyramid has grown to huge heights with its base including people who have never so much as heard of an ALU. It's a MUCH BIGGER tent, so to speak. But that also means that those making products find they need to cater to the bottom 90% of the pyramid, not the top 10% (which is close enough to zero market to them as to be equivalent.)

hehe.

You are on the "time to market" driving side of things where cost per unit is less of an issue. I have similar pressures (speed, number crunching, IIR and FIR filtering, low power, etc.) But I also _may_ have some other pressures that include those and add some more -- such as very low cost, very small size, long term support by vendors, and so on.

Jon

Vote

J

Jon Kirwan 13 years ago

Latency is tolerable. Variability not so much.

Jon

Vote

M

Mark Borgerson 13 years ago

least one (LV

essential.

ARM Linux

time

output.

the

Doesn't the ability to rotate right by 1 to 32 bits in a single cycle imply a barrel shifter?

I think the Cortex M4 can find the leading bit in a 32-bit register with the CLZ (Count Leading Zeroes) instruction in a single cycle.

The ARM reference suggests a way to normalize a 32-bit word in 2 clocks using the CLZ and shift instructions:

"Use the CLZ Thumb-2 instruction followed by a left shift of Rm by the resulting Rd value to normalize the value of register Rm. Use MOVS, rather than MOV, to flag the case where Rm is zero: CLZ r5, r9 MOVS r9, r9, LSL r5"

formatting link

Of course, if you have an FPU, you would generally let it handle normalization and denormalization. IIRC, the CM4 can convert a 32-bit integer to IEEE-854 floating point with a single instruction. You may have to set some global rounding and saturation flags before that.

I don't think that's been true since the very early

1960s. In 1968, I took an undergrad university course that used an early BASIC-like language on a time-shared CDC machine.

By 1974, PDP-8s were widely used at sea and ashore by oceanographers. There was even a PDP-10 available on the top floor of the oceanography building for free use by grad students.

I had a friend back in the early 80's with most of an associate degree that did a lot of Apple II and Macintosh programming. He also went on to work on the math functions in Excel at Microsoft. He was very smart--- but too busy programming to finish a college degree.

The Phds in math and physics that I've worked with all seemed to want to use Linux for everything! ;-)

I do appreciate being on the low-volume end of things. Spending a few extra bucks for CPUs and batteries is not much of an issue when it costs $200K plus $20K per day for 30 days to deploy a dozen instrumentsmoorings on the equator. I'm not even on the top end of that cost curve---the space guys have the oceanographers beat by orders of magnitude. OTOH, they don't have to worry about their equipment being damaged or destroyed by fishermen who find their equipment a handy place to tie up for the night.

Mark Borgerson

Vote

M

Mark Borgerson 13 years ago

Hmmm. If the process requiring minimal variation was the highest priority, it shouldn't have to worry about variations from tail- chaining. Doesn't that only happen with an interrupt of lower ore equal priority is triggered and whose handler gets executed after the handler of the higher priority interrupt is finished? I'll have to re-read that section of the STM32 user guide.

Mark Borgerson

Vote

A

Anders.Montonen 13 years ago

A higher-priority interrupt can arrive during exception return. Quoting section B1.5.12 of the ARMv7-M ARM: "The ARMv7-M architecture does not specify the point at which the processor recognizes any asynchronous exception that arrives during an exception. If the processor recognizes a new exception while it is tail-chaining another exception, and the new exception has higher priority than the exception being tail-chained, then the processor can, instead, take the new exception, using late-arrival preemption. It is IMPLEMENTATION DEFINED what conditions, if any, lead to late arrival preemption."

-a

Vote

J

Jon Kirwan 13 years ago

least one (LV

essential.

ARM Linux

milliseconds.

time

output.

the

I suppose. The one in the ADSP-21xx requires much more logic. The ADSP-21xx barrel shifter can do both normalization and denormalization in a single cycle. Lane changes alone is, in my mind, only part of the job. Once you have the ability to do a 0-31 lane change, it's a shame to not add the gates for normalization.

If this is a processor with a floating point unit, it's not something I care about. I'd be looking for integer units (as I wouldn't want to waste power on clocking substantial die space when not in use.)

A quick google tells me there is an M4 and an M4F, but then looking at the web page below you point towards, I see that there is a chapter (3.11) called "Floating-point instructions" underneath the heading of "Cortex-M4 Devices Generic User Guide"... so I don't know if all of them include FP or if some do and some don't and which you may be discussing here.

Thanks!

I do specialized floating point which permits me to optimize for the application. Generic FP is great for generic work. Not great for some things where, for example, dynamic range can be traded for precision or visa versa or where I know, a priori, that an entire vector will all share the same exponent. Just as a few real world examples in actual applications already fielded.

I want the core tools, but I want to write my own microcode (in effect.) And I want small die space (better yield, lower cost, lower power consumption.) Just give me the basic lower level components of FP.

You are right about the time period. My recollection is similar. Doesn't change the point about the pyramid or the programmer marketplace that today is being addressed by chip vendors and software development tool vendors.

Makes sense.

That also makes sense to me!

Hehe.

Hehe!! Some day I'd really love to hear the stories!! And share some of my own. My only exposure to ocean work was with sound propagation through and between thermal layers (reflections, etc.)

Jon

Vote

J

Jon Kirwan 13 years ago

See this:

formatting link

It covers the 2100 Family barrel shifter unit, starting on page 2-22 (section 2.4).

The overview says,

"The shifter provides a complete set of shifting functions for 16-bit inputs, yielding a 32-bit output. These include arithmetic shift, logical shift and normalization. The shifter also performs derivation of exponent and derivation of common exponent for an entire block of numbers. These basic functions can be combined to efficiently implement any degree of numerical format control, including full floating-point representation."

My kind of barrel shifter module. Wouldn't mind a 32x64. But this is quite tolerable.

For interrupt latency (and I was using the timer here and had complete control over the memory system), see:

formatting link

In this case, section 3.4.3.1, page 3-19ff.

"For the timer interrupt on these processors, the latency from when the interrupt occurs to when the first instruction of the service routine is executed is only one cycle. This is shown in Figure 3.3. The single cycle of latency is needed to fetch the instruction stored at the interrupt vector location."

My kind of interrupt latency variability.

Jon

Vote

J

Jon Kirwan 13 years ago

Thanks, Anders. I can see that there is interesting reading ahead should I decide to use this architecture for certain applications. I don't mind nuance, so long as it is predictable.

From your points and the above, if a particular implementation of the core is chosen (a specific part from a specific manufacturer) then would it be possible to establish timer interrupts together with crafted software in order to drive I/O pins with guaranteed known latencies?

("Implementation defined" connotes to me that it may actually be defined for some specific implementation.)

To put the question in concrete terms, assume there is a background task running but that I want to use a timer to trigger an ADC sample and hold circuit, followed by another triggering the ADC conversion start, where an exact number of CPU cycles from one to the other is vital... and do this WITHOUT the use of a timer counter output module designed in hardware?

(That isn't a real example. I would normally use the output module's features. But removing that possibility gets at the question I'm asking better without having to describe the real application in detail. So assume no hardware support except for the timer interrupt event.)

Thanks by the way for what you've already added!

Jon

Vote

M

Mark Borgerson 13 years ago

There are control bits that enable Cortex M4 FPU, but I don't know whether they control the FPU clocks or just access to the registers.

The Cortex M3 chips also have the shift and CLZ instructions, but don't have the floating point unit. IIRC, they are code compatible (and some of the STM32s are pin-compatible). Peripheral registers and the memory map may be different---but the ARM cores are pretty similar.

I replaced an MSP430 with a Cortex M3 in an instrument that measures the frequency output of a pressure sensor. I got about the same power dissipation but 8 times higher resolution due to the difference between measuring period with an 8Mhz clock and a 60MHz clock. Battery drain was minimized by shutting down the CPU clock between the input capture interrupts and by shutting off all peripherals except the timers. When 64K of buffer RAM got filled, it was written to the SD card. The MSP430 had to write to SD much more often because it had only about 10K of RAM and a slower SPI-based SD card interface.

I know there are (were) Kinetis M4 chips without the FPU, but I thing all the STM32F4 chips have the FPU. The IAR compiler has flags that allow you to choose hardware or software floating point. I think the GCC compiler does the same. I don't know if you get lower power dissipation if the FPU is present but not used.

A lot of the web blurbs point out that you can make a choice between clocking the CPU for 10 microseconds or the CPU + FPU for 1 microsecond in applications where you can sleep between calculations. I haven't gotten to the point of calculating the power advantages either way in any of my apps. However, you can now get fancy JTAG debug modules that measure power almost on a cycle-by-cycle basis and compute the power stats for you.

The ChiBIOS RTOS I'm playing with has the option to turn off the CPU clock during the idle thread. I'll have to try that out once I get my hobby autonomous navigation app running. I think that app will spend a lot of time in the idle thread between 1Hz gps updates.

Hmmm, that's a neat idea. I could see that happening with a lot of oceanographic instruments where the data doesn't vary by more than a factor of two over the interval of an FIR filter. (I do the FIR on the raw ADC counts. After demeaning, there's a bit more dynamic range.)

As you've no doubt discovered---the basic lower-level component in much lower volume may cost much more per unit. Those billions of cell phones and tablets have driven ARM SOC chip prices to levels I wouldn't have imagined 5 years ago.

Have you considered FPGAs? You could certainly get the chip you want---but the learning curve might be higher than you'd like.

I never got past fairly simple CPLDs, but an undergrad that soldered boards for me for a few months told me that they used FPGAs in the control systems for the Oregon State Baja racer built as an ME student project.

A properly sized and laid out FPGA seems to be the tool of choice for some applications requiring speed, deterministic behavior, etc. etc. They used to be all over in TVs, cable boxes, DVD players, etc. etc. The newer and faster ARM chips may have displaced many of the FPGAs and ASICs in consumer apps outside the direct video processsing path for smaller companies. For Samsung Apple, and Sony, I suppose custom chips and ASICs are still the way to go.

IIRC, the ARM core in IPhones and IPads have FPUs--- although I don't know how much need those devices have for floating point. The burden in milliWatts and pennies must not be too high since Apple wants to sqeeze out every possible minute of battery life and production cost.

Mark Borgerson

Vote

Looking for ARM system with RTOS

Join the Discussion

Didn't find your answer?