PowerBasic rocks!

- J
- John Larkin
  
  Contact options for registered users
posted
14 years ago

Thu, May 14, 2009 1:27 AM

I just tried a test loop to add an array of 64 million 16-bit signed integers into an array of 32-bit signed integers. This is to get an idea of how long a signal averaging (summing, actually) thing might take on a Pentium based SBC.

On my HP winXP desktop, writing a very dumb PowerBasic loop...

FOR Y = 1 TO 10

FOR X = 1 TO 64000000 S(X) = S(X) + A(X) NEXT NEXT (where X, Y, and S() are longs)

this takes 2.25 seconds, or 0.225 seconds to do the 64M sum. That's

3.6 nanoseconds per loop iteration.

I could rewrite this using pointers and it might be faster.

We tried it on a Kontron MiniITX SBC, in C, with pointers, Linux but with a wimpier processor, and got about the same run time.

We'll acquire 64M samples once a second or so, so the signal averaging doesn't look like a showstopper. I was impressed.

There may be some MAC/array instructions buried in the x86 architecture that might be even faster.

John

- T
- Tim Williams
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, May 14, 2009 2:05 AM

Well, the old array iterator LODSW, followed by ops, followed by STOSW, or STOSD in this case, would go pretty quick. And that's just

386 business; I don't think the Pentiums added anything useful here, but if you have SIMD (MMX / 3DNow! / etc.) instructions, you can do even more. Obviously, you'll have to either cast A() to int32 or scale the values so an int16 S() doesn't overflow. (Incidentially, most SIMD instruction sets offer saturating arithmetic, so over/ underflow need not be disasterous, though unsightly.)

I'm guessing PowerBasic does arrays as any other, so maybe you could inline ASM something like...

lds si,far ptr(A) ; simple lea si,A if FLAT model les di,far ptr(S) ; ditto mov ecx,64000000 addloop: lodsd ; get A[SI], increment SI add eax,[di] ; add S[DI] stosd ; save S[DI], increment DI dec ecx jnz addloop

If you have to work with A() in int16, then you'll have to clear EAX, then load AX and add EAX in either order, then save the sum. Might not be too bad, if you LODSD to get S[SI], then add A[BX] let's say, but then you need INC BX: INC BX or ADD BX,+2, which is another step. I don't know MMX instructions, so you'll have to look that up yourself. No big deal, you do assembly, right? (Unless you only ever do 68k assembly, in which case you might not be up on your languages... tsk tsk ;-) )

Tim

- B
- bill.sloman
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, May 14, 2009 2:25 AM

Why use x86 for this sort of work? That's what DSP chips are designed to do - the last time I looked, an Analog Devices Blackfin processor looked to be quick, cheap and tolerably easy to program.

formatting link

At the risk of teaching my grandmother to suck eggs, DSP chips tend to have multiple buses - Havard architecture - so you can pull S(x) out of memory, add A(x) and dump the incremented S(x) back into memory in a single processor cycle.

x86 is von Neumann architecture, and you need to access the the single bus three times to do that same job, plus a few more processor cycles to add the two numbers and look after the loop counter.

True nerds are supposed to build their own DSP processors in programmable logic devices - that's what comp.arch.fpga was set up to talk about. I built mine in 100k ECL back when DSP chips were a bit too new for comfort, and it had the advantage that nobody saw the structure as programmable, so the software department didn't get to tell me how I should have done it.

-- Bill Sloman, Nijmegen

- F
- Frank Buss
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, May 14, 2009 2:58 AM

I've used the DOS version of PowerBASIC some 15 years ago for some small projects and it was even then really fast. The bottleneck could be the transfer from the external hardware through all Windows layers and finally to the Basic program (use big blocks and ringbuffers, but I assume you know this already). And maybe you could get in trouble if the GC of PowerBASIC decides to stop the rest of your program for some time (I don't know if it has a concurrent GC), or Windows thinks now it is a good time to run the virus scanner, while you are trying to do some high speed signal processing. Maybe would be better to preprocess the data (and thus reducing the bandwith) before feeding to the PC.

--
Frank Buss, fb@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, May 14, 2009 3:24 AM

We're actually going to run the app on the Kontron, all Linux and C. The ADC data will be DMA'd into cpu ram (PCI Express from our FPGA), summed into the averaging array, and the original data usually written onto a 4-drive striped RAID disk setup. After some bunch of shots, we'll save the summed array, too. I did this test to get a rough idea of whether we had time to do the sums. We do.

It is impressive how fast an x86 running Basic can be. This is PBCC, the 32-bit Console Compiler version.

John

- A
- Abbey Somebody
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, May 14, 2009 3:29 AM

A friggin' PDA could probably handle it.

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, May 14, 2009 7:13 AM

Native code compilers are much smarter than they used to be. For a simple loop with a high number of iterations speculative execution and hardware branch prediction is a winner. Loop unrolling can still gain you a bit more speed most times and using the SSE extensions judiciously will get you another order of magnitude on a good day. Less effective in floating point mode.

Regards, Martin Brown

- G
- Gary Peek
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, May 14, 2009 11:53 AM

John Larkin wrote: > It is impressive how fast an x86 running Basic can be. This is PBCC,

John, if you are also using Visual Basic and get tired of how slow it is, try Power Basic for Windows. It's inexpensive too.

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, May 14, 2009 4:34 PM

Sure. Except for DMA acquisition of ADC samples at 128 Mbytes per second. And signal-averaging a 64M word block of samples in 200 msec; and spooling the data to a striped disk array at that rate; and exporting the data over a couple of gigabit Ethernet links. And running Apache and Samba and realtime experimantal shot scripts.

Except for those little details, you're absolutely right. Otherwise, you're AlwaysWrong.

John

- L
- langwadt
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, May 14, 2009 5:06 PM

The blackfin in is fine, I use it, but; you can get a lot of horsepower in a pc for cheap and it is easy to program and debug, and the tools are availible for free or close.

With a blackfin you need to buy the AD tools and a jtag probe to program it, and if you can't find a board that fits you needs you need to get one build, lots of fine pitch bga needed.

and while it is reasonable fast running at 500MHz it won't be if you need to go to external 100MHz sdram.

I'm not sure it is that simple anymore, there may only be one main memory but there are several levels of cache and I believe they are split in data and program. add to that multiple cores, lots of stuff that can be processed and loaded in parallel. and that running at possibly several GHz

-Lasse

- N
- Nico Coesel
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, May 14, 2009 5:08 PM

Powerbasic is interesting. If you know your way around, it can be faster than C.

An Intel Atom board can be bought for a few bucks and it will crawl circles around a blackfin. If you want to outperform a PC for a specific task you need a fast FPGA. Most DSPs have turned into generic processors and most generic processor have DSP instructions. Just look at the ARM instruction set. You'll find the multiply-accumulate.

One of my former employers was the first to do typical DSP processing on a PC which eliminated very expensive DSP boards. They are now among the biggest players in their field because their product is much cheaper.

Using a DSP is something you really need to think about twice. You know why DSP chips must be available for a long time? Its because the software cannot be ported!

--
Failure does not prove something is impossible, failure simply
indicates you are not using the right tools...
                     "If it doesn\'t fit, use a bigger hammer!"
--------------------------------------------------------------

- R
- Robert Baer
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, May 14, 2009 8:26 PM

Get back to the slide rule; calculate orders of magnitude *first* and get -->microseconds

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, May 14, 2009 11:37 PM

I did a couple of programs using PB for Windows and their PB Forms thing; the combo is sort of like VB. You draw a window with gadgets (pulldowns, radio boxes, stuff like that) and it creates the Basic source to make all that work in Windows.

But real Windows programming is a huge PITA. I prefer to whip things out in the Console Compiler, which is like working text-mode in a DOS box... much easier. It's like the olden days, when mere mortals could still program.

We did our materials control thing in PBCC. About 17K lines of code, compiles to about 400K, and it's blindingly fast.

John

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, May 14, 2009 11:51 PM

- B
- bill.sloman
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, May 15, 2009 8:38 AM

g

to

That is the point. John Larkin is talking about a 64M array of 16-bit integers being added to an array of 32-bit integers; the bottle-neck is likely to be the process of getting the data out of memory and back into memory - two fetches and one store - where the Harvard architecture is three times faster than the van Neumann.

in

.

The task here is just adding A(x) to S(x) and storing the sum in S(x)

- cache memory for the data doesn't help for this particular task. and program memory wouldn't be a problem.

-- Bill Sloman, Nijmegen

- B
- bill.sloman
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, May 15, 2009 8:54 AM

But probably not in this application. John Larkin was to do

S(x)=3D A(x) + S(x) for x =3D 1 to 64,000,000

which is three memeory accesses into bulk memory per addition

As I said.

Sure. But it's intended to cope with Finite Impulse Response filter calculations, where each bit of new data is multiplied by a series of weighing factors and the product added to a series of accumulated sums; you've got a bunch of weighing factors, and a bunch of accumulated sums the cache, and you only have to access main memory after you've done the the series of unpdates, so main mmeory isn't usually the bottleneck.

But it wouldn't be competitive for the sort of job that that John Larkinseems to want to do.

But if the program is as trivial as the example presented, this really doesn't matter.

-- Bill Sloman, Nijmegen

- C
- Clifford Heath
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, May 15, 2009 9:09 AM

That's not true, since the buss is much wider than the data elements being added, and you get burst mode transfers that multiply the effect. If the three arrays are aligned just wrong, every write invalidates all the pre- fetching you just did. It's most important to get this stuff right.

Clifford Heath.

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, May 15, 2009 9:24 AM

Why guess? Look at the disassembly in the debugger.

SIMD instructions ought to get you some reasonable speed gain here. Something along the lines of PUNPCKLWD and PADDP on the largest chunk at a time that your (presumed Intel) CPU will permit. Maybe easier to benchmark it in C - most of those compilers have bindings for using the MMX and SSE extensions. Data alignment will matter (16 byte boundary).

Regards, Martin Brown

- R
- Robert Baer
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, May 15, 2009 10:55 AM

Sorry; i goofed BIG TIME.

- P
- panteltje
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, May 15, 2009 11:24 AM

It all depends on how much real memory yiu have, condider this: #include #include #include

#define BIG_SIZE 64000000

int main(int argc, char **argv) { int i; int64_t *mem; int64_t *pmem; int32_t *b; int32_t *pb;

fprintf(stderr, "memory needed=3D%d MB\\n", ( (BIG_SIZE * sizeof (int64_t) ) + (BIG_SIZE * sizeof(int32_t) ) ) / 1000000 );

mem =3D (int64_t*)malloc(BIG_SIZE * sizeof(int64_t) ); if(! mem) { fprintf(stderr, "could not allocate space for mem, aborting.\\n"); exit(1); }

b =3D (int32_t*)malloc(BIG_SIZE * sizeof(int32_t) ); if(! b) { fprintf(stderr, "could not allocate space for b, aborting.\\n"); exit(1); }

fprintf(stderr, "mem=3D%p\\n", mem); fprintf(stderr, "b=3D%p\\n", b);

pmem =3D mem; pb =3D b;

for(i =3D 0; i < BIG_SIZE; i++) { *pmem +=3D *pb;

pmem++; pb++; }

exit(0); }

Now if I run that: grml: ~ # gcc -o test2 test2.c grml: ~ # ./test2 memory needed=3D768 MB mem=3D0x99556008 b=3D0x8a131008

and I only have 385 MB on this machine, then it starts swapping big time, and takes hours to run ;-)

So you must have at least a GB ???