PowerBasic rocks!

- P
- panteltje
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Fri, May 15, 2009 12:23 PM

Even with your values I still have problems with this :

#include #include #include

#define BIG_SIZE 64000000

int main(int argc, char **argv) { int i, j; int32_t *mem; int32_t *pmem; int16_t *b; int16_t *pb; struct timeval *start_timeval; struct timeval *current_timeval; unsigned long lstart_usecs, lcurrent_usecs, ldiff_usecs;

fprintf(stderr, "memory needed=%d MB\\n", ( (BIG_SIZE * sizeof (int32_t) ) + (BIG_SIZE * sizeof(int16_t) ) ) / 1000000 );

mem = (int32_t*)malloc(BIG_SIZE * sizeof(int32_t) ); if(! mem) { fprintf(stderr, "could not allocate space for mem, aborting.\\n"); exit(1); }

b = (int16_t*)malloc(BIG_SIZE * sizeof(int16_t) ); if(! b) { fprintf(stderr, "could not allocate space for b, aborting.\\n"); exit(1); }

fprintf(stderr, "mem=%p\\n", mem); fprintf(stderr, "b=%p\\n", b);

start_timeval = malloc(sizeof(struct timeval) ); if(! start_timeval) { fprintf(stderr, "could not allocate space for start_timeval, aborting.\\n");

exit(1); }

current_timeval = malloc(sizeof(struct timeval) ); if(! start_timeval) { fprintf(stderr, "could not allocate space for current_timeval, aborting.\\n");

exit(1); }

/* get start time */ gettimeofday(start_timeval, NULL);

for(j = 0; j < 10; j++) { pmem = mem; pb = b;

for(i = 0; i < BIG_SIZE; i++) { *pmem += *pb;

pmem++; pb++; } }

/* get elapsed time */ gettimeofday(current_timeval, NULL);

/* calculate the difference */ lcurrent_usecs =\\ current_timeval -> tv_usec + (1000000 * current_timeval -> tv_sec);

lstart_usecs =\\ start_timeval -> tv_usec + (1000000 * start_timeval -> tv_sec);

ldiff_usecs = lcurrent_usecs - lstart_usecs;

fprintf(stderr, "Time used is %d us (%.4f s).\\n", ldiff_usecs, (float) ldiff_usecs / 1000000.0);

fprintf(stderr, "Ready\\n");

exit(0); }

When I run that on the eeePC with 512MB RAM, I get these times:

eeepc-unknown:/root> ./test2 memory needed=384 MB mem=0xa8a54008 b=0xa1041008 Time used is 13920337 us (13.9203 s). Ready

Run repeatedly.

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Fri, May 15, 2009 1:55 PM

Oh, three orders of magnitude isn't so big. I've made much bigger errors. For us old farts who remember BASICs that ran at milliseconds per statement, the idea of executing a useful loop iteration in 3 ns is sort of startling. I checked it myself a number of ways, just to make sure it was actually doing all that math.

John

- J
- John Devereux
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Fri, May 15, 2009 5:24 PM

[...]

memory needed=384 MB mem=0xa89b1008 b=0xa0f9e008 Time used is 1647114 us (1.6471 s). Ready

:)

--

John Devereux

- J
- James Waldby
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Fri, May 15, 2009 5:29 PM

[snip program]

On my system*, an example output of ./time-snipped-prog is: memory needed=384 MB mem=0x2aaaaaae8010 b=0x2aaab9f0d010 Time used is 2891413 us (2.8914 s). Ready

-------------------------------------------

*Some extracts from free and per-processor /proc/cpuinfo on my system: total used free Mem: 3936312 3537024 399288 model name : AMD Athlon(tm) 64 X2 Dual Core Processor 5200+ cpu MHz : 1000.000 bogomips : 2042.00

-----------------------------

The program shown below is shorter and faster than the snipped program (shorter because of formatting, no error checks, and no bother with incrementing pointers, which can get in the way of compiler optimization). Eg, in repeated runs, 1.7760 s was the least time from the snipped program, while

- J
- Jan Panteltje
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Fri, May 15, 2009 5:30 PM

On a sunny day (Fri, 15 May 2009 18:24:50 +0100) it happened John Devereux wrote in :

You win!

- J
- Jan Panteltje
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Fri, May 15, 2009 5:44 PM

On a sunny day (Fri, 15 May 2009 12:29:21 -0500) it happened James Waldby wrote in :

Causes reboot of my eeePC :-(

- B
- bill.sloman
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Fri, May 15, 2009 6:12 PM

g

Can the wider bus transfer data from two different addresses in the same fetch or store operation? In this context, burst mode into fast cache could help, but burst mode isn't constitutionally faster than a DSP number cruncher running flat out.

e pre-

Pre-fetching doesn't seem to make sense in this context. The example was just read, add and store.

-- Bill Sloman, Nijmegen

- J
- John Devereux
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Fri, May 15, 2009 6:25 PM

Here is a simplified version, rewritten to more straightforwardly translate the original Larkin snippet:

#include #include

#define SIZE 64000000

int s[SIZE]; int a[SIZE];

int main(int argc, char **argv) { unsigned start_usecs, current_usecs, diff_usecs; struct timeval start_timeval, current_timeval;

/* get start time */ gettimeofday(&start_timeval, NULL);

int x,y;

/* The Loop */ for(y=0;y

- J
- James Waldby
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Fri, May 15, 2009 6:26 PM

[...]

I presume you mean it crashes and burns without a sigsegv properly occurring as it should. If so, that must be a system bug or even an eeePC bug. Have any such problems been reported in eeePC circles? Sigsegv is a lot easier to deal with; near the beginning of program, say #define z fprintf(stderr,"@%3d\\n",__LINE__); and then add a z at the beginning of any debatable line, to trace execution.

You could change "#define NPP 64000000" to "#define NPP 640000",

11000000 to 110000, and 49000000 to 490000, or some such numbers, and see if the problem still occurs.

The main difference in effects of your program and mine that I see is that mine initializes the two arrays with specific values. Also, mine references elements 11000000 and 49000000 after the loops, to verify that adds actually took place. You could add those steps to *your* program, and see if it then reboots your eeePC :-). Note, after malloc'ing current_timeval in your prog, you've error-checked wrong variable. You could use the ttime() routine from my program, where struct timeval is stack allocated, so malloc and error check presumably not needed.

--
jiw

- J
- Jan Panteltje
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Fri, May 15, 2009 6:48 PM

On a sunny day (Fri, 15 May 2009 19:25:35 +0100) it happened John Devereux wrote in :

I wonder how fast yours is when compiled with -O4 flag?

- J
- Jan Panteltje
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Fri, May 15, 2009 6:52 PM

On a sunny day (Fri, 15 May 2009 13:26:13 -0500) it happened James Waldby wrote in :

Sure it crashes, the eeePC reboots :-)

I think it reboots because you always need to check for null pointer return from malloc, and stuffing things in address 0[i] is no good :-) But I wont't try again, I need this thing. It has month of work on it.

I agree it should not reboot, but then again... US should not be in Afghanistan either .. but it is.

- J
- John Devereux
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Fri, May 15, 2009 7:14 PM

[...]

It's the same...

--

John Devereux

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Fri, May 15, 2009 8:24 PM

pre-

Modern DRAMs are essentially wide-word block-transfer devices, and modern caches are very smart. The inner "add" loop is a few (five, actually) pipeline-locked instructions working on blocks of input and output values located in data cache. Really smart programmers doing really time-critical stuff - like video games - take cache architecture into account when planning their code.

Considering what a pig the x86 is, they have managed to make it work pretty well.

John

- B
- bill.sloman
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sat, May 16, 2009 11:34 PM

t

ck

)

eing

the pre-

None of this gets around the fact that if you are looking at 128M words of data, the memory is outside the cache, on the other side of a word-wide bus.

Memory access time is the bottle-neck here, and the fastest solution has to be three blocks of memory - one for the new data and two for the accumulated data (which you ping-pong between up-date cycles) with three separate paths to a DSP processor that can add fast enough to match the memory transfer rate.

If you can find memory that is specified for a read-and-write cycle, you don't need the second block of memory for the accumulated data, but your maximum cycle rate is going to be a bit lower than you get with read-only and write-only cycles.

-- Bill Sloman, Nijmegen

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sun, May 17, 2009 1:37 AM

pre-

The DRAM is 64 bits wide, I think. It's slow to set up, ras-cas and all that, but block transfers to and from cache scream. The CPU is adding cache-to-cache, and the cache logic is shipping in and out big blocks of data.

Any processor that random-accessed DRAM to do this math would take minutes, not milliseconds. Even if it had a Harvard architecture.

I'm summing the 64M samples - and thrashing 384 Mbytes of DRAM - in a quarter of a second, with a Basic program. That's good enough.

The Kontron is a high-end MiniITX board. It has the CPU, BIOS, video, six USB ports, four serial, three ethernet ports (two GbE), dram, flash socket, switching regulators, four SATA connectors that transfer data concurrently, and more stuff I can't remember. We had it running Linux an hour after we took it out of the box. It's about $400. If I'm going to do a couple dozen systems a year, it's a great deal.

It also does the summing in about the same time, from a C program.

John

- C
- Clifford Heath
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sun, May 17, 2009 2:02 AM

Right. For more insight into the kind of issues that cache behaviour brings, have a read of the implementation details of FFTW:

formatting link

Clifford Heath.

- N
- Nobody
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sun, May 17, 2009 2:51 AM

When you're reading and writing cache lines using burst transfers, there's no need for the bus width to match the word size. The Pentium and up use a

64-bit memory bus; many graphics cards use a 256-bit memory bus.

Sure; a wider bus or multiple buses will transfer data faster. It doesn't actually matter whether you have 1x 16-bit bus + 2x 32-bit buses or a single 80-bit bus. Except that a single 80-bit bus would be more flexible.

- B
- bill.sloman
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sun, May 17, 2009 3:23 PM

-bit

eck

back

S(x)

.

s being

all the pre-

More than one microsecond to read A(x), S(x), add them and store the up-dated S(x)? I did it in 40nsec twenty years ago - albeit not with a 2x64Mword store.

For an earlier project we did look at a rather larger store, and published our study

"Large data buffer for electron beam lithography" Microelectronic Engineering, Volume 6 , Issue 1-4 (Dec., 1987) Pages: 141 - 146 ISSN:

0167-9317 by J. P. Melot, A. W. Sloman and M. J. Penberth.

Melot had worked on stuff for CERN, and was at home in that kind of digital environment. The electron beam microfabricator we were putting togther wasn't all that fast, and the word rate didn't need to be much higher than 10MHz, but we were figuring on incorporating on-the-fly error detection and correction - customers get picky if pads go missing in the midle of an integrated circuit.

-- Bill Sloman, Nijmegen

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sun, May 17, 2009 10:28 PM

The annoying thing about fftw is that thay don't just tell you how long it takes to do an FFT. The speed graphs are in Mflops. What the heck does that mean?

John

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sun, May 17, 2009 10:42 PM

Hey, this seems to be a little faster...

' down-count experiment...

X = 63999999 ZOT: S(X) = S(X) + A(X) DECR X IF X 0 THEN GOTO ZOT

Presumably it eliminates an immediate compare.

John