Distributed microcontroller computing

- K
- kha59
  
  Contact options for registered users
posted
18 years ago

Thu, Oct 13, 2005 8:41 PM

I've got a pure math problem implemented in C that will take about 3 years to solve using all 5pcs available to me (the algorithm is about as efficient as it will get without some major mathematical insights).

The funny thing about this problem is that I can easily split it into smaller problems each of which only requires about 1000 bytes of memory and only uses bitwise operations and additions mod 256.

I was considering the possibility of creating a network of microcontrollers, probably using I2C, with one main processor dishing out smaller sub problems and reporting results back to a PC. To date I have only used Microchip PICs, primarily the 18f4550 and done my programming in MPLAB ASM. I think I could easily implement this setup using these chips but they are too slow (40MHz about 20Mips I think).

I have heard that there are some high speed microcontrollers out there, I had a quick look at Freescales range of PowerPC based embedded processors. These are about $25/800Mips - as far as I can tell this would be significantly cheaper on a per MIP basis than buying computer hardware (even omitting the hdd and using minimal ram).

My questions:

What is the best $/Mips microcontroller out there, preferably with I2C.

What is the best $/Mips microcontroller that I will be able to understand without too much trouble given that I have only used Microchip products so far?
The technical docs for the Freescale embedded processors left me somewhat confused (quite a jump up from microchip microcontrollers!!!) what external components do these devices need to run? In particular is the program code stored in the device or does it obtain this via the bus?

Now I'm not sure about the gory technical details but I assume that given the simplicity of the instructions my program needs to execute the MIPS figures for microcontrollers should be fairly comparable with that for desktops. The main difference I can see here would be regarding branching - I've heard that modern processors go to great lengths to predict which instructions will be executed so that no time is wasted while retrieving program instructions...
Another possibility that would keep me busy for a while is implementing the program in assembler on my PC. Unfortunately every computer that I have access to has a different architecture (Athlon,Sempron,Duron,2xPentium M) so I guess I would have to restrict myself to x86 instruction set? What sort of performance increase do you think this might produce?

Thank you very much for your time and I look forward to hearing from you!

- A
- Alvin Andries
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Oct 13, 2005 8:53 PM

Hi,

Your first description seems to suggest a easily parallellizable algorithm of moderate complexity. This might well be in the ballpark of modern FPGAs, which are inheritly parallel. If you could elaborate a bit more on your algorithm? I once studied a similar problem and the thoughest points were:

1) efficient communication of your system (you probably need to load your data and collect the results) 2) fault tolerance: the more devices you have, the likelier it becomes that one of them will fail (murphy: just 1 second before producing the answer)

Kind regards, Alvin.

- K
- kha59
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Oct 13, 2005 9:19 PM

Thanks Alvin,

The problem is to assign one of 5 colours for each of the integers from

1 upwards such that if any two integers have the same colour the integer that they sum to must have a different colour:

eg. If we have

1-green 2-blue 3-green

then 4 cannot be green or blue as 1+3 = 4 and 2+2 = 4.

A correct sequence of 160 digits for 5 colours is known. I wish to find a sequence of 162 digits.

I'm doing an exhaustive search on a severely restricted subset of all the possible sequences. The sequence is built up one colour at a time until you get to the point where you know that somewhere down the track there will be no possible colour, then you go back one in the sequence and try a different colour etc.

This is quite easy to split up each chip at a time should only increase the sequence by a few digits (due to memory constraints) and report each possible final sequence back to the coordinating chip which dishes each of these out to other chips when they request more work. Of course there needs to be a prioritisation of sequences such that the sequence queue doesn't get too big (ie. always dish out the longest sequences which will be exhaustively searched and therefore removed quickest). All of this stuff isn't too hard.

To the points you made,

1) efficient communication. Each chip needs to get a sequence per unit of work which is no biggy, but it will need to report back each sequence it ends up at..... This could in fact be quite a few - maximum 5^(#of digits sequence was increased by) but usually a lot less. For each of these it needs to return (sequence length)/2 bytes. I think I will need to consider this point in some more detail... 2) fault tolerance. I wish to find a single correct sequence and believe (hope!) that there are many of these (expected running time is time to first sequence not completion time for exhaustive search which would in fact take 100s of years!!!), whilst missing one sequence due to a faulty device/communications would be very bad this wouldn't be disastrous. I'm not trying to say that no sequence of 162 digits exists.

I don't know anything about FPGAs or how these would apply, do you happen to have some useful links?

Thanks!

- E
- eKIK
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Oct 13, 2005 9:23 PM

Since time is an important factor, building and testing a whole net of FPGAs/uC's seems like a step in the wrong direction. Common PC computing power is availible in abundancy. One way to get it is:

Sun sells processing power (and it's quite cheap too :) ).

formatting link

One of (the many) advantages of this approach is that you can write your program in java, which probably will cut down on development time a lot.

Another way of doing it is writing your own BOINC program.

formatting link

This is a system for distributed computing. After you've written your program you might post a message in an appropriate newsgroup and ask for help running it. I know I would help out =)

Hope this helps // eKIK

- K
- kha59
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Oct 13, 2005 9:40 PM

Thanks for the suggestion eKIK,

The idea of this was really more for the fun of setting this up than seriously trying to save money. The sungrid seems a little on the expensive side for me here in New Zealand at $1US/hr.... I can get access to supercomputing at $0.30NZ/hr.... (on 2200Athlons).

Something like BOINC sounds like a good idea, although I think most people that would be inclined to do that are probably already running something like mersenne prime or seti at home...

Cheers!

- E
- eKIK
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Oct 13, 2005 10:01 PM

I think you might be right about BOINC users already running other things. My school has allowed students to use all the schools computers idle time for distributed computing. Perhaps a similar deal can be worked out for you?

I can see the fun in trying to setup the system you originally spoke of. You can thing of FPGAs as "blank microcontrollers". You "program" them using either VHDL or Verilog, which is kind of tricky the first times since it's all executing in parallell. You can check on

formatting link

for information about CPLDs/FPGAs. This is probably the most complicated (and also most fun :) ) way of doing it. One problem is that you have to write the communication interface by yourself as well. SPI and I2C ones are availiable at

formatting link

MIPS (Meaningless Indicator of Processor Speed) is not the best way to measure speed at all times, since most microcontrollers has different CPI (Clockcycles Per Instruction) for different instructions. A LW might take longer than a ADD for example.

Perhaps you should try to write the inner loop of your program in assembly or pseudocode, then have a look at different uC datasheets to see how many cycles the loop actually takes to complete?

Atmel is a good source for semi-high performance microcontrollers. They even have a uC + FPGA hybrid

formatting link

which might be VERY interesting for you.

Regards // eKIK

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Oct 13, 2005 10:50 PM

I can follow green, but blue seems to extend your rule ?

So that's a string of ~160 characters, where each character can be one of 5 values ?

This reads a little like sorting primes. The data set would certainly fit into a (very) small microcontroller = you can even pack into nibbles, and consume just 80 bytes, but the problem with many small uC will be ensuring there are no overlaps, or holes, in their scan coverage.

ie the task is simple enough, but multi-uC management is likely to be a nightmare.

Something like i2c for the backplane is also likely to be a serious bottleneck.

Look at Altera, Lattice, Xilinx - there are many demo/eval boards and tool sets. Also look at the Soft CPUs : Xilinx PicoBlaze, and Lattice Mico8

FPGAs can do hugely parallel tasks, and on a small data set like this, you have no memory bandswidth issues.

With a FPGA, you could do exclusion mapping - that is, do not store the Colour@integer, but instead have an array of N x 5 booleans, which are excluded colours. [ALL 5 => Whoops, go back! ] An FPGA could scan for all ahead exclusions, very efficently indeed. One of the small soft CPUs could manage the re-seed process.

The algorithm is always where the biggest speed gains can be made, especially in efficently mapping the algorithm to the hardware it runs on.

In a FPGA you could set up 'algorithm races', where (eg) you code

4 algorithms in ~1/4 of the chip each, and run it for a couple of days, and compare their Attained String Lengths.

If the present best is a langth of 160, don't just think about 162, look to smash it ! :)

I've added comp.arch.fpga, as this really sounds more like a FPGA+smart algorithm, than a "sea of uC" problem.

-jg

- 4
- 42Bastian Schick
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Oct 14, 2005 7:27 AM

I would not use I2C but rather UART.

Also look at the small ARMs from Philips (LPC2xxx) and Atmel SAM7.

These are sometimes with very low pin count (48 QFP) and run with

60MHZ (or 64MHZ if Atmel).

Philips just announced the LPC2101 with 2K RAM and 8K Flash for about

1.5USD.

There are some companies selling those chips on a DIL outline PCB with all the needed things like quarz and voltage regulator.

I don't think Freescale has currently something to compete this.

--
42Bastian
Do not email to bastian42@yahoo.com, it's a spam-only account :-)
Use @monlynx.de instead !

- L
- linnix
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Oct 14, 2005 3:23 PM

I would use pairs of processor coupling with FPGA. The FPGA acts as cross-switch for setting up the vectors (data arrays) as well as parallel processing on the vectors itself. You can setup hundreds or thousands of vector processors in a high end FPGA. You don't need high pin counts on the FPGA, but unfortunately, they are usually proportional to the macrocells.

I would use ethernet.

- A
- Alvin Andries
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Oct 17, 2005 6:01 PM

Hi,

The blue rule seems ok: 2 + 2 == 4.

I'd certainly favour FPGAs for this. The biggest issue would still be the memory bandwidth: each time you split some work over different processing nodes, you need to communicate the present N charactrs, while testing for an expansion could easily take less than O(N) time (depends on how the algorithm behaves), possibly leaving processing nodes starved for data and while others might be spening more time on communicating than on processing.

My first guess is that a single dual-ported RAM per processing node should suffice. That allows many processing nodes on a modern FPGA. Maybe if I change the memory architecture that I have in mind, some communication overhead could be eliminated. Oops ... I'm trying to solve it! Can I? :-)

Alvin.