proth siever in FPGA?

Hello,

I know FPGA's are not suitable for sieving large numbers however one set of numbers (Proth numbers, related to the Sierpinski conjecture) I think would fit an FPGA quite well.

Sorry if this is a relatively basic or silly question but I'm a novice with FPGAs although I've been reading up on them and playing around with the software.

I'd be dealing with numbers in the 32 to 64 bit range (current limits of sieving are around 50 bit numbers) and hoping to put many sievers on one FPGA (I have a Spartan-3 dev kit with

200k gates).

The basic algorithm is this:-

p - 64 bit number (a prime) although top bit will never be set r - 64 bit number (0

Reply to
Alex
Loading thread data ...

Hi Alex,

The single Siever won't be much of a problem, but you will ultimately be limited by your memory bandwidth if you want to implement multiple Sievers.

Oh, definitely doable. Apart from the ultra-basics it will teach you FSM design, pipelining and other useful stuff. Later on you should be able to tack an embedded CPU (preferrably NIOS due to its easy bus structure) to the Siever(s).

However, please use a simple SRAM. They're more expensive, but at least you don't have to complicate things by having to write the SDRAM state machine as well.

Good luck!

Ben

Reply to
Ben Twijnstra

Thanks Ben!

times on >99.9% of runs. Even if I do one loop per clock cycle (highly unrealistic) on the 50MHz Spartan-3 I' looking at one test per second per siever, and only 512bits going in and out of the siever per test.

There's no memory access whilst in the main loop so I mind if it's a bit slow grabbing data or returning results.

As for the memory the main chip (XC3S200FT256) has 12 18kbit block RAM (216K bits) the board comes with a 1M-Byte SRAM (in two 256Kx16 chips) on the back. No SDRAM to worry about.

Ta. -Alex

Reply to
Alex

Slightly off topic since I don't know this proth sieve

If you are interested in trully huge primes, perhaps the mersenne primes search might be something to consider as background material.

If you google comp arch group for FPGA Mersenne you should find a recent (1-2 month or so) mention about an engine that did huge 64 bit int FFTs on I think 2^24 points, this is used to do huge number arithmetic in shorter order. It used 384 18.18 muls on biggest V2Pro to drive the FFT kernal. The 64b math was needed to prevent loss of accuracy to do these huge muls needed for prime checking. The math was mostly beyond me but the FPGA HW was quite interesting.

Ofcourse it turns out the memory management was the real problem here, if you can imagine doing a 2^24 pt 64 precision FFT.

This must be some sort of Uni assignment right? better than what I was offered

johnjakson at usa dot com

Reply to
JJ

Hi Alex, I don't know anything about "seiving" but it sounds like a good first project - the operations inside your loop are straight forward to implement in FPGA logic (comparison/shifting/add/subtract). Although you didn't state it, I'm assuming the overall goal is to achieve the fastest throughtput possible given the starter kit hardware constaints?, i.e. maximise sieved numbers per second.

Why do you feel you can't do one iteration of that loop per clock cycle? If you pipeline your logic you will get through the loop once per clock cycle, but there will be a latency of 2,3, 4 etc clock cycles depending on how deeply you need to pipeline. You should be able to achieve well above 50MHz, the max clock rate will ultimately depend on how deeply you pipeline.

I don't necessarily agree with Ben's statement that the memory (external SRAM?) bandwidth will be the limiting factor on throughput. Say each siever uses 100 slices and a XC3S200 provides 1920 slices, then you can have 19 sievers operating in parallel. If each siever takes 1 second to complete the algorithm then every second you need 19x ( p read, r read, n write) which is a trivial amount of bandwidth.

If speed is your goal then optimise the algorithm first (although I guess you will need some experience and knowledge of the hardware/FPGA before you can do this). Maybe I didn't understand the algorithm, but if the loop runs to completion 99.9% of the time then there must be a way of exploiting this? Wouldn't it be better to first do a coarse run through by going e.g r

Reply to
Andrew FPGA

Thanks for the comments (Andrew & JJ). Let me explain a bit. You asked for this! :-)

I don't need an FFT because I don't need to multiply numbers. The sieving can be performed on 'small' integers (64-bits gives a very large scope) using addition, subtraction, left shifting by one bit and comparisons.

No multiplication and no floating-point makes it much easier.

Proth sieving has some unique characteristics that make it different from the Mersenne primes research.

The main goal is to maximise numbers sieved based on time. This splits into two seperate goals:-

- Maximising the speed of each individual siever.

- Maximising the number of sievers on the FPGA.

Obviously pipelining may be required to complete one iteration of the loop within one clock cycle, but then that may increase the number of slices used by each siever, bringing down the number of sievers that I can place on the FPGA.

I'm playing around with a Spartan-3 starter kit as that is the only platform I have for development. This is a hobby project (not University project), if I can get reasonable performance then I may consider moving up to a larger FPGA with more gates (to get more sievers).

I'll only ever be working with dev kits. I don't intend (unless the performance is stunning) that I'd ever consider to build anything with multiple FPGAs, although my inner-geek would love to have a box with lots of blinkenlights and serious amounts of processing power.

The 50MHz comment was purely my brain not working. I had glanced at the description and the mentioning of a 50MHz oscillator had stuck in my brain. I then assumed that the final solution would be limited to 50MHz. I've no idea how quick an XC3S200 can actually run, or would run with this 'design'. (could anyone give a rough guess?)

As for the actual problem I want to solve, here's a 'quick' description of Proth sieving and why I want to do it.

Proth numbers are numbers of the form k*2^n+1 where

2^n > k. k should be odd (for obvious reasons).

A Sierpinski number (of the second kind) is a value of k so that k*2^n+1 is composite for all k.

k=78557 will always produce composite numbers for any positive n you pick. This has been proven.

To prove that a k is not a Sierpinski number one must find a value for n such that k*2^n+1 produces a prime number.

The Sierpinski conjecture is that 78557 is the lowest value of k possible. Of the 39278 other odd candidates, primes have been found for all but 9.

The conjecture is regarded as true by most people and so primes for the other 9 k-values are expected to be found.

There's more on the Sierpinski problem here:-

formatting link
(That URL looks a bit suspect but trust me on this. There used to be 17 candidate k-values but the above project has found primes for 8 of them.)

The above site is running a distributed attack (a la SETI, Cancer@Home, etc) to find primes.

The numbers involved are extremely large. The most recent prime was found for k=27653 at n=9167433.

That's 27653*2^9167433+1, a nice 2,759,677 digits in decimal.

Trying to work with those kinds of things on a FPGA would be madness.

Primality of these numbers can be proven with a Proth Test.

formatting link
can be useful too if you want to read up any more on this.

Trying to prove primality of 2.7 million digit numbers is not an easy task. It can take a month on a 3GHz Intel P4. The equation involved is: let N=k*2^n+1 and if:- a^((N-1)/2) mod N = N-1 then N is prime. The value of a used is usually the lowest odd prime that is not a factor of k. Since the remaining k-values being tested are all prime then a=3 usually works.

Now consider the info above. In the above case N is a 2.7 million digit number. (N-1)/2 may have the same, or one digit less than N. Still an insanely large number.

3^99 (a 2 digit number) results in a 49 digit number. 3^999 (a 3 digit number) results in a 478 digit number.

Now consider 3^N where N has 2.7 million digits.

Big.

Anyway, on to the sieving.

So the project has produced a large number of k,n pairs that need to be tested. Each Proth test takes a long time.

Sieving is the process of removing candidate numbers by trial division with known primes. For example, one of my computers is checking the range of primes from 882000000000000 to 883000000000000 and, amongst others, it's found that 882347066940847 is a factor of 24737*2^80174983+1.

It has found 17 unique factors in the week that it has been running. That's 17 k,n pairs that do not need to be proth tested. Those proth tests would have run for 17 months on the same machine only to not find any primes.

This is the value of sieving.

OK, so an FPGA won't be good at dealing with large (i.e.

10 million bit) numbers. So why am I here clogging up this newsgroup with my ramblings?

It's because Proth numbers (k*2^n+1) have some properties that make sieving easy.

Say I'm checking when k=10223 and my prime of choice is p=882319864863617.

I compute the remainder of (k*2)+1 mod p. Which, when testing such large primes, is simply (k*2)+1.

r = 10223*2+1 = 20447 (that's for n=1)

Starting with (k*2^n)+1

- subtract the 1 we added on =k*2^n

- double the number (k*2^n)*2 = k*2^(n+1)

- add one = (k*2^(n+1))+1

More importantly, these actions can be performed on the remainder without having to keep track of the actual value of k*2^n+1. So:-

r_next = (2*(r-1))+1 mod p = 2r-1 mod p

The 'mod p' part is quite expensive. Since we are only dealing with a doubling of the previous r (which we knew was guaranteed to be less than p) we know that, at most, r could grow to less than 2p. This is why 'mod p' is replaced by a simple subtraction of p if r >= p.

If r is ever equal to 1 then it will always remain 1, we can stop processing (for the FPGA I wouldn't even bother passing this on as a work unit).

If r is ever equal to 0 then it means that p is factor of k*2^n+1 at whatever n the algorithm has iterated to. We can stop here and record the value of n.

So with relatively little effort we can check for factors for very large values of n without having to keep track of the large (multi-million digit) values. The snag is that we have to compute r for every n along the way.

This is why I can't run a coarser run by skipping values. Each iteration must be performed and the value of r checked. There is no way to predict future values of r without computing them.

There's another stop condition, if we ever get back to the r value we started at (when n=1) then we know we can stop. The r values will just follow the same pattern again. At this point I'd like to stop and record the fact that no factor was found (otherwise I would have stopped before) for this k,p pair.

The modulus operation involved has an important bearing on the size of this loop. Say I'm checking a number against p=3. Consider all of the possible values of r and what they transform to:-

2x-1 mod 3

0 -> (2*0)-1 mod 3 == 2.

1 -> (2*1)-1 mod 3 == 1. 2 -> (2*2)-1 mod 3 == 0.

The maximum number of values r can have is p, but r=1 is a special case, so the maximum loop size would be p-1.

Some values of p can produce smaller sub-loops.

2x-1 mod 7:

0-> (2*0)-1 mod 7 == 6.

1-> (2*1)-1 mod 7 == 1. 2-> (2*2)-1 mod 7 == 3. 3-> (2*3)-1 mod 7 == 5. 4-> (2*4)-1 mod 7 == 0. 5-> (2*5)-1 mod 7 == 2. 6-> (2*6)-1 mod 7 == 4.

0->6->4->0

1->1 2->3->5->2

So if my initial value of r had been 2 I could loop endlessly without ever finding a factor (r=0). This is why it's important to check to see if we've returned to the initial value of r.

The sieving effort involved in the SoB project above has put an arbitrary limit on the upper bound of n of 50 million. Proth testing a k,n pair where n~50M would take a long long time, but sieving up to that range is relatively cheap. If, two years down the road, the project is looking at n values up at that range we'll be thankful that we've sieved up there.

As I said above, the maximum size of the loop is p-1. So for p=882319864863617 this would mean a maximum loop size of 882,319,864,863,616.

This is why 99.9% of the runs will completely iterate up to n=50,000,000.

It's actually closer to 99.999999%. From the numbers above there is a 1 in 17,646,397 chance that p will be a factor of k*2^n+1 with n between 1 and 50,000,000. (This isn't strictly true due to the possibility of smaller sub-loops, but gives you an idea of the numbers involved).

Also, by recording the value of r when I do reach the end of the loop it means I can continue this at a later date. I'd only need r at n=1 (trivial to compute) and the value of r at a given n and I can continue on my merry way.

The faster we can sieve, the faster we can rule out more and more k,n pairs. The less pairs we have the faster sieving becomes (although only slightly) and so we can sieve out higher and higher primes.

*THIS* is why I'm interested in an FPGA helping out. They have the capability of being much much faster than traditional CPUs.

If you've read this far then thank you. Reward yourself with a coffee/tea/walk/cigarette/nap/whatever.

Ta, -Alex

Reply to
Alex

snipping all to spare the BW

Well that was quite an interesting read and links too.

Right now its still over my head since its not my hobby or work interest but still it points to the challenges in an FPGA design. But since I know the simple seive of eratosthenes very well I'll come back to this later as a possible cpu/memory benchmark.

I think starting with the starter kit is an excellent idea, when you reach best case solution you can atleast speculate what bigger HW could do.

On the other hand you could avoid this HW experiment and keep it in C for quite a while longer. When I model my cpu project, I don't use Verilog simulation, I write in a form of HDL C that allows the C model to run in std VC6 compiler on a std PC, this C RTL (register transfer level) code runs about 20x slower than simple minded C code that has no HW detail but it also runs orders faster than Verilog simulation. The C RTL code though looks so much like Verilog it can be hand translated with some grepping back into Verilog for the actual FPGA synthesis. This might not be so easy for VHDL since its syntax is not like C.

You could reasonably experiment with C RTL engines that are orders magnitude larger than any FPGA could actually handle and you still have the PC memory and disk resources. So far you are going backwards by about 20x if you do what I suggest, but then with a simple C RTL translation effort you have a Verilog that can be synthesized into any FPGA provided you understood the architecture of the FPGA. Now with the C RTL model you can have your simulator count cycles and HW costs without ever going to HW. But you will need to port the C RTL to Verilog to find out what your clock freq limit is going to be.

What I like about this approach is that lots of architectures can be explored and with my HW exp in hand I know what to expect when I go to synthesis. Exploring architecture is relatively painless in your fav C IDE, but you can evolve the arch towards one that just flies through the FPGA tools. Spending alot of time in Webpack trying to get optimal results leads to great frustration. You might also look at HandelC but $$$.

For the long term, I still think big DRAM is going to be a big heasdache since I see poor locality of reference written all over this project, ie SRAM cache will be useless. You could take a look at Micron's RLDRAM which performs like a DRAM in density and like an SRAM in speed. It does this by allowing accesses to start every 2.5ns, with

8 in flight and data accesses come out after 20ns or 8 clocks. If you can arrange for interleaved engines to sync with this, you can get 32MByte RLDRAM effectively working at 2.5ns but you need good HW pipeline design, and this is high end FPGA stuff ie V2Pro and above. These accesses can be fully random to any address provided the bank is already done with previous job.

For spartan3 best grade, webpack 6.3 was giving me 225MHz or so for a

16bit cpu datapath with the 16bit adder and dual port blockram both close to the limit. I use 2 clocks to get a 32bit datapath with 4way ported BRam. Others here may have gotten better results. I mostly work on V2Pro about 1.5x faster, and it seems webpack 7.1 has slowed things down some (I dunno why).

If you start adding numbers >16b unpipelined, your perf will go to hell pretty quickly, you need to pipeline either at the 16b or 32b add level.

best of luck to this project

regards

johnjakson at usa dot com

ps If you like I can send you C RTL sample code.

Reply to
JJ

Ok, without looking at it in more detail the math is beyond me at this point. If you want to get a feeling for clock rates achievable on the FPGA, just try some simple stuff and put it through the tools. Create a

16 bit adder in VHDL/Verilog and put it through the tools. Do the same for comparator. Try 32 bit and 64bit adders. Make sure you register the inputs and outputs of course. Not a bad way to learn an HDL language either. Try something simple in the HDL and have a look to see what the synthesizer created.

Remember that the underlying spartan 3 logic cell has one 4 input lookup table (LUT) and 1 flipflop. The flip flop can be used for pipelining and often it is already an unused resource so effectively it comes for free.

Yes there is always a tradeoff between FPGA resource usage and speed (clock rate). E.g. a bit serial multiplier uses very little logic resource but takes multiple clock cycles to compute the result. Alternatively a fully parallel multiplier can produce a result every clock cycle.

Regards Andrew

Reply to
Andrew FPGA

Hi Alex,

First let me thank you for what has to be a trully unique posting from a new participant to comp.arch.fpga .

-- Clearly stated problem (but missing density or speed as goal, expected execution time (i.e. typical loop count per r,p pair))

-- Obvious effort put into defining the data to be manipulated and careful description of what the data looks like

-- Well stated algorithm, with appropriate comments

-- Some initial thoughts on bandwidth to host system with a proposed solution to keep the FPGA engine busy

-- Algorithm presented in terms that cleanly map to hardware constructs

-- Some boundary case analysis (the r-- , r-=p) , since special cases are the major pain in hardware algorithm accelaerators

-- No silly request that we email you the answer

-- Just really refreshingly clear!

Good plan, get one going. See where the bottle necks are. See how big it is.

Sure

Depends on your other hardware design experience.

Looks like you need the following types of functional units:

Memory interface, pair of 64 bit registers

another 64 bit register

64 bit counter

64 bit test for 0

Ooops, looks like r is a loadable counter, rather than just a register

64 bit comparitor (includes a 64 bit subtractor)

another 64 bit subtractor

64 bit comparitor (no subtractor for this one)

plus a controller.

The initial design should not care about timing, just get it to do the algorithm.

Then there are wonderful opportunities for pipelining: -- between the various algorithm statements,

-- within the counters and adders/subtractors (because at 64 bits there may be oppportunities that do not exist for smaller structures)

-- Multiple instances of the algorithm processing different r,p pairs, but sharing the pipeline. Barrel processing. This is hinted at, because it looks like your algorithm can't start the next itteration of the loop until the current one is finished. This is also referred to as C-slow approach by Nicholas Weaver who posts to this news group on occasion. Try a Google search and this:

formatting link

Looks like you have some fun ahead of you.

Philip Freidin Fliptronics

Reply to
Philip Freidin

Thanks. I've been around Usenet long enough to know what to do and, more importantly, what not to do. The project is a learning experience for me, I don't want anyone to give me the answer :-)

And thank you for your comments, they've helped me look in the right directions. I'm a complete novice when it comes to FPGAs and VHDL. So I've bought _The Student's Guide To VHDL_ by Peter J. Anderson (ISBN 1558605207) based on various recommendations.

I've implemented a very simple siever limited to the IEEE 1164 INTEGER type (signed 32-bit). Obviously this isn't much help for the final version (64-bit numbers) but it's helping me understand the whole thing.

The algorithm has changed slightly (and for the better). I have eliminated the need for the decrement of r, and I'm now checking to see if r is set to one of 9 values.

(I've been playing around with this today in a text editor. I only have the Xilinx software at home on so please read through any syntax errors :-)

(Also note that this is purely functional and unoptimised.)

function sieve_it ( p, c1, c2, c3, c4, c5, c6, c7, c8, c9 : in INTEGER ) return INTEGER is variable n : INTEGER := 0; variable r : INTEGER := 1; begin while n < 50000000 loop r := r*2; -- double it (max p is 30 bits with signed integers) -- with 64 bit unsigned integers then p can go up to -- 2^63-1 n := n+1; -- increment n if r >= p then r := r-p; -- r was 0 < r < p before being doubled. -- after being doubled it must be 0 < r < 2p-2 -- so we will only ever need to subtract p once to bring it -- back into the range 0 < r < p -- since r is always +ve we never have to deal with negative -- numbers. Should use unsigned integers but leave that until -- later on. end if; -- r now holds 2^n mod p. -- Check if r (2^n mod p) is equal to one of our target values -- As far as concurrency is concerned, n must have been -- incremented by this point. all other operations must be -- performed sequentially. if r=c1 or r=c2 or ... r=c9 then -- got one, return n return n; end if; end loop; return 0; end function sieve_it;

Up to 9 'check' values are provided, if not all values are needed then a dummy value (either 0 or p) can be provided as r can never end up at either of these values.

[MATHS BIT]

Computing test cases is easy. Consider k=19249 and p=100003. We are looking to solve the discrete logarithm (= means equivalence):-

19249*2^n+1 = 0 (mod 100003)

=> 19249*2^n = -1 (mod 100003) : subtract 1 from each side

=> 2^n = -1 * 19249^-1 (mod 100003) : divide by 19249

(Don't worry how this bit works!)

19249^-1 = 3486 (mod 100003) : 3486*19249 = 1 (mod 100003)

=> 2^n = (100003-3486) (mod 100003)

=> 2^n = 96517 (mod 100003)

I can compute hundreds of thousands of these test cases per second on a humble PC. This is not the tricky part.

There are 8 other candidate k values (aside from 19249), hence the

9 possible values to check for. Some checking (performed on a PC) can eliminate certain k values (quadratic residues and Legendre symbols) and so not all 9 c values may be needed. [END OF MATHS BIT]

So we call sieve_it with p=100003 and c1=96517 (c2..c9 are all 0).

It should return n=9484 because 19249*2^9484+1 = 0 (mod 100003). (Warning, this number has around 2900 digits so put your calculators down.)

I don't care which value (c1 to c9) matched, in truth I don't really care what the magic n value was. All I want to know is there was a match for any n in the range 0..50M. Here's why...

The hits are so rare that just knowing that a certain p will provide a result is all that is necessary.

Take the range of p from 882T (8.82*10^14) to 883T. I've sieved this range with a software siever and so I know exactly how many factors were found.

On average the prime density is about 1 in 33 in that range. So there are roughly 1E12/33 = 3E10 primes between 882T and 883T. That's 3E10 individual calls to sieve_it() that need to be made. This is why I want to fit as many sievers on the FPGA as possible, and make them as fast as possible.

Out of those 3E10 primes, a mere 866 were factors of the numbers. That's one hit for every ~ 35,000,000 calls to sieve_it.

So the plan is something like this:-

  1. Implement and test sieve_it function. I have a wrapper entity that sets up vars for the inputs, calls sieve_it and then writes the result out to an input:-

entity proth_sieve is ( clk,reset : in STD_LOGIC; nresult : out INTEGER ) end proth_sieve ...

proth_u_like : process is begin --assign values to tp,t1..t9 tp := 100003; t1 := 96517; result := sieve_it( tp,t1,t2,t3,t4,t5,t6,t7,t8,t9 ); nresult

Reply to
Alex

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.