FPGA imple. of aes

- M
- manjunath.rg
  
  Contact options for registered users
posted
18 years ago

Wed, Mar 8, 2006 2:12 PM

We have been doing a project on high speed aes using subpippelining concepts we would be happy if we find some code which may help us.. if anyone in this group has any access pls help us

- M
- me_2003
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Mar 8, 2006 2:45 PM

Take a look at the following cores, they might help you..

formatting link

Mordehay.

- M
- manjunath.rg
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Mar 9, 2006 7:54 AM

i saw it its not of much help..as we are doing it based on subpipelining concepts and composite field arithmetic if you find something of such sort please do help us thanks in advance

- F
- fpga_toys
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Mar 9, 2006 12:31 PM

do you have a C based implemention somewhere as an example?

- M
- me_2003
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Mar 9, 2006 12:50 PM

I've made a implementation of the aes core in fpga which work with pipelining - i.e. only 4 sboxes that I use and itterate each 5 times for every round. I cannot give you the code/spec due to IP issues... the design nature depeneds on what is the speed (i.e. clk cycles) you need for each round and how much memories you can spare (dpbram = 2 sboxes). Hope it helps, Mordehay.

- H
- Hans
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Mar 9, 2006 9:48 PM

See

formatting link

No pipelining but perhaps the testbench can save you some time.

Hans

formatting link

- A
- Allan Herriman
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Mar 10, 2006 6:48 AM

You want to try it in your C -> hardware compiler? I'd be interested in the results.

AES is a public algorithm, and widely available. The original proposal (RIJNDAEL) was written in C, and is designed to give good performance on machines that can manipulate 8 bit chunks o' data (e.g. most modern CPUs), so it is a good match to C.

formatting link

Note that AES is a block cypher. These can be used with or without feedback around the outside. The latency isn't so important when not using feedback, which allows the use of subpipelining to increase the clock rate. Unfortunately, many of the interesting crypto applications use block cyphers with feedback (e.g. CBC, CFB), so the latency affects the throughput, and subpipelining doesn't help.

formatting link

Google shows that there are many papers claiming rather fast AES in FPGAs (with some fine print saying they're using a non-feedback mode). I've never seen a feedback mode cypher in a real world application get anything over some Gb/s.

Regards, Allan

- F
- fpga_toys
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Mar 10, 2006 10:27 AM

Me too :)

I'll take a look at it this weekend, as it might make another interesting example for the next FpgaC release. I have a pipelined RSA-72 I did two years ago when looking at building dnet engines that is a monster because of the barrel shifters and LUT RAMS required for retiming. First glance at the referenced materials suggests the problem with AES is going to be 80 or more block rams for S box lookups tables to get any reasonable parallelism. It's not clear there is an easy way to avoid using sbox tables, as the algorithm for creating the table is itterative. The rest of the requirements per round seem pretty timid. I have a couple ideas to ponder first.

The feedback chaining clearly limits performance unless you have a fair number of independent concurrent streams that can be muxed into the pipeline - like a 11 port mux/switch used to breakout a very fast connection.

- B
- backhus
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Mar 10, 2006 11:24 AM

Hi Allan, interesting point, but have you thought about what the reasons may be?

Let's do some (approximative) calculations.

Assume you have a single AES-Round that runs with a 100MHz Clock. This round needs at least 10 clocks to produce an AES Cipher.

With 128 Bits Data width that gives:

128 * 100e6 /10 = 1,28e9 Bits per second

So that is the limit for the assumed circuit.

Adding a feedback path for block cipher modes will extend the number of clocs to create a ciper.

Let's assume 14 clocks to produce a CBC cipher

Now we have:

128 * 100e6 /14 = 914,3e6 Bits per second

That's all what's possible with the assumed circuit.

How can we increase the throughput?

1) Wait for better silicon that allows higher clock rates. 2) Use more chip-space to implement aditional rounds and decrease the number of iterations needed in the round. But that may be rather expensive!

3) Improve the rounds latency. Make it fast to the limit. (Which is at about 500MHz as some vendors claim for their products ;-) )

Now let's assume our circuit will still run at 100MHz, but the improved round runs at 500 MHz. That will reduce the round latency to 2 100MHz cycles. Which gives 6 cycles to create the CBC cipher.

Now we have:

128 * 100e6 /6 = 2,1e9 Bits per second

So, that's the theoretical limit for the assumed circuit. You can exceed it by investing in additional or better (ASIC) silicon, if you have the money.

As I understand the original posting, these guys want to spend some work on solution 3 somehow.

My tip to manjunath & co.: Have a look at the standard implementations and the book "The design of rijndael" ISBN: 3540425802 Identify the modules and start optimizing the designs to whatever your goal is.

Have a nice synthesis Eilert

- T
- Thomas Womack
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Mar 10, 2006 3:12 PM

There has been a lot of research put into efficient implementations of the S-boxes without using lookup tables;

formatting link

might be an example. I went to a conference in August where

formatting link

was presented, which runs AES at 25Gbits/second on an XC3S2000; the round function is pipelined into seven stages of three levels of LUT each.

Tom

- F
- fpga_toys
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, Mar 11, 2006 2:16 AM

Any clue what the specific GF functions and tables are?

- A
- Allan Herriman
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, Mar 11, 2006 6:39 AM

Hi Eilert,

That's the idea. Your numbers are a little out though. Using a mature FPGA process (with moderate speed grade) is likley to result in a clock of about 200MHz if hand placement of the sboxes is used.

AES takes 14 rounds per block. It might be possible to have feedback around that block without wasting another clock, but let's assume that it takes 1 extra clock for the feedback mux, which gives 128 bits of result every 15 clocks. This results in a throughput of 1.7Gb/s.

A newer FPGA + fastest speed grade + hand placement of some LUTs might double the numbers. I doubt it could reach a 500MHz clock in an FPGA.

Of course, if one isn't using a feedback mode, many AES engines can be run in parallel for a vast increase in speed. Alternately, the loops can be unrolled for the same effect.

I notice that OC192 / STM64 AES encryptors have been available for a couple of years. I assume these have a single FPGA which produces approx 20Gb/s of crypto material (10Gb/s encrypt, 10Gb/s decrypt + the encrypt and decrypt streams are different so they can't share any hardware).

Regards, Allan

- T
- Thomas Womack
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, Mar 11, 2006 8:24 AM

formatting link

at the bottom of page six and top of page seven gives a set of fields that you could use, but I'm afraid I'm not really in the mood to explain GF(2^k) arithmetic in full detail in a Usenet post, and on trying I've found that I can't reconstruct the whole process without a fair amount of work; how much do you know about it to begin with?

It's basically a generalisation of complex numbers to binary arithmetic: start off with W defined so that W^2 = 1+W, and you have [with a,b,c,d single bits]

(a+bW)(c+dW) = ac + (bc+ad)W + bdW^2 = (ac+bd) + (bc+ad+bd)W

(a+bW)^{-1} = (a+b) + bW

so, multiplication and inversion of things of the form a+bW are two LUTs each. You then define X^2 = (something in 1 and W) + (something in 1 and W)*X and repeat the process, using the definition of inversion at the bottom of page 6 of the iacr preprint, to get multiplication of four-bit expressions; you then define Y^2 = (something in 1, X, W) + (something in 1,X,W)*Y and repeat again.

This is probably easiest done if you can find a spare mathematician to lean over your shoulder while you're doing it, or ask on sci.crypt where there will probably be someone who has the derivation handy: good terms to google on are 'composite extensions' and 'towers of fields'.

Tom

- M
- manjunath.rg
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, Mar 11, 2006 8:38 AM

No i dont think you can implement hardware concepts of subpipeling that too in non feedback mode in C so easily..anyway if you have a c to vhdl converter do tell me

- F
- fpga_toys
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, Mar 11, 2006 12:47 PM

Pipelines in C are relatively easy at the statement level, just requires reversing statement order.

a = 1; b = a; c = b;

propagates 1 to c with sequential execution.

for(;;) { c = b; b = a; a = 1; }

requires three clocks before c obtains the 1 value, three clock latency pipeline that trickles up ward.

- F
- fpga_toys
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, Mar 11, 2006 1:22 PM

Thanks Tom. My math skills date back to very early 70's, and have not needed to progress past that for most of the hardware/software engineering I've done since. I do have a general interest in crypto stuff, and would probably need a math guy with some patience to walk me thru it.

I did take the Deamon example code for study this morning, and while not suitable for FPGA implementation because of the all the serialized looping it was enough to understand the core algorithm pretty quickly. I re-wrote it into a fully unrolled subset C for FpgaC in a couple hours that is highly parallel, and pipelined at each round. The Sbox's are just stubbed out with a define macro, waiting for something reasonable to place in the macro. It appears that it should run at a pretty fair clip once someone can provide a set of C statements for the Sbox implementation you have reference.

it does suffer a bit from a long standing problem we inherited from TMCC, which is that it doesn't know how to map F5/F6 muxes for extending 4-LUT equations, and tends to push terms down a little too quickly forcing a slightly deeper logic tree than optimal. This is also impacting the PCI core I started as demo code a few weeks ago.

So, I'm off re-writting the FpgaC bottom end code to solve that problem for good. After the mux fixes, it appears FpgaC can compile the AES engine to netlist very well, along with the earlier RSA demo code's barrel shifters.

John

- M
- manjunath.rg
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Mar 13, 2006 3:24 AM

Hello mordehay.. we are just graduate students doing this as a project.. i respect your intellectual property but we r not showing the code to anyone nor are we using it for other purposes like paper presentation or contests..etc.. So you can be sure that it wont pass on to anyone else except me or rather my group...hope you will provide us your implementation.. thanking you

- F
- fpga_toys
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Mar 15, 2006 5:57 AM

I've checked the FpgaC version of AES encryption into the FpgaC sf.net subversion archive under examples/crypto/aes with the Sbox macro stubbed out waiting for the code Tom suggested. If someone plays with it, and does the code replacement for the sbox arrays please send me the changes and i'll include in the example for the next beta release in April with your name attached.

svn co

formatting link

fpgac

I got a start on updating FpgaC to handle platform specific technology mapping so it can use F5/F6/F7 MUXes and do a better job of flattening the combinatorial tree. Should probably have a chance to finish that over the weekend, and get it checked in sometime next week. From looking at the netlists, I suspect that will produce fairly optimal implementation when done.

Any other suggestions?

Have fun, John

fpga snipped-for-privacy@yahoo.com wrote:

- A
- Allan Herriman
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Mar 15, 2006 6:27 AM

How about inferring BRAM for the sboxes? That's what many implementations do. (I'm assuming the point of the exercise is to compare the results of an implementation written in C with one written in a more conventional HDL.)

Regards, Allan

- F
- fpga_toys
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Mar 15, 2006 7:28 AM

May seem strange, but the point wasn't to do a head to head comparison with other HDL's. I've some interest in crypto algorithms and have been on a search for "projects" which I can use as examples for the FpgaC release. The changes for technology mapping F5/F6 muxes has been on my list since last summer. AES and PCI examples just drove the point home it should be now, not later.

The AES algorithm was easy to unroll by hand once I had an idea of what the core computation was ... the whole project took a little less than two hours start to finish. About an hour to code and debug the define macros for the first round (embedding row shift order), and another hour to cut, paste, and hand edit it as fully unrolled and finish setup of the test bench. I'm a very experienced C coder, so it might take someone else a bit longer. I also visualized the parallelism needed early in the project, and coded to obtain/maintain it.