:: 27-Nov-1998 01:55 (Friday) ::
Did some investigation of integer bit-slice last night and it looks
like it is a no win over BrydDES. Based on clock counts from
s-boxes 1 and 4 it would be 5% faster if hand assembly can do 10%
better than an optimized compile under DJGPP.
For integer work it looks like it is best to concentrate on BrydDES
for now. First step is to figure out how it works.
On the MMX front, here is how I come up with a possible 50% improvement.
On a P5 MMX, the s-boxes average 45 clocks per box. There are 8 boxes
per round and an equivalent of 9.348 (of 16) rounds done for each
“slice” of 64 keys.
To those 45 clocks must be added some setup (basically a = e ^ k)
and cleanup which amount to 16 clocks if they can’t be paired. As
these are all loads and stores they are all U pipe and will not pair
unless the s-boxes are rewritten to accommodate it.
So we have a clock count per key of (45 + 16) * 8 * 9.348 / 64 = 71.28.
At 200MHz this gives 2806 kkeys/s. We currently get 1876 kkeys/s.
2806/1876 = 1.50. Q.E.D.
If the s-boxes can be re-written it might go as high as 60%, assuming
half of those 16 instructions pair.
This all relates to the P5 MMX as the clock counts are deterministic.
On the out of order processors (PPro, PII, K6 and K6-2) the
improvement may vary.