Mail Archives: djgpp/1996/11/28/18:02:29
> > What makes you say that? I can't see how this would make it faster...
> > more cache misses, and an extra shift to index non-byte sized quantities.
> > Not to mention the fact that there are more byte sized registers.
> I believe in 32-bit protected mode most dword register ops are faster
> than the equivalent 16-bit ones on a 486 and above. Certainly on a P6
> 16-bit instructions are disproportionately slow.
> In any case I haven't seen djgpp generate any optimizations which utilise
> the byte registers; AFAIK it uses them only in straightforward byte ops.
On the pentium, the following rule is used to decide which type of
instructions
to use:
i) If you are running your code in 32 bit protected mode, use 32 bit and
8 bit data and registers, and avoid 16 bit ones
ii) If your running in 16 bit protected/real mode, avoid 32 bits
registers
Its all in the pentium programmers manual. Go to
http://www.x86.com/
and have a look around there...
> > > did you actually profile your code to see where the bottlenecks are?
> > Yes. I know exactly where I need to improve.
> I have no idea how good your C coding skills are, so don't be offended,
> but careful C code can speed up a sloppy implementation by ~ 100%:
> on the other hand, there are limits.
> Check your algorithm to see what basic operations are being used
> (specifically multiplies, divides, sqrts etc) and check how many
> operations are duplicated in such a way that they can be removed with
> a little recoding -
> e.g a1 = b1/(x*y); c = x*y;
> a2 = b2/(x*y); ===> a1 = b1/c; a2 = b1/c etc.
> a3 = b3/(x*Y);
> Simplistic, but you get the point.
Actually, this is even faster if you:
c = 1 / (x * y);
a1 = b1 * c;
a2 = b2 * c;
a3 = b3 * c;
A divide takes 39 cycles on a normal double divide, a mul takes 3
cycles.
Using your method, you have 3 divides (117 cycles) and one mul for 120
cycles.
Using the second method, you have 39 + 9 cycles, or 48... :)
Leathal.
- Raw text -