mirror of
https://github.com/openssl/openssl.git
synced 2024-12-05 07:54:47 +08:00
modes/asm/ghash-*.pl: switch to [more reproducible] performance results
collected with 'apps/openssl speed ghash'.
This commit is contained in:
parent
a3b0c44b1b
commit
d52d5ad147
@ -12,9 +12,9 @@
|
|||||||
# The module implements "4-bit" GCM GHASH function and underlying
|
# The module implements "4-bit" GCM GHASH function and underlying
|
||||||
# single multiplication operation in GF(2^128). "4-bit" means that it
|
# single multiplication operation in GF(2^128). "4-bit" means that it
|
||||||
# uses 256 bytes per-key table [+128 bytes shared table]. On PA-7100LC
|
# uses 256 bytes per-key table [+128 bytes shared table]. On PA-7100LC
|
||||||
# it processes one byte in 19 cycles, which is more than twice as fast
|
# it processes one byte in 19.6 cycles, which is more than twice as
|
||||||
# as code generated by gcc 3.2. PA-RISC 2.0 loop is scheduled for 8
|
# fast as code generated by gcc 3.2. PA-RISC 2.0 loop is scheduled for
|
||||||
# cycles, but measured performance on PA-8600 system is ~9 cycles per
|
# 8 cycles, but measured performance on PA-8600 system is ~9 cycles per
|
||||||
# processed byte. This is ~2.2x faster than 64-bit code generated by
|
# processed byte. This is ~2.2x faster than 64-bit code generated by
|
||||||
# vendor compiler (which used to be very hard to beat:-).
|
# vendor compiler (which used to be very hard to beat:-).
|
||||||
#
|
#
|
||||||
|
@ -17,8 +17,8 @@
|
|||||||
#
|
#
|
||||||
# gcc 3.3.x cc 5.2 this assembler
|
# gcc 3.3.x cc 5.2 this assembler
|
||||||
#
|
#
|
||||||
# 32-bit build 81.0 48.6 11.8 (+586%/+311%)
|
# 32-bit build 81.4 43.3 12.6 (+546%/+244%)
|
||||||
# 64-bit build 27.5 20.3 11.8 (+133%/+72%)
|
# 64-bit build 20.2 21.2 12.6 (+60%/+68%)
|
||||||
#
|
#
|
||||||
# Here is data collected on UltraSPARC T1 system running Linux:
|
# Here is data collected on UltraSPARC T1 system running Linux:
|
||||||
#
|
#
|
||||||
|
@ -21,17 +21,18 @@
|
|||||||
#
|
#
|
||||||
# gcc 2.95.3(*) MMX assembler x86 assembler
|
# gcc 2.95.3(*) MMX assembler x86 assembler
|
||||||
#
|
#
|
||||||
# Pentium 100/112(**) - 50
|
# Pentium 105/111(**) - 50
|
||||||
# PIII 63 /77 12.2 24
|
# PIII 68 /75 12.2 24
|
||||||
# P4 96 /122 18.0 84(***)
|
# P4 125/125 17.8 84(***)
|
||||||
# Opteron 50 /71 10.1 30
|
# Opteron 66 /70 10.1 30
|
||||||
# Core2 54 /68 8.6 18
|
# Core2 54 /67 8.4 18
|
||||||
#
|
#
|
||||||
# (*) gcc 3.4.x was observed to generate few percent slower code,
|
# (*) gcc 3.4.x was observed to generate few percent slower code,
|
||||||
# which is one of reasons why 2.95.3 results were chosen,
|
# which is one of reasons why 2.95.3 results were chosen,
|
||||||
# another reason is lack of 3.4.x results for older CPUs;
|
# another reason is lack of 3.4.x results for older CPUs;
|
||||||
# comparison is not completely fair, because C results are
|
# comparison with MMX results is not completely fair, because C
|
||||||
# for vanilla "256B" implementations, not "528B";-)
|
# results are for vanilla "256B" implementation, while
|
||||||
|
# assembler results are for "528B";-)
|
||||||
# (**) second number is result for code compiled with -fPIC flag,
|
# (**) second number is result for code compiled with -fPIC flag,
|
||||||
# which is actually more relevant, because assembler code is
|
# which is actually more relevant, because assembler code is
|
||||||
# position-independent;
|
# position-independent;
|
||||||
@ -44,7 +45,7 @@
|
|||||||
|
|
||||||
# May 2010
|
# May 2010
|
||||||
#
|
#
|
||||||
# Add PCLMULQDQ version performing at 2.13 cycles per processed byte.
|
# Add PCLMULQDQ version performing at 2.10 cycles per processed byte.
|
||||||
# The question is how close is it to theoretical limit? The pclmulqdq
|
# The question is how close is it to theoretical limit? The pclmulqdq
|
||||||
# instruction latency appears to be 14 cycles and there can't be more
|
# instruction latency appears to be 14 cycles and there can't be more
|
||||||
# than 2 of them executing at any given time. This means that single
|
# than 2 of them executing at any given time. This means that single
|
||||||
@ -60,38 +61,36 @@
|
|||||||
# Before we proceed to this implementation let's have closer look at
|
# Before we proceed to this implementation let's have closer look at
|
||||||
# the best-performing code suggested by Intel in their white paper.
|
# the best-performing code suggested by Intel in their white paper.
|
||||||
# By tracing inter-register dependencies Tmod is estimated as ~19
|
# By tracing inter-register dependencies Tmod is estimated as ~19
|
||||||
# cycles and Naggr is 4, resulting in 2.05 cycles per processed byte.
|
# cycles and Naggr chosen by Intel is 4, resulting in 2.05 cycles per
|
||||||
# As implied, this is quite optimistic estimate, because it does not
|
# processed byte. As implied, this is quite optimistic estimate,
|
||||||
# account for Karatsuba pre- and post-processing, which for a single
|
# because it does not account for Karatsuba pre- and post-processing,
|
||||||
# multiplication is ~5 cycles. Unfortunately Intel does not provide
|
# which for a single multiplication is ~5 cycles. Unfortunately Intel
|
||||||
# performance data for GHASH alone, only for fused GCM mode. But
|
# does not provide performance data for GHASH alone. But benchmarking
|
||||||
# we can estimate it by subtracting CTR performance result provided
|
# AES_GCM_encrypt ripped out of Fig. 15 of the white paper with aadt
|
||||||
# in "AES Instruction Set" white paper: 3.54-1.38=2.16 cycles per
|
# alone resulted in 2.46 cycles per byte of out 16KB buffer. Note that
|
||||||
# processed byte or 5% off the estimate. It should be noted though
|
# the result accounts even for pre-computing of degrees of the hash
|
||||||
# that 3.54 is GCM result for 16KB block size, while 1.38 is CTR for
|
# key H, but its portion is negligible at 16KB buffer size.
|
||||||
# 1KB block size, meaning that real number is likely to be a bit
|
|
||||||
# further from estimate.
|
|
||||||
#
|
#
|
||||||
# Moving on to the implementation in question. Tmod is estimated as
|
# Moving on to the implementation in question. Tmod is estimated as
|
||||||
# ~13 cycles and Naggr is 2, giving asymptotic performance of ...
|
# ~13 cycles and Naggr is 2, giving asymptotic performance of ...
|
||||||
# 2.16. How is it possible that measured performance is better than
|
# 2.16. How is it possible that measured performance is better than
|
||||||
# optimistic theoretical estimate? There is one thing Intel failed
|
# optimistic theoretical estimate? There is one thing Intel failed
|
||||||
# to recognize. By fusing GHASH with CTR former's performance is
|
# to recognize. By serializing GHASH with CTR in same subroutine
|
||||||
# really limited to above (Tmul + Tmod/Naggr) equation. But if GHASH
|
# former's performance is really limited to above (Tmul + Tmod/Naggr)
|
||||||
# procedure is detached, the modulo-reduction can be interleaved with
|
# equation. But if GHASH procedure is detached, the modulo-reduction
|
||||||
# Naggr-1 multiplications and under ideal conditions even disappear
|
# can be interleaved with Naggr-1 multiplications at instruction level
|
||||||
# from the equation. So that optimistic theoretical estimate for this
|
# and under ideal conditions even disappear from the equation. So that
|
||||||
# implementation is ... 28/16=1.75, and not 2.16. Well, it's probably
|
# optimistic theoretical estimate for this implementation is ...
|
||||||
# way too optimistic, at least for such small Naggr. I'd argue that
|
# 28/16=1.75, and not 2.16. Well, it's probably way too optimistic,
|
||||||
# (28+Tproc/Naggr), where Tproc is time required for Karatsuba pre-
|
# at least for such small Naggr. I'd argue that (28+Tproc/Naggr),
|
||||||
# and post-processing, is more realistic estimate. In this case it
|
# where Tproc is time required for Karatsuba pre- and post-processing,
|
||||||
# gives ... 1.91 cycles per processed byte. Or in other words,
|
# is more realistic estimate. In this case it gives ... 1.91 cycles.
|
||||||
# depending on how well we can interleave reduction and one of the
|
# Or in other words, depending on how well we can interleave reduction
|
||||||
# two multiplications the performance should be betwen 1.91 and 2.16.
|
# and one of the two multiplications the performance should be betwen
|
||||||
# As already mentioned, this implementation processes one byte [out
|
# 1.91 and 2.16. As already mentioned, this implementation processes
|
||||||
# of 1KB buffer] in 2.13 cycles, while x86_64 counterpart - in 2.07.
|
# one byte out of 8KB buffer in 2.10 cycles, while x86_64 counterpart
|
||||||
# x86_64 performance is better, because larger register bank allows
|
# - in 2.02. x86_64 performance is better, because larger register
|
||||||
# to interleave reduction and multiplication better.
|
# bank allows to interleave reduction and multiplication better.
|
||||||
#
|
#
|
||||||
# Does it make sense to increase Naggr? To start with it's virtually
|
# Does it make sense to increase Naggr? To start with it's virtually
|
||||||
# impossible in 32-bit mode, because of limited register bank
|
# impossible in 32-bit mode, because of limited register bank
|
||||||
|
@ -20,17 +20,18 @@
|
|||||||
# gcc 3.4.x(*) assembler
|
# gcc 3.4.x(*) assembler
|
||||||
#
|
#
|
||||||
# P4 28.6 14.0 +100%
|
# P4 28.6 14.0 +100%
|
||||||
# Opteron 18.5 7.7 +140%
|
# Opteron 19.3 7.7 +150%
|
||||||
# Core2 17.5 8.1(**) +115%
|
# Core2 17.8 8.1(**) +120%
|
||||||
#
|
#
|
||||||
# (*) comparison is not completely fair, because C results are
|
# (*) comparison is not completely fair, because C results are
|
||||||
# for vanilla "256B" implementation, not "528B";-)
|
# for vanilla "256B" implementation, while assembler results
|
||||||
|
# are for "528B";-)
|
||||||
# (**) it's mystery [to me] why Core2 result is not same as for
|
# (**) it's mystery [to me] why Core2 result is not same as for
|
||||||
# Opteron;
|
# Opteron;
|
||||||
|
|
||||||
# May 2010
|
# May 2010
|
||||||
#
|
#
|
||||||
# Add PCLMULQDQ version performing at 2.07 cycles per processed byte.
|
# Add PCLMULQDQ version performing at 2.02 cycles per processed byte.
|
||||||
# See ghash-x86.pl for background information and details about coding
|
# See ghash-x86.pl for background information and details about coding
|
||||||
# techniques.
|
# techniques.
|
||||||
#
|
#
|
||||||
|
Loading…
Reference in New Issue
Block a user