Commit Graph

152 Commits

Author SHA1 Message Date
Eric Biggers
1ca1b91794 crypto: chacha20-generic - refactor to allow varying number of rounds
In preparation for adding XChaCha12 support, rename/refactor
chacha20-generic to support different numbers of rounds.  The
justification for needing XChaCha12 support is explained in more detail
in the patch "crypto: chacha - add XChaCha12 support".

The only difference between ChaCha{8,12,20} are the number of rounds
itself; all other parts of the algorithm are the same.  Therefore,
remove the "20" from all definitions, structures, functions, files, etc.
that will be shared by all ChaCha versions.

Also make ->setkey() store the round count in the chacha_ctx (previously
chacha20_ctx).  The generic code then passes the round count through to
chacha_block().  There will be a ->setkey() function for each explicitly
allowed round count; the encrypt/decrypt functions will be the same.  I
decided not to do it the opposite way (same ->setkey() function for all
round counts, with different encrypt/decrypt functions) because that
would have required more boilerplate code in architecture-specific
implementations of ChaCha and XChaCha.

Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-11-20 14:26:55 +08:00
Ard Biesheuvel
cc3cc48972 crypto: arm64/aes-blk - ensure XTS mask is always loaded
Commit 2e5d2f33d1 ("crypto: arm64/aes-blk - improve XTS mask handling")
optimized away some reloads of the XTS mask vector, but failed to take
into account that calls into the XTS en/decrypt routines will take a
slightly different code path if a single block of input is split across
different buffers. So let's ensure that the first load occurs
unconditionally, and move the reload to the end so it doesn't occur
needlessly.

Fixes: 2e5d2f33d1 ("crypto: arm64/aes-blk - improve XTS mask handling")
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-10-12 14:20:45 +08:00
Eric Biggers
7ff9036a62 crypto: arm64/aes - fix handling sub-block CTS-CBC inputs
In the new arm64 CTS-CBC implementation, return an error code rather
than crashing on inputs shorter than AES_BLOCK_SIZE bytes.  Also set
cra_blocksize to AES_BLOCK_SIZE (like is done in the cts template) to
indicate the minimum input size.

Fixes: dd597fb33f ("crypto: arm64/aes-blk - add support for CTS-CBC mode")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-10-08 13:47:02 +08:00
Ard Biesheuvel
2e5d2f33d1 crypto: arm64/aes-blk - improve XTS mask handling
The Crypto Extension instantiation of the aes-modes.S collection of
skciphers uses only 15 NEON registers for the round key array, whereas
the pure NEON flavor uses 16 NEON registers for the AES S-box.

This means we have a spare register available that we can use to hold
the XTS mask vector, removing the need to reload it at every iteration
of the inner loop.

Since the pure NEON version does not permit this optimization, tweak
the macros so we can factor out this functionality. Also, replace the
literal load with a short sequence to compose the mask vector.

On Cortex-A53, this results in a ~4% speedup.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-09-21 13:24:50 +08:00
Ard Biesheuvel
dd597fb33f crypto: arm64/aes-blk - add support for CTS-CBC mode
Currently, we rely on the generic CTS chaining mode wrapper to
instantiate the cts(cbc(aes)) skcipher. Due to the high performance
of the ARMv8 Crypto Extensions AES instructions (~1 cycles per byte),
any overhead in the chaining mode layers is amplified, and so it pays
off considerably to fold the CTS handling into the SIMD routines.

On Cortex-A53, this results in a ~50% speedup for smaller input sizes.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-09-21 13:24:50 +08:00
Ard Biesheuvel
6e7de6af91 crypto: arm64/aes-blk - revert NEON yield for skciphers
The reasoning of commit f10dc56c64 ("crypto: arm64 - revert NEON yield
for fast AEAD implementations") applies equally to skciphers: the walk
API already guarantees that the input size of each call into the NEON
code is bounded to the size of a page, and so there is no need for an
additional TIF_NEED_RESCHED flag check inside the inner loop. So revert
the skcipher changes to aes-modes.S (but retain the mac ones)

This partially reverts commit 0c8f838a52.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-09-21 13:24:50 +08:00
Ard Biesheuvel
557ecb4543 crypto: arm64/aes-blk - remove pointless (u8 *) casts
For some reason, the asmlinkage prototypes of the NEON routines take
u8[] arguments for the round key arrays, while the actual round keys
are arrays of u32, and so passing them into those routines requires
u8* casts at each occurrence. Fix that.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-09-21 13:24:50 +08:00
Ard Biesheuvel
2fffee536c crypto: arm64/crct10dif - implement non-Crypto Extensions alternative
The arm64 implementation of the CRC-T10DIF algorithm uses the 64x64 bit
polynomial multiplication instructions, which are optional in the
architecture, and if these instructions are not available, we fall back
to the C routine which is slow and inefficient.

So let's reuse the 64x64 bit PMULL alternative from the GHASH driver that
uses a sequence of ~40 instructions involving 8x8 bit PMULL and some
shifting and masking. This is a lot slower than the original, but it is
still twice as fast as the current [unoptimized] C code on Cortex-A53,
and it is time invariant and much easier on the D-cache.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-09-04 11:37:04 +08:00
Ard Biesheuvel
6c1b0da13e crypto: arm64/crct10dif - preparatory refactor for 8x8 PMULL version
Reorganize the CRC-T10DIF asm routine so we can easily instantiate an
alternative version based on 8x8 polynomial multiplication in a
subsequent patch.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-09-04 11:37:04 +08:00
Ard Biesheuvel
598b7d41e5 crypto: arm64/crc32 - remove PMULL based CRC32 driver
Now that the scalar fallbacks have been moved out of this driver into
the core crc32()/crc32c() routines, we are left with a CRC32 crypto API
driver for arm64 that is based only on 64x64 polynomial multiplication,
which is an optional instruction in the ARMv8 architecture, and is less
and less likely to be available on cores that do not also implement the
CRC32 instructions, given that those are mandatory in the architecture
as of ARMv8.1.

Since the scalar instructions do not require the special handling that
SIMD instructions do, and since they turn out to be considerably faster
on some cores (Cortex-A53) as well, there is really no point in keeping
this code around so let's just remove it.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-09-04 11:37:04 +08:00
Ard Biesheuvel
ed6ed11830 crypto: arm64/aes-modes - get rid of literal load of addend vector
Replace the literal load of the addend vector with a sequence that
performs each add individually. This sequence is only 2 instructions
longer than the original, and 2% faster on Cortex-A53.

This is an improvement by itself, but also works around a Clang issue,
whose integrated assembler does not implement the GNU ARM asm syntax
completely, and does not support the =literal notation for FP registers
(more info at https://bugs.llvm.org/show_bug.cgi?id=38642)

Cc: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-09-04 11:37:04 +08:00
Jason A. Donenfeld
578bdaabd0 crypto: speck - remove Speck
These are unused, undesired, and have never actually been used by
anybody. The original authors of this code have changed their mind about
its inclusion. While originally proposed for disk encryption on low-end
devices, the idea was discarded [1] in favor of something else before
that could really get going. Therefore, this patch removes Speck.

[1] https://marc.info/?l=linux-crypto-vger&m=153359499015659

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Eric Biggers <ebiggers@google.com>
Cc: stable@vger.kernel.org
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-09-04 11:35:03 +08:00
Ard Biesheuvel
c2b24c36e0 crypto: arm64/aes-gcm-ce - fix scatterwalk API violation
Commit 71e52c278c ("crypto: arm64/aes-ce-gcm - operate on
two input blocks at a time") modified the granularity at which
the AES/GCM code processes its input to allow subsequent changes
to be applied that improve performance by using aggregation to
process multiple input blocks at once.

For this reason, it doubled the algorithm's 'chunksize' property
to 2 x AES_BLOCK_SIZE, but retained the non-SIMD fallback path that
processes a single block at a time. In some cases, this violates the
skcipher scatterwalk API, by calling skcipher_walk_done() with a
non-zero residue value for a chunk that is expected to be handled
in its entirety. This results in a WARN_ON() to be hit by the TLS
self test code, but is likely to break other user cases as well.
Unfortunately, none of the current test cases exercises this exact
code path at the moment.

Fixes: 71e52c278c ("crypto: arm64/aes-ce-gcm - operate on two ...")
Reported-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Tested-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-08-25 19:50:43 +08:00
Ard Biesheuvel
7fa885e2a2 crypto: arm64/sm4-ce - check for the right CPU feature bit
ARMv8.2 specifies special instructions for the SM3 cryptographic hash
and the SM4 symmetric cipher. While it is unlikely that a core would
implement one and not the other, we should only use SM4 instructions
if the SM4 CPU feature bit is set, and we currently check the SM3
feature bit instead. So fix that.

Fixes: e99ce921c4 ("crypto: arm64 - add support for SM4...")
Cc: <stable@vger.kernel.org>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-08-25 19:50:41 +08:00
Ard Biesheuvel
22240df7ac crypto: arm64/ghash-ce - implement 4-way aggregation
Enhance the GHASH implementation that uses 64-bit polynomial
multiplication by adding support for 4-way aggregation. This
more than doubles the performance, from 2.4 cycles per byte
to 1.1 cpb on Cortex-A53.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-08-07 17:51:40 +08:00
Ard Biesheuvel
8e492eff7d crypto: arm64/ghash-ce - replace NEON yield check with block limit
Checking the TIF_NEED_RESCHED flag is disproportionately costly on cores
with fast crypto instructions and comparatively slow memory accesses.

On algorithms such as GHASH, which executes at ~1 cycle per byte on
cores that implement support for 64 bit polynomial multiplication,
there is really no need to check the TIF_NEED_RESCHED particularly
often, and so we can remove the NEON yield check from the assembler
routines.

However, unlike the AEAD or skcipher APIs, the shash/ahash APIs take
arbitrary input lengths, and so there needs to be some sanity check
to ensure that we don't hog the CPU for excessive amounts of time.

So let's simply cap the maximum input size that is processed in one go
to 64 KB.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-08-07 17:51:39 +08:00
Ard Biesheuvel
30f1a9f53e crypto: arm64/aes-ce-gcm - don't reload key schedule if avoidable
Squeeze out another 5% of performance by minimizing the number
of invocations of kernel_neon_begin()/kernel_neon_end() on the
common path, which also allows some reloads of the key schedule
to be optimized away.

The resulting code runs at 2.3 cycles per byte on a Cortex-A53.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-08-07 17:38:04 +08:00
Ard Biesheuvel
e0bd888dc4 crypto: arm64/aes-ce-gcm - implement 2-way aggregation
Implement a faster version of the GHASH transform which amortizes
the reduction modulo the characteristic polynomial across two
input blocks at a time.

On a Cortex-A53, the gcm(aes) performance increases 24%, from
3.0 cycles per byte to 2.4 cpb for large input sizes.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-08-07 17:38:04 +08:00
Ard Biesheuvel
71e52c278c crypto: arm64/aes-ce-gcm - operate on two input blocks at a time
Update the core AES/GCM transform and the associated plumbing to operate
on 2 AES/GHASH blocks at a time. By itself, this is not expected to
result in a noticeable speedup, but it paves the way for reimplementing
the GHASH component using 2-way aggregation.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-08-07 17:38:04 +08:00
Herbert Xu
3465893d27 Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Merge crypto-2.6 to pick up NEON yield revert.
2018-08-07 17:37:10 +08:00
Ard Biesheuvel
f10dc56c64 crypto: arm64 - revert NEON yield for fast AEAD implementations
As it turns out, checking the TIF_NEED_RESCHED flag after each
iteration results in a significant performance regression (~10%)
when running fast algorithms (i.e., ones that use special instructions
and operate in the < 4 cycles per byte range) on in-order cores with
comparatively slow memory accesses such as the Cortex-A53.

Given the speed of these ciphers, and the fact that the page based
nature of the AEAD scatterwalk API guarantees that the core NEON
transform is never invoked with more than a single page's worth of
input, we can estimate the worst case duration of any resulting
scheduling blackout: on a 1 GHz Cortex-A53 running with 64k pages,
processing a page's worth of input at 4 cycles per byte results in
a delay of ~250 us, which is a reasonable upper bound.

So let's remove the yield checks from the fused AES-CCM and AES-GCM
routines entirely.

This reverts commit 7b67ae4d5c and
partially reverts commit 7c50136a8a.

Fixes: 7c50136a8a ("crypto: arm64/aes-ghash - yield NEON after every ...")
Fixes: 7b67ae4d5c ("crypto: arm64/aes-ccm - yield NEON after every ...")
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-08-07 17:26:23 +08:00
Herbert Xu
c5f5aeef9b Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux
Merge mainline to pick up c7513c2a27 ("crypto/arm64: aes-ce-gcm -
add missing kernel_neon_begin/end pair").
2018-08-03 17:55:12 +08:00
Ard Biesheuvel
c7513c2a27 crypto/arm64: aes-ce-gcm - add missing kernel_neon_begin/end pair
Calling pmull_gcm_encrypt_block() requires kernel_neon_begin() and
kernel_neon_end() to be used since the routine touches the NEON
register file. Add the missing calls.

Also, since NEON register contents are not preserved outside of
a kernel mode NEON region, pass the key schedule array again.

Fixes: 7c50136a8a ("crypto: arm64/aes-ghash - yield NEON after every ...")
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-07-31 13:20:30 +01:00
Eric Biggers
6b0daa7820 crypto: arm64/sha256 - increase cra_priority of scalar implementations
Commit b73b7ac0a7 ("crypto: sha256_generic - add cra_priority") gave
sha256-generic and sha224-generic a cra_priority of 100, to match the
convention for generic implementations.  But sha256-arm64 and
sha224-arm64 also have priority 100, so their order relative to the
generic implementations became ambiguous.

Therefore, increase their priority to 125 so that they have higher
priority than the generic implementations but lower priority than the
NEON implementations which have priority 150.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-07-27 19:16:38 +08:00
Eric Biggers
e50944e219 crypto: shash - remove useless setting of type flags
Many shash algorithms set .cra_flags = CRYPTO_ALG_TYPE_SHASH.  But this
is redundant with the C structure type ('struct shash_alg'), and
crypto_register_shash() already sets the type flag automatically,
clearing any type flag that was already there.  Apparently the useless
assignment has just been copy+pasted around.

So, remove the useless assignment from all the shash algorithms.

This patch shouldn't change any actual behavior.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-07-09 00:30:24 +08:00
Jia He
6e88f01206 crypto: arm64/aes-blk - fix and move skcipher_walk_done out of kernel_neon_begin, _end
In a arm64 server(QDF2400),I met a similar might-sleep warning as [1]:
[    7.019116] BUG: sleeping function called from invalid context at
./include/crypto/algapi.h:416
[    7.027863] in_atomic(): 1, irqs_disabled(): 0, pid: 410, name:
cryptomgr_test
[    7.035106] 1 lock held by cryptomgr_test/410:
[    7.039549]  #0:         (ptrval) (&drbg->drbg_mutex){+.+.}, at:
drbg_instantiate+0x34/0x398
[    7.048038] CPU: 9 PID: 410 Comm: cryptomgr_test Not tainted
4.17.0-rc6+ #27
[    7.068228]  dump_backtrace+0x0/0x1c0
[    7.071890]  show_stack+0x24/0x30
[    7.075208]  dump_stack+0xb0/0xec
[    7.078523]  ___might_sleep+0x160/0x238
[    7.082360]  skcipher_walk_done+0x118/0x2c8
[    7.086545]  ctr_encrypt+0x98/0x130
[    7.090035]  simd_skcipher_encrypt+0x68/0xc0
[    7.094304]  drbg_kcapi_sym_ctr+0xd4/0x1f8
[    7.098400]  drbg_ctr_update+0x98/0x330
[    7.102236]  drbg_seed+0x1b8/0x2f0
[    7.105637]  drbg_instantiate+0x2ac/0x398
[    7.109646]  drbg_kcapi_seed+0xbc/0x188
[    7.113482]  crypto_rng_reset+0x4c/0xb0
[    7.117319]  alg_test_drbg+0xec/0x330
[    7.120981]  alg_test.part.6+0x1c8/0x3c8
[    7.124903]  alg_test+0x58/0xa0
[    7.128044]  cryptomgr_test+0x50/0x58
[    7.131708]  kthread+0x134/0x138
[    7.134936]  ret_from_fork+0x10/0x1c

Seems there is a bug in Ard Biesheuvel's commit.
Fixes: 6833817472 ("crypto: arm64/aes-blk - move kernel mode neon
en/disable into loop")

[1] https://www.spinics.net/lists/linux-crypto/msg33103.html

Signed-off-by: jia.he@hxt-semitech.com
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: <stable@vger.kernel.org> # 4.17
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-06-15 23:06:46 +08:00
Adam Langley
c2e415fe75 crypto: clarify licensing of OpenSSL asm code
Several source files have been taken from OpenSSL. In some of them a
comment that "permission to use under GPL terms is granted" was
included below a contradictory license statement. In several cases,
there was no indication that the license of the code was compatible
with the GPLv2.

This change clarifies the licensing for all of these files. I've
confirmed with the author (Andy Polyakov) that a) he has licensed the
files with the GPLv2 comment under that license and b) that he's also
happy to license the other files under GPLv2 too. In one case, the
file is already contained in his CRYPTOGAMS bundle, which has a GPLv2
option, and so no special measures are needed.

In all cases, the license status of code has been clarified by making
the GPLv2 license prominent.

The .S files have been regenerated from the updated .pl files.

This is a comment-only change. No code is changed.

Signed-off-by: Adam Langley <agl@chromium.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-31 00:13:44 +08:00
Ard Biesheuvel
6caf7adc5e crypto: arm64/sha512-ce - yield NEON after every block of input
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-12 00:13:12 +08:00
Ard Biesheuvel
7edc86cb1c crypto: arm64/sha3-ce - yield NEON after every block of input
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-12 00:13:11 +08:00
Ard Biesheuvel
5b3da65177 crypto: arm64/crct10dif-ce - yield NEON after every block of input
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-12 00:13:11 +08:00
Ard Biesheuvel
4e530fba69 crypto: arm64/crc32-ce - yield NEON after every block of input
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-12 00:13:10 +08:00
Ard Biesheuvel
7c50136a8a crypto: arm64/aes-ghash - yield NEON after every block of input
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-12 00:13:10 +08:00
Ard Biesheuvel
20ab633258 crypto: arm64/aes-bs - yield NEON after every block of input
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-12 00:13:09 +08:00
Ard Biesheuvel
0c8f838a52 crypto: arm64/aes-blk - yield NEON after every block of input
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-12 00:13:08 +08:00
Ard Biesheuvel
7b67ae4d5c crypto: arm64/aes-ccm - yield NEON after every block of input
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-12 00:13:07 +08:00
Ard Biesheuvel
d82f37ab5e crypto: arm64/sha2-ce - yield NEON after every block of input
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-12 00:13:06 +08:00
Ard Biesheuvel
7df8d16475 crypto: arm64/sha1-ce - yield NEON after every block of input
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-12 00:13:05 +08:00
Ard Biesheuvel
e99ce921c4 crypto: arm64 - add support for SM4 encryption using special instructions
Add support for the SM4 symmetric cipher implemented using the special
SM4 instructions introduced in ARM architecture revision 8.2.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-05 14:52:53 +08:00
Masahiro Yamada
54a702f705 kbuild: mark $(targets) as .SECONDARY and remove .PRECIOUS markers
GNU Make automatically deletes intermediate files that are updated
in a chain of pattern rules.

Example 1) %.dtb.o <- %.dtb.S <- %.dtb <- %.dts
Example 2) %.o <- %.c <- %.c_shipped

A couple of makefiles mark such targets as .PRECIOUS to prevent Make
from deleting them, but the correct way is to use .SECONDARY.

  .SECONDARY
    Prerequisites of this special target are treated as intermediate
    files but are never automatically deleted.

  .PRECIOUS
    When make is interrupted during execution, it may delete the target
    file it is updating if the file was modified since make started.
    If you mark the file as precious, make will never delete the file
    if interrupted.

Both can avoid deletion of intermediate files, but the difference is
the behavior when Make is interrupted; .SECONDARY deletes the target,
but .PRECIOUS does not.

The use of .PRECIOUS is relatively rare since we do not want to keep
partially constructed (possibly corrupted) targets.

Another difference is that .PRECIOUS works with pattern rules whereas
.SECONDARY does not.

  .PRECIOUS: $(obj)/%.lex.c

works, but

  .SECONDARY: $(obj)/%.lex.c

has no effect.  However, for the reason above, I do not want to use
.PRECIOUS which could cause obscure build breakage.

The targets specified as .SECONDARY must be explicit.  $(targets)
contains all targets that need to include .*.cmd files.  So, the
intermediates you want to keep are mostly in there.  Therefore, mark
$(targets) as .SECONDARY.  It means primary targets are also marked
as .SECONDARY, but I do not see any drawback for this.

I replaced some .SECONDARY / .PRECIOUS markers with 'targets'.  This
will make Kbuild search for non-existing .*.cmd files, but this is
not a noticeable performance issue.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Frank Rowand <frowand.list@gmail.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
2018-04-07 19:04:02 +09:00
Leonard Crestez
6aaf49b495 crypto: arm,arm64 - Fix random regeneration of S_shipped
The decision to rebuild .S_shipped is made based on the relative
timestamps of .S_shipped and .pl files but git makes this essentially
random. This means that the perl script might run anyway (usually at
most once per checkout), defeating the whole purpose of _shipped.

Fix by skipping the rule unless explicit make variables are provided:
REGENERATE_ARM_CRYPTO or REGENERATE_ARM64_CRYPTO.

This can produce nasty occasional build failures downstream, for example
for toolchains with broken perl. The solution is minimally intrusive to
make it easier to push into stable.

Another report on a similar issue here: https://lkml.org/lkml/2018/3/8/1379

Signed-off-by: Leonard Crestez <leonard.crestez@nxp.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-23 23:43:19 +08:00
Ard Biesheuvel
1a3713c7cd crypto: arm64/sha256-neon - play nice with CONFIG_PREEMPT kernels
Tweak the SHA256 update routines to invoke the SHA256 block transform
block by block, to avoid excessive scheduling delays caused by the
NEON algorithm running with preemption disabled.

Also, remove a stale comment which no longer applies now that kernel
mode NEON is actually disallowed in some contexts.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-16 23:35:58 +08:00
Ard Biesheuvel
870c163a0e crypto: arm64/aes-blk - add 4 way interleave to CBC-MAC encrypt path
CBC MAC is strictly sequential, and so the current AES code simply
processes the input one block at a time. However, we are about to add
yield support, which adds a bit of overhead, and which we prefer to
align with other modes in terms of granularity (i.e., it is better to
have all routines yield every 64 bytes and not have an exception for
CBC MAC which yields every 16 bytes)

So unroll the loop by 4. We still cannot perform the AES algorithm in
parallel, but we can at least merge the loads and stores.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-16 23:35:58 +08:00
Ard Biesheuvel
a8f8a69e82 crypto: arm64/aes-blk - add 4 way interleave to CBC encrypt path
CBC encryption is strictly sequential, and so the current AES code
simply processes the input one block at a time. However, we are
about to add yield support, which adds a bit of overhead, and which
we prefer to align with other modes in terms of granularity (i.e.,
it is better to have all routines yield every 64 bytes and not have
an exception for CBC encrypt which yields every 16 bytes)

So unroll the loop by 4. We still cannot perform the AES algorithm in
parallel, but we can at least merge the loads and stores.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-16 23:35:57 +08:00
Ard Biesheuvel
55868b45cf crypto: arm64/aes-blk - remove configurable interleave
The AES block mode implementation using Crypto Extensions or plain NEON
was written before real hardware existed, and so its interleave factor
was made build time configurable (as well as an option to instantiate
all interleaved sequences inline rather than as subroutines)

We ended up using INTERLEAVE=4 with inlining disabled for both flavors
of the core AES routines, so let's stick with that, and remove the option
to configure this at build time. This makes the code easier to modify,
which is nice now that we're adding yield support.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-16 23:35:55 +08:00
Ard Biesheuvel
4bf7e7a19d crypto: arm64/chacha20 - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.

Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.

So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-16 23:35:55 +08:00
Ard Biesheuvel
78ad7b08d8 crypto: arm64/aes-bs - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.

Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.

So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-16 23:35:55 +08:00
Ard Biesheuvel
6833817472 crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.

Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.

So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)

Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-16 23:35:54 +08:00
Ard Biesheuvel
bd2ad885e3 crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.

Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.

So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-16 23:35:54 +08:00
Eric Biggers
91a2abb78f crypto: arm64/speck - add NEON-accelerated implementation of Speck-XTS
Add a NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
for ARM64.  This is ported from the 32-bit version.  It may be useful on
devices with 64-bit ARM CPUs that don't have the Cryptography
Extensions, so cannot do AES efficiently -- e.g. the Cortex-A53
processor on the Raspberry Pi 3.

It generally works the same way as the 32-bit version, but there are
some slight differences due to the different instructions, registers,
and syntax available in ARM64 vs. in ARM32.  For example, in the 64-bit
version there are enough registers to hold the XTS tweaks for each
128-byte chunk, so they don't need to be saved on the stack.

Benchmarks on a Raspberry Pi 3 running a 64-bit kernel:

   Algorithm                              Encryption     Decryption
   ---------                              ----------     ----------
   Speck64/128-XTS (NEON)                 92.2 MB/s      92.2 MB/s
   Speck128/256-XTS (NEON)                75.0 MB/s      75.0 MB/s
   Speck128/256-XTS (generic)             47.4 MB/s      35.6 MB/s
   AES-128-XTS (NEON bit-sliced)          33.4 MB/s      29.6 MB/s
   AES-256-XTS (NEON bit-sliced)          24.6 MB/s      21.7 MB/s

The code performs well on higher-end ARM64 processors as well, though
such processors tend to have the Crypto Extensions which make AES
preferred.  For example, here are the same benchmarks run on a HiKey960
(with CPU affinity set for the A73 cores), with the Crypto Extensions
implementation of AES-256-XTS added:

   Algorithm                              Encryption     Decryption
   ---------                              -----------    -----------
   AES-256-XTS (Crypto Extensions)        1273.3 MB/s    1274.7 MB/s
   Speck64/128-XTS (NEON)                  359.8 MB/s     348.0 MB/s
   Speck128/256-XTS (NEON)                 292.5 MB/s     286.1 MB/s
   Speck128/256-XTS (generic)              186.3 MB/s     181.8 MB/s
   AES-128-XTS (NEON bit-sliced)           142.0 MB/s     124.3 MB/s
   AES-256-XTS (NEON bit-sliced)           104.7 MB/s      91.1 MB/s

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-16 23:35:41 +08:00
Ard Biesheuvel
fb87127bce crypto: arm64/sha512 - fix/improve new v8.2 Crypto Extensions code
Add a missing symbol export that prevents this code to be built as a
module. Also, move the round constant table to the .rodata section,
and use a more optimized version of the core transform.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-01-26 01:10:36 +11:00