The XTS template now wraps an ECB mode algorithm rather than the block
cipher directly. Therefore it is now redundant for crypto modules to
wrap their ECB code with generic XTS code themselves via xts_crypt().
Remove the xts-camellia-asm algorithm which did this. Users who request
xts(camellia) and previously would have gotten xts-camellia-asm will now
get xts(ecb-camellia-asm) instead, which is just as fast.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The LRW template now wraps an ECB mode algorithm rather than the block
cipher directly. Therefore it is now redundant for crypto modules to
wrap their ECB code with generic LRW code themselves via lrw_crypt().
Remove the lrw-camellia-asm algorithm which did this. Users who request
lrw(camellia) and previously would have gotten lrw-camellia-asm will now
get lrw(ecb-camellia-asm) instead, which is just as fast.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The LRW template now wraps an ECB mode algorithm rather than the block
cipher directly. Therefore it is now redundant for crypto modules to
wrap their ECB code with generic LRW code themselves via lrw_crypt().
Remove the lrw-camellia-aesni-avx2 algorithm which did this. Users who
request lrw(camellia) and previously would have gotten
lrw-camellia-aesni-avx2 will now get lrw(ecb-camellia-aesni-avx2)
instead, which is just as fast.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The LRW template now wraps an ECB mode algorithm rather than the block
cipher directly. Therefore it is now redundant for crypto modules to
wrap their ECB code with generic LRW code themselves via lrw_crypt().
Remove the lrw-camellia-aesni algorithm which did this. Users who
request lrw(camellia) and previously would have gotten
lrw-camellia-aesni will now get lrw(ecb-camellia-aesni) instead, which
is just as fast.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Convert the x86 asm implementation of Triple DES from the (deprecated)
blkcipher interface over to the skcipher interface.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Convert the x86 asm implementation of Blowfish from the (deprecated)
blkcipher interface over to the skcipher interface.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Convert the AVX implementation of CAST6 from the (deprecated) ablkcipher
and blkcipher interfaces over to the skcipher interface. Note that this
includes replacing the use of ablk_helper with crypto_simd.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The LRW template now wraps an ECB mode algorithm rather than the block
cipher directly. Therefore it is now redundant for crypto modules to
wrap their ECB code with generic LRW code themselves via lrw_crypt().
Remove the lrw-cast6-avx algorithm which did this. Users who request
lrw(cast6) and previously would have gotten lrw-cast6-avx will now get
lrw(ecb-cast6-avx) instead, which is just as fast.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Convert the AVX implementation of CAST5 from the (deprecated) ablkcipher
and blkcipher interfaces over to the skcipher interface. Note that this
includes replacing the use of ablk_helper with crypto_simd.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Convert the AVX implementation of Twofish from the (deprecated)
ablkcipher and blkcipher interfaces over to the skcipher interface.
Note that this includes replacing the use of ablk_helper with
crypto_simd.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The LRW template now wraps an ECB mode algorithm rather than the block
cipher directly. Therefore it is now redundant for crypto modules to
wrap their ECB code with generic LRW code themselves via lrw_crypt().
Remove the lrw-twofish-avx algorithm which did this. Users who request
lrw(twofish) and previously would have gotten lrw-twofish-avx will now
get lrw(ecb-twofish-avx) instead, which is just as fast.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Convert the 3-way implementation of Twofish from the (deprecated)
blkcipher interface over to the skcipher interface.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The XTS template now wraps an ECB mode algorithm rather than the block
cipher directly. Therefore it is now redundant for crypto modules to
wrap their ECB code with generic XTS code themselves via xts_crypt().
Remove the xts-twofish-3way algorithm which did this. Users who request
xts(twofish) and previously would have gotten xts-twofish-3way will now
get xts(ecb-twofish-3way) instead, which is just as fast.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The LRW template now wraps an ECB mode algorithm rather than the block
cipher directly. Therefore it is now redundant for crypto modules to
wrap their ECB code with generic LRW code themselves via lrw_crypt().
Remove the lrw-twofish-3way algorithm which did this. Users who request
lrw(twofish) and previously would have gotten lrw-twofish-3way will now
get lrw(ecb-twofish-3way) instead, which is just as fast.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Convert the AVX and AVX2 implementations of Serpent from the
(deprecated) ablkcipher and blkcipher interfaces over to the skcipher
interface. Note that this includes replacing the use of ablk_helper
with crypto_simd.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The LRW template now wraps an ECB mode algorithm rather than the block
cipher directly. Therefore it is now redundant for crypto modules to
wrap their ECB code with generic LRW code themselves via lrw_crypt().
Remove the lrw-serpent-avx algorithm which did this. Users who request
lrw(serpent) and previously would have gotten lrw-serpent-avx will now
get lrw(ecb-serpent-avx) instead, which is just as fast.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The LRW template now wraps an ECB mode algorithm rather than the block
cipher directly. Therefore it is now redundant for crypto modules to
wrap their ECB code with generic LRW code themselves via lrw_crypt().
Remove the lrw-serpent-avx2 algorithm which did this. Users who request
lrw(serpent) and previously would have gotten lrw-serpent-avx2 will now
get lrw(ecb-serpent-avx2) instead, which is just as fast.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Convert the SSE2 implementation of Serpent from the (deprecated)
ablkcipher and blkcipher interfaces over to the skcipher interface.
Note that this includes replacing the use of ablk_helper with
crypto_simd.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The XTS template now wraps an ECB mode algorithm rather than the block
cipher directly. Therefore it is now redundant for crypto modules to
wrap their ECB code with generic XTS code themselves via xts_crypt().
Remove the xts-serpent-sse2 algorithm which did this. Users who request
xts(serpent) and previously would have gotten xts-serpent-sse2 will now
get xts(ecb-serpent-sse2) instead, which is just as fast.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The LRW template now wraps an ECB mode algorithm rather than the block
cipher directly. Therefore it is now redundant for crypto modules to
wrap their ECB code with generic LRW code themselves via lrw_crypt().
Remove the lrw-serpent-sse2 algorithm which did this. Users who request
lrw(serpent) and previously would have gotten lrw-serpent-sse2 will now
get lrw(ecb-serpent-sse2) instead, which is just as fast.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add a function to crypto_simd that registers an array of skcipher
algorithms, then allocates and registers the simd wrapper algorithms for
them. It assumes the naming scheme where the names of the underlying
algorithms are prefixed with two underscores.
Also add the corresponding 'unregister' function.
Most of the x86 crypto modules will be able to use these.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add test vectors for Speck64-XTS, generated in userspace using C code.
The inputs were borrowed from the AES-XTS test vectors, with key lengths
adjusted.
xts-speck64-neon passes these tests. However, they aren't currently
applicable for the generic XTS template, as that only supports a 128-bit
block size.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add test vectors for Speck128-XTS, generated in userspace using C code.
The inputs were borrowed from the AES-XTS test vectors.
Both xts(speck128-generic) and xts-speck128-neon pass these tests.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Export the Speck constants and transform context and the ->setkey(),
->encrypt(), and ->decrypt() functions so that they can be reused by the
ARM NEON implementation of Speck-XTS. The generic key expansion code
will be reused because it is not performance-critical and is not
vectorizable, while the generic encryption and decryption functions are
needed as fallbacks and for the XTS tweak encryption.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add a generic implementation of Speck, including the Speck128 and
Speck64 variants. Speck is a lightweight block cipher that can be much
faster than AES on processors that don't have AES instructions.
We are planning to offer Speck-XTS (probably Speck128/256-XTS) as an
option for dm-crypt and fscrypt on Android, for low-end mobile devices
with older CPUs such as ARMv7 which don't have the Cryptography
Extensions. Currently, such devices are unencrypted because AES is not
fast enough, even when the NEON bit-sliced implementation of AES is
used. Other AES alternatives such as Twofish, Threefish, Camellia,
CAST6, and Serpent aren't fast enough either; it seems that only a
modern ARX cipher can provide sufficient performance on these devices.
This is a replacement for our original proposal
(https://patchwork.kernel.org/patch/10101451/) which was to offer
ChaCha20 for these devices. However, the use of a stream cipher for
disk/file encryption with no space to store nonces would have been much
more insecure than we thought initially, given that it would be used on
top of flash storage as well as potentially on top of F2FS, neither of
which is guaranteed to overwrite data in-place.
Speck has been somewhat controversial due to its origin. Nevertheless,
it has a straightforward design (it's an ARX cipher), and it appears to
be the leading software-optimized lightweight block cipher currently,
with the most cryptanalysis. It's also easy to implement without side
channels, unlike AES. Moreover, we only intend Speck to be used when
the status quo is no encryption, due to AES not being fast enough.
We've also considered a novel length-preserving encryption mode based on
ChaCha20 and Poly1305. While theoretically attractive, such a mode
would be a brand new crypto construction and would be more complicated
and difficult to implement efficiently in comparison to Speck-XTS.
There is confusion about the byte and word orders of Speck, since the
original paper doesn't specify them. But we have implemented it using
the orders the authors recommended in a correspondence with them. The
test vectors are taken from the original paper but were mapped to byte
arrays using the recommended byte and word orders.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The RSA private key for the first form should have
version, prime1, prime2, exponent1, exponent2, coefficient
values 0.
With non-zero values for prime1,2, exponent 1,2 and coefficient
the Intel QAT driver will assume that values are provided for the
private key second form. This will result in signature verification
failures for modules where QAT device is present and the modules
are signed with rsa,sha256.
Cc: <stable@vger.kernel.org>
Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Signed-off-by: Conor McLoughlin <conor.mcloughlin@intel.com>
Reviewed-by: Stephan Mueller <smueller@chronox.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The crypto engine could actually only enqueue hash and ablkcipher request.
This patch permit it to enqueue any type of crypto_async_request.
Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Tested-by: Fabien Dessenne <fabien.dessenne@st.com>
Tested-by: Fabien Dessenne <fabien.dessenne@st.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
After checking all possible call chains to crypto_report here,
my tool finds that crypto_report is never called in atomic context.
And crypto_report calls crypto_alg_match which calls down_read,
thus it proves again that crypto_report can call functions which may sleep.
Thus GFP_ATOMIC is not necessary, and it can be replaced with GFP_KERNEL.
This is found by a static analysis tool named DCNS written by myself.
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
After checking all possible call chains to kzalloc here,
my tool finds that this kzalloc is never called in atomic context.
Thus GFP_ATOMIC is not necessary, and it can be replaced with GFP_KERNEL.
This is found by a static analysis tool named DCNS written by myself.
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
There is no need for ahash_mcryptd_{update,final,finup,digest}(); we
should just call crypto_ahash_*() directly.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Export and import are mandatory in async hash. As drivers were
rewritten, drop empty wrappers and correct init of ahash transformation.
Signed-off-by: Kamil Konieczny <k.konieczny@partner.samsung.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Pull crypto fixes from Herbert Xu:
"This fixes the following issues:
- oversize stack frames on mn10300 in sha3-generic
- warning on old compilers in sha3-generic
- API error in sun4i_ss_prng
- potential dead-lock in sun4i_ss_prng
- null-pointer dereference in sha512-mb
- endless loop when DECO acquire fails in caam
- kernel oops when hashing empty message in talitos"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: sun4i_ss_prng - convert lock to _bh in sun4i_ss_prng_generate
crypto: sun4i_ss_prng - fix return value of sun4i_ss_prng_generate
crypto: caam - fix endless loop when DECO acquire fails
crypto: sha3-generic - Use __optimize to support old compilers
compiler-gcc.h: __nostackprotector needs gcc-4.4 and up
compiler-gcc.h: Introduce __optimize function attribute
crypto: sha3-generic - deal with oversize stack frames
crypto: talitos - fix Kernel Oops on hashing an empty file
crypto: sha512-mb - initialize pending lengths correctly
This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:
for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
done
with de-mangling cleanups yet to come.
NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do. But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.
The next patch from Al will sort out the final differences, and we
should be all done.
Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As reported by kbuild test robot, the optimized SHA3 C implementation
compiles to mn10300 code that uses a disproportionate amount of stack
space, i.e.,
crypto/sha3_generic.c: In function 'keccakf':
crypto/sha3_generic.c:147:1: warning: the frame size of 1232 bytes is larger than 1024 bytes [-Wframe-larger-than=]
As kindly diagnosed by Arnd, this does not only occur when building for
the mn10300 architecture (which is what the report was about) but also
for h8300, and builds for other 32-bit architectures show an increase in
stack space utilization as well.
Given that SHA3 operates on 64-bit quantities, and keeps a state matrix
of 25 64-bit words, it is not surprising that 32-bit architectures with
few general purpose registers are impacted the most by this, and it is
therefore reasonable to implement a workaround that distinguishes between
32-bit and 64-bit architectures.
Arnd figured out that taking the round calculation out of the loop, and
inlining it explicitly but only on 64-bit architectures preserves most
of the performance gain achieved by the rewrite, and also gets rid of
the excessive use of stack space.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Pull crypto updates from Herbert Xu:
"API:
- Enforce the setting of keys for keyed aead/hash/skcipher
algorithms.
- Add multibuf speed tests in tcrypt.
Algorithms:
- Improve performance of sha3-generic.
- Add native sha512 support on arm64.
- Add v8.2 Crypto Extentions version of sha3/sm3 on arm64.
- Avoid hmac nesting by requiring underlying algorithm to be unkeyed.
- Add cryptd_max_cpu_qlen module parameter to cryptd.
Drivers:
- Add support for EIP97 engine in inside-secure.
- Add inline IPsec support to chelsio.
- Add RevB core support to crypto4xx.
- Fix AEAD ICV check in crypto4xx.
- Add stm32 crypto driver.
- Add support for BCM63xx platforms in bcm2835 and remove bcm63xx.
- Add Derived Key Protocol (DKP) support in caam.
- Add Samsung Exynos True RNG driver.
- Add support for Exynos5250+ SoCs in exynos PRNG driver"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (166 commits)
crypto: picoxcell - Fix error handling in spacc_probe()
crypto: arm64/sha512 - fix/improve new v8.2 Crypto Extensions code
crypto: arm64/sm3 - new v8.2 Crypto Extensions implementation
crypto: arm64/sha3 - new v8.2 Crypto Extensions implementation
crypto: testmgr - add new testcases for sha3
crypto: sha3-generic - export init/update/final routines
crypto: sha3-generic - simplify code
crypto: sha3-generic - rewrite KECCAK transform to help the compiler optimize
crypto: sha3-generic - fixes for alignment and big endian operation
crypto: aesni - handle zero length dst buffer
crypto: artpec6 - remove select on non-existing CRYPTO_SHA384
hwrng: bcm2835 - Remove redundant dev_err call in bcm2835_rng_probe()
crypto: stm32 - remove redundant dev_err call in stm32_cryp_probe()
crypto: axis - remove unnecessary platform_get_resource() error check
crypto: testmgr - test misuse of result in ahash
crypto: inside-secure - make function safexcel_try_push_requests static
crypto: aes-generic - fix aes-generic regression on powerpc
crypto: chelsio - Fix indentation warning
crypto: arm64/sha1-ce - get rid of literal pool
crypto: arm64/sha2-ce - move the round constant table to .rodata section
...
Pull poll annotations from Al Viro:
"This introduces a __bitwise type for POLL### bitmap, and propagates
the annotations through the tree. Most of that stuff is as simple as
'make ->poll() instances return __poll_t and do the same to local
variables used to hold the future return value'.
Some of the obvious brainos found in process are fixed (e.g. POLLIN
misspelled as POLL_IN). At that point the amount of sparse warnings is
low and most of them are for genuine bugs - e.g. ->poll() instance
deciding to return -EINVAL instead of a bitmap. I hadn't touched those
in this series - it's large enough as it is.
Another problem it has caught was eventpoll() ABI mess; select.c and
eventpoll.c assumed that corresponding POLL### and EPOLL### were
equal. That's true for some, but not all of them - EPOLL### are
arch-independent, but POLL### are not.
The last commit in this series separates userland POLL### values from
the (now arch-independent) kernel-side ones, converting between them
in the few places where they are copied to/from userland. AFAICS, this
is the least disruptive fix preserving poll(2) ABI and making epoll()
work on all architectures.
As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
it will trigger only on what would've triggered EPOLLWRBAND on other
architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
at all on sparc. With this patch they should work consistently on all
architectures"
* 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
make kernel-side POLL... arch-independent
eventpoll: no need to mask the result of epi_item_poll() again
eventpoll: constify struct epoll_event pointers
debugging printk in sg_poll() uses %x to print POLL... bitmap
annotate poll(2) guts
9p: untangle ->poll() mess
->si_band gets POLL... bitmap stored into a user-visible long field
ring_buffer_poll_wait() return value used as return value of ->poll()
the rest of drivers/*: annotate ->poll() instances
media: annotate ->poll() instances
fs: annotate ->poll() instances
ipc, kernel, mm: annotate ->poll() instances
net: annotate ->poll() instances
apparmor: annotate ->poll() instances
tomoyo: annotate ->poll() instances
sound: annotate ->poll() instances
acpi: annotate ->poll() instances
crypto: annotate ->poll() instances
block: annotate ->poll() instances
x86: annotate ->poll() instances
...
Pull block updates from Jens Axboe:
"This is the main pull request for block IO related changes for the
4.16 kernel. Nothing major in this pull request, but a good amount of
improvements and fixes all over the map. This contains:
- BFQ improvements, fixes, and cleanups from Angelo, Chiara, and
Paolo.
- Support for SMR zones for deadline and mq-deadline from Damien and
Christoph.
- Set of fixes for bcache by way of Michael Lyle, including fixes
from himself, Kent, Rui, Tang, and Coly.
- Series from Matias for lightnvm with fixes from Hans Holmberg,
Javier, and Matias. Mostly centered around pblk, and the removing
rrpc 1.2 in preparation for supporting 2.0.
- A couple of NVMe pull requests from Christoph. Nothing major in
here, just fixes and cleanups, and support for command tracing from
Johannes.
- Support for blk-throttle for tracking reads and writes separately.
From Joseph Qi. A few cleanups/fixes also for blk-throttle from
Weiping.
- Series from Mike Snitzer that enables dm to register its queue more
logically, something that's alwways been problematic on dm since
it's a stacked device.
- Series from Ming cleaning up some of the bio accessor use, in
preparation for supporting multipage bvecs.
- Various fixes from Ming closing up holes around queue mapping and
quiescing.
- BSD partition fix from Richard Narron, fixing a problem where we
can't mount newer (10/11) FreeBSD partitions.
- Series from Tejun reworking blk-mq timeout handling. The previous
scheme relied on atomic bits, but it had races where we would think
a request had timed out if it to reused at the wrong time.
- null_blk now supports faking timeouts, to enable us to better
exercise and test that functionality separately. From me.
- Kill the separate atomic poll bit in the request struct. After
this, we don't use the atomic bits on blk-mq anymore at all. From
me.
- sgl_alloc/free helpers from Bart.
- Heavily contended tag case scalability improvement from me.
- Various little fixes and cleanups from Arnd, Bart, Corentin,
Douglas, Eryu, Goldwyn, and myself"
* 'for-4.16/block' of git://git.kernel.dk/linux-block: (186 commits)
block: remove smart1,2.h
nvme: add tracepoint for nvme_complete_rq
nvme: add tracepoint for nvme_setup_cmd
nvme-pci: introduce RECONNECTING state to mark initializing procedure
nvme-rdma: remove redundant boolean for inline_data
nvme: don't free uuid pointer before printing it
nvme-pci: Suspend queues after deleting them
bsg: use pr_debug instead of hand crafted macros
blk-mq-debugfs: don't allow write on attributes with seq_operations set
nvme-pci: Fix queue double allocations
block: Set BIO_TRACE_COMPLETION on new bio during split
blk-throttle: use queue_is_rq_based
block: Remove kblockd_schedule_delayed_work{,_on}()
blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays
blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly()
lib/scatterlist: Fix chaining support in sgl_alloc_order()
blk-throttle: track read and write request individually
block: add bdev_read_only() checks to common helpers
block: fail op_is_write() requests to read-only partitions
blk-throttle: export io_serviced_recursive, io_service_bytes_recursive
...
All current SHA3 test cases are smaller than the SHA3 block size, which
means not all code paths are being exercised. So add a new test case to
each variant, and make one of the existing test cases chunked.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
To allow accelerated implementations to fall back to the generic
routines, e.g., in contexts where a SIMD based implementation is
not allowed to run, expose the generic SHA3 init/update/final
routines to other modules.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
In preparation of exposing the generic SHA3 implementation to other
versions as a fallback, simplify the code, and remove an inconsistency
in the output handling (endian swabbing rsizw words of state before
writing the output does not make sense)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The way the KECCAK transform is currently coded involves many references
into the state array using indexes that are calculated at runtime using
simple but non-trivial arithmetic. This forces the compiler to treat the
state matrix as an array in memory rather than keep it in registers,
which results in poor performance.
So instead, let's rephrase the algorithm using fixed array indexes only.
This helps the compiler keep the state matrix in registers, resulting
in the following speedup (SHA3-256 performance in cycles per byte):
before after speedup
Intel Core i7 @ 2.0 GHz (2.9 turbo) 100.6 35.7 2.8x
Cortex-A57 @ 2.0 GHz (64-bit mode) 101.6 12.7 8.0x
Cortex-A53 @ 1.0 GHz 224.4 15.8 14.2x
Cortex-A57 @ 2.0 GHz (32-bit mode) 201.8 63.0 3.2x
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Ensure that the input is byte swabbed before injecting it into the
SHA3 transform. Use the get_unaligned() accessor for this so that
we don't perform unaligned access inadvertently on architectures
that do not support that.
Cc: <stable@vger.kernel.org>
Fixes: 53964b9ee6 ("crypto: sha3 - Add SHA-3 hash algorithm")
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Async hash operations can use result pointer in final/finup/digest,
but not in init/update/export/import, so test it for misuse.
Signed-off-by: Kamil Konieczny <k.konieczny@partner.samsung.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
My last bugfix added -Os on the command line, which unfortunately caused
a build regression on powerpc in some configurations.
I've done some more analysis of the original problem and found slightly
different workaround that avoids this regression and also results in
better performance on gcc-7.0: -fcode-hoisting is an optimization step
that got added in gcc-7 and that for all gcc-7 versions causes worse
performance.
This disables -fcode-hoisting on all compilers that understand the option.
For gcc-7.1 and 7.2 I found the same performance as my previous patch
(using -Os), in gcc-7.0 it was even better. On gcc-8 I could see no
change in performance from this patch. In theory, code hoisting should
not be able make things better for the AES cipher, so leaving it
disabled for gcc-8 only serves to simplify the Makefile change.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Link: https://www.mail-archive.com/linux-crypto@vger.kernel.org/msg30418.html
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83356
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83651
Fixes: 148b974dee ("crypto: aes-generic - build with -Os on gcc-7+")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Pull crypto fix from Herbert Xu:
"This fixes a NULL pointer dereference in crypto_remove_spawns that can
be triggered through af_alg"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: algapi - fix NULL dereference in crypto_remove_spawns()
Convert salsa20-asm from the deprecated "blkcipher" API to the
"skcipher" API, in the process fixing it up to use the generic helpers.
This allows removing the salsa20_keysetup() and salsa20_ivsetup()
assembly functions, which aren't performance critical; the C versions do
just fine.
This also fixes the same bug that salsa20-generic had, where the state
array was being maintained directly in the transform context rather than
on the stack or in the request context. Thus, if multiple threads used
the same Salsa20 transform concurrently they produced the wrong results.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Export the Salsa20 constants, transform context, and initialization
functions so that they can be reused by the x86 implementation.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Convert salsa20-generic from the deprecated "blkcipher" API to the
"skcipher" API, in the process fixing it up to be thread-safe (as the
crypto API expects) by maintaining each request's state separately from
the transform context.
Also remove the unnecessary cra_alignmask and tighten validation of the
key size by accepting only 16 or 32 bytes, not anything in between.
These changes bring the code close to the way chacha20-generic does
things, so hopefully it will be easier to maintain in the future.
However, the way Salsa20 interprets the IV is still slightly different;
that was not changed.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
While testing other changes, I discovered that gcc-7.2.1 produces badly
optimized code for aes_encrypt/aes_decrypt. This is especially true when
CONFIG_UBSAN_SANITIZE_ALL is enabled, where it leads to extremely
large stack usage that in turn might cause kernel stack overflows:
crypto/aes_generic.c: In function 'aes_encrypt':
crypto/aes_generic.c:1371:1: warning: the frame size of 4880 bytes is larger than 2048 bytes [-Wframe-larger-than=]
crypto/aes_generic.c: In function 'aes_decrypt':
crypto/aes_generic.c:1441:1: warning: the frame size of 4864 bytes is larger than 2048 bytes [-Wframe-larger-than=]
I verified that this problem exists on all architectures that are
supported by gcc-7.2, though arm64 in particular is less affected than
the others. I also found that gcc-7.1 and gcc-8 do not show the extreme
stack usage but still produce worse code than earlier versions for this
file, apparently because of optimization passes that generally provide
a substantial improvement in object code quality but understandably fail
to find any shortcuts in the AES algorithm.
Possible workarounds include
a) disabling -ftree-pre and -ftree-sra optimizations, this was an earlier
patch I tried, which reliably fixed the stack usage, but caused a
serious performance regression in some versions, as later testing
found.
b) disabling UBSAN on this file or all ciphers, as suggested by Ard
Biesheuvel. This would lead to massively better crypto performance in
UBSAN-enabled kernels and avoid the stack usage, but there is a concern
over whether we should exclude arbitrary files from UBSAN at all.
c) Forcing the optimization level in a different way. Similar to a),
but rather than deselecting specific optimization stages,
this now uses "gcc -Os" for this file, regardless of the
CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE/SIZE option. This is a reliable
workaround for the stack consumption on all architecture, and I've
retested the performance results now on x86, cycles/byte (lower is
better) for cbc(aes-generic) with 256 bit keys:
-O2 -Os
gcc-6.3.1 14.9 15.1
gcc-7.0.1 14.7 15.3
gcc-7.1.1 15.3 14.7
gcc-7.2.1 16.8 15.9
gcc-8.0.0 15.5 15.6
This implements the option c) by enabling forcing -Os on all compiler
versions starting with gcc-7.1. As a workaround for PR83356, it would
only be needed for gcc-7.2+ with UBSAN enabled, but since it also shows
better performance on gcc-7.1 without UBSAN, it seems appropriate to
use the faster version here as well.
Side note: during testing, I also played with the AES code in libressl,
which had a similar performance regression from gcc-6 to gcc-7.2,
but was three times slower overall. It might be interesting to
investigate that further and possibly port the Linux implementation
into that.
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83356
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83651
Cc: Richard Biener <rguenther@suse.de>
Cc: Jakub Jelinek <jakub@gcc.gnu.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>