linux/arch/x86/lib
Linus Torvalds c228d294f2 x86: explicitly align IO accesses in memcpy_{to,from}io
In commit 170d13ca3a ("x86: re-introduce non-generic memcpy_{to,from}io")
I made our copy from IO space use a separate copy routine rather than
rely on the generic memcpy.  I did that because our generic memory copy
isn't actually well-defined when it comes to internal access ordering or
alignment, and will in fact depend on various CPUID flags.

In particular, the default memcpy() for a modern Intel CPU will
generally be just a "rep movsb", which works reasonably well for
medium-sized memory copies of regular RAM, since the CPU will turn it
into fairly optimized microcode.

However, for non-cached memory and IO, "rep movs" ends up being
horrendously slow and will just do the architectural "one byte at a
time" accesses implied by the movsb.

At the other end of the spectrum, if you _don't_ end up using the "rep
movsb" code, you'd likely fall back to the software copy, which does
overlapping accesses for the tail, and may copy things backwards.
Again, for regular memory that's fine, for IO memory not so much.

The thinking was that clearly nobody really cared (because things
worked), but some people had seen horrible performance due to the byte
accesses, so let's just revert back to our long ago version that dod
"rep movsl" for the bulk of the copy, and then fixed up the potentially
last few bytes of the tail with "movsw/b".

Interestingly (and perhaps not entirely surprisingly), while that was
our original memory copy implementation, and had been used before for
IO, in the meantime many new users of memcpy_*io() had come about.  And
while the access patterns for the memory copy weren't well-defined (so
arguably _any_ access pattern should work), in practice the "rep movsb"
case had been very common for the last several years.

In particular Jarkko Sakkinen reported that the memcpy_*io() change
resuled in weird errors from his Geminilake NUC TPM module.

And it turns out that the TPM TCG accesses according to spec require
that the accesses be

 (a) done strictly sequentially

 (b) be naturally aligned

otherwise the TPM chip will abort the PCI transaction.

And, in fact, the tpm_crb.c driver did this:

	memcpy_fromio(buf, priv->rsp, 6);
	...
	memcpy_fromio(&buf[6], &priv->rsp[6], expected - 6);

which really should never have worked in the first place, but back
before commit 170d13ca3a it *happened* to work, because the
memcpy_fromio() would be expanded to a regular memcpy, and

 (a) gcc would expand the first memcpy in-line, and turn it into a
     4-byte and a 2-byte read, and they happened to be in the right
     order, and the alignment was right.

 (b) gcc would call "memcpy()" for the second one, and the machines that
     had this TPM chip also apparently ended up always having ERMS
     ("Enhanced REP MOVSB/STOSB instructions"), so we'd use the "rep
     movbs" for that copy.

In other words, basically by pure luck, the code happened to use the
right access sizes in the (two different!) memcpy() implementations to
make it all work.

But after commit 170d13ca3a, both of the memcpy_fromio() calls
resulted in a call to the routine with the consistent memory accesses,
and in both cases it started out transferring with 4-byte accesses.
Which worked for the first copy, but resulted in the second copy doing a
32-bit read at an address that was only 2-byte aligned.

Jarkko is actually fixing the fragile code in the TPM driver, but since
this is an excellent example of why we absolutely must not use a generic
memcpy for IO accesses, _and_ an IO-specific one really should strive to
align the IO accesses, let's do exactly that.

Side note: Jarkko also noted that the driver had been used on ARM
platforms, and had worked.  That was because on 32-bit ARM, memcpy_*io()
ends up always doing byte accesses, and on 64-bit ARM it first does byte
accesses to align to 8-byte boundaries, and then does 8-byte accesses
for the bulk.

So ARM actually worked by design, and the x86 case worked by pure luck.

We *might* want to make x86-64 do the 8-byte case too.  That should be a
pretty straightforward extension, but let's do one thing at a time.  And
generally MMIO accesses aren't really all that performance-critical, as
shown by the fact that for a long time we just did them a byte at a
time, and very few people ever noticed.

Reported-and-tested-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Tested-by: Jerry Snitselaar <jsnitsel@redhat.com>
Cc: David Laight <David.Laight@aculab.com>
Fixes: 170d13ca3a ("x86: re-introduce non-generic memcpy_{to,from}io")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 09:07:48 -08:00
..
.gitignore x86: Gitignore: arch/x86/lib/inat-tables.c 2009-11-04 13:11:28 +01:00
atomic64_32.c x86: Adjust asm constraints in atomic64 wrappers 2012-01-20 17:29:31 -08:00
atomic64_386_32.S x86/debug: Remove perpetually broken, unmaintainable dwarf annotations 2015-06-02 07:57:48 +02:00
atomic64_cx8_32.S x86/debug: Remove perpetually broken, unmaintainable dwarf annotations 2015-06-02 07:57:48 +02:00
cache-smp.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
checksum_32.S x86/extable: Introduce _ASM_EXTABLE_UA for uaccess fixups 2018-09-03 15:12:09 +02:00
clear_page_64.S x86/asm: Trim clear_page.S includes 2018-02-13 17:37:07 +01:00
cmdline.c x86/boot: Add early cmdline parsing for options with arguments 2017-07-18 11:38:06 +02:00
cmpxchg8b_emu.S x86: move exports to actual definitions 2016-08-07 23:47:15 -04:00
cmpxchg16b_emu.S x86/debug: Remove perpetually broken, unmaintainable dwarf annotations 2015-06-02 07:57:48 +02:00
copy_page_64.S License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
copy_user_64.S x86/extable: Introduce _ASM_EXTABLE_UA for uaccess fixups 2018-09-03 15:12:09 +02:00
cpu.c x86/cpu: Rename cpu_data.x86_mask to cpu_data.x86_stepping 2018-02-15 01:15:52 +01:00
csum-copy_64.S x86/extable: Introduce _ASM_EXTABLE_UA for uaccess fixups 2018-09-03 15:12:09 +02:00
csum-partial_64.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
csum-wrappers_64.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
delay.c Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-01-30 13:01:09 -08:00
error-inject.c x86/error_inject: Make just_return_func() globally visible 2018-02-13 14:33:35 +01:00
getuser.S x86/extable: Introduce _ASM_EXTABLE_UA for uaccess fixups 2018-09-03 15:12:09 +02:00
hweight.S License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
inat.c x86: Fix to decode grouped AVX with VEX pp bits 2012-02-11 15:11:35 +01:00
insn-eval.c x86/umip: Fix insn_get_code_seg_params()'s return value 2017-11-23 20:17:59 +01:00
insn.c x86/insn: Add AVX-512 support to the instruction decoder 2016-07-21 09:37:11 -03:00
iomap_copy_64.S x86/debug: Remove perpetually broken, unmaintainable dwarf annotations 2015-06-02 07:57:48 +02:00
iomem.c x86: explicitly align IO accesses in memcpy_{to,from}io 2019-02-01 09:07:48 -08:00
kaslr.c x86/kaslr: Fix incorrect i8254 outb() parameters 2019-01-11 21:35:47 +01:00
Makefile kbuild: remove redundant target cleaning on failure 2019-01-06 09:46:51 +09:00
memcpy_32.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
memcpy_64.S x86/asm/64: Use 32-bit XOR to zero registers 2018-07-03 09:59:29 +02:00
memmove_64.S License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
memset_64.S License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
misc.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
mmx_32.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
msr-reg-export.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
msr-reg.S License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
msr-smp.c x86/msr: Make rdmsrl_safe_on_cpu() scheduling safe as well 2018-03-28 10:34:13 +02:00
msr.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
putuser.S x86/extable: Introduce _ASM_EXTABLE_UA for uaccess fixups 2018-09-03 15:12:09 +02:00
retpoline.S Revert "x86/retpoline: Simplify vmexit_fill_RSB()" 2018-02-20 09:38:26 +01:00
rwsem.S locking/arch, x86: Add __down_read_killable() 2017-10-10 11:50:15 +02:00
string_32.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
strstr_32.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
usercopy_32.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
usercopy_64.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
usercopy.c x86/nmi: Fix NMI uaccess race against CR3 switching 2018-08-31 17:08:22 +02:00
x86-opcode-map.txt x86/decoder: Fix and update the opcodes map 2017-12-15 13:45:20 +01:00