Commit Graph

545 Commits

Author SHA1 Message Date
Linus Torvalds
221bb8a46e - ARM: GICv3 ITS emulation and various fixes. Removal of the old
VGIC implementation.
 
 - s390: support for trapping software breakpoints, nested virtualization
 (vSIE), the STHYI opcode, initial extensions for CPU model support.
 
 - MIPS: support for MIPS64 hosts (32-bit guests only) and lots of cleanups,
 preliminary to this and the upcoming support for hardware virtualization
 extensions.
 
 - x86: support for execute-only mappings in nested EPT; reduced vmexit
 latency for TSC deadline timer (by about 30%) on Intel hosts; support for
 more than 255 vCPUs.
 
 - PPC: bugfixes.
 
 The ugly bit is the conflicts.  A couple of them are simple conflicts due
 to 4.7 fixes, but most of them are with other trees. There was definitely
 too much reliance on Acked-by here.  Some conflicts are for KVM patches
 where _I_ gave my Acked-by, but the worst are for this pull request's
 patches that touch files outside arch/*/kvm.  KVM submaintainers should
 probably learn to synchronize better with arch maintainers, with the
 latter providing topic branches whenever possible instead of Acked-by.
 This is what we do with arch/x86.  And I should learn to refuse pull
 requests when linux-next sends scary signals, even if that means that
 submaintainers have to rebase their branches.
 
 Anyhow, here's the list:
 
 - arch/x86/kvm/vmx.c: handle_pcommit and EXIT_REASON_PCOMMIT was removed
 by the nvdimm tree.  This tree adds handle_preemption_timer and
 EXIT_REASON_PREEMPTION_TIMER at the same place.  In general all mentions
 of pcommit have to go.
 
 There is also a conflict between a stable fix and this patch, where the
 stable fix removed the vmx_create_pml_buffer function and its call.
 
 - virt/kvm/kvm_main.c: kvm_cpu_notifier was removed by the hotplug tree.
 This tree adds kvm_io_bus_get_dev at the same place.
 
 - virt/kvm/arm/vgic.c: a few final bugfixes went into 4.7 before the
 file was completely removed for 4.8.
 
 - include/linux/irqchip/arm-gic-v3.h: this one is entirely our fault;
 this is a change that should have gone in through the irqchip tree and
 pulled by kvm-arm.  I think I would have rejected this kvm-arm pull
 request.  The KVM version is the right one, except that it lacks
 GITS_BASER_PAGES_SHIFT.
 
 - arch/powerpc: what a mess.  For the idle_book3s.S conflict, the KVM
 tree is the right one; everything else is trivial.  In this case I am
 not quite sure what went wrong.  The commit that is causing the mess
 (fd7bacbca4, "KVM: PPC: Book3S HV: Fix TB corruption in guest exit
 path on HMI interrupt", 2016-05-15) touches both arch/powerpc/kernel/
 and arch/powerpc/kvm/.  It's large, but at 396 insertions/5 deletions
 I guessed that it wasn't really possible to split it and that the 5
 deletions wouldn't conflict.  That wasn't the case.
 
 - arch/s390: also messy.  First is hypfs_diag.c where the KVM tree
 moved some code and the s390 tree patched it.  You have to reapply the
 relevant part of commits 6c22c98637, plus all of e030c1125e, to
 arch/s390/kernel/diag.c.  Or pick the linux-next conflict
 resolution from http://marc.info/?l=kvm&m=146717549531603&w=2.
 Second, there is a conflict in gmap.c between a stable fix and 4.8.
 The KVM version here is the correct one.
 
 I have pushed my resolution at refs/heads/merge-20160802 (commit
 3d1f53419842) at git://git.kernel.org/pub/scm/virt/kvm/kvm.git.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJXoGm7AAoJEL/70l94x66DugQIAIj703ePAFepB/fCrKHkZZia
 SGrsBdvAtNsOhr7FQ5qvvjLxiv/cv7CymeuJivX8H+4kuUHUllDzey+RPHYHD9X7
 U6n1PdCH9F15a3IXc8tDjlDdOMNIKJixYuq1UyNZMU6NFwl00+TZf9JF8A2US65b
 x/41W98ilL6nNBAsoDVmCLtPNWAqQ3lajaZELGfcqRQ9ZGKcAYOaLFXHv2YHf2XC
 qIDMf+slBGSQ66UoATnYV2gAopNlWbZ7n0vO6tE2KyvhHZ1m399aBX1+k8la/0JI
 69r+Tz7ZHUSFtmlmyByi5IAB87myy2WQHyAPwj+4vwJkDGPcl0TrupzbG7+T05Y=
 =42ti
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:

 - ARM: GICv3 ITS emulation and various fixes.  Removal of the
   old VGIC implementation.

 - s390: support for trapping software breakpoints, nested
   virtualization (vSIE), the STHYI opcode, initial extensions
   for CPU model support.

 - MIPS: support for MIPS64 hosts (32-bit guests only) and lots
   of cleanups, preliminary to this and the upcoming support for
   hardware virtualization extensions.

 - x86: support for execute-only mappings in nested EPT; reduced
   vmexit latency for TSC deadline timer (by about 30%) on Intel
   hosts; support for more than 255 vCPUs.

 - PPC: bugfixes.

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (302 commits)
  KVM: PPC: Introduce KVM_CAP_PPC_HTM
  MIPS: Select HAVE_KVM for MIPS64_R{2,6}
  MIPS: KVM: Reset CP0_PageMask during host TLB flush
  MIPS: KVM: Fix ptr->int cast via KVM_GUEST_KSEGX()
  MIPS: KVM: Sign extend MFC0/RDHWR results
  MIPS: KVM: Fix 64-bit big endian dynamic translation
  MIPS: KVM: Fail if ebase doesn't fit in CP0_EBase
  MIPS: KVM: Use 64-bit CP0_EBase when appropriate
  MIPS: KVM: Set CP0_Status.KX on MIPS64
  MIPS: KVM: Make entry code MIPS64 friendly
  MIPS: KVM: Use kmap instead of CKSEG0ADDR()
  MIPS: KVM: Use virt_to_phys() to get commpage PFN
  MIPS: Fix definition of KSEGX() for 64-bit
  KVM: VMX: Add VMCS to CPU's loaded VMCSs before VMPTRLD
  kvm: x86: nVMX: maintain internal copy of current VMCS
  KVM: PPC: Book3S HV: Save/restore TM state in H_CEDE
  KVM: PPC: Book3S HV: Pull out TM state save/restore into separate procedures
  KVM: arm64: vgic-its: Simplify MAPI error handling
  KVM: arm64: vgic-its: Make vgic_its_cmd_handle_mapi similar to other handlers
  KVM: arm64: vgic-its: Turn device_id validation into generic ID validation
  ...
2016-08-02 16:11:27 -04:00
Gerald Schaefer
bc29b7ac1d s390/mm: clean up pte/pmd encoding
The hugetlbfs pte<->pmd conversion functions currently assume that the pmd
bit layout is consistent with the pte layout, which is not really true.

The SW read and write bits are encoded as the sequence "wr" in a pte, but
in a pmd it is "rw". The hugetlbfs conversion assumes that the sequence
is identical in both cases, which results in swapped read and write bits
in the pmd. In practice this is not a problem, because those pmd bits are
only relevant for THP pmds and not for hugetlbfs pmds. The hugetlbfs code
works on (fake) ptes, and the converted pte bits are correct.

There is another variation in pte/pmd encoding which affects dirty
prot-none ptes/pmds. In this case, a pmd has both its HW read-only and
invalid bit set, while it is only the invalid bit for a pte. This also has
no effect in practice, but it should better be consistent.

This patch fixes both inconsistencies by changing the SW read/write bit
layout for pmds as well as the PAGE_NONE encoding for ptes. It also makes
the hugetlbfs conversion functions more robust by introducing a
move_set_bit() macro that uses the pte/pmd bit #defines instead of
constant shifts.

Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-07-31 05:27:57 -04:00
Linus Torvalds
0e06f5c0de Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - a few misc bits

 - ocfs2

 - most(?) of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (125 commits)
  thp: fix comments of __pmd_trans_huge_lock()
  cgroup: remove unnecessary 0 check from css_from_id()
  cgroup: fix idr leak for the first cgroup root
  mm: memcontrol: fix documentation for compound parameter
  mm: memcontrol: remove BUG_ON in uncharge_list
  mm: fix build warnings in <linux/compaction.h>
  mm, thp: convert from optimistic swapin collapsing to conservative
  mm, thp: fix comment inconsistency for swapin readahead functions
  thp: update Documentation/{vm/transhuge,filesystems/proc}.txt
  shmem: split huge pages beyond i_size under memory pressure
  thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
  khugepaged: add support of collapse for tmpfs/shmem pages
  shmem: make shmem_inode_info::lock irq-safe
  khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
  thp: extract khugepaged from mm/huge_memory.c
  shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
  shmem: add huge pages support
  shmem: get_unmapped_area align huge page
  shmem: prepare huge= mount option and sysfs knob
  mm, rmap: account shmem thp pages
  ...
2016-07-26 19:55:54 -07:00
Kirill A. Shutemov
dcddffd41d mm: do not pass mm_struct into handle_mm_fault
We always have vma->vm_mm around.

Link: http://lkml.kernel.org/r/1466021202-61880-8-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Linus Torvalds
015cd867e5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Martin Schwidefsky:
 "There are a couple of new things for s390 with this merge request:

   - a new scheduling domain "drawer" is added to reflect the unusual
     topology found on z13 machines.  Performance tests showed up to 8
     percent gain with the additional domain.

   - the new crc-32 checksum crypto module uses the vector-galois-field
     multiply and sum SIMD instruction to speed up crc-32 and crc-32c.

   - proper __ro_after_init support, this requires RO_AFTER_INIT_DATA in
     the generic vmlinux.lds linker script definitions.

   - kcov instrumentation support.  A prerequisite for that is the
     inline assembly basic block cleanup, which is the reason for the
     net/iucv/iucv.c change.

   - support for 2GB pages is added to the hugetlbfs backend.

  Then there are two removals:

   - the oprofile hardware sampling support is dead code and is removed.
     The oprofile user space uses the perf interface nowadays.

   - the ETR clock synchronization is removed, this has been superseeded
     be the STP clock synchronization.  And it always has been
     "interesting" code..

  And the usual bug fixes and cleanups"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (82 commits)
  s390/pci: Delete an unnecessary check before the function call "pci_dev_put"
  s390/smp: clean up a condition
  s390/cio/chp : Remove deprecated create_singlethread_workqueue
  s390/chsc: improve channel path descriptor determination
  s390/chsc: sanitize fmt check for chp_desc determination
  s390/cio: make fmt1 channel path descriptor optional
  s390/chsc: fix ioctl CHSC_INFO_CU command
  s390/cio/device_ops: fix kernel doc
  s390/cio: allow to reset channel measurement block
  s390/console: Make preferred console handling more consistent
  s390/mm: fix gmap tlb flush issues
  s390/mm: add support for 2GB hugepages
  s390: have unique symbol for __switch_to address
  s390/cpuinfo: show maximum thread id
  s390/ptrace: clarify bits in the per_struct
  s390: stack address vs thread_info
  s390: remove pointless load within __switch_to
  s390: enable kcov support
  s390/cpumf: use basic block for ecctr inline assembly
  s390/hypfs: use basic block for diag inline assembly
  ...
2016-07-26 12:22:51 -07:00
David Hildenbrand
f045402984 s390/mm: fix gmap tlb flush issues
__tlb_flush_asce() should never be used if multiple asce belong to a mm.

As this function changes mm logic determining if local or global tlb
flushes will be neded, we might end up flushing only the gmap asce on all
CPUs and a follow up mm asce flushes will only flush on the local CPU,
although that asce ran on multiple CPUs.

The missing tlb flushes will provoke strange faults in user space and even
low address protections in user space, crashing the kernel.

Fixes: 1b948d6cae ("s390/mm,tlb: optimize TLB flushing for zEC12")
Cc: stable@vger.kernel.org # 3.15+
Reported-by: Sascha Silbe <silbe@linux.vnet.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-07-13 10:58:01 +02:00
Gerald Schaefer
d08de8e2d8 s390/mm: add support for 2GB hugepages
This adds support for 2GB hugetlbfs pages on s390.

Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-07-06 08:46:43 +02:00
Heiko Carstens
931641c639 s390/mm: use basic block for essa inline assembly
Use only simple inline assemblies which consist of a single basic
block if the register asm construct is being used.

Otherwise gcc would generate broken code if the compiler option
--sanitize-coverage=trace-pc would be used.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-28 09:32:29 +02:00
Michal Hocko
10d58bf297 s390: get rid of superfluous __GFP_REPEAT
__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations.

page_table_alloc then uses the flag for a single page allocation.  This
means that this flag has never been actually useful here because it has
always been used only for PAGE_ALLOC_COSTLY requests.

Link: http://lkml.kernel.org/r/1464599699-30131-14-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24 17:23:52 -07:00
David Hildenbrand
37d9df98b7 KVM: s390: backup the currently enabled gmap when scheduled out
Nested virtualization will have to enable own gmaps. Current code
would enable the wrong gmap whenever scheduled out and back in,
therefore resulting in the wrong gmap being enabled.

This patch reenables the last enabled gmap, therefore avoiding having to
touch vcpu->arch.gmap when enabling a different gmap.

Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:55:24 +02:00
David Hildenbrand
01f719176f s390/mm: don't fault everything in read-write in gmap_pte_op_fixup()
Let's not fault in everything in read-write but limit it to read-only
where possible.

When restricting access rights, we already have the required protection
level in our hands. When reading from guest 2 storage (gmap_read_table),
it is obviously PROT_READ. When shadowing a pte, the required protection
level is given via the guest 2 provided pte.

Based on an initial patch by Martin Schwidefsky.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:55:20 +02:00
David Hildenbrand
5b6c963bce s390/mm: allow to check if a gmap shadow is valid
It will be very helpful to have a mechanism to check without any locks
if a given gmap shadow is still valid and matches the given properties.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:55:16 +02:00
David Hildenbrand
4a49443924 s390/mm: remember the int code for the last gmap fault
For nested virtualization, we want to know if we are handling a protection
exception, because these can directly be forwarded to the guest without
additional checks.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:55:08 +02:00
David Hildenbrand
717c05554a s390/mm: limit number of real-space gmap shadows
We have no known user of real-space designation and only support it to
be architecture compliant.

Gmap shadows with real-space designation are never unshadowed
automatically, as there is nothing to protect for the top level table.

So let's simply limit the number of such shadows to one by removing
existing ones on creation of another one.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:55:07 +02:00
David Hildenbrand
3218f7094b s390/mm: support real-space for gmap shadows
We can easily support real-space designation just like EDAT1 and EDAT2.
So guest2 can provide for guest3 an asce with the real-space control being
set.

We simply have to allocate the biggest page table possible and fake all
levels.

There is no protection to consider. If we exceed guest memory, vsie code
will inject an addressing exception (via program intercept). In the future,
we could limit the fake table level to the gmap page table.

As the top level page table can never go away, such gmap shadows will never
get unshadowed, we'll have to come up with another way to limit the number
of kept gmap shadows.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:55:02 +02:00
David Hildenbrand
18b8980988 s390/mm: support EDAT2 for gmap shadows
If the guest is enabled for EDAT2, we can easily create shadows for
guest2 -> guest3 provided tables that make use of EDAT2.

If guest2 references a 2GB page, this memory looks consecutive for guest2,
but it does not have to be so for us. Therefore we have to create fake
segment and page tables.

This works just like EDAT1 support, so page tables are removed when the
parent table (r3t table entry) is changed.

We don't hve to care about:
- ACCF-Validity Control in RTTE
- Access-Control Bits in RTTE
- Fetch-Protection Bit in RTTE
- Common-Region Bit in RTTE

Just like for EDAT1, all bits might be dropped and there is no guaranteed
that they are active.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:54:56 +02:00
David Hildenbrand
fd8d4e3ab6 s390/mm: support EDAT1 for gmap shadows
If the guest is enabled for EDAT1, we can easily create shadows for
guest2 -> guest3 provided tables that make use of EDAT1.

If guest2 references a 1MB page, this memory looks consecutive for guest2,
but it might not be so for us. Therefore we have to create fake page tables.

We can easily add that to our existing infrastructure. The invalidation
mechanism will make sure that fake page tables are removed when the parent
table (sgt table entry) is changed.

As EDAT1 also introduced protection on all page table levels, we have to
also shadow these correctly.

We don't have to care about:
- ACCF-Validity Control in STE
- Access-Control Bits in STE
- Fetch-Protection Bit in STE
- Common-Segment Bit in STE

As all bits might be dropped and there is no guaranteed that they are
active ("unpredictable whether the CPU uses these bits", "may be used").
Without using EDAT1 in the shadow ourselfes (STE-format control == 0),
simply shadowing these bits would not be enough. They would be ignored.

Please note that we are using the "fake" flag to make this look consistent
with further changes (EDAT2, real-space designation support) and don't let
the shadow functions handle fc=1 stes.

In the future, with huge pages in the host, gmap_shadow_pgt() could simply
try to map a huge host page if "fake" is set to one and indicate via return
value that no lower fake tables / shadow ptes are required.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:54:51 +02:00
David Hildenbrand
5b062bd494 s390/mm: prepare for EDAT1/EDAT2 support in gmap shadow
In preparation for EDAT1/EDAT2 support for gmap shadows, we have to store
the requested edat level in the gmap shadow.

The edat level used during shadow translation is a property of the gmap
shadow. Depending on that level, the gmap shadow will look differently for
the same guest tables. We have to store it internally in order to support
it later.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:54:47 +02:00
David Hildenbrand
0f7f848915 s390/mm: fix races on gmap_shadow creation
Before any thread is allowed to use a gmap_shadow, it has to be fully
initialized. However, for invalidation to work properly, we have to
register the new gmap_shadow before we protect the parent gmap table.

Because locking is tricky, and we have to avoid duplicate gmaps, let's
introduce an initialized field, that signalizes other threads if that
gmap_shadow can already be used or if they have to retry.

Let's properly return errors using ERR_PTR() instead of simply returning
NULL, so a caller can properly react on the error.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:54:28 +02:00
David Hildenbrand
998f637cc4 s390/mm: avoid races on region/segment/page table shadowing
We have to unlock sg->guest_table_lock in order to call
gmap_protect_rmap(). If we sleep just before that call, another VCPU
might pick up that shadowed page table (while it is not protected yet)
and use it.

In order to avoid these races, we have to introduce a third state -
"origin set but still invalid" for an entry. This way, we can avoid
another thread already using the entry before the table is fully protected.
As soon as everything is set up, we can clear the invalid bit - if we
had no race with the unshadowing code.

Suggested-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:54:27 +02:00
David Hildenbrand
a9d23e71d7 s390/mm: shadow pages with real guest requested protection
We really want to avoid manually handling protection for nested
virtualization. By shadowing pages with the protection the guest asked us
for, the SIE can handle most protection-related actions for us (e.g.
special handling for MVPG) and we can directly forward protection
exceptions to the guest.

PTEs will now always be shadowed with the correct _PAGE_PROTECT flag.
Unshadowing will take care of any guest changes to the parent PTE and
any host changes to the host PTE. If the host PTE doesn't have the
fitting access rights or is not available, we have to fix it up.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:54:19 +02:00
David Hildenbrand
eea3678d43 s390/mm: flush tlb of shadows in all situations
For now, the tlb of shadow gmap is only flushed when the parent is removed,
not when it is removed upfront. Therefore other shadow gmaps can reuse the
tables without the tlb getting flushed.

Fix this by simply flushing the tlb
1. Before the shadow tables are removed (analogouos to other unshadow functions)
2. When the gmap is freed and therefore the top level pages are freed.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:54:18 +02:00
Martin Schwidefsky
4be130a084 s390/mm: add shadow gmap support
For a nested KVM guest the outer KVM host needs to create shadow
page tables for the nested guest. This patch adds the basic support
to the guest address space (gmap) code.

For each guest address space the inner KVM host creates, the first
outer KVM host needs to create shadow page tables. The address space
is identified by the ASCE loaded into the control register 1 at the
time the inner SIE instruction for the second nested KVM guest is
executed. The outer KVM host creates the shadow tables starting with
the table identified by the ASCE on a on-demand basis. The outer KVM
host will get repeated faults for all the shadow tables needed to
run the second KVM guest.

While a shadow page table for the second KVM guest is active the access
to the origin region, segment and page tables needs to be restricted
for the first KVM guest. For region and segment and page tables the first
KVM guest may read the memory, but write attempt has to lead to an
unshadow.  This is done using the page invalid and read-only bits in the
page table of the first KVM guest. If the first guest re-accesses one of
the origin pages of a shadow, it gets a fault and the affected parts of
the shadow page table hierarchy needs to be removed again.

PGSTE tables don't have to be shadowed, as all interpretation assist can't
deal with the invalid bits in the shadow pte being set differently than
the original ones provided by the first KVM guest.

Many bug fixes and improvements by David Hildenbrand.

Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:54:04 +02:00
Martin Schwidefsky
6ea427bbbd s390/mm: add reference counter to gmap structure
Let's use a reference counter mechanism to control the lifetime of
gmap structures. This will be needed for further changes related to
gmap shadows.

Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:53:59 +02:00
Martin Schwidefsky
b2d73b2a0a s390/mm: extended gmap pte notifier
The current gmap pte notifier forces a pte into to a read-write state.
If the pte is invalidated the gmap notifier is called to inform KVM
that the mapping will go away.

Extend this approach to allow read-write, read-only and no-access
as possible target states and call the pte notifier for any change
to the pte.

This mechanism is used to temporarily set specific access rights for
a pte without doing the heavy work of a true mprotect call.

Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:46:49 +02:00
Martin Schwidefsky
8ecb1a59d6 s390/mm: use RCU for gmap notifier list and the per-mm gmap list
The gmap notifier list and the gmap list in the mm_struct change rarely.
Use RCU to optimize the reader of these lists.

Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:46:49 +02:00
Martin Schwidefsky
414d3b0749 s390/kvm: page table invalidation notifier
Pass an address range to the page table invalidation notifier
for KVM. This allows to notify changes that affect a larger
virtual memory area, e.g. for 1MB pages.

Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-20 09:46:48 +02:00
Heiko Carstens
de3fa841e4 s390/mm: fix compile for PAGE_DEFAULT_KEY != 0
The usual problem for code that is ifdef'ed out is that it doesn't
compile after a while. That's also the case for the storage key
initialisation code, if it would be used (set PAGE_DEFAULT_KEY to
something not zero):

./arch/s390/include/asm/page.h: In function 'storage_key_init_range':
./arch/s390/include/asm/page.h:36:2: error: implicit declaration of function '__storage_key_init_range'

Since the code itself has been useful for debugging purposes several
times, remove the ifdefs and make sure the code gets compiler
coverage. The cost for this is eight bytes.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-14 16:54:05 +02:00
Heiko Carstens
6c22c98637 s390: avoid extable collisions
We have some inline assemblies where the extable entry points to a
label at the end of an inline assembly which is not followed by an
instruction.

On the other hand we have also inline assemblies where the extable
entry points to the first instruction of an inline assembly.

If a first type inline asm (extable point to empty label at the end)
would be directly followed by a second type inline asm (extable points
to first instruction) then we would have two different extable entries
that point to the same instruction but would have a different target
address.

This can lead to quite random behaviour, depending on sorting order.

I verified that we currently do not have such collisions within the
kernel. However to avoid such subtle bugs add a couple of nop
instructions to those inline assemblies which contain an extable that
points to an empty label.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:26 +02:00
Heiko Carstens
d07a980c1b s390: add proper __ro_after_init support
On s390 __ro_after_init is currently mapped to __read_mostly which
means that data marked as __ro_after_init will not be protected.

Reason for this is that the common code __ro_after_init implementation
is x86 centric: the ro_after_init data section was added to rodata,
since x86 enables write protection to kernel text and rodata very
late. On s390 we have write protection for these sections enabled with
the initial page tables. So adding the ro_after_init data section to
rodata does not work on s390.

In order to make __ro_after_init work properly on s390 move the
ro_after_init data, right behind rodata. Unlike the rodata section it
will be marked read-only later after all init calls happened.

This s390 specific implementation adds new __start_ro_after_init and
__end_ro_after_init labels. Everything in between will be marked
read-only after the init calls happened. In addition to the
__ro_after_init data move also the exception table there, since from a
practical point of view it fits the __ro_after_init requirements.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:23 +02:00
Martin Schwidefsky
64f31d5802 s390/mm: simplify the TLB flushing code
ptep_flush_lazy and pmdp_flush_lazy use mm->context.attach_count to
decide between a lazy TLB flush vs an immediate TLB flush. The field
contains two 16-bit counters, the number of CPUs that have the mm
attached and can create TLB entries for it and the number of CPUs in
the middle of a page table update.

The __tlb_flush_asce, ptep_flush_direct and pmdp_flush_direct functions
use the attach counter and a mask check with mm_cpumask(mm) to decide
between a local flush local of the current CPU and a global flush.

For all these functions the decision between lazy vs immediate and
local vs global TLB flush can be based on CPU masks. There are two
masks:  the mm->context.cpu_attach_mask with the CPUs that are actively
using the mm, and the mm_cpumask(mm) with the CPUs that have used the
mm since the last full flush. The decision between lazy vs immediate
flush is based on the mm->context.cpu_attach_mask, to decide between
local vs global flush the mm_cpumask(mm) is used.

With this patch all checks will use the CPU masks, the old counter
mm->context.attach_count with its two 16-bit values is turned into a
single counter mm->context.flush_count that keeps track of the number
of CPUs with incomplete page table updates. The sole user of this
counter is finish_arch_post_lock_switch() which waits for the end of
all page table updates.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:22 +02:00
Martin Schwidefsky
a9809407f6 s390/mm: fix vunmap vs finish_arch_post_lock_switch
The vunmap_pte_range() function calls ptep_get_and_clear() without any
locking. ptep_get_and_clear() uses ptep_xchg_lazy()/ptep_flush_direct()
for the page table update. ptep_flush_direct requires that preemption
is disabled, but without any locking this is not the case. If the kernel
preempts the task while the attach_counter is increased an endless loop
in finish_arch_post_lock_switch() will occur the next time the task is
scheduled.

Add explicit preempt_disable()/preempt_enable() calls to the relevant
functions in arch/s390/mm/pgtable.c.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:21 +02:00
Heiko Carstens
0ccb32c983 s390/mm: align swapper_pg_dir to 16k
The segment/region table that is part of the kernel image must be
properly aligned to 16k in order to make the crdte inline assembly
work.
Otherwise it will calculate a wrong segment/region table start address
and access incorrect memory locations if the swapper_pg_dir is not
aligned to 16k.

Therefore define BSS_FIRST_SECTIONS in order to put the swapper_pg_dir
at the beginning of the bss section and also align the bss section to
16k just like other architectures did.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:18 +02:00
Heiko Carstens
37cd944c8d s390/pgtable: add mapping statistics
Add statistics that show how memory is mapped within the kernel
identity mapping. This is more or less the same like git
commit ce0c0e50f9 ("x86, generic: CPA add statistics about state
of direct mapping v4") for x86.

I also intentionally copied the lower case "k" within DirectMap4k vs
the upper case "M" and "G" within the two other lines. Let's have
consistent inconsistencies across architectures.

The output of /proc/meminfo now contains these additional lines:

DirectMap4k:        2048 kB
DirectMap1M:     3991552 kB
DirectMap2G:     4194304 kB

The implementation on s390 is lockless unlike the x86 version, since I
assume changes to the kernel mapping are a very rare event. Therefore
it really doesn't matter if these statistics could potentially be
inconsistent if read while kernel pages tables are being changed.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:16 +02:00
Heiko Carstens
bab247ff5f s390/vmem: simplify vmem code for read-only mappings
For the kernel identity mapping map everything read-writeable and
subsequently call set_memory_ro() to make the ro section read-only.
This simplifies the code a lot.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:16 +02:00
Heiko Carstens
e8a97e42dc s390/pageattr: allow kernel page table splitting
set_memory_ro() and set_memory_rw() currently only work on 4k
mappings, which is good enough for module code aka the vmalloc area.

However we stumbled already twice into the need to make this also work
on larger mappings:
- the ro after init patch set
- the crash kernel resize code

Therefore this patch implements automatic kernel page table splitting
if e.g. set_memory_ro() would be called on parts of a 2G mapping.
This works quite the same as the x86 code, but is much simpler.

In order to make this work and to be architecturally compliant we now
always use the csp, cspg or crdte instructions to replace valid page
table entries. This means that set_memory_ro() and set_memory_rw()
will be much more expensive than before. In order to avoid huge
latencies the code contains a couple of cond_resched() calls.

The current code only splits page tables, but does not merge them if
it would be possible.  The reason for this is that currently there is
no real life scenarion where this would really happen. All current use
cases that I know of only change access rights once during the life
time. If that should change we can still implement kernel page table
merging at a later time.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:15 +02:00
Heiko Carstens
3e76ee99b0 s390/mm: always use PAGE_KERNEL when mapping pages
Always use PAGE_KERNEL when re-enabling pages within the kernel
mapping due to debug pagealloc. Without using this pgprot value
pte_mkwrite() and pte_wrprotect() won't work on such mappings after an
unmap -> map cycle anymore.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:14 +02:00
Heiko Carstens
5aa29975e8 s390/vmem: make use of pte_clear()
Use pte_clear() instead of open-coding it.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:14 +02:00
Heiko Carstens
c126aa83e2 s390/pgtable: get rid of _REGION3_ENTRY_RO
_REGION3_ENTRY_RO is a duplicate of _REGION_ENTRY_PROTECT.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:14 +02:00
Heiko Carstens
2dffdcbac9 s390/vmem: introduce and use SEGMENT_KERNEL and REGION3_KERNEL
Instead of open-coded SEGMENT_KERNEL and REGION3_KERNEL assignments use
defines.  Also to make e.g. pmd_wrprotect() work on the kernel mapping
a couple more flags must be set. Therefore add the missing flags also.

In order to make everything symmetrical this patch also adds software
dirty, young, read and write bits for region 3 table entries.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:13 +02:00
Heiko Carstens
2e9996fcf8 s390/vmem: align segment and region tables to 16k
Usually segment and region tables are 16k aligned due to the way the
buddy allocator works.  This is not true for the vmem code which only
asks for a 4k alignment. In order to be consistent enforce a 16k
alignment here as well.

This alignment will be assumed and therefore is required by the
pageattr code.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:13 +02:00
Christian Borntraeger
1c343f7b0e KVM: s390/mm: Fix CMMA reset during reboot
commit 1e133ab296 ("s390/mm: split arch/s390/mm/pgtable.c") factored
out the page table handling code from __gmap_zap and  __s390_reset_cmma
into ptep_zap_unused and added a simple flag that tells which one of the
function (reset or not) is to be made. This also changed the behaviour,
as it also zaps unused page table entries on reset.
Turns out that this is wrong as s390_reset_cmma uses the page walker,
which DOES NOT take the ptl lock.

The most simple fix is to not do the zapping part on reset (which uses
the walker)

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Fixes: 1e133ab296 ("s390/mm: split arch/s390/mm/pgtable.c")
Cc: stable@vger.kernel.org # 4.6+
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-06-13 15:58:09 +02:00
David Hildenbrand
a7e19ab55f KVM: s390: handle missing storage-key facility
Without the storage-key facility, SIE won't interpret SSKE, ISKE and
RRBE for us. So let's add proper interception handlers that will be called
if lazy sske cannot be enabled.

Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-10 12:07:31 +02:00
David Hildenbrand
1824c723ac KVM: s390: pfmf: support conditional-sske facility
We already indicate that facility but don't implement it in our pfmf
interception handler. Let's add a new storage key handling function for
conditionally setting the guest storage key.

As we will reuse this function later on, let's directly implement returning
the old key via parameter and indicating if any change happened via rc.

Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-10 12:07:30 +02:00
David Hildenbrand
154c8c19c3 s390/mm: return key via pointer in get_guest_storage_key
Let's just split returning the key and reporting errors. This makes calling
code easier and avoids bugs as happened already.

Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-10 12:07:28 +02:00
David Hildenbrand
8d6037a7b4 s390/mm: simplify get_guest_storage_key
We can safe a few LOC and make that function easier to understand
by rewriting existing code.

Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-10 12:07:27 +02:00
Martin Schwidefsky
d3ed1ceeac s390/mm: set and get guest storage key mmap locking
Move the mmap semaphore locking out of set_guest_storage_key
and get_guest_storage_key. This makes the two functions more
like the other ptep_xxx operations and allows to avoid repeated
semaphore operations if multiple keys are read or written.

Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-10 12:07:27 +02:00
David Hildenbrand
c427c42cd6 s390/mm: don't drop errors in get_guest_storage_key
Commit 1e133ab296 ("s390/mm: split arch/s390/mm/pgtable.c") changed
the return value of get_guest_storage_key to an unsigned char, resulting
in -EFAULT getting interpreted as a valid storage key.

Cc: stable@vger.kernel.org # 4.6+
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2016-06-10 12:07:26 +02:00
Michal Hocko
cf0d44d513 s390: fix info leak in do_sigsegv
Aleksa has reported incorrect si_errno value when stracing task which
received SIGSEGV:
[pid 20799] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_errno=2510266, si_addr=0x100000000000000}

The reason seems to be that do_sigsegv is not initializing siginfo
structure defined on the stack completely so it will leak 4B of
the previous stack content. Fix it simply by initializing si_errno
to 0 (same as do_sigbus does already).

Cc: stable # introduced pre-git times
Reported-by: Aleksa Sarai <asarai@suse.de>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-05-23 16:45:25 +02:00
Linus Torvalds
f61a657fdf Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Martin Schwidefsky:
 "The s390 patches for the 4.7 merge window have the usual bug fixes and
  cleanups, and the following new features:

   - An interface for dasd driver to query if a volume is online to
     another operating system

   - A new ioctl for the dasd driver to verify the format for a range of
     tracks

   - Following the example of x86 the struct fpu is now allocated with
     the task_struct

   - The 'report_error' interface for the PCI bus to send an
     adapter-error notification from user space to the service element
     of the machine"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (29 commits)
  s390/vmem: remove unused function parameter
  s390/vmem: fix identity mapping
  s390: add missing include statements
  s390: add missing declarations
  s390: make couple of variables and functions static
  s390/cache: remove superfluous locking
  s390/cpuinfo: simplify locking and skip offline cpus early
  s390/3270: hangup the 3270 tty after a disconnect
  s390/3270: handle reconnect of a tty with a different size
  s390/3270: avoid endless I/O loop with disconnected 3270 terminals
  s390/3270: fix garbled output on 3270 tty view
  s390/3270: fix view reference counting
  s390/3270: add missing tty_kref_put
  s390/dumpstack: implement and use return_address()
  s390/cpum_sf: Remove superfluous SMP function call
  s390/cpum_cf: Remove superfluous SMP function call
  s390/Kconfig: make z196 the default processor type
  s390/sclp: avoid compile warning in sclp_pci_report
  s390/fpu: allocate 'struct fpu' with the task_struct
  s390/crypto: cleanup and move the header with the cpacf definitions
  ...
2016-05-18 12:17:16 -07:00