When running Active Memory Sharing, pages can get marked as
"loaned" with the hypervisor by the CMM driver. This state gets
cleared by the system firmware when rebooting the partition.
When using kexec to boot a new kernel, this state never gets
cleared and the hypervisor and CMM driver can get out of sync
with respect to the number of pages currently marked "loaned".
Fix this by adding a reboot notifier to the CMM driver to deflate
the balloon and mark all pages as active.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
When running Active Memory Sharing, the Collaborative Memory Manager
(CMM) may mark some pages as "loaned" with the hypervisor.
Periodically, the CMM will query the hypervisor for a loan request,
which is a single signed value. When kexec'ing into a kdump kernel,
the CMM driver in the kdump kernel is not aware of the pages the
previous kernel had marked as "loaned", so the hypervisor and the CMM
driver are out of sync. This results in the CMM driver getting a
negative loan request, which can then get treated as a large unsigned
value and can cause kdump to hang due to the CMM driver inflating too
large. Since there really is no clean way for the CMM driver in the
kdump kernel to clean this up, simply disable CMM in the kdump kernel.
This fixes hangs we were seeing doing kdump with AMS.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
ibm_configure_kernel_dump is passed as the token to rtas_call() is
never initialised. This sets it to something sane.
Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Acked-by: Nathan Lynch <ntl@pobox.com>
Acked-by: Manish Ahuja <mahujam@gmail.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
print_dump_header() will be called at least once with a NULL pointer in
a normal boot sequence. If DEBUG is defined then we will dereference
the pointer and crash. Add a quick fix to exit early in the NULL pointer
case.
Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Acked-by: Manish Ahuja <mahujam@gmail.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Since "Factor out cpu joining/unjoining the GIQ"
(b4963255ad) the WARN_ON in
xics_set_cpu_giq() is being triggered during boot on JS20 because the
GIQ indicator is not available on that platform. While the warning is
harmless and the system runs normally, it's nicer to check for the
existence of the indicator before trying to manipulate it.
Implement rtas_indicator_present(), which searches the
/rtas/rtas-indicators property for the given indicator token, and use
this function in xics_set_cpu_giq().
Also use a WARN statement in xics_set_cpu_giq to get better
information on failure.
Signed-off-by: Nathan Lynch <ntl@pobox.com>
Acked-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The pseries PCI hotplug code has a number of issues, ranging from
incorrect resource setup to crashes, depending on what is added,
when, whether it contains a bridge, etc etc....
This fixes a whole bunch of these, while actually simplifying the code
a bit, using more generic code in the process and factoring out common
code between adding of a PHB, a slot or a device.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
To properly fix PCI hotplug, it's useful to be able to make the fixup
passes on all devices whether they were just hot plugged or already
there.
The EEH code however used to not be very friendly with calling
eeh_add_device_late() multiple time, and not very rebust in the way it
generally tests whether a device is in the expected state vs. the EEH
code.
This improves it, along with cleaning up a couple of debug printk's.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The 'ibm,interrupt-server#-size' properties are not in the cpu nodes,
which is where we currently look for them, but rather live under the
interrupt source controller nodes (which have "ibm,ppc-xics" in their
compatible property).
This moves the code that looks for the ibm,interrupt-server#-size
properties from xics_update_irq_servers() into xics_init_IRQ().
Also this adds a check for mismatched sizes across the interrupt
source controller nodes. Not sure this is necessary as in this case
the firmware might be seriously busted.
This property only appears on POWER6 boxes and is only used in the
set-indicator(gqirm) call, and apparently firmware currently ignores
the value we pass. Nevertheless we need to fix it in case future
firmware versions use it.
Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This gets rid of this build warning:
arch/powerpc/platforms/pseries/pci_dlpar.c: In function 'init_phb_dynamic':
arch/powerpc/platforms/pseries/pci_dlpar.c:192: warning: unused variable 'b'
This is one of the very few warnings left in a ppc64_defconfig build and
getting rid of it will make it easier to see future introduced ones (in
fact this was introduced very recently).
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Resources for PHB's that are dynamically added to a system are not
properly allocated in the resource tree.
Not having these resources allocated causes an oops when removing
the PHB when we try to release them.
The diff appears a bit messy, this is mainly due to moving everything
one tab to the left in the pcibios_allocate_bus_resources routine.
The functionality change in this routine is only that the
list_for_each_entry() loop is pulled out and moved to the necessary
calling routine.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
linux/crash_dump.h defines is_kdump_kernel() to be used by code that
needs to know if the previous kernel crashed instead of a (clean) boot
or reboot.
This updates the just added powerpc code to use it. This is needed
for the next commit, which will remove __kdump_flag.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This adds relocatable kernel support for kdump. With this one can
use the same regular kernel to capture the kdump. A signature (0xfeed1234)
is passed in r6 from panic code to the next kernel through kexec_sequence
and purgatory code. The signature is used to differentiate between
kdump kernel and non-kdump kernels.
The purgatory code compares the signature and sets the __kdump_flag in
head_64.S. During the boot up, kernel code checks __kdump_flag and if it
is set, the kernel will behave as relocatable kdump kernel. This kernel
will boot at the address where it was loaded by kexec-tools ie. at the
address reserved through crashkernel boot parameter.
CONFIG_CRASH_DUMP depends on CONFIG_RELOCATABLE option to build kdump
kernel as relocatable. So the same kernel can be used as production and
kdump kernel.
This patch incorporates the changes suggested by Paul Mackerras to avoid
GOT use and to avoid two copies of the code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Mohan Kumar M <mohan@in.ibm.com>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We used to assume that even numbered threads were the primary
threads, ie those that would be listed and started as a cpu from
open firmware. Replace a left over is even (% 2) check with a check
for it being a primary thread and update the comments.
Tested with a debug print on pseries, identical code found for cell.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The pfn of the memory to be removed should be validated prior to
attempting to remove the memory. In cases where the probe of a
memory section fails during hotplug add, the pfn for the lmb may
not be valid.
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
A single full sync (mb()) is requrired to order the mmio to the qirr reg
with the set or clear of the message word. However, test_and_clear_bit
has the effect of smp_mb() and we are not doing any other io from here,
so we don't need a mb per bit processed.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Several printks were broken at word boundaries for line length. Some
even referred to old function names. Using __func__ and changing the
text slightly for the format allows these printk formats to fit on one
line.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
It is physically per-cpu, and we want the irq layer to treat it that way.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
EOI normally has the side effect of returning the cpu to the base
priority to recieve the next interrupt. This is actually controlled
by the top byte of the xirr register. When we are exiting the
kernel in kexec we must eoi the ipi for the next kernel because we
never return from the handler, but we want to leave interrupt
delivery blocked until the next kernel takes action.
Since the hardware ipi vector is fixed, its easiest to just do the
eoi explicitly.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This factors out processors joining and unjoining the Global Interrupt
Queue into a separate function.
There is a bit of math to calculate the arguments to rtas to join
or leave the global interrupt queue, and a warning on failure
afterwards. Make a helper for the 3 callers.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We only need to check the ibm,interrupt-server#-size property once, not
once per global server and thread.
We can use !CONFIG_SMP cpu masks and hard_smp_processor_id() to avoid an ifdef.
Put the node when breaking out of the loop on lpar systems.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Trim unneeded includes from xics.c. We don't use signals or gfp
flags, we use only OF functions and don't need prom, and the 8259
is now handled by our caller.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The xirr is 32 bits in hardware, but the hypervisor requries the upper
bits of the register to be clear on the hcall. By changing the type
from signed to unsigned int we can drop masking it back to 32 bits.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Now that xics_update_irq_servers is called only from init and hotplug
code, it becomes possible to clean up the ordering of functions in the
file, grouping them but the interfaces they implement.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
xics supports only one ipi per cpu, and expects software to use some
queue to know why the interrupt was sent. In Linux, we use a an array
of bitmaps indexed by cpu to identify the message. Currently the bits
are set in smp.c and decoded in xics.c, with the data structure in a
header file. Consolidate the code in xics.c similar to mpic and other
interrupt controllers.
Also, while making the the array static, the message word doesn't need
to be volatile as set_bit and test_clear_bit take care of it for us, and
put it under ifdef smp.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently, every time we determine which irq server to use, we check if
default_server, which is the id of the bootcpu, is still online. But
default_server is a hardware cpu, not the logical cpu id needed to index
cpu_online_map.
Since the default server can only go offline during a cpu hotplug event,
explicitly check the default server and choose the new one when we move
irqs away from the cpu being offlined.
This has the added benefit of only needing the boot_cpuid to be updated
and not relying on the cpu being marked offline during migrate_irqs_away.
Also, since xics_update_irq_servers only reads device tree information, we
can call it before xics_init_host in xics_init_IRQ and then default_server
will always be valid when we can reach get_irq_server via the host ops.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When reciving an irq vector that does not have a linux mapping, the kernel
prints a message and calls RTAS to disable the irq source. Previously
the kernel did not EOI the interrupt, causing the source to think it is
still being processed by software. While this does add an additional
layer of protection against interrupt storms had RTAS failed to disable
the source, it also prevents the interrupt from working when a driver
later enables it. (We could alternatively send an EOI on startup, but
that strategy would likely fail on an emulated xics.)
All interrupts should be disabled when the kernel starts, but this can
be observed if a driver does not shutdown an interrupt in its reboot
hook before starting a new kernel with kexec.
Michael reports this can be reproduced trivially by banging the keyboard
while kexec'ing on a P5 LPAR: even though the hvc_console driver request's
the console irq later in boot, the console is non-functional because
we're receiving no console interrupts.
Reported-By: Michael Ellerman
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Testing hotplug memory remove has revealed that we can oops in
pseries_lmb_remove(). The incorrect shift causes a NULL pointer
dereference in the page_zone() inline routine.
I have only been able to reproduce the oops on kernels with large pages
enabled.
Tested on Power5 and Power6 with and without large pages enabled.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
rtas_log_read() doesn't check file flags for O_NONBLOCK and blocks
non-blocking readers of /proc/ppc64/rtas/error_log when there is
no data available. This fixes it.
Also rtas_log_read() returns now with ENODATA to prevent suspending of
process in wait_event_interruptible() when logging facility was
switched off and log is already empty.
Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
irq_radix_revmap() currently serves 2 purposes, irq mapping lookup
and insertion which happen in interrupt and process context respectively.
Separate the function into its 2 components, one for lookup only and one
for insertion only.
Fix the only user of the revmap tree (XICS) to use the new functions.
Also, move the insertion into the radix tree of those irqs that were
requested before it was initialized at said tree initialization.
Mutual exclusion between the tree initialization and readers/writers is
handled via a state variable (revmap_trees_allocated) set to 1 when the tree
has been initialized and set to 2 after the already requested irqs have been
inserted in the tree by the init path. This state is checked before any reader
or writer access just like we used to check for tree.gfp_mask != 0 before.
Finally, now that we're not any longer inserting nodes into the radix-tree
in interrupt context, turn the GFP_ATOMIC allocations into GFP_KERNEL ones.
Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The return code from invocation of the notifier for
pSeries_reconfig_chain during update of the device tree is not
checked. This causes writes to /proc/ppc64/ofdt to update memory
properties (i.e. ibm,dyamic-reconfiguration-memory) to always
return success, instead of the result of the notifier chain.
This happens specifically when we remove/add memory from the
device tree on machines using memory specified in the
ibm,dynamic-reconfiguration-memory property of the device tree.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch also includes the required removal of (unused) inclusion of
<asm/a.out.h> <linux/a.out.h>'s in the arch/ code for these
architectures.
[dwmw2: updated for 2.6.27-rc]
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
This fixes an error building powerpc allmodconfig:
ERROR: "CMO_PageSize" [arch/powerpc/platforms/pseries/cmm.ko] undefined!
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Currently print_device_node_tree() isn't called but it can be useful for
debugging. Leave the function there but hide it behind '#if 0' to save
it being rewritten. If you want to call it you're already editing this
file anyway. ;P
Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
__FUNCTION__ is gcc-specific, use __func__
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
If the firmware page size used for collaborative memory overcommit
is 4k, but the kernel is using 64k pages, the page loaning is currently
broken as it only marks the first 4k page of each 64k page as loaned.
This fixes this to iterate through each 4k page and mark them all as
loaned/active.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
During platform setup, save off the primary/secondary paging space
pool IDs and the page size. Added accessors in hvcall.h for these
variables. This is needed for a subsequent fix.
Submitted-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Noticed due to these wanings:
arch/powerpc/platforms/pseries/cmm.c:298: warning: initialization from incompatible pointer type
arch/powerpc/platforms/pseries/cmm.c:299: warning: initialization from incompatible pointer type
arch/powerpc/platforms/pseries/cmm.c:320: warning: initialization from incompatible pointer type
arch/powerpc/platforms/pseries/cmm.c:320: warning: initialization from incompatible pointer type
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
To support Cooperative Memory Overcommitment (CMO), we need to check
for failure from some of the tce hcalls.
These changes for the pseries platform affect the powerpc architecture;
patches for the other affected platforms are included in this patch.
pSeries platform IOMMU code changes:
* platform TCE functions must handle H_NOT_ENOUGH_RESOURCES errors and
return an error.
Architecture IOMMU code changes:
* Calls to ppc_md.tce_build need to check return values and return
DMA_MAPPING_ERROR for transient errors.
Architecture changes:
* struct machdep_calls for tce_build*_pSeriesLP functions need to change
to indicate failure.
* all other platforms will need updates to iommu functions to match the new
calling semantics; they will return 0 on success. The other platforms
default configs have been built, but no further testing was performed.
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Acked-by: Olof Johansson <olof@lixom.net>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Adds a collaborative memory manager, which acts as a simple balloon driver
for System p machines that support cooperative memory overcommitment
(CMO).
Adds a platform configuration option for CMO called PPC_SMLPAR.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Newer versions of firmware support page states, which are used by the
collaborative memory manager (future patch) to "loan" pages to the
hypervisor for use by other partitions.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
For Cooperative Memory Overcommitment (CMO), set the FW_FEATURE_CMO
flag in powerpc_firmware_features from the rtas ibm,get-system-parameters
table prior to calling iommu_init_early_pSeries.
With this, any CMO specific functionality can be controlled by checking:
firmware_has_feature(FW_FEATURE_CMO)
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch changes the EEH_MAX_FAILS action from panic to printing an
error message. Panicking under under this condition is too harsh.
Although performance will be affected and the device may not recover,
the system is still running, which at the very least should allow for a
more graceful shutdown. The patch also removes the msleep() within a
spinlock, which can lead to a deadlock and is not recommended.
Signed-off-by: Mike Mason <mmlnx@us.ibm.com>
Acked-by: Linas Vepstas <linasvepstas@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Update iommu_alloc() to take the struct dma_attrs and pass them on to
tce_build(). This change propagates down to the tce_build functions of
all the platforms.
Signed-off-by: Mark Nelson <markn@au1.ibm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Choosing PCI or not at config time is allowed on some
platforms via an if expression in arch/powerpc/Kconfig.
To add a new platform with PCI support selectable at
config time, you must change the if expression. This
patch makes this easier by changing:
bool "PCI support" if <long expression>
to
bool "PCI support" if PPC_PCI_CHOICE
and adding select PPC_PCI_CHOICE to all the config nodes that
were previously in the PCI if expression.
Platforms with unconditional PCI support continue to
just select PCI in their config nodes.
Signed-off-by: John Rigby <jrigby@freescale.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
It is okay for both _PAGE_GUARDED and _PAGE_COHERENT (G and M) to be set
in the same pte. In fact, even if that were not the case, there doesn't
seem to be any place where G is set without also setting I (_PAGE_NO_CACHE),
so the test for I is sufficient as a condition to clear _PAGE_COHERENT
when filling the hash table.
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The following patch restores the PERR and SERR bits in the PCI
command register during an EEH device recovery. We have found
at least one case (an Agilent test card) where the PERR/SERR
bits are set to 1 by firmware at boot time, but are not restored
to 1 during EEH recovery. The patch fixes the Agilent card
problem. It has been tested on several other EEH-enabled cards
with no regressions.
Signed-off-by: Mike Mason <mmlnx@us.ibm.com>
Acked-by: Linas Vepstas <linasvepstas@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The current low level hash code on LPAR configurations clears
_PAGE_COHERENT (M) when either _PAGE_GUARDED (G) or _PAGE_NO_CACHE (I)
is set. This conflicts with _PAGE_SAO which has M, I and W bits sets at
once (normally invalid combo) to indicate the new SAO attribute.
This changes the code to allow that case.
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>