This adds the back-end for the PMU on the POWER5 processor. This knows
how to use the fixed-function PMC5 and PMC6 (instructions completed and
run cycles). Unlike POWER6, PMC5/6 obey the freeze conditions and can
generate interrupts, so their use doesn't impose any extra restrictions.
POWER5+ is different and is not supported by this patch.
Signed-off-by: Paul Mackerras <paulus@samba.org>
When we introduced VSX, we changed the way FPRs are stored in the
thread_struct. Unfortunately we missed the load/store float double
alignment handler code when updating how we access FPRs in the
thread_struct.
Below fixes this and merges the little/big endian case.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently, setting hw_event.exclude_kernel does nothing on the PPC970
variants used in Apple G5 machines, because they have the HV (hypervisor)
bit in the MSR forced to 1, so as far as the PMU is concerned, the
kernel runs in hypervisor mode. Thus we have to use the MMCR0_FCHV
(freeze counters in hypervisor mode) bit rather than the MMCR0_FCS
(freeze counters in supervisor mode) bit.
This checks the MSR.HV bit at startup, and if it is set, we set the
freeze_counters_kernel variable to MMCR0_FCHV (it was initialized to
MMCR0_FCS). We then use that whenever we need to exclude kernel events.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Fix the VSX alignment handler for VSX registers > 32. 32-63 are stored
in the VMX part of the thread_struct not the FPR part.
Signed-off-by: Michael Neuling <mikey@neuling.org>
CC: stable@kernel.org (2.6.27 & .28 please)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Impact: new perf_counter feature
This extends the perf_counter_hw_event struct with bits that specify
that events in user, kernel and/or hypervisor mode should not be
counted (i.e. should be excluded), and adds code to program the PMU
mode selection bits accordingly on x86 and powerpc.
For software counters, we don't currently have the infrastructure to
distinguish which mode an event occurs in, so we currently fail the
counter initialization if the setting of the hw_event.exclude_* bits
would require us to distinguish. Context switches and CPU migrations
are currently considered to occur in kernel mode.
On x86, this changes the previous policy that only root can count
kernel events. Now non-root users can count kernel events or exclude
them. Non-root users still can't use NMI events, though. On x86 we
don't appear to have any way to control whether hypervisor events are
counted or not, so hw_event.exclude_hv is ignored.
On powerpc, the selection of whether to count events in user, kernel
and/or hypervisor mode is PMU-wide, not per-counter, so this adds a
check that the hw_event.exclude_* settings are the same as other events
on the PMU. Counters being added to a group have to have the same
settings as the other hardware counters in the group. Counters and
groups can only be enabled in hw_perf_group_sched_in or power_perf_enable
if they have the same settings as any other counters already on the
PMU. If we are not running on a hypervisor, the exclude_hv setting
is ignored (by forcing it to 0) since we can't ever get any
hypervisor events.
Signed-off-by: Paul Mackerras <paulus@samba.org>
The new legacy_mem file in sysfs is causing problems with X on machines
that don't support legacy memory access. The way I initially implemented
it, we would fail with -ENXIO when trying to mmap it, thus exposing to
X that we do support the API but there is no legacy memory.
Unfortunately, X poor error handling is causing it to fail to start when
it gets this error.
This implements a workaround hack that instead maps anonymous memory
instead (using shmem if VM_SHARED is set, just like /dev/zero does).
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Impact: fix dynamic ftrace with large modules in PPC64
The math to calculate the offset into the TOC that is taken from reading
the trampoline is incorrect. The bottom half of the offset is a signed
extended short. The current code was using an OR to create the offset
when it should have been using an addition.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Acked-by: Geoff Levand <geoffrey.levand@am.sony.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Recently, a patch left DEBUG enabled in the powerpc common PCI code,
resulting in an old bug in a pr_debug() statement to show up and cause
a NULL dereference on some machines.
This fixes the pr_debug() statement and reverts to DEBUG not being
force-enabled in that file.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In the VIO bus code the wrappers for dma alloc_coherent and free_coherent
calls are rounding to IOMMU_PAGE_SIZE. Taking a look at the underlying
calls, the actual mapping is promoted to PAGE_SIZE. Changing the
rounding in these two functions fixes under-reporting the entitlement
used by the system. Without this change, the system could run out of
entitlement before it believes it has and incur mapping failures at the
firmware level.
Also in the VIO bus code, the wrapper for dma map_sg is not exiting in
an error path where it should. Rather than fall through to code for the
success case, this patch adds the return that is needed in the error path.
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Conflicts:
arch/x86/include/asm/pda.h
We merge tip/core/percpu into tip/perfcounters/core because of a
semantic and contextual conflict: the former eliminates the PDA,
while the latter extends it with apic_perf_irqs field.
Resolve the conflict by moving the new field to the irq_cpustat
structure on 64-bit too.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
arm, arm/mach-integrator and powerpc were missing
.data.percpu.page_aligned in their percpu output section definitions.
Add it.
Signed-off-by: Tejun Heo <tj@kernel.org>
The PAPR says that the property for specifying the number of SLBs should
be called "slb-size". We currently only look for "ibm,slb-size" because
this is what firmware actually presents.
This patch makes us look for the "slb-size" property as well and in
preference to the "ibm,slb-size". This should future proof us if
firmware changes to match PAPR.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Impact: New perf_counter features
A pinned counter group is one that the user wants to have on the CPU
whenever possible, i.e. whenever the associated task is running, for
a per-task group, or always for a per-cpu group. If the system
cannot satisfy that, it puts the group into an error state where
it is not scheduled any more and reads from it return EOF (i.e. 0
bytes read). The group can be released from error state and made
readable again using prctl(PR_TASK_PERF_COUNTERS_ENABLE). When we
have finer-grained enable/disable controls on counters we'll be able
to reset the error state on individual groups.
An exclusive group is one that the user wants to be the only group
using the CPU performance monitor hardware whenever it is on. The
counter group scheduler will not schedule an exclusive group if there
are already other groups on the CPU and will not schedule other groups
onto the CPU if there is an exclusive group scheduled (that statement
does not apply to groups containing only software counters, which can
always go on and which do not prevent an exclusive group from going on).
With an exclusive group, we will be able to let users program PMU
registers at a low level without the concern that those settings will
perturb other measurements.
Along the way this reorganizes things a little:
- is_software_counter() is moved to perf_counter.h.
- cpuctx->active_oncpu now records the number of hardware counters on
the CPU, i.e. it now excludes software counters. Nothing was reading
cpuctx->active_oncpu before, so this change is harmless.
- A new cpuctx->exclusive field records whether we currently have an
exclusive group on the CPU.
- counter_sched_out moves higher up in perf_counter.c and gets called
from __perf_counter_remove_from_context and __perf_counter_exit_task,
where we used to have essentially the same code.
- __perf_counter_sched_in now goes through the counter list twice, doing
the pinned counters in the first loop and the non-pinned counters in
the second loop, in order to give the pinned counters the best chance
to be scheduled in.
Note that only a group leader can be exclusive or pinned, and that
attribute applies to the whole group. This avoids some awkwardness in
some corner cases (e.g. where a group leader is closed and the other
group members get added to the context list). If we want to relax that
restriction later, we can, and it is easier to relax a restriction than
to apply a new one.
This doesn't yet handle the case where a pinned counter is inherited
and goes into error state in the child - the error state is not
propagated up to the parent when the child exits, and arguably it
should.
Signed-off-by: Paul Mackerras <paulus@samba.org>
This makes sure that we call the platform-specific ppc_md.enable_pmcs
function on each CPU before we try to use the PMU on that CPU. If the
CPU goes off-line and then on-line, we need to do the enable_pmcs call
again, so we use the hw_perf_counter_setup hook to ensure that. It gets
called as each CPU comes online, but it isn't called on the CPU that is
coming up, so this adds the CPU number as an argument to it (there were
no non-empty instances of hw_perf_counter_setup before).
This also arranges to set the pmcregs_in_use field of the lppaca (data
structure shared with the hypervisor) on each CPU when we are using the
PMU and clear it when we are not. This allows the hypervisor to optimize
partition switches by not saving/restoring the PMU registers when we
aren't using the PMU.
Signed-off-by: Paul Mackerras <paulus@samba.org>
We use Doorbell interrupts for IPIs and thus we need to make sure we aren't
interrupted in the process of processing the IPI.
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Acked-by: Dave Liu <daveliu@freescale.com>
The PowerMac kernel occasionally fails to bring up the secondary CPUs on
SMP, the trigger factor seem to be fairly random and related to location
of code and data.
This appears to be due to the initial loading of the TOC value by the
secondary processor which now happens before we clear HID4:RM_CI (Real
Mode Cache Invalidate). This bit should really be cleared before we do
any load or store other than fetching code.
This fix works based on the assumption that all SMP 64-bit PowerMacs use
variants of the 970, which fortunately is true, by explicitely clearing
that bit, adding an slbia for good measure as RM_CI mode is known to
create bogus ERAT entries.
I also removed some spurrious debug output that was left enabled by
mistake while at it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The per_cpu__ prefix on DECLARE_PER_CPU'd variables is going away;
rename cache_dir to cache_dir_pcpu.
Signed-off-by: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Convert arch/powerpc/ over to long long based u64:
-#ifdef __powerpc64__
-# include <asm-generic/int-l64.h>
-#else
-# include <asm-generic/int-ll64.h>
-#endif
+#include <asm-generic/int-ll64.h>
This will avoid reoccuring spurious warnings in core kernel code that
comes when people test on their own hardware. (i.e. x86 in ~98% of the
cases) This is what x86 uses and it generally helps keep 64-bit code
32-bit clean too.
[Adjusted to not impact user mode (from paulus) - sfr]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Enforce that the crash kernel region never overlaps the current kernel,
as it will be written directly on kexec load.
Also, default to the previous KDUMP_KERNELBASE if the start is 0.
Other architectures (x86, ia64) state that specifying the start address
0 (or omitting it) will result in the kernel allocating it. Before the
relocatable patch in 2.6.28, powerpc would adjust any other start value
to the hardcoded KDUMP_KERNELBASE of 32M.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We are declaring the dummy section (used to work around a binutils
bug) as PT_NOTE, but we don't have enough bytes for it to be a valid
note header, and kexec userspace complains:
Warning: Elf Note name is not null terminated
Warning: append= option is not passed. Using the first kernel root partition
Warning: Elf Note name is not null terminated
Instead of using the arbitray value 0xf177 (aka "fill"), declare a
no-name no-description note of type 0.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Impact: cleanup, update to new cpumask API
Irq_desc.affinity and irq_desc.pending_mask are now cpumask_var_t's
so access to them should be using the new cpumask API.
Signed-off-by: Mike Travis <travis@sgi.com>
This adds the back-end for the PMU on the POWER6 processor.
Fortunately, the event selection hardware is somewhat simpler on
POWER6 than on other POWER family processors, so the constraints
fit into only 32 bits.
Signed-off-by: Paul Mackerras <paulus@samba.org>
This adds the back-end for the PMU on the PPC970 family.
The PPC970 allows events from the ISU to be selected in two different
ways. Rather than use alternative event codes to express this, we
instead use a single encoding for ISU events and express the
resulting constraint (that you can't select events from all three
of FPU/IFU/VPU, ISU and IDU/STS at the same time, since they all come
in through only 2 multiplexers) using a NAND constraint field, and
work out which multiplexer is used for ISU events at compute_mmcr
time.
Signed-off-by: Paul Mackerras <paulus@samba.org>
This provides the architecture-specific functions needed to access
PMU hardware on the 64-bit PowerPC processors. It has been designed
for the IBM POWER family (POWER 4/4+/5/5+/6 and PPC970) but will
hopefully also suit other 64-bit PowerPC machines (although probably
not Cell given how different it is in this area). This doesn't
include back-ends for any specific processors.
This implements a system which allows back-ends to express the
constraints that their hardware has on what events can be counted
simultaneously. The constraints are expressed as a 64-bit mask +
64-bit value for each event, and the encoding is capable of
expressing the constraints arising from having a set of multiplexers
feeding an event bus, with some events being available through
multiple multiplexer settings, such as we get on POWER4 and PPC970.
Furthermore, the back-end can supply alternative event codes for
each event, and the constraint checking code will try all possible
combinations of alternative event codes to try to find a combination
that will fit.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Because 64-bit powerpc uses lazy (soft) interrupt disabling, it is
possible for a performance monitor exception to come in when the
kernel thinks interrupts are disabled (i.e. when they are
soft-disabled but hard-enabled). In such a situation the performance
monitor exception handler might have some processing to do (such as
process wakeups) which can't be done in what is effectively an NMI
handler.
This provides a way to defer that work until interrupts get enabled,
either in raw_local_irq_restore() or by returning from an interrupt
handler to code that had interrupts enabled. We have a per-processor
flag that indicates that there is work pending to do when interrupts
subsequently get re-enabled. This flag is checked in the interrupt
return path and in raw_local_irq_restore(), and if it is set,
perf_counter_do_pending() is called to do the pending work.
Signed-off-by: Paul Mackerras <paulus@samba.org>
The Freescale PowerPC specific gianfar driver (gig-e) uses
cacheable_memzero for performance reasons we need to export
the symbol to allow the driver to be built as a module.
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
tce_entryp is a "u64 *" not an "unsigned long *".
[Split from a large patch -sfr]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
of_get_flat_dt_prop() returns a "void *", so we don't need to cast when
assigning its result to a pointer variable.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
X has been failing to start on my quad G5 powermac since commit
1fd0f52583 ("powerpc: Fix domain numbers
in /proc on 64-bit") went in. The reason is that the change allows X
to see the PCI-PCI bridge above the video card (previously it was
obscured by the fact that there were two "00" directories in
/proc/bus/pci), and the pciconfig_iobase system call on the bridge is
failing because of a hack that we have to return information about the
AGP bus when X asks about bus 0. This machine doesn't have an AGP bus
(it has PCI Express) and so the pciconfig_iobase call is returning -1,
which ultimately causes X to fail to start.
This fixes it by checking that we have an AGP bridge before
redirecting the pciconfig_iobase call to return information about the
AGP bus. With this, X starts successfully both on a quad G5 with
PCI Express and on an older dual G5 with AGP.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The current code for providing processor cache information in sysfs
has the following deficiencies:
- several complex functions that are hard to understand
- implicit recursion (cache_desc_release -> kobject_put -> cache_desc_release)
- explicit recursion (create_cache_index_info)
- use of two per-cpu arrays when one would suffice
- duplication of work on systems where CPUs share cache
Also, when I looked at implementing support for a shared_cpu_map
attribute, it was pretty much impossible to handle hotplug without
checking every single online CPU's cache_desc list and fixing things
up... not that this is a hot path, but it would have introduced
O(n^2)-ish behavior during boot. Addressing this involved rethinking
the core data structures used, which didn't lend itself to an
incremental approach.
This implementation maintains a "forest" (potentially more than one
tree) of cache objects which reflects the system's cache topology.
Cache objects are instantiated as needed as CPUs come online. A
per-cpu array is used mainly for sysfs-related bookkeeping; the
objects in the array just point to the appropriate points in the
forest.
This maintains compatibility with the existing code and includes some
enhancements:
- Implement the shared_cpu_map attribute, which is essential for
enabling userspace to discover the system's overall cache topology.
- Use cache-block-size properties if cache-line-size is not available.
I chose to place this implementation in a new file since it would have
roughly doubled the size of sysfs.c, which is already kind of messy.
Signed-off-by: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
There's a problem on some embedded platforms when we re-assign
everything on PCI, such as 44x. The generic code tries to avoid
assigning devices to addresses overlapping the low legacy
addresses such as VGA hard decoded areas using constants that
are unfortunately no good for us, as they don't take into account
the address translation we do to access PCI busses.
Thus we end up allocating things like IO BARs to 0, which is
technically legal, but will shadow hard decoded ports for use
by things like VGA cards.
This works around it by attempting to reserve legacy regions
before we try to assign addresses.
NOTE: This may have nasty side effects in cases I haven't tested
yet:
- We try to use FW mappings (ie. powermac) and the FW has allocated
a conflicting address over those legacy regions. This will typically
happen. I would expect the new code to just fail with an informative
message without harm but I haven't had a chance to test that scenario
yet.
- A device with fixed BARs overlapping those legacy addresses such
as an IDE controller in legacy mode is in the system. I don't know
for sure yet what will happen there, I have to test :-)
Ideally, we should change PCIBIOS_MIN_IO/MIN_MEM accross the board
to take a bus pointer so they can provide appropriate per-bus translated
values to the generic code but that's a more invasive patch. I will
do that in the future, but in the meantime, this fixes the problem
locally
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (98 commits)
PCI PM: Put PM callbacks in the order of execution
PCI PM: Run default PM callbacks for all devices using new framework
PCI PM: Register power state of devices during initialization
PCI PM: Call pci_fixup_device from legacy routines
PCI PM: Rearrange code in pci-driver.c
PCI PM: Avoid touching devices behind bridges in unknown state
PCI PM: Move pci_has_legacy_pm_support
PCI PM: Power-manage devices without drivers during suspend-resume
PCI PM: Add suspend counterpart of pci_reenable_device
PCI PM: Fix poweroff and restore callbacks
PCI: Use msleep instead of cpu_relax during ASPM link retraining
PCI: PCIe portdrv: Add kerneldoc comments to remining core funtions
PCI: PCIe portdrv: Rearrange code so that related things are together
PCI: PCIe portdrv: Fix suspend and resume of PCI Express port services
PCI: PCIe portdrv: Add kerneldoc comments to some core functions
x86/PCI: Do not use interrupt links for devices using MSI-X
net: sfc: Use pci_clear_master() to disable bus mastering
PCI: Add pci_clear_master() as opposite of pci_set_master()
PCI hotplug: remove redundant test in cpq hotplug
PCI: pciehp: cleanup register and field definitions
...
This is a global variable defined in fsl_booke_mmu.c with a value that gets
initialized in assembly code in head_fsl_booke.S.
It's never used.
If some code ever does want to know the number of entries in TLB1, then
"numcams = mfspr(SPRN_TLB1CFG) & 0xfff", is a whole lot simpler than a
global initialized during kernel boot from assembly.
Signed-off-by: Trent Piepho <tpiepho@freescale.com>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Some assembly code in head_fsl_booke.S hard-coded the size of struct tlbcam
to 20 when it indexed the TLBCAM table. Anyone changing the size of struct
tlbcam would not know to expect that.
The kernel already has a system to get the size of C structures into
assembly language files, asm-offsets, so let's use it.
The definition of the struct gets moved to a header, so that asm-offsets.c
can include it.
Signed-off-by: Trent Piepho <tpiepho@freescale.com>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (24 commits)
trivial: chack -> check typo fix in main Makefile
trivial: Add a space (and a comma) to a printk in 8250 driver
trivial: Fix misspelling of "firmware" in docs for ncr53c8xx/sym53c8xx
trivial: Fix misspelling of "firmware" in powerpc Makefile
trivial: Fix misspelling of "firmware" in usb.c
trivial: Fix misspelling of "firmware" in qla1280.c
trivial: Fix misspelling of "firmware" in a100u2w.c
trivial: Fix misspelling of "firmware" in megaraid.c
trivial: Fix misspelling of "firmware" in ql4_mbx.c
trivial: Fix misspelling of "firmware" in acpi_memhotplug.c
trivial: Fix misspelling of "firmware" in ipw2100.c
trivial: Fix misspelling of "firmware" in atmel.c
trivial: Fix misspelled firmware in Kconfig
trivial: fix an -> a typos in documentation and comments
trivial: fix then -> than typos in comments and documentation
trivial: update Jesper Juhl CREDITS entry with new email
trivial: fix singal -> signal typo
trivial: Fix incorrect use of "loose" in event.c
trivial: printk: fix indentation of new_text_line declaration
trivial: rtc-stk17ta8: fix sparse warning
...
Use the generic pci_swizzle_interrupt_pin() instead of arch-specific code.
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>