linux-next

mirror of https://github.com/edk2-porting/linux-next.git synced 2025-01-06 12:44:14 +08:00

Author	SHA1	Message	Date
Linus Torvalds	2fb9e96cad	Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull additional x86 fixes from Peter Anvin: - address a long-standing bug related to when a kernel-spawned process gets a signal on an i386 kernel compiled without CONFIG_VM86. - fix the newly introduced build warning in arch/x86/boot. - fix a typo in the i386 system call table which affects building some libcs. * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86-32: Fix endless loop when processing signals for kernel tasks x86, boot: Correct CFLAGS for hostprogs x86-32: Fix typo for mq_getsetattr in syscall table	2012-03-23 17:10:09 -07:00
Linus Torvalds	8e3ade251b	Merge branch 'akpm' (Andrew's patch-bomb) Merge second batch of patches from Andrew Morton: - various misc things - core kernel changes to prctl, exit, exec, init, etc. - kernel/watchdog.c updates - get_maintainer - MAINTAINERS - the backlight driver queue - core bitops code cleanups - the led driver queue - some core prio_tree work - checkpatch udpates - largeish crc32 update - a new poll() feature for the v4l guys - the rtc driver queue - fatfs - ptrace - signals - kmod/usermodehelper updates - coredump - procfs updates * emailed from Andrew Morton <akpm@linux-foundation.org>: (141 commits) seq_file: add seq_set_overflow(), seq_overflow() proc-ns: use d_set_d_op() API to set dentry ops in proc_ns_instantiate(). procfs: speed up /proc/pid/stat, statm procfs: add num_to_str() to speed up /proc/stat proc: speed up /proc/stat handling fs/proc/kcore.c: make get_sparsemem_vmemmap_info() static coredump: add VM_NODUMP, MADV_NODUMP, MADV_CLEAR_NODUMP coredump: remove VM_ALWAYSDUMP flag kmod: make __request_module() killable kmod: introduce call_modprobe() helper usermodehelper: ____call_usermodehelper() doesn't need do_exit() usermodehelper: kill umh_wait, renumber UMH_* constants usermodehelper: implement UMH_KILLABLE usermodehelper: introduce umh_complete(sub_info) usermodehelper: use UMH_WAIT_PROC consistently signal: zap_pid_ns_processes: s/SEND_SIG_NOINFO/SEND_SIG_FORCED/ signal: oom_kill_task: use SEND_SIG_FORCED instead of force_sig() signal: cosmetic, s/from_ancestor_ns/force/ in prepare_signal() paths signal: give SEND_SIG_FORCED more power to beat SIGNAL_UNKILLABLE Hexagon: use set_current_blocked() and block_sigmask() ...	2012-03-23 16:59:10 -07:00
Akinobu Mita	0b2f4d4d76	x86: use for_each_clear_bit_from() Use for_each_clear_bit() to iterate over all the cleared bit in a memory region. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-03-23 16:58:34 -07:00
Akinobu Mita	307b1cd7ec	bitops: rename for_each_set_bit_cont() in favor of analogous list.h function This renames for_each_set_bit_cont() to for_each_set_bit_from() because it is analogous to list_for_each_entry_from() in list.h rather than list_for_each_entry_continue(). This doesn't remove for_each_set_bit_cont() for now. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Robert Richter <robert.richter@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-03-23 16:58:33 -07:00
Linus Torvalds	475c77edf8	Merge branch 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci Pull PCI changes (including maintainer change) from Jesse Barnes: "This pull has some good cleanups from Bjorn and Yinghai, as well as some more code from Yinghai to better handle resource re-allocation when enabled. There's also a new initcall_debug feature from Arjan which will print out quirk timing information to help identify slow quirks for fixing or refinement (Yinghai sent in a few patches to do just that once the new debug code landed). Beyond that, I'm handing off PCI maintainership to Bjorn Helgaas. He's been a core PCI and Linux contributor for some time now, and has kindly volunteered to take over. I just don't feel I have the time for PCI review and work that it deserves lately (I've taken on some other projects), and haven't been as responsive lately as I'd like, so I approached Bjorn asking if he'd like to manage things. He's going to give it a try, and I'm confident he'll do at least as well as I have in keeping the tree managed, patches flowing, and keeping things stable." Fix up some fairly trivial conflicts due to other cleanups (mips device resource fixup cleanups clashing with list handling cleanup, ppc iseries removal clashing with pci_probe_only cleanup etc) * 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci: (112 commits) PCI: Bjorn gets PCI hotplug too PCI: hand PCI maintenance over to Bjorn Helgaas unicore32/PCI: move <asm-generic/pci-bridge.h> include to asm/pci.h sparc/PCI: convert devtree and arch-probed bus addresses to resource powerpc/PCI: allow reallocation on PA Semi powerpc/PCI: convert devtree bus addresses to resource powerpc/PCI: compute I/O space bus-to-resource offset consistently arm/PCI: don't export pci_flags PCI: fix bridge I/O window bus-to-resource conversion x86/PCI: add spinlock held check to 'pcibios_fwaddrmap_lookup()' PCI / PCIe: Introduce command line option to disable ARI PCI: make acpihp use __pci_remove_bus_device instead PCI: export __pci_remove_bus_device PCI: Rename pci_remove_behind_bridge to pci_stop_and_remove_behind_bridge PCI: Rename pci_remove_bus_device to pci_stop_and_remove_bus_device PCI: print out PCI device info along with duration PCI: Move "pci reassigndev resource alignment" out of quirks.c PCI: Use class for quirk for usb host controller fixup PCI: Use class for quirk for ti816x class fixup PCI: Use class for quirk for intel e100 interrupt fixup ...	2012-03-23 14:02:12 -07:00
Linus Torvalds	a20ae85aba	Fixes: - Fix KDB keyboard repeat scan codes and leaked keyboard events - Fix kernel crash with kdb_printf() for users who compile new kdb_printf()'s in early code - Return all segment registers to gdb on x86_64 Features: - KDB/KGDB hook the reboot notifier and end user can control if it stops, detaches or does nothing (updated docs as well) - Notify users who use CONFIG_DEBUG_RODATA to use hw breakpoints -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJPbJ17AAoJEIciOldedpOjQeIP/AkUxQFJ7O4aLrLYHl62EHnh spkgkd+nBzIzcKyV73alkrVBaR2WE2822aAPQmAPBP8/X283DZJJjqgDCUNVI1Mf CIZ7g8AQHRnS+bAZmof5Jss4malZn4byLvG/cfpOivrsye+4A8MdrAKKM3pYWNVy 4xABkcEknVEEamdNEhHrcPd+xehretfw7+9mmU8hfjqHb/5cXFB7JwDcf4tF7ozT MDyN4xKtOn1W/ftQl0t6izksCUuPyqKzIfUyAy0j6AwTgsEavXu56S52T1UoB2ZI JcwLn/ZpN4eGCWVodY04R3jzaMtKFb6ImY40wsb5wl3CU3Ecy5syMU6z4fg3cvjH /KE6xWF61j4yiE5lzjeJVtKyxwalthzrr56qU2uEwrsEVmo3SOnW9sm0cwouqa7j /TAMlhZuGgbZGesFwdaUKI5OLGoki+pRQ0Gaq3TsbZwpPC5Bimkq0bIvruruKJCP QWVkEvb5TZgxCFS3AvniePOm7Hc2wD9zXB3OfN3o91pCom0ryDBIthcLlwhVeNCo Jd67pnJJNVULPF/1qVicZihKHxvG3DUb4E9pUcbgJplBke+isi+3eHOvnQrYFjIG iCamE9qvVbsQm/OFV8MOJ5mfPs9R+nb/jNzTO8JDmBc8AL7nRDV3AFGjW68x/KWT ERcqNEGJ4QuVAxfejq76 =SXu9 -----END PGP SIGNATURE----- Merge tag 'for_linus-3.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb Pull KGDB/KDB updates from Jason Wessel: "Fixes: - Fix KDB keyboard repeat scan codes and leaked keyboard events - Fix kernel crash with kdb_printf() for users who compile new kdb_printf()'s in early code - Return all segment registers to gdb on x86_64 Features: - KDB/KGDB hook the reboot notifier and end user can control if it stops, detaches or does nothing (updated docs as well) - Notify users who use CONFIG_DEBUG_RODATA to use hw breakpoints" * tag 'for_linus-3.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb: kdb: Add message about CONFIG_DEBUG_RODATA on failure to install breakpoint kdb: Avoid using dbg_io_ops until it is initialized kgdb,debug_core: add the ability to control the reboot notifier KDB: Fix usability issues relating to the 'enter' key. kgdb,debug-core,gdbstub: Hook the reboot notifier for debugger detach kgdb: Respect that flush op is optional kgdb: x86: Return all segment registers also in 64-bit mode	2012-03-23 09:29:44 -07:00
Dmitry Adamushko	29a2e2836f	x86-32: Fix endless loop when processing signals for kernel tasks The problem occurs on !CONFIG_VM86 kernels [1] when a kernel-mode task returns from a system call with a pending signal. A real-life scenario is a child of 'khelper' returning from a failed kernel_execve() in ____call_usermodehelper() [ kernel/kmod.c ]. kernel_execve() fails due to a pending SIGKILL, which is the result of "kill -9 -1" (at least, busybox's init does it upon reboot). The loop is as follows: * syscall_exit_work: - work_pending: // start_of_the_loop - work_notify_sig: - do_notify_resume() - do_signal() - if (!user_mode(regs)) return; - resume_userspace // TIF_SIGPENDING is still set - work_pending // so we call work_pending => goto // start_of_the_loop More information can be found in another LKML thread: http://www.serverphorums.com/read.php?12,457826 [1] the problem was also seen on MIPS. Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Link: http://lkml.kernel.org/r/1332448765.2299.68.camel@dimm Cc: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@hack.frob.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: <stable@vger.kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2012-03-22 13:50:25 -07:00
Jan Kiszka	639077fb69	kgdb: x86: Return all segment registers also in 64-bit mode Even if the content is always 0, gdb expects us to return also ds, es, fs, and gs while in x86-64 mode. Do this to avoid ugly errors on "info registers". [jason.wessel@windriver.com: adjust NUMREGBYTES for two new regs] Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2012-03-22 15:07:15 -05:00
Linus Torvalds	28f23d1f3b	Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 "urgent" leftovers from Ingo Molnar: "Pending x86/urgent bits that were not high prio enough to warrant -rc-less v3.3-final inclusion." * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, efi: Fix pointer math issue in handle_ramdisks() x86/ioapic: Add register level checks to detect bogus io-apic entries x86, mce: Fix rcu splat in drain_mce_log_buffer() x86, memblock: Move mem_hole_size() to .init	2012-03-22 09:44:50 -07:00
Linus Torvalds	2390481546	Merge branch 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 platform changes from Ingo Molnar. Removes the Moorestown platform that nobody ever used. * 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/platform: Move APIC ID validity check into platform APIC code x86/olpc/xo15/sci: Enable lid close wakeup control x86/geode/net5501: Add platform driver for Soekris Engineering net5501 x86/geode/alix2: Supplement driver to include GPIO button support x86/mid/powerbtn: Use MSIC read/write instead of ipc_scu x86/mid/thermal: Turn off thermistor x86/mid/thermal: Add msic_thermal alias x86/mid/thermal: Convert to use Intel MSIC API x86/mid/scu_ipc: Remove Moorestown support x86/mid: Kill off Moorestown x86/mrst: Add msic_thermal platform support x86/config: Select MSIC MFD driver on Intel Medfield platform x86/mid: Remove Intel Moorestown x86/mrst: Set ISA bus type for fake MP IRQs x86/ioapic: Use legacy_pic to set correct gsi-irq mapping	2012-03-22 09:43:22 -07:00
Linus Torvalds	754b980077	Merge branch 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull MCE changes from Ingo Molnar. * 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Fix return value of mce_chrdev_read() when erst is disabled x86/mce: Convert static array of pointers to per-cpu variables x86/mce: Replace hard coded hex constants with symbolic defines x86/mce: Recognise machine check bank signature for data path error x86/mce: Handle "action required" errors x86/mce: Add mechanism to safely save information in MCE handler x86/mce: Create helper function to save addr/misc when needed HWPOISON: Add code to handle "action required" errors. HWPOISON: Clean up memory_failure() vs. __memory_failure()	2012-03-22 09:42:04 -07:00
Linus Torvalds	35cb8d9e18	Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/fpu changes from Ingo Molnar. * 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: i387: Split up <asm/i387.h> into exported and internal interfaces i387: Uninline the generic FP helpers that we expose to kernel modules	2012-03-22 09:41:22 -07:00
Linus Torvalds	f06fc0c0de	Merge branch 'x86-eficross-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/eficross (booting 32/64-bit kernel from 64/32-bit EFI) from Ingo Molnar * 'x86-eficross-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, efi: Allow basic init with mixed 32/64-bit efi/kernel x86, efi: Add basic error handling x86, efi: Cleanup config table walking x86, efi: Convert printk to pr_*() x86, efi: Refactor efi_init() a bit	2012-03-22 09:31:31 -07:00
Linus Torvalds	4c64616bb5	Merge branch 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/debug changes from Ingo Molnar. * 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Fix section warnings x86-64: Fix CFI data for common_interrupt() x86: Properly _init-annotate NMI selftest code x86/debug: Fix/improve the show_msr=<cpus> debug print out	2012-03-22 09:30:39 -07:00
Linus Torvalds	c5c7fb8fbd	Merge branches 'x86-cpu-for-linus', 'x86-boot-for-linus', 'x86-cpufeature-for-linus', 'x86-process-for-linus' and 'x86-uv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull trivial x86 branches from Ingo Molnar: small one-liners to fix up details. * 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Remove some noise from boot log when starting cpus * 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, boot: Fix port argument to inl() function * 'x86-cpufeature-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, cpufeature: Add CPU features from Intel document 319433-012A * 'x86-process-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86_64: Record stack pointer before task execution begins * 'x86-uv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/UV: Lower UV rtc clocksource rating	2012-03-22 09:28:15 -07:00
Linus Torvalds	e17fdf5c67	Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/asm changes from Ingo Molnar * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Include probe_roms.h in probe_roms.c x86/32: Print control and debug registers for kerenel context x86: Tighten dependencies of CPU_SUP_*_32 x86/numa: Improve internode cache alignment x86: Fix the NMI nesting comments x86-64: Improve insn scheduling in SAVE_ARGS_IRQ x86-64: Fix CFI annotations for NMI nesting code bitops: Add missing parentheses to new get_order macro bitops: Optimise get_order() bitops: Adjust the comment on get_order() to describe the size==0 case x86/spinlocks: Eliminate TICKET_MASK x86-64: Handle byte-wise tail copying in memcpy() without a loop x86-64: Fix memcpy() to support sizes of 4Gb and above x86-64: Fix memset() to support sizes of 4Gb and above x86-64: Slightly shorten copy_page()	2012-03-22 09:13:24 -07:00
Xiao Guangrong	b716ad953a	mm: search from free_area_cache for the bigger size If the required size is bigger than cached_hole_size it is better to search from free_area_cache - it is easier to get a free region, specifically for the 64 bit process whose address space is large enough Do it just as hugetlb_get_unmapped_area_topdown() in arch/x86/mm/hugetlbpage.c Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hillf Danton <dhillf@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-03-21 17:54:56 -07:00
Andrea Arcangeli	1a5a9906d4	mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode In some cases it may happen that pmd_none_or_clear_bad() is called with the mmap_sem hold in read mode. In those cases the huge page faults can allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a false positive from pmd_bad() that will not like to see a pmd materializing as trans huge. It's not khugepaged causing the problem, khugepaged holds the mmap_sem in write mode (and all those sites must hold the mmap_sem in read mode to prevent pagetables to go away from under them, during code review it seems vm86 mode on 32bit kernels requires that too unless it's restricted to 1 thread per process or UP builds). The race is only with the huge pagefaults that can convert a pmd_none() into a pmd_trans_huge(). Effectively all these pmd_none_or_clear_bad() sites running with mmap_sem in read mode are somewhat speculative with the page faults, and the result is always undefined when they run simultaneously. This is probably why it wasn't common to run into this. For example if the madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page fault, the hugepage will not be zapped, if the page fault runs first it will be zapped. Altering pmd_bad() not to error out if it finds hugepmds won't be enough to fix this, because zap_pmd_range would then proceed to call zap_pte_range (which would be incorrect if the pmd become a pmd_trans_huge()). The simplest way to fix this is to read the pmd in the local stack (regardless of what we read, no need of actual CPU barriers, only compiler barrier needed), and be sure it is not changing under the code that computes its value. Even if the real pmd is changing under the value we hold on the stack, we don't care. If we actually end up in zap_pte_range it means the pmd was not none already and it was not huge, and it can't become huge from under us (khugepaged locking explained above). All we need is to enforce that there is no way anymore that in a code path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad can run into a hugepmd. The overhead of a barrier() is just a compiler tweak and should not be measurable (I only added it for THP builds). I don't exclude different compiler versions may have prevented the race too by caching the value of pmd on the stack (that hasn't been verified, but it wouldn't be impossible considering pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines and there's no external function called in between pmd_trans_huge and pmd_none_or_clear_bad). if (pmd_trans_huge(pmd)) { if (next-addr != HPAGE_PMD_SIZE) { VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem)); split_huge_page_pmd(vma->vm_mm, pmd); } else if (zap_huge_pmd(tlb, vma, pmd, addr)) continue; /* fall through / } if (pmd_none_or_clear_bad(pmd)) Because this race condition could be exercised without special privileges this was reported in CVE-2012-1179. The race was identified and fully explained by Ulrich who debugged it. I'm quoting his accurate explanation below, for reference. ====== start quote ======= mapcount 0 page_mapcount 1 kernel BUG at mm/huge_memory.c:1384! At some point prior to the panic, a "bad pmd ..." message similar to the following is logged on the console: mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7). The "bad pmd ..." message is logged by pmd_clear_bad() before it clears the page's PMD table entry. 143 void pmd_clear_bad(pmd_t pmd) 144 { -> 145 pmd_ERROR(pmd); 146 pmd_clear(pmd); 147 } After the PMD table entry has been cleared, there is an inconsistency between the actual number of PMD table entries that are mapping the page and the page's map count (_mapcount field in struct page). When the page is subsequently reclaimed, __split_huge_page() detects this inconsistency. 1381 if (mapcount != page_mapcount(page)) 1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n", 1383 mapcount, page_mapcount(page)); -> 1384 BUG_ON(mapcount != page_mapcount(page)); The root cause of the problem is a race of two threads in a multithreaded process. Thread B incurs a page fault on a virtual address that has never been accessed (PMD entry is zero) while Thread A is executing an madvise() system call on a virtual address within the same 2 MB (huge page) range. virtual address space .---------------------. \| \| \| \| .-\|---------------------\| \| \| \| \| \| \|<-- B(fault) \| \| \| 2 MB \| \|/////////////////////\|-. huge < \|/////////////////////\| > A(range) page \| \|/////////////////////\|-' \| \| \| \| \| \| '-\|---------------------\| \| \| \| \| '---------------------' - Thread A is executing an madvise(..., MADV_DONTNEED) system call on the virtual address range "A(range)" shown in the picture. sys_madvise // Acquire the semaphore in shared mode. down_read(&current->mm->mmap_sem) ... madvise_vma switch (behavior) case MADV_DONTNEED: madvise_dontneed zap_page_range unmap_vmas unmap_page_range zap_pud_range zap_pmd_range // // Assume that this huge page has never been accessed. // I.e. content of the PMD entry is zero (not mapped). // if (pmd_trans_huge(pmd)) { // We don't get here due to the above assumption. } // // Assume that Thread B incurred a page fault and .---------> // sneaks in here as shown below. \| // \| if (pmd_none_or_clear_bad(pmd)) \| { \| if (unlikely(pmd_bad(pmd))) \| pmd_clear_bad \| { \| pmd_ERROR \| // Log "bad pmd ..." message here. \| pmd_clear \| // Clear the page's PMD entry. \| // Thread B incremented the map count \| // in page_add_new_anon_rmap(), but \| // now the page is no longer mapped \| // by a PMD entry (-> inconsistency). \| } \| } \| v - Thread B is handling a page fault on virtual address "B(fault)" shown in the picture. ... do_page_fault __do_page_fault // Acquire the semaphore in shared mode. down_read_trylock(&mm->mmap_sem) ... handle_mm_fault if (pmd_none(pmd) && transparent_hugepage_enabled(vma)) // We get here due to the above assumption (PMD entry is zero). do_huge_pmd_anonymous_page alloc_hugepage_vma // Allocate a new transparent huge page here. ... __do_huge_pmd_anonymous_page ... spin_lock(&mm->page_table_lock) ... page_add_new_anon_rmap // Here we increment the page's map count (starts at -1). atomic_set(&page->_mapcount, 0) set_pmd_at // Here we set the page's PMD entry which will be cleared // when Thread A calls pmd_clear_bad(). ... spin_unlock(&mm->page_table_lock) The mmap_sem does not prevent the race because both threads are acquiring it in shared mode (down_read). Thread B holds the page_table_lock while the page's map count and PMD table entry are updated. However, Thread A does not synchronize on that lock. ====== end quote ======= [akpm@linux-foundation.org: checkpatch fixes] Reported-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Jones <davej@redhat.com> Acked-by: Larry Woodman <lwoodman@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [2.6.38+] Cc: Mark Salter <msalter@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-03-21 17:54:54 -07:00
Linus Torvalds	c207f3a431	Generialize powerpc's irq_host as irq_domain This branch takes the PowerPC irq_host infrastructure (reverse mapping from Linux IRQ numbers to hardware irq numbering), generalizes it, renames it to irq_domain, and makes it available to all architectures. Originally the plan has been to create an all-new irq_domain implementation which addresses some of the powerpc shortcomings such as not handling 1:1 mappings well, but doing that proved to be far more difficult and invasive than generalizing the working code and refactoring it in-place. So, this branch rips out the 'new' irq_domain and replaces it with the modified powerpc version (in a fully bisectable way of course). It converts all users over to the new API and makes irq_domain selectable on any architecture. No architecture is forced to enable irq_domain, but the infrastructure is required for doing OpenFirmware style irq translations. It will even work on SPARC even though SPARC has it's own mechanism for translating irqs at boot time. MIPS, microblaze, embedded x86 and c6x are converted too. The resulting irq_domain code is probably still too verbose and can be optimized more, but that can be done incrementally and is a task for follow-on patches. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJPZ1yiAAoJEEFnBt12D9kB4yIQAJvCfTPL65sCYVD6i9RnVHtR ahwddtd0AtT+UYLU8Xg2fZgVi6cmupDGnqkBixzZD3xxSTERqm7Snqa0ugklfeAi B6Zqf/K17H5hJNaoQ3fkNauow8m7ZYOeEH2vVUvkb3woWS9Wm7OGd+BvcIBgYSGe Aaoumhu7kDxFkii0qz3x/+kvsb6DRp2HtSPWj+APL/kNjdiO4JBOihtcc/lX6d47 bsZLiEMzHUFV4ApJNwqmfDnf54oMrHmrRJxgQHIMjeJC5or9I3Do8wDGe/aTF5xO 5GVpxCQsTlJMjTBWlAFtpTwCJB6y76EHQrHc7WzLlq8OJSsxApOke8M0BzXFrfMy CU7UUpTvNZTLpZibLCEQKemv1+oNOkfFylsHxfek2MCqx0W6W4FHEGV3qE/GtgV9 +vurA9hNNp7VM0FGRGigcUr3woYdHLdEVQrlnL7Z9AgBu1W44MZLaai7iRVZOeCT ZQ9++v2PJJ8vHT8kdkgTdiRpnEhmv84MX/GBT7ilWFEMIVeT5zhGkIBojzNgyzGc 7cvermmM0P8h+unkDgmzmSbDxo0PboqVKeoO71AOBhA6MmR9iom7XkuNdHhoOwy2 4A5xT1srbhJDbuv15BBREBV24TywpZ4a1+4nwQT4L1fXe+HfCxeEWexGcKQMRcIt dAelOHTQ+ZGkOKvXeW05 =ruGA -----END PGP SIGNATURE----- Merge tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6 Pull irq_domain support for all architectures from Grant Likely: "Generialize powerpc's irq_host as irq_domain This branch takes the PowerPC irq_host infrastructure (reverse mapping from Linux IRQ numbers to hardware irq numbering), generalizes it, renames it to irq_domain, and makes it available to all architectures. Originally the plan has been to create an all-new irq_domain implementation which addresses some of the powerpc shortcomings such as not handling 1:1 mappings well, but doing that proved to be far more difficult and invasive than generalizing the working code and refactoring it in-place. So, this branch rips out the 'new' irq_domain and replaces it with the modified powerpc version (in a fully bisectable way of course). It converts all users over to the new API and makes irq_domain selectable on any architecture. No architecture is forced to enable irq_domain, but the infrastructure is required for doing OpenFirmware style irq translations. It will even work on SPARC even though SPARC has it's own mechanism for translating irqs at boot time. MIPS, microblaze, embedded x86 and c6x are converted too. The resulting irq_domain code is probably still too verbose and can be optimized more, but that can be done incrementally and is a task for follow-on patches." * tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6: (31 commits) dt: fix twl4030 for non-dt compile on x86 mfd: twl-core: Add IRQ_DOMAIN dependency devicetree: Add empty of_platform_populate() for !CONFIG_OF_ADDRESS (sparc) irq_domain: Centralize definition of irq_dispose_mapping() irq_domain/mips: Allow irq_domain on MIPS irq_domain/x86: Convert x86 (embedded) to use common irq_domain ppc-6xx: fix build failure in flipper-pic.c and hlwd-pic.c irq_domain/microblaze: Convert microblaze to use irq_domains irq_domain/powerpc: Replace custom xlate functions with library functions irq_domain/powerpc: constify irq_domain_ops irq_domain/c6x: Use library of xlate functions irq_domain/c6x: constify irq_domain structures irq_domain/c6x: Convert c6x to use generic irq_domain support. irq_domain: constify irq_domain_ops irq_domain: Create common xlate functions that device drivers can use irq_domain: Remove irq_domain_add_simple() irq_domain: Remove 'new' irq_domain in favour of the ppc one mfd: twl-core.c: Fix the number of interrupts managed by twl4030 of/address: add empty static inlines for !CONFIG_OF irq_domain: Add support for base irq and hwirq in legacy mappings ...	2012-03-21 10:27:19 -07:00
Linus Torvalds	c7c66c0cb0	Power management updates for 3.4 Assorted extensions and fixes including: * Introduction of early/late suspend/hibernation device callbacks. * Generic PM domains extensions and fixes. * devfreq updates from Axel Lin and MyungJoo Ham. * Device PM QoS updates. * Fixes of concurrency problems with wakeup sources. * System suspend and hibernation fixes. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (GNU/Linux) iQIcBAABAgAGBQJPZww5AAoJEKhOf7ml8uNsiBYQAL9YGso7KypZhLspNxvAKuZr iHyme2F7OdOiUfo40DVH5tRuEsQvLOl0S+9ukWLrzQotKBsMfym05jtbGN9m6Ygh Z793sx3eRI3mltekJ9yrOxH6BOBDMWMkwY8ztU/X5aYDNirgJ/qtAjSK4BvWXBrz APeaUReVnLdaNP8SnhHfne/KPsHk++NKZvAAva7E6RwtZn4KV6bfiBPGb8yvY8pP m4cg1S5QEduMy+zQJ8+IlEHR91bt9spUyRwbhw6ZHCNzNeu4iEZT8DVt1O1sIRbO LsNcClqsd40nr781SoF8N9GmGUxlUDr46bS3FSsDkYzn8uyxGEsv00edJZtPwIm5 7nPuYat3Ke1YsON0Kcd/wkBGXqw/Rjfp3F1bnHjpVx/0oM/6MPrFNnIwvpHspejG kN3770idYJ17dLckhcsbYsLdy8yirITILDzvHT0AAaZ9z4Lr9Pm56WwFZLyb/lhR 2cqK8Bb8W9YvcVsKV8YqkyBVrygWMe+c56KoAoUBiSNxvW6LphmXFBj5QiFMs8s8 Xh8H7xU96FKbpNMIAZ1+bpI4zgulQG4xPXI9pKbhMfjaMUgj2zQeO8/t0WlB1M0z +kEUcYHJnXrRrObQuHEFXZdIjy/E0fdUboMIrlLt0gm97OxnG6imPseQp6/leQkC t+L4Aq6TOUofUU86d4cI =IGhc -----END PGP SIGNATURE----- Merge tag 'pm-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates for 3.4 from Rafael Wysocki: "Assorted extensions and fixes including: * Introduction of early/late suspend/hibernation device callbacks. * Generic PM domains extensions and fixes. * devfreq updates from Axel Lin and MyungJoo Ham. * Device PM QoS updates. * Fixes of concurrency problems with wakeup sources. * System suspend and hibernation fixes." * tag 'pm-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (43 commits) PM / Domains: Check domain status during hibernation restore of devices PM / devfreq: add relation of recommended frequency. PM / shmobile: Make MTU2 driver use pm_genpd_dev_always_on() PM / shmobile: Make CMT driver use pm_genpd_dev_always_on() PM / shmobile: Make TMU driver use pm_genpd_dev_always_on() PM / Domains: Introduce "always on" device flag PM / Domains: Fix hibernation restore of devices, v2 PM / Domains: Fix handling of wakeup devices during system resume sh_mmcif / PM: Use PM QoS latency constraint tmio_mmc / PM: Use PM QoS latency constraint PM / QoS: Make it possible to expose PM QoS latency constraints PM / Sleep: JBD and JBD2 missing set_freezable() PM / Domains: Fix include for PM_GENERIC_DOMAINS=n case PM / Freezer: Remove references to TIF_FREEZE in comments PM / Sleep: Add more wakeup source initialization routines PM / Hibernate: Enable usermodehelpers in hibernate() error path PM / Sleep: Make __pm_stay_awake() delete wakeup source timers PM / Sleep: Fix race conditions related to wakeup source timer function PM / Sleep: Fix possible infinite loop during wakeup source destruction PM / Hibernate: print physical addresses consistently with other parts of kernel ...	2012-03-21 10:15:51 -07:00
Linus Torvalds	9f3938346a	Merge branch 'kmap_atomic' of git://github.com/congwang/linux Pull kmap_atomic cleanup from Cong Wang. It's been in -next for a long time, and it gets rid of the (no longer used) second argument to k[un]map_atomic(). Fix up a few trivial conflicts in various drivers, and do an "evil merge" to catch some new uses that have come in since Cong's tree. * 'kmap_atomic' of git://github.com/congwang/linux: (59 commits) feature-removal-schedule.txt: schedule the deprecated form of kmap_atomic() for removal highmem: kill all __kmap_atomic() [swarren@nvidia.com: highmem: Fix ARM build break due to __kmap_atomic rename] drbd: remove the second argument of k[un]map_atomic() zcache: remove the second argument of k[un]map_atomic() gma500: remove the second argument of k[un]map_atomic() dm: remove the second argument of k[un]map_atomic() tomoyo: remove the second argument of k[un]map_atomic() sunrpc: remove the second argument of k[un]map_atomic() rds: remove the second argument of k[un]map_atomic() net: remove the second argument of k[un]map_atomic() mm: remove the second argument of k[un]map_atomic() lib: remove the second argument of k[un]map_atomic() power: remove the second argument of k[un]map_atomic() kdb: remove the second argument of k[un]map_atomic() udf: remove the second argument of k[un]map_atomic() ubifs: remove the second argument of k[un]map_atomic() squashfs: remove the second argument of k[un]map_atomic() reiserfs: remove the second argument of k[un]map_atomic() ocfs2: remove the second argument of k[un]map_atomic() ntfs: remove the second argument of k[un]map_atomic() ...	2012-03-21 09:40:26 -07:00
Linus Torvalds	4a52246302	driver core merge for 3.4-rc1 Here's the big driver core merge for 3.4-rc1. Lots of various things here, sysfs fixes/tweaks (with the nlink breakage reverted), dynamic debugging updates, w1 drivers, hyperv driver updates, and a variety of other bits and pieces, full information in the shortlog. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (GNU/Linux) iEYEABECAAYFAk9neCsACgkQMUfUDdst+ylyQwCfY2eizvzw5HhjQs8gOiBRDADe yrgAnj1Zan2QkoCnQIFJNAoxqNX9yAhd =biH6 -----END PGP SIGNATURE----- Merge tag 'driver-core-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core patches for 3.4-rc1 from Greg KH: "Here's the big driver core merge for 3.4-rc1. Lots of various things here, sysfs fixes/tweaks (with the nlink breakage reverted), dynamic debugging updates, w1 drivers, hyperv driver updates, and a variety of other bits and pieces, full information in the shortlog." * tag 'driver-core-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (78 commits) Tools: hv: Support enumeration from all the pools Tools: hv: Fully support the new KVP verbs in the user level daemon Drivers: hv: Support the newly introduced KVP messages in the driver Drivers: hv: Add new message types to enhance KVP regulator: Support driver probe deferral Revert "sysfs: Kill nlink counting." uevent: send events in correct order according to seqnum (v3) driver core: minor comment formatting cleanups driver core: move the deferred probe pointer into the private area drivercore: Add driver probe deferral mechanism DS2781 Maxim Stand-Alone Fuel Gauge battery and w1 slave drivers w1_bq27000: Only one thread can access the bq27000 at a time. w1_bq27000 - remove w1_bq27000_write w1_bq27000: remove unnecessary NULL test. sysfs: Fix memory leak in sysfs_sd_setsecdata(). intel_idle: Revert change of auto_demotion_disable_flags for Nehalem w1: Fix w1_bq27000 driver-core: documentation: fix up Greg's email address powernow-k6: Really enable auto-loading powernow-k7: Fix CPU family number ...	2012-03-20 11:16:20 -07:00
Linus Torvalds	161f7a7161	Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer changes for v3.4 from Ingo Molnar * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits) ntp: Fix integer overflow when setting time math: Introduce div64_long cs5535-clockevt: Allow the MFGPT IRQ to be shared cs5535-clockevt: Don't ignore MFGPT on SMP-capable kernels x86/time: Eliminate unused irq0_irqs counter clocksource: scx200_hrt: Fix the build x86/tsc: Reduce the TSC sync check time for core-siblings timer: Fix bad idle check on irq entry nohz: Remove ts->Einidle checks before restarting the tick nohz: Remove update_ts_time_stat from tick_nohz_start_idle clockevents: Leave the broadcast device in shutdown mode when not needed clocksource: Load the ACPI PM clocksource asynchronously clocksource: scx200_hrt: Convert scx200 to use clocksource_register_hz clocksource: Get rid of clocksource_calc_mult_shift() clocksource: dbx500: convert to clocksource_register_hz() clocksource: scx200_hrt: use pr_<level> instead of printk time: Move common updates to a function time: Reorder so the hot data is together time: Remove most of xtime_lock usage in timekeeping.c ntp: Add ntp_lock to replace xtime_locking ...	2012-03-20 10:32:09 -07:00
Linus Torvalds	2ba68940c8	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes for v3.4 from Ingo Molnar * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) printk: Make it compile with !CONFIG_PRINTK sched/x86: Fix overflow in cyc2ns_offset sched: Fix nohz load accounting -- again! sched: Update yield() docs printk/sched: Introduce special printk_sched() for those awkward moments sched/nohz: Correctly initialize 'next_balance' in 'nohz' idle balancer sched: Cleanup cpu_active madness sched: Fix load-balance wreckage sched: Clean up parameter passing of proc_sched_autogroup_set_nice() sched: Ditch per cgroup task lists for load-balancing sched: Rename load-balancing fields sched: Move load-balancing arguments into helper struct sched/rt: Do not submit new work when PI-blocked sched/rt: Prevent idle task boosting sched/wait: Add __wake_up_all_locked() API sched/rt: Document scheduler related skip-resched-check sites sched/rt: Use schedule_preempt_disabled() sched/rt: Add schedule_preempt_disabled() sched/rt: Do not throttle when PI boosting sched/rt: Keep period timer ticking when rt throttling is active ...	2012-03-20 10:31:44 -07:00
Linus Torvalds	9c2b957db1	Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf events changes for v3.4 from Ingo Molnar: - New "hardware based branch profiling" feature both on the kernel and the tooling side, on CPUs that support it. (modern x86 Intel CPUs with the 'LBR' hardware feature currently.) This new feature is basically a sophisticated 'magnifying glass' for branch execution - something that is pretty difficult to extract from regular, function histogram centric profiles. The simplest mode is activated via 'perf record -b', and the result looks like this in perf report: $ perf record -b any_call,u -e cycles:u branchy $ perf report -b --sort=symbol 52.34% [.] main [.] f1 24.04% [.] f1 [.] f3 23.60% [.] f1 [.] f2 0.01% [k] _IO_new_file_xsputn [k] _IO_file_overflow 0.01% [k] _IO_vfprintf_internal [k] _IO_new_file_xsputn 0.01% [k] _IO_vfprintf_internal [k] strchrnul 0.01% [k] __printf [k] _IO_vfprintf_internal 0.01% [k] main [k] __printf This output shows from/to branch columns and shows the highest percentage (from,to) jump combinations - i.e. the most likely taken branches in the system. "branches" can also include function calls and any other synchronous and asynchronous transitions of the instruction pointer that are not 'next instruction' - such as system calls, traps, interrupts, etc. This feature comes with (hopefully intuitive) flat ascii and TUI support in perf report. - Various 'perf annotate' visual improvements for us assembly junkies. It will now recognize function calls in the TUI and by hitting enter you can follow the call (recursively) and back, amongst other improvements. - Multiple threads/processes recording support in perf record, perf stat, perf top - which is activated via a comma-list of PIDs: perf top -p 21483,21485 perf stat -p 21483,21485 -ddd perf record -p 21483,21485 - Support for per UID views, via the --uid paramter to perf top, perf report, etc. For example 'perf top --uid mingo' will only show the tasks that I am running, excluding other users, root, etc. - Jump label restructurings and improvements - this includes the factoring out of the (hopefully much clearer) include/linux/static_key.h generic facility: struct static_key key = STATIC_KEY_INIT_FALSE; ... if (static_key_false(&key)) do unlikely code else do likely code ... static_key_slow_inc(); ... static_key_slow_inc(); ... The static_key_false() branch will be generated into the code with as little impact to the likely code path as possible. the static_key_slow_() APIs flip the branch via live kernel code patching. This facility can now be used more widely within the kernel to micro-optimize hot branches whose likelihood matches the static-key usage and fast/slow cost patterns. - SW function tracer improvements: perf support and filtering support. - Various hardenings of the perf.data ABI, to make older perf.data's smoother on newer tool versions, to make new features integrate more smoothly, to support cross-endian recording/analyzing workflows better, etc. - Restructuring of the kprobes code, the splitting out of 'optprobes', and a corner case bugfix. - Allow the tracing of kernel console output (printk). - Improvements/fixes to user-space RDPMC support, allowing user-space self-profiling code to extract PMU counts without performing any system calls, while playing nice with the kernel side. - 'perf bench' improvements - ... and lots of internal restructurings, cleanups and fixes that made these features possible. And, as usual this list is incomplete as there were also lots of other improvements 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (120 commits) perf report: Fix annotate double quit issue in branch view mode perf report: Remove duplicate annotate choice in branch view mode perf/x86: Prettify pmu config literals perf report: Enable TUI in branch view mode perf report: Auto-detect branch stack sampling mode perf record: Add HEADER_BRANCH_STACK tag perf record: Provide default branch stack sampling mode option perf tools: Make perf able to read files from older ABIs perf tools: Fix ABI compatibility bug in print_event_desc() perf tools: Enable reading of perf.data files from different ABI rev perf: Add ABI reference sizes perf report: Add support for taken branch sampling perf record: Add support for sampling taken branch perf tools: Add code to support PERF_SAMPLE_BRANCH_STACK x86/kprobes: Split out optprobe related code to kprobes-opt.c x86/kprobes: Fix a bug which can modify kernel code permanently x86/kprobes: Fix instruction recovery on optimized path perf: Add callback to flush branch_stack on context switch perf: Disable PERF_SAMPLE_BRANCH_* when not supported perf/x86: Add LBR software filter support for Intel CPUs ...	2012-03-20 10:29:15 -07:00
Linus Torvalds	0bbfcaff9b	Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq/core changes for v3.4 from Ingo Molnar * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Remove paranoid warnons and bogus fixups genirq: Flush the irq thread on synchronization genirq: Get rid of unnecessary IRQTF_DIED flag genirq: No need to check IRQTF_DIED before stopping a thread handler genirq: Get rid of unnecessary irqaction field in task_struct genirq: Fix incorrect check for forced IRQ thread handler softirq: Reduce invoke_softirq() code duplication genirq: Fix long-term regression in genirq irq_set_irq_type() handling x86-32/irq: Don't switch to irq stack for a user-mode irq	2012-03-20 10:28:56 -07:00
Cong Wang	8fd75e1216	x86: remove the second argument of k[un]map_atomic() Acked-by: Avi Kivity <avi@redhat.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Cong Wang <amwang@redhat.com>	2012-03-20 21:48:15 +08:00
Steffen Persvold	943bc7e110	x86: Fix section warnings Fix the following section warnings : WARNING: vmlinux.o(.text+0x49dbc): Section mismatch in reference from the function acpi_map_cpu2node() to the variable .cpuinit.data:__apicid_to_node The function acpi_map_cpu2node() references the variable __cpuinitdata __apicid_to_node. This is often because acpi_map_cpu2node lacks a __cpuinitdata annotation or the annotation of __apicid_to_node is wrong. WARNING: vmlinux.o(.text+0x49dc1): Section mismatch in reference from the function acpi_map_cpu2node() to the function .cpuinit.text:numa_set_node() The function acpi_map_cpu2node() references the function __cpuinit numa_set_node(). This is often because acpi_map_cpu2node lacks a __cpuinit annotation or the annotation of numa_set_node is wrong. WARNING: vmlinux.o(.text+0x526e77): Section mismatch in reference from the function prealloc_protection_domains() to the function .init.text:alloc_passthrough_domain() The function prealloc_protection_domains() references the function __init alloc_passthrough_domain(). This is often because prealloc_protection_domains lacks a __init annotation or the annotation of alloc_passthrough_domain is wrong. Signed-off-by: Steffen Persvold <sp@numascale.com> Link: http://lkml.kernel.org/r/1331810188-24785-1-git-send-email-sp@numascale.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2012-03-19 12:01:01 +01:00
Daniel J Blueman	fa63030e9c	x86/platform: Move APIC ID validity check into platform APIC code Move APIC ID validity check into platform APIC code, so it can be overridden when needed. For NumaChip systems, always trust MADT, as it's constructed with high APIC IDs. Behaviour verifies on standard x86 systems and on NumaChip systems with this, and compile-tested with allyesconfig. Signed-off-by: Daniel J Blueman <daniel@numascale-asia.com> Reviewed-by: Steffen Persvold <sp@numascale.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/1331709454-27966-1-git-send-email-daniel@numascale-asia.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2012-03-14 09:49:48 +01:00
Ingo Molnar	ea281a9eba	Two miscellaneous MCE fixes -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iQIcBAABAgAGBQJPX7q/AAoJEKurIx+X31iBIHIQAIjXmz+5cCp/sDjtcoiPvcvu 0OLspJbT4jswGdiFP+Xvn7MEJH2Q+Bj8ejejCrpyXG4R1v/K9zmzDwyAGaJbLiGw T5I9L9/GesYhCoIzyT7IfyNCtjsXMRmdmxLbxvyWxXiw/tT355Reb8Pqsq3Xy9dA 0swO69MC3oJcYG19pLgv+QrtZaK79aiJqe5YmDIIoLYo8jIhnza2bKn3CKvQLV34 8jIKTyo31MOTVZW9qy+pt5kNrnUgtmMBtdEsfVQdrAxKyEXwBd0rFRLHZ0c8YQit QcI0/MyK5la3swtiGnEiJQOvGGP/VIfPklVdw3ahLAthYUC+mof6zDBqpSGheJBQ f0Tbm0k3uWzVVij6Au2i3xi+nKNiPtUlOCcE1EPFvOYh0QOSDy4TPdxRDCZHHPeA hc5nbLTqwxAc8BE3vnDySztCGoeA3/UEOba1Kc4meKyLma7pJ4eNEL9j40XInn/z JRNbfj4BuSgmpV5nZbsb9jU/f15OfOuwZ+UMz//PYuIEI6gK3aYXojc5x1Z++Dot G6GIF4zs1zkzOnvaNeB8KZkmtWvRl/7dZuEM5SMDU+V0lj5xDs3nM4AhYPYSAvX2 knmj//Gr/NE/DpAH0+bZIqzz6eduGc8HV5Y0nWlWvPKO+qs4n9uE0/8VL+5lZcQ8 kojVLi2WTLvhCsnxUV3z =G63l -----END PGP SIGNATURE----- Merge tag 'mce-for-tip' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/mce Apply two miscellaneous MCE fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2012-03-14 07:44:48 +01:00
Ingo Molnar	cd593accdc	Linux 3.3-rc7 -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (GNU/Linux) iQEcBAABAgAGBQJPW8yUAAoJEHm+PkMAQRiGhFIH/RGUPxGmUkJv8EP5I4HDA4dJ c6/PrzZCHs8rxzYzvn7ojXqZGXTOAA5ZgS9A6LkJ2sxMFvgMnkpFi6B4CwMzizS3 vLWo/HNxbiTCNGFfQrhQB8O58uNI8wOBa87lrQfkXkDqN0cFhdjtIxeY1BD9LXIo qbWysGxCcZhJWHapsQ3NZaVJQnIK5vA/+mhyYP4HzbcHI3aWnbIEZ8GQKeY28Ch0 +pct5UQBjZavV5SujaW0Xd65oIiycm8XHAQw6FxQy//DfaabauWgFteR162Q/oew xxUBDOHF3nO1bdteHHaYqxig0j1MbIHsqxTnE/neR8UryF04//1SFF7DYuY/1pg= =SV5V -----END PGP SIGNATURE----- Merge tag 'v3.3-rc7' into x86/mce Merge reason: Update from an ancient -rc1 base to an almost-final stable kernel. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2012-03-14 07:44:11 +01:00
Thomas Gleixner	df8d291f28	Merge branch 'linus' into irq/core Reason: Get upstream fixes integrated before further modifications. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2012-03-13 16:35:16 +01:00
Salman Qazi	9993bc635d	sched/x86: Fix overflow in cyc2ns_offset When a machine boots up, the TSC generally gets reset. However, when kexec is used to boot into a kernel, the TSC value would be carried over from the previous kernel. The computation of cycns_offset in set_cyc2ns_scale is prone to an overflow, if the machine has been up more than 208 days prior to the kexec. The overflow happens when we multiply *scale, even though there is enough room to store the final answer. We fix this issue by decomposing tsc_now into the quotient and remainder of division by CYC2NS_SCALE_FACTOR and then performing the multiplication separately on the two components. Refactor code to share the calculation with the previous fix in __cycles_2_ns(). Signed-off-by: Salman Qazi <sqazi@google.com> Acked-by: John Stultz <john.stultz@linaro.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Turner <pjt@google.com> Cc: john stultz <johnstul@us.ibm.com> Link: http://lkml.kernel.org/r/20120310004027.19291.88460.stgit@dungbeetle.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2012-03-13 16:27:51 +01:00
Ingo Molnar	47258cf3c4	Linux 3.3-rc7 -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (GNU/Linux) iQEcBAABAgAGBQJPW8yUAAoJEHm+PkMAQRiGhFIH/RGUPxGmUkJv8EP5I4HDA4dJ c6/PrzZCHs8rxzYzvn7ojXqZGXTOAA5ZgS9A6LkJ2sxMFvgMnkpFi6B4CwMzizS3 vLWo/HNxbiTCNGFfQrhQB8O58uNI8wOBa87lrQfkXkDqN0cFhdjtIxeY1BD9LXIo qbWysGxCcZhJWHapsQ3NZaVJQnIK5vA/+mhyYP4HzbcHI3aWnbIEZ8GQKeY28Ch0 +pct5UQBjZavV5SujaW0Xd65oIiycm8XHAQw6FxQy//DfaabauWgFteR162Q/oew xxUBDOHF3nO1bdteHHaYqxig0j1MbIHsqxTnE/neR8UryF04//1SFF7DYuY/1pg= =SV5V -----END PGP SIGNATURE----- Merge tag 'v3.3-rc7' into sched/core Merge reason: merge back final fixes, prepare for the merge window. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2012-03-13 16:26:52 +01:00
Suresh Siddha	73d63d038e	x86/ioapic: Add register level checks to detect bogus io-apic entries With the recent changes to clear_IO_APIC_pin() which tries to clear remoteIRR bit explicitly, some of the users started to see "Unable to reset IRR for apic .." messages. Close look shows that these are related to bogus IO-APIC entries which return's all 1's for their io-apic registers. And the above mentioned error messages are benign. But kernel should have ignored such io-apic's in the first place. Check if register 0, 1, 2 of the listed io-apic are all 1's and ignore such io-apic. Reported-by: Álvaro Castillo <midgoon@gmail.com> Tested-by: Jon Dufresne <jon@jondufresne.org> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: yinghai@kernel.org Cc: kernel-team@fedoraproject.org Cc: Josh Boyer <jwboyer@redhat.com> Cc: <stable@kernel.org> Link: http://lkml.kernel.org/r/1331577393.31585.94.camel@sbsiddha-desk.sc.intel.com [ Performed minor cleanup of affected code. ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2012-03-13 05:52:02 +01:00
Ingo Molnar	bea95c152d	Merge branch 'perf/hw-branch-sampling' into perf/core Merge reason: The 'perf record -b' hardware branch sampling feature is ready for upstream. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2012-03-12 20:47:05 +01:00
Peter Zijlstra	f9b4eeb809	perf/x86: Prettify pmu config literals I got somewhat tired of having to decode hex numbers.. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Stephane Eranian <eranian@google.com> Cc: Robert Richter <robert.richter@amd.com> Link: http://lkml.kernel.org/n/tip-0vsy1sgywc4uar3mu1szm0rg@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2012-03-12 20:44:54 +01:00
Ingo Molnar	35239e23c6	Merge branch 'perf/urgent' into perf/core Merge reason: We are going to queue up a dependent patch. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2012-03-12 20:44:11 +01:00
Peter Zijlstra	87e24f4b67	perf/x86: Fix local vs remote memory events for NHM/WSM Verified using the below proglet.. before: [root@westmere ~]# perf stat -e node-stores -e node-store-misses ./numa 0 remote write Performance counter stats for './numa 0': 2,101,554 node-stores 2,096,931 node-store-misses 5.021546079 seconds time elapsed [root@westmere ~]# perf stat -e node-stores -e node-store-misses ./numa 1 local write Performance counter stats for './numa 1': 501,137 node-stores 199 node-store-misses 5.124451068 seconds time elapsed After: [root@westmere ~]# perf stat -e node-stores -e node-store-misses ./numa 0 remote write Performance counter stats for './numa 0': 2,107,516 node-stores 2,097,187 node-store-misses 5.012755149 seconds time elapsed [root@westmere ~]# perf stat -e node-stores -e node-store-misses ./numa 1 local write Performance counter stats for './numa 1': 2,063,355 node-stores 165 node-store-misses 5.082091494 seconds time elapsed #define _GNU_SOURCE #include <sched.h> #include <stdio.h> #include <errno.h> #include <sys/mman.h> #include <sys/types.h> #include <dirent.h> #include <signal.h> #include <unistd.h> #include <numaif.h> #include <stdlib.h> #define SIZE (3210241024) volatile int done; void sig_done(int sig) { done = 1; } int main(int argc, char *argv) { cpu_set_t mask, mask2; size_t size; int i, err, t; int nrcpus = 1024; char mem; unsigned long nodemask = 0x01; /* node 0 / DIR node; struct dirent de; int read = 0; int local = 0; if (argc < 2) { printf("usage: %s [0-3]\n", argv[0]); printf(" bit0 - local/remote\n"); printf(" bit1 - read/write\n"); exit(0); } switch (atoi(argv[1])) { case 0: printf("remote write\n"); break; case 1: printf("local write\n"); local = 1; break; case 2: printf("remote read\n"); read = 1; break; case 3: printf("local read\n"); local = 1; read = 1; break; } mask = CPU_ALLOC(nrcpus); size = CPU_ALLOC_SIZE(nrcpus); CPU_ZERO_S(size, mask); node = opendir("/sys/devices/system/node/node0/"); if (!node) perror("opendir"); while ((de = readdir(node))) { int cpu; if (sscanf(de->d_name, "cpu%d", &cpu) == 1) CPU_SET_S(cpu, size, mask); } closedir(node); mask2 = CPU_ALLOC(nrcpus); CPU_ZERO_S(size, mask2); for (i = 0; i < size; i++) CPU_SET_S(i, size, mask2); CPU_XOR_S(size, mask2, mask2, mask); // invert if (!local) mask = mask2; err = sched_setaffinity(0, size, mask); if (err) perror("sched_setaffinity"); mem = mmap(0, SIZE, PROT_READ\|PROT_WRITE, MAP_PRIVATE\|MAP_ANONYMOUS, -1, 0); err = mbind(mem, SIZE, MPOL_BIND, &nodemask, 8sizeof(nodemask), MPOL_MF_MOVE); if (err) perror("mbind"); signal(SIGALRM, sig_done); alarm(5); if (!read) { while (!done) { for (i = 0; i < SIZE; i++) mem[i] = 0x01; } } else { while (!done) { for (i = 0; i < SIZE; i++) t += (volatile char )(mem + i); } } return 0; } Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> Cc: <stable@kernel.org> Link: http://lkml.kernel.org/n/tip-tq73sxus35xmqpojf7ootxgs@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2012-03-12 20:43:41 +01:00
Peter Zijlstra	5fbd036b55	sched: Cleanup cpu_active madness Stepan found: CPU0 CPUn _cpu_up() __cpu_up() boostrap() notify_cpu_starting() set_cpu_online() while (!cpu_active()) cpu_relax() <PREEMPT-out> smp_call_function(.wait=1) /* we find cpu_online() is true / arch_send_call_function_ipi_mask() / wait-forever-more */ <PREEMPT-in> local_irq_enable() cpu_notify(CPU_ONLINE) sched_cpu_active() set_cpu_active() Now the purpose of cpu_active is mostly with bringing down a cpu, where we mark it !active to avoid the load-balancer from moving tasks to it while we tear down the cpu. This is required because we only update the sched_domain tree after we brought the cpu-down. And this is needed so that some tasks can still run while we bring it down, we just don't want new tasks to appear. On cpu-up however the sched_domain tree doesn't yet include the new cpu, so its invisible to the load-balancer, regardless of the active state. So instead of setting the active state after we boot the new cpu (and consequently having to wait for it before enabling interrupts) set the cpu active before we set it online and avoid the whole mess. Reported-by: Stepan Moskovchenko <stepanm@codeaurora.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1323965362.18942.71.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>	2012-03-12 20:43:15 +01:00
Greg Kroah-Hartman	263a5c8e16	Merge 3.3-rc6 into driver-core-next This was done to resolve a conflict in the drivers/base/cpu.c file. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2012-03-09 12:35:53 -08:00
Jan Beulich	a240ada241	x86: Include probe_roms.h in probe_roms.c ... to ensure that declarations and definitions are in sync. Signed-off-by: Jan Beulich <jbeulich@suse.com> Link: http://lkml.kernel.org/r/4F5888F902000078000770F1@nat28.tlf.novell.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2012-03-08 10:57:35 +01:00
Jan Beulich	c7e23289a6	x86/32: Print control and debug registers for kerenel context While for a user mode register dump it may be reasonable to skip those (albeit x86-64 doesn't do so), for kernel mode dumps these should be printed to make sure all information possibly necessary for analysis is available. Signed-off-by: Jan Beulich <jbeulich@suse.com> Link: http://lkml.kernel.org/r/4F58889202000078000770E7@nat28.tlf.novell.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2012-03-08 10:57:35 +01:00
Srivatsa S. Bhat	b11e3d782b	x86, mce: Fix rcu splat in drain_mce_log_buffer() While booting, the following message is seen: [ 21.665087] =============================== [ 21.669439] [ INFO: suspicious RCU usage. ] [ 21.673798] 3.2.0-0.0.0.28.36b5ec9-default #2 Not tainted [ 21.681353] ------------------------------- [ 21.685864] arch/x86/kernel/cpu/mcheck/mce.c:194 suspicious rcu_dereference_index_check() usage! [ 21.695013] [ 21.695014] other info that might help us debug this: [ 21.695016] [ 21.703488] [ 21.703489] rcu_scheduler_active = 1, debug_locks = 1 [ 21.710426] 3 locks held by modprobe/2139: [ 21.714754] #0: (&__lockdep_no_validate__){......}, at: [<ffffffff8133afd3>] __driver_attach+0x53/0xa0 [ 21.725020] #1: [ 21.725323] ioatdma: Intel(R) QuickData Technology Driver 4.00 [ 21.733206] (&__lockdep_no_validate__){......}, at: [<ffffffff8133afe1>] __driver_attach+0x61/0xa0 [ 21.743015] #2: (i7core_edac_lock){+.+.+.}, at: [<ffffffffa01cfa5f>] i7core_probe+0x1f/0x5c0 [i7core_edac] [ 21.753708] [ 21.753709] stack backtrace: [ 21.758429] Pid: 2139, comm: modprobe Not tainted 3.2.0-0.0.0.28.36b5ec9-default #2 [ 21.768253] Call Trace: [ 21.770838] [<ffffffff810977cd>] lockdep_rcu_suspicious+0xcd/0x100 [ 21.777366] [<ffffffff8101aa41>] drain_mcelog_buffer+0x191/0x1b0 [ 21.783715] [<ffffffff8101aa78>] mce_register_decode_chain+0x18/0x20 [ 21.790430] [<ffffffffa01cf8db>] i7core_register_mci+0x2fb/0x3e4 [i7core_edac] [ 21.798003] [<ffffffffa01cfb14>] i7core_probe+0xd4/0x5c0 [i7core_edac] [ 21.804809] [<ffffffff8129566b>] local_pci_probe+0x5b/0xe0 [ 21.810631] [<ffffffff812957c9>] __pci_device_probe+0xd9/0xe0 [ 21.816650] [<ffffffff813362e4>] ? get_device+0x14/0x20 [ 21.822178] [<ffffffff81296916>] pci_device_probe+0x36/0x60 [ 21.828061] [<ffffffff8133ac8a>] really_probe+0x7a/0x2b0 [ 21.833676] [<ffffffff8133af23>] driver_probe_device+0x63/0xc0 [ 21.839868] [<ffffffff8133b01b>] __driver_attach+0x9b/0xa0 [ 21.845718] [<ffffffff8133af80>] ? driver_probe_device+0xc0/0xc0 [ 21.852027] [<ffffffff81339168>] bus_for_each_dev+0x68/0x90 [ 21.857876] [<ffffffff8133aa3c>] driver_attach+0x1c/0x20 [ 21.863462] [<ffffffff8133a64d>] bus_add_driver+0x16d/0x2b0 [ 21.869377] [<ffffffff8133b6dc>] driver_register+0x7c/0x160 [ 21.875220] [<ffffffff81296bda>] __pci_register_driver+0x6a/0xf0 [ 21.881494] [<ffffffffa01fe000>] ? 0xffffffffa01fdfff [ 21.886846] [<ffffffffa01fe047>] i7core_init+0x47/0x1000 [i7core_edac] [ 21.893737] [<ffffffff810001ce>] do_one_initcall+0x3e/0x180 [ 21.899670] [<ffffffff810a9b95>] sys_init_module+0xc5/0x220 [ 21.905542] [<ffffffff8149bc39>] system_call_fastpath+0x16/0x1b Fix this by using ACCESS_ONCE() instead of rcu_dereference_check_mce() over mcelog.next. Since the access to each entry is controlled by the ->finished field, ACCESS_ONCE() should work just fine. An rcu_dereference is unnecessary here. Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Suggested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>	2012-03-07 11:44:29 +01:00
Masami Hiramatsu	3f33ab1c0c	x86/kprobes: Split out optprobe related code to kprobes-opt.c Split out optprobe related code to arch/x86/kernel/kprobes-opt.c for maintenanceability. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Suggested-by: Ingo Molnar <mingo@elte.hu> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: yrl.pp-manager.tt@hitachi.com Cc: systemtap@sourceware.org Cc: anderson@redhat.com Link: http://lkml.kernel.org/r/20120305133222.5982.54794.stgit@localhost.localdomain [ Tidied up the code a tiny bit ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2012-03-06 09:49:49 +01:00
Masami Hiramatsu	464846888d	x86/kprobes: Fix a bug which can modify kernel code permanently Fix a bug in kprobes which can modify kernel code permanently at run-time. In the result, kernel can crash when it executes the modified code. This bug can happen when we put two probes enough near and the first probe is optimized. When the second probe is set up, it copies a byte which is already modified by the first probe, and executes it when the probe is hit. Even worse, the first probe and the second probe are removed respectively, the second probe writes back the copied (modified) instruction. To fix this bug, kprobes always recovers the original code and copies the first byte from recovered instruction. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: yrl.pp-manager.tt@hitachi.com Cc: systemtap@sourceware.org Cc: anderson@redhat.com Link: http://lkml.kernel.org/r/20120305133215.5982.31991.stgit@localhost.localdomain Signed-off-by: Ingo Molnar <mingo@elte.hu>	2012-03-06 09:49:49 +01:00
Masami Hiramatsu	86b4ce3156	x86/kprobes: Fix instruction recovery on optimized path Current probed-instruction recovery expects that only breakpoint instruction modifies instruction. However, since kprobes jump optimization can replace original instructions with a jump, that expectation is not enough. And it may cause instruction decoding failure on the function where an optimized probe already exists. This bug can reproduce easily as below: 1) find a target function address (any kprobe-able function is OK) $ grep __secure_computing /proc/kallsyms ffffffff810c19d0 T __secure_computing 2) decode the function $ objdump -d vmlinux --start-address=0xffffffff810c19d0 --stop-address=0xffffffff810c19eb vmlinux: file format elf64-x86-64 Disassembly of section .text: ffffffff810c19d0 <__secure_computing>: ffffffff810c19d0: 55 push %rbp ffffffff810c19d1: 48 89 e5 mov %rsp,%rbp ffffffff810c19d4: e8 67 8f 72 00 callq ffffffff817ea940 <mcount> ffffffff810c19d9: 65 48 8b 04 25 40 b8 mov %gs:0xb840,%rax ffffffff810c19e0: 00 00 ffffffff810c19e2: 83 b8 88 05 00 00 01 cmpl $0x1,0x588(%rax) ffffffff810c19e9: 74 05 je ffffffff810c19f0 <__secure_computing+0x20> 3) put a kprobe-event at an optimize-able place, where no call/jump places within the 5 bytes. $ su - # cd /sys/kernel/debug/tracing # echo p __secure_computing+0x9 > kprobe_events 4) enable it and check it is optimized. # echo 1 > events/kprobes/p___secure_computing_9/enable # cat ../kprobes/list ffffffff810c19d9 k __secure_computing+0x9 [OPTIMIZED] 5) put another kprobe on an instruction after previous probe in the same function. # echo p __secure_computing+0x12 >> kprobe_events bash: echo: write error: Invalid argument # dmesg \| tail -n 1 [ 1666.500016] Probing address(0xffffffff810c19e2) is not an instruction boundary. 6) however, if the kprobes optimization is disabled, it works. # echo 0 > /proc/sys/debug/kprobes-optimization # cat ../kprobes/list ffffffff810c19d9 k __secure_computing+0x9 # echo p __secure_computing+0x12 >> kprobe_events (no error) This is because kprobes doesn't recover the instruction which is overwritten with a relative jump by another kprobe when finding instruction boundary. It only recovers the breakpoint instruction. This patch fixes kprobes to recover such instructions. With this fix: # echo p __secure_computing+0x9 > kprobe_events # echo 1 > events/kprobes/p___secure_computing_9/enable # cat ../kprobes/list ffffffff810c1aa9 k __secure_computing+0x9 [OPTIMIZED] # echo p __secure_computing+0x12 >> kprobe_events # cat ../kprobes/list ffffffff810c1aa9 k __secure_computing+0x9 [OPTIMIZED] ffffffff810c1ab2 k __secure_computing+0x12 [DISABLED] Changes in v4: - Fix a bug to ensure optimized probe is really optimized by jump. - Remove kprobe_optready() dependency. - Cleanup code for preparing optprobe separation. Changes in v3: - Fix a build error when CONFIG_OPTPROBE=n. (Thanks, Ingo!) To fix the error, split optprobe instruction recovering path from kprobes path. - Cleanup comments/styles. Changes in v2: - Fix a bug to recover original instruction address in RIP-relative instruction fixup. - Moved on tip/master. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: yrl.pp-manager.tt@hitachi.com Cc: systemtap@sourceware.org Cc: anderson@redhat.com Link: http://lkml.kernel.org/r/20120305133209.5982.36568.stgit@localhost.localdomain Signed-off-by: Ingo Molnar <mingo@elte.hu>	2012-03-06 09:49:48 +01:00
Stephane Eranian	d010b3326c	perf: Add callback to flush branch_stack on context switch With branch stack sampling, it is possible to filter by priv levels. In system-wide mode, that means it is possible to capture only user level branches. The builtin SW LBR filter needs to disassemble code based on LBR captured addresses. For that, it needs to know the task the addresses are associated with. Because of context switches, the content of the branch stack buffer may contain addresses from different tasks. We need a callback on context switch to either flush the branch stack or save it. This patch adds a new callback in struct pmu which is called during context switches. The callback is called only when necessary. That is when a system-wide context has, at least, one event which uses PERF_SAMPLE_BRANCH_STACK. The callback is never called for per-thread context. In this version, the Intel x86 code simply flushes (resets) the LBR on context switches (fills it with zeroes). Those zeroed branches are then filtered out by the SW filter. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1328826068-11713-11-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2012-03-05 14:55:42 +01:00
Stephane Eranian	2481c5fa6d	perf: Disable PERF_SAMPLE_BRANCH_* when not supported PERF_SAMPLE_BRANCH_* is disabled for: - SW events (sw counters, tracepoints) - HW breakpoints - ALL but Intel x86 architecture - AMD64 processors Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1328826068-11713-10-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2012-03-05 14:55:42 +01:00
Stephane Eranian	3e702ff6d1	perf/x86: Add LBR software filter support for Intel CPUs This patch adds an internal sofware filter to complement the (optional) LBR hardware filter. The software filter is necessary: - as a substitute when there is no HW LBR filter (e.g., Atom, Core) - to complement HW LBR filter in case of errata (e.g., Nehalem/Westmere) - to provide finer grain filtering (e.g., all processors) Sometimes the LBR HW filter cannot distinguish between two types of branches. For instance, to capture syscall as CALLS, it is necessary to enable the LBR_FAR filter which will also capture JMP instructions. Thus, a second pass is necessary to filter those out, this is what the SW filter can do. The SW filter is built on top of the internal x86 disassembler. It is a best effort filter especially for user level code. It is subject to the availability of the text page of the program. The SW filter is enabled on all Intel processors. It is bypassed when the user is capturing all branches at all priv levels. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1328826068-11713-9-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2012-03-05 14:55:42 +01:00

1 2 3 4 5 ...

8103 Commits