linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-11-19 18:24:14 +08:00

Author	SHA1	Message	Date
Martin KaFai Lau	3a08c2fd76	bpf: LRU List Introduce bpf_lru_list which will provide LRU capability to the bpf_htab in the later patch. * General Thoughts: 1. Target use case. Read is more often than update. (i.e. bpf_lookup_elem() is more often than bpf_update_elem()). If bpf_prog does a bpf_lookup_elem() first and then an in-place update, it still counts as a read operation to the LRU list concern. 2. It may be useful to think of it as a LRU cache 3. Optimize the read case 3.1 No lock in read case 3.2 The LRU maintenance is only done during bpf_update_elem() 4. If there is a percpu LRU list, it will lose the system-wise LRU property. A completely isolated percpu LRU list has the best performance but the memory utilization is not ideal considering the work load may be imbalance. 5. Hence, this patch starts the LRU implementation with a global LRU list with batched operations before accessing the global LRU list. As a LRU cache, #read >> #update/#insert operations, it will work well. 6. There is a local list (for each cpu) which is named 'struct bpf_lru_locallist'. This local list is not used to sort the LRU property. Instead, the local list is to batch enough operations before acquiring the lock of the global LRU list. More details on this later. 7. In the later patch, it allows a percpu LRU list by specifying a map-attribute for scalability reason and for use cases that need to prepare for the worst (and pathological) case like DoS attack. The percpu LRU list is completely isolated from each other and the LRU nodes (including free nodes) cannot be moved across the list. The following description is for the global LRU list but mostly applicable to the percpu LRU list also. * Global LRU List: 1. It has three sub-lists: active-list, inactive-list and free-list. 2. The two list idea, active and inactive, is borrowed from the page cache. 3. All nodes are pre-allocated and all sit at the free-list (of the global LRU list) at the beginning. The pre-allocation reasoning is similar to the existing BPF_MAP_TYPE_HASH. However, opting-out prealloc (BPF_F_NO_PREALLOC) is not supported in the LRU map. * Active/Inactive List (of the global LRU list): 1. The active list, as its name says it, maintains the active set of the nodes. We can think of it as the working set or more frequently accessed nodes. The access frequency is approximated by a ref-bit. The ref-bit is set during the bpf_lookup_elem(). 2. The inactive list, as its name also says it, maintains a less active set of nodes. They are the candidates to be removed from the bpf_htab when we are running out of free nodes. 3. The ordering of these two lists is acting as a rough clock. The tail of the inactive list is the older nodes and should be released first if the bpf_htab needs free element. * Rotating the Active/Inactive List (of the global LRU list): 1. It is the basic operation to maintain the LRU property of the global list. 2. The active list is only rotated when the inactive list is running low. This idea is similar to the current page cache. Inactive running low is currently defined as "# of inactive < # of active". 3. The active list rotation always starts from the tail. It moves node without ref-bit set to the head of the inactive list. It moves node with ref-bit set back to the head of the active list and then clears its ref-bit. 4. The inactive rotation is pretty simply. It walks the inactive list and moves the nodes back to the head of active list if its ref-bit is set. The ref-bit is cleared after moving to the active list. If the node does not have ref-bit set, it just leave it as it is because it is already in the inactive list. * Shrinking the Inactive List (of the global LRU list): 1. Shrinking is the operation to get free nodes when the bpf_htab is full. 2. It usually only shrinks the inactive list to get free nodes. 3. During shrinking, it will walk the inactive list from the tail, delete the nodes without ref-bit set from bpf_htab. 4. If no free node found after step (3), it will forcefully get one node from the tail of inactive or active list. Forcefully is in the sense that it ignores the ref-bit. * Local List: 1. Each CPU has a 'struct bpf_lru_locallist'. The purpose is to batch enough operations before acquiring the lock of the global LRU. 2. A local list has two sub-lists, free-list and pending-list. 3. During bpf_update_elem(), it will try to get from the free-list of (the current CPU local list). 4. If the local free-list is empty, it will acquire from the global LRU list. The global LRU list can either satisfy it by its global free-list or by shrinking the global inactive list. Since we have acquired the global LRU list lock, it will try to get at most LOCAL_FREE_TARGET elements to the local free list. 5. When a new element is added to the bpf_htab, it will first sit at the pending-list (of the local list) first. The pending-list will be flushed to the global LRU list when it needs to acquire free nodes from the global list next time. * Lock Consideration: The LRU list has a lock (lru_lock). Each bucket of htab has a lock (buck_lock). If both locks need to be acquired together, the lock order is always lru_lock -> buck_lock and this only happens in the bpf_lru_list.c logic. In hashtab.c, both locks are not acquired together (i.e. one lock is always released first before acquiring another lock). Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-15 11:50:20 -05:00
David S. Miller	bb598c1b8c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Several cases of bug fixes in 'net' overlapping other changes in 'net-next-. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-15 10:54:36 -05:00
Linus Torvalds	e76d21c40b	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) Fix off by one wrt. indexing when dumping /proc/net/route entries, from Alexander Duyck. 2) Fix lockdep splats in iwlwifi, from Johannes Berg. 3) Cure panic when inserting certain netfilter rules when NFT_SET_HASH is disabled, from Liping Zhang. 4) Memory leak when nft_expr_clone() fails, also from Liping Zhang. 5) Disable UFO when path will apply IPSEC tranformations, from Jakub Sitnicki. 6) Don't bogusly double cwnd in dctcp module, from Florian Westphal. 7) skb_checksum_help() should never actually use the value "0" for the resulting checksum, that has a special meaning, use CSUM_MANGLED_0 instead. From Eric Dumazet. 8) Per-tx/rx queue statistic strings are wrong in qed driver, fix from Yuval MIntz. 9) Fix SCTP reference counting of associations and transports in sctp_diag. From Xin Long. 10) When we hit ip6tunnel_xmit() we could have come from an ipv4 path in a previous layer or similar, so explicitly clear the ipv6 control block in the skb. From Eli Cooper. 11) Fix bogus sleeping inside of inet_wait_for_connect(), from WANG Cong. 12) Correct deivce ID of T6 adapter in cxgb4 driver, from Hariprasad Shenai. 13) Fix potential access past the end of the skb page frag array in tcp_sendmsg(). From Eric Dumazet. 14) 'skb' can legitimately be NULL in inet{,6}_exact_dif_match(). Fix from David Ahern. 15) Don't return an error in tcp_sendmsg() if we wronte any bytes successfully, from Eric Dumazet. 16) Extraneous unlocks in netlink_diag_dump(), we removed the locking but forgot to purge these unlock calls. From Eric Dumazet. 17) Fix memory leak in error path of __genl_register_family(). We leak the attrbuf, from WANG Cong. 18) cgroupstats netlink policy table is mis-sized, from WANG Cong. 19) Several XDP bug fixes in mlx5, from Saeed Mahameed. 20) Fix several device refcount leaks in network drivers, from Johan Hovold. 21) icmp6_send() should use skb dst device not skb->dev to determine L3 routing domain. From David Ahern. 22) ip_vs_genl_family sets maxattr incorrectly, from WANG Cong. 23) We leak new macvlan port in some cases of maclan_common_netlink() errors. Fix from Gao Feng. 24) Similar to the icmp6_send() fix, icmp_route_lookup() should determine L3 routing domain using skb_dst(skb)->dev not skb->dev. Also from David Ahern. 25) Several fixes for route offloading and FIB notification handling in mlxsw driver, from Jiri Pirko. 26) Properly cap __skb_flow_dissect()'s return value, from Eric Dumazet. 27) Fix long standing regression in ipv4 redirect handling, wrt. validating the new neighbour's reachability. From Stephen Suryaputra Lin. 28) If sk_filter() trims the packet excessively, handle it reasonably in tcp input instead of exploding. From Eric Dumazet. 29) Fix handling of napi hash state when copying channels in sfc driver, from Bert Kenward. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (121 commits) mlxsw: spectrum_router: Flush FIB tables during fini net: stmmac: Fix lack of link transition for fixed PHYs sctp: change sk state only when it has assocs in sctp_shutdown bnx2: Wait for in-flight DMA to complete at probe stage Revert "bnx2: Reset device during driver initialization" ps3_gelic: fix spelling mistake in debug message net: ethernet: ixp4xx_eth: fix spelling mistake in debug message ibmvnic: Fix size of debugfs name buffer ibmvnic: Unmap ibmvnic_statistics structure sfc: clear napi_hash state when copying channels mlxsw: spectrum_router: Correctly dump neighbour activity mlxsw: spectrum: Fix refcount bug on span entries bnxt_en: Fix VF virtual link state. bnxt_en: Fix ring arithmetic in bnxt_setup_tc(). Revert "include/uapi/linux/atm_zatm.h: include linux/time.h" tcp: take care of truncations done by sk_filter() ipv4: use new_gw for redirect neigh lookup r8152: Fix error path in open function net: bpqether.h: remove if_ether.h guard net: __skb_flow_dissect() must cap its return value ...	2016-11-14 14:15:53 -08:00
Mickaël Salaün	535e7b4b5e	bpf: Use u64_to_user_ptr() Replace the custom u64_to_ptr() function with the u64_to_user_ptr() macro. Signed-off-by: Mickaël Salaün <mic@digikod.net> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-14 16:23:08 -05:00
Linus Torvalds	f5c9f9c723	Revert "printk: make reading the kernel log flush pending lines" This reverts commit `bfd8d3f23b`. It turns out that this flushes things much too aggressiverly, and causes lines to break up when the system logger races with new continuation lines being printed. There's a pending patch to make printk() flushing much more straightforward, but it's too invasive for 4.9, so in the meantime let's just not make the system message logging flush continuation lines. They'll be flushed by the final newline anyway. Suggested-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-11-14 09:31:52 -08:00
Linus Torvalds	5d69561b7b	Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Ingo Molnar: "This fixes a genirq regression that resulted in the Intel/Broxton pinctrl/GPIO driver (and possibly others) spewing warnings" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Use irq type from irqdata instead of irqdesc	2016-11-14 08:34:56 -08:00
Daniel Borkmann	c540594f86	bpf, mlx4: fix prog refcount in mlx4_en_try_alloc_resources error path Commit `67f8b1dcb9` ("net/mlx4_en: Refactor the XDP forwarding rings scheme") added a bug in that the prog's reference count is not dropped in the error path when mlx4_en_try_alloc_resources() is failing from mlx4_xdp_set(). We previously took bpf_prog_add(prog, priv->rx_ring_num - 1), that we need to release again. Earlier in the call path, dev_change_xdp_fd() itself holds a reference to the prog as well (hence the '- 1' in the bpf_prog_add()), so a simple atomic_sub() is safe to use here. When an error is propagated, then bpf_prog_put() is called eventually from dev_change_xdp_fd() Fixes: `67f8b1dcb9` ("net/mlx4_en: Refactor the XDP forwarding rings scheme") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-12 23:31:57 -05:00
Linus Torvalds	b9f659b810	Power management fixes for v4.9-rc5 - Prevent the PM core from attempting to suspend parent devices if any of their children, whose suspend callbacks were invoked asynchronously, have failed to suspend during the "late" and "noirq" phases of system-wide suspend of devices (Brian Norris). - Prevent the boot-time system suspend test code from leaking a reference to the RTC device used by it (Johan Hovold). - Fix cpupower to use the return value of one of its library functions correctly and restore the correct behavior of it when used for setting cpufreq tunables broken during the 4.7 development cycle (Laura Abbott). -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIcBAABCAAGBQJYJloUAAoJEILEb/54YlRxkYgP/RvTRLO+6Hlk0Re/Lm5SS7O4 YjOrcemWkGkA+YJ5Qe4vki28nMzGITIAWuHJK7Nex0A1zFMX3GrNPxxMPHBejfog 7VjeEp77x0h9F8EzWX2YLZaTCSpvQjpHH5/vUQaTIE8a/WtZVbSzfCT4jqAYYHzV rHn52aNwiAKd4kjMTSBJdjTOdhqebRFYRHpBHovoTARcGH4DgEJciQzaEh1O6qAq barOGrDDkB32mULjsBNRX4Wyx1wuJiOpVjiD7sIM75ZomlIDC/RyYc9O37t6GDBC koTTlilV2/i2VoqcU5nQgAmgUOoDz1skGJW6Rsp/d/S2GnZiLBJwma31e187otBL NFl1egL1BJaz5J1sow7YPYBQILzlYGwNhXLiKHoPEshWp5WSn7mIhxxzAou3Llsp oPCXDwuRrMhqYrOdclxiL+Pj6dUapvut3SerZB5C9rGg2JL9Y2gDxrhf/jFqXxNb mBw3kRx94bDb6DNHJPuplRHWrL0SlM2ZXnUa7XwUSt+jIiwMK2la0cAL1P/7Qva9 zJ6SG01BnGEd6OI+y9ZSjITdpTVK10JIuCrTmR9cvWNA+A1vOGCkg6Nj+qunmzFh j6W2O0KQ/moR0ICCPCg96Hjf888mohX7vhGQqDQwpXYlsvWcWvjNpURsfY47yBH9 p70+ziBkqyyIAprJKjLB =ykEg -----END PGP SIGNATURE----- Merge tag 'pm-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix two bugs in error code paths in the PM core (system-wide suspend of devices), a device reference leak in the boot-time suspend test code and a cpupower utility regression from the 4.7 cycle. Specifics: - Prevent the PM core from attempting to suspend parent devices if any of their children, whose suspend callbacks were invoked asynchronously, have failed to suspend during the "late" and "noirq" phases of system-wide suspend of devices (Brian Norris). - Prevent the boot-time system suspend test code from leaking a reference to the RTC device used by it (Johan Hovold). - Fix cpupower to use the return value of one of its library functions correctly and restore the correct behavior of it when used for setting cpufreq tunables broken during the 4.7 development cycle (Laura Abbott)" * tag 'pm-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM / sleep: don't suspend parent when async child suspend_{noirq, late} fails PM / sleep: fix device reference leak in test_suspend cpupower: Correct return type of cpu_power_is_cpu_online() in cpufreq-set	2016-11-11 16:54:23 -08:00
Rafael J. Wysocki	cd16f3dcdb	Merge branches 'pm-tools-fixes' and 'pm-sleep-fixes' * pm-tools-fixes: cpupower: Correct return type of cpu_power_is_cpu_online() in cpufreq-set * pm-sleep-fixes: PM / sleep: don't suspend parent when async child suspend_{noirq, late} fails PM / sleep: fix device reference leak in test_suspend	2016-11-11 23:24:58 +01:00
Hans de Goede	c6c7d83b9c	Revert "console: don't prefer first registered if DT specifies stdout-path" This reverts commit `05fd007e46` ("console: don't prefer first registered if DT specifies stdout-path"). The reverted commit changes existing behavior on which many ARM boards rely. Many ARM small-board-computers, like e.g. the Raspberry Pi have both a video output and a serial console. Depending on whether the user is using the device as a more regular computer; or as a headless device we need to have the console on either one or the other. Many users rely on the kernel behavior of the console being present on both outputs, before the reverted commit the console setup with no console= kernel arguments on an ARM board which sets stdout-path in dt would look like this: [root@localhost ~]# cat /proc/consoles ttyS0 -W- (EC p a) 4:64 tty0 -WU (E p ) 4:1 Where as after the reverted commit, it looks like this: [root@localhost ~]# cat /proc/consoles ttyS0 -W- (EC p a) 4:64 This commit reverts commit `05fd007e46` ("console: don't prefer first registered if DT specifies stdout-path") restoring the original behavior. Fixes: `05fd007e46` ("console: don't prefer first registered if DT specifies stdout-path") Link: http://lkml.kernel.org/r/20161104121135.4780-2-hdegoede@redhat.com Signed-off-by: Hans de Goede <hdegoede@redhat.com> Cc: Paul Burton <paul.burton@imgtec.com> Cc: Rob Herring <robh+dt@kernel.org> Cc: Frank Rowand <frowand.list@gmail.com> Cc: Thorsten Leemhuis <regressions@leemhuis.info> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-11-11 08:12:37 -08:00
Tobias Klauser	de464375da	bpf: Remove unused but set variables Remove the unused but set variables min_set and max_set in adjust_reg_min_max_vals to fix the following warning when building with 'W=1': kernel/bpf/verifier.c:1483:7: warning: variable ‘min_set’ set but not used [-Wunused-but-set-variable] There is no warning about max_set being unused, but since it is only used in the assignment of min_set it can be removed as well. They were introduced in commit `484611357c` ("bpf: allow access into map value arrays") but seem to have never been used. Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-09 21:15:00 -05:00
Thomas Gleixner	7ee7e87dfb	genirq: Use irq type from irqdata instead of irqdesc The type flags in the irq descriptor are there for historical reasons and only updated via irq_modify_status() or irq_set_type(). Both functions also update the type flags in irqdata. __setup_irq() is the only left over user of the type flags in the irq descriptor. If __setup_irq() is called with empty irq type flags, then the type flags are retrieved from irqdata. If an interrupt is shared, then the type flags are compared with the type flags stored in the irq descriptor. On x86 the ioapic does not have a irq_set_type() callback because the type is defined in the BIOS tables and cannot be changed. The type is stored in irqdata at setup time without updating the type data in the irq descriptor. As a result the comparison described above fails. There is no point in updating the irq descriptor flags because the only relevant storage is irqdata. Use the type flags from irqdata for both retrieval and comparison in __setup_irq() instead. Aside of that the print out in case of non matching type flags has the old and new type flags arguments flipped. Fix that as well. For correctness sake the flags stored in the irq descriptor should be removed, but this is beyond the scope of this bugfix and will be done in a later patch. Fixes: `4b357daed6` ("genirq: Look-up trigger type if not specified by caller") Reported-and-tested-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Jon Hunter <jonathanh@nvidia.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1611072020360.3501@nanos Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2016-11-08 15:15:19 +01:00
Daniel Borkmann	20b2b24f91	bpf: fix map not being uncharged during map creation failure In map_create(), we first find and create the map, then once that suceeded, we charge it to the user's RLIMIT_MEMLOCK, and then fetch a new anon fd through anon_inode_getfd(). The problem is, once the latter fails f.e. due to RLIMIT_NOFILE limit, then we only destruct the map via map->ops->map_free(), but without uncharging the previously locked memory first. That means that the user_struct allocation is leaked as well as the accounted RLIMIT_MEMLOCK memory not released. Make the label names in the fix consistent with bpf_prog_load(). Fixes: `aaac3ba95e` ("bpf: charge user for creation of BPF maps and programs") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-07 13:22:26 -05:00
Daniel Borkmann	483bed2b0d	bpf: fix htab map destruction when extra reserve is in use Commit `a6ed3ea65d` ("bpf: restore behavior of bpf_map_update_elem") added an extra per-cpu reserve to the hash table map to restore old behaviour from pre prealloc times. When non-prealloc is in use for a map, then problem is that once a hash table extra element has been linked into the hash-table, and the hash table is destroyed due to refcount dropping to zero, then htab_map_free() -> delete_all_elements() will walk the whole hash table and drop all elements via htab_elem_free(). The problem is that the element from the extra reserve is first fed to the wrong backend allocator and eventually freed twice. Fixes: `a6ed3ea65d` ("bpf: restore behavior of bpf_map_update_elem") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-07 13:20:52 -05:00
Linus Torvalds	ffbcbfca84	Merge branches 'sched-urgent-for-linus' and 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull stack vmap fixups from Thomas Gleixner: "Two small patches related to sched_show_task(): - make sure to hold a reference on the task stack while accessing it - remove the thread_saved_pc printout .. and add a sanity check into release_task_stack() to catch problems with task stack references" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Remove pointless printout in sched_show_task() sched/core: Fix oops in sched_show_task() * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: fork: Add task stack refcounting sanity check and prevent premature task stack freeing	2016-11-05 11:46:02 -07:00
WANG Cong	243d521261	taskstats: fix the length of cgroupstats_cmd_get_policy cgroupstats_cmd_get_policy is [CGROUPSTATS_CMD_ATTR_MAX+1], taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1], but their family.maxattr is TASKSTATS_CMD_ATTR_MAX. CGROUPSTATS_CMD_ATTR_MAX is less than TASKSTATS_CMD_ATTR_MAX, so we could end up accessing out-of-bound. Change cgroupstats_cmd_get_policy to TASKSTATS_CMD_ATTR_MAX+1, this is safe because the rest are initialized to 0's. Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-03 16:55:58 -04:00
Linus Torvalds	8243d55977	sched/core: Remove pointless printout in sched_show_task() In sched_show_task() we print out a useless hex number, not even a symbol, and there's a big question mark whether this even makes sense anyway, I suspect we should just remove it all. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bp@alien8.de Cc: brgerst@gmail.com Cc: jann@thejh.net Cc: keescook@chromium.org Cc: linux-api@vger.kernel.org Cc: tycho.andersen@canonical.com Link: http://lkml.kernel.org/r/CA+55aFzphURPFzAvU4z6Moy7ZmimcwPuUdYU8bj9z0J+S8X1rw@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-11-03 07:31:34 +01:00
Tetsuo Handa	382005027f	sched/core: Fix oops in sched_show_task() When CONFIG_THREAD_INFO_IN_TASK=y, it is possible that an exited thread remains in the task list after its stack pointer was already set to NULL. Therefore, thread_saved_pc() and stack_not_used() in sched_show_task() will trigger NULL pointer dereference if an attempt to dump such thread's traces (e.g. SysRq-t, khungtaskd) is made. Since show_stack() in sched_show_task() calls try_get_task_stack() and sched_show_task() is called from interrupt context, calling try_get_task_stack() from sched_show_task() will be safe as well. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Andy Lutomirski <luto@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bp@alien8.de Cc: brgerst@gmail.com Cc: jann@thejh.net Cc: keescook@chromium.org Cc: linux-api@vger.kernel.org Cc: tycho.andersen@canonical.com Link: http://lkml.kernel.org/r/201611021950.FEJ34368.HFFJOOMLtQOVSF@I-love.SAKURA.ne.jp Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-11-03 07:27:34 +01:00
Johan Hovold	ceb75787bc	PM / sleep: fix device reference leak in test_suspend Make sure to drop the reference taken by class_find_device() after opening the RTC device. Fixes: `77437fd4e6` (pm: boot time suspend selftest) Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-11-02 05:10:04 +01:00
Andy Lutomirski	405c075971	fork: Add task stack refcounting sanity check and prevent premature task stack freeing If something goes wrong with task stack refcounting and a stack refcount hits zero too early, warn and leak it rather than potentially freeing it early (and silently). Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/f29119c783a9680a4b4656e751b6123917ace94b.1477926663.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-11-01 07:39:17 +01:00
Daniel Borkmann	0f98621bef	bpf, inode: add support for symlinks and fix mtime/ctime While commit `bb35a6ef7d` ("bpf, inode: allow for rename and link ops") added support for hard links that can be used for prog and map nodes, this work adds simple symlink support, which can be used f.e. for directories also when unpriviledged and works with cmdline tooling that understands S_IFLNK anyway. Since the switch in `e27f4a942a` ("bpf: Use mount_nodev not mount_ns to mount the bpf filesystem"), there can be various mount instances with mount_nodev() and thus hierarchy can be flattened to facilitate object sharing. Thus, we can keep bpf tooling also working by repointing paths. Most of the functionality can be used from vfs library operations. The symlink is stored in the inode itself, that is in i_link, which is sufficient in our case as opposed to storing it in the page cache. While at it, I noticed that bpf_mkdir() and bpf_mkobj() don't update the directories mtime and ctime, so add a common helper for it called bpf_dentry_finalize() that takes care of it for all cases now. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-31 15:28:11 -04:00
David S. Miller	27058af401	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Mostly simple overlapping changes. For example, David Ahern's adjacency list revamp in 'net-next' conflicted with an adjacency list traversal bug fix in 'net'. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-30 12:42:58 -04:00
Thomas Graf	ebb676daa1	bpf: Print function name in addition to function id The verifier currently prints raw function ids when printing CALL instructions or when complaining: 5: (85) call 23 unknown func 23 print a meaningful function name instead: 5: (85) call bpf_redirect#23 unknown func bpf_redirect#23 Moves the function documentation to a single comment and renames all helpers names in the list to conform to the bpf_ prefix notation so they can be greped in the kernel source. Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-29 15:55:13 -04:00
Linus Torvalds	b546e0c289	Power management fixes for v4.9-rc3 Specifics: - Fix a missing KERN_CONT in a system suspend message by converting the affected code to using pr_info() and pr_cont() instead of the "raw" printk() (Jon Hunter). - Make intel_pstate set the CPU P-state from its .set_policy() callback when the scaling_governor sysfs attribute is set to "performance" so that it interacts with NOHZ_FULL more predictably which was the case before 4.7 (Rafael Wysocki). - Make intel_pstate always request the maximum allowed P-state when the scaling_governor sysfs attribute is set to "performance" to prevent it from effectively ingoring that setting is some situations (Rafael Wysocki). -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIcBAABCAAGBQJYE+vbAAoJEILEb/54YlRxsqQQAIZLfm5Isp4L0JBt2svtLYuO 5mLgGOfcpsHIpqq5+saKtK+JOcCpm/MJ0PMQdv3WhKkFGXG9++Dz6F2gNNFlE+Sy era/B9NFry1BsQ14xsX8XonYRkQ2tS+CgOw+akSZl9BTVeX57VASqPIzlKI4Auz/ Ru5M8n4QWAD7dPWnqHTFHYbmTnMtr18wn0mNjiC/cLz7TPxmGM9Pe730/bpGKMDC DQpflnpT/ktqV1imDuaiMeuP2re2DSZWMrirZo326UsDtcKN1A4sCO+PPUUt/TTb k6pKsZ9OB9FgMmaRqx5G6wCD3NvuhKa1JOqg3RvAVZ0SmMcl2AFhCaoEJtX+TKvz 3365o2tSQeGc3bmXTb7YqP2SaYm107rOG/M0I2JuA7qNUKxk8MFfbNdaYxTY2ezT D5NF0SCycOMeEXIDTTuZ2oN1NHR6osO7bNFOIacKlcIL8PmEl8fDqX2KdYQze5xh AGsa1Oee7FPrDd2kQIL6HNGgVjd6gtHU9cVjyzQeUv9SU6zwOlu4/BbZI+wIterN L2OfW7w6/KSSyScDZ3TbifeJwpP918bUv578dEP6U/2VKthPBUPY/MYMMQzSuEIn RAQoDShqacIjdyLZtStii28KctMOk0L/edEDbdFmyu2ttNwpB+hk2ieCYWfI8wAa 2SAaYyRGpC2nafsSTv+E =RtkB -----END PGP SIGNATURE----- Merge tag 'pm-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix two intel_pstate issues related to the way it works when the scaling_governor sysfs attribute is set to "performance" and fix up messages in the system suspend core code. Specifics: - Fix a missing KERN_CONT in a system suspend message by converting the affected code to using pr_info() and pr_cont() instead of the "raw" printk() (Jon Hunter). - Make intel_pstate set the CPU P-state from its .set_policy() callback when the scaling_governor sysfs attribute is set to "performance" so that it interacts with NOHZ_FULL more predictably which was the case before 4.7 (Rafael Wysocki). - Make intel_pstate always request the maximum allowed P-state when the scaling_governor sysfs attribute is set to "performance" to prevent it from effectively ingoring that setting is some situations (Rafael Wysocki)" * tag 'pm-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: intel_pstate: Always set max P-state in performance mode PM / suspend: Fix missing KERN_CONT for suspend message cpufreq: intel_pstate: Set P-state upfront in performance mode	2016-10-28 18:29:13 -07:00
Linus Torvalds	b49c3170bf	Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Misc kernel fixes: a virtualization environment related fix, an uncore PMU driver removal handling fix, a PowerPC fix and new events for Knights Landing" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Honour the CPUID for number of fixed counters in hypervisors perf/powerpc: Don't call perf_event_disable() from atomic context perf/core: Protect PMU device removal with a 'pmu_bus_running' check, to fix CONFIG_DEBUG_TEST_DRIVER_REMOVE=y kernel panic perf/x86/intel/cstate: Add C-state residency events for Knights Landing	2016-10-28 16:27:16 -07:00
Linus Torvalds	a8006bd915	Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: "Fix four timer locking races: two were noticed by Linus while reviewing the code while chasing for a corruption bug, and two from fixing spurious USB timeouts" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers: Prevent base clock corruption when forwarding timers: Prevent base clock rewind when forwarding clock timers: Lock base for same bucket optimization timers: Plug locking race vs. timer migration	2016-10-28 11:26:01 -07:00
Linus Torvalds	965c4b7e1a	Merge branches 'core-urgent-for-linus', 'irq-urgent-for-linus' and 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool, irq and scheduler fixes from Ingo Molnar: "One more objtool fixlet for GCC6 code generation patterns, an irq DocBook fix and an unused variable warning fix in the scheduler" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: objtool: Fix rare switch jump table pattern detection * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: doc: Add missing parameter for msi_setup * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Remove unused but set variable 'rq'	2016-10-28 10:12:27 -07:00
Jiri Olsa	5aab90ce1e	perf/powerpc: Don't call perf_event_disable() from atomic context The trinity syscall fuzzer triggered following WARN() on powerpc: WARNING: CPU: 9 PID: 2998 at arch/powerpc/kernel/hw_breakpoint.c:278 ... NIP [c00000000093aedc] .hw_breakpoint_handler+0x28c/0x2b0 LR [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0 Call Trace: [c0000002f7933580] [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0 (unreliable) [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0 [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0 [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0 [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100 [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48 Followed by a lockdep warning: =============================== [ INFO: suspicious RCU usage. ] 4.8.0-rc5+ #7 Tainted: G W ------------------------------- ./include/linux/rcupdate.h:556 Illegal context switch in RCU read-side critical section! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 2 locks held by ls/2998: #0: (rcu_read_lock){......}, at: [<c0000000000f6a00>] .__atomic_notifier_call_chain+0x0/0x1c0 #1: (rcu_read_lock){......}, at: [<c00000000093ac50>] .hw_breakpoint_handler+0x0/0x2b0 stack backtrace: CPU: 9 PID: 2998 Comm: ls Tainted: G W 4.8.0-rc5+ #7 Call Trace: [c0000002f7933150] [c00000000094b1f8] .dump_stack+0xe0/0x14c (unreliable) [c0000002f79331e0] [c00000000013c468] .lockdep_rcu_suspicious+0x138/0x180 [c0000002f7933270] [c0000000001005d8] .___might_sleep+0x278/0x2e0 [c0000002f7933300] [c000000000935584] .mutex_lock_nested+0x64/0x5a0 [c0000002f7933410] [c00000000023084c] .perf_event_ctx_lock_nested+0x16c/0x380 [c0000002f7933500] [c000000000230a80] .perf_event_disable+0x20/0x60 [c0000002f7933580] [c00000000093aeec] .hw_breakpoint_handler+0x29c/0x2b0 [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0 [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0 [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0 [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100 [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48 While it looks like the first WARN() is probably valid, the other one is triggered by disabling event via perf_event_disable() from atomic context. The event is disabled here in case we were not able to emulate the instruction that hit the breakpoint. By disabling the event we unschedule the event and make sure it's not scheduled back. But we can't call perf_event_disable() from atomic context, instead we need to use the event's pending_disable irq_work method to disable it. Reported-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Neuling <mikey@neuling.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20161026094824.GA21397@krava Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-10-28 11:06:25 +02:00
Jiri Olsa	0933840acf	perf/core: Protect PMU device removal with a 'pmu_bus_running' check, to fix CONFIG_DEBUG_TEST_DRIVER_REMOVE=y kernel panic CAI Qian reported a crash in the PMU uncore device removal code, enabled by the CONFIG_DEBUG_TEST_DRIVER_REMOVE=y option: https://marc.info/?l=linux-kernel&m=147688837328451 The reason for the crash is that perf_pmu_unregister() tries to remove a PMU device which is not added at this point. We add PMU devices only after pmu_bus is registered, which happens in the perf_event_sysfs_init() call and sets the 'pmu_bus_running' flag. The fix is to get the 'pmu_bus_running' flag state at the point the PMU is taken out of the PMU list and remove the device later only if it's set. Reported-by: CAI Qian <caiqian@redhat.com> Tested-by: CAI Qian <caiqian@redhat.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rob Herring <robh@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20161020111011.GA13361@krava Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-10-28 11:06:25 +02:00
Andrey Konovalov	b274c0bb39	kcov: properly check if we are in an interrupt in_interrupt() returns a nonzero value when we are either in an interrupt or have bh disabled via local_bh_disable(). Since we are interested in only ignoring coverage from actual interrupts, do a proper check instead of just calling in_interrupt(). As a result of this change, kcov will start to collect coverage from within local_bh_disable()/local_bh_enable() sections. Link: http://lkml.kernel.org/r/1476115803-20712-1-git-send-email-andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Cc: Nicolai Stange <nicstange@gmail.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Kees Cook <keescook@chromium.org> Cc: James Morse <james.morse@arm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-27 18:43:42 -07:00
Johannes Berg	56989f6d85	genetlink: mark families as __ro_after_init Now genl_register_family() is the only thing (other than the users themselves, perhaps, but I didn't find any doing that) writing to the family struct. In all families that I found, genl_register_family() is only called from __init functions (some indirectly, in which case I've add __init annotations to clarifly things), so all can actually be marked __ro_after_init. This protects the data structure from accidental corruption. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-27 16:16:09 -04:00
Johannes Berg	489111e5c2	genetlink: statically initialize families Instead of providing macros/inline functions to initialize the families, make all users initialize them statically and get rid of the macros. This reduces the kernel code size by about 1.6k on x86-64 (with allyesconfig). Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-27 16:16:09 -04:00
Johannes Berg	a07ea4d994	genetlink: no longer support using static family IDs Static family IDs have never really been used, the only use case was the workaround I introduced for those users that assumed their family ID was also their multicast group ID. Additionally, because static family IDs would never be reserved by the generic netlink code, using a relatively low ID would only work for built-in families that can be registered immediately after generic netlink is started, which is basically only the control family (apart from the workaround code, which I also had to add code for so it would reserve those IDs) Thus, anything other than GENL_ID_GENERATE is flawed and luckily not used except in the cases I mentioned. Move those workarounds into a few lines of code, and then get rid of GENL_ID_GENERATE entirely, making it more robust. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-27 16:16:09 -04:00
Linus Torvalds	9c953d639c	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "A set of fixes for this series, most notably the fix for the blk-mq software queue regression in from this merge window. Apart from that, a fix for an unlikely hang if a queue is flooded with FUA requests from Ming, and a few small fixes for nbd and badblocks. Lastly, a rename update for the proc softirq output, since the block polling code was made generic" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: update hardware and software queues for sleeping alloc block: flush: fix IO hang in case of flood fua req nbd: fix incorrect unlock of nbd->sock_lock in sock_shutdown badblocks: badblocks_set/clear update unacked_exist softirq: Display IRQ_POLL for irq-poll statistics	2016-10-27 10:05:31 -07:00
Linus Torvalds	9dcb8b685f	mm: remove per-zone hashtable of bitlock waitqueues The per-zone waitqueues exist because of a scalability issue with the page waitqueues on some NUMA machines, but it turns out that they hurt normal loads, and now with the vmalloced stacks they also end up breaking gfs2 that uses a bit_wait on a stack object: wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE) where 'gh' can be a reference to the local variable 'mount_gh' on the stack of fill_super(). The reason the per-zone hash table breaks for this case is that there is no "zone" for virtual allocations, and trying to look up the physical page to get at it will fail (with a BUG_ON()). It turns out that I actually complained to the mm people about the per-zone hash table for another reason just a month ago: the zone lookup also hurts the regular use of "unlock_page()" a lot, because the zone lookup ends up forcing several unnecessary cache misses and generates horrible code. As part of that earlier discussion, we had a much better solution for the NUMA scalability issue - by just making the page lock have a separate contention bit, the waitqueue doesn't even have to be looked at for the normal case. Peter Zijlstra already has a patch for that, but let's see if anybody even notices. In the meantime, let's fix the actual gfs2 breakage by simplifying the bitlock waitqueues and removing the per-zone issue. Reported-by: Andreas Gruenbacher <agruenba@redhat.com> Tested-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-27 09:27:57 -07:00
Tobias Klauser	f5d6d2da0d	sched/fair: Remove unused but set variable 'rq' Since commit: `8663e24d56` ("sched/fair: Reorder cgroup creation code") ... the variable 'rq' in alloc_fair_sched_group() is set but no longer used. Remove it to fix the following GCC warning when building with 'W=1': kernel/sched/fair.c:8842:13: warning: variable ‘rq’ set but not used [-Wunused-but-set-variable] Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20161026113704.8981-1-tklauser@distanz.ch Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-10-27 08:33:52 +02:00
Thomas Gleixner	6bad6bccf2	timers: Prevent base clock corruption when forwarding When a timer is enqueued we try to forward the timer base clock. This mechanism has two issues: 1) Forwarding a remote base unlocked The forwarding function is called from get_target_base() with the current timer base lock held. But if the new target base is a different base than the current base (can happen with NOHZ, sigh!) then the forwarding is done on an unlocked base. This can lead to corruption of base->clk. Solution is simple: Invoke the forwarding after the target base is locked. 2) Possible corruption due to jiffies advancing This is similar to the issue in get_net_timer_interrupt() which was fixed in the previous patch. jiffies can advance between check and assignement and therefore advancing base->clk beyond the next expiry value. So we need to read jiffies into a local variable once and do the checks and assignment with the local copy. Fixes: a683f390b93f("timers: Forward the wheel clock whenever possible") Reported-by: Ashton Holmes <scoopta@gmail.com> Reported-by: Michael Thayer <michael.thayer@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Michal Necasek <michal.necasek@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: knut.osmundsen@oracle.com Cc: stable@vger.kernel.org Cc: stern@rowland.harvard.edu Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20161022110552.253640125@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2016-10-25 16:32:50 +02:00
Thomas Gleixner	041ad7bc75	timers: Prevent base clock rewind when forwarding clock Ashton and Michael reported, that kernel versions 4.8 and later suffer from USB timeouts which are caused by the timer wheel rework. This is caused by a bug in the base clock forwarding mechanism, which leads to timers expiring early. The scenario which leads to this is: run_timers() while (jiffies >= base->clk) { collect_expired_timers(); base->clk++; expire_timers(); } So base->clk = jiffies + 1. Now the cpu goes idle: idle() get_next_timer_interrupt() nextevt = __next_time_interrupt(); if (time_after(nextevt, base->clk)) base->clk = jiffies; jiffies has not advanced since run_timers(), so this assignment effectively decrements base->clk by one. base->clk is the index into the timer wheel arrays. So let's assume the following state after the base->clk increment in run_timers(): jiffies = 0 base->clk = 1 A timer gets enqueued with an expiry delta of 63 ticks (which is the case with the USB timeout and HZ=250) so the resulting bucket index is: base->clk + delta = 1 + 63 = 64 The timer goes into the first wheel level. The array size is 64 so it ends up in bucket 0, which is correct as it takes 63 ticks to advance base->clk to index into bucket 0 again. If the cpu goes idle before jiffies advance, then the bug in the forwarding mechanism sets base->clk back to 0, so the next invocation of run_timers() at the next tick will index into bucket 0 and therefore expire the timer 62 ticks too early. Instead of blindly setting base->clk to jiffies we must make the forwarding conditional on jiffies > base->clk, but we cannot use jiffies for this as we might run into the following issue: if (time_after(jiffies, base->clk) { if (time_after(nextevt, base->clk)) base->clk = jiffies; jiffies can increment between the check and the assigment far enough to advance beyond nextevt. So we need to use a stable value for checking. get_next_timer_interrupt() has the basej argument which is the jiffies value snapshot taken in the calling code. So we can just that. Thanks to Ashton for bisecting and providing trace data! Fixes: `a683f390b9` ("timers: Forward the wheel clock whenever possible") Reported-by: Ashton Holmes <scoopta@gmail.com> Reported-by: Michael Thayer <michael.thayer@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Michal Necasek <michal.necasek@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: knut.osmundsen@oracle.com Cc: stable@vger.kernel.org Cc: stern@rowland.harvard.edu Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20161022110552.175308322@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2016-10-25 16:32:50 +02:00
Thomas Gleixner	4da9152a43	timers: Lock base for same bucket optimization Linus stumbled over the unlocked modification of the timer expiry value in mod_timer() which is an optimization for timers which stay in the same bucket - due to the bucket granularity - despite their expiry time getting updated. The optimization itself still makes sense even if we take the lock, because in case that the bucket stays the same, we avoid the pointless queue/enqueue dance. Make the check and the modification of timer->expires protected by the base lock and shuffle the remaining code around so we can keep the lock held when we actually have to requeue the timer to a different bucket. Fixes: `f00c0afdfa` ("timers: Implement optimization for same expiry time in mod_timer()") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610241711220.4983@nanos Cc: stable@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org>	2016-10-25 16:27:39 +02:00
Thomas Gleixner	b831275a35	timers: Plug locking race vs. timer migration Linus noticed that lock_timer_base() lacks a READ_ONCE() for accessing the timer flags. As a consequence the compiler is allowed to reload the flags between the initial check for TIMER_MIGRATION and the following timer base computation and the spin lock of the base. While this has not been observed (yet), we need to make sure that it never happens. Fixes: `0eeda71bc3` ("timer: Replace timer base by a cpu index") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610241711220.4983@nanos Cc: stable@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org>	2016-10-25 16:27:39 +02:00
Jon Hunter	1adb469b9b	PM / suspend: Fix missing KERN_CONT for suspend message Commit `4bcc595ccd` (printk: reinstate KERN_CONT for printing continuation lines) exposed a missing KERN_CONT from one of the messages shown on entering suspend. With v4.9-rc1, the 'done.' shown after syncing the filesystems no longer appears as a continuation but a new message with its own timestamp. [ 9.259566] PM: Syncing filesystems ... [ 9.264119] done. Fix this by adding the KERN_CONT log level for the 'done.' part of the message seen after syncing filesystems. While we are at it, convert these suspend printks to pr_info and pr_cont, respectively. Signed-off-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-10-24 14:38:02 +02:00
Daniel Borkmann	2d0e30c30f	bpf: add helper for retrieving current numa node id Use case is mainly for soreuseport to select sockets for the local numa node, but since generic, lets also add this for other networking and tracing program types. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-22 17:05:52 -04:00
Linus Torvalds	bfb7bfef6f	Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Ingo Molnar: "Mostly irqchip driver fixes, plus a symbol export" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kernel/irq: Export irq_set_parent() irqchip/gic: Add missing \n to CPU IF adjustment message irqchip/jcore: Don't show Kconfig menu item for driver irqchip/eznps: Drop pointless static qualifier in nps400_of_init() irqchip/gic-v3-its: Fix entry size mask for GITS_BASER irqchip/gic-v3-its: Fix 64bit GIC{R,ITS}_TYPER accesses	2016-10-22 09:33:51 -07:00
Sagi Grimberg	f660f60667	softirq: Display IRQ_POLL for irq-poll statistics This library was moved to the generic area and was renamed to irq-poll. Hence, update proc/softirqs output accordingly. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-10-21 15:45:47 -06:00
Sudip Mukherjee	3118dac501	kernel/irq: Export irq_set_parent() The TPS65217 driver grew interrupt support which uses irq_set_parent(). While it's not yet clear why this is used in the first place, building the driver as a module fails with: ERROR: ".irq_set_parent" [drivers/mfd/tps65217.ko] undefined! The correctness of the driver change is still investigated, but for now it's less trouble to export irq_set_parent() than dealing with the build wreckage. [ tglx: Rewrote changelog and made the export GPL ] Fixes: `6556bdacf6` ("mfd: tps65217: Add support for IRQs") Signed-off-by: Sudip Mukherjee <sudip.mukherjee@codethink.co.uk> Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Cc: Marcin Niestroj <m.niestroj@grinn-global.com> Cc: Grygorii Strashko <grygorii.strashko@ti.com> Cc: Tony Lindgren <tony@atomide.com> Cc: Lee Jones <lee.jones@linaro.org> Link: http://lkml.kernel.org/r/1475775403-27207-1-git-send-email-sudipm.mukherjee@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2016-10-21 10:21:38 +02:00
Linus Torvalds	893e2c5c9f	Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "This fixes a group scheduling related performance/interactivity regression introduced in v4.8, which affects certain hardware environments where cpu_possible_mask != cpu_present_mask" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix incorrect task group ->load_avg	2016-10-19 10:03:55 -07:00
Linus Torvalds	8835ca59da	printk: suppress empty continuation lines We have a fairly common pattern where you print several things as continuations on one single line in a loop, and then at the end you do printk(KERN_CONT "\n"); to flush the buffered output. But if the output was flushed by something else (concurrent printk activity, or just system logging), we don't want that final flushing to just print an empty line. So just suppress empty continuation lines when they couldn't be merged into the line they are a continuation of. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-19 09:11:24 -07:00
Linus Torvalds	63ae602cea	Merge branch 'gup_flag-cleanups' Merge the gup_flags cleanups from Lorenzo Stoakes: "This patch series adjusts functions in the get_user_pages* family such that desired FOLL_* flags are passed as an argument rather than implied by flags. The purpose of this change is to make the use of FOLL_FORCE explicit so it is easier to grep for and clearer to callers that this flag is being used. The use of FOLL_FORCE is an issue as it overrides missing VM_READ/VM_WRITE flags for the VMA whose pages we are reading from/writing to, which can result in surprising behaviour. The patch series came out of the discussion around commit `38e0885465` ("mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing"), which addressed a BUG_ON() being triggered when a page was faulted in with PROT_NONE set but having been overridden by FOLL_FORCE. do_numa_page() was run on the assumption the page _must_ be one marked for NUMA node migration as an actual PROT_NONE page would have been dealt with prior to this code path, however FOLL_FORCE introduced a situation where this assumption did not hold. See https://marc.info/?l=linux-mm&m=147585445805166 for the patch proposal" Additionally, there's a fix for an ancient bug related to FOLL_FORCE and FOLL_WRITE by me. [ This branch was rebased recently to add a few more acked-by's and reviewed-by's ] * gup_flag-cleanups: mm: replace access_process_vm() write parameter with gup_flags mm: replace access_remote_vm() write parameter with gup_flags mm: replace __access_remote_vm() write parameter with gup_flags mm: replace get_user_pages_remote() write/force parameters with gup_flags mm: replace get_user_pages() write/force parameters with gup_flags mm: replace get_vaddr_frames() write/force parameters with gup_flags mm: replace get_user_pages_locked() write/force parameters with gup_flags mm: replace get_user_pages_unlocked() write/force parameters with gup_flags mm: remove write/force parameters from __get_user_pages_unlocked() mm: remove write/force parameters from __get_user_pages_locked() mm: remove gup_flags FOLL_WRITE games from __get_user_pages()	2016-10-19 08:39:47 -07:00
Lorenzo Stoakes	f307ab6dce	mm: replace access_process_vm() write parameter with gup_flags This removes the 'write' argument from access_process_vm() and replaces it with 'gup_flags' as use of this function previously silently implied FOLL_FORCE, whereas after this patch callers explicitly pass this flag. We make this explicit as use of FOLL_FORCE can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-19 08:31:25 -07:00
Lorenzo Stoakes	9beae1ea89	mm: replace get_user_pages_remote() write/force parameters with gup_flags This removes the 'write' and 'force' from get_user_pages_remote() and replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers as use of this flag can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-19 08:12:02 -07:00

1 2 3 4 5 ...

23437 Commits