All drivers that support modification of the RX flow hash indirection
table initialise it in the same way: RX rings are assigned to table
entries in rotation. Make that default policy explicit by having them
call a ethtool_rxfh_indir_default() function.
In the ethtool core, add support for a zero size value for
ETHTOOL_SRXFHINDIR, which resets the table to this default.
Partly-suggested-by: Matt Carlson <mcarlson@broadcom.com>
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Shreyas N Bhatewara <sbhatewara@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new ethtool operation (get_rxfh_indir_size) to get the
indirectional table size. Use this to validate the user buffer size
before calling get_rxfh_indir or set_rxfh_indir. Use get_rxnfc to get
the number of RX rings, and validate the contents of the new
indirection table before calling set_rxfh_indir. Remove this
validation from drivers.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Dimitris Michailidis <dm@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When establishing a unix connection on stream sockets the
server end receives an skb with socket in its receive queue.
Report who is waiting for these ends to be accepted for
listening sockets via NLA.
There's a lokcing issue with this -- the unix sk state lock is
required to access the peer, and it is taken under the listening
sk's queue lock. Strictly speaking the queue lock should be taken
inside the state lock, but since in this case these two sockets
are different it shouldn't lead to deadlock.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Report the peer socket inode ID as NLA. With this it's finally
possible to find out the other end of an interesting unix connection.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Actually, the socket path if it's not anonymous doesn't give
a clue to which file the socket is bound to. Even if the path
is absolute, it can be unlinked and then new socket can be
bound to it.
With this NLA it's possible to check which file a particular
socket is really bound to.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Report the sun_path when requested as NLA. With leading '\0' if
present but without the leading AF_UNIX bits.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The socket inode is used as a key for lookup. This is effectively
the only really unique ID of a unix socket, but using this for
search currently has one problem -- it is O(number of sockets) :(
Does it worth fixing this lookup or inventing some other ID for
unix sockets?
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Walk the unix sockets table and fill the core response structure,
which includes type, state and inode.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Includes basic module_init/_exit functionality, dump/get_exact stubs
and declares the basic API structures for request and response.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sk address is used as a cookie between dump/get_exact calls.
It will be required for unix socket sdumping, so move it from
inet_diag to sock_diag.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I've made a mistake when fixing the sock_/inet_diag aliases :(
1. The sock_diag layer should request the family-based alias,
not just the IPPROTO_IP one;
2. The inet_diag layer should request for AF_INET+protocol alias,
not just the protocol one.
Thus fix this.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we restore regulatory settings the world regulatory domain
is properly reset on cfg80211 (or user prefered regulatory domain)
but we were never setting back channel values for drivers that use
WIPHY_FLAG_CUSTOM_REGULATORY. Set these values up again by using
the orig_ channel parameters.
This fixes restoring custom regulatory settings upon disconnect
events.
Cc: compat@orbit-lab.org
Cc: Paul Stewart <pstew@google.com>
Cc: Rajkumar Manoharan <rmanohar@qca.qualcomm.com>
Cc: Senthilkumar Balasubramanian <senthilb@qca.qualcomm.com>
Signed-off-by: Rajkumar Manoharan <rmanohar@qca.qualcomm.com>
Signed-off-by: Luis R. Rodriguez <mcgrof@qca.qualcomm.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
By definition WIPHY_FLAG_STRICT_REGULATORY was intended to allow the
wiphy to adjust itself to the country IE power information if the
card had no regulatory data but we had no way to tell cfg80211 that if
the card also had its own custom regulatory domain (these are typically
custom world regulatory domains) that we want to follow the country IE's
noted values for power for each channel. We add support for this and
document it.
This is not a critical fix but a performance optimization for cards
with custom regulatory domains that associate to an AP with sends
out country IEs with a higher EIRP than the one on the custom
regulatory domain. In practice the only driver affected right now
are the Atheros drivers as they are the only drivers using both
WIPHY_FLAG_STRICT_REGULATORY and WIPHY_FLAG_CUSTOM_REGULATORY --
used on cards that have an Atheros world regulatory domain. Cards
that have been programmed to follow a country specifically will not
follow the country IE power. So although not a stable fix distributions
should consider cherry picking this.
Cc: compat@orbit-lab.org
Cc: Paul Stewart <pstew@google.com>
Cc: Rajkumar Manoharan <rmanohar@qca.qualcomm.com>
Cc: Senthilkumar Balasubramanian <senthilb@qca.qualcomm.com>
Reported-by: Rajkumar Manoharan <rmanohar@qca.qualcomm.com>
Signed-off-by: Luis R. Rodriguez <mcgrof@qca.qualcomm.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
we found that power save is not getting enabled when we do
change interface in this order STA->IBSS->STA. this is
because ieee80211_setup_sdata clears type-dependent union
Reported-by: Leela Kella <leela@qca.qualcomm.com>
Signed-off-by: Mohammed Shafi Shajakhan <mohammed@qca.qualcomm.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Currently BAR, ADDBA and DELBA frames are always sent using AC_VO. If
the TID for which a BA session is established is assigned to a different
queue BAR, ADDBA and DELBA frames can "overtake" frames of the according
BA session.
Hence, always put BA session related frames into the same queue as the
BA sessions data frames.
Signed-off-by: Helmut Schaa <helmut.schaa@googlemail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Now that IBSS no longer needs to insert stations
from atomic context, we can get rid of all the
special cases for that, and even get rid of the
sta_lock (though it needs to stay as tim_lock.)
This makes the station management code much more
straight-forward.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
In order to notify drivers and simplify the station
management code, defer IBSS station insertion to a
work item and don't do it directly while receiving
a frame.
This increases the complexity in IBSS a little bit,
but it's pretty straight forward and it allows us
to reduce the station management complexity (next
patch) considerably.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
No real changes, just note that they are const.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Currently, each AP interface will send multicast
traffic if any interface has a station entry even
if that station entry is allocated only. With the
new station state management we can easily fix it
by adding a counter that counts each authorized
station only and send multicast traffic only when
the correct interface has at least one authorized
station.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Station entries can have various states, the most
important ones being auth, assoc and authorized.
This patch prepares us for telling the driver about
these states, we don't want to confuse drivers with
strange transitions, so with this we enforce that
they move in the right order between them (back and
forth); some transitions might happen before the
driver even knows about the station, but at least
runtime transitions will be ordered correctly.
As a consequence, IBSS and MESH stations will now
have the ASSOC flag set (so they can transition to
AUTHORIZED), and we can get rid of a special case
in TX processing.
When freeing a station, unwind the state so that
other parts of the code (or drivers later) can rely
on the transitions.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
There's no need to use RCU here, we can just lock
the station mutex instead. This allows the code
to sleep, which is necessary for later patches.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This is already checked in cfg80211, so no need
to repeat the checks here.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The nl80211 station handling code is a bit messy
and doesn't do a lot of validation. It seems like
this could be an issue for drivers that don't use
mac80211 to validate everything.
As cfg80211 doesn't keep station state, move the
validation of allowing supported_rates to change
for TDLS only in station mode to mac80211.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This was evidently missed in the TDLS patch (07ba55d7).
Cc: Arik Nemtsov <arik@wizery.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
We should only dereference the pointer if it's valid, not the other way
round.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is an initial implementation for the NFC Logical Link Control
protocol. It's also known as NFC peer to peer mode.
This is a basic implementation as it lacks SDP (services Discovery
Protocol), frames aggregation support, and frame rejecion parsing.
Follow up patches will implement those missing features.
This code has been tested against a Nexus S phone implementing LLCP 1.0.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Without an API for setting and getting the local and remote general bytes,
drivers won't be able to properly establish a DEP link.
This API also allows them to propagate the remote general bytes they get
from the DEP link establishment up to the LLCP layer.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
NFC-DEP (Data Exchange Protocol) is an NFC MAC layer.
This command allows to enable and disable the DEP link on to which e.g.
LLCP can run.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
rawsock_create() is called with preemption disabled, so we should not
sleep.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The netlink notifier is atomic so we must not sleep in that context.
Also we know that Any netlink packets arriving to us will be purged when
the notifier is called, so we don't need to take the mutex.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This is a factorization of the current rawsock tx skb allocation routine,
as it will be used by the LLCP code.
We also rename nfc_alloc_skb to nfc_alloc_recv_skb for consistency sake.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Our new return also created a memleak. The skb should be freed before
returning an error.
Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
All nl80211 commands that need only the wiphy
still allow identifying it by giving an interface
index, except, as Kenny pointed out, the testmode
dump support.
Fix this by looking up the wiphy via the ifidx in
this case as well.
Tested-by: Kenny Hsu <kenny.hsu@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
net/ipv4/sysctl_net_ipv4.c:78:6: warning: symbol 'inet_get_ping_group_range_table'
was not declared. Should it be static?
net/ipv4/sysctl_net_ipv4.c:119:31: warning: incorrect type in argument 2
(different signedness)
net/ipv4/sysctl_net_ipv4.c:119:31: expected int *range
net/ipv4/sysctl_net_ipv4.c:119:31: got unsigned int *<noident>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Its better to use a predefined size for this small automatic variable.
Removes a sparse error as well :
net/sched/cls_flow.c:288:13: error: bad constant expression
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 6d4cdf47d2 (vlan: add 802.1q netpoll support) forgot to declare
as static some private functions.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The original code generates a Sparse warning:
net/8021q/vlan_core.c:336:9:
error: incompatible types in comparison expression (different address spaces)
It's ok to dereference __rcu pointers here because we are holding the
RTNL lock. I've added some calls to rtnl_dereference() to silence the
warning.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before adding a struct rtnl_link_ops into link_ops list, check it doesnt
clash with a prior one.
Based on a previous patch from Alexander Smirnov
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Alexander Smirnov <alex.bluesman.smirnov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: add missing spin_unlock at ceph_mdsc_build_path()
ceph: fix SEEK_CUR, SEEK_SET regression
crush: fix mapping calculation when force argument doesn't exist
ceph: use i_ceph_lock instead of i_lock
rbd: remove buggy rollback functionality
rbd: return an error when an invalid header is read
ceph: fix rasize reporting by ceph_show_options
After commit 8e2ec63917 ("ipv6: don't
use inetpeer to store metrics for routes.") the test in rt6_alloc_cow()
for setting the ANYCAST flag is now wrong.
'rt' will always now have a plen of 128, because it is set explicitly
to 128 by ip6_rt_copy.
So to restore the semantics of the test, check the destination prefix
length of 'ort'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't just succeed with a route that has a NULL neighbour attached.
This follows the behavior of addrconf_dst_alloc().
Allowing this kind of route to end up with a NULL neigh attached will
result in packet drops on output until the route is somehow
invalidated, since nothing will meanwhile try to lookup the neigh
again.
A statistic is bumped for the case where we see a neigh-less route on
output, but the resulting packet drop is otherwise silent in nature,
and frankly it's a hard error for this to happen and ipv6 should do
what ipv4 does which is say something in the kernel logs.
Signed-off-by: David S. Miller <davem@davemloft.net>
It's simpler to just keep these things out until there is a real user
of them, so we can see what the needs actually are, rather than keep
these things around as useless overhead.
Signed-off-by: David S. Miller <davem@davemloft.net>
The ip address of the vif can be set even before the
vif is up. requiring the vif to be up in the vif
notifier makes the notifer ignore this event, which
causes wrong arp filter configuration later on.
Reported-by: Eyal Shapira <eyal@wizery.com>
Signed-off-by: Eliad Peller <eliad@wizery.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Configure arp filtering on sta reconfiguration.
Signed-off-by: Eliad Peller <eliad@wizery.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
ieee80211_configure_filter code used local->scanning as a boolean
value when it was a bit mask. Bits SCAN_COMPLETED, SCAN_ABORTED
should not set FIF_BCN_PRBRESP_PROMISC filter.
SCAN_HW_SCANNING should not set FIF_BCN_PRBRESP_PROMISC either,
as there is no explicit filter configuration request from
scan code. If a driver requires FIF_BCN_PRBRESP_PROMISC mode
during HW scanning, it's up to the driver to temporary enable it.
Similar mistake was fixed also in ieee80211_hw_config (power
configuration code).
Verified-by: Vitaly Wool <vitaly.wool@sonyericsson.com>
Signed-off-by: Dmitry Tarnyagin <dmitry.tarnyagin@stericsson.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Regulatory updates set by CORE are ignored for custom regulatory cards.
Let us notify the changes to the driver, as some drivers uses core hint
to restore its orig_* reg domain setting.
Cc: Paul Stewart <pstew@google.com>
Signed-off-by: Rajkumar Manoharan <rmanohar@qca.qualcomm.com>
Acked-by: Luis R. Rodriguez <mcgrof@qca.qualcomm.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Use ieee80211_is_data, ieee80211_is_mgmt and ieee80211_is_first_frag
in the tx status path. This makes the code easier to read and allows us
to remove two local variables: frag and type.
Signed-off-by: Helmut Schaa <helmut.schaa@googlemail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When a station leaves suddenly while ampdu traffic to that station is still
running, there is a possibility that the ampdu pending queues are not freed due
to a race condition leading to memory leaks. In '__sta_info_destroy' when we
attempt to destroy the ampdu sessions in 'ieee80211_sta_tear_down_BA_sessions',
the driver calls 'ieee80211_stop_tx_ba_cb_irqsafe' to delete the ampdu
structures (tid_tx) and splice the pending queues and this job gets queued in
sdata workqueue. However, the sta entry can get destroyed before the above work
gets scheduled and hence the race.
Purging the queues and freeing the tid_tx to avoid the leak. The better solution
would be to fix the race, but that can be taken up in a separate patch.
Signed-off-by: Nishant Sarmukadam <nishants@marvell.com>
Signed-off-by: Yogesh Ashok Powar <yogeshp@marvell.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
It is quite possible to run into a race in bss timeout where
the drivers see the bss entry just before notifying cfg80211
of a roaming event but it got timed out by the time rdev->event_work
got scehduled from cfg80211_wq. This would result in the following
WARN-ON() along with the failure to notify the user space of
the roaming. The other situation which is happening with ath6kl
that runs into issue is when the driver reports roam to same AP
event where the AP bss entry already got expired. To fix this,
move cfg80211_get_bss() from __cfg80211_roamed() to cfg80211_roamed().
[158645.538384] WARNING: at net/wireless/sme.c:586
__cfg80211_roamed+0xc2/0x1b1()
[158645.538810] Call Trace:
[158645.538838] [<c1033527>] warn_slowpath_common+0x65/0x7a
[158645.538917] [<c14cfacf>] ? __cfg80211_roamed+0xc2/0x1b1
[158645.538946] [<c103354b>] warn_slowpath_null+0xf/0x13
[158645.539055] [<c14cfacf>] __cfg80211_roamed+0xc2/0x1b1
[158645.539086] [<c14beb5b>] cfg80211_process_rdev_events+0x153/0x1cc
[158645.539166] [<c14bd57b>] cfg80211_event_work+0x26/0x36
[158645.539195] [<c10482ae>] process_one_work+0x219/0x38b
[158645.539273] [<c14bd555>] ? wiphy_new+0x419/0x419
[158645.539301] [<c10486cb>] worker_thread+0xf6/0x1bf
[158645.539379] [<c10485d5>] ? rescuer_thread+0x1b5/0x1b5
[158645.539407] [<c104b3e2>] kthread+0x62/0x67
[158645.539484] [<c104b380>] ? __init_kthread_worker+0x42/0x42
[158645.539514] [<c151309a>] kernel_thread_helper+0x6/0xd
Reported-by: Kalle Valo <kvalo@qca.qualcomm.com>
Signed-off-by: Vasanthakumar Thiagarajan <vthiagar@qca.qualcomm.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
We recently introduced a new return here but it needs an unlock first.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This extension can be used to simulate special link layer
characteristics. Simulate because packet data is not modified, only the
calculation base is changed to delay a packet based on the original
packet size and artificial cell information.
packet_overhead can be used to simulate a link layer header compression
scheme (e.g. set packet_overhead to -20) or with a positive
packet_overhead value an additional MAC header can be simulated. It is
also possible to "replace" the 14 byte Ethernet header with something
else.
cell_size and cell_overhead can be used to simulate link layer schemes,
based on cells, like some TDMA schemes. Another application area are MAC
schemes using a link layer fragmentation with a (small) header each.
Cell size is the maximum amount of data bytes within one cell. Cell
overhead is an additional variable to change the per-cell-overhead
(e.g. 5 byte header per fragment).
Example (5 kbit/s, 20 byte per packet overhead, cell-size 100 byte, per
cell overhead 5 byte):
tc qdisc add dev eth0 root netem rate 5kbit 20 100 5
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
gred_change_vq() is called under sch_tree_lock(sch).
This means a spinlock is held, and we are not allowed to sleep in this
context.
We might pre-allocate memory using GFP_KERNEL before taking spinlock,
but this is not suitable for stable material.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces kmem.tcp.max_usage_in_bytes file, living in the
kmem_cgroup filesystem. The root cgroup will display a value equal
to RESOURCE_MAX. This is to avoid introducing any locking schemes in
the network paths when cgroups are not being actively used.
All others, will see the maximum memory ever used by this cgroup.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces kmem.tcp.failcnt file, living in the
kmem_cgroup filesystem. Following the pattern in the other
memcg resources, this files keeps a counter of how many times
allocation failed due to limits being hit in this cgroup.
The root cgroup will always show a failcnt of 0.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces kmem.tcp.usage_in_bytes file, living in the
kmem_cgroup filesystem. It is a simple read-only file that displays the
amount of kernel memory currently consumed by the cgroup.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch uses the "tcp.limit_in_bytes" field of the kmem_cgroup to
effectively control the amount of kernel memory pinned by a cgroup.
This value is ignored in the root cgroup, and in all others,
caps the value specified by the admin in the net namespaces'
view of tcp_sysctl_mem.
If namespaces are being used, the admin is allowed to set a
value bigger than cgroup's maximum, the same way it is allowed
to set pretty much unlimited values in a real box.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows each namespace to independently set up
its levels for tcp memory pressure thresholds. This patch
alone does not buy much: we need to make this values
per group of process somehow. This is achieved in the
patches that follows in this patchset.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces memory pressure controls for the tcp
protocol. It uses the generic socket memory pressure code
introduced in earlier patches, and fills in the
necessary data in cg_proto struct.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The goal of this work is to move the memory pressure tcp
controls to a cgroup, instead of just relying on global
conditions.
To avoid excessive overhead in the network fast paths,
the code that accounts allocated memory to a cgroup is
hidden inside a static_branch(). This branch is patched out
until the first non-root cgroup is created. So when nobody
is using cgroups, even if it is mounted, no significant performance
penalty should be seen.
This patch handles the generic part of the code, and has nothing
tcp-specific.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtsu.com>
CC: Kirill A. Shutemov <kirill@shutemov.name>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replaces all uses of struct sock fields' memory_pressure,
memory_allocated, sockets_allocated, and sysctl_mem to acessor
macros. Those macros can either receive a socket argument, or a mem_cgroup
argument, depending on the context they live in.
Since we're only doing a macro wrapping here, no performance impact at all is
expected in the case where we don't have cgroups disabled.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Same fix as 731abb9cb2 for ipip and sit tunnel.
Commit 1c5cae815d removed an explicit call to dev_alloc_name in
ipip_tunnel_locate and ipip6_tunnel_locate, because register_netdevice
will now create a valid name, however the tunnel keeps a copy of the
name in the private parms structure. Fix this by copying the name back
after register_netdevice has successfully returned.
This shows up if you do a simple tunnel add, followed by a tunnel show:
$ sudo ip tunnel add mode ipip remote 10.2.20.211
$ ip tunnel
tunl0: ip/ip remote any local any ttl inherit nopmtudisc
tunl%d: ip/ip remote 10.2.20.211 local any ttl inherit
$ sudo ip tunnel add mode sit remote 10.2.20.212
$ ip tunnel
sit0: ipv6/ip remote any local any ttl 64 nopmtudisc 6rd-prefix 2002::/16
sit%d: ioctl 89f8 failed: No such device
sit%d: ipv6/ip remote 10.2.20.212 local any ttl inherit
Cc: stable@vger.kernel.org
Signed-off-by: Ted Feng <artisdom@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no obvious reason to add a default multicast route for loopback
devices, otherwise there would be a route entry whose dst.error set to
-ENETUNREACH that would blocking all multicast packets.
====================
[ more detailed explanation ]
The problem is that the resulting routing table depends on the sequence
of interface's initialization and in some situation, that would block all
muticast packets. Suppose there are two interfaces on my computer
(lo and eth0), if we initailize 'lo' before 'eth0', the resuting routing
table(for multicast) would be
# ip -6 route show | grep ff00::
unreachable ff00::/8 dev lo metric 256 error -101
ff00::/8 dev eth0 metric 256
When sending multicasting packets, routing subsystem will return the first
route entry which with a error set to -101(ENETUNREACH).
I know the kernel will set the default ipv6 address for 'lo' when it is up
and won't set the default multicast route for it, but there is no reason to
stop 'init' program from setting address for 'lo', and that is exactly what
systemd did.
I am sure there is something wrong with kernel or systemd, currently I preferred
kernel caused this problem.
====================
Signed-off-by: Li Wei <lw@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
wait_for_completion_interruptible_timeout() returns -ERESTARTSYS if
interrupted so completion_rc needs to be signed. The current code
probably returns -ETIMEDOUT if we hit this situation, but after this
patch is applied it will return -ERESTARTSYS.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Don't write more than the requested number of bytes of an batman-adv icmp
packet to the userspace buffer. Otherwise unrelated userspace memory might get
overridden by the kernel.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
The access_ok read check can be directly done in copy_from_user since a failure
of access_ok is handled the same way as an error in __copy_from_user.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Writing a icmp_packet_rr and then reading icmp_packet can lead to kernel
memory corruption, if __user *buf is just below TASK_SIZE.
Signed-off-by: Paul Kot <pawlkt@gmail.com>
[sven@narfation.org: made it checkpatch clean]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Instead of testing defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wrap the udp6 lookup into the proper ifdef-s.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet reported, that when inet_diag is built-in the udp_diag also goes
built-in and when ipv6 is a module the udp6 lookup symbol is not found.
LD .tmp_vmlinux1
net/built-in.o: In function `udp_dump_one':
udp_diag.c:(.text+0xa2b40): undefined reference to `__udp6_lib_lookup'
make: *** [.tmp_vmlinux1] Erreur 1
Fix this by making udp diag build mode depend on both -- inet diag and ipv6.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do the same as TCP does -- iterate the given udp_table, filter
sockets with bytecode and dump sockets into reply message.
The same filtering as for TCP applies, though only some of the
state bits really matter.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do the same as TCP does -- lookup a socket in the given udp_table,
check cookie, fill the reply message with existing inet socket dumping
helper and send one back.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce the transport level diag handler module for UDP (and UDP-lite)
sockets and register (empty for now) callbacks in the inet_diag module.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The UDP diag get_exact handler will require them to find a
socket by provided net, [sd]addr-s, [sd]ports and device.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce two callbacks in inet_diag_handler -- one for dumping all
sockets (with filters) and the other one for dumping a single sk.
Replace direct calls to icsk handlers with indirect calls to callbacks
provided by handlers.
Make existing TCP and DCCP handlers use provided helpers for icsk-s.
The UDP diag module will provide its own.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The existing inet_csk_diag_fill dumps the inet connection sock info
into the netlink inet_diag_message. Prepare this routine to be able
to dump only the inet_sock part of a socket if the icsk part is missing.
This will be used by UDP diag module when dumping UDP sockets.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The upcoming UDP module will require exactly this ability, so just
move the existing code to provide one.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to previous patch: the 1st part locks the inet handler
and will get generalized and the 2nd one dumps icsk-s and will
be used by TCP and DCCP handlers.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 1st part locks the inet handler and the 2nd one dump the
inet connection sock.
In the next patches the 1st part will be generalized to call
the socket dumping routine indirectly (i.e. TCP/UDP/DCCP) and
the 2nd part will be used by TCP and DCCP handlers.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The netlink diag susbsys stores sk address bits in the nl message
as a "cookie" and uses one when dumps details about particular
socket.
The same will be required for udp diag module, so introduce a heler
in inet_diag module
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's an info_size value stored on inet_diag_handler, but for existing
code this value is effectively constant, so just use sizeof(struct tcp_info)
where required.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now RED uses a Q0.32 number to store max_p (max probability), allow
RED/GRED/CHOKE to use/report full resolution at config/dump time.
Old tc binaries are non aware of new attributes, and still set/get Plog.
New tc binary set/get both Plog and max_p for backward compatibility,
they display "probability value" if they get max_p from new kernels.
# tc -d qdisc show dev ...
...
qdisc red 10: parent 1:1 limit 360Kb min 30Kb max 90Kb ecn ewma 5
probability 0.09 Scell_log 15
Make sure we avoid potential divides by 0 in reciprocal_value(), if
(max_th - min_th) is big.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 865d9f9f74.
This commit breaks the build with CONFIG_NETPRIO_CGROUP=y so
revert it. It does build as a module though. The SUBSYS macro
in the cgroup core code automatically defines a subsys structure
as extern. Long term we should fix the macro. And I need to
fully build test things.
Tested with CONFIG_NETPRIO_CGROUP={y|m|n} with and without
CONFIG_CGROUPS defined.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
CC: Neil Horman <nhorman@tuxdriver.com>
Reported-By: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These tests are off by one because sock_diag_handlers[] only has AF_MAX
elements.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net_prio_subsys can be made static this removes the sparse
warning it was throwing.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add netpoll support to 802.1q vlan devices. Based on the netpoll support
in the bridging code. Tested on a forced_eth device with netconsole.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adaptative RED AQM for linux, based on paper from Sally FLoyd,
Ramakrishna Gummadi, and Scott Shenker, August 2001 :
http://icir.org/floyd/papers/adaptiveRed.pdf
Goal of Adaptative RED is to make max_p a dynamic value between 1% and
50% to reach the target average queue : (max_th - min_th) / 2
Every 500 ms:
if (avg > target and max_p <= 0.5)
increase max_p : max_p += alpha;
else if (avg < target and max_p >= 0.01)
decrease max_p : max_p *= beta;
target :[min_th + 0.4*(min_th - max_th),
min_th + 0.6*(min_th - max_th)].
alpha : min(0.01, max_p / 4)
beta : 0.9
max_P is a Q0.32 fixed point number (unsigned, with 32 bits mantissa)
Changes against our RED implementation are :
max_p is no longer a negative power of two (1/(2^Plog)), but a Q0.32
fixed point number, to allow full range described in Adatative paper.
To deliver a random number, we now use a reciprocal divide (thats really
a multiply), but this operation is done once per marked/droped packet
when in RED_BETWEEN_TRESH window, so added cost (compared to previous
AND operation) is near zero.
dump operation gives current max_p value in a new TCA_RED_MAX_P
attribute.
Example on a 10Mbit link :
tc qdisc add dev $DEV parent 1:1 handle 10: est 1sec 8sec red \
limit 400000 min 30000 max 90000 avpkt 1000 \
burst 55 ecn adaptative bandwidth 10Mbit
# tc -s -d qdisc show dev eth3
...
qdisc red 10: parent 1:1 limit 400000b min 30000b max 90000b ecn
adaptative ewma 5 max_p=0.113335 Scell_log 15
Sent 50414282 bytes 34504 pkt (dropped 35, overlimits 1392 requeues 0)
rate 9749Kbit 831pps backlog 72056b 16p requeues 0
marked 1357 early 35 pdrop 0 other 0
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>