The phy layer is missing locking for the above two functions - it
has been observed that two threads (userspace and the phy worker
thread) can race, entering the bus ->write or ->read functions
simultaneously.
This causes the FEC driver to initialise a completion while another
thread is waiting on it or while the interrupt is calling complete()
on it, which causes spinlock unlock-without-lock, spinlock lockups,
and completion timeouts.
Fixes: a59a4d192 ("phy: add the EEE support and the way to access to the MMD registers.")
Fixes: 0c1d77dfb ("net: libphy: Add phy specific function to access mmd phy registers")
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Santosh Shilimkar says:
====================
RDS: Few more fixes
As indicated in the earlier series [1], this is a follow-up series which
addresses few issues around the RDS FMR code. With [1] and the subject
series, now I can run many parallel threads with multiple sockets with
N x N traffic. The stress tests has survived overnight runs.
[1] https://lkml.org/lkml/2015/8/22/127
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Memory allocated for 'ibmr' uses kzalloc_node() which already
initialises the memory to zero. There is no need to do
memset() 0 on that memory.
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
FMR flush is an expensive and time consuming operation. Reduce the
frequency of FMR pool flush by 50% so that more FMR work gets accumulated
for more efficient flushing.
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RDS FMR flush operation and also it races with connect/reconect
which happes a lot with RDS. FMR flush being on common rds_wq aggrevates
the problem. Lets push RDS FMR pool flush work to its own worker.
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In rds_ib_flush_mr_pool(), dirty_count accounts the clean ones
which is wrong. This can lead to a negative dirty count value.
Lets fix it.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On rds_ib_frag_slab allocation failure, ensure rds_ib_incoming_slab
is not pointing to the detsroyed memory.
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before 56ef9c909b40[1] it used to ignore all errors from igmp_join().
That commit enhanced that and made it error out whatever error happened
with igmp_join(), but that's not good because when using multicast
groups vxlan will try to join it multiple times if the socket is reused
and then the 2nd and further attempts will fail with EADDRINUSE.
As we don't track to which groups the socket is already subscribed, it's
okay to just ignore that error.
Fixes: 56ef9c909b ("vxlan: Move socket initialization to within rtnl scope")
Reported-by: John Nielsen <lists@jnielsen.net>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- code restyling and beautification
- use int kernel types instead of C99
- update kereldoc
- prevent potential hlist double deletion of VLAN objects
- fix gw bandwidth calculation
- convert list to hlist when needed
- add lockdep_asserts calls in function with lock requirements
described in kerneldoc
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJV26RQAAoJEOb/4TMchkvfsGkP/jJrS9XAUL6SkYARWU7JB++I
G0tVjzJVzh9aPXMV5uFcqwYYB/j2SvqIsad9MCcRpgsmsEhc52TELZCsY6HtZFLt
0la5mOAkMmjBvW8/gzaGM4nzDc388UYOxW1gKwl+obczUCpEnbyu/R7CiC8zB5d7
oHYfgmYbGTAtoFzwiA5zXdu11QfPJhLMivMVsNt2sf7daBjJP6arFl+hJWWEGE/L
oYfEuiOajFvp2d8H/VXnV43aJHtWHvUis1CqB2dFvTFWkPyWuFpmIwoJWydD/e1s
vjcbq4HdwTwG6Io4DkbmL1YLTONW/Z83FViIHCSDu4TRxQnDrY52WHetbmoE6ARi
EtQfUlU4i39FydZUnMIEmmY2lyU74YeGd3cqqDsYO2j30xJ0tpcih815cXBL9twC
jZZqlk6NNeJ0zLAHTXe6azEbg/jv7TBL0wjcDDgmHUDPZkFDMD6gn8okn26Yi3oY
qd5/HCU2oj8RP9OaHf7kkWo0cg1bwq2ygoRkWcEIwCcUyq3utJJ+8RMI8jUkC3is
siiPPSbEEQW02fl60KRJZeiyVY2Em2fDYR0gVrj4czcIr0HALtKgJ+u5jDYqqtj3
uzGT0YcXeAFYRJgo8yjOnzUz7sM2Tgld4qDI59FegrUilsoW/fwi3uEU32Vzda1P
4aLuUtbkeQfnaLMJCuE2
=hRJW
-----END PGP SIGNATURE-----
Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge
Antonio Quartulli says:
====================
Included changes:
- code restyling and beautification
- use int kernel types instead of C99
- update kereldoc
- prevent potential hlist double deletion of VLAN objects
- fix gw bandwidth calculation
- convert list to hlist when needed
- add lockdep_asserts calls in function with lock requirements
described in kerneldoc
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQEcBAABCgAGBQJV3BEJAAoJEP5prqPJtc/Hb2wH/RFxO+rBh16yZBJjFlPbn2ZQ
VbcngZtowJle9kjTBINCN/8KsjOhdpn9oT8iOxVcrwyxPa/gWcqnz7cip9regabu
6fOnIlmCnomJ9E/9Gt4joqsB14Zlbubn4xU+VJacZRDjXktJSeHGexxYHnAsROC6
V4W2yIySd5T1UvlzSCbbugRLa9c0ROtLj2RxdHrTicbmcyQrA/bvErACFGtlInso
PV6YLFk+ESk3RH0vl2FxUkNC2g7QiKp7zhX9eAuDkEg2CIYCL1sNQt6eAQMPaRbc
o9u60JLbqiXrKbvlmOGBnIPqkBWxbYX3Jo9Qc4bEyS93Gzdn73kMwnFbMgd8Oek=
=JbLo
-----END PGP SIGNATURE-----
Merge tag 'linux-can-fixes-for-4.2-20150825' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
this is the updated pull request of one patch by me for the peak_usb
driver. It fixes the driver, so that non FD adapters don't provide CAN
FD bittimings.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
posted_index is RO in firmware. We need not do ioread everytime to get
posted index. Store posted index locally.
Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the renesas ethernet driver directory is compiled if SH_ETH is
configured rather than NET_VENDOR_RENESAS. Although incorrect that was
quite harmless as until recently as SH_ETH configured the only driver in
the renesas directory. However, as of c156633f13 ("Renesas Ethernet AVB
driver proper") the renesas directory includes another driver, configured
by RAVB, and it makes little sense for it to have a hidden dependency on
SH_ETH.
Signed-off-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
[horms: rewrote changelog]
Signed-off-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The r8169 driver collects statistical information returned by
@get_stats64 by counting them in the driver itself, even though many
(but not all) of the values are already collected by tally counters
(TCs) in the NIC. Some of these TC values are not returned by
@get_stats64. Especially the received multicast packages are missing
from /proc/net/dev.
Rectify this by fetching the TCs and returning them from
rtl8169_get_stats64.
The counters collected in the driver obviously disappear as soon as the
driver is unloaded so after a driver is loaded the counters always start
at 0. The TCs on the other hand are only reset by a power cycle. Without
further considerations the values collected by the driver would not match
up against the TC values.
This patch introduces a new function rtl8169_reset_counters which
resets the TCs. Also, since rtl8169_reset_counters shares most of
its code with rtl8169_update_counters, refactor the shared code into
two new functions rtl8169_map_counters and rtl8169_unmap_counters.
Unfortunately chip versions prior to RTL_GIGA_MAC_VER_19 don't allow
to reset the TCs programatically. Therefore introduce an addition to
the rtl8169_private struct and a function rtl8169_init_counter_offsets
to store the TCs at first rtl_open. Use these values as offsets in
rtl8169_get_stats64. Propagate a failure to reset *and* update the
counters up to rtl_open and emit a warning message, if so.
Signed-off-by: Corinna Vinschen <vinschen@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
>> net/rds/ib_recv.c:382:28: sparse: incorrect type in initializer (different base types)
net/rds/ib_recv.c:382:28: expected int [signed] can_wait
net/rds/ib_recv.c:382:28: got restricted gfp_t
net/rds/ib_recv.c:828:23: sparse: cast to restricted __le64
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shreyas Bhatewara would no longer maintain the vmxnet3 driver. Taking over
the role of vmxnet3 maintainer.
Signed-off-by: Shrikrishna Khare <skhare@vmware.com>
Signed off-by: Shreyas Bhatewara <sbhatewara@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a tunnel is deleted, the cached dst entry should be released.
This problem may prevent the removal of a netns (seen with a x-netns IPv6
gre tunnel):
unregister_netdevice: waiting for lo to become free. Usage count = 3
CC: Dmitry Kozlov <xeb@mail.ru>
Fixes: c12b395a46 ("gre: Support GRE over IPv6")
Signed-off-by: huaibin Wang <huaibin.wang@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The vxlan_get_sk_family inline function was added after the last #endif,
making multiple inclusion of net/vxlan.h fail. Move it to the proper place.
Reported-by: Mark Rustad <mark.d.rustad@intel.com>
Fixes: 705cc62f67 ("vxlan: provide access function for vxlan socket address family")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fix following warnings.
.//net/core/skbuff.c:407: warning: No description found
for parameter 'len'
.//net/core/skbuff.c:407: warning: Excess function parameter
'length' description in '__netdev_alloc_skb'
.//net/core/skbuff.c:476: warning: No description found
for parameter 'len'
.//net/core/skbuff.c:476: warning: Excess function parameter
'length' description in '__napi_alloc_skb'
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Let packets move from one netns to the other at PPP encapsulation and
decapsulation time.
PPP units and channels remain in the netns in which they were
originally created. Only the net_device may move to a different
namespace. Cross netns handling is thus transparent to lower PPP
layers (PPPoE, L2TP, etc.).
PPP devices are automatically unregistered when their netns gets
removed. So read() and poll() on the unit file descriptor will
respectively receive EOF and POLLHUP. Channels aren't affected.
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Claim the emac sram ourselves, rather then relying on the bootloader
having mapped the sram to the emac controller during boot.
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove various inlined functions not referenced in the kernel.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To avoid multiply/division operations on the data path,
we hold a {channel, tc}==>txq mapping table.
We held this mapping table inside the channel object that is
being destroyed upon some configuration operations (e.g MTU change).
So in case ndo_select_queue occurs during such a configuration operation,
it may access a NULL channel pointer, resulting in kernel panic.
To fix this issue we moved the {channel, tc}==>txq mapping table
outside the channel object so that it will be available also
during such configuration operations.
Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Return a negative error code on failure.
A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@
identifier ret; expression e1,e2;
@@
(
if (\(ret < 0\|ret != 0\))
{ ... return ret; }
|
ret = 0
)
... when != ret = e1
when != &ret
*if(...)
{
... when != ret = e2
when forall
return ret;
}
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Return a negative error code on failure.
A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@
identifier ret; expression e1,e2;
@@
(
if (\(ret < 0\|ret != 0\))
{ ... return ret; }
|
ret = 0
)
... when != ret = e1
when != &ret
*if(...)
{
... when != ret = e2
when forall
return ret;
}
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Propagate error code on failure.
A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@
identifier ret; expression e1,e2;
@@
(
if (\(ret < 0\|ret != 0\))
{ ... return ret; }
|
ret = 0
)
... when != ret = e1
when != &ret
*if(...)
{
... when != ret = e2
when forall
return ret;
}
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hi,
After commit f70ced0917 (blk-mq: support per-distpatch_queue flush
machinery), the mtip32xx driver may oops upon module load due to walking
off the end of an array in mtip_init_cmd. On initialization of the
flush_rq, init_request is called with request_index >= the maximum queue
depth the driver supports. For mtip32xx, this value is used to index
into an array. What this means is that the driver will walk off the end
of the array, and either oops or cause random memory corruption.
The problem is easily reproduced by doing modprobe/rmmod of the mtip32xx
driver in a loop. I can typically reproduce the problem in about 30
seconds.
Now, in the case of mtip32xx, it actually doesn't support flush/fua, so
I think we can simply return without doing anything. In addition, no
other mq-enabled driver does anything with the request_index passed into
init_request(), so no other driver is affected. However, I'm not really
sure what is expected of drivers. Ming, what did you envision drivers
would do when initializing the flush requests?
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Santosh Shilimkar says:
====================
RDS: Assorted bug fixes
We would like to improve RDS upstream support and in that context, I
started playing with it. But run into number of issues including as
basic is RDS IB RDMA doesn't work. As part of the debug, I ended up
creating the $subject series which has bunch of assorted fixes. At
least with this series I can run RDS IB RDMA and other tests
successfully.
Some of these fixes have been done by Chris Meson, Andy Grover and
Zach Brown while at Oracle. There are still more kinks with FMR and
error handling and I plan to address them in a follow up series.
Series generated against Linus's master(v4.2-rc-7) but also applies
against next-next cleanly. Its tested on Oracle hardware with IB
fabric for both bcopy as well as RDMA mode. I don't have access
to iWARP hardware so any testing help on iWARP hardware appreciated.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Connection could have been dropped while the route is being resolved
so check for valid cm_id before initiating the connection.
Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rds_send_queue_rm() allows for the "current datagram" being queued
to exceed SO_SNDBUF thresholds by checking bytes queued without
counting in length of current datagram. (Since sk_sndbuf is set
to twice requested SO_SNDBUF value as a kernel heuristic this
is usually fine!)
If this "current datagram" squeezing past the threshold is itself
many times the size of the sk_sndbuf threshold itself then even
twice the SO_SNDBUF does not save us and it gets queued but
cannot be transmitted. Threads block and deadlock and device
becomes unusable. The check for this datagram not exceeding
SNDBUF thresholds (EMSGSIZE) is not done on this datagram as
that check is only done if queueing attempt fails.
(Datagrams that follow this datagram fail queueing attempts, go
through the check and eventually trip EMSGSIZE error but zero
length datagrams silently fail!)
This fix moves the check for datagrams exceeding SNDBUF limits
before any processing or queueing is attempted and returns EMSGSIZE
early in the rds_sndmsg() code. This change also ensures that all
datagrams get checked for exceeding SNDBUF/sk_sndbuf size limits
and the large datagrams that exceed those limits do not get to
rds_send_queue_rm() code for processing.
Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rds_send_drop_to() is used during socket tear down to find all the
messages on the socket and flush them . It can race with the
acking code unless it takes the m_rs_lock on each and every message.
This plugs a hole where we didn't take m_rs_lock on any message that
didn't have the RDS_MSG_ON_CONN set. Taking m_rs_lock avoids
double frees and other memory corruptions as the ack code trusts
the message m_rs pointer on a socket that had actually been freed.
We must take m_rs_lock to access m_rs. Because of lock nesting and
rs access, we also need to acquire rs_lock.
Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
During connection resets, we are destroying the rdma id too soon. We can't
destroy it when it is still in use. So lets move rdma_destroy_id() after
we clear the rings.
Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the asserion level since its not fatal and can be hit
in normal execution paths. There is no need to take the
system down.
We keep the WARN_ON() to detect the condition if we get
here with bad pages.
Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
WR(Work Requests )always generate a WC(Work Completion) with
signaled send. Default RDS ib code is setup for un-signaled
completion. Since RDS connction is persistent, we can end up
sending the data even after large-send when the remote end is
not active(for any reason).
By doing a signaled send at least once per large-send,
we can at least detect the problem in work completion
handler there by avoiding sending more data to
inactive remote.
Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rds_send_xmit() marks the rds message map flag after
xmit_[rdma/atomic]() which is clearly wrong. We need
to maintain the ownership between transport and rds.
Also take care of error path.
Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This helps to detect the accidental processes/apps trying to destroy
the RDS socket which they are sharing with other processes/apps.
Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ensure we don't keep sending the data if the link is congested.
Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we get an ENOMEM during rds_ib_recv_refill, we might never come
back and refill again later. Patch makes sure to kick krdsd into
helping out.
To achieve this we add RDS_RECV_REFILL flag and update in the refill
path based on that so that at least some therad will keep posting
receive buffers.
Since krdsd and softirq both might race for refill, we decide to
schedule on work queue based on ring_low instead of ring_empty.
Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the ip address tables hasn't changed, there is no need to remove
them only to be added back again.
Lets fix it.
Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Destroy ib state early during shutdown. Otherwise we can get callbacks
after the QP isn't really able to handle them.
Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We were still seeing rare occurrences of the WARN_ON(recv->r_frag) which
indicates that the recv refill path was finding allocated frags in ring
entries that were marked free. These were usually followed by OOM crashes.
They only seem to be occurring in the presence of completion errors and
connection resets.
This patch ensures that we free the frag as we mark the ring entry free.
This should stop the refill path from finding allocated frags in ring
entries that were marked free.
Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In rds_cmsg_rdma_args() 'ret' is used by rds_pin_pages() which returns
number of pinned pages on success. And the same value is returned to the
caller of rds_cmsg_rdma_args() on success which is not intended.
Commit f4a3fc03c1 ("RDS: Clean up error handling in rds_cmsg_rdma_args")
removed the 'ret = 0' line which broke RDS RDMA mode.
Fix it by restoring the return value on rds_pin_pages() success
keeping the clean-up in place.
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e79729123f ("writeback: don't issue wb_writeback_work if clean")
updated writeback path to avoid kicking writeback work items if there
are no inodes to be written out; unfortunately, the avoidance logic
was too aggressive and broke sync_inodes_sb().
* sync_inodes_sb() must write out I_DIRTY_TIME inodes but I_DIRTY_TIME
inodes dont't contribute to bdi/wb_has_dirty_io() tests and were
being skipped over.
* inodes are taken off wb->b_dirty/io/more_io lists after writeback
starts on them. sync_inodes_sb() skipping wait_sb_inodes() when
bdi_has_dirty_io() breaks it by making it return while writebacks
are in-flight.
This patch fixes the breakages by
* Removing bdi_has_dirty_io() shortcut from bdi_split_work_to_wbs().
The callers are already testing the condition.
* Removing bdi_has_dirty_io() shortcut from sync_inodes_sb() so that
it always calls into bdi_split_work_to_wbs() and wait_sb_inodes().
* Making bdi_split_work_to_wbs() consider the b_dirty_time list for
WB_SYNC_ALL writebacks.
Kudos to Eryu, Dave and Jan for tracking down the issue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: e79729123f ("writeback: don't issue wb_writeback_work if clean")
Link: http://lkml.kernel.org/g/20150812101204.GE17933@dhcp-13-216.nay.redhat.com
Reported-and-bisected-by: Eryu Guan <eguan@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.com>
Cc: Ted Ts'o <tytso@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When TCP pacing was added back in linux-3.12, we chose
to apply a fixed ratio of 200 % against current rate,
to allow probing for optimal throughput even during
slow start phase, where cwnd can be doubled every other gRTT.
At Google, we found it was better applying a different ratio
while in Congestion Avoidance phase.
This ratio was set to 120 %.
We've used the normal tcp_in_slow_start() helper for a while,
then tuned the condition to select the conservative ratio
as soon as cwnd >= ssthresh/2 :
- After cwnd reduction, it is safer to ramp up more slowly,
as we approach optimal cwnd.
- Initial ramp up (ssthresh == INFINITY) still allows doubling
cwnd every other RTT.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 18ee49ddb0 ("phylib: rename mii_bus::dev to mii_bus::parent")
changed the parent of PHY devices from the bus to the bus parent.
Then, commit 4dea547fef ("phylib: rework to prepare for OF
registration of PHYs") moved the code into phy_device.c
At this point, it is somewhat unclear why the change was seen as
necessary. But, when we look at the device model tree in
/sys/devices, it is clearly incorrect. The PHYs should be children of
their MDIO bus.
Change the PHY's parent device to be the MDIO bus device.
Cc: Lennert Buytenhek <buytenh@wantstofly.org>
Cc: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: David Daney <david.daney@cavium.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Directs route lookups to VRF table. Compiles out if NET_VRF is not
enabled. With this patch able to successfully bring up ipsec tunnels
in VRFs, even with duplicate network configuration.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
slow start after idle might reduce cwnd, but we perform this
after first packet was cooked and sent.
With TSO/GSO, it means that we might send a full TSO packet
even if cwnd should have been reduced to IW10.
Moving the SSAI check in skb_entail() makes sense, because
we slightly reduce number of times this check is done,
especially for large send() and TCP Small queue callbacks from
softirq context.
As Neal pointed out, we also need to perform the check
if/when receive window opens.
Tested:
Following packetdrill test demonstrates the problem
// Test of slow start after idle
`sysctl -q net.ipv4.tcp_slow_start_after_idle=1`
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
+.100 < . 1:1(0) ack 1 win 511
+0 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_SOCKET, SO_SNDBUF, [200000], 4) = 0
+0 write(4, ..., 26000) = 26000
+0 > . 1:5001(5000) ack 1
+0 > . 5001:10001(5000) ack 1
+0 %{ assert tcpi_snd_cwnd == 10 }%
+.100 < . 1:1(0) ack 10001 win 511
+0 %{ assert tcpi_snd_cwnd == 20, tcpi_snd_cwnd }%
+0 > . 10001:20001(10000) ack 1
+0 > P. 20001:26001(6000) ack 1
+.100 < . 1:1(0) ack 26001 win 511
+0 %{ assert tcpi_snd_cwnd == 36, tcpi_snd_cwnd }%
+4 write(4, ..., 20000) = 20000
// If slow start after idle works properly, we should send 5 MSS here (cwnd/2)
+0 > . 26001:31001(5000) ack 1
+0 %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
+0 > . 31001:36001(5000) ack 1
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>