When a packet were fully copied in zerocopy, we don't wait for the DMA done to
mark the done flag, so after the packet were passed to lower device, we need to
add used and signal guest immediately.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Currently, we restart tx polling unconditionally when sendmsg()
fails. This would cause unnecessary wakeups of vhost wokers and waste
cpu utlization when evil userspace(guest driver) is able to hit EFAULT or
EINVAL.
The polling is only needed when the socket send buffer were exceeded or not
enough memory. So fix this by restarting polling only when sendmsg() returns
EAGAIN/ENOBUFS.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
When we want to disable vhost_net backend while there's a tx work, a possible
NULL pointer defernece may happen we we try to deference the vq->bufs after
vhost_net_set_backend() assign a NULL to it.
As suggested by Michael, fix this by checking the vq->bufs instead of
vhost_sock_zcopy().
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Pull networking fixes from David Miller:
1) Fix namespace init and cleanup in phonet to fix some oopses, from
Eric W. Biederman.
2) Missing kfree_skb() in AF_KEY, from Julia Lawall.
3) Refcount leak and source address handling fix in l2tp from James
Chapman.
4) Memory leak fix in CAIF from Tomasz Gregorek.
5) When routes are cloned from ipv6 addrconf routes, we don't process
expirations properly. Fix from Gao Feng.
6) Fix panic on DMA errors in atl1 driver, from Tony Zelenoff.
7) Only enable interrupts in 8139cp driver after we've registered the
IRQ handler. From Jason Wang.
8) Fix too many reads of KS_CIDER register in ks8851 during probe,
fixing crashes on spurious interrupts. From Matt Renzelmann.
9) Missing include in ath5k driver and missing iounmap on probe
failure, from Jonathan Bither.
10) Fix RX packet handling in smsc911x driver, from Will Deacon.
11) Fix ixgbe WoL on fiber by leaving the laser on during shutdown.
12) ks8851 needs MAX_RECV_FRAMES increased otherwise the internal MAC
buffers are easily overflown. Fix from Davide Cimingahi.
13) Fix memory leaks in peak_usb CAN driver, from Jesper Juhl.
14) gred packet scheduler can dump in WRED more when doing a netlink
dump. Fix from David Ward.
15) Fix MTU in USB smsc75xx driver, from Stephane Fillod.
16) Dummy device needs ->ndo_uninit handler to properly handle
->ndo_init failures. From Hiroaki SHIMODA.
17) Fix TX fragmentation in ath9k driver, from Sujith Manoharan.
18) Missing RTNL lock in ixgbe PM resume, from Benjamin Poirier.
19) Missing iounmap in farsync WAN driver, from Julia Lawall.
20) With LRO/GRO, tcp_grow_window() is easily tricked into not growing
the receive window properly, and this hurts performance. Fix from
Eric Dumazet.
21) Network namespace init failure can leak net_generic data, fix from
Julian Anastasov.
22) Fix skb_over_panic due to mis-accounting in TCP for partially ACK'd
SKBs. From Eric Dumazet.
23) New IDs for qmi_wwan driver, from Bjørn Mork.
24) Fix races in ax25_exit(), from Eric W. Biederman.
25) IPV6 TCP doesn't handle TCP_MAXSEG socket option properly, copy over
logic from the IPV4 side. From Neal Cardwell.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (59 commits)
tcp: fix TCP_MAXSEG for established IPv6 passive sockets
drivers/net: Do not free an IRQ if its request failed
drop_monitor: allow more events per second
ks8851: Fix request_irq/free_irq mismatch
net/hyperv: Adding cancellation to ensure rndis filter is closed
ks8851: Fix mutex deadlock in ks8851_net_stop()
net ax25: Reorder ax25_exit to remove races.
icplus: fix interrupt for IC+ 101A/G and 1001LF
net: qmi_wwan: support Sierra Wireless MC77xx devices in QMI mode
bnx2x: off by one in bnx2x_ets_e3b0_sp_pri_to_cos_set()
ksz884x: don't copy too much in netdev_set_mac_address()
tcp: fix retransmit of partially acked frames
netns: do not leak net_generic data on failed init
net/sock.h: fix sk_peek_off kernel-doc warning
tcp: fix tcp_grow_window() for large incoming frames
drivers/net/wan/farsync.c: add missing iounmap
davinci_mdio: Fix MDIO timeout check
ipv6: clean up rt6_clean_expires
ipv6: fix rt6_update_expires
arcnet: rimi: Fix device name in debug output
...
The skb struct ubuf_info callback gets passed struct ubuf_info
itself, not the arg value as the field name and the function signature
seem to imply. Rename the arg field to ctx to match usage,
add documentation and change the callback argument type
to make usage clear and to have compiler check correctness.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit ea5d404655
broke build for the vhost test module used
by tools/virtio. Fix it up.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We shouldn't hold any locks on release path. Pass a flag to
vhost_dev_cleanup to use the lockdep info correctly.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Sasha Levin <levinsasha928@gmail.com>
This is a tiny, but important, patch to vhost.
Vhost's worker thread only called schedule() when it had no work to do, and
it wanted to go to sleep. But if there's always work to do, e.g., the guest
is running a network-intensive program like netperf with small message sizes,
schedule() was *never* called. This had several negative implications (on
non-preemptive kernels):
1. Passing time was not properly accounted to the "vhost" process (ps and
top would wrongly show it using zero CPU time).
2. Sometimes error messages about RCU timeouts would be printed, if the
core running the vhost thread didn't schedule() for a very long time.
3. Worst of all, a vhost thread would "hog" the core. If several vhost
threads need to share the same core, typically one would get most of the
CPU time (and its associated guest most of the performance), while the
others hardly get any work done.
The trivial solution is to add
if (need_resched())
schedule();
After doing every piece of work. This will not do the heavy schedule() all
the time, just when the timer interrupt decided a reschedule is warranted
(so need_resched returns true).
Thanks to Abel Gordon for this patch.
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
By adding some module aliases, programs (or users) won't have to explicitly
call modprobe. Vhost-net will always be available if built into the kernel.
It does require assigning a permanent minor number for depmod to work.
Also:
- use C99 style initialization.
- add missing entry in documentation for loop-control
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-By: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>
Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The meth for calculating the # of outstanding buffers gives
incorrect results when vq->upend_idx wraps around zero.
Fix that.
Signed-off-by: Shirley Ma <xma@us.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
On backend change, we flushed out outstanding skbs
but forgot to update the used ring, so that
done entries were left in the ubuf_info ring.
As a result we lose heads or complete incorrect ones,
crashing the guest or leaking memory.
Fix by updating the used ring.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
As we now only update used ring after enabling
the backend, we can write flags with __put_user:
as that's done on data path, it matters.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Fix get/put refcount imbalance with zero copy,
which caused qemu to hang forever on guest driver unload.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We need to log writes when updating used flags and avail event
fields. Otherwise the guest may see a stale value after migration and
miss notifying the host.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Move the used ring initialization after backend was set. This
makes it possible to disable the backend and tweak the used ring,
then restart. This will also make it possible to log the used ring
write correctly.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>From: Shirley Ma <mashirle@us.ibm.com>
This adds experimental zero copy support in vhost-net,
disabled by default. To enable, set
experimental_zcopytx module option to 1.
This patch maintains the outstanding userspace buffers in the
sequence it is delivered to vhost. The outstanding userspace buffers
will be marked as done once the lower device buffers DMA has finished.
This is monitored through last reference of kfree_skb callback. Two
buffer indices are used for this purpose.
The vhost-net device passes the userspace buffers info to lower device
skb through message control. DMA done status check and guest
notification are handled by handle_tx: in the worst case is all buffers
in the vq are in pending/done status, so we need to notify guest to
release DMA done buffers first before we get any new buffers from the
vq.
One known problem is that if the guest stops submitting
buffers, buffers might never get used until some
further action, e.g. device reset. This does not
seem to affect linux guests.
Signed-off-by: Shirley <xma@us.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support the new event index feature. When acked,
utilize it to reduce the # of interrupts sent to the guest.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
- Documentation/kvm/ to Documentation/virtual/kvm
- Documentation/uml/ to Documentation/virtual/uml
- Documentation/lguest/ to Documentation/virtual/lguest
throughout the kernel source tree.
Signed-off-by: Rob Landley <rob@landley.net>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Use of skb_queue_empty(&sock->sk->sk_receive_queue)
without taking the sk_receive_queue.lock is unsafe
or useless. Take it out.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
vhost takes a sock lock to try and prevent
the skb from being pulled from the receive queue
after skb_peek. However this is not the right lock to use for that,
sk_receive_queue.lock is. Fix that up.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Codes duplication were found between the handling of mergeable and big
buffers, so this patch tries to unify them. This could be easily done
by adding a quota to the get_rx_bufs() which is used to limit the
number of buffers it returns (for mergeable buffer, the quota is
simply UIO_MAXIOV, for big buffers, the quota is just 1), and then the
previous handle_rx_mergeable() could be resued also for big buffers.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
No need to check the support of mergeable buffer inside the recevie
loop as the whole handle_rx()_xx is in the read critical region. So
this patch move it ahead of the receiving loop.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
copy_from_user is pretty high on perf top profile,
replacing it with __copy_from_user helps.
It's also safe because we do access_ok checks during setup.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Minor cleanup of vhost.c and net.c to match coding style.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
When built with rcu checks enabled, vhost triggers
bogus warnings as vhost features are read without
dev->mutex sometimes, and private pointer is read
with our kind of rcu where work serves as a
read side critical section.
Fixing it properly is not trivial.
Disable the warnings by stubbing out the checks for now.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
To detect that a sequence number is done, we are doing math on unsigned
integers so the result is unsigned too. Not what was intended for the <=
comparison. The result is user stuck forever in flush call.
Convert to int to fix this.
Further, get rid of ({}) to make code clearer.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This adds a test module for vhost infrastructure.
Intentionally not tied to kbuild to prevent people
from installing and loading it accidentally.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We really store a page offset in write_address,
so rename it write_page to avoid confusion.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Fix two bugs in dirty page logging:
When counting pages we should increase address by 1 instead of
VHOST_PAGE_SIZE. Make log_write() correctly process requests
that cross pages with write_address not starting at page boundary.
Reported-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Incorrect rcu check was used as rcu isn't done
under mutex here. Force check to 1 for now,
to stop it from complaining.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We do access_ok checks on all ring values on an ioctl,
so we don't need to redo them on each access.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Delete successive assignments to the same location.
A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@
expression i;
@@
*i = ...;
i = ...;
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1699 commits)
bnx2/bnx2x: Unsupported Ethtool operations should return -EINVAL.
vlan: Calling vlan_hwaccel_do_receive() is always valid.
tproxy: use the interface primary IP address as a default value for --on-ip
tproxy: added IPv6 support to the socket match
cxgb3: function namespace cleanup
tproxy: added IPv6 support to the TPROXY target
tproxy: added IPv6 socket lookup function to nf_tproxy_core
be2net: Changes to use only priority codes allowed by f/w
tproxy: allow non-local binds of IPv6 sockets if IP_TRANSPARENT is enabled
tproxy: added tproxy sockopt interface in the IPV6 layer
tproxy: added udp6_lib_lookup function
tproxy: added const specifiers to udp lookup functions
tproxy: split off ipv6 defragmentation to a separate module
l2tp: small cleanup
nf_nat: restrict ICMP translation for embedded header
can: mcp251x: fix generation of error frames
can: mcp251x: fix endless loop in interrupt handler if CANINTF_MERRF is set
can-raw: add msg_flags to distinguish local traffic
9p: client code cleanup
rds: make local functions/variables static
...
Fix up conflicts in net/core/dev.c, drivers/net/pcmcia/smc91c92_cs.c and
drivers/net/wireless/ath/ath9k/debug.c as per David
* 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
vfs: make no_llseek the default
vfs: don't use BKL in default_llseek
llseek: automatically add .llseek fop
libfs: use generic_file_llseek for simple_attr
mac80211: disallow seeks in minstrel debug code
lirc: make chardev nonseekable
viotape: use noop_llseek
raw: use explicit llseek file operations
ibmasmfs: use generic_file_llseek
spufs: use llseek in all file operations
arm/omap: use generic_file_llseek in iommu_debug
lkdtm: use generic_file_llseek in debugfs
net/wireless: use generic_file_llseek in debugfs
drm: use noop_llseek
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.
The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.
New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.
The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.
Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.
Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.
===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}
@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}
@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}
@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}
@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};
@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};
@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};
@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};
@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};
// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};
@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};
// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};
// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};
// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};
@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};
// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////
@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};
@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};
@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};
@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
access_ok() returns 1 if it's OK otherwise it should return 0.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Qemu supports up to UIO_MAXIOV s/g so we have to match that because guest
drivers may rely on this.
Allocate indirect and log arrays dynamically to avoid using too much contigious
memory and make the length of hdr array to match the header length since each
iovec entry has a least one byte.
Test with copying large files w/ and w/o migration in both linux and windows
guests.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The log eventfd signalling got put in dead code.
We didn't notice because qemu currently does polling
instead of eventfd select.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
In mergeable buffer case, we use headcount, log_num
and seg as indexes in same-size arrays, and
we know that headcount <= seg and
log_num equals either 0 or seg.
Therefore, the right thing to do is range-check seg,
not headcount as we do now: these will be different
if guest chains s/g descriptors (this does not
happen now, but we can not trust the guest).
Long term, we should add BUG_ON checks to verify
two other indexes are what we think they should be.
Reported-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
vhost should set worker to NULL on cgroups attach failure,
so that we won't try to destroy the worker again on close.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Since 2.6.36-rc1, non-root users of vhost-net fail to attach
if they are in any cgroups.
The reason is that when qemu uses vhost, vhost wants to attach
its thread to all cgroups that qemu has. But we got the API backwards,
so a non-priveledged process (Qemu) tried to control
the priveledged one (vhost), which fails.
Fix this by switching to the new cgroup_attach_task_all,
and running it from the vhost thread.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Its currently illegal to call kthread_stop(NULL)
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also add rcu_dereference_protected() for code paths where locks are held.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1443 commits)
phy/marvell: add 88ec048 support
igb: Program MDICNFG register prior to PHY init
e1000e: correct MAC-PHY interconnect register offset for 82579
hso: Add new product ID
can: Add driver for esd CAN-USB/2 device
l2tp: fix export of header file for userspace
can-raw: Fix skb_orphan_try handling
Revert "net: remove zap_completion_queue"
net: cleanup inclusion
phy/marvell: add 88e1121 interface mode support
u32: negative offset fix
net: Fix a typo from "dev" to "ndev"
igb: Use irq_synchronize per vector when using MSI-X
ixgbevf: fix null pointer dereference due to filter being set for VLAN 0
e1000e: Fix irq_synchronize in MSI-X case
e1000e: register pm_qos request on hardware activation
ip_fragment: fix subtracting PPPOE_SES_HLEN from mtu twice
net: Add getsockopt support for TCP thin-streams
cxgb4: update driver version
cxgb4: add new PCI IDs
...
Manually fix up conflicts in:
- drivers/net/e1000e/netdev.c: due to pm_qos registration
infrastructure changes
- drivers/net/phy/marvell.c: conflict between adding 88ec048 support
and cleaning up the IDs
- drivers/net/wireless/ipw2x00/ipw2100.c: trivial ipw2100_pm_qos_req
conflict (registration change vs marking it static)
This adds support for mergeable buffers in vhost-net: this is needed
for older guests without indirect buffer support, as well
as for zero copy with some devices.
Includes changes by Michael S. Tsirkin to make the
patch as low risk as possible (i.e., close to no changes
when feature is disabled).
Signed-off-by: David Stevens <dlstevens@us.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Apply the cgroup of the owner task to the created vhost worker.
Based on patches from Sridhar Samudrala's and Tejun Heo.
Later we'll need to also apply cpumask and probably priority
of the owner process.
Discussion on the best way to do this is still ongoing.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Sridhar Samudrala <samudrala.sridhar@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Replace vhost_workqueue with per-vhost kthread. Other than callback
argument change from struct work_struct * to struct vhost_work *,
there's no visible change to vhost_poll_*() interface.
This conversion is to make each vhost use a dedicated kthread so that
resource control via cgroup can be applied.
Partially based on Sridhar Samudrala's patch.
* Updated to use sub structure vhost_work instead of directly using
vhost_poll at Michael's suggestion.
* Added flusher wake_up() optimization at Michael's suggestion.
Changes by MST:
* Converted atomics/barrier use to a spinlock.
* Create thread on SET_OWNER
* Fix flushing
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Sridhar Samudrala <samudrala.sridhar@gmail.com>
Conflicts:
drivers/vhost/net.c
net/bridge/br_device.c
Fix merge conflict in drivers/vhost/net.c with guidance from
Stephen Rothwell.
Revert the effects of net-2.6 commit 573201f36f
since net-next-2.6 has fixes that make bridge netpoll work properly thus
we don't need it disabled.
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (24 commits)
bridge: Partially disable netpoll support
tcp: fix crash in tcp_xmit_retransmit_queue
IPv6: fix CoA check in RH2 input handler (mip6_rthdr_input())
ibmveth: lost IRQ while closing/opening device leads to service loss
rt2x00: Fix lockdep warning in rt2x00lib_probe_dev()
vhost: avoid pr_err on condition guest can trigger
ipmr: Don't leak memory if fib lookup fails.
vhost-net: avoid flush under lock
net: fix problem in reading sock TX queue
net/core: neighbour update Oops
net: skb_tx_hash() fix relative to skb_orphan_try()
rfs: call sock_rps_record_flow() in tcp_splice_read()
xfrm: do not assume that template resolving always returns xfrms
hostap_pci: set dev->base_addr during probe
axnet_cs: use spin_lock_irqsave in ax_interrupt
dsa: Fix Kconfig dependencies.
act_nat: not all of the ICMP packets need an IP header payload
r8169: incorrect identifier for a 8168dp
Phonet: fix skb leak in pipe endpoint accept()
Bluetooth: Update sec_level/auth_type for already existing connections
...
Guest can trigger packet truncation by posting
a very short buffer and disabling buffer merging.
Convert pr_err to pr_debug to avoid log from filling
up when this happens.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We flush under vq mutex when changing backends.
This creates a deadlock as workqueue being flushed
needs this lock as well.
https://bugzilla.redhat.com/show_bug.cgi?id=612421
Drop the vq mutex before flush: we have the device mutex
which is sufficient to prevent another ioctl from touching
the vq.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (35 commits)
NET: SB1250: Initialize .owner
vxge: show startup message with KERN_INFO
ll_temac: Fix missing iounmaps
bridge: Clear IPCB before possible entry into IP stack
bridge br_multicast: BUG: unable to handle kernel NULL pointer dereference
net: Fix definition of netif_vdbg() when VERBOSE_DEBUG is defined
net/ne: fix memory leak in ne_drv_probe()
xfrm: fix xfrm by MARK logic
virtio_net: fix oom handling on tx
virtio_net: do not reschedule rx refill forever
s2io: resolve statistics issues
linux/net.h: fix kernel-doc warnings
net: decreasing real_num_tx_queues needs to flush qdisc
sched: qdisc_reset_all_tx is calling qdisc_reset without qdisc_lock
qlge: fix a eeh handler to not add a pending timer
qlge: Replacing add_timer() to mod_timer()
usbnet: Set parent device early for netdev_printk()
net: Revert "rndis_host: Poll status channel before control channel"
netfilter: ip6t_REJECT: fix a dst leak in ipv6 REJECT
drivers: bluetooth: bluecard_cs.c: Fixed include error, changed to linux/io.h
...
patch 'break out of polling loop on error' caused
a minor performance regression on my machine: recover
that performance by adding a bunch of unlikely annotations
in the error handling.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
When ring parsing fails, we currently handle this
as ring empty condition. This means that we enable
kicks and recheck ring empty: if this not empty,
we re-start polling which of course will fail again.
Instead, let's return a negative error code and stop polling.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
10, 233 is allocated officially to /dev/kmview which is shipping in
Ubuntu and Debian distributions. vhost_net seem to have borrowed it
without making a proper request and this causes regressions in the other
distributions.
vhost_net can use a dynamic minor so use that instead. Also update the
file with a comment to try and avoid future misunderstandings.
cc: stable@kernel.org
Signed-off-by: Alan Cox <device@lanana.org>
[ We should have caught this before 2.6.34 got released. - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (22 commits)
netlink: bug fix: wrong size was calculated for vfinfo list blob
netlink: bug fix: don't overrun skbs on vf_port dump
xt_tee: use skb_dst_drop()
netdev/fec: fix ifconfig eth0 down hang issue
cnic: Fix context memory init. on 5709.
drivers/net: Eliminate a NULL pointer dereference
drivers/net/hamradio: Eliminate a NULL pointer dereference
be2net: Patch removes redundant while statement in loop.
ipv6: Add GSO support on forwarding path
net: fix __neigh_event_send()
vhost: fix the memory leak which will happen when memory_access_ok fails
vhost-net: fix to check the return value of copy_to/from_user() correctly
vhost: fix to check the return value of copy_to/from_user() correctly
vhost: Fix host panic if ioctl called with wrong index
net: fix lock_sock_bh/unlock_sock_bh
net/iucv: Add missing spin_unlock
net: ll_temac: fix checksum offload logic
net: ll_temac: fix interrupt bug when interrupt 0 is used
sctp: dubious bitfields in sctp_transport
ipmr: off by one in __ipmr_fill_mroute()
...
We need to free newmem when vhost_set_memory() fails to complete.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
copy_to/from_user() returns the number of bytes that could not be copied.
So we need to check if it is not zero, and in that case, we should return
the error number -EFAULT rather than directly return the return value from
copy_to/from_user().
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
copy_to/from_user() returns the number of bytes that could not be copied.
So we need to check if it is not zero, and in that case, we should return
the error number -EFAULT rather than directly return the return value from
copy_to/from_user().
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Delete a label and goto from vhost_net_set_backend
Inverting a test allows a label and goto to be eliminated.
Signed-off-by: Jeff Dike <jdike@linux.intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The C99 specification states in section 6.11.5:
The placement of a storage-class specifier other than at the beginning
of the declaration specifiers in a declaration is an obsolescent
feature.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Missed a boundary value check in vhost_set_vring. The host panics if
idx == nvqs is used in ioctl commands in vhost_virtqueue_init.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
- C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not
USHORT_MAX/SHORT_MAX/SHORT_MIN.
- Make SHRT_MIN of type s16, not int, for consistency.
[akpm@linux-foundation.org: fix drivers/dma/timb_dma.c]
[akpm@linux-foundation.org: fix security/keys/keyring.c]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
According to memory-barriers.txt, an smp memory barrier in guest
should always be paired with an smp memory barrier in host,
and I quote "a lack of appropriate pairing is almost certainly an
error". In case of vhost, failure to flush out used index
update before looking at the interrupt disable flag
could result in missed interrupts, resulting in
networking hang under stress.
This might happen when flags read bypasses used index write.
So we see interrupts disabled and do not interrupt, at the
same time guest writes flags value to enable interrupt,
reads an old used index value, thinks that
used ring is empty and waits for interrupt.
Note: the barrier we pair with here is in
drivers/virtio/virtio_ring.c, function
vring_enable_cb.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Juan Quintela <quintela@redhat.com>
vq_memory_access_ok needs to check whether mem == NULL
Signed-off-by: Jeff Dike <jdike@linux.intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Stanse found a locking problem in vhost_set_vring:
several returns from VHOST_SET_VRING_KICK, VHOST_SET_VRING_CALL,
VHOST_SET_VRING_ERR with the vq->mutex held.
Fix these up.
Reported-by: Jiri Slaby <jirislaby@gmail.com>
Acked-by: Laurent Chavey <chavey@google.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
A thinko in code means we never trigger interrupt
mitigation. Fix this.
Reported-by: Juan Quintela <quintela@redhat.com>
Reported-by: Unai Uribarri <unai.uribarri@optenet.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
An error could cause vhost_net_set_backend to exit without unlocking
vq->mutex. Fix this.
Signed-off-by: Jeff Dike <jdike@linux.intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
guest to remote communication with vhost net sometimes stops until
guest driver is restarted. This happens when we get guest kick precisely
when the backend send queue is full, as a result handle_tx() returns without
polling backend. This patch fixes this by restarting tx poll on this condition.
Signed-off-by: Sridhar Samudrala <samudrala@us.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Tom Lendacky <toml@us.ibm.com>
get_user_pages_fast returns number of pages on success, negative value
on failure, but never 0. Fix vhost code to match this logic.
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
vq log eventfd context pointer needs to be initialized, otherwise
operation may fail or oops if log is enabled but log eventfd not set by
userspace. When log_ctx for device is created, it is copied to the vq.
This reset was missing.
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
vhost was dong some complex math to get
offset to log at, and got it wrong by a couple of bytes,
while in fact it's simple: get address where we write,
subtract start of buffer, add log base.
Do it this way.
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This adds support for passing a macvtap file descriptor into
vhost-net, much like we already do for tun/tap.
Most of the new code is taken from the respective patch
in the tun driver and may get consolidated in the future.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
vhost-net only uses memory barriers to control SMP effects
(communication with userspace potentially running on a different CPU),
so it should use SMP barriers and not mandatory barriers for memory
access ordering, as suggested by Documentation/memory-barriers.txt
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
drivers/built-in.o: In function `get_tun_socket':
net.c:(.text+0x15436e): undefined reference to `tun_get_socket'
If tun is a module, vhost must be a module, too.
If tun is built-in or disabled, vhost can be built-in.
Note: TUN || !TUN might look a bit strange until you realize
that boolean logic rules do not apply for tristate variables.
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>