The static initial tables are pretty large, and after the net
namespace has been instantiated, they just hang around for nothing.
This commit removes them and creates tables on-demand at runtime when
needed.
Size shrinks by 7735 bytes (x86_64).
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
The respective xt_table structures already have most of the metadata
needed for hook setup. Add a 'priority' field to struct xt_table so
that xt_hook_link() can be called with a reduced number of arguments.
So should we be having more tables in the future, it comes at no
static cost (only runtime, as before) - space saved:
6807373->6806555.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Rewrite COMPAT_XT_ALIGN in terms of dummy structure hack.
Compat counters logically have nothing to do with it.
Use ALIGN() macro while I'm at it for same types.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
As noticed by Jon Masters <jonathan@jonmasters.org>, the conntrack hash
size is global and not per namespace, but modifiable at runtime through
/sys/module/nf_conntrack/hashsize. Changing the hash size will only
resize the hash in the current namespace however, so other namespaces
will use an invalid hash size. This can cause crashes when enlarging
the hashsize, or false negative lookups when shrinking it.
Move the hash size into the per-namespace data and only use the global
hash size to initialize the per-namespace value when instanciating a
new namespace. Additionally restrict hash resizing to init_net for
now as other namespaces are not handled currently.
Cc: stable@kernel.org
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Expectation hashtable size was simply glued to a variable with no code
to rehash expectations, so it was a bug to allow writing to it.
Make "expect_hashsize" readonly.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Patrick McHardy <kaber@trash.net>
nf_conntrack_cachep is currently shared by all netns instances, but
because of SLAB_DESTROY_BY_RCU special semantics, this is wrong.
If we use a shared slab cache, one object can instantly flight between
one hash table (netns ONE) to another one (netns TWO), and concurrent
reader (doing a lookup in netns ONE, 'finding' an object of netns TWO)
can be fooled without notice, because no RCU grace period has to be
observed between object freeing and its reuse.
We dont have this problem with UDP/TCP slab caches because TCP/UDP
hashtables are global to the machine (and each object has a pointer to
its netns).
If we use per netns conntrack hash tables, we also *must* use per netns
conntrack slab caches, to guarantee an object can not escape from one
namespace to another one.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
[Patrick: added unique slab name allocation]
Cc: stable@kernel.org
Signed-off-by: Patrick McHardy <kaber@trash.net>
As discovered by Jon Masters <jonathan@jonmasters.org>, the "untracked"
conntrack, which is located in the data section, might be accidentally
freed when a new namespace is instantiated while the untracked conntrack
is attached to a skb because the reference count it re-initialized.
The best fix would be to use a seperate untracked conntrack per
namespace since it includes a namespace pointer. Unfortunately this is
not possible without larger changes since the namespace is not easily
available everywhere we need it. For now move the untracked conntrack
initialization to the init_net setup function to make sure the reference
count is not re-initialized and handle cleanup in the init_net cleanup
function to make sure namespaces can exit properly while the untracked
conntrack is in use in other namespaces.
Cc: stable@kernel.org
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new target for the raw table, which can be used to specify conntrack
parameters for specific connections, f.i. the conntrack helper.
The target attaches a "template" connection tracking entry to the skb, which
is used by the conntrack core when initializing a new conntrack.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Support initializing selected parameters of new conntrack entries from a
"conntrack template", which is a specially marked conntrack entry attached
to the skb.
Currently the helper and the event delivery masks can be initialized this
way.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Add two masks for conntrack end expectation events to struct nf_conntrack_ecache
and use them to filter events. Their default value is "all events" when the
event sysctl is on and "no events" when it is off. A following patch will add
specific initializations. Expectation events depend on the ecache struct of
their master conntrack.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Split up the IPCT_STATUS event into an IPCT_REPLY event, which is generated
when the IPS_SEEN_REPLY bit is set, and an IPCT_ASSURED event, which is
generated when the IPS_ASSURED bit is set.
In combination with a following patch to support selective event delivery,
this can be used for "sparse" conntrack replication: start replicating the
conntrack entry after it reached the ASSURED state and that way it's SYN-flood
resistant.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Make sure not to assign a helper for a different network or transport
layer protocol to a connection.
Additionally change expectation deletion by helper to compare the name
directly - there might be multiple helper registrations using the same
name, currently one of them is chosen in an unpredictable manner and
only those expectations are removed.
Signed-off-by: Patrick McHardy <kaber@trash.net>
As noticed by Shin Hong <hongshin@gmail.com>, there is a race between
htable_find_get() and htable_put():
htable_put(): htable_find_get():
spin_lock_bh(&hashlimit_lock);
<search entry>
atomic_dec_and_test(&hinfo->use)
atomic_inc(&hinfo->use)
spin_unlock_bh(&hashlimit_lock)
return hinfo;
spin_lock_bh(&hashlimit_lock);
hlist_del(&hinfo->node);
spin_unlock_bh(&hashlimit_lock);
htable_destroy(hinfo);
The entire locking concept is overly complicated, tables are only
created/referenced and released in process context, so a single
mutex works just fine. Remove the hashinfo_spinlock and atomic
reference count and use the mutex to protect table lookups/creation
and reference count changes.
Signed-off-by: Patrick McHardy <kaber@trash.net>
The TCPMSS target is dropping SYN packets where:
1) There is data, or
2) The data offset makes the TCP header larger than the packet.
Both of these result in an error level printk. This printk has been
removed.
This change avoids dropping SYN packets containing data. If there
is also no MSS option (as well as data), one will not be added
because of possible complications due to the increased packet size.
Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
In a string like "v:SIP/2.0..." it was checking for !isalpha('S') when it
meant to be inspecting the ':'.
Patch by Greg Alexander <greqcs@galexander.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch remove variable part from a debug message to have
message concatenation from syslog.
Signed-off-by: Eric Leblond <eric@inl.fr>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Make hashtable per-netns.
Make proc files per-netns.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Make recent table list per-netns.
Make proc files per-netns.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Simply pass hashtable to seqfile iterators, proc entry itself is not needed.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (56 commits)
sky2: Fix oops in sky2_xmit_frame() after TX timeout
Documentation/3c509: document ethtool support
af_packet: Don't use skb after dev_queue_xmit()
vxge: use pci_dma_mapping_error to test return value
netfilter: ebtables: enforce CAP_NET_ADMIN
e1000e: fix and commonize code for setting the receive address registers
e1000e: e1000e_enable_tx_pkt_filtering() returns wrong value
e1000e: perform 10/100 adaptive IFS only on parts that support it
e1000e: don't accumulate PHY statistics on PHY read failure
e1000e: call pci_save_state() after pci_restore_state()
netxen: update version to 4.0.72
netxen: fix set mac addr
netxen: fix smatch warning
netxen: fix tx ring memory leak
tcp: update the netstamp_needed counter when cloning sockets
TI DaVinci EMAC: Handle emac module clock correctly.
dmfe/tulip: Let dmfe handle DM910x except for SPARC on-board chips
ixgbe: Fix compiler warning about variable being used uninitialized
netfilter: nf_ct_ftp: fix out of bounds read in update_nl_seq()
mv643xx_eth: don't include cache padding in rx desc buffer size
...
Fix trivial conflict in drivers/scsi/cxgb3i/cxgb3i_offload.c
commit 8a27f7c90f
changed the output style of %pi4 to use fixed
width leading zero IP addresses "001.002.003.004".
It's useful when printing multiple lines of
addresses, but was a change in output style for
some existing uses.
Using %pI4 restores the previous output style.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Use the same format string as net/ipv4/netfilter/nf_nat_ftp.c
to encode an ipv4 address and port.
Both uses should be a single common function.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
As noticed by Dan Carpenter <error27@gmail.com>, update_nl_seq()
currently contains an out of bounds read of the seq_aft_nl array
when looking for the oldest sequence number position.
Fix it to only compare valid positions.
Cc: stable@kernel.org
Signed-off-by: Patrick McHardy <kaber@trash.net>
I was very frustrated about the fact that I have to recompile the kernel
to change the hash size. So, I created this patch.
If IPVS is built-in you can append ip_vs.conn_tab_bits=?? to kernel
command line, or, if you built IPVS as modules, you can add
options ip_vs conn_tab_bits=??.
To keep everything backward compatible, you still can select the size at
compile time, and that will be used as default.
It has been about a year since this patch was originally posted
and subsequently dropped on the basis of insufficient test data.
Mark Bergsma has provided the following test results which seem
to strongly support the need for larger hash table sizes:
We do however run into the same problem with the default setting (212 =
4096 entries), as most of our LVS balancers handle around a million
connections/SLAB entries at any point in time (around 100-150 kpps
load). With only 4096 hash table entries this implies that each entry
consists of a linked list of 256 connections *on average*.
To provide some statistics, I did an oprofile run on an 2.6.31 kernel,
with both the default 4096 table size, and the same kernel recompiled
with IP_VS_CONN_TAB_BITS set to 18 (218 = 262144 entries). I built a
quick test setup with a part of Wikimedia/Wikipedia's live traffic
mirrored by the switch to the test host.
With the default setting, at ~ 120 kpps packet load we saw a typical %si
CPU usage of around 30-35%, and oprofile reported a hot spot in
ip_vs_conn_in_get:
samples % image name app name
symbol name
1719761 42.3741 ip_vs.ko ip_vs.ko ip_vs_conn_in_get
302577 7.4554 bnx2 bnx2 /bnx2
181984 4.4840 vmlinux vmlinux __ticket_spin_lock
128636 3.1695 vmlinux vmlinux ip_route_input
74345 1.8318 ip_vs.ko ip_vs.ko ip_vs_conn_out_get
68482 1.6874 vmlinux vmlinux mwait_idle
After loading the recompiled kernel with 218 entries, %si CPU usage
dropped in half to around 12-18%, and oprofile looks much healthier,
with only 7% spent in ip_vs_conn_in_get:
samples % image name app name
symbol name
265641 14.4616 bnx2 bnx2 /bnx2
143251 7.7986 vmlinux vmlinux __ticket_spin_lock
140661 7.6576 ip_vs.ko ip_vs.ko ip_vs_conn_in_get
94364 5.1372 vmlinux vmlinux mwait_idle
86267 4.6964 vmlinux vmlinux ip_route_input
[ horms@verge.net.au: trivial up-port and minor style fixes ]
Signed-off-by: Catalin(ux) M. BOIE <catab@embedromix.ro>
Cc: Mark Bergsma <mark@wikimedia.org>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The ipvs code has a nifty system for doing the size of ioctl command
copies; it defines an array with values into which it indexes the cmd
to find the right length.
Unfortunately, the ipvs code forgot to check if the cmd was in the
range that the array provides, allowing for an index outside of the
array, which then gives a "garbage" result into the length, which
then gets used for copying into a stack buffer.
Fix this by adding sanity checks on these as well as the copy size.
[ horms@verge.net.au: adjusted limit to IP_VS_SO_GET_MAX ]
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
We can initialize the random hash bytes on checkentry. This is
preferable since it is outside the hot path.
Reference: http://bugzilla.netfilter.org/show_bug.cgi?id=621
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
"It is deliberately not done in the init function, since we might not
have sufficient random while booting."
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Moving rnd_inited into the hole after the uint8 lets go of the uint32
rnd_inited was using, plus the padding that would follow the int group.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Remove the private version of the greatest common divider to use
lib/gcd.c, the latter also implementing the a < b case.
[akpm@linux-foundation.org: repair neighboring whitespace because the diff looked odd]
Signed-off-by: Florian Fainelli <florian@openwrt.org>
Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Cc: Takashi Iwai <tiwai@suse.de>
Acked-by: Simon Horman <horms@verge.net.au>
Cc: Julius Volz <juliusv@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (26 commits)
net: sh_eth alignment fix for sh7724 using NET_IP_ALIGN V2
ixgbe: allow tx of pre-formatted vlan tagged packets
ixgbe: Fix 82598 premature copper PHY link indicatation
ixgbe: Fix tx_restart_queue/non_eop_desc statistics counters
bcm63xx_enet: fix compilation failure after get_stats_count removal
packet: dont call sleeping functions while holding rcu_read_lock()
tcp: Revert per-route SACK/DSACK/TIMESTAMP changes.
ipvs: zero usvc and udest
netfilter: fix crashes in bridge netfilter caused by fragment jumps
ipv6: reassembly: use seperate reassembly queues for conntrack and local delivery
sky2: leave PCI config space writeable
sky2: print Optima chip name
x25: Update maintainer.
ipvs: fix synchronization on connection close
netfilter: xtables: document minimal required version
drivers/net/bonding/: : use pr_fmt
can: CAN_MCP251X should depend on HAS_DMA
drivers/net/usb: Correct code taking the size of a pointer
drivers/net/cpmac.c: Correct code taking the size of a pointer
drivers/net/sfc: Correct code taking the size of a pointer
...
Makes use of skip_spaces() defined in lib/string.c for removing leading
spaces from strings all over the tree.
It decreases lib.a code size by 47 bytes and reuses the function tree-wide:
text data bss dec hex filename
64688 584 592 65864 10148 (TOTALS-BEFORE)
64641 584 592 65817 10119 (TOTALS-AFTER)
Also, while at it, if we see (*str && isspace(*str)), we can be sure to
remove the first condition (*str) as the second one (isspace(*str)) also
evaluates to 0 whenever *str == 0, making it redundant. In other words,
"a char equals zero is never a space".
Julia Lawall tried the semantic patch (http://coccinelle.lip6.fr) below,
and found occurrences of this pattern on 3 more files:
drivers/leds/led-class.c
drivers/leds/ledtrig-timer.c
drivers/video/output.c
@@
expression str;
@@
( // ignore skip_spaces cases
while (*str && isspace(*str)) { \(str++;\|++str;\) }
|
- *str &&
isspace(*str)
)
Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Cc: Julia Lawall <julia@diku.dk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: David Howells <dhowells@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Samuel Ortiz <samuel@sortiz.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make sure that any otherwise uninitialised fields of usvc are zero.
This has been obvserved to cause a problem whereby the port of
fwmark services may end up as a non-zero value which causes
scheduling of a destination server to fail for persisitent services.
As observed by Deon van der Merwe <dvdm@truteq.co.za>.
This fix suggested by Julian Anastasov <ja@ssi.bg>.
For good measure also zero udest.
Cc: Deon van der Merwe <dvdm@truteq.co.za>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Cc: stable@kernel.org
Signed-off-by: Patrick McHardy <kaber@trash.net>
commit 9d3a0de makes slaves expire as they would do on the master
with much shorter timeouts. But it introduces another problem:
When we close a connection, on master server the connection became
CLOSE_WAIT/TIME_WAIT, it was synced to slaves, but if master is
finished within it's timeouts (CLOSE), it will not be synced to
slaves. Then slaves will be kept on CLOSE_WAIT/TIME_WAIT until
timeout reaches. Thus we should also sync with CLOSE.
Cc: Wensong Zhang <wensong@linux-vs.org>
Cc: Simon Horman <horms@verge.net.au>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1815 commits)
mac80211: fix reorder buffer release
iwmc3200wifi: Enable wimax core through module parameter
iwmc3200wifi: Add wifi-wimax coexistence mode as a module parameter
iwmc3200wifi: Coex table command does not expect a response
iwmc3200wifi: Update wiwi priority table
iwlwifi: driver version track kernel version
iwlwifi: indicate uCode type when fail dump error/event log
iwl3945: remove duplicated event logging code
b43: fix two warnings
ipw2100: fix rebooting hang with driver loaded
cfg80211: indent regulatory messages with spaces
iwmc3200wifi: fix NULL pointer dereference in pmkid update
mac80211: Fix TX status reporting for injected data frames
ath9k: enable 2GHz band only if the device supports it
airo: Fix integer overflow warning
rt2x00: Fix padding bug on L2PAD devices.
WE: Fix set events not propagated
b43legacy: avoid PPC fault during resume
b43: avoid PPC fault during resume
tcp: fix a timewait refcnt race
...
Fix up conflicts due to sysctl cleanups (dead sysctl_check code and
CTL_UNNUMBERED removed) in
kernel/sysctl_check.c
net/ipv4/sysctl_net_ipv4.c
net/ipv6/addrconf.c
net/sctp/sysctl.c
* git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6: (43 commits)
security/tomoyo: Remove now unnecessary handling of security_sysctl.
security/tomoyo: Add a special case to handle accesses through the internal proc mount.
sysctl: Drop & in front of every proc_handler.
sysctl: Remove CTL_NONE and CTL_UNNUMBERED
sysctl: kill dead ctl_handler definitions.
sysctl: Remove the last of the generic binary sysctl support
sysctl net: Remove unused binary sysctl code
sysctl security/tomoyo: Don't look at ctl_name
sysctl arm: Remove binary sysctl support
sysctl x86: Remove dead binary sysctl support
sysctl sh: Remove dead binary sysctl support
sysctl powerpc: Remove dead binary sysctl support
sysctl ia64: Remove dead binary sysctl support
sysctl s390: Remove dead sysctl binary support
sysctl frv: Remove dead binary sysctl support
sysctl mips/lasat: Remove dead binary sysctl support
sysctl drivers: Remove dead binary sysctl support
sysctl crypto: Remove dead binary sysctl support
sysctl security/keys: Remove dead binary sysctl support
sysctl kernel: Remove binary sysctl logic
...
Take advantage of the new pernet automatic storage management,
and stop using compatibility network namespace functions.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Take advantage of the new pernet automatic storage management,
and stop using compatibility network namespace functions.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not including net/atm/
Compiled tested x86 allyesconfig only
Added a > 80 column line or two, which I ignored.
Existing checkpatch plaints willfully, cheerfully ignored.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Generated with the following semantic patch
@@
struct net *n1;
struct net *n2;
@@
- n1 == n2
+ net_eq(n1, n2)
@@
struct net *n1;
struct net *n2;
@@
- n1 != n2
+ !net_eq(n1, n2)
applied over {include,net,drivers/net}.
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>