Commit Graph

1906 Commits

Author SHA1 Message Date
Rumen G. Bogdanovski
7a4fbb1fa4 [IPVS]: Flag synced connections and expose them in proc
This patch labels the sync-created connections with IP_VS_CONN_F_SYNC
flag and creates /proc/net/ip_vs_conn_sync to enable monitoring of the
origin of the connections, if they are local or created by the
synchronization.

Signed-off-by: Rumen G. Bogdanovski <rumen@voicecho.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:21 -08:00
Ilpo Järvinen
20de20beba [TCP]: Correct DSACK check placing
Previously one of the in-block skip branches was missing it.

Also, drop it from tail-fully-processed case because the next
iteration will do exactly the same thing, i.e., process the
SACK block that contains the DSACK information.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:15 -08:00
Eric Dumazet
8dbde28d97 [NET]: NET_CLS_ROUTE : convert ip_rt_acct to per_cpu variables
ip_rt_acct needs 4096 bytes per cpu to perform some accounting.
It is actually allocated as a single huge array [4096*NR_CPUS]
(rounded up to a power of two)

Converting it to a per cpu variable is wanted to :
 - Save space on machines were num_possible_cpus() < NR_CPUS
 - Better NUMA placement (each cpu gets memory on its node)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:08 -08:00
Ilpo Järvinen
68f8353b48 [TCP]: Rewrite SACK block processing & sack_recv_cache use
Key points of this patch are:

  - In case new SACK information is advance only type, no skb
    processing below previously discovered highest point is done
  - Optimize cases below highest point too since there's no need
    to always go up to highest point (which is very likely still
    present in that SACK), this is not entirely true though
    because I'm dropping the fastpath_skb_hint which could
    previously optimize those cases even better. Whether that's
    significant, I'm not too sure.

Currently it will provide skipping by walking. Combined with
RB-tree, all skipping would become fast too regardless of window
size (can be done incrementally later).

Previously a number of cases in TCP SACK processing fails to
take advantage of costly stored information in sack_recv_cache,
most importantly, expected events such as cumulative ACK and new
hole ACKs. Processing on such ACKs result in rather long walks
building up latencies (which easily gets nasty when window is
huge). Those latencies are often completely unnecessary
compared with the amount of _new_ information received, usually
for cumulative ACK there's no new information at all, yet TCP
walks whole queue unnecessary potentially taking a number of
costly cache misses on the way, etc.!

Since the inclusion of highest_sack, there's a lot information
that is very likely redundant (SACK fastpath hint stuff,
fackets_out, highest_sack), though there's no ultimate guarantee
that they'll remain the same whole the time (in all unearthly
scenarios). Take advantage of this knowledge here and drop
fastpath hint and use direct access to highest SACKed skb as
a replacement.

Effectively "special cased" fastpath is dropped. This change
adds some complexity to introduce better coveraged "fastpath",
though the added complexity should make TCP behave more cache
friendly.

The current ACK's SACK blocks are compared against each cached
block individially and only ranges that are new are then scanned
by the high constant walk. For other parts of write queue, even
when in previously known part of the SACK blocks, a faster skip
function is used (if necessary at all). In addition, whenever
possible, TCP fast-forwards to highest_sack skb that was made
available by an earlier patch. In typical case, no other things
but this fast-forward and mandatory markings after that occur
making the access pattern quite similar to the former fastpath
"special case".

DSACKs are special case that must always be walked.

The local to recv_sack_cache copying could be more intelligent
w.r.t DSACKs which are likely to be there only once but that
is left to a separate patch.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:07 -08:00
Ilpo Järvinen
fd6dad616d [TCP]: Earlier SACK block verification & simplify access to them
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:07 -08:00
Ilpo Järvinen
9e10c47cb9 [TCP]: Create tcp_sacktag_one().
Worker function that implements the main logic of
the inner-most loop of tcp_sacktag_write_queue().

Idea was originally presented by David S. Miller.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:06 -08:00
Ilpo Järvinen
b7d4815f35 [TCP]: Prior_fackets can be replaced by highest_sack seq
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:05 -08:00
Ilpo Järvinen
9f58f3b721 [TCP]: Make lost retrans detection more self-contained
Highest_sack_end_seq is no longer calculated in the loop,
thus it can be pushed to the worker function altogether
making that function independent of the sacktag.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:04 -08:00
Ilpo Järvinen
a47e5a988a [TCP]: Convert highest_sack to sk_buff to allow direct access
It is going to replace the sack fastpath hint quite soon... :-)

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:03 -08:00
Ilpo Järvinen
85cc391c0e [TCP]: non-FACK SACK follows conservative SACK loss recovery
Many assumptions that are true when no reordering or other
strange events happen are not a part of the RFC3517. FACK
implementation is based on such assumptions. Previously (before
the rewrite) the non-FACK SACK was basically doing fast rexmit
and then it times out all skbs when first cumulative ACK arrives,
which cannot really be called SACK based recovery :-).

RFC3517 SACK disables these things:
- Per SKB timeouts & head timeout entry to recovery
- Marking at least one skb while in recovery (RFC3517 does this
  only for the fast retransmission but not for the other skbs
  when cumulative ACKs arrive in the recovery)
- Sacktag's loss detection flavors B and C (see comment before
  tcp_sacktag_write_queue)

This does not implement the "last resort" rule 3 of NextSeg, which
allows retransmissions also when not enough SACK blocks have yet
arrived above a segment for IsLost to return true [RFC3517].

The implementation differs from RFC3517 in these points:
- Rate-halving is used instead of FlightSize / 2
- Instead of using dupACKs to trigger the recovery, the number
  of SACK blocks is used as FACK does with SACK blocks+holes
  (which provides more accurate number). It seems that the
  difference can affect negatively only if the receiver does not
  generate SACK blocks at all even though it claimed to be
  SACK-capable.
- Dupthresh is not a constant one. Dynamical adjustments include
  both holes and sacked segments (equal to what FACK has) due to
  complexity involved in determining the number sacked blocks
  between highest_sack and the reordered segment. Thus it's will
  be an over-estimate.

Implementation note:

tcp_clean_rtx_queue doesn't need a lost_cnt tweak because head
skb at that point cannot be SACKED_ACKED (nor would such
situation last for long enough to cause problems).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:03 -08:00
Ilpo Järvinen
f577111302 [TCP]: Extend reordering detection to cover CA_Loss partially
This implements more accurately what is stated in sacktag's
overall comment:

  "Both of these heuristics are not used in Loss state, when
   we cannot account for retransmits accurately."

When CA_Loss state is entered, the state changer ensures that
undo_marker is only set if no TCPCB_RETRANS skbs were found,
thus having non-zero undo_marker in CA_Loss basically tells
that the R-bits still accurately reflect the current state
of TCP.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:02 -08:00
Ilpo Järvinen
b9d86585dc [TCP]: Move !in_sack test earlier in sacktag & reorganize if()s
All intermediate conditions include it already, make them
simpler as well.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:01 -08:00
Rainer Jochem
62013dbb84 [IPV4] ipconfig: Implement DHCP Class-identifier
From : Rainer Jochem <rainer.jochem@mpi-sb.mpg.de>

Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:59 -08:00
David S. Miller
294b4baf29 [IPSEC]: Kill afinfo->nf_post_routing
After changeset:

	[NETFILTER]: Introduce NF_INET_ hook values

It always evaluates to NF_INET_POST_ROUTING.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:55 -08:00
Patrick McHardy
6e23ae2a48 [NETFILTER]: Introduce NF_INET_ hook values
The IPv4 and IPv6 hook values are identical, yet some code tries to figure
out the "correct" value by looking at the address family. Introduce NF_INET_*
values for both IPv4 and IPv6. The old values are kept in a #ifndef __KERNEL__
section for userspace compatibility.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:55 -08:00
Herbert Xu
1bf06cd2e3 [IPSEC]: Add async resume support on input
This patch adds support for async resumptions on input.  To do so, the
transform would return -EINPROGRESS and subsequently invoke the
function xfrm_input_resume to resume processing.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:54 -08:00
Herbert Xu
60d5fcfb19 [IPSEC]: Remove nhoff from xfrm_input
The nhoff field isn't actually necessary in xfrm_input.  For tunnel
mode transforms we now throw away the output IP header so it makes no
sense to fill in the nexthdr field.  For transport mode we can now let
the function transport_finish do the setting and it knows where the
nexthdr field is.

The only other thing that needs the nexthdr field to be set is the
header extraction code.  However, we can simply move the protocol
extraction out of the generic header extraction.

We want to minimise the amount of info we have to carry around between
transforms as this simplifies the resumption process for async crypto.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:53 -08:00
Herbert Xu
0ebea8ef35 [IPSEC]: Move state lock into x->type->input
This patch releases the lock on the state before calling
x->type->input.  It also adds the lock to the spots where they're
currently needed.

Most of those places (all except mip6) are expected to disappear with
async crypto.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:52 -08:00
Herbert Xu
668dc8af31 [IPSEC]: Move integrity stat collection into xfrm_input
Similar to the moving out of the replay processing on the output, this
patch moves the integrity stat collectin from x->type->input into
xfrm_input.

This would eventually allow transforms such as AH/ESP to be lockless.

The error value EBADMSG (currently unused in the crypto layer) is used
to indicate a failed integrity check.  In future this error can be
directly returned by the crypto layer once we switch to aead
algorithms.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:51 -08:00
Herbert Xu
716062fd4c [IPSEC]: Merge most of the input path
As part of the work on asynchronous cryptographic operations, we need
to be able to resume from the spot where they occur.  As such, it
helps if we isolate them to one spot.

This patch moves most of the remaining family-specific processing into
the common input code.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:50 -08:00
Herbert Xu
862b82c6f9 [IPSEC]: Merge most of the output path
As part of the work on asynchrnous cryptographic operations, we need
to be able to resume from the spot where they occur.  As such, it
helps if we isolate them to one spot.

This patch moves most of the remaining family-specific processing into
the common output code.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:48 -08:00
Herbert Xu
c439cb2e4b [IPV4]: Add ip_local_out
Most callers of the LOCAL_OUT chain will set the IP packet length and
header checksum before doing so.  They also share the same output
function dst_output.

This patch creates a new function called ip_local_out which does all
of that and converts the appropriate users over to it.

Apart from removing duplicate code, it will also help in merging the
IPsec output path once the same thing is done for IPv6.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:47 -08:00
Herbert Xu
227620e295 [IPSEC]: Separate inner/outer mode processing on input
With inter-family transforms the inner mode differs from the outer
mode.  Attempting to handle both sides from the same function means
that it needs to handle both IPv4 and IPv6 which creates duplication
and confusion.

This patch separates the two parts on the input path so that each
function deals with one family only.

In particular, the functions xfrm4_extract_inut/xfrm6_extract_inut
moves the pertinent fields from the IPv4/IPv6 IP headers into a
neutral format stored in skb->cb.  This is then used by the inner mode
input functions to modify the inner IP header.  In this way the input
function no longer has to know about the outer address family.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:46 -08:00
Herbert Xu
36cf9acf93 [IPSEC]: Separate inner/outer mode processing on output
With inter-family transforms the inner mode differs from the outer
mode.  Attempting to handle both sides from the same function means
that it needs to handle both IPv4 and IPv6 which creates duplication
and confusion.

This patch separates the two parts on the output path so that each
function deals with one family only.

In particular, the functions xfrm4_extract_output/xfrm6_extract_output
moves the pertinent fields from the IPv4/IPv6 IP headers into a
neutral format stored in skb->cb.  This is then used by the outer mode
output functions to write the outer IP header.  In this way the output
function no longer has to know about the inner address family.

Since the extract functions are only called by tunnel modes (the only
modes that can support inter-family transforms), I've also moved the
xfrm*_tunnel_check_size calls into them.  This allows the correct ICMP
message to be sent as opposed to now where you might call icmp_send
with an IPv6 packet and vice versa.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:45 -08:00
Herbert Xu
29bb43b4ec [INET]: Give outer DSCP directly to ip*_copy_dscp
This patch changes the prototype of ipv4_copy_dscp and ipv6_copy_dscp so
that they directly take the outer DSCP rather than the outer IP header.
This will help us to unify the code for inter-family tunnels.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:45 -08:00
Herbert Xu
e40b328615 [IPSEC]: Forbid BEET + ipcomp for now
While BEET can theoretically work with IPComp the current code can't
do that because it tries to construct a BEET mode tunnel type which
doesn't (and cannot) exist.  In fact as it is it won't even attach a
tunnel object at all for BEET which is bogus.

To support this fully we'd also need to change the policy checks on
input to recognise a plain tunnel as a legal variant of an optional
BEET transform.

This patch simply fails such constructions for now.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:43 -08:00
Herbert Xu
25ee3286dc [IPSEC]: Merge common code into xfrm_bundle_create
Half of the code in xfrm4_bundle_create and xfrm6_bundle_create are
common.  This patch extracts that logic and puts it into
xfrm_bundle_create.  The rest of it are then accessed through afinfo.

As a result this fixes the problem with inter-family transforms where
we treat every xfrm dst in the bundle as if it belongs to the top
family.

This patch also fixes a long-standing error-path bug where we may free
the xfrm states twice.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:43 -08:00
Herbert Xu
66cdb3ca27 [IPSEC]: Move flow construction into xfrm_dst_lookup
This patch moves the flow construction from the callers of
xfrm_dst_lookup into that function.  It also changes xfrm_dst_lookup
so that it takes an xfrm state as its argument instead of explicit
addresses.

This removes any address-specific logic from the callers of
xfrm_dst_lookup which is needed to correctly support inter-family
transforms.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:42 -08:00
Herbert Xu
fff6938880 [IPSEC]: Make sure idev is consistent with dev in xfrm_dst
Previously we took the device from the bottom route and idev from the
top route.  This is bad because idev may well point to a different
device.  This patch changes it so that we get the idev from the device
directly.

It also makes it an error if either dev or idev is NULL.  This is
consistent with the rest of the routing code which also treats these
cases as errors.

I've removed the err initialisation in xfrm6_policy.c because it
achieves no purpose and hid a bug when an initial version of this
patch neglected to set err to -ENODEV (fortunately the IPv4 version
warned about it).

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:40 -08:00
Herbert Xu
45ff5a3f9a [IPSEC]: Set dst->input to dst_discard
The input function should never be invoked on IPsec dst objects.  This
is because we don't apply IPsec on input until after we've made the
routing decision.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:40 -08:00
Herbert Xu
8ce68ceb55 [IPSEC]: Only set neighbour on top xfrm dst
The neighbour field is only used by dst_confirm which only ever happens on
the top-most xfrm dst.  So it's a waste to duplicate for every other xfrm
dst.  This patch moves its setting out of the loop so that only the top one
gets set.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:39 -08:00
Herbert Xu
352e512c32 [NET]: Eliminate duplicate copies of dst_discard
We have a number of copies of dst_discard scattered around the place
which all do the same thing, namely free a packet on the input or
output paths.

This patch deletes all of them except dst_discard and points all the
users to it.

The only non-trivial bit is decnet where it returns an error.
However, conceptually this is identical to the blackhole functions
used in IPv4 and IPv6 which do not return errors.  So they should
either all return errors or all return zero.  For now I've stuck with
the majority and picked zero as the return value.

It doesn't really matter in practice since few if any driver would
react differently depending on a zero return value or NET_RX_DROP.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:37 -08:00
Herbert Xu
b4ce92775c [IPV6]: Move nfheader_len into rt6_info
The dst member nfheader_len is only used by IPv6.  It's also currently
creating a rather ugly alignment hole in struct dst.  Therefore this patch
moves it from there into struct rt6_info.

It also reorders the fields in rt6_info to minimize holes.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:37 -08:00
Pavel Emelyanov
b24b8a247f [NET]: Convert init_timer into setup_timer
Many-many code in the kernel initialized the timer->function
and  timer->data together with calling init_timer(timer). There
is already a helper for this. Use it for networking code.

The patch is HUGE, but makes the code 130 lines shorter
(98 insertions(+), 228 deletions(-)).

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:35 -08:00
Wang Chen
33c732c361 [IPV4]: Add raw drops counter.
Add raw drops counter for IPv4 in /proc/net/raw .

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:33 -08:00
Adrian Bunk
6ff7751d06 [TCP]: Make tcp_splice_data_recv() static.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:32 -08:00
Jens Axboe
9c55e01c0c [TCP]: Splice receive support.
Support for network splice receive.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:31 -08:00
Rolf Manderscheid
a9e527e3f9 IPoIB: improve IPv4/IPv6 to IB mcast mapping functions
An IPoIB subnet on an IB fabric that spans multiple IB subnets can't
use link-local scope in multicast GIDs.  The existing routines that
map IP/IPv6 multicast addresses into IB link-level addresses hard-code
the scope to link-local, and they also leave the partition key field
uninitialised.  This patch adds a parameter (the link-level broadcast
address) to the mapping routines, allowing them to initialise both the
scope and the P_Key appropriately, and fixes up the call sites.

The next step will be to add a way to configure the scope for an IPoIB
interface.

Signed-off-by: Rolf Manderscheid <rvm@obsidianresearch.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-01-25 14:15:37 -08:00
Herbert Xu
f945fa7ad9 [INET]: Fix truesize setting in ip_append_data
As it is ip_append_data only counts page fragments to the skb that
allocated it.  As such it means that the first skb gets hit with a
4K charge even though it might have only used a fraction of it while
all subsequent skb's that use the same page gets away with no charge
at all.

This bug was exposed by the UDP accounting patch.

[ The wmem_alloc bumping needs to be moved with the truesize,
  noticed by Takahiro Yasui.  -DaveM ]

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-23 03:11:43 -08:00
David S. Miller
1e34a11d55 [IPV4]: Add missing skb->truesize increment in ip_append_page().
And as noted by Takahiro Yasui, we thus need to bump the
sk->sk_wmem_alloc at this spot as well.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-23 03:11:40 -08:00
Wang Chen
5b4d383a1a [ICMP]: ICMP_MIB_OUTMSGS increment duplicated
Commit "96793b482540f3a26e2188eaf75cb56b7829d3e3" (Add ICMPMsgStats
MIB (RFC 4293)) made a mistake.

In that patch, David L added a icmp_out_count() in
ip_push_pending_frames(), remove icmp_out_count() from
icmp_reply(). But he forgot to remove icmp_out_count() from
icmp_send() too.  Since icmp_send and icmp_reply will call
icmp_push_reply, which will call ip_push_pending_frames, a duplicated
increment happened in icmp_send.

This patch remove the icmp_out_count from icmp_send too.

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-21 03:39:45 -08:00
Eric Dumazet
8d3f099abe [IPV4] FIB_HASH : Avoid unecessary loop in fn_hash_dump_zone()
I noticed "ip route list" was slower than "cat /proc/net/route" on a
machine with a full Internet routing table (214392 entries : Special
thanks to Robert ;) )

This is similar to problem reported in commit
d8c9283089 ("[IPV4] ROUTE: ip_rt_dump()
is unecessary slow")

Fix is to avoid scanning the begining of fz_hash table, but directly
seek to the right offset.

Before patch :

time ip route >/tmp/ROUTE

real    0m1.285s
user    0m0.712s
sys     0m0.436s

After patch

# time ip route >/tmp/ROUTE

real    0m0.835s
user    0m0.692s
sys     0m0.124s

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-20 20:31:39 -08:00
Joonwoo Park
6725033fa2 [IPV4] fib_trie: fix duplicated route issue
http://bugzilla.kernel.org/show_bug.cgi?id=9493

The fib allows making identical routes with 'ip route replace'.
This patch makes the fib return -EEXIST if replacement would cause duplication.

Signed-off-by: Joonwoo Park <joonwpark81@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-20 20:31:38 -08:00
Joonwoo Park
bd566e7525 [IPV4] fib_hash: fix duplicated route issue
http://bugzilla.kernel.org/show_bug.cgi?id=9493

The fib allows making identical routes with 'ip route replace'.
This patch makes the fib return -EEXIST if replacement would cause duplication.

Signed-off-by: Joonwoo Park <joonwpark81@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-20 20:31:37 -08:00
Eric Dumazet
0bcceadceb [IPV4] ROUTE: fix rcu_dereference() uses in /proc/net/rt_cache
In rt_cache_get_next(), no need to guard seq->private by a
rcu_dereference() since seq is private to the thread running this
function. Reading seq.private once (as guaranted bu rcu_dereference())
or several time if compiler really is dumb enough wont change the
result.

But we miss real spots where rcu_dereference() are needed, both in
rt_cache_get_first() and rt_cache_get_next()

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-10 03:55:57 -08:00
Brice Goglin
877364e60e [LRO] Fix lro_mgr->features checks
lro_mgr->features contains a bitmask of LRO_F_* values which are
defined as power of two, not as bit indexes.
They must be checked with x&LRO_F_FOO, not with test_bit(LRO_F_FOO,&x).

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
Acked-by: Andrew Gallatin <gallatin@myri.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-08 23:30:18 -08:00
Eric Dumazet
d8c9283089 [IPV4] ROUTE: ip_rt_dump() is unecessary slow
I noticed "ip route list cache x.y.z.t" can be *very* slow.

While strace-ing -T it I also noticed that first part of route cache
is fetched quite fast :

recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202
GXm\0\0\2  \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3772 <0.000047>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\234\0\0\0\30\0\2\0\254i\
202GXm\0\0\2  \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3736 <0.000042>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\204\0\0\0\30\0\2\0\254i\
202GXm\0\0\2  \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3740 <0.000055>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\234\0\0\0\30\0\2\0\254i\
202GXm\0\0\2  \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3712 <0.000043>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\204\0\0\0\30\0\2\0\254i\
202GXm\0\0\2  \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3732 <0.000053>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202
GXm\0\0\2  \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3708 <0.000052>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202
GXm\0\0\2  \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3680 <0.000041>

while the part at the end of the table is more expensive:

recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\204\0\0\0\30\0\2\0\254i\202GXm\0\0\2  \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3656 <0.003857>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\204\0\0\0\30\0\2\0\254i\202GXm\0\0\2  \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3772 <0.003891>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202GXm\0\0\2  \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3712 <0.003765>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202GXm\0\0\2  \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3700 <0.003879>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202GXm\0\0\2  \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3676 <0.003797>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202GXm\0\0\2  \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3724 <0.003856>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\234\0\0\0\30\0\2\0\254i\202GXm\0\0\2  \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3736 <0.003848>

The following patch corrects this performance/latency problem,
removing quadratic behavior.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-08 23:30:16 -08:00
Amos Waterland
92ffb85dd3 [IPV4] ipconfig: Fix regression in ip command line processing
The recent changes for ip command line processing fixed some problems
but unfortunately broke some common usage scenarios.  In current
2.6.24-rc6 the following command line results in no IP address
assignment, which is surely a regression:

 ip=10.0.2.15::10.0.2.2:255.255.255.0::eth0:off

Please find below a patch that works for all cases I can find.

Signed-off-by: Amos Waterland <apw@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-08 23:29:58 -08:00
Herbert Xu
f844c74fe0 [IPV4] raw: Strengthen check on validity of iph->ihl
We currently check that iph->ihl is bounded by the real length and that
the real length is greater than the minimum IP header length.  However,
we did not check the caes where iph->ihl is less than the minimum IP
header length.

This breaks because some ip_fast_csum implementations assume that which
is quite reasonable.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-08 23:29:57 -08:00
Mark McLoughlin
44344b2a85 [INET]: Fix netdev renaming and inet address labels
When re-naming an interface, the previous secondary address
labels get lost e.g.

  $> brctl addbr foo
  $> ip addr add 192.168.0.1 dev foo
  $> ip addr add 192.168.0.2 dev foo label foo:00
  $> ip addr show dev foo | grep inet
    inet 192.168.0.1/32 scope global foo
    inet 192.168.0.2/32 scope global foo:00
  $> ip link set foo name bar
  $> ip addr show dev bar | grep inet
    inet 192.168.0.1/32 scope global bar
    inet 192.168.0.2/32 scope global bar:2

Turns out to be a simple thinko in inetdev_changename() - clearly we
want to look at the address label, rather than the device name, for
a suffix to retain.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-04 03:55:34 -08:00
Gavin McCullagh
2072c228c9 [TCP]: use non-delayed ACK for congestion control RTT
When a delayed ACK representing two packets arrives, there are two RTT
samples available, one for each packet.  The first (in order of seq
number) will be artificially long due to the delay waiting for the
second packet, the second will trigger the ACK and so will not itself
be delayed.

According to rfc1323, the SRTT used for RTO calculation should use the
first rtt, so receivers echo the timestamp from the first packet in
the delayed ack.  For congestion control however, it seems measuring
delayed ack delay is not desirable as it varies independently of
congestion.

The patch below causes seq_rtt and last_ackt to be updated with any
available later packet rtts which should have less (and hopefully
zero) delack delay.  The rtt value then gets passed to
ca_ops->pkts_acked().

Where TCP_CONG_RTT_STAMP was set, effort was made to supress RTTs from
within a TSO chunk (!fully_acked), using only the final ACK (which
includes any TSO delay) to generate RTTs.  This patch removes these
checks so RTTs are passed for each ACK to ca_ops->pkts_acked().

For non-delay based congestion control (cubic, h-tcp), rtt is
sometimes used for rtt-scaling.  In shortening the RTT, this may make
them a little less aggressive.  Delay-based schemes (eg vegas, veno,
illinois) should get a cleaner, more accurate congestion signal,
particularly for small cwnds.  The congestion control module can
potentially also filter out bad RTTs due to the delayed ack alarm by
looking at the associated cnt which (where delayed acking is in use)
should probably be 1 if the alarm went off or greater if the ACK was
triggered by a packet.

Signed-off-by: Gavin McCullagh <gavin.mccullagh@nuim.ie>
Acked-by: Ilpo Jrvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-12-29 19:11:21 -08:00
Simon Horman
9cecd07c3f [IPV4] Fix ip=dhcp regression
David Brownell pointed out a regression in my recent "Fix ip command
line processing" patch. It turns out to be a fairly blatant oversight on
my part whereby ic_enable is never set, and thus autoconfiguration is
never enabled. Clearly my testing was broken :-(

The solution that I have is to set ic_enable to 1 if we hit
ip_auto_config_setup(), which basically means that autoconfiguration is
activated unless told otherwise. I then flip ic_enable to 0 if ip=off,
ip=none, ip=::::::off or ip=::::::none using ic_proto_name();

The incremental patch is below, let me know if a non-incremental version
is prepared, as I did as for the original patch to be reverted pending a
fix.

Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-12-28 13:39:11 -08:00
Simon Horman
a6c05c3d06 [IPV4]: Fix ip command line processing.
Recently the documentation in Documentation/nfsroot.txt was
update to note that in fact ip=off and ip=::::::off as the
latter is ignored and the default (on) is used.

This was certainly a step in the direction of reducing confusion.
But it seems to me that the code ought to be fixed up so that
ip=::::::off actually turns off ip autoconfiguration.

This patch also notes more specifically that ip=on (aka ip=::::::on)
is the default.

Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-12-26 19:36:36 -08:00
Patrick McHardy
fae718ddaf [NETFILTER]: nf_conntrack_ipv4: fix module parameter compatibility
Some users do "modprobe ip_conntrack hashsize=...". Since we have the
module aliases this loads nf_conntrack_ipv4 and nf_conntrack, the
hashsize parameter is unknown for nf_conntrack_ipv4 however and makes
it fail.

Allow to specify hashsize= for both nf_conntrack and nf_conntrack_ipv4.

Note: the nf_conntrack message in the ringbuffer will display an
incorrect hashsize since nf_conntrack is first pulled in as a
dependency and calculates the size itself, then it gets changed
through a call to nf_conntrack_set_hashsize().

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-12-26 19:36:33 -08:00
Denis V. Lunev
d883a03671 [IPV4]: OOPS with NETLINK_FIB_LOOKUP netlink socket
[ Regression added by changeset:
	cd40b7d398
	[NET]: make netlink user -> kernel interface synchronious
  -DaveM ]

nl_fib_input re-reuses incoming skb to send the reply. This means that this
packet will be freed twice, namely in:
- netlink_unicast_kernel
- on receive path
Use clone to send as a cure, the caller is responsible for kfree_skb on error.

Thanks to Alexey Dobryan, who originally found the problem.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-12-21 02:01:53 -08:00
Joe Perches
e00ccd4a78 [NETFILTER] ipv4: Spelling fixes
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-12-20 14:05:03 -08:00
Timo Teras
1d06916747 [IPV4] ip_gre: set mac_header correctly in receive path
mac_header update in ipgre_recv() was incorrectly changed to
skb_reset_mac_header() when it was introduced.

Signed-off-by: Timo Teras <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-12-20 00:10:33 -08:00
Mark Ryden
e0260feddf [IPV4] ARP: Remove not used code
In arp_process() (net/ipv4/arp.c), there is unused code: definition
and assignment of tha (target hw address ).

Signed-off-by: Mark Ryden <markryde@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-12-19 23:38:11 -08:00
Satoru SATOH
488faa2ae3 [IPV4]: Make tcp_input_metrics() get minimum RTO via tcp_rto_min()
tcp_input_metrics() refers to the built-time constant TCP_RTO_MIN
regardless of configured minimum RTO with iproute2.

Signed-off-by: Satoru SATOH <satoru.satoh@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-12-16 14:00:19 -08:00
Amos Waterland
f33e1d9fa2 [IPV4]: Updates to nfsroot documentation
The difference between ip=off and ip=::::::off has been a cause of much
confusion.  Document how each behaves, and do not contradict ourselves by
saying that "off" is the default when in fact "any" is the default and is
descibed as being so lower in the file.

Signed-off-by: Amos Waterland <apw@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-12-14 13:54:40 -08:00
Patrick McHardy
a18aa31b77 [NETFILTER]: ip_tables: fix compat copy race
When copying entries to user, the kernel makes two passes through the
data, first copying all the entries, then fixing up names and counters.
On the second pass it copies the kernel and match data from userspace
to the kernel again to find the corresponding structures, expecting
that kernel pointers contained in the data are still valid.

This is obviously broken, fix by avoiding the second pass completely
and fixing names and counters while dumping the ruleset, using the
kernel-internal data structures.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-12-14 13:54:35 -08:00
Thomas Graf
2017a72c07 [IPv4] ESP: Discard dummy packets introduced in rfc4303
RFC4303 introduces dummy packets with a nexthdr value of 59
to implement traffic confidentiality. Such packets need to
be dropped silently and the payload may not be attempted to
be parsed as it consists of random chunk.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-12-11 02:45:26 -08:00
Pavel Emelyanov
a4e65d36a9 [IPV4]: Swap the ifa allocation with the"ipv4_devconf_setall" call
According to Herbert, the ipv4_devconf_setall should be called
only when the ifa is added to the device. However, failed
ifa allocation may bring things into inconsistent state.

Move the call to ipv4_devconf_setall after the ifa allocation.

Fits both net-2.6 (with offsets) and net-2.6.25 (cleanly).

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-12-11 02:45:25 -08:00
Denis V. Lunev
56c99d0415 [IPV4]: Remove prototype of ip_rt_advice
ip_rt_advice has been gone, so no need to keep prototype and debug message.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-12-07 01:07:38 -08:00
Mitsuru Chinen
7f53878dc2 [IPv4]: Reply net unreachable ICMP message
IPv4 stack doesn't reply any ICMP destination unreachable message
with net unreachable code when IP detagrams are being discarded
because of no route could be found in the forwarding path.
Incidentally, IPv6 stack replies such ICMPv6 message in the similar
situation.

Signed-off-by: Mitsuru Chinen <mitch@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-12-07 01:07:24 -08:00
Andrew Gallatin
621544eb8c [LRO]: fix lro_gen_skb() alignment
Add a field to the lro_mgr struct so that drivers can specify how much
padding is required to align layer 3 headers when a packet is copied
into a freshly allocated skb by inet_lro.c:lro_gen_skb().  Without
padding, skbs generated by LRO will cause alignment warnings on
architectures which require strict alignment (seen on sparc64).

Myri10GE is updated to use this field.

Signed-off-by: Andrew Gallatin <gallatin@myri.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-12-05 05:37:32 -08:00
Ilpo Jrvinen
4e67d876ce [TCP]: NAGLE_PUSH seems to be a wrong way around
The comment in tcp_nagle_test suggests that. This bug is very
very old, even 2.4.0 seems to have it.

Signed-off-by: Ilpo Jrvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-12-05 05:37:31 -08:00
Ilpo Jrvinen
52d3408150 [TCP]: Move prior_in_flight collect to more robust place
The previous location is after sacktag processing, which affects
counters tcp_packets_in_flight depends on. This may manifest as
wrong behavior if new SACK blocks are present and all is clear
for call to tcp_cong_avoid, which in the case of
tcp_reno_cong_avoid bails out early because it thinks that
TCP is not limited by cwnd.

Signed-off-by: Ilpo Jrvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-12-05 05:37:30 -08:00
Ilpo Jrvinen
3e6f049e0c [TCP] FRTO: Use of existing funcs make code more obvious & robust
Though there's little need for everything that tcp_may_send_now
does (actually, even the state had to be adjusted to pass some
checks FRTO does not want to occur), it's more robust to let it
make the decision if sending is allowed. State adjustments
needed:
- Make sure snd_cwnd limit is not hit in there
- Disable nagle (if necessary) through the frto_counter == 2

The result of check for frto_counter in argument to call for
tcp_enter_frto_loss can just be open coded, therefore there
isn't need to store the previous frto_counter past
tcp_may_send_now.

In addition, returns can then be combined.

Signed-off-by: Ilpo Jrvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-12-05 05:37:29 -08:00
Pavel Emelyanov
4ac63ad6c5 [IPVS]: Fix sched registration race when checking for name collision.
The register_ip_vs_scheduler() checks for the scheduler with the
same name under the read-locked __ip_vs_sched_lock, then drops,
takes it for writing and puts the scheduler in list.

This is racy, since we can have a race window between the lock
being re-locked for writing.

The fix is to search the scheduler with the given name right under
the write-locked __ip_vs_sched_lock.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-12-05 05:37:27 -08:00
Pavel Emelyanov
a014bc8f0f [IPVS]: Don't leak sysctl tables if the scheduler registration fails.
In case we load lblc or lblcr module we can leak some sysctl
tables if the call to register_ip_vs_scheduler() fails.

I've looked at the register_ip_vs_scheduler() code and saw, that
the only reason to fail is the name collision, so I think that
with some 3rd party schedulers this becomes a relevant issue. No?

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-12-05 05:37:26 -08:00
Herbert Xu
d523a328fb [INET]: Fix inet_diag dead-lock regression
The inet_diag register fix broke inet_diag module loading because the
loaded module had to take the same mutex that's already held by the
loader in order to register the new handler.

This patch fixes it by introducing a separate mutex to protect the
handling of handlers.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2007-12-03 15:51:25 +11:00
Stephen Hemminger
a357dde9df [TCP] illinois: Incorrect beta usage
Lachlan Andrew observed that my TCP-Illinois implementation uses the
beta value incorrectly:
  The parameter  beta  in the paper specifies the amount to decrease
  *by*:  that is, on loss,
     W <-  W -  beta*W
  but in   tcp_illinois_ssthresh() uses  beta  as the amount
  to decrease  *to*: W <- beta*W

This bug makes the Linux TCP-Illinois get less-aggressive on uncongested network,
hurting performance. Note: since the base beta value is .5, it has no
impact on a congested network.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2007-11-30 01:10:55 +11:00
Pavel Emelyanov
076931989f [INET]: Fix inet_diag register vs rcv race
The following race is possible when one cpu unregisters the handler
while other one is trying to receive a message and call this one:

CPU1:                                                 CPU2:
inet_diag_rcv()                                       inet_diag_unregister()
  mutex_lock(&inet_diag_mutex);
  netlink_rcv_skb(skb, &inet_diag_rcv_msg);
    if (inet_diag_table[nlh->nlmsg_type] == 
                               NULL) /* false handler is still registered */
    ...
    netlink_dump_start(idiagnl, skb, nlh,
                           inet_diag_dump, NULL);
           cb = kzalloc(sizeof(*cb), GFP_KERNEL);
                   /* sleep here freeing memory 
                    * or preempt
                    * or sleep later on nlk->cb_mutex
                    */
                                                         spin_lock(&inet_diag_register_lock);
                                                         inet_diag_table[type] = NULL;
    ...                                                  spin_unlock(&inet_diag_register_lock);
                                                         synchronize_rcu();
                                                         /* CPU1 is sleeping - RCU quiescent
                                                          * state is passed
                                                          */
                                                         return;
    /* inet_diag_dump is finally called: */
    inet_diag_dump()
      handler = inet_diag_table[cb->nlh->nlmsg_type];
      BUG_ON(handler == NULL); 
      /* OOPS! While we slept the unregister has set
       * handler to NULL :(
       */

Grep showed, that the register/unregister functions are called
from init/fini module callbacks for tcp_/dccp_diag, so it's OK
to use the inet_diag_mutex to synchronize manipulations with the
inet_diag_table and the access to it.

Besides, as Herbert pointed out, asynchronous dumps should hold 
this mutex as well, and thus, we provide the mutex as cb_mutex one.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2007-11-30 00:08:14 +11:00
Adrian Bunk
3660019e5f [IPV4]: Remove bogus ifdef mess in arp_process
The #ifdef's in arp_process() were not only a mess, they were also wrong 
in the CONFIG_NET_ETHERNET=n and (CONFIG_NETDEV_1000=y or 
CONFIG_NETDEV_10000=y) cases.

Since they are not required this patch removes them.

Also removed are some #ifdef's around #include's that caused compile 
errors after this change.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2007-11-26 23:17:53 +08:00
Ilpo Järvinen
7f9c33e515 [TCP] MTUprobe: Cleanup send queue check (no need to loop)
The original code has striking complexity to perform a query
which can be reduced to a very simple compare.

FIN seqno may be included to write_seq but it should not make
any significant difference here compared to skb->len which was
used previously. One won't end up there with SYN still queued.

Use of write_seq check guarantees that there's a valid skb in
send_head so I removed the extra check.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-by: John Heffner <jheffner@psc.edu>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2007-11-23 19:10:56 +08:00
Ilpo Järvinen
91cc17c0e5 [TCP]: MTUprobe: receiver window & data available checks fixed
It seems that the checked range for receiver window check should
begin from the first rather than from the last skb that is going
to be included to the probe. And that can be achieved without
reference to skbs at all, snd_nxt and write_seq provides the
correct seqno already. Plus, it SHOULD account packets that are
necessary to trigger fast retransmit [RFC4821].

Location of snd_wnd < probe_size/size_needed check is bogus
because it will cause the other if() match as well (due to
snd_nxt >= snd_una invariant).

Removed dead obvious comment.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2007-11-23 19:08:16 +08:00
Pavel Emelyanov
d535a916cd [IPVS]: Fix compiler warning about unused register_ip_vs_protocol
This is silly, but I have turned the CONFIG_IP_VS to m,
to check the compilation of one (recently sent) fix
and set all the CONFIG_IP_VS_PROTO_XXX options to n to
speed up the compilation.

In this configuration the compiler warns me about

  CC [M]  net/ipv4/ipvs/ip_vs_proto.o
net/ipv4/ipvs/ip_vs_proto.c:49: warning: 'register_ip_vs_protocol' defined but not used

Indeed. With no protocols selected there are no
calls to this function - all are compiled out with
ifdefs.

Maybe the best fix would be to surround this call with
ifdef-s or tune the Kconfig dependences, but I think that
marking this register function as __used is enough. No?

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-20 17:44:01 -08:00
Jonas Danielsson
b4a9811c42 [ARP]: Fix arp reply when sender ip 0
Fix arp reply when received arp probe with sender ip 0.

Send arp reply with target ip address 0.0.0.0 and target hardware
address set to hardware address of requester. Previously sent reply
with target ip address and target hardware address set to same as
source fields.

Signed-off-by: Jonas Danielsson <the.sator@gmail.com>
Acked-by: Alexey Kuznetov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-20 17:38:16 -08:00
YOSHIFUJI Hideaki
354faf0977 [IPV4] TCPMD5: Use memmove() instead of memcpy() because we have overlaps.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-20 17:30:31 -08:00
YOSHIFUJI Hideaki
a80cc20da4 [IPV4] TCPMD5: Omit redundant NULL check for kfree() argument.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-20 17:30:06 -08:00
Evgeniy Polyakov
1f305323ff [NETFILTER]: Fix kernel panic with REDIRECT target.
When connection tracking entry (nf_conn) is about to copy itself it can
have some of its extension users (like nat) as being already freed and
thus not required to be copied.

Actually looking at this function I suspect it was copied from
nf_nat_setup_info() and thus bug was introduced.

Report and testing from David <david@unsolicited.net>.

[ Patrick McHardy states:

	I now understand whats happening:

	- new connection is allocated without helper
	- connection is REDIRECTed to localhost
	- nf_nat_setup_info adds NAT extension, but doesn't initialize it yet
	- nf_conntrack_alter_reply performs a helper lookup based on the
	   new tuple, finds the SIP helper and allocates a helper extension,
	   causing reallocation because of too little space
	- nf_nat_move_storage is called with the uninitialized nat extension

	So your fix is entirely correct, thanks a lot :)  ]

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-20 04:27:35 -08:00
Joe Perches
464c4f184a [IPV4]: Add missing "space"
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-19 23:46:29 -08:00
Sam Jansen
5487796f0c [TCP]: Problem bug with sysctl_tcp_congestion_control function
From: "Sam Jansen" <sjansen@google.com>

sysctl_tcp_congestion_control seems to have a bug that prevents it
from actually calling the tcp_set_default_congestion_control
function. This is not so apparent because it does not return an error
and generally the /proc interface is used to configure the default TCP
congestion control algorithm.  This is present in 2.6.18 onwards and
probably earlier, though I have not inspected 2.6.15--2.6.17.

sysctl_tcp_congestion_control calls sysctl_string and expects a successful
return code of 0. In such a case it actually sets the congestion control
algorithm with tcp_set_default_congestion_control. Otherwise, it returns the
value returned by sysctl_string. This was correct in 2.6.14, as sysctl_string
returned 0 on success. However, sysctl_string was updated to return 1 on
success around about 2.6.15 and sysctl_tcp_congestion_control was not updated.
Even though sysctl_tcp_congestion_control returns 1, do_sysctl_strategy
converts this return code to '0', so the caller never notices the error.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-19 23:28:21 -08:00
Ilpo Jrvinen
6e42141009 [TCP] MTUprobe: fix potential sk_send_head corruption
When the abstraction functions got added, conversion here was
made incorrectly. As a result, the skb may end up pointing
to skb which got included to the probe skb and then was freed.
For it to trigger, however, skb_transmit must fail sending as
well.

Signed-off-by: Ilpo Jrvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-19 23:24:09 -08:00
Simon Horman
9055fa1f3d [IPVS]: Move remaining sysctl handlers over to CTL_UNNUMBERED
Switch the remaining IPVS sysctl entries over to to use CTL_UNNUMBERED,
I stronly doubt that anyone is using the sys_sysctl interface to
these variables.

Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-19 21:51:13 -08:00
Simon Horman
9e103fa6bd [IPVS]: Fix sysctl warnings about missing strategy in schedulers
sysctl table check failed: /net/ipv4/vs/lblc_expiration .3.5.21.19 Missing strategy
[...]
sysctl table check failed: /net/ipv4/vs/lblcr_expiration .3.5.21.20 Missing strategy

Switch these entried over to use CTL_UNNUMBERED as clearly
the sys_syscal portion wasn't working.

This is along the same lines as Christian Borntraeger's patch that fixes
up entries with no stratergy in net/ipv4/ipvs/ip_vs_ctl.c

Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-19 21:50:21 -08:00
Christian Borntraeger
611cd55b15 [IPVS]: Fix sysctl warnings about missing strategy
Running the latest git code I get the following messages during boot:
sysctl table check failed: /net/ipv4/vs/drop_entry .3.5.21.4 Missing strategy
[...]		  
sysctl table check failed: /net/ipv4/vs/drop_packet .3.5.21.5 Missing strategy
[...]
sysctl table check failed: /net/ipv4/vs/secure_tcp .3.5.21.6 Missing strategy
[...]
sysctl table check failed: /net/ipv4/vs/sync_threshold .3.5.21.24 Missing strategy

I removed the binary sysctl handler for those messages and also removed
the definitions in ip_vs.h. The alternative would be to implement a 
proper strategy handler, but syscall sysctl is deprecated.

There are other sysctl definitions that are commented out or work with 
the default sysctl_data strategy. I did not touch these. 

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-19 21:49:25 -08:00
Eric Dumazet
483b23ffa3 [NET]: Corrects a bug in ip_rt_acct_read()
It seems that stats of cpu 0 are counted twice, since
for_each_possible_cpu() is looping on all possible cpus, including 0

Before percpu conversion of ip_rt_acct, we should also remove the
assumption that CPU 0 is online (or even possible)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-18 18:47:38 -08:00
Eric Dumazet
d90bf5a976 [NET]: rt_check_expire() can take a long time, add a cond_resched()
On commit 39c90ece75:

	[IPV4]: Convert rt_check_expire() from softirq processing to workqueue.

we converted rt_check_expire() from softirq to workqueue, allowing the
function to perform all work it was supposed to do.

When the IP route cache is big, rt_check_expire() can take a long time
to run.  (default settings : 20% of the hash table is scanned at each
invocation)

Adding cond_resched() helps giving cpu to higher priority tasks if
necessary.

Using a "if (need_resched())" test before calling "cond_resched();" is
necessary to avoid spending too much time doing the resched check.
(My tests gave a time reduction from 88 ms to 25 ms per
rt_check_expire() run on my i686 test machine)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-14 16:14:05 -08:00
Ilpo Jrvinen
e1cd8f78f8 [TCP] FRTO: Clear frto_highmark only after process_frto that uses it
I broke this in commit 3de96471bd:

	[TCP]: Wrap-safed reordering detection FRTO check

tcp_process_frto should always see a valid frto_highmark. An invalid
frto_highmark (zero) is very likely what ultimately caused a seqno
compare in tcp_frto_enter_loss to do the wrong leading to the LOST-bit
leak.

Having LOST-bits integry ensured like done after commit
23aeeec365:

	[TCP] FRTO: Plug potential LOST-bit leak

won't hurt. It may still be useful in some other, possibly legimate,
scenario.

Reported by Chazarain Guillaume <guichaz@yahoo.fr>.

Signed-off-by: Ilpo Jrvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-14 15:55:09 -08:00
Ilpo Jrvinen
96a2d41a3e [TCP]: Make sure write_queue_from does not begin with NULL ptr
NULL ptr can be returned from tcp_write_queue_head to cached_skb
and then assigned to skb if packets_out was zero. Without this,
system is vulnerable to a carefully crafted ACKs which obviously
is remotely triggerable.

Besides, there's very little that needs to be done in sacktag
if there weren't any packets outstanding, just skipping the rest
doesn't hurt.

Signed-off-by: Ilpo Jrvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-14 15:47:18 -08:00
Ilpo Jrvinen
23aeeec365 [TCP] FRTO: Plug potential LOST-bit leak
It might be possible that, in some extreme scenario that
I just cannot now construct in my mind, end_seq <=
frto_highmark check does not match causing the lost_out
and LOST bits become out-of-sync due to clearing and
recounting in the loop.

This may fix LOST-bit leak reported by Chazarain Guillaume
<guichaz@yahoo.fr>.

Signed-off-by: Ilpo Jrvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-13 21:03:13 -08:00
Ilpo Jrvinen
746aa32d28 [TCP] FRTO: Limit snd_cwnd if TCP was application limited
Otherwise TCP might violate packet ordering principles that FRTO
is based on. If conventional recovery path is chosen, this won't
be significant at all. In practice, any small enough value will
be sufficient to provide proper operation for FRTO, yet other
users of snd_cwnd might benefit from a "close enough" value.

FRTO's formula is now equal to what tcp_enter_cwr() uses.

FRTO used to check application limitedness a bit differently but
I changed that in commit 575ee7140d
and as a result checking for application limitedness became
completely non-existing.

Signed-off-by: Ilpo Jrvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-13 21:01:23 -08:00
Li Zefan
e0bf9cf15f [NETFILTER]: nf_nat: fix memset error
The size passing to memset is the size of a pointer.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-13 02:57:16 -08:00
Pavel Emelyanov
d71209ded2 [INET]: Use list_head-s in inetpeer.c
The inetpeer.c tracks the LRU list of inet_perr-s, but makes
it by hands. Use the list_head-s for this.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-12 21:27:28 -08:00
Adrian Bunk
22649d1afb [IPVS]: Remove unused exports.
This patch removes the following unused EXPORT_SYMBOL's:
- ip_vs_try_bind_dest
- ip_vs_find_dest

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-12 21:25:24 -08:00
Denis V. Lunev
2994c63863 [INET]: Small possible memory leak in FIB rules
This patch fixes a small memory leak. Default fib rules can be deleted by
the user if the rule does not carry FIB_RULE_PERMANENT flag, f.e. by
	ip rule flush

Such a rule will not be freed as the ref-counter has 2 on start and becomes
clearly unreachable after removal.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 22:12:03 -08:00
Pavel Emelyanov
358352b8b8 [INET]: Cleanup the xfrm4_tunnel_(un)register
Both check for the family to select an appropriate tunnel list.
Consolidate this check and make the for() loop more readable.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 21:48:54 -08:00
Pavel Emelyanov
99f933263a [INET]: Add missed tunnel64_err handler
The tunnel64_protocol uses the tunnel4_protocol's err_handler and
thus calls the tunnel4_protocol's handlers.

This is not very good, as in case of (icmp) error the wrong error
handlers will be called (e.g. ipip ones instead of sit) and this
won't be noticed at all, because the error is not reported.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 21:47:39 -08:00