summaryrefslogtreecommitdiffstats
path: root/net/ipv4
AgeCommit message (Collapse)Author
2015-06-11tcp: fill shinfo->gso_size at last momentEric Dumazet
In commit cd7d8498c9a5 ("tcp: change tcp_skb_pcount() location") we stored gso_segs in a temporary cache hot location. This patch does the same for gso_size. This allows to save 2 cache line misses in tcp xmit path for the last packet that is considered but not sent because of various conditions (cwnd, tso defer, receiver window, TSQ...) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-11tcp: tcp_set_skb_tso_segs() no longer need struct sock parameterEric Dumazet
tcp_set_skb_tso_segs() & tcp_init_tso_segs() no longer use the sock pointer. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-11tcp: fill shinfo->gso_type at last momentEric Dumazet
Our goal is to touch skb_shinfo(skb) only when absolutely needed, to avoid two cache line misses in TCP output path for last skb that is considered but not sent because of various conditions (cwnd, tso defer, receiver window, TSQ...) A packet is GSO only when skb_shinfo(skb)->gso_size is not zero. We can set skb_shinfo(skb)->gso_type to sk->sk_gso_type even for non GSO packets. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-11tcp: reserve tcp_skb_mss() to tcp stackEric Dumazet
tcp_gso_segment() and tcp_gro_receive() are not strictly part of TCP stack. They should not assume tcp_skb_mss(skb) is in fact skb_shinfo(skb)->gso_size. This will allow us to change tcp_skb_mss() in following patches. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-11tcp: add CDG congestion controlKenneth Klette Jonassen
CAIA Delay-Gradient (CDG) is a TCP congestion control that modifies the TCP sender in order to [1]: o Use the delay gradient as a congestion signal. o Back off with an average probability that is independent of the RTT. o Coexist with flows that use loss-based congestion control, i.e., flows that are unresponsive to the delay signal. o Tolerate packet loss unrelated to congestion. (Disabled by default.) Its FreeBSD implementation was presented for the ICCRG in July 2012; slides are available at http://www.ietf.org/proceedings/84/iccrg.html Running the experiment scenarios in [1] suggests that our implementation achieves more goodput compared with FreeBSD 10.0 senders, although it also causes more queueing delay for a given backoff factor. The loss tolerance heuristic is disabled by default due to safety concerns for its use in the Internet [2, p. 45-46]. We use a variant of the Hybrid Slow start algorithm in tcp_cubic to reduce the probability of slow start overshoot. [1] D.A. Hayes and G. Armitage. "Revisiting TCP congestion control using delay gradients." In Networking 2011, pages 328-341. Springer, 2011. [2] K.K. Jonassen. "Implementing CAIA Delay-Gradient in Linux." MSc thesis. Department of Informatics, University of Oslo, 2015. Cc: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: David Hayes <davihay@ifi.uio.no> Cc: Andreas Petlund <apetlund@simula.no> Cc: Dave Taht <dave.taht@bufferbloat.net> Cc: Nicolas Kuhn <nicolas.kuhn@telecom-bretagne.eu> Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-11tcp: export tcp_enter_cwr()Kenneth Klette Jonassen
Upcoming tcp_cdg uses tcp_enter_cwr() to initiate PRR. Export this function so that CDG can be compiled as a module. Cc: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: David Hayes <davihay@ifi.uio.no> Cc: Andreas Petlund <apetlund@simula.no> Cc: Dave Taht <dave.taht@bufferbloat.net> Cc: Nicolas Kuhn <nicolas.kuhn@telecom-bretagne.eu> Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-10net: tcp: dctcp_update_alpha() fixes.Eric Dumazet
dctcp_alpha can be read by from dctcp_get_info() without synchro, so use WRITE_ONCE() to prevent compiler from using dctcp_alpha as a temporary variable. Also, playing with small dctcp_shift_g (like 1), can expose an overflow with 32bit values shifted 9 times before divide. Use an u64 field to avoid this problem, and perform the divide only if acked_bytes_ecn is not zero. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2015-06-07fib_trie: coding style: Use pointer after checkFiro Yang
As Alexander Duyck pointed out that: struct tnode { ... struct key_vector kv[1]; } The kv[1] member of struct tnode is an arry that refernced by a null pointer will not crash the system, like this: struct tnode *p = NULL; struct key_vector *kv = p->kv; As such p->kv doesn't actually dereference anything, it is simply a means for getting the offset to the array from the pointer p. This patch make the code more regular to avoid making people feel odd when they look at the code. Signed-off-by: Firo Yang <firogm@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-07tcp: get_cookie_sock() consolidationEric Dumazet
IPv4 and IPv6 share same implementation of get_cookie_sock(), and there is no point inlining it. We add tcp_ prefix to the common helper name and export it. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-07tcp: remove redundant checks IIEric Dumazet
For same reasons than in commit 12e25e1041d0 ("tcp: remove redundant checks"), we can remove redundant checks done for timewait sockets. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-06inet: add IP_BIND_ADDRESS_NO_PORT to overcome bind(0) limitationsEric Dumazet
When an application needs to force a source IP on an active TCP socket it has to use bind(IP, port=x). As most applications do not want to deal with already used ports, x is often set to 0, meaning the kernel is in charge to find an available port. But kernel does not know yet if this socket is going to be a listener or be connected. It has very limited choices (no full knowledge of final 4-tuple for a connect()) With limited ephemeral port range (about 32K ports), it is very easy to fill the space. This patch adds a new SOL_IP socket option, asking kernel to ignore the 0 port provided by application in bind(IP, port=0) and only remember the given IP address. The port will be automatically chosen at connect() time, in a way that allows sharing a source port as long as the 4-tuples are unique. This new feature is available for both IPv4 and IPv6 (Thanks Neal) Tested: Wrote a test program and checked its behavior on IPv4 and IPv6. strace(1) shows sequences of bind(IP=127.0.0.2, port=0) followed by connect(). Also getsockname() show that the port is still 0 right after bind() but properly allocated after connect(). socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5 setsockopt(5, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0 bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0 getsockname(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0 connect(5, {sa_family=AF_INET, sin_port=htons(53174), sin_addr=inet_addr("127.0.0.3")}, 16) = 0 getsockname(5, {sa_family=AF_INET, sin_port=htons(38050), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0 IPv6 test : socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 7 setsockopt(7, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0 bind(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0 getsockname(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0 connect(7, {sa_family=AF_INET6, sin6_port=htons(57300), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0 getsockname(7, {sa_family=AF_INET6, sin6_port=htons(60964), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0 I was able to bind()/connect() a million concurrent IPv4 sockets, instead of ~32000 before patch. lpaa23:~# ulimit -n 1000010 lpaa23:~# ./bind --connect --num-flows=1000000 & 1000000 sockets lpaa23:~# grep TCP /proc/net/sockstat TCP: inuse 2000063 orphan 0 tw 47 alloc 2000157 mem 66 Check that a given source port is indeed used by many different connections : lpaa23:~# ss -t src :40000 | head -10 State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 0 127.0.0.2:40000 127.0.202.33:44983 ESTAB 0 0 127.0.0.2:40000 127.2.27.240:44983 ESTAB 0 0 127.0.0.2:40000 127.2.98.5:44983 ESTAB 0 0 127.0.0.2:40000 127.0.124.196:44983 ESTAB 0 0 127.0.0.2:40000 127.2.139.38:44983 ESTAB 0 0 127.0.0.2:40000 127.1.59.80:44983 ESTAB 0 0 127.0.0.2:40000 127.3.6.228:44983 ESTAB 0 0 127.0.0.2:40000 127.0.38.53:44983 ESTAB 0 0 127.0.0.2:40000 127.1.197.10:44983 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-04tcp: double default TSQ output bytes limitWei Liu
Xen virtual network driver has higher latency than a physical NIC. Having only 128K as limit for TSQ introduced 30% regression in guest throughput. This patch raises the limit to 256K. This reduces the regression to 8%. This buys us more time to work out a proper solution in the long run. Signed-off-by: Wei Liu <wei.liu2@citrix.com> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-04tcp: remove redundant checksEric Dumazet
tcp_v4_rcv() checks the following before calling tcp_v4_do_rcv(): if (th->doff < sizeof(struct tcphdr) / 4) goto bad_packet; if (!pskb_may_pull(skb, th->doff * 4)) goto discard_it; So following check in tcp_v4_do_rcv() is redundant and "goto csum_err;" is wrong anyway. if (skb->len < tcp_hdrlen(skb) || ...) goto csum_err; A second check can be removed after no_tcp_socket label for same reason. Same tests can be removed in tcp_v6_do_rcv() Note : short tcp frames are not properly accounted in tcpInErrs MIB, because pskb_may_pull() failure simply drops incoming skb, we might fix this in a separate patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-04ipv4/udp: Verify multicast group is ours in upd_v4_early_demux()Shawn Bohrer
421b3885bf6d56391297844f43fb7154a6396e12 "udp: ipv4: Add udp early demux" introduced a regression that allowed sockets bound to INADDR_ANY to receive packets from multicast groups that the socket had not joined. For example a socket that had joined 224.168.2.9 could also receive packets from 225.168.2.9 despite not having joined that group if ip_early_demux is enabled. Fix this by calling ip_check_mc_rcu() in udp_v4_early_demux() to verify that the multicast packet is indeed ours. Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Reported-by: Yurij M. Plotnikov <Yurij.Plotnikov@oktetlabs.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/phy/amd-xgbe-phy.c drivers/net/wireless/iwlwifi/Kconfig include/net/mac80211.h iwlwifi/Kconfig and mac80211.h were both trivial overlapping changes. The drivers/net/phy/amd-xgbe-phy.c file got removed in 'net-next' and the bug fix that happened on the 'net' side is already integrated into the rest of the amd-xgbe driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-31tcp: fix child sockets to use system default congestion control if not setNeal Cardwell
Linux 3.17 and earlier are explicitly engineered so that if the app doesn't specifically request a CC module on a listener before the SYN arrives, then the child gets the system default CC when the connection is established. See tcp_init_congestion_control() in 3.17 or earlier, which says "if no choice made yet assign the current value set as default". The change ("net: tcp: assign tcp cong_ops when tcp sk is created") altered these semantics, so that children got their parent listener's congestion control even if the system default had changed after the listener was created. This commit returns to those original semantics from 3.17 and earlier, since they are the original semantics from 2007 in 4d4d3d1e8 ("[TCP]: Congestion control initialization."), and some Linux congestion control workflows depend on that. In summary, if a listener socket specifically sets TCP_CONGESTION to "x", or the route locks the CC module to "x", then the child gets "x". Otherwise the child gets current system default from net.ipv4.tcp_congestion_control. That's the behavior in 3.17 and earlier, and this commit restores that. Fixes: 55d8694fa82c ("net: tcp: assign tcp cong_ops when tcp sk is created") Cc: Florian Westphal <fw@strlen.de> Cc: Daniel Borkmann <dborkman@redhat.com> Cc: Glenn Judd <glenn.judd@morganstanley.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-31udp: fix behavior of wrong checksumsEric Dumazet
We have two problems in UDP stack related to bogus checksums : 1) We return -EAGAIN to application even if receive queue is not empty. This breaks applications using edge trigger epoll() 2) Under UDP flood, we can loop forever without yielding to other processes, potentially hanging the host, especially on non SMP. This patch is an attempt to make things better. We might in the future add extra support for rt applications wanting to better control time spent doing a recv() in a hostile environment. For example we could validate checksums before queuing packets in socket receive queue. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next, they are: 1) default CONFIG_NETFILTER_INGRESS to y for easier compile-testing of all options. 2) Allow to bind a table to net_device. This introduces the internal NFT_AF_NEEDS_DEV flag to perform a mandatory check for this binding. This is required by the next patch. 3) Add the 'netdev' table family, this new table allows you to create ingress filter basechains. This provides access to the existing nf_tables features from ingress. 4) Kill unused argument from compat_find_calc_{match,target} in ip_tables and ip6_tables, from Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-30net: limit tcp/udp rmem/wmem to SOCK_{RCV,SND}BUF_MINSorin Dumitru
This is similar to b1cb59cf2efe(net: sysctl_net_core: check SNDBUF and RCVBUF for min length). I don't think too small values can cause crashes in the case of udp and tcp, but I've seen this set to too small values which triggered awful performance. It also makes the setting consistent across all the wmem/rmem sysctls. Signed-off-by: Sorin Dumitru <sdumitru@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-28Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2015-05-28 1) Fix a race in xfrm_state_lookup_byspi, we need to take the refcount before we release xfrm_state_lock. From Li RongQing. 2) Fix IV generation on ESN state. We used just the low order sequence numbers for IV generation on ESN, as a result the IV can repeat on the same state. Fix this by using the high order sequence number bits too and make sure to always initialize the high order bits with zero. These patches are serious stable candidates. Fixes from Herbert Xu. 3) Fix the skb->mark handling on vti. We don't reset skb->mark in skb_scrub_packet anymore, so vti must care to restore the original value back after it was used to lookup the vti policy and state. Fixes from Alexander Duyck. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-28ip_vti/ip6_vti: Preserve skb->mark after rcv_cb callAlexander Duyck
The vti6_rcv_cb and vti_rcv_cb calls were leaving the skb->mark modified after completing the function. This resulted in the original skb->mark value being lost. Since we only need skb->mark to be set for xfrm_policy_check we can pull the assignment into the rcv_cb calls and then just restore the original mark after xfrm_policy_check has been completed. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-05-28ip_vti/ip6_vti: Do not touch skb->mark on xmitAlexander Duyck
Instead of modifying skb->mark we can simply modify the flowi_mark that is generated as a result of the xfrm_decode_session. By doing this we don't need to actually touch the skb->mark and it can be preserved as it passes out through the tunnel. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-05-28esp4: Switch to new AEAD interfaceHerbert Xu
This patch makes use of the new AEAD interface which uses a single SG list instead of separate lists for the AD and plain text. The IV generation is also now carried out through normal AEAD methods. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-05-27tcp/dccp: warn user for preferred ip_local_port_rangeEric Dumazet
After commit 07f4c90062f8f ("tcp/dccp: try to not exhaust ip_local_port_range in connect()") it is advised to have an even number of ports described in /proc/sys/net/ipv4/ip_local_port_range This means start/end values should have a different parity. Let's warn sysadmins of this, so that they can update their settings if they want to. Suggested-by: David S. Miller <davem@davemloft.net> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-27tcp: connect() from bound sockets can be fasterEric Dumazet
__inet_hash_connect() does not use its third argument (port_offset) if socket was already bound to a source port. No need to perform useless but expensive md5 computations. Reported-by: Crestez Dan Leonard <cdleonard@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-27tcp/dccp: try to not exhaust ip_local_port_range in connect()Eric Dumazet
A long standing problem on busy servers is the tiny available TCP port range (/proc/sys/net/ipv4/ip_local_port_range) and the default sequential allocation of source ports in connect() system call. If a host is having a lot of active TCP sessions, chances are very high that all ports are in use by at least one flow, and subsequent bind(0) attempts fail, or have to scan a big portion of space to find a slot. In this patch, I changed the starting point in __inet_hash_connect() so that we try to favor even [1] ports, leaving odd ports for bind() users. We still perform a sequential search, so there is no guarantee, but if connect() targets are very different, end result is we leave more ports available to bind(), and we spread them all over the range, lowering time for both connect() and bind() to find a slot. This strategy only works well if /proc/sys/net/ipv4/ip_local_port_range is even, ie if start/end values have different parity. Therefore, default /proc/sys/net/ipv4/ip_local_port_range was changed to 32768 - 60999 (instead of 32768 - 61000) There is no change on security aspects here, only some poor hashing schemes could be eventually impacted by this change. [1] : The odd/even property depends on ip_local_port_range values parity Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-27ip_fragment: don't forward defragmented DF packetFlorian Westphal
We currently always send fragments without DF bit set. Thus, given following setup: mtu1500 - mtu1500:1400 - mtu1400:1280 - mtu1280 A R1 R2 B Where R1 and R2 run linux with netfilter defragmentation/conntrack enabled, then if Host A sent a fragmented packet _with_ DF set to B, R1 will respond with icmp too big error if one of these fragments exceeded 1400 bytes. However, if R1 receives fragment sizes 1200 and 100, it would forward the reassembled packet without refragmenting, i.e. R2 will send an icmp error in response to a packet that was never sent, citing mtu that the original sender never exceeded. The other minor issue is that a refragmentation on R1 will conceal the MTU of R2-B since refragmentation does not set DF bit on the fragments. This modifies ip_fragment so that we track largest fragment size seen both for DF and non-DF packets, and set frag_max_size to the largest value. If the DF fragment size is larger or equal to the non-df one, we will consider the packet a path mtu probe: We set DF bit on the reassembled skb and also tag it with a new IPCB flag to force refragmentation even if skb fits outdev mtu. We will also set DF bit on each fragment in this case. Joint work with Hannes Frederic Sowa. Reported-by: Jesse Gross <jesse@nicira.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-27net: ipv4: avoid repeated calls to ip_skb_dst_mtu helperFlorian Westphal
ip_skb_dst_mtu is small inline helper, but its called in several places. before: 17061 44 0 17105 42d1 net/ipv4/ip_output.o after: 16805 44 0 16849 41d1 net/ipv4/ip_output.o Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-27ipv4: Fix fib_trie.c build, missing linux/vmalloc.h include.David S. Miller
We used to get this indirectly I supposed, but no longer do. Either way, an explicit include should have been done in the first place. net/ipv4/fib_trie.c: In function '__node_free_rcu': >> net/ipv4/fib_trie.c:293:3: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration] vfree(n); ^ net/ipv4/fib_trie.c: In function 'tnode_alloc': >> net/ipv4/fib_trie.c:312:3: error: implicit declaration of function 'vzalloc' [-Werror=implicit-function-declaration] return vzalloc(size); ^ >> net/ipv4/fib_trie.c:312:3: warning: return makes pointer from integer without a cast cc1: some warnings being treated as errors Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-26tcp: tcp_tso_autosize() minimum is one packetEric Dumazet
By making sure sk->sk_gso_max_segs minimal value is one, and sysctl_tcp_min_tso_segs minimal value is one as well, tcp_tso_autosize() will return a non zero value. We can then revert 843925f33fcc293d80acf2c5c8a78adf3344d49b ("tcp: Do not apply TSO segment limit to non-TSO packets") and save few cpu cycles in fast path. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-26tcp: fix/cleanup inet_ehash_locks_alloc()Eric Dumazet
If tcp ehash table is constrained to a very small number of buckets (eg boot parameter thash_entries=128), then we can crash if spinlock array has more entries. While we are at it, un-inline inet_ehash_locks_alloc() and make following changes : - Budget 2 cache lines per cpu worth of 'spinlocks' - Try to kmalloc() the array to avoid extra TLB pressure. (Most servers at Google allocate 8192 bytes for this hash table) - Get rid of various #ifdef Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-26netfilter: remove unused comefrom hookmask argumentFlorian Westphal
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-25ip: reject too-big defragmented DF-skb when forwardingFlorian Westphal
Send icmp pmtu error if we find that the largest fragment of df-skb exceeded the output path mtu. The ip output path will still catch this later on but we can avoid the forward/postrouting hook traversal by rejecting right away. This is what ipv6 already does. Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-25net: make skb_splice_bits more configureableHannes Frederic Sowa
Prepare skb_splice_bits to be able to deal with AF_UNIX sockets. AF_UNIX sockets don't use lock_sock/release_sock and thus we have to use a callback to make the locking and unlocking configureable. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-25net: skbuff: add skb_append_pagefrags and use itHannes Frederic Sowa
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/cadence/macb.c drivers/net/phy/phy.c include/linux/skbuff.h net/ipv4/tcp.c net/switchdev/switchdev.c Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD} renaming overlapping with net-next changes of various sorts. phy.c was a case of two changes, one adding a local variable to a function whilst the second was removing one. tcp.c overlapped a deadlock fix with the addition of new tcp_info statistic values. macb.c involved the addition of two zyncq device entries. skbuff.h involved adding back ipv4_daddr to nf_bridge_info whilst net-next changes put two other existing members of that struct into a union. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-22ipv4: fill in table id when replacing a routeMichal Kubeček
When replacing an IPv4 route, tb_id member of the new fib_alias structure is not set in the replace code path so that the new route is ignored. Fixes: 0ddcf43d5d4a ("ipv4: FIB Local/MAIN table collapse") Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contain Netfilter fixes for your net tree, they are: 1) Fix a race in nfnetlink_log and nfnetlink_queue that can lead to a crash. This problem is due to wrong order in the per-net registration and netlink socket events. Patch from Francesco Ruggeri. 2) Make sure that counters that userspace pass us are higher than 0 in all the x_tables frontends. Discovered via Trinity, patch from Dave Jones. 3) Revert a patch for br_netfilter to rely on the conntrack status bits. This breaks stateless IPv6 NAT transformations. Patch from Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-22ipv4: Avoid crashing in ip_errorEric W. Biederman
ip_error does not check if in_dev is NULL before dereferencing it. IThe following sequence of calls is possible: CPU A CPU B ip_rcv_finish ip_route_input_noref() ip_route_input_slow() inetdev_destroy() dst_input() With the result that a network device can be destroyed while processing an input packet. A crash was triggered with only unicast packets in flight, and forwarding enabled on the only network device. The error condition was created by the removal of the network device. As such it is likely the that error code was -EHOSTUNREACH, and the action taken by ip_error (if in_dev had been accessible) would have been to not increment any counters and to have tried and likely failed to send an icmp error as the network device is going away. Therefore handle this weird case by just dropping the packet if !in_dev. It will result in dropping the packet sooner, and will not result in an actual change of behavior. Fixes: 251da4130115b ("ipv4: Cache ip_error() routes even when not forwarding.") Reported-by: Vittorio Gambaletta <linuxbugs@vittgam.net> Tested-by: Vittorio Gambaletta <linuxbugs@vittgam.net> Signed-off-by: Vittorio Gambaletta <linuxbugs@vittgam.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-22tcp: fix a potential deadlock in tcp_get_info()Eric Dumazet
Taking socket spinlock in tcp_get_info() can deadlock, as inet_diag_dump_icsk() holds the &hashinfo->ehash_locks[i], while packet processing can use the reverse locking order. We could avoid this locking for TCP_LISTEN states, but lockdep would certainly get confused as all TCP sockets share same lockdep classes. [ 523.722504] ====================================================== [ 523.728706] [ INFO: possible circular locking dependency detected ] [ 523.734990] 4.1.0-dbg-DEV #1676 Not tainted [ 523.739202] ------------------------------------------------------- [ 523.745474] ss/18032 is trying to acquire lock: [ 523.750002] (slock-AF_INET){+.-...}, at: [<ffffffff81669d44>] tcp_get_info+0x2c4/0x360 [ 523.758129] [ 523.758129] but task is already holding lock: [ 523.763968] (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff816bcb75>] inet_diag_dump_icsk+0x1d5/0x6c0 [ 523.774661] [ 523.774661] which lock already depends on the new lock. [ 523.774661] [ 523.782850] [ 523.782850] the existing dependency chain (in reverse order) is: [ 523.790326] -> #1 (&(&hashinfo->ehash_locks[i])->rlock){+.-...}: [ 523.796599] [<ffffffff811126bb>] lock_acquire+0xbb/0x270 [ 523.802565] [<ffffffff816f5868>] _raw_spin_lock+0x38/0x50 [ 523.808628] [<ffffffff81665af8>] __inet_hash_nolisten+0x78/0x110 [ 523.815273] [<ffffffff816819db>] tcp_v4_syn_recv_sock+0x24b/0x350 [ 523.822067] [<ffffffff81684d41>] tcp_check_req+0x3c1/0x500 [ 523.828199] [<ffffffff81682d09>] tcp_v4_do_rcv+0x239/0x3d0 [ 523.834331] [<ffffffff816842fe>] tcp_v4_rcv+0xa8e/0xc10 [ 523.840202] [<ffffffff81658fa3>] ip_local_deliver_finish+0x133/0x3e0 [ 523.847214] [<ffffffff81659a9a>] ip_local_deliver+0xaa/0xc0 [ 523.853440] [<ffffffff816593b8>] ip_rcv_finish+0x168/0x5c0 [ 523.859624] [<ffffffff81659db7>] ip_rcv+0x307/0x420 Lets use u64_sync infrastructure instead. As a bonus, 64bit arches get optimized, as these are nop for them. Fixes: 0df48c26d841 ("tcp: add tcpi_bytes_acked to tcp_info") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21tcp: add tcpi_segs_in and tcpi_segs_out to tcp_infoMarcelo Ricardo Leitner
This patch tracks the total number of inbound and outbound segments on a TCP socket. One may use this number to have an idea on connection quality when compared against the retransmissions. RFC4898 named these : tcpEStatsPerfSegsIn and tcpEStatsPerfSegsOut These are a 32bit field each and can be fetched both from TCP_INFO getsockopt() if one has a handle on a TCP socket, or from inet_diag netlink facility (iproute2/ss patch will follow) Note that tp->segs_out was placed near tp->snd_nxt for good data locality and minimal performance impact, while tp->segs_in was placed near tp->bytes_received for the same reason. Join work with Eric Dumazet. Note that received SYN are accounted on the listener, but sent SYNACK are not accounted. Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21tcp: improve REUSEADDR/NOREUSEADDR cohabitationEric Dumazet
inet_csk_get_port() randomization effort tends to spread sockets on all the available range (ip_local_port_range) This is unfortunate because SO_REUSEADDR sockets have less requirements than non SO_REUSEADDR ones. If an application uses SO_REUSEADDR hint, it is to try to allow source ports being shared. So instead of picking a random port number in ip_local_port_range, lets try first in first half of the range. This gives more chances to use upper half of the range for the sockets with strong requirements (not using SO_REUSEADDR) Note this patch does not add a new sysctl, and only changes the way we try to pick port number. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Marcelo Ricardo Leitner <mleitner@redhat.com> Cc: Flavio Leitner <fbl@redhat.com> Acked-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21inet_hashinfo: remove bsocket counterEric Dumazet
We no longer need bsocket atomic counter, as inet_csk_get_port() calls bind_conflict() regardless of its value, after commit 2b05ad33e1e624e ("tcp: bind() fix autoselection to share ports") This patch removes overhead of maintaining this counter and double inet_csk_get_port() calls under pressure. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Marcelo Ricardo Leitner <mleitner@redhat.com> Cc: Flavio Leitner <fbl@redhat.com> Acked-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21tcp: ensure epoll edge trigger wakeup when write queue is emptyJason Baron
We currently rely on the setting of SOCK_NOSPACE in the write() path to ensure that we wake up any epoll edge trigger waiters when acks return to free space in the write queue. However, if we fail to allocate even a single skb in the write queue, we could end up waiting indefinitely. Fix this by explicitly issuing a wakeup when we detect the condition of an empty write queue and a return value of -EAGAIN. This allows userspace to re-try as we expect this to be a temporary failure. I've tested this approach by artificially making sk_stream_alloc_skb() return NULL periodically. In that case, epoll edge trigger waiters will hang indefinitely in epoll_wait() without this patch. Signed-off-by: Jason Baron <jbaron@akamai.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21tcp: add a force_schedule argument to sk_stream_alloc_skb()Eric Dumazet
In commit 8e4d980ac215 ("tcp: fix behavior for epoll edge trigger") we fixed a possible hang of TCP sockets under memory pressure, by allowing sk_stream_alloc_skb() to use sk_forced_mem_schedule() if no packet is in socket write queue. It turns out there are other cases where we want to force memory schedule : tcp_fragment() & tso_fragment() need to split a big TSO packet into two smaller ones. If we block here because of TCP memory pressure, we can effectively block TCP socket from sending new data. If no further ACK is coming, this hang would be definitive, and socket has no chance to effectively reduce its memory usage. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20netfilter: ensure number of counters is >0 in do_replace()Dave Jones
After improving setsockopt() coverage in trinity, I started triggering vmalloc failures pretty reliably from this code path: warn_alloc_failed+0xe9/0x140 __vmalloc_node_range+0x1be/0x270 vzalloc+0x4b/0x50 __do_replace+0x52/0x260 [ip_tables] do_ipt_set_ctl+0x15d/0x1d0 [ip_tables] nf_setsockopt+0x65/0x90 ip_setsockopt+0x61/0xa0 raw_setsockopt+0x16/0x60 sock_common_setsockopt+0x14/0x20 SyS_setsockopt+0x71/0xd0 It turns out we don't validate that the num_counters field in the struct we pass in from userspace is initialized. The same problem also exists in ebtables, arptables, ipv6, and the compat variants. Signed-off-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-19tcp: add rfc3168, section 6.1.1.1. fallbackDaniel Borkmann
This work as a follow-up of commit f7b3bec6f516 ("net: allow setting ecn via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing ECN connections. In other words, this work adds a retry with a non-ECN setup SYN packet, as suggested from the RFC on the first timeout: [...] A host that receives no reply to an ECN-setup SYN within the normal SYN retransmission timeout interval MAY resend the SYN and any subsequent SYN retransmissions with CWR and ECE cleared. [...] Schematic client-side view when assuming the server is in tcp_ecn=2 mode, that is, Linux default since 2009 via commit 255cac91c3c9 ("tcp: extend ECN sysctl to allow server-side only ECN"): 1) Normal ECN-capable path: SYN ECE CWR -----> <----- SYN ACK ECE ACK -----> 2) Path with broken middlebox, when client has fallback: SYN ECE CWR ----X crappy middlebox drops packet (timeout, rtx) SYN -----> <----- SYN ACK ACK -----> In case we would not have the fallback implemented, the middlebox drop point would basically end up as: SYN ECE CWR ----X crappy middlebox drops packet (timeout, rtx) SYN ECE CWR ----X crappy middlebox drops packet (timeout, rtx) SYN ECE CWR ----X crappy middlebox drops packet (timeout, rtx) In any case, it's rather a smaller percentage of sites where there would occur such additional setup latency: it was found in end of 2014 that ~56% of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the fallback would mitigate with a slight latency trade-off. Recent related paper on this topic: Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth, Gorry Fairhurst, and Richard Scheffenegger: "Enabling Internet-Wide Deployment of Explicit Congestion Notification." Proc. PAM 2015, New York. http://ecn.ethz.ch/ecn-pam15.pdf Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168, section 6.1.1.1. fallback on timeout. For users explicitly not wanting this which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that allows for disabling the fallback. tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but rather we let tcp_ecn_rcv_synack() take that over on input path in case a SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent ECN being negotiated eventually in that case. Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch> Signed-off-by: Brian Trammell <trammell@tik.ee.ethz.ch> Cc: Eric Dumazet <edumazet@google.com> Cc: Dave That <dave.taht@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19tcp: don't over-send F-RTO probesYuchung Cheng
After sending the new data packets to probe (step 2), F-RTO may incorrectly send more probes if the next ACK advances SND_UNA and does not sack new packet. However F-RTO RFC 5682 probes at most once. This bug may cause sender to always send new data instead of repairing holes, inducing longer HoL blocking on the receiver for the application. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19tcp: only undo on partial ACKs in CA_LossYuchung Cheng
Undo based on TCP timestamps should only happen on ACKs that advance SND_UNA, according to the Eifel algorithm in RFC 3522: Section 3.2: (4) If the value of the Timestamp Echo Reply field of the acceptable ACK's Timestamps option is smaller than the value of RetransmitTS, then proceed to step (5), Section Terminology: We use the term 'acceptable ACK' as defined in [RFC793]. That is an ACK that acknowledges previously unacknowledged data. This is because upon receiving an out-of-order packet, the receiver returns the last timestamp that advances RCV_NXT, not the current timestamp of the packet in the DUPACK. Without checking the flag, the DUPACK will cause tcp_packet_delayed() to return true and tcp_try_undo_loss() will revert cwnd reduction. Note that we check the condition in CA_Recovery already by only calling tcp_try_undo_partial() if FLAG_SND_UNA_ADVANCED is set or tcp_try_undo_recovery() if snd_una crosses high_seq. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>