summaryrefslogtreecommitdiffstats
path: root/net/ipv4
AgeCommit message (Collapse)Author
2016-04-07GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOUAlexander Duyck
This patch fixes an issue I found in which we were dropping frames if we had enabled checksums on GRE headers that were encapsulated by either FOU or GUE. Without this patch I was barely able to get 1 Gb/s of throughput. With this patch applied I am now at least getting around 6 Gb/s. The issue is due to the fact that with FOU or GUE applied we do not provide a transport offset pointing to the GRE header, nor do we offload it in software as the GRE header is completely skipped by GSO and treated like a VXLAN or GENEVE type header. As such we need to prevent the stack from generating it and also prevent GRE from generating it via any interface we create. Fixes: c3483384ee511 ("gro: Allow tunnel stacking in the case of FOU/GUE") Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-30gro: Allow tunnel stacking in the case of FOU/GUEAlexander Duyck
This patch should fix the issues seen with a recent fix to prevent tunnel-in-tunnel frames from being generated with GRO. The fix itself is correct for now as long as we do not add any devices that support NETIF_F_GSO_GRE_CSUM. When such a device is added it could have the potential to mess things up due to the fact that the outer transport header points to the outer UDP header and not the GRE header as would be expected. Fixes: fac8e0f579695 ("tunnels: Don't apply GRO to multiple layers of encapsulation.") Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-28netfilter: ipv4: fix NULL dereferenceLiping Zhang
Commit fa50d974d104 ("ipv4: Namespaceify ip_default_ttl sysctl knob") use sock_net(skb->sk) to get the net namespace, but we can't assume that sk_buff->sk is always exist, so when it is NULL, oops will happen. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Reviewed-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28netfilter: x_tables: enforce nul-terminated table name from getsockopt ↵Pablo Neira Ayuso
GET_ENTRIES Make sure the table names via getsockopt GET_ENTRIES is nul-terminated in ebtables and all the x_tables variants and their respective compat code. Uncovered by KASAN. Reported-by: Baozeng Ding <sploving1@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28netfilter: x_tables: fix unconditional helperFlorian Westphal
Ben Hawkes says: In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it is possible for a user-supplied ipt_entry structure to have a large next_offset field. This field is not bounds checked prior to writing a counter value at the supplied offset. Problem is that mark_source_chains should not have been called -- the rule doesn't have a next entry, so its supposed to return an absolute verdict of either ACCEPT or DROP. However, the function conditional() doesn't work as the name implies. It only checks that the rule is using wildcard address matching. However, an unconditional rule must also not be using any matches (no -m args). The underflow validator only checked the addresses, therefore passing the 'unconditional absolute verdict' test, while mark_source_chains also tested for presence of matches, and thus proceeeded to the next (not-existent) rule. Unify this so that all the callers have same idea of 'unconditional rule'. Reported-by: Ben Hawkes <hawkes@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28netfilter: x_tables: make sure e->next_offset covers remaining blob sizeFlorian Westphal
Otherwise this function may read data beyond the ruleset blob. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28netfilter: x_tables: validate e->target_offset earlyFlorian Westphal
We should check that e->target_offset is sane before mark_source_chains gets called since it will fetch the target entry for loop detection. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-27Fix returned tc and hoplimit values for route with IPv6 encapsulationQuentin Armitage
For a route with IPv6 encapsulation, the traffic class and hop limit values are interchanged when returned to userspace by the kernel. For example, see below. ># ip route add 192.168.0.1 dev eth0.2 encap ip6 dst 0x50 tc 0x50 hoplimit 100 table 1000 ># ip route show table 1000 192.168.0.1 encap ip6 id 0 src :: dst fe83::1 hoplimit 80 tc 100 dev eth0.2 scope link Signed-off-by: Quentin Armitage <quentin@armitage.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-23net: Reset encap_level to avoid resetting features on inner IP headersAlexander Duyck
This patch corrects an oversight in which we were allowing the encap_level value to pass from the outer headers to the inner headers. As a result we were incorrectly identifying UDP or GRE tunnels as also making use of ipip or sit when the second header actually represented a tunnel encapsulated in either a UDP or GRE tunnel which already had the features masked. Fixes: 76443456227097179c1482 ("net: Move GSO csum into SKB_GSO_CB") Reported-by: Tom Herbert <tom@herbertland.com> Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-22ipv4: initialize flowi4_flags before calling fib_lookup()Lance Richardson
Field fl4.flowi4_flags is not initialized in fib_compute_spec_dst() before calling fib_lookup(), which means fib_table_lookup() is using non-deterministic data at this line: if (!(flp->flowi4_flags & FLOWI_FLAG_SKIP_NH_OIF)) { Fix by initializing the entire fl4 structure, which will prevent similar issues as fields are added in the future by ensuring that all fields are initialized to zero unless explicitly initialized to another value. Fixes: 58189ca7b2741 ("net: Fix vti use case with oif in dst lookups") Suggested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Lance Richardson <lrichard@redhat.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-22ipv4: fix broadcast packets receptionPaolo Abeni
Currently, ingress ipv4 broadcast datagrams are dropped since, in udp_v4_early_demux(), ip_check_mc_rcu() is invoked even on bcast packets. This patch addresses the issue, invoking ip_check_mc_rcu() only for mcast packets. Fixes: 6e5403093261 ("ipv4/udp: Verify multicast group is ours in upd_v4_early_demux()") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-21net: ipv4: Fix truncated timestamp returned by inet_current_timestamp()Deepa Dinamani
The millisecond timestamps returned by the function is converted to network byte order by making a call to htons(). htons() only returns __be16 while __be32 is required here. This was identified by the sparse warning from the buildbot: net/ipv4/af_inet.c:1405:16: sparse: incorrect type in return expression (different base types) net/ipv4/af_inet.c:1405:16: expected restricted __be32 net/ipv4/af_inet.c:1405:16: got restricted __be16 [usertype] <noident> Change the function to use htonl() to return the correct __be32 type instead so that the millisecond value doesn't get truncated. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: James Morris <jmorris@namei.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Arnd Bergmann <arnd@arndb.de> Fixes: 822c868532ca ("net: ipv4: Convert IP network timestamps to be y2038 safe") Reported-by: Fengguang Wu <fengguang.wu@intel.com> [0-day test robot] Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-20tunnels: Remove encapsulation offloads on decap.Jesse Gross
If a packet is either locally encapsulated or processed through GRO it is marked with the offloads that it requires. However, when it is decapsulated these tunnel offload indications are not removed. This means that if we receive an encapsulated TCP packet, aggregate it with GRO, decapsulate, and retransmit the resulting frame on a NIC that does not support encapsulation, we won't be able to take advantage of hardware offloads even though it is just a simple TCP packet at this point. This fixes the problem by stripping off encapsulation offload indications when packets are decapsulated. The performance impacts of this bug are significant. In a test where a Geneve encapsulated TCP stream is sent to a hypervisor, GRO'ed, decapsulated, and bridged to a VM performance is improved by 60% (5Gbps->8Gbps) as a result of avoiding unnecessary segmentation at the VM tap interface. Reported-by: Ramu Ramamurthy <sramamur@linux.vnet.ibm.com> Fixes: 68c33163 ("v4 GRE: Add TCP segmentation offload for GRE") Signed-off-by: Jesse Gross <jesse@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-20tunnels: Don't apply GRO to multiple layers of encapsulation.Jesse Gross
When drivers express support for TSO of encapsulated packets, they only mean that they can do it for one layer of encapsulation. Supporting additional levels would mean updating, at a minimum, more IP length fields and they are unaware of this. No encapsulation device expresses support for handling offloaded encapsulated packets, so we won't generate these types of frames in the transmit path. However, GRO doesn't have a check for multiple levels of encapsulation and will attempt to build them. UDP tunnel GRO actually does prevent this situation but it only handles multiple UDP tunnels stacked on top of each other. This generalizes that solution to prevent any kind of tunnel stacking that would cause problems. Fixes: bf5a755f ("net-gre-gro: Add GRE support to the GRO stack") Signed-off-by: Jesse Gross <jesse@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-20ipip: Properly mark ipip GRO packets as encapsulated.Jesse Gross
ipip encapsulated packets can be merged together by GRO but the result does not have the proper GSO type set or even marked as being encapsulated at all. Later retransmission of these packets will likely fail if the device does not support ipip offloads. This is similar to the issue resolved in IPv6 sit in feec0cb3 ("ipv6: gro: support sit protocol"). Reported-by: Patrick Boutilier <boutilpj@ednet.ns.ca> Fixes: 9667e9bb ("ipip: Add gro callbacks to ipip offload") Tested-by: Patrick Boutilier <boutilpj@ednet.ns.ca> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jesse Gross <jesse@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: "Highlights: 1) Support more Realtek wireless chips, from Jes Sorenson. 2) New BPF types for per-cpu hash and arrap maps, from Alexei Starovoitov. 3) Make several TCP sysctls per-namespace, from Nikolay Borisov. 4) Allow the use of SO_REUSEPORT in order to do per-thread processing of incoming TCP/UDP connections. The muxing can be done using a BPF program which hashes the incoming packet. From Craig Gallek. 5) Add a multiplexer for TCP streams, to provide a messaged based interface. BPF programs can be used to determine the message boundaries. From Tom Herbert. 6) Add 802.1AE MACSEC support, from Sabrina Dubroca. 7) Avoid factorial complexity when taking down an inetdev interface with lots of configured addresses. We were doing things like traversing the entire address less for each address removed, and flushing the entire netfilter conntrack table for every address as well. 8) Add and use SKB bulk free infrastructure, from Jesper Brouer. 9) Allow offloading u32 classifiers to hardware, and implement for ixgbe, from John Fastabend. 10) Allow configuring IRQ coalescing parameters on a per-queue basis, from Kan Liang. 11) Extend ethtool so that larger link mode masks can be supported. From David Decotigny. 12) Introduce devlink, which can be used to configure port link types (ethernet vs Infiniband, etc.), port splitting, and switch device level attributes as a whole. From Jiri Pirko. 13) Hardware offload support for flower classifiers, from Amir Vadai. 14) Add "Local Checksum Offload". Basically, for a tunneled packet the checksum of the outer header is 'constant' (because with the checksum field filled into the inner protocol header, the payload of the outer frame checksums to 'zero'), and we can take advantage of that in various ways. From Edward Cree" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1548 commits) bonding: fix bond_get_stats() net: bcmgenet: fix dma api length mismatch net/mlx4_core: Fix backward compatibility on VFs phy: mdio-thunder: Fix some Kconfig typos lan78xx: add ndo_get_stats64 lan78xx: handle statistics counter rollover RDS: TCP: Remove unused constant RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket net: smc911x: convert pxa dma to dmaengine team: remove duplicate set of flag IFF_MULTICAST bonding: remove duplicate set of flag IFF_MULTICAST net: fix a comment typo ethernet: micrel: fix some error codes ip_tunnels, bpf: define IP_TUNNEL_OPTS_MAX and use it bpf, dst: add and use dst_tclassid helper bpf: make skb->tc_classid also readable net: mvneta: bm: clarify dependencies cls_bpf: reset class and reuse major in da ldmvsw: Checkpatch sunvnet.c and sunvnet_common.c ldmvsw: Add ldmvsw.c driver code ...
2016-03-18ip_tunnels, bpf: define IP_TUNNEL_OPTS_MAX and use itDaniel Borkmann
eBPF defines this as BPF_TUNLEN_MAX and OVS just uses the hard-coded value inside struct sw_flow_key. Thus, add and use IP_TUNNEL_OPTS_MAX for this, which makes the code a bit more generic and allows to remove BPF_TUNLEN_MAX from eBPF code. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-17tcp/dccp: remove obsolete WARN_ON() in icmp handlersEric Dumazet
Now SYN_RECV request sockets are installed in ehash table, an ICMP handler can find a request socket while another cpu handles an incoming packet transforming this SYN_RECV request socket into an ESTABLISHED socket. We need to remove the now obsolete WARN_ON(req->sk), since req->sk is set when a new child is created and added into listener accept queue. If this race happens, the ICMP will do nothing special. Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Ben Lazarus <blazarus@google.com> Reported-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-17Merge branch 'linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto update from Herbert Xu: "Here is the crypto update for 4.6: API: - Convert remaining crypto_hash users to shash or ahash, also convert blkcipher/ablkcipher users to skcipher. - Remove crypto_hash interface. - Remove crypto_pcomp interface. - Add crypto engine for async cipher drivers. - Add akcipher documentation. - Add skcipher documentation. Algorithms: - Rename crypto/crc32 to avoid name clash with lib/crc32. - Fix bug in keywrap where we zero the wrong pointer. Drivers: - Support T5/M5, T7/M7 SPARC CPUs in n2 hwrng driver. - Add PIC32 hwrng driver. - Support BCM6368 in bcm63xx hwrng driver. - Pack structs for 32-bit compat users in qat. - Use crypto engine in omap-aes. - Add support for sama5d2x SoCs in atmel-sha. - Make atmel-sha available again. - Make sahara hashing available again. - Make ccp hashing available again. - Make sha1-mb available again. - Add support for multiple devices in ccp. - Improve DMA performance in caam. - Add hashing support to rockchip" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (116 commits) crypto: qat - remove redundant arbiter configuration crypto: ux500 - fix checks of error code returned by devm_ioremap_resource() crypto: atmel - fix checks of error code returned by devm_ioremap_resource() crypto: qat - Change the definition of icp_qat_uof_regtype hwrng: exynos - use __maybe_unused to hide pm functions crypto: ccp - Add abstraction for device-specific calls crypto: ccp - CCP versioning support crypto: ccp - Support for multiple CCPs crypto: ccp - Remove check for x86 family and model crypto: ccp - memset request context to zero during import lib/mpi: use "static inline" instead of "extern inline" lib/mpi: avoid assembler warning hwrng: bcm63xx - fix non device tree compatibility crypto: testmgr - allow rfc3686 aes-ctr variants in fips mode. crypto: qat - The AE id should be less than the maximal AE number lib/mpi: Endianness fix crypto: rockchip - add hash support for crypto engine in rk3288 crypto: xts - fix compile errors crypto: doc - add skcipher API documentation crypto: doc - update AEAD AD handling ...
2016-03-15tags: Fix DEFINE_PER_CPU expansionsPeter Zijlstra
$ make tags GEN tags ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name pattern "\1" ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name pattern "\1" ctags: Warning: kernel/locking/lockdep.c:151: null expansion of name pattern "\1" ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern "\1" ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern "\1" ctags: Warning: kernel/workqueue.c:323: null expansion of name pattern "\1" ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern "\1" ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern "\1" ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1" Which are all the result of the DEFINE_PER_CPU pattern: scripts/tags.sh:200: '/\<DEFINE_PER_CPU([^,]*, *\([[:alnum:]_]*\)/\1/v/' scripts/tags.sh:201: '/\<DEFINE_PER_CPU_SHARED_ALIGNED([^,]*, *\([[:alnum:]_]*\)/\1/v/' The below cures them. All except the workqueue one are within reasonable distance of the 80 char limit. TJ do you have any preference on how to fix the wq one, or shall we just not care its too long? Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Tejun Heo <tj@kernel.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS/OVS updates for net-next The following patchset contains Netfilter/IPVS fixes and OVS NAT support, more specifically this batch is composed of: 1) Fix a crash in ipset when performing a parallel flush/dump with set:list type, from Jozsef Kadlecsik. 2) Make sure NFACCT_FILTER_* netlink attributes are in place before accessing them, from Phil Turnbull. 3) Check return error code from ip_vs_fill_iph_skb_off() in IPVS SIP helper, from Arnd Bergmann. 4) Add workaround to IPVS to reschedule existing connections to new destination server by dropping the packet and wait for retransmission of TCP syn packet, from Julian Anastasov. 5) Allow connection rescheduling in IPVS when in CLOSE state, also from Julian. 6) Fix wrong offset of SIP Call-ID in IPVS helper, from Marco Angaroni. 7) Validate IPSET_ATTR_ETHER netlink attribute length, from Jozsef. 8) Check match/targetinfo netlink attribute size in nft_compat, patch from Florian Westphal. 9) Check for integer overflow on 32-bit systems in x_tables, from Florian Westphal. Several patches from Jarno Rajahalme to prepare the introduction of NAT support to OVS based on the Netfilter infrastructure: 10) Schedule IP_CT_NEW_REPLY definition for removal in nf_conntrack_common.h. 11) Simplify checksumming recalculation in nf_nat. 12) Add comments to the openvswitch conntrack code, from Jarno. 13) Update the CT state key only after successful nf_conntrack_in() invocation. 14) Find existing conntrack entry after upcall. 15) Handle NF_REPEAT case due to templates in nf_conntrack_in(). 16) Call the conntrack helper functions once the conntrack has been confirmed. 17) And finally, add the NAT interface to OVS. The batch closes with: 18) Cleanup to use spin_unlock_wait() instead of spin_lock()/spin_unlock(), from Nicholas Mc Guire. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-14net: diag: add a scheduling point in inet_diag_dump_icsk()Eric Dumazet
On loaded TCP servers, looking at millions of sockets can hold cpu for many seconds, if the lookup condition is very narrow. (eg : ss dst 1.2.3.4 ) Better add a cond_resched() to allow other processes to access the cpu. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-14netfilter: Allow calling into nat helper without skb_dst.Jarno Rajahalme
NAT checksum recalculation code assumes existence of skb_dst, which becomes a problem for a later patch in the series ("openvswitch: Interface with NAT."). Simplify this by removing the check on skb_dst, as the checksum will be dealt with later in the stack. Suggested-by: Pravin Shelar <pshelar@nicira.com> Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-14tcp: Add RFC4898 tcpEStatsPerfDataSegsOut/InMartin KaFai Lau
Per RFC4898, they count segments sent/received containing a positive length data segment (that includes retransmission segments carrying data). Unlike tcpi_segs_out/in, tcpi_data_segs_out/in excludes segments carrying no data (e.g. pure ack). The patch also updates the segs_in in tcp_fastopen_add_skb() so that segs_in >= data_segs_in property is kept. Together with retransmission data, tcpi_data_segs_out gives a better signal on the rxmit rate. v6: Rebase on the latest net-next v5: Eric pointed out that checking skb->len is still needed in tcp_fastopen_add_skb() because skb can carry a FIN without data. Hence, instead of open coding segs_in and data_segs_in, tcp_segs_in() helper is used. Comment is added to the fastopen case to explain why segs_in has to be reset and tcp_segs_in() has to be called before __skb_pull(). v4: Add comment to the changes in tcp_fastopen_add_skb() and also add remark on this case in the commit message. v3: Add const modifier to the skb parameter in tcp_segs_in() v2: Rework based on recent fix by Eric: commit a9d99ce28ed3 ("tcp: fix tcpi_segs_in after connection establishment") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Chris Rapier <rapier@psc.edu> Cc: Eric Dumazet <edumazet@google.com> Cc: Marcelo Ricardo Leitner <mleitner@redhat.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-13GSO/UDP: Use skb->len instead of udph->len to determine length of original skbAlexander Duyck
It is possible for tunnels to end up generating IP or IPv6 datagrams that are larger than 64K and expecting to be segmented. As such we need to deal with length values greater than 64K. In order to accommodate this we need to update the code to work with a 32b length value instead of a 16b one. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-13ipv4: Don't do expensive useless work during inetdev destroy.David S. Miller
When an inetdev is destroyed, every address assigned to the interface is removed. And in this scenerio we do two pointless things which can be very expensive if the number of assigned interfaces is large: 1) Address promotion. We are deleting all addresses, so there is no point in doing this. 2) A full nf conntrack table purge for every address. We only need to do this once, as is already caught by the existing masq_dev_notifier so masq_inet_event() can skip this. Reported-by: Solar Designer <solar@openwall.com> Signed-off-by: David S. Miller <davem@davemloft.net> Tested-by: Cyrill Gorcunov <gorcunov@openvz.org>
2016-03-13netconf: add macro to represent all attributesZhang Shengju
This patch adds macro NETCONFA_ALL to represent all type of netconf attributes for IPv4 and IPv6. Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-13gro: Defer clearing of flush bit in tunnel pathsAlexander Duyck
This patch updates the GRO handlers for GRE, VXLAN, GENEVE, and FOU so that we do not clear the flush bit until after we have called the next level GRO handler. Previously this was being cleared before parsing through the list of frames, however this resulted in several paths where either the bit needed to be reset but wasn't as in the case of FOU, or cases where it was being set as in GENEVE. By just deferring the clearing of the bit until after the next level protocol has been parsed we can avoid any unnecessary bit twiddling and avoid bugs. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09tcp: Add tcp_inq to get available receive bytes on socketTom Herbert
Create a common kernel function to get the number of bytes available on a TCP socket. This is based on code in INQ getsockopt and we now call the function for that getsockopt. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter updates for your net-next tree, they are: 1) Remove useless debug message when deleting IPVS service, from Yannick Brosseau. 2) Get rid of compilation warning when CONFIG_PROC_FS is unset in several spots of the IPVS code, from Arnd Bergmann. 3) Add prandom_u32 support to nft_meta, from Florian Westphal. 4) Remove unused variable in xt_osf, from Sudip Mukherjee. 5) Don't calculate IP checksum twice from netfilter ipv4 defrag hook since fixing af_packet defragmentation issues, from Joe Stringer. 6) On-demand hook registration for iptables from netns. Instead of registering the hooks for every available netns whenever we need one of the support tables, we register this on the specific netns that needs it, patchset from Florian Westphal. 7) Add missing port range selection to nf_tables masquerading support. BTW, just for the record, there is a typo in the description of 5f6c253ebe93b0 ("netfilter: bridge: register hooks only when bridge interface is added") that refers to the cluster match as deprecated, but it is actually the CLUSTERIP target (which registers hooks inconditionally) the one that is scheduled for removal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08bpf, vxlan, geneve, gre: fix usage of dst_cache on xmitDaniel Borkmann
The assumptions from commit 0c1d70af924b ("net: use dst_cache for vxlan device"), 468dfffcd762 ("geneve: add dst caching support") and 3c1cb4d2604c ("net/ipv4: add dst cache support for gre lwtunnels") on dst_cache usage when ip_tunnel_info is used is unfortunately not always valid as assumed. While it seems correct for ip_tunnel_info front-ends such as OVS, eBPF however can fill in ip_tunnel_info for consumers like vxlan, geneve or gre with different remote dsts, tos, etc, therefore they cannot be assumed as packet independent. Right now vxlan, geneve, gre would cache the dst for eBPF and every packet would reuse the same entry that was first created on the initial route lookup. eBPF doesn't store/cache the ip_tunnel_info, so each skb may have a different one. Fix it by adding a flag that checks the ip_tunnel_info. Also the !tos test in vxlan needs to be handeled differently in this context as it is currently inferred from ip_tunnel_info as well if present. ip_tunnel_dst_cache_usable() helper is added for the three tunnel cases, which checks if we can use dst cache. Fixes: 0c1d70af924b ("net: use dst_cache for vxlan device") Fixes: 468dfffcd762 ("geneve: add dst caching support") Fixes: 3c1cb4d2604c ("net/ipv4: add dst cache support for gre lwtunnels") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Several cases of overlapping changes, as well as one instance (vxlan) of a bug fix in 'net' overlapping with code movement in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07tcp: fix tcpi_segs_in after connection establishmentEric Dumazet
If final packet (ACK) of 3WHS is lost, it appears we do not properly account the following incoming segment into tcpi_segs_in While we are at it, starts segs_in with one, to count the SYN packet. We do not yet count number of SYN we received for a request sock, we might add this someday. packetdrill script showing proper behavior after fix : // Tests tcpi_segs_in when 3rd packet (ACK) of 3WHS is lost 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK> +.020 < P. 1:1001(1000) ack 1 win 32792 +0 accept(3, ..., ...) = 4 +.000 %{ assert tcpi_segs_in == 2, 'tcpi_segs_in=%d' % tcpi_segs_in }% Fixes: 2efd055c53c06 ("tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07arp: correct return value of arp_rcvZhang Shengju
Currently, arp_rcv() always return zero on a packet delivery upcall. To make its behavior more compliant with the way this API should be used, this patch changes this to let it return NET_RX_SUCCESS when the packet is proper handled, and NET_RX_DROP otherwise. v1->v2: If sanity check is failed, call kfree_skb() instead of consume_skb(), then return the correct return value. Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-03mld, igmp: Fix reserved tailroom calculationBenjamin Poirier
The current reserved_tailroom calculation fails to take hlen and tlen into account. skb: [__hlen__|__data____________|__tlen___|__extra__] ^ ^ head skb_end_offset In this representation, hlen + data + tlen is the size passed to alloc_skb. "extra" is the extra space made available in __alloc_skb because of rounding up by kmalloc. We can reorder the representation like so: [__hlen__|__data____________|__extra__|__tlen___] ^ ^ head skb_end_offset The maximum space available for ip headers and payload without fragmentation is min(mtu, data + extra). Therefore, reserved_tailroom = data + extra + tlen - min(mtu, data + extra) = skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen) = skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen) Compare the second line to the current expression: reserved_tailroom = skb_end_offset - min(mtu, skb_end_offset) and we can see that hlen and tlen are not taken into account. The min() in the third line can be expanded into: if mtu < skb_tailroom - tlen: reserved_tailroom = skb_tailroom - mtu else: reserved_tailroom = tlen Depending on hlen, tlen, mtu and the number of multicast address records, the current code may output skbs that have less tailroom than dev->needed_tailroom or it may output more skbs than needed because not all space available is used. Fixes: 4c672e4b ("ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs") Signed-off-by: Benjamin Poirier <bpoirier@suse.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-02net/ipv4: remove left over dead codeEric Engestrom
8cc785f6f429c2a3fb81745dc142cbd72a462c4a ("net: ipv4: make the ping /proc code AF-independent") removed the code using it, but renamed this variable instead of removing it. Signed-off-by: Eric Engestrom <eric.engestrom@imgtec.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-02netfilter: nft_masq: support port rangePablo Neira Ayuso
Complete masquerading support by allowing port range selection. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-02netfilter: xtables: don't hook tables by defaultFlorian Westphal
delay hook registration until the table is being requested inside a namespace. Historically, a particular table (iptables mangle, ip6tables filter, etc) was registered on module load. When netns support was added to iptables only the ip/ip6tables ruleset was made namespace aware, not the actual hook points. This means f.e. that when ipt_filter table/module is loaded on a system, then each namespace on that system has an (empty) iptables filter ruleset. In other words, if a namespace sends a packet, such skb is 'caught' by netfilter machinery and fed to hooking points for that table (i.e. INPUT, FORWARD, etc). Thanks to Eric Biederman, hooks are no longer global, but per namespace. This means that we can avoid allocation of empty ruleset in a namespace and defer hook registration until we need the functionality. We register a tables hook entry points ONLY in the initial namespace. When an iptables get/setockopt is issued inside a given namespace, we check if the table is found in the per-namespace list. If not, we attempt to find it in the initial namespace, and, if found, create an empty default table in the requesting namespace and register the needed hooks. Hook points are destroyed only once namespace is deleted, there is no 'usage count' (it makes no sense since there is no 'remove table' operation in xtables api). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-02netfilter: xtables: prepare for on-demand hook registerFlorian Westphal
This change prepares for upcoming on-demand xtables hook registration. We change the protoypes of the register/unregister functions. A followup patch will then add nf_hook_register/unregister calls to the iptables one. Once a hook is registered packets will be picked up, so all assignments of the form net->ipv4.iptable_$table = new_table have to be moved to ip(6)t_register_table, else we can see NULL net->ipv4.iptable_$table later. This patch doesn't change functionality; without this the actual change simply gets too big. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-02netfilter: nf_defrag_ipv4: Drop redundant ip_send_check()Joe Stringer
Since commit 0848f6428ba3 ("inet: frags: fix defragmented packet's IP header for af_packet"), ip_send_check() would be called twice for defragmentation that occurs from netfilter ipv4 defrag hooks. Remove the extra call. Signed-off-by: Joe Stringer <joe@ovn.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-01net: remove skb_sender_cpu_clear()WANG Cong
After commit 52bd2d62ce67 ("net: better skb->sender_cpu and skb->napi_id cohabitation") skb_sender_cpu_clear() becomes empty and can be removed. Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-01net: ipv4: tcp_probe: Replace timespec with timespec64Deepa Dinamani
TCP probe log timestamps use struct timespec which is not y2038 safe. Even though timespec might be good enough here as it is used to represent delta time, the plan is to get rid of all uses of timespec in the kernel. Replace with struct timespec64 which is y2038 safe. Prints still use unsigned long format and type. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-01net: ipv4: Convert IP network timestamps to be y2038 safeDeepa Dinamani
ICMP timestamp messages and IP source route options require timestamps to be in milliseconds modulo 24 hours from midnight UT format. Add inet_current_timestamp() function to support this. The function returns the required timestamp in network byte order. Timestamp calculation is also changed to call ktime_get_real_ts64() which uses struct timespec64. struct timespec64 is y2038 safe. Previously it called getnstimeofday() which uses struct timespec. struct timespec is not y2038 safe. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: James Morris <jmorris@namei.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-26GSO: Provide software checksum of tunneled UDP fragmentation offloadAlexander Duyck
On reviewing the code I realized that GRE and UDP tunnels could cause a kernel panic if we used GSO to segment a large UDP frame that was sent through the tunnel with an outer checksum and hardware offloads were not available. In order to correct this we need to update the feature flags that are passed to the skb_segment function so that in the event of UDP fragmentation being requested for the inner header the segmentation function will correctly generate the checksum for the payload if we cannot segment the outer header. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-26net: l3mdev: prefer VRF master for source address selectionDavid Lamparter
When selecting an address in context of a VRF, the vrf master should be preferred for address selection. If it isn't, the user has a hard time getting the system to select to their preference - the code will pick the address off the first in-VRF interface it can find, which on a router could well be a non-routable address. Signed-off-by: David Lamparter <equinox@diac24.net> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> [dsa: Fixed comment style and removed extra blank link ] Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-26net: l3mdev: address selection should only consider devices in L3 domainDavid Ahern
David Lamparter noted a use case where the source address selection fails to pick an address from a VRF interface - unnumbered interfaces. Relevant commands from his script: ip addr add 9.9.9.9/32 dev lo ip link set lo up ip link add name vrf0 type vrf table 101 ip rule add oif vrf0 table 101 ip rule add iif vrf0 table 101 ip link set vrf0 up ip addr add 10.0.0.3/32 dev vrf0 ip link add name dummy2 type dummy ip link set dummy2 master vrf0 up --> note dummy2 has no address - unnumbered device ip route add 10.2.2.2/32 dev dummy2 table 101 ip neigh add 10.2.2.2 dev dummy2 lladdr 02:00:00:00:00:02 tcpdump -ni dummy2 & And using ping instead of his socat example: $ ping -I vrf0 -c1 10.2.2.2 ping: Warning: source address might be selected on device other than vrf0. PING 10.2.2.2 (10.2.2.2) from 9.9.9.9 vrf0: 56(84) bytes of data. >From tcpdump: 12:57:29.449128 IP 9.9.9.9 > 10.2.2.2: ICMP echo request, id 2491, seq 1, length 64 Note the source address is from lo and is not a VRF local address. With this patch: $ ping -I vrf0 -c1 10.2.2.2 PING 10.2.2.2 (10.2.2.2) from 10.0.0.3 vrf0: 56(84) bytes of data. >From tcpdump: 12:59:25.096426 IP 10.0.0.3 > 10.2.2.2: ICMP echo request, id 2113, seq 1, length 64 Now the source address comes from vrf0. The ipv4 function for selecting source address takes a const argument. Removing the const requires touching a lot of places, so instead l3mdev_master_ifindex_rcu is changed to take a const argument and then do the typecast to non-const as required by netdev_master_upper_dev_get_rcu. This is similar to what l3mdev_fib_table_rcu does. IPv6 for unnumbered interfaces appears to be selecting the addresses properly. Cc: David Lamparter <david@opensourcerouting.org> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-24ipv4: only create late gso-skb if skb is already set up with CHECKSUM_PARTIALHannes Frederic Sowa
Otherwise we break the contract with GSO to only pass CHECKSUM_PARTIAL skbs down. This can easily happen with UDP+IPv4 sockets with the first MSG_MORE write smaller than the MTU, second write is a sendfile. Returning -EOPNOTSUPP lets the callers fall back into normal sendmsg path, were we calculate the checksum manually during copying. Commit d749c9cbffd6 ("ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked sockets") started to exposes this bug. Fixes: d749c9cbffd6 ("ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked sockets") Reported-by: Jiri Benc <jbenc@redhat.com> Cc: Jiri Benc <jbenc@redhat.com> Reported-by: Wakko Warner <wakko@animx.eu.org> Cc: Wakko Warner <wakko@animx.eu.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-24soreuseport: fix merge conflict in tcp bindCraig Gallek
One of the validation checks for the new array-based TCP SO_REUSEPORT validation was unintentionally dropped in ea8add2b1903. This adds it back. Lack of this check allows the user to allocate multiple sock_reuseport structures (leaking all but the first). Fixes: ea8add2b1903 ("tcp/dccp: better use of ephemeral ports in bind()") Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-23tunnel: Clear IPCB(skb)->opt before dst_link_failure calledBernie Harris
IPCB may contain data from previous layers (in the observed case the qdisc layer). In the observed scenario, the data was misinterpreted as ip header options, which later caused the ihl to be set to an invalid value (<5). This resulted in an infinite loop in the mips implementation of ip_fast_csum. This patch clears IPCB(skb)->opt before dst_link_failure can be called for various types of tunnels. This change only applies to encapsulated ipv4 packets. The code introduced in 11c21a30 which clears all of IPCB has been removed to be consistent with these changes, and instead the opt field is cleared unconditionally in ip_tunnel_xmit. The change in ip_tunnel_xmit applies to SIT, GRE, and IPIP tunnels. The relevant vti, l2tp, and pptp functions already contain similar code for clearing the IPCB. Signed-off-by: Bernie Harris <bernie.harris@alliedtelesis.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-23tcp: convert cached rtt from usec to jiffies when feeding initial rtoKonstantin Khlebnikov
Currently it's converted into msecs, thus HZ=1000 intact. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Fixes: 740b0f1841f6 ("tcp: switch rtt estimations to usec resolution") Signed-off-by: David S. Miller <davem@davemloft.net>