aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv4
AgeCommit message (Collapse)Author
2019-12-02Merge tag 'v4.19.86' into v4.19/standard/baseBruce Ashfield
This is the 4.19.86 stable release # gpg: Signature made Sun 24 Nov 2019 02:21:09 AM EST # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2019-12-02Merge tag 'v4.19.85' into v4.19/standard/baseBruce Ashfield
This is the 4.19.85 stable release # gpg: Signature made Wed 20 Nov 2019 12:47:54 PM EST # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2019-11-24tcp: start receiver buffer autotuning soonerYuchung Cheng
[ Upstream commit 041a14d2671573611ffd6412bc16e2f64469f7fb ] Previously receiver buffer auto-tuning starts after receiving one advertised window amount of data. After the initial receiver buffer was raised by patch a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB"), the reciver buffer may take too long to start raising. To address this issue, this patch lowers the initial bytes expected to receive roughly the expected sender's initial window. Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-24tcp: up initial rmem to 128KB and SYN rwin to around 64KBYuchung Cheng
[ Upstream commit a337531b942bd8a03e7052444d7e36972aac2d92 ] Previously TCP initial receive buffer is ~87KB by default and the initial receive window is ~29KB (20 MSS). This patch changes the two numbers to 128KB and ~64KB (rounding down to the multiples of MSS) respectively. The patch also simplifies the calculations s.t. the two numbers are directly controlled by sysctl tcp_rmem[1]: 1) Initial receiver buffer budget (sk_rcvbuf): while this should be configured via sysctl tcp_rmem[1], previously tcp_fixup_rcvbuf() always override and set a larger size when a new connection establishes. 2) Initial receive window in SYN: previously it is set to 20 packets if MSS <= 1460. The number 20 was based on the initial congestion window of 10: the receiver needs twice amount to avoid being limited by the receive window upon out-of-order delivery in the first window burst. But since this only applies if the receiving MSS <= 1460, connection using large MTU (e.g. to utilize receiver zero-copy) may be limited by the receive window. With this patch TCP memory configuration is more straight-forward and more properly sized to modern high-speed networks by default. Several popular stacks have been announcing 64KB rwin in SYNs as well. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-20netfilter: masquerade: don't flush all conntracks if only one address ↵Tan Hu
deleted on device [ Upstream commit 097f95d319f817e651bd51f8846aced92a55a6a1 ] We configured iptables as below, which only allowed incoming data on established connections: iptables -t mangle -A PREROUTING -m state --state ESTABLISHED -j ACCEPT iptables -t mangle -P PREROUTING DROP When deleting a secondary address, current masquerade implements would flush all conntracks on this device. All the established connections on primary address also be deleted, then subsequent incoming data on the connections would be dropped wrongly because it was identified as NEW connection. So when an address was delete, it should only flush connections related with the address. Signed-off-by: Tan Hu <tan.hu@zte.com.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-20ip_gre: fix parsing gre header in ipgre_errHaishuang Yan
[ Upstream commit b0350d51f001e6edc13ee4f253b98b50b05dd401 ] gre_parse_header stops parsing when csum_err is encountered, which means tpi->key is undefined and ip_tunnel_lookup will return NULL improperly. This patch introduce a NULL pointer as csum_err parameter. Even when csum_err is encountered, it won't return error and continue parsing gre header as expected. Fixes: 9f57c67c379d ("gre: Remove support for sharing GRE protocol hook.") Reported-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-20ipmr: Fix skb headroom in ipmr_get_route().Guillaume Nault
[ Upstream commit 7901cd97963d6cbde88fa25a4a446db3554c16c6 ] In route.c, inet_rtm_getroute_build_skb() creates an skb with no headroom. This skb is then used by inet_rtm_getroute() which may pass it to rt_fill_info() and, from there, to ipmr_get_route(). The later might try to reuse this skb by cloning it and prepending an IPv4 header. But since the original skb has no headroom, skb_push() triggers skb_under_panic(): skbuff: skb_under_panic: text:00000000ca46ad8a len:80 put:20 head:00000000cd28494e data:000000009366fd6b tail:0x3c end:0xec0 dev:veth0 ------------[ cut here ]------------ kernel BUG at net/core/skbuff.c:108! invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 6 PID: 587 Comm: ip Not tainted 5.4.0-rc6+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014 RIP: 0010:skb_panic+0xbf/0xd0 Code: 41 a2 ff 8b 4b 70 4c 8b 4d d0 48 c7 c7 20 76 f5 8b 44 8b 45 bc 48 8b 55 c0 48 8b 75 c8 41 54 41 57 41 56 41 55 e8 75 dc 7a ff <0f> 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 RSP: 0018:ffff888059ddf0b0 EFLAGS: 00010286 RAX: 0000000000000086 RBX: ffff888060a315c0 RCX: ffffffff8abe4822 RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff88806c9a79cc RBP: ffff888059ddf118 R08: ffffed100d9361b1 R09: ffffed100d9361b0 R10: ffff88805c68aee3 R11: ffffed100d9361b1 R12: ffff88805d218000 R13: ffff88805c689fec R14: 000000000000003c R15: 0000000000000ec0 FS: 00007f6af184b700(0000) GS:ffff88806c980000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffc8204a000 CR3: 0000000057b40006 CR4: 0000000000360ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: skb_push+0x7e/0x80 ipmr_get_route+0x459/0x6fa rt_fill_info+0x692/0x9f0 inet_rtm_getroute+0xd26/0xf20 rtnetlink_rcv_msg+0x45d/0x630 netlink_rcv_skb+0x1a5/0x220 rtnetlink_rcv+0x15/0x20 netlink_unicast+0x305/0x3a0 netlink_sendmsg+0x575/0x730 sock_sendmsg+0xb5/0xc0 ___sys_sendmsg+0x497/0x4f0 __sys_sendmsg+0xcb/0x150 __x64_sys_sendmsg+0x48/0x50 do_syscall_64+0xd2/0xac0 entry_SYSCALL_64_after_hwframe+0x49/0xbe Actually the original skb used to have enough headroom, but the reserve_skb() call was lost with the introduction of inet_rtm_getroute_build_skb() by commit 404eb77ea766 ("ipv4: support sport, dport and ip_proto in RTM_GETROUTE"). We could reserve some headroom again in inet_rtm_getroute_build_skb(), but this function shouldn't be responsible for handling the special case of ipmr_get_route(). Let's handle that directly in ipmr_get_route() by calling skb_realloc_headroom() instead of skb_clone(). Fixes: 404eb77ea766 ("ipv4: support sport, dport and ip_proto in RTM_GETROUTE") Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-17Merge tag 'v4.19.84' into v4.19/standard/baseBruce Ashfield
This is the 4.19.84 stable release # gpg: Signature made Tue 12 Nov 2019 01:21:53 PM EST # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2019-11-17Merge tag 'v4.19.83' into v4.19/standard/baseBruce Ashfield
This is the 4.19.83 stable release # gpg: Signature made Sun 10 Nov 2019 05:27:57 AM EST # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2019-11-17Merge tag 'v4.19.81' into v4.19/standard/baseBruce Ashfield
This is the 4.19.81 stable release # gpg: Signature made Tue 29 Oct 2019 04:20:10 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2019-11-12ipv4: Fix table id reference in fib_sync_down_addrDavid Ahern
[ Upstream commit e0a312629fefa943534fc46f7bfbe6de3fdaf463 ] Hendrik reported routes in the main table using source address are not removed when the address is removed. The problem is that fib_sync_down_addr does not account for devices in the default VRF which are associated with the main table. Fix by updating the table id reference. Fixes: 5a56a0b3a45d ("net: Don't delete routes in different VRFs") Reported-by: Hendrik Donner <hd@os-cillation.de> Signed-off-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-10ipv4: fix route update on metric change.Paolo Abeni
[ Upstream commit 0b834ba00ab5337e938c727e216e1f5249794717 ] Since commit af4d768ad28c ("net/ipv4: Add support for specifying metric of connected routes"), when updating an IP address with a different metric, the associated connected route is updated, too. Still, the mentioned commit doesn't handle properly some corner cases: $ ip addr add dev eth0 192.168.1.0/24 $ ip addr add dev eth0 192.168.2.1/32 peer 192.168.2.2 $ ip addr add dev eth0 192.168.3.1/24 $ ip addr change dev eth0 192.168.1.0/24 metric 10 $ ip addr change dev eth0 192.168.2.1/32 peer 192.168.2.2 metric 10 $ ip addr change dev eth0 192.168.3.1/24 metric 10 $ ip -4 route 192.168.1.0/24 dev eth0 proto kernel scope link src 192.168.1.0 192.168.2.2 dev eth0 proto kernel scope link src 192.168.2.1 192.168.3.0/24 dev eth0 proto kernel scope link src 192.168.2.1 metric 10 Only the last route is correctly updated. The problem is the current test in fib_modify_prefix_metric(): if (!(dev->flags & IFF_UP) || ifa->ifa_flags & (IFA_F_SECONDARY | IFA_F_NOPREFIXROUTE) || ipv4_is_zeronet(prefix) || prefix == ifa->ifa_local || ifa->ifa_prefixlen == 32) Which should be the logical 'not' of the pre-existing test in fib_add_ifaddr(): if (!ipv4_is_zeronet(prefix) && !(ifa->ifa_flags & IFA_F_SECONDARY) && (prefix != addr || ifa->ifa_prefixlen < 32)) To properly negate the original expression, we need to change the last logical 'or' to a logical 'and'. Fixes: af4d768ad28c ("net/ipv4: Add support for specifying metric of connected routes") Reported-and-suggested-by: Beniamino Galvani <bgalvani@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-10net: use skb_queue_empty_lockless() in busy poll contextsEric Dumazet
[ Upstream commit 3f926af3f4d688e2e11e7f8ed04e277a14d4d4a4 ] Busy polling usually runs without locks. Let's use skb_queue_empty_lockless() instead of skb_queue_empty() Also uses READ_ONCE() in __skb_try_recv_datagram() to address a similar potential problem. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-10net: use skb_queue_empty_lockless() in poll() handlersEric Dumazet
[ Upstream commit 3ef7cf57c72f32f61e97f8fa401bc39ea1f1a5d4 ] Many poll() handlers are lockless. Using skb_queue_empty_lockless() instead of skb_queue_empty() is more appropriate. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-10udp: use skb_queue_empty_lockless()Eric Dumazet
[ Upstream commit 137a0dbe3426fd7bcfe3f8117b36a87b3590e4eb ] syzbot reported a data-race [1]. We should use skb_queue_empty_lockless() to document that we are not ensuring a mutual exclusion and silence KCSAN. [1] BUG: KCSAN: data-race in __skb_recv_udp / __udp_enqueue_schedule_skb write to 0xffff888122474b50 of 8 bytes by interrupt on cpu 0: __skb_insert include/linux/skbuff.h:1852 [inline] __skb_queue_before include/linux/skbuff.h:1958 [inline] __skb_queue_tail include/linux/skbuff.h:1991 [inline] __udp_enqueue_schedule_skb+0x2c1/0x410 net/ipv4/udp.c:1470 __udp_queue_rcv_skb net/ipv4/udp.c:1940 [inline] udp_queue_rcv_one_skb+0x7bd/0xc70 net/ipv4/udp.c:2057 udp_queue_rcv_skb+0xb5/0x400 net/ipv4/udp.c:2074 udp_unicast_rcv_skb.isra.0+0x7e/0x1c0 net/ipv4/udp.c:2233 __udp4_lib_rcv+0xa44/0x17c0 net/ipv4/udp.c:2300 udp_rcv+0x2b/0x40 net/ipv4/udp.c:2470 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:442 [inline] ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124 process_backlog+0x1d3/0x420 net/core/dev.c:5955 read to 0xffff888122474b50 of 8 bytes by task 8921 on cpu 1: skb_queue_empty include/linux/skbuff.h:1494 [inline] __skb_recv_udp+0x18d/0x500 net/ipv4/udp.c:1653 udp_recvmsg+0xe1/0xb10 net/ipv4/udp.c:1712 inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838 sock_recvmsg_nosec+0x5c/0x70 net/socket.c:871 ___sys_recvmsg+0x1a0/0x3e0 net/socket.c:2480 do_recvmmsg+0x19a/0x5c0 net/socket.c:2601 __sys_recvmmsg+0x1ef/0x200 net/socket.c:2680 __do_sys_recvmmsg net/socket.c:2703 [inline] __se_sys_recvmmsg net/socket.c:2696 [inline] __x64_sys_recvmmsg+0x89/0xb0 net/socket.c:2696 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 8921 Comm: syz-executor.4 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-10udp: fix data-race in udp_set_dev_scratch()Eric Dumazet
[ Upstream commit a793183caa9afae907a0d7ddd2ffd57329369bf5 ] KCSAN reported a data-race in udp_set_dev_scratch() [1] The issue here is that we must not write over skb fields if skb is shared. A similar issue has been fixed in commit 89c22d8c3b27 ("net: Fix skb csum races when peeking") While we are at it, use a helper only dealing with udp_skb_scratch(skb)->csum_unnecessary, as this allows udp_set_dev_scratch() to be called once and thus inlined. [1] BUG: KCSAN: data-race in udp_set_dev_scratch / udpv6_recvmsg write to 0xffff888120278317 of 1 bytes by task 10411 on cpu 1: udp_set_dev_scratch+0xea/0x200 net/ipv4/udp.c:1308 __first_packet_length+0x147/0x420 net/ipv4/udp.c:1556 first_packet_length+0x68/0x2a0 net/ipv4/udp.c:1579 udp_poll+0xea/0x110 net/ipv4/udp.c:2720 sock_poll+0xed/0x250 net/socket.c:1256 vfs_poll include/linux/poll.h:90 [inline] do_select+0x7d0/0x1020 fs/select.c:534 core_sys_select+0x381/0x550 fs/select.c:677 do_pselect.constprop.0+0x11d/0x160 fs/select.c:759 __do_sys_pselect6 fs/select.c:784 [inline] __se_sys_pselect6 fs/select.c:769 [inline] __x64_sys_pselect6+0x12e/0x170 fs/select.c:769 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 read to 0xffff888120278317 of 1 bytes by task 10413 on cpu 0: udp_skb_csum_unnecessary include/net/udp.h:358 [inline] udpv6_recvmsg+0x43e/0xe90 net/ipv6/udp.c:310 inet6_recvmsg+0xbb/0x240 net/ipv6/af_inet6.c:592 sock_recvmsg_nosec+0x5c/0x70 net/socket.c:871 ___sys_recvmsg+0x1a0/0x3e0 net/socket.c:2480 do_recvmmsg+0x19a/0x5c0 net/socket.c:2601 __sys_recvmmsg+0x1ef/0x200 net/socket.c:2680 __do_sys_recvmmsg net/socket.c:2703 [inline] __se_sys_recvmmsg net/socket.c:2696 [inline] __x64_sys_recvmmsg+0x89/0xb0 net/socket.c:2696 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 10413 Comm: syz-executor.0 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: 2276f58ac589 ("udp: use a separate rx queue for packet reception") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-10net: annotate accesses to sk->sk_incoming_cpuEric Dumazet
[ Upstream commit 7170a977743b72cf3eb46ef6ef89885dc7ad3621 ] This socket field can be read and written by concurrent cpus. Use READ_ONCE() and WRITE_ONCE() annotations to document this, and avoid some compiler 'optimizations'. KCSAN reported : BUG: KCSAN: data-race in tcp_v4_rcv / tcp_v4_rcv write to 0xffff88812220763c of 4 bytes by interrupt on cpu 0: sk_incoming_cpu_update include/net/sock.h:953 [inline] tcp_v4_rcv+0x1b3c/0x1bb0 net/ipv4/tcp_ipv4.c:1934 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:442 [inline] ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124 process_backlog+0x1d3/0x420 net/core/dev.c:5955 napi_poll net/core/dev.c:6392 [inline] net_rx_action+0x3ae/0xa90 net/core/dev.c:6460 __do_softirq+0x115/0x33f kernel/softirq.c:292 do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082 do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337 do_softirq kernel/softirq.c:329 [inline] __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189 read to 0xffff88812220763c of 4 bytes by interrupt on cpu 1: sk_incoming_cpu_update include/net/sock.h:952 [inline] tcp_v4_rcv+0x181a/0x1bb0 net/ipv4/tcp_ipv4.c:1934 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:442 [inline] ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124 process_backlog+0x1d3/0x420 net/core/dev.c:5955 napi_poll net/core/dev.c:6392 [inline] net_rx_action+0x3ae/0xa90 net/core/dev.c:6460 __do_softirq+0x115/0x33f kernel/softirq.c:292 run_ksoftirqd+0x46/0x60 kernel/softirq.c:603 smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-10inet: stop leaking jiffies on the wireEric Dumazet
[ Upstream commit a904a0693c189691eeee64f6c6b188bd7dc244e9 ] Historically linux tried to stick to RFC 791, 1122, 2003 for IPv4 ID field generation. RFC 6864 made clear that no matter how hard we try, we can not ensure unicity of IP ID within maximum lifetime for all datagrams with a given source address/destination address/protocol tuple. Linux uses a per socket inet generator (inet_id), initialized at connection startup with a XOR of 'jiffies' and other fields that appear clear on the wire. Thiemo Nagel pointed that this strategy is a privacy concern as this provides 16 bits of entropy to fingerprint devices. Let's switch to a random starting point, this is just as good as far as RFC 6864 is concerned and does not leak anything critical. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Thiemo Nagel <tnagel@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-10erspan: fix the tun_info options_len check for erspanXin Long
[ Upstream commit 2eb8d6d2910cfe3dc67dc056f26f3dd9c63d47cd ] The check for !md doens't really work for ip_tunnel_info_opts(info) which only does info + 1. Also to avoid out-of-bounds access on info, it should ensure options_len is not less than erspan_metadata in both erspan_xmit() and ip6erspan_tunnel_xmit(). Fixes: 1a66a836da ("gre: add collect_md mode to ERSPAN tunnel") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-29ipv4: Return -ENETUNREACH if we can't create route but saddr is validStefano Brivio
[ Upstream commit 595e0651d0296bad2491a4a29a7a43eae6328b02 ] ...instead of -EINVAL. An issue was found with older kernel versions while unplugging a NFS client with pending RPCs, and the wrong error code here prevented it from recovering once link is back up with a configured address. Incidentally, this is not an issue anymore since commit 4f8943f80883 ("SUNRPC: Replace direct task wakeups from softirq context"), included in 5.2-rc7, had the effect of decoupling the forwarding of this error by using SO_ERROR in xs_wake_error(), as pointed out by Benjamin Coddington. To the best of my knowledge, this isn't currently causing any further issue, but the error code doesn't look appropriate anyway, and we might hit this in other paths as well. In detail, as analysed by Gonzalo Siero, once the route is deleted because the interface is down, and can't be resolved and we return -EINVAL here, this ends up, courtesy of inet_sk_rebuild_header(), as the socket error seen by tcp_write_err(), called by tcp_retransmit_timer(). In turn, tcp_write_err() indirectly calls xs_error_report(), which wakes up the RPC pending tasks with a status of -EINVAL. This is then seen by call_status() in the SUN RPC implementation, which aborts the RPC call calling rpc_exit(), instead of handling this as a potentially temporary condition, i.e. as a timeout. Return -EINVAL only if the input parameters passed to ip_route_output_key_hash_rcu() are actually invalid (this is the case if the specified source address is multicast, limited broadcast or all zeroes), but return -ENETUNREACH in all cases where, at the given moment, the given source address doesn't allow resolving the route. While at it, drop the initialisation of err to -ENETUNREACH, which was added to __ip_route_output_key() back then by commit 0315e3827048 ("net: Fix behaviour of unreachable, blackhole and prohibit routes"), but actually had no effect, as it was, and is, overwritten by the fib_lookup() return code assignment, and anyway ignored in all other branches, including the if (fl4->saddr) one: I find this rather confusing, as it would look like -ENETUNREACH is the "default" error, while that statement has no effect. Also note that after commit fc75fc8339e7 ("ipv4: dont create routes on down devices"), we would get -ENETUNREACH if the device is down, but -EINVAL if the source address is specified and we can't resolve the route, and this appears to be rather inconsistent. Reported-by: Stefan Walter <walteste@inf.ethz.ch> Analysed-by: Benjamin Coddington <bcodding@redhat.com> Analysed-by: Gonzalo Siero <gsierohu@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-29ipv4: fix race condition between route lookup and invalidationWei Wang
[ Upstream commit 5018c59607a511cdee743b629c76206d9c9e6d7b ] Jesse and Ido reported the following race condition: <CPU A, t0> - Received packet A is forwarded and cached dst entry is taken from the nexthop ('nhc->nhc_rth_input'). Calls skb_dst_set() <t1> - Given Jesse has busy routers ("ingesting full BGP routing tables from multiple ISPs"), route is added / deleted and rt_cache_flush() is called <CPU B, t2> - Received packet B tries to use the same cached dst entry from t0, but rt_cache_valid() is no longer true and it is replaced in rt_cache_route() by the newer one. This calls dst_dev_put() on the original dst entry which assigns the blackhole netdev to 'dst->dev' <CPU A, t3> - dst_input(skb) is called on packet A and it is dropped due to 'dst->dev' being the blackhole netdev There are 2 issues in the v4 routing code: 1. A per-netns counter is used to do the validation of the route. That means whenever a route is changed in the netns, users of all routes in the netns needs to redo lookup. v6 has an implementation of only updating fn_sernum for routes that are affected. 2. When rt_cache_valid() returns false, rt_cache_route() is called to throw away the current cache, and create a new one. This seems unnecessary because as long as this route does not change, the route cache does not need to be recreated. To fully solve the above 2 issues, it probably needs quite some code changes and requires careful testing, and does not suite for net branch. So this patch only tries to add the deleted cached rt into the uncached list, so user could still be able to use it to receive packets until it's done. Fixes: 95c47f9cf5e0 ("ipv4: call dst_dev_put() properly") Signed-off-by: Wei Wang <weiwan@google.com> Reported-by: Ido Schimmel <idosch@idosch.org> Reported-by: Jesse Hathaway <jesse@mbuki-mvuki.org> Tested-by: Jesse Hathaway <jesse@mbuki-mvuki.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Cc: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-10Merge tag 'v4.19.78' into v4.19/standard/baseBruce Ashfield
This is the 4.19.78 stable release # gpg: Signature made Mon 07 Oct 2019 12:57:29 PM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2019-10-10Merge tag 'v4.19.77' into v4.19/standard/baseBruce Ashfield
This is the 4.19.77 stable release # gpg: Signature made Sat 05 Oct 2019 07:10:18 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2019-10-10Merge tag 'v4.19.76' into v4.19/standard/baseBruce Ashfield
This is the 4.19.76 stable release # gpg: Signature made Tue 01 Oct 2019 02:26:13 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2019-10-10Merge tag 'v4.19.75' into v4.19/standard/baseBruce Ashfield
This is the 4.19.75 stable release # gpg: Signature made Sat 21 Sep 2019 01:17:15 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2019-10-10Merge tag 'v4.19.74' into v4.19/standard/baseBruce Ashfield
This is the 4.19.74 stable release # gpg: Signature made Thu 19 Sep 2019 03:09:41 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2019-10-07udp: only do GSO if # of segs > 1Josh Hunt
[ Upstream commit 4094871db1d65810acab3d57f6089aa39ef7f648 ] Prior to this change an application sending <= 1MSS worth of data and enabling UDP GSO would fail if the system had SW GSO enabled, but the same send would succeed if HW GSO offload is enabled. In addition to this inconsistency the error in the SW GSO case does not get back to the application if sending out of a real device so the user is unaware of this failure. With this change we only perform GSO if the # of segments is > 1 even if the application has enabled segmentation. I've also updated the relevant udpgso selftests. Fixes: bec1f6f69736 ("udp: generate gso with UDP_SEGMENT") Signed-off-by: Josh Hunt <johunt@akamai.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-07udp: fix gso_segs calculationsJosh Hunt
[ Upstream commit 44b321e5020d782ad6e8ae8183f09b163be6e6e2 ] Commit dfec0ee22c0a ("udp: Record gso_segs when supporting UDP segmentation offload") added gso_segs calculation, but incorrectly got sizeof() the pointer and not the underlying data type. In addition let's fix the v6 case. Fixes: bec1f6f69736 ("udp: generate gso with UDP_SEGMENT") Fixes: dfec0ee22c0a ("udp: Record gso_segs when supporting UDP segmentation offload") Signed-off-by: Josh Hunt <johunt@akamai.com> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-07net: ipv4: avoid mixed n_redirects and rate_tokens usagePaolo Abeni
[ Upstream commit b406472b5ad79ede8d10077f0c8f05505ace8b6d ] Since commit c09551c6ff7f ("net: ipv4: use a dedicated counter for icmp_v4 redirect packets") we use 'n_redirects' to account for redirect packets, but we still use 'rate_tokens' to compute the redirect packets exponential backoff. If the device sent to the relevant peer any ICMP error packet after sending a redirect, it will also update 'rate_token' according to the leaking bucket schema; typically 'rate_token' will raise above BITS_PER_LONG and the redirect packets backoff algorithm will produce undefined behavior. Fix the issue using 'n_redirects' to compute the exponential backoff in ip_rt_send_redirect(). Note that we still clear rate_tokens after a redirect silence period, to avoid changing an established behaviour. The root cause predates git history; before the mentioned commit in the critical scenario, the kernel stopped sending redirects, after the mentioned commit the behavior more randomic. Reported-by: Xiumei Mu <xmu@redhat.com> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Fixes: c09551c6ff7f ("net: ipv4: use a dedicated counter for icmp_v4 redirect packets") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-07erspan: remove the incorrect mtu limit for erspanHaishuang Yan
[ Upstream commit 0e141f757b2c78c983df893e9993313e2dc21e38 ] erspan driver calls ether_setup(), after commit 61e84623ace3 ("net: centralize net_device min/max MTU checking"), the range of mtu is [min_mtu, max_mtu], which is [68, 1500] by default. It causes the dev mtu of the erspan device to not be greater than 1500, this limit value is not correct for ipgre tap device. Tested: Before patch: # ip link set erspan0 mtu 1600 Error: mtu greater than device maximum. After patch: # ip link set erspan0 mtu 1600 # ip -d link show erspan0 21: erspan0@NONE: <BROADCAST,MULTICAST> mtu 1600 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 0 Fixes: 61e84623ace3 ("net: centralize net_device min/max MTU checking") Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-05tcp: better handle TCP_USER_TIMEOUT in SYN_SENT stateEric Dumazet
[ Upstream commit a66b10c05ee2d744189e9a2130394b070883d289 ] Yuchung Cheng and Marek Majkowski independently reported a weird behavior of TCP_USER_TIMEOUT option when used at connect() time. When the TCP_USER_TIMEOUT is reached, tcp_write_timeout() believes the flow should live, and the following condition in tcp_clamp_rto_to_user_timeout() programs one jiffie timers : remaining = icsk->icsk_user_timeout - elapsed; if (remaining <= 0) return 1; /* user timeout has passed; fire ASAP */ This silly situation ends when the max syn rtx count is reached. This patch makes sure we honor both TCP_SYNCNT and TCP_USER_TIMEOUT, avoiding these spurious SYN packets. Fixes: b701a99e431d ("tcp: Add tcp_clamp_rto_to_user_timeout() helper to improve accuracy") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Yuchung Cheng <ycheng@google.com> Reported-by: Marek Majkowski <marek@cloudflare.com> Cc: Jon Maxwell <jmaxwell37@gmail.com> Link: https://marc.info/?l=linux-netdev&m=156940118307949&w=2 Acked-by: Jon Maxwell <jmaxwell37@gmail.com> Tested-by: Marek Majkowski <marek@cloudflare.com> Signed-off-by: Marek Majkowski <marek@cloudflare.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-01net: don't warn in inet diag when IPV6 is disabledStephen Hemminger
[ Upstream commit 1e64d7cbfdce4887008314d5b367209582223f27 ] If IPV6 was disabled, then ss command would cause a kernel warning because the command was attempting to dump IPV6 socket information. The fix is to just remove the warning. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202249 Fixes: 432490f9d455 ("net: ip, diag -- Add diag interface for raw sockets") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-09-21udp: correct reuseport selection with connected socketsWillem de Bruijn
[ Upstream commit acdcecc61285faed359f1a3568c32089cc3a8329 ] UDP reuseport groups can hold a mix unconnected and connected sockets. Ensure that connections only receive all traffic to their 4-tuple. Fast reuseport returns on the first reuseport match on the assumption that all matches are equal. Only if connections are present, return to the previous behavior of scoring all sockets. Record if connections are present and if so (1) treat such connected sockets as an independent match from the group, (2) only return 2-tuple matches from reuseport and (3) do not return on the first 2-tuple reuseport match to allow for a higher scoring match later. New field has_conns is set without locks. No other fields in the bitmap are modified at runtime and the field is only ever set unconditionally, so an RMW cannot miss a change. Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection") Link: http://lkml.kernel.org/r/CA+FuTSfRP09aJNYRt04SS6qj22ViiOEWaWmLAwX0psk8-PGNxw@mail.gmail.com Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Craig Gallek <kraig@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-19tcp: fix tcp_ecn_withdraw_cwr() to clear TCP_ECN_QUEUE_CWRNeal Cardwell
[ Upstream commit af38d07ed391b21f7405fa1f936ca9686787d6d2 ] Fix tcp_ecn_withdraw_cwr() to clear the correct bit: TCP_ECN_QUEUE_CWR. Rationale: basically, TCP_ECN_DEMAND_CWR is a bit that is purely about the behavior of data receivers, and deciding whether to reflect incoming IP ECN CE marks as outgoing TCP th->ece marks. The TCP_ECN_QUEUE_CWR bit is purely about the behavior of data senders, and deciding whether to send CWR. The tcp_ecn_withdraw_cwr() function is only called from tcp_undo_cwnd_reduction() by data senders during an undo, so it should zero the sender-side state, TCP_ECN_QUEUE_CWR. It does not make sense to stop the reflection of incoming CE bits on incoming data packets just because outgoing packets were spuriously retransmitted. The bug has been reproduced with packetdrill to manifest in a scenario with RFC3168 ECN, with an incoming data packet with CE bit set and carrying a TCP timestamp value that causes cwnd undo. Before this fix, the IP CE bit was ignored and not reflected in the TCP ECE header bit, and sender sent a TCP CWR ('W') bit on the next outgoing data packet, even though the cwnd reduction had been undone. After this fix, the sender properly reflects the CE bit and does not set the W bit. Note: the bug actually predates 2005 git history; this Fixes footer is chosen to be the oldest SHA1 I have tested (from Sep 2007) for which the patch applies cleanly (since before this commit the code was in a .h file). Fixes: bdf1ee5d3bd3 ("[TCP]: Move code from tcp_ecn.h to tcp*.c and tcp.h & remove it") Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-15Merge tag 'v4.19.72' into v4.19/standard/baseBruce Ashfield
This is the 4.19.72 stable release
2019-09-15Merge tag 'v4.19.70' into v4.19/standard/baseBruce Ashfield
This is the 4.19.70 stable release
2019-09-15Merge tag 'v4.19.67' into v4.19/standard/baseBruce Ashfield
This is the 4.19.67 stable release
2019-09-15Merge tag 'v4.19.66' into v4.19/standard/baseBruce Ashfield
This is the 4.19.66 stable release
2019-09-15Merge tag 'v4.19.64' into v4.19/standard/baseBruce Ashfield
This is the 4.19.64 stable release
2019-09-15Merge tag 'v4.19.62' into v4.19/standard/baseBruce Ashfield
This is the 4.19.62 stable release
2019-09-10tcp: remove empty skb from write queue in error casesEric Dumazet
[ Upstream commit fdfc5c8594c24c5df883583ebd286321a80e0a67 ] Vladimir Rutsky reported stuck TCP sessions after memory pressure events. Edge Trigger epoll() user would never receive an EPOLLOUT notification allowing them to retry a sendmsg(). Jason tested the case of sk_stream_alloc_skb() returning NULL, but there are other paths that could lead both sendmsg() and sendpage() to return -1 (EAGAIN), with an empty skb queued on the write queue. This patch makes sure we remove this empty skb so that Jason code can detect that the queue is empty, and call sk->sk_write_space(sk) accordingly. Fixes: ce5ec440994b ("tcp: ensure epoll edge trigger wakeup when write queue is empty") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jason Baron <jbaron@akamai.com> Reported-by: Vladimir Rutsky <rutsky@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-10tcp: inherit timestamp on mtu probeWillem de Bruijn
[ Upstream commit 888a5c53c0d8be6e98bc85b677f179f77a647873 ] TCP associates tx timestamp requests with a byte in the bytestream. If merging skbs in tcp_mtu_probe, migrate the tstamp request. Similar to MSG_EOR, do not allow moving a timestamp from any segment in the probe but the last. This to avoid merging multiple timestamps. Tested with the packetdrill script at https://github.com/wdebruij/packetdrill/commits/mtu_probe-1 Link: http://patchwork.ozlabs.org/patch/1143278/#2232897 Fixes: 4ed2d765dfac ("net-timestamp: TCP timestamping") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-06ipv4/icmp: fix rt dst dev null pointer dereferenceHangbin Liu
[ Upstream commit e2c693934194fd3b4e795635934883354c06ebc9 ] In __icmp_send() there is a possibility that the rt->dst.dev is NULL, e,g, with tunnel collect_md mode, which will cause kernel crash. Here is what the code path looks like, for GRE: - ip6gre_tunnel_xmit - ip6gre_xmit_ipv4 - __gre6_xmit - ip6_tnl_xmit - if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE - icmp_send - net = dev_net(rt->dst.dev); <-- here The reason is __metadata_dst_init() init dst->dev to NULL by default. We could not fix it in __metadata_dst_init() as there is no dev supplied. On the other hand, the reason we need rt->dst.dev is to get the net. So we can just try get it from skb->dev when rt->dst.dev is NULL. v4: Julian Anastasov remind skb->dev also could be NULL. We'd better still use dst.dev and do a check to avoid crash. v3: No changes. v2: fix the issue in __icmp_send() instead of updating shared dst dev in {ip_md, ip6}_tunnel_xmit. Fixes: c8b34e680a09 ("ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Julian Anastasov <ja@ssi.bg> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-16netfilter: Fix rpfilter dropping vrf packets by mistakeMiaohe Lin
[ Upstream commit b575b24b8eee37f10484e951b62ce2a31c579775 ] When firewalld is enabled with ipv4/ipv6 rpfilter, vrf ipv4/ipv6 packets will be dropped. Vrf device will pass through netfilter hook twice. One with enslaved device and another one with l3 master device. So in device may dismatch witch out device because out device is always enslaved device.So failed with the check of the rpfilter and drop the packets by mistake. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-09ipip: validate header length in ipip_tunnel_xmitHaishuang Yan
[ Upstream commit 47d858d0bdcd47cc1c6c9eeca91b091dd9e55637 ] We need the same checks introduced by commit cb9f1b783850 ("ip: validate header length on virtual device xmit") for ipip tunnel. Fixes: cb9f1b783850b ("ip: validate header length on virtual device xmit") Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-04ip_tunnel: allow not to count pkts on tstats by setting skb's dev to NULLXin Long
commit 5684abf7020dfc5f0b6ba1d68eda3663871fce52 upstream. iptunnel_xmit() works as a common function, also used by a udp tunnel which doesn't have to have a tunnel device, like how TIPC works with udp media. In these cases, we should allow not to count pkts on dev's tstats, so that udp tunnel can work with no tunnel device safely. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Tommi Rantala <tommi.t.rantala@nokia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-28tcp: Reset bytes_acked and bytes_received when disconnectingChristoph Paasch
[ Upstream commit e858faf556d4e14c750ba1e8852783c6f9520a0e ] If an app is playing tricks to reuse a socket via tcp_disconnect(), bytes_acked/received needs to be reset to 0. Otherwise tcp_info will report the sum of the current and the old connection.. Cc: Eric Dumazet <edumazet@google.com> Fixes: 0df48c26d841 ("tcp: add tcpi_bytes_acked to tcp_info") Fixes: bdd1f9edacb5 ("tcp: add tcpi_bytes_received to tcp_info") Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-28tcp: fix tcp_set_congestion_control() use from bpf hookEric Dumazet
[ Upstream commit 8d650cdedaabb33e85e9b7c517c0c71fcecc1de9 ] Neal reported incorrect use of ns_capable() from bpf hook. bpf_setsockopt(...TCP_CONGESTION...) -> tcp_set_congestion_control() -> ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN) -> ns_capable_common() -> current_cred() -> rcu_dereference_protected(current->cred, 1) Accessing 'current' in bpf context makes no sense, since packets are processed from softirq context. As Neal stated : The capability check in tcp_set_congestion_control() was written assuming a system call context, and then was reused from a BPF call site. The fix is to add a new parameter to tcp_set_congestion_control(), so that the ns_capable() call is only performed under the right context. Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Lawrence Brakmo <brakmo@fb.com> Reported-by: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-28tcp: be more careful in tcp_fragment()Eric Dumazet
[ Upstream commit b617158dc096709d8600c53b6052144d12b89fab ] Some applications set tiny SO_SNDBUF values and expect TCP to just work. Recent patches to address CVE-2019-11478 broke them in case of losses, since retransmits might be prevented. We should allow these flows to make progress. This patch allows the first and last skb in retransmit queue to be split even if memory limits are hit. It also adds the some room due to the fact that tcp_sendmsg() and tcp_sendpage() might overshoot sk_wmem_queued by about one full TSO skb (64KB size). Note this allowance was already present in stable backports for kernels < 4.15 Note for < 4.15 backports : tcp_rtx_queue_tail() will probably look like : static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk) { struct sk_buff *skb = tcp_send_head(sk); return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk); } Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrew Prout <aprout@ll.mit.edu> Tested-by: Andrew Prout <aprout@ll.mit.edu> Tested-by: Jonathan Lemon <jonathan.lemon@gmail.com> Tested-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Christoph Paasch <cpaasch@apple.com> Cc: Jonathan Looney <jtl@netflix.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-28ipv4: don't set IPv6 only flags to IPv4 addressesMatteo Croce
[ Upstream commit 2e60546368165c2449564d71f6005dda9205b5fb ] Avoid the situation where an IPV6 only flag is applied to an IPv4 address: # ip addr add 192.0.2.1/24 dev dummy0 nodad home mngtmpaddr noprefixroute # ip -4 addr show dev dummy0 2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 inet 192.0.2.1/24 scope global noprefixroute dummy0 valid_lft forever preferred_lft forever Or worse, by sending a malicious netlink command: # ip -4 addr show dev dummy0 2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 inet 192.0.2.1/24 scope global nodad optimistic dadfailed home tentative mngtmpaddr noprefixroute stable-privacy dummy0 valid_lft forever preferred_lft forever Signed-off-by: Matteo Croce <mcroce@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>