summaryrefslogtreecommitdiffstats
path: root/net/ipv4
AgeCommit message (Collapse)Author
2021-06-15tcp: Migrate TCP_NEW_SYN_RECV requests at retransmitting SYN+ACKs.Kuniyuki Iwashima
As with the preceding patch, this patch changes reqsk_timer_handler() to call reuseport_migrate_sock() and inet_reqsk_clone() to migrate in-flight requests at retransmitting SYN+ACKs. If we can select a new listener and clone the request, we resume setting the SYN+ACK timer for the new req. If we can set the timer, we call inet_ehash_insert() to unhash the old req and put the new req into ehash. The noteworthy point here is that by unhashing the old req, another CPU processing it may lose the "own_req" race in tcp_v[46]_syn_recv_sock() and drop the final ACK packet. However, the new timer will recover this situation. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-7-kuniyu@amazon.co.jp
2021-06-15tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.Kuniyuki Iwashima
When we call close() or shutdown() for listening sockets, each child socket in the accept queue are freed at inet_csk_listen_stop(). If we can get a new listener by reuseport_migrate_sock() and clone the request by inet_reqsk_clone(), we try to add it into the new listener's accept queue by inet_csk_reqsk_queue_add(). If it fails, we have to call __reqsk_free() to call sock_put() for its listener and free the cloned request. After putting the full socket into ehash, tcp_v[46]_syn_recv_sock() sets NULL to ireq_opt/pktopts in struct inet_request_sock, but ipv6_opt can be non-NULL. So, we have to set NULL to ipv6_opt of the old request to avoid double free. Note that we do not update req->rsk_listener and instead clone the req to migrate because another path may reference the original request. If we protected it by RCU, we would need to add rcu_read_lock() in many places. Suggested-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/netdev/20201209030903.hhow5r53l6fmozjn@kafai-mbp.dhcp.thefacebook.com/ Link: https://lore.kernel.org/bpf/20210612123224.12525-6-kuniyu@amazon.co.jp
2021-06-15tcp: Keep TCP_CLOSE sockets in the reuseport group.Kuniyuki Iwashima
When we close a listening socket, to migrate its connections to another listener in the same reuseport group, we have to handle two kinds of child sockets. One is that a listening socket has a reference to, and the other is not. The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the accept queue of their listening socket. So we can pop them out and push them into another listener's queue at close() or shutdown() syscalls. On the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the three-way handshake and not in the accept queue. Thus, we cannot access such sockets at close() or shutdown() syscalls. Accordingly, we have to migrate immature sockets after their listening socket has been closed. Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At that time, if we could select a new listener from the same reuseport group, no connection would be aborted. However, we cannot do that because reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to the reuseport group from closed sockets. This patch allows TCP_CLOSE sockets to remain in the reuseport group and access it while any child socket references them. The point is that reuseport_detach_sock() was called twice from inet_unhash() and sk_destruct(). This patch replaces the first reuseport_detach_sock() with reuseport_stop_listen_sock(), which checks if the reuseport group is capable of migration. If capable, it decrements num_socks, moves the socket backwards in socks[] and increments num_closed_socks. When all connections are migrated, sk_destruct() calls reuseport_detach_sock() to remove the socket from socks[], decrement num_closed_socks, and set NULL to sk_reuseport_cb. By this change, closed or shutdowned sockets can keep sk_reuseport_cb. Consequently, calling listen() after shutdown() can cause EADDRINUSE or EBUSY in inet_csk_bind_conflict() or reuseport_add_sock() which expects such sockets not to have the reuseport group. Therefore, this patch also loosens such validation rules so that a socket can listen again if it has a reuseport group with num_closed_socks more than 0. When such sockets listen again, we handle them in reuseport_resurrect(). If there is an existing reuseport group (reuseport_add_sock() path), we move the socket from the old group to the new one and free the old one if necessary. If there is no existing group (reuseport_alloc() path), we allocate a new reuseport group, detach sk from the old one, and free it if necessary, not to break the current shutdown behaviour: - we cannot carry over the eBPF prog of shutdowned sockets - we cannot attach/detach an eBPF prog to/from listening sockets via shutdowned sockets Note that when the number of sockets gets over U16_MAX, we try to detach a closed socket randomly to make room for the new listening socket in reuseport_grow(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-4-kuniyu@amazon.co.jp
2021-06-15net: Introduce net.ipv4.tcp_migrate_req.Kuniyuki Iwashima
This commit adds a new sysctl option: net.ipv4.tcp_migrate_req. If this option is enabled or eBPF program is attached, we will be able to migrate child sockets from a listener to another in the same reuseport group after close() or shutdown() syscalls. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-2-kuniyu@amazon.co.jp
2021-06-14ipv4: Fix device used for dst_alloc with local routesDavid Ahern
Oliver reported a use case where deleting a VRF device can hang waiting for the refcnt to drop to 0. The root cause is that the dst is allocated against the VRF device but cached on the loopback device. The use case (added to the selftests) has an implicit VRF crossing due to the ordering of the FIB rules (lookup local is before the l3mdev rule, but the problem occurs even if the FIB rules are re-ordered with local after l3mdev because the VRF table does not have a default route to terminate the lookup). The end result is is that the FIB lookup returns the loopback device as the nexthop, but the ingress device is in a VRF. The mismatch causes the dst alloc against the VRF device but then cached on the loopback. The fix is to bring the trick used for IPv6 (see ip6_rt_get_dev_rcu): pick the dst alloc device based the fib lookup result but with checks that the result has a nexthop device (e.g., not an unreachable or prohibit entry). Fixes: f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev if relevant") Reported-by: Oliver Herms <oliver.peter.herms@gmail.com> Signed-off-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10ping: Check return value of function 'ping_queue_rcv_skb'Zheng Yongjun
Function 'ping_queue_rcv_skb' not always return success, which will also return fail. If not check the wrong return value of it, lead to function `ping_rcv` return success. Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-09udp: fix race between close() and udp_abort()Paolo Abeni
Kaustubh reported and diagnosed a panic in udp_lib_lookup(). The root cause is udp_abort() racing with close(). Both racing functions acquire the socket lock, but udp{v6}_destroy_sock() release it before performing destructive actions. We can't easily extend the socket lock scope to avoid the race, instead use the SOCK_DEAD flag to prevent udp_abort from doing any action when the critical race happens. Diagnosed-and-tested-by: Kaustubh Pandey <kapandey@codeaurora.org> Fixes: 5d77dca82839 ("net: diag: support SOCK_DESTROY for UDP sockets") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-09inet: annotate data race in inet_send_prepare() and inet_dgram_connect()Eric Dumazet
Both functions are known to be racy when reading inet_num as we do not want to grab locks for the common case the socket has been bound already. The race is resolved in inet_autobind() by reading again inet_num under the socket lock. syzbot reported: BUG: KCSAN: data-race in inet_send_prepare / udp_lib_get_port write to 0xffff88812cba150e of 2 bytes by task 24135 on cpu 0: udp_lib_get_port+0x4b2/0xe20 net/ipv4/udp.c:308 udp_v6_get_port+0x5e/0x70 net/ipv6/udp.c:89 inet_autobind net/ipv4/af_inet.c:183 [inline] inet_send_prepare+0xd0/0x210 net/ipv4/af_inet.c:807 inet6_sendmsg+0x29/0x80 net/ipv6/af_inet6.c:639 sock_sendmsg_nosec net/socket.c:654 [inline] sock_sendmsg net/socket.c:674 [inline] ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350 ___sys_sendmsg net/socket.c:2404 [inline] __sys_sendmmsg+0x315/0x4b0 net/socket.c:2490 __do_sys_sendmmsg net/socket.c:2519 [inline] __se_sys_sendmmsg net/socket.c:2516 [inline] __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff88812cba150e of 2 bytes by task 24132 on cpu 1: inet_send_prepare+0x21/0x210 net/ipv4/af_inet.c:806 inet6_sendmsg+0x29/0x80 net/ipv6/af_inet6.c:639 sock_sendmsg_nosec net/socket.c:654 [inline] sock_sendmsg net/socket.c:674 [inline] ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350 ___sys_sendmsg net/socket.c:2404 [inline] __sys_sendmmsg+0x315/0x4b0 net/socket.c:2490 __do_sys_sendmmsg net/socket.c:2519 [inline] __se_sys_sendmmsg net/socket.c:2516 [inline] __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x0000 -> 0x9db4 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 24132 Comm: syz-executor.2 Not tainted 5.13.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-09xfrm: remove description from xfrm_type structFlorian Westphal
Its set but never read. Reduces size of xfrm_type to 64 bytes on 64bit. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-06-08net: ipv4: Remove unneed BUG() functionZheng Yongjun
When 'nla_parse_nested_deprecated' failed, it's no need to BUG() here, return -EINVAL is ok. Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-08net: ipv4: fix memory leak in netlbl_cipsov4_add_stdNanyong Sun
Reported by syzkaller: BUG: memory leak unreferenced object 0xffff888105df7000 (size 64): comm "syz-executor842", pid 360, jiffies 4294824824 (age 22.546s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000e67ed558>] kmalloc include/linux/slab.h:590 [inline] [<00000000e67ed558>] kzalloc include/linux/slab.h:720 [inline] [<00000000e67ed558>] netlbl_cipsov4_add_std net/netlabel/netlabel_cipso_v4.c:145 [inline] [<00000000e67ed558>] netlbl_cipsov4_add+0x390/0x2340 net/netlabel/netlabel_cipso_v4.c:416 [<0000000006040154>] genl_family_rcv_msg_doit.isra.0+0x20e/0x320 net/netlink/genetlink.c:739 [<00000000204d7a1c>] genl_family_rcv_msg net/netlink/genetlink.c:783 [inline] [<00000000204d7a1c>] genl_rcv_msg+0x2bf/0x4f0 net/netlink/genetlink.c:800 [<00000000c0d6a995>] netlink_rcv_skb+0x134/0x3d0 net/netlink/af_netlink.c:2504 [<00000000d78b9d2c>] genl_rcv+0x24/0x40 net/netlink/genetlink.c:811 [<000000009733081b>] netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline] [<000000009733081b>] netlink_unicast+0x4a0/0x6a0 net/netlink/af_netlink.c:1340 [<00000000d5fd43b8>] netlink_sendmsg+0x789/0xc70 net/netlink/af_netlink.c:1929 [<000000000a2d1e40>] sock_sendmsg_nosec net/socket.c:654 [inline] [<000000000a2d1e40>] sock_sendmsg+0x139/0x170 net/socket.c:674 [<00000000321d1969>] ____sys_sendmsg+0x658/0x7d0 net/socket.c:2350 [<00000000964e16bc>] ___sys_sendmsg+0xf8/0x170 net/socket.c:2404 [<000000001615e288>] __sys_sendmsg+0xd3/0x190 net/socket.c:2433 [<000000004ee8b6a5>] do_syscall_64+0x37/0x90 arch/x86/entry/common.c:47 [<00000000171c7cee>] entry_SYSCALL_64_after_hwframe+0x44/0xae The memory of doi_def->map.std pointing is allocated in netlbl_cipsov4_add_std, but no place has freed it. It should be freed in cipso_v4_doi_free which frees the cipso DOI resource. Fixes: 96cb8e3313c7a ("[NetLabel]: CIPSOv4 and Unlabeled packet integration") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Nanyong Sun <sunnanyong@huawei.com> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-07ipv4: Fix spelling mistakesZheng Yongjun
Fix some spelling mistakes in comments: Dont ==> Don't timout ==> timeout incomming ==> incoming necesarry ==> necessary substract ==> subtract Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-07Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Bug fixes overlapping feature additions and refactoring, mostly. Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-04tcp: export timestamp helpers for mptcpFlorian Westphal
MPTCP is builtin, so no need to add EXPORT_SYMBOL()s. It will be used to support SO_TIMESTAMP(NS) ancillary messages in the mptcp receive path. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03icmp: fix lib conflict with trinityAndreas Roeseler
Including <linux/in.h> and <netinet/in.h> in the dependencies breaks compilation of trinity due to multiple definitions. <linux/in.h> is only used in <linux/icmp.h> to provide the definition of the struct in_addr, but this can be substituted out by using the datatype __be32. Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-03net: tcp better handling of reordering then loss casesYuchung Cheng
This patch aims to improve the situation when reordering and loss are ocurring in the same flight of packets. Previously the reordering would first induce a spurious recovery, then the subsequent ACK may undo the cwnd (based on the timestamps e.g.). However the current loss recovery does not proceed to invoke RACK to install a reordering timer. If some packets are also lost, this may lead to a long RTO-based recovery. An example is https://groups.google.com/g/bbr-dev/c/OFHADvJbTEI The solution is to after reverting the recovery, always invoke RACK to either mount the RACK timer to fast retransmit after the reordering window, or restarts the recovery if new loss is identified. Hence it is possible the sender may go from Recovery to Disorder/Open to Recovery again in one ACK. Reported-by: mingkun bian <bianmingkun@gmail.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-02net: ipconfig: Don't override command-line hostnames or domainsJosh Triplett
If the user specifies a hostname or domain name as part of the ip= command-line option, preserve it and don't overwrite it with one supplied by DHCP/BOOTP. For instance, ip=::::myhostname::dhcp will use "myhostname" rather than ignoring and overwriting it. Fix the comment on ic_bootp_string that suggests it only copies a string "if not already set"; it doesn't have any such logic. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Support for SCTP chunks matching on nf_tables, from Phil Sutter. 2) Skip LDMXCSR, we don't need a valid MXCSR state. From Stefano Brivio. 3) CONFIG_RETPOLINE for nf_tables set lookups, from Florian Westphal. 4) A few Kconfig leading spaces removal, from Juerg Haefliger. 5) Remove spinlock from xt_limit, from Jason Baron. 6) Remove useless initialization in xt_CT, oneliner from Yang Li. 7) Tree-wide replacement of netlink_unicast() by nfnetlink_unicast(). 8) Reduce footprint of several structures: xt_action_param, nft_pktinfo and nf_hook_state, from Florian. 10) Add nft_thoff() and nft_sk() helpers and use them, also from Florian. 11) Fix documentation in nf_tables pipapo avx2, from Florian Westphal. 12) Fix clang-12 fmt string warnings, also from Florian. ====================
2021-06-01net: Return the correct errno codeZheng Yongjun
When kalloc or kmemdup failed, should return ENOMEM rather than ENOBUF. Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-29netfilter: nf_tables: add and use nft_sk helperFlorian Westphal
This allows to change storage placement later on without changing readers. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-05-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
cdc-wdm: s/kill_urbs/poison_urbs/ to fix build Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf-next 2021-05-19 The following pull-request contains BPF updates for your *net-next* tree. We've added 43 non-merge commits during the last 11 day(s) which contain a total of 74 files changed, 3717 insertions(+), 578 deletions(-). The main changes are: 1) syscall program type, fd array, and light skeleton, from Alexei. 2) Stop emitting static variables in skeleton, from Andrii. 3) Low level tc-bpf api, from Kumar. 4) Reduce verifier kmalloc/kfree churn, from Lorenz. ====================
2021-05-19net: Add notifications when multipath hash field changeIdo Schimmel
In-kernel notifications are already sent when the multipath hash policy itself changes, but not when the multipath hash fields change. Add these notifications, so that interested listeners (e.g., switch ASIC drivers) could perform the necessary configuration. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-18cipso: correct comments of cipso_v4_cache_invalidate()Zheng Yejian
Since cipso_v4_cache_invalidate() has no return value, so drop related descriptions in its comments. Fixes: 446fda4f2682 ("[NetLabel]: CIPSOv4 engine") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-18ipv4: Add custom multipath hash policyIdo Schimmel
Add a new multipath hash policy where the packet fields used for hash calculation are determined by user space via the fib_multipath_hash_fields sysctl that was introduced in the previous patch. The current set of available packet fields includes both outer and inner fields, which requires two invocations of the flow dissector. Avoid unnecessary dissection of the outer or inner flows by skipping dissection if none of the outer or inner fields are required. In accordance with the existing policies, when an skb is not available, packet fields are extracted from the provided flow key. In which case, only outer fields are considered. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-18ipv4: Add a sysctl to control multipath hash fieldsIdo Schimmel
A subsequent patch will add a new multipath hash policy where the packet fields used for multipath hash calculation are determined by user space. This patch adds a sysctl that allows user space to set these fields. The packet fields are represented using a bitmask and are common between IPv4 and IPv6 to allow user space to use the same numbering across both protocols. For example, to hash based on standard 5-tuple: # sysctl -w net.ipv4.fib_multipath_hash_fields=0x0037 net.ipv4.fib_multipath_hash_fields = 0x0037 The kernel rejects unknown fields, for example: # sysctl -w net.ipv4.fib_multipath_hash_fields=0x1000 sysctl: setting key "net.ipv4.fib_multipath_hash_fields": Invalid argument More fields can be added in the future, if needed. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-18ipv4: Calculate multipath hash inside switch statementIdo Schimmel
A subsequent patch will add another multipath hash policy where the multipath hash is calculated directly by the policy specific code and not outside of the switch statement. Prepare for this change by moving the multipath hash calculation inside the switch statement. No functional changes intended. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-18skmsg: Remove unused parameters of sk_msg_wait_data()Cong Wang
'err' and 'flags' are not used, we can just get rid of them. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <song@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210517022348.50555-1-xiyou.wangcong@gmail.com
2021-05-17ipv4: Fix fall-through warnings for ClangGustavo A. R. Silva
In preparation to enable -Wimplicit-fallthrough for Clang, fix multiple warnings by explicitly adding multiple break statements instead of just letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-05-17net: Remove the member netns_okYejune Deng
Every protocol has the 'netns_ok' member and it is euqal to 1. The 'if (!prot->netns_ok)' always false in inet_add_protocol(). Signed-off-by: Yejune Deng <yejunedeng@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-17ip: Treat IPv4 segment's lowest address as unicastSeth David Schoen
Treat only the highest, not the lowest, IPv4 address within a local subnet as a broadcast address. Signed-off-by: Seth David Schoen <schoen@loyalty.org> Suggested-by: John Gilmore <gnu@toad.com> Acked-by: Dave Taht <dave.taht@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-14tcp: add tracepoint for checksum errorsJakub Kicinski
Add a tracepoint for capturing TCP segments with a bad checksum. This makes it easy to identify sources of bad frames in the fleet (e.g. machines with faulty NICs). It should also help tools like IOvisor's tcpdrop.py which are used today to get detailed information about such packets. We don't have a socket in many cases so we must open code the address extraction based just on the skb. v2: add missing export for ipv6=m Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-14esp: drop unneeded assignment in esp4_gro_receive()Yang Li
Making '!=' operation with 0 directly after calling the function xfrm_parse_spi() is more efficient, assignment to err is redundant. Eliminate the following clang_analyzer warning: net/ipv4/esp4_offload.c:41:7: warning: Although the value stored to 'err' is used in the enclosing expression, the value is never actually read from 'err' No functional change, only more efficient. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-05-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2021-05-11 The following pull-request contains BPF updates for your *net* tree. We've added 13 non-merge commits during the last 8 day(s) which contain a total of 21 files changed, 817 insertions(+), 382 deletions(-). The main changes are: 1) Fix multiple ringbuf bugs in particular to prevent writable mmap of read-only pages, from Andrii Nakryiko & Thadeu Lima de Souza Cascardo. 2) Fix verifier alu32 known-const subregister bound tracking for bitwise operations and/or/xor, from Daniel Borkmann. 3) Reject trampoline attachment for functions with variable arguments, and also add a deny list of other forbidden functions, from Jiri Olsa. 4) Fix nested bpf_bprintf_prepare() calls used by various helpers by switching to per-CPU buffers, from Florent Revest. 5) Fix kernel compilation with BTF debug info on ppc64 due to pahole missing TCP-CC functions like cubictcp_init, from Martin KaFai Lau. 6) Add a kconfig entry to provide an option to disallow unprivileged BPF by default, from Daniel Borkmann. 7) Fix libbpf compilation for older libelf when GELF_ST_VISIBILITY() macro is not available, from Arnaldo Carvalho de Melo. 8) Migrate test_tc_redirect to test_progs framework as prep work for upcoming skb_change_head() fix & selftest, from Jussi Maki. 9) Fix a libbpf segfault in add_dummy_ksym_var() if BTF is not present, from Ian Rogers. 10) Fix tx_only micro-benchmark in xdpsock BPF sample with proper frame size, from Magnus Karlsson. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-11bpf: Limit static tcp-cc functions in the .BTF_ids list to x86Martin KaFai Lau
During the discussion in [0]. It was pointed out that static functions in ppc64 is prefixed with ".". For example, the 'readelf -s vmlinux.ppc': 89326: c000000001383280 24 NOTYPE LOCAL DEFAULT 31 cubictcp_init 89327: c000000000c97c50 168 FUNC LOCAL DEFAULT 2 .cubictcp_init The one with FUNC type is ".cubictcp_init" instead of "cubictcp_init". The "." seems to be done by arch/powerpc/include/asm/ppc_asm.h. This caused that pahole cannot generate the BTF for these tcp-cc kernel functions because pahole only captures the FUNC type and "cubictcp_init" is not. It then failed the kernel compilation in ppc64. This behavior is only reported in ppc64 so far. I tried arm64, s390, and sparc64 and did not observe this "." prefix and NOTYPE behavior. Since the kfunc call is only supported in the x86_64 and x86_32 JIT, this patch limits those tcp-cc functions to x86 only to avoid unnecessary compilation issue in other ARCHs. In the future, we can examine if it is better to change all those functions from static to extern. [0] https://lore.kernel.org/bpf/4e051459-8532-7b61-c815-f3435767f8a0@kernel.org/ Fixes: e78aea8b2170 ("bpf: tcp: Put some tcp cong functions in allowlist for bpf-tcp-cc") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Michal Suchánek <msuchanek@suse.de> Cc: Jiri Slaby <jslaby@suse.com> Cc: Jiri Olsa <jolsa@redhat.com> Link: https://lore.kernel.org/bpf/20210508005011.3863757-1-kafai@fb.com
2021-05-10rtnetlink: avoid RCU read lock when holding RTNLCong Wang
When we call af_ops->set_link_af() we hold a RCU read lock as we retrieve af_ops from the RCU protected list, but this is unnecessary because we already hold RTNL lock, which is the writer lock for protecting rtnl_af_ops, so it is safer than RCU read lock. Similar for af_ops->validate_link_af(). This was not a problem until we begin to take mutex lock down the path of ->set_link_af() in __ipv6_dev_mc_dec() recently. We can just drop the RCU read lock there and assert RTNL lock. Reported-and-tested-by: syzbot+7d941e89dd48bcf42573@syzkaller.appspotmail.com Fixes: 63ed8de4be81 ("mld: add mc_lock for protecting per-interface mld data") Tested-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfJakub Kicinski
Pablo Neira Ayuso says: ==================== Netfilter fixes for net 1) Add SECMARK revision 1 to fix incorrect layout that prevents from remove rule with this target, from Phil Sutter. 2) Fix pernet exit path spat in arptables, from Florian Westphal. 3) Missing rcu_read_unlock() for unknown nfnetlink callbacks, reported by syzbot, from Eric Dumazet. 4) Missing check for skb_header_pointer() NULL pointer in nfnetlink_osf. 5) Remove BUG_ON() after skb_header_pointer() from packet path in several conntrack helper and the TCP tracker. 6) Fix memleak in the new object error path of userdata. 7) Avoid overflows in nft_hash_buckets(), reported by syzbot, also from Eric. 8) Avoid overflows in 32bit arches, from Eric. * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf: netfilter: nftables: avoid potential overflows on 32bit arches netfilter: nftables: avoid overflows in nft_hash_buckets() netfilter: nftables: Fix a memleak from userdata error path in new objects netfilter: remove BUG_ON() after skb_header_pointer() netfilter: nfnetlink_osf: Fix a missing skb_header_pointer() NULL check netfilter: nfnetlink: add a missing rcu_read_unlock() netfilter: arptables: use pernet ops struct during unregister netfilter: xt_SECMARK: add new revision to fix structure layout ==================== Link: https://lore.kernel.org/r/20210507174739.1850-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-06tcp: Specify cmsgbuf is user pointer for receive zerocopy.Arjun Roy
A prior change (1f466e1f15cf) introduces separate handling for ->msg_control depending on whether the pointer is a kernel or user pointer. However, while tcp receive zerocopy is using this field, it is not properly annotating that the buffer in this case is a user pointer. This can cause faults when the improper mechanism is used within put_cmsg(). This patch simply annotates tcp receive zerocopy's use as explicitly being a user pointer. Fixes: 7eeba1706eba ("tcp: Add receive timestamp support for receive zerocopy.") Signed-off-by: Arjun Roy <arjunroy@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20210506223530.2266456-1-arjunroy.kdev@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-04net: Only allow init netns to set default tcp cong to a restricted algoJonathon Reinhart
tcp_set_default_congestion_control() is netns-safe in that it writes to &net->ipv4.tcp_congestion_control, but it also sets ca->flags |= TCP_CONG_NON_RESTRICTED which is not namespaced. This has the unintended side-effect of changing the global net.ipv4.tcp_allowed_congestion_control sysctl, despite the fact that it is read-only: 97684f0970f6 ("net: Make tcp_allowed_congestion_control readonly in non-init netns") Resolve this netns "leak" by only allowing the init netns to set the default algorithm to one that is restricted. This restriction could be removed if tcp_allowed_congestion_control were namespace-ified in the future. This bug was uncovered with https://github.com/JonathonReinhart/linux-netns-sysctl-verify Fixes: 6670e1524477 ("tcp: Namespace-ify sysctl_tcp_default_congestion_control") Signed-off-by: Jonathon Reinhart <jonathon.reinhart@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-03netfilter: arptables: use pernet ops struct during unregisterFlorian Westphal
Like with iptables and ebtables, hook unregistration has to use the pernet ops struct, not the template. This triggered following splat: hook not found, pf 3 num 0 WARNING: CPU: 0 PID: 224 at net/netfilter/core.c:480 __nf_unregister_net_hook+0x1eb/0x610 net/netfilter/core.c:480 [..] nf_unregister_net_hook net/netfilter/core.c:502 [inline] nf_unregister_net_hooks+0x117/0x160 net/netfilter/core.c:576 arpt_unregister_table_pre_exit+0x67/0x80 net/ipv4/netfilter/arp_tables.c:1565 Fixes: f9006acc8dfe5 ("netfilter: arp_tables: pass table pointer via nf_hook_ops") Reported-by: syzbot+dcccba8a1e41a38cb9df@syzkaller.appspotmail.com Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-28icmp: standardize naming of RFC 8335 PROBE constantsAndreas Roeseler
The current definitions of constants for PROBE, currently defined only in the net-next kernel branch, are inconsistent, with some beginning with ICMP and others with simply EXT. This patch attempts to standardize the naming conventions of the constants for PROBE before their release into a stable Kernel, and to update the relevant definitions in net/ipv4/icmp.c. Similarly, the definitions for the code field (previously ICMP_EXT_MAL_QUERY, etc) use the same prefixes as the type field. This patch adds _CODE_ to the prefix to clarify the distinction of these constants. Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com> Acked-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20210427153635.2591-1-andreas.a.roeseler@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) The various ip(6)table_foo incarnations are updated to expect that the table is passed as 'void *priv' argument that netfilter core passes to the hook functions. This reduces the struct net size by 2 cachelines on x86_64. From Florian Westphal. 2) Add cgroupsv2 support for nftables. 3) Fix bridge log family merge into nf_log_syslog: Missing unregistration from netns exit path, from Phil Sutter. 4) Add nft_pernet() helper to access nftables pernet area. 5) Add struct nfnl_info to reduce nfnetlink callback footprint and to facilite future updates. Consolidate nfnetlink callbacks. 6) Add CONFIG_NETFILTER_XTABLES_COMPAT Kconfig knob, also from Florian. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-26netfilter: allow to turn off xtables compat layerFlorian Westphal
The compat layer needs to parse untrusted input (the ruleset) to translate it to a 64bit compatible format. We had a number of bugs in this department in the past, so allow users to turn this feature off. Add CONFIG_NETFILTER_XTABLES_COMPAT kconfig knob and make it default to y to keep existing behaviour. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26netfilter: arp_tables: pass table pointer via nf_hook_opsFlorian Westphal
Same change as previous patch. Only difference: no need to handle NULL template_ops parameter, the only caller (arptable_filter) always passes non-NULL argument. This removes all remaining accesses to net->ipv4.arptable_filter. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26netfilter: ip_tables: pass table pointer via nf_hook_opsFlorian Westphal
iptable_x modules rely on 'struct net' to contain a pointer to the table that should be evaluated. In order to remove these pointers from struct net, pass them via the 'priv' pointer in a similar fashion as nf_tables passes the rule data. To do that, duplicate the nf_hook_info array passed in from the iptable_x modules, update the ops->priv pointers of the copy to refer to the table and then change the hookfn implementations to just pass the 'priv' argument to the traverser. After this patch, the xt_table pointers can already be removed from struct net. However, changes to struct net result in re-compile of the entire network stack, so do the removal after arptables and ip6tables have been converted as well. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26netfilter: xt_nat: pass table to hookfnFlorian Westphal
This changes how ip(6)table nat passes the ruleset/table to the evaluation loop. At the moment, it will fetch the table from struct net. This change stores the table in the hook_ops 'priv' argument instead. This requires to duplicate the hook_ops for each netns, so they can store the (per-net) xt_table structure. The dupliated nat hook_ops get stored in net_generic data area. They are free'd in the namespace exit path. This is a pre-requisite to remove the xt_table/ruleset pointers from struct net. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26netfilter: x_tables: remove paranoia testsFlorian Westphal
No need for these. There is only one caller, the xtables core, when the table is registered for the first time with a particular network namespace. After ->table_init() call, the table is linked into the tables[af] list, so next call to that function will skip the ->table_init(). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26netfilter: arptables: unregister the tables by nameFlorian Westphal
and again, this time for arptables. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26netfilter: iptables: unregister the tables by nameFlorian Westphal
xtables stores the xt_table structs in the struct net. This isn't needed anymore, the structures could be passed via the netfilter hook 'private' pointer to the hook functions, which would allow us to remove those pointers from struct net. As a first step, reduce the number of accesses to the net->ipv4.ip6table_{raw,filter,...} pointers. This allows the tables to get unregistered by name instead of having to pass the raw address. The xt_table structure cane looked up by name+address family instead. This patch is useless as-is (the backends still have the raw pointer address), but it lowers the bar to remove those. It also allows to put the 'was table registered in the first place' check into ip_tables.c rather than have it in each table sub module. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26netfilter: x_tables: remove ipt_unregister_tableFlorian Westphal
Its the same function as ipt_unregister_table_exit. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>