aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv4
AgeCommit message (Collapse)Author
2022-11-15Merge tag 'v6.0.8' into v6.0/standard/basev6.0/standard/xilinx-zynqmpv6.0/standard/x86v6.0/standard/tiny/x86v6.0/standard/tiny/common-pcv6.0/standard/tiny/basev6.0/standard/ti-am335xv6.0/standard/qemuppcv6.0/standard/qemuarm64v6.0/standard/preempt-rt/x86v6.0/standard/preempt-rt/bcm-2xxx-rpiv6.0/standard/preempt-rt/basev6.0/standard/nxp-ls20xxv6.0/standard/edgerouterv6.0/standard/beaglebonev6.0/standard/baseBruce Ashfield
This is the 6.0.8 stable release # gpg: Signature made Thu 10 Nov 2022 12:18:24 PM EST # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2022-11-15Merge tag 'v6.0.7' into v6.0/standard/baseBruce Ashfield
This is the 6.0.7 stable release # gpg: Signature made Thu 03 Nov 2022 11:00:42 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2022-11-15Merge tag 'v6.0.6' into v6.0/standard/baseBruce Ashfield
This is the 6.0.6 stable release # gpg: Signature made Sat 29 Oct 2022 04:08:43 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2022-11-15Merge tag 'v6.0.4' into v6.0/standard/baseBruce Ashfield
This is the 6.0.4 stable release # gpg: Signature made Wed 26 Oct 2022 06:23:01 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2022-11-15Merge tag 'v6.0.3' into v6.0/standard/baseBruce Ashfield
This is the 6.0.3 stable release # gpg: Signature made Fri 21 Oct 2022 06:40:49 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2022-11-10net: also flag accepted sockets supporting msghdr originated zerocopyStefan Metzmacher
commit 71b7786ea478f3c4611deff4d2b9676b0c17c56b upstream. Without this only the client initiated tcp sockets have SOCK_SUPPORT_ZC. The listening socket on the server also has it, but the accepted connections didn't, which meant IORING_OP_SEND[MSG]_ZC will always fails with -EOPNOTSUPP. Fixes: e993ffe3da4b ("net: flag sockets supporting msghdr originated zerocopy") Cc: <stable@vger.kernel.org> # 6.0 CC: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/io-uring/20221024141503.22b4e251@kernel.org/T/#m38aa19b0b825758fb97860a38ad13122051f9dda Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-10net: remove SOCK_SUPPORT_ZC from sockmapPavel Begunkov
commit fee9ac06647e59a69fb7aec58f25267c134264b4 upstream. sockmap replaces ->sk_prot with its own callbacks, we should remove SOCK_SUPPORT_ZC as the new proto doesn't support msghdr::ubuf_info. Cc: <stable@vger.kernel.org> # 6.0 Reported-by: Jakub Kicinski <kuba@kernel.org> Fixes: e993ffe3da4bc ("net: flag sockets supporting msghdr originated zerocopy") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-10net/ulp: remove SOCK_SUPPORT_ZC from tls socketsPavel Begunkov
commit e276d62dcfdee6582486e8b8344dd869518e14be upstream. Remove SOCK_SUPPORT_ZC when we're setting ulp as it might not support msghdr::ubuf_info, e.g. like TLS replacing ->sk_prot with a new set of handlers. Cc: <stable@vger.kernel.org> # 6.0 Reported-by: Jakub Kicinski <kuba@kernel.org> Fixes: e993ffe3da4bc ("net: flag sockets supporting msghdr originated zerocopy") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-04nh: fix scope used to find saddr when adding non gw nhNicolas Dichtel
[ Upstream commit bac0f937c343d651874f83b265ca8f5070ed4f06 ] As explained by Julian, fib_nh_scope is related to fib_nh_gw4, but fib_info_update_nhc_saddr() needs the scope of the route, which is the scope "before" fib_nh_scope, ie fib_nh_scope - 1. This patch fixes the problem described in commit 747c14307214 ("ip: fix dflt addr selection for connected nexthop"). Fixes: 597cfe4fc339 ("nexthop: Add support for IPv4 nexthops") Link: https://lore.kernel.org/netdev/6c8a44ba-c2d5-cdf-c5c7-5baf97cba38@ssi.bg/ Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-11-04tcp: fix indefinite deferral of RTO with SACK renegingNeal Cardwell
[ Upstream commit 3d2af9cce3133b3bc596a9d065c6f9d93419ccfb ] This commit fixes a bug that can cause a TCP data sender to repeatedly defer RTOs when encountering SACK reneging. The bug is that when we're in fast recovery in a scenario with SACK reneging, every time we get an ACK we call tcp_check_sack_reneging() and it can note the apparent SACK reneging and rearm the RTO timer for srtt/2 into the future. In some SACK reneging scenarios that can happen repeatedly until the receive window fills up, at which point the sender can't send any more, the ACKs stop arriving, and the RTO fires at srtt/2 after the last ACK. But that can take far too long (O(10 secs)), since the connection is stuck in fast recovery with a low cwnd that cannot grow beyond ssthresh, even if more bandwidth is available. This fix changes the logic in tcp_check_sack_reneging() to only rearm the RTO timer if data is cumulatively ACKed, indicating forward progress. This avoids this kind of nearly infinite loop of RTO timer re-arming. In addition, this meets the goals of tcp_check_sack_reneging() in handling Windows TCP behavior that looks temporarily like SACK reneging but is not really. Many thanks to Jakub Kicinski and Neil Spring, who reported this issue and provided critical packet traces that enabled root-causing this issue. Also, many thanks to Jakub Kicinski for testing this fix. Fixes: 5ae344c949e7 ("tcp: reduce spurious retransmits due to transient SACK reneging") Reported-by: Jakub Kicinski <kuba@kernel.org> Reported-by: Neil Spring <ntspring@fb.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Tested-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20221021170821.1093930-1-ncardwell.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-11-04tcp: fix a signed-integer-overflow bug in tcp_add_backlog()Lu Wei
[ Upstream commit ec791d8149ff60c40ad2074af3b92a39c916a03f ] The type of sk_rcvbuf and sk_sndbuf in struct sock is int, and in tcp_add_backlog(), the variable limit is caculated by adding sk_rcvbuf, sk_sndbuf and 64 * 1024, it may exceed the max value of int and overflow. This patch reduces the limit budget by halving the sndbuf to solve this issue since ACK packets are much smaller than the payload. Fixes: c9c3321257e1 ("tcp: add tcp_add_backlog()") Signed-off-by: Lu Wei <luwei32@huawei.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-10-29netfilter: rpfilter/fib: Set ->flowic_uid correctly for user namespaces.Guillaume Nault
[ Upstream commit 1fcc064b305a1aadeff0d4bff961094d27660acd ] Currently netfilter's rpfilter and fib modules implicitely initialise ->flowic_uid with 0. This is normally the root UID. However, this isn't the case in user namespaces, where user ID 0 is mapped to a different kernel UID. By initialising ->flowic_uid with sock_net_uid(), we get the root UID of the user namespace, thus keeping the same behaviour whether or not we're running in a user namepspace. Note, this is similar to commit 8bcfd0925ef1 ("ipv4: add missing initialization for flowi4_uid"), which fixed the rp_filter sysctl. Fixes: 622ec2c9d524 ("net: core: add UID to flows, rules, and routes") Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-10-29netfilter: rpfilter/fib: Populate flowic_l3mdev fieldPhil Sutter
[ Upstream commit acc641ab95b66b813c1ce856c377a2bbe71e7f52 ] Use the introduced field for correct operation with VRF devices instead of conditionally overwriting flowic_oif. This is a partial revert of commit b575b24b8eee3 ("netfilter: Fix rpfilter dropping vrf packets by mistake"), implementing a simpler solution. Signed-off-by: Phil Sutter <phil@nwl.cc> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Stable-dep-of: 1fcc064b305a ("netfilter: rpfilter/fib: Set ->flowic_uid correctly for user namespaces.") Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-10-29udp: Update reuse->has_conns under reuseport_lock.Kuniyuki Iwashima
[ Upstream commit 69421bf98482d089e50799f45e48b25ce4a8d154 ] When we call connect() for a UDP socket in a reuseport group, we have to update sk->sk_reuseport_cb->has_conns to 1. Otherwise, the kernel could select a unconnected socket wrongly for packets sent to the connected socket. However, the current way to set has_conns is illegal and possible to trigger that problem. reuseport_has_conns() changes has_conns under rcu_read_lock(), which upgrades the RCU reader to the updater. Then, it must do the update under the updater's lock, reuseport_lock, but it doesn't for now. For this reason, there is a race below where we fail to set has_conns resulting in the wrong socket selection. To avoid the race, let's split the reader and updater with proper locking. cpu1 cpu2 +----+ +----+ __ip[46]_datagram_connect() reuseport_grow() . . |- reuseport_has_conns(sk, true) |- more_reuse = __reuseport_alloc(more_socks_size) | . | | |- rcu_read_lock() | |- reuse = rcu_dereference(sk->sk_reuseport_cb) | | | | | /* reuse->has_conns == 0 here */ | | |- more_reuse->has_conns = reuse->has_conns | |- reuse->has_conns = 1 | /* more_reuse->has_conns SHOULD BE 1 HERE */ | | | | | |- rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb, | | | more_reuse) | `- rcu_read_unlock() `- kfree_rcu(reuse, rcu) | |- sk->sk_state = TCP_ESTABLISHED Note the likely(reuse) in reuseport_has_conns_set() is always true, but we put the test there for ease of review. [0] For the record, usually, sk_reuseport_cb is changed under lock_sock(). The only exception is reuseport_grow() & TCP reqsk migration case. 1) shutdown() TCP listener, which is moved into the latter part of reuse->socks[] to migrate reqsk. 2) New listen() overflows reuse->socks[] and call reuseport_grow(). 3) reuse->max_socks overflows u16 with the new listener. 4) reuseport_grow() pops the old shutdown()ed listener from the array and update its sk->sk_reuseport_cb as NULL without lock_sock(). shutdown()ed TCP sk->sk_reuseport_cb can be changed without lock_sock(), but, reuseport_has_conns_set() is called only for UDP under lock_sock(), so likely(reuse) never be false in reuseport_has_conns_set(). [0]: https://lore.kernel.org/netdev/CANn89iLja=eQHbsM_Ta2sQF0tOGU8vAGrh_izRuuHjuO1ouUag@mail.gmail.com/ Fixes: acdcecc61285 ("udp: correct reuseport selection with connected sockets") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20221014182625.89913-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-10-26net: flag sockets supporting msghdr originated zerocopyPavel Begunkov
commit e993ffe3da4bcddea0536b03be1031bf35cd8d85 upstream. We need an efficient way in io_uring to check whether a socket supports zerocopy with msghdr provided ubuf_info. Add a new flag into the struct socket flags fields. Cc: <stable@vger.kernel.org> # 6.0 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/3dafafab822b1c66308bb58a0ac738b1e3f53f74.1666346426.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-21net-next: Fix IP_UNICAST_IF option behavior for connected socketsRichard Gobert
[ Upstream commit 0e4d354762cefd3e16b4cff8988ff276e45effc4 ] The IP_UNICAST_IF socket option is used to set the outgoing interface for outbound packets. The IP_UNICAST_IF socket option was added as it was needed by the Wine project, since no other existing option (SO_BINDTODEVICE socket option, IP_PKTINFO socket option or the bind function) provided the needed characteristics needed by the IP_UNICAST_IF socket option. [1] The IP_UNICAST_IF socket option works well for unconnected sockets, that is, the interface specified by the IP_UNICAST_IF socket option is taken into consideration in the route lookup process when a packet is being sent. However, for connected sockets, the outbound interface is chosen when connecting the socket, and in the route lookup process which is done when a packet is being sent, the interface specified by the IP_UNICAST_IF socket option is being ignored. This inconsistent behavior was reported and discussed in an issue opened on systemd's GitHub project [2]. Also, a bug report was submitted in the kernel's bugzilla [3]. To understand the problem in more detail, we can look at what happens for UDP packets over IPv4 (The same analysis was done separately in the referenced systemd issue). When a UDP packet is sent the udp_sendmsg function gets called and the following happens: 1. The oif member of the struct ipcm_cookie ipc (which stores the output interface of the packet) is initialized by the ipcm_init_sk function to inet->sk.sk_bound_dev_if (the device set by the SO_BINDTODEVICE socket option). 2. If the IP_PKTINFO socket option was set, the oif member gets overridden by the call to the ip_cmsg_send function. 3. If no output interface was selected yet, the interface specified by the IP_UNICAST_IF socket option is used. 4. If the socket is connected and no destination address is specified in the send function, the struct ipcm_cookie ipc is not taken into consideration and the cached route, that was calculated in the connect function is being used. Thus, for a connected socket, the IP_UNICAST_IF sockopt isn't taken into consideration. This patch corrects the behavior of the IP_UNICAST_IF socket option for connect()ed sockets by taking into consideration the IP_UNICAST_IF sockopt when connecting the socket. In order to avoid reconnecting the socket, this option is still ignored when applied on an already connected socket until connect() is called again by the Richard Gobert. Change the __ip4_datagram_connect function, which is called during socket connection, to take into consideration the interface set by the IP_UNICAST_IF socket option, in a similar way to what is done in the udp_sendmsg function. [1] https://lore.kernel.org/netdev/1328685717.4736.4.camel@edumazet-laptop/T/ [2] https://github.com/systemd/systemd/issues/11935#issuecomment-618691018 [3] https://bugzilla.kernel.org/show_bug.cgi?id=210255 Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220829111554.GA1771@debian Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-10-21tcp: annotate data-race around tcp_md5sig_pool_populatedEric Dumazet
[ Upstream commit aacd467c0a576e5e44d2de4205855dc0fe43f6fb ] tcp_md5sig_pool_populated can be read while another thread changes its value. The race has no consequence because allocations are protected with tcp_md5sig_mutex. This patch adds READ_ONCE() and WRITE_ONCE() to document the race and silence KCSAN. Reported-by: Abhishek Shah <abhishek.shah@columbia.edu> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-10-21once: add DO_ONCE_SLOW() for sleepable contextsEric Dumazet
[ Upstream commit 62c07983bef9d3e78e71189441e1a470f0d1e653 ] Christophe Leroy reported a ~80ms latency spike happening at first TCP connect() time. This is because __inet_hash_connect() uses get_random_once() to populate a perturbation table which became quite big after commit 4c2c8f03a5ab ("tcp: increase source port perturb table to 2^16") get_random_once() uses DO_ONCE(), which block hard irqs for the duration of the operation. This patch adds DO_ONCE_SLOW() which uses a mutex instead of a spinlock for operations where we prefer to stay in process context. Then __inet_hash_connect() can use get_random_slow_once() to populate its perturbation table. Fixes: 4c2c8f03a5ab ("tcp: increase source port perturb table to 2^16") Fixes: 190cc82489f4 ("tcp: change source port randomizarion at connect() time") Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu> Link: https://lore.kernel.org/netdev/CANn89iLAEYBaoYajy0Y9UmGFff5GPxDUoG-ErVB2jDdRNQ5Tug@mail.gmail.com/T/#t Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willy Tarreau <w@1wt.eu> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-10-21tcp: fix tcp_cwnd_validate() to not forget is_cwnd_limitedNeal Cardwell
[ Upstream commit f4ce91ce12a7c6ead19b128ffa8cff6e3ded2a14 ] This commit fixes a bug in the tracking of max_packets_out and is_cwnd_limited. This bug can cause the connection to fail to remember that is_cwnd_limited is true, causing the connection to fail to grow cwnd when it should, causing throughput to be lower than it should be. The following event sequence is an example that triggers the bug: (a) The connection is cwnd_limited, but packets_out is not at its peak due to TSO deferral deciding not to send another skb yet. In such cases the connection can advance max_packets_seq and set tp->is_cwnd_limited to true and max_packets_out to a small number. (b) Then later in the round trip the connection is pacing-limited (not cwnd-limited), and packets_out is larger. In such cases the connection would raise max_packets_out to a bigger number but (unexpectedly) flip tp->is_cwnd_limited from true to false. This commit fixes that bug. One straightforward fix would be to separately track (a) the next window after max_packets_out reaches a maximum, and (b) the next window after tp->is_cwnd_limited is set to true. But this would require consuming an extra u32 sequence number. Instead, to save space we track only the most important information. Specifically, we track the strongest available signal of the degree to which the cwnd is fully utilized: (1) If the connection is cwnd-limited then we remember that fact for the current window. (2) If the connection not cwnd-limited then we track the maximum number of outstanding packets in the current window. In particular, note that the new logic cannot trigger the buggy (a)/(b) sequence above because with the new logic a condition where tp->packets_out > tp->max_packets_out can only trigger an update of tp->is_cwnd_limited if tp->is_cwnd_limited is false. This first showed up in a testing of a BBRv2 dev branch, but this buggy behavior highlighted a general issue with the tcp_cwnd_validate() logic that can cause cwnd to fail to increase at the proper rate for any TCP congestion control, including Reno or CUBIC. Fixes: ca8a22634381 ("tcp: make cwnd-limited checks measurement-based, and gentler") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Kevin(Yudong) Yang <yyd@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-10-21netfilter: nft_fib: Fix for rpath check with VRF devicesPhil Sutter
[ Upstream commit 2a8a7c0eaa8747c16aa4a48d573aa920d5c00a5c ] Analogous to commit b575b24b8eee3 ("netfilter: Fix rpfilter dropping vrf packets by mistake") but for nftables fib expression: Add special treatment of VRF devices so that typical reverse path filtering via 'fib saddr . iif oif' expression works as expected. Fixes: f6d0cbcf09c50 ("netfilter: nf_tables: add fib expression") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-10-21esp: choose the correct inner protocol for GSO on inter address family tunnelsSabrina Dubroca
[ Upstream commit 26dbd66eab8080be51759e48280da04015221e22 ] Commit 23c7f8d7989e ("net: Fix esp GSO on inter address family tunnels.") is incomplete. It passes to skb_eth_gso_segment the protocol for the outer IP version, instead of the inner IP version, so we end up calling inet_gso_segment on an inner IPv6 packet and ipv6_gso_segment on an inner IPv4 packet and the packets are dropped. This patch completes the fix by selecting the correct protocol based on the inner mode's family. Fixes: c35fe4106b92 ("xfrm: Add mode handlers for IPsec on layer 2") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-10-03Resolve jiffies wrapping about arpLi Wang
When jiffies wraps, it must be larger than the value of "updated". The solution will enhance the condition of "time_after". Signed-off-by: Li Wang <li.wang@windriver.com>
2022-09-22udp: Use WARN_ON_ONCE() in udp_read_skb()Peilin Ye
Prevent udp_read_skb() from flooding the syslog. Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Link: https://lore.kernel.org/r/20220921005915.2697-1-yepeilin.cs@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-20ipmr: Always call ip{,6}_mr_forward() from RCU read-side critical sectionIdo Schimmel
These functions expect to be called from RCU read-side critical section, but this only happens when invoked from the data path via ip{,6}_mr_input(). They can also be invoked from process context in response to user space adding a multicast route which resolves a cache entry with queued packets [1][2]. Fix by adding missing rcu_read_lock() / rcu_read_unlock() in these call paths. [1] WARNING: suspicious RCU usage 6.0.0-rc3-custom-15969-g049d233c8bcc-dirty #1387 Not tainted ----------------------------- net/ipv4/ipmr.c:84 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by smcrouted/246: #0: ffffffff862389b0 (rtnl_mutex){+.+.}-{3:3}, at: ip_mroute_setsockopt+0x11c/0x1420 stack backtrace: CPU: 0 PID: 246 Comm: smcrouted Not tainted 6.0.0-rc3-custom-15969-g049d233c8bcc-dirty #1387 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x91/0xb9 vif_dev_read+0xbf/0xd0 ipmr_queue_xmit+0x135/0x1ab0 ip_mr_forward+0xe7b/0x13d0 ipmr_mfc_add+0x1a06/0x2ad0 ip_mroute_setsockopt+0x5c1/0x1420 do_ip_setsockopt+0x23d/0x37f0 ip_setsockopt+0x56/0x80 raw_setsockopt+0x219/0x290 __sys_setsockopt+0x236/0x4d0 __x64_sys_setsockopt+0xbe/0x160 do_syscall_64+0x34/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd [2] WARNING: suspicious RCU usage 6.0.0-rc3-custom-15969-g049d233c8bcc-dirty #1387 Not tainted ----------------------------- net/ipv6/ip6mr.c:69 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by smcrouted/246: #0: ffffffff862389b0 (rtnl_mutex){+.+.}-{3:3}, at: ip6_mroute_setsockopt+0x6b9/0x2630 stack backtrace: CPU: 1 PID: 246 Comm: smcrouted Not tainted 6.0.0-rc3-custom-15969-g049d233c8bcc-dirty #1387 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x91/0xb9 vif_dev_read+0xbf/0xd0 ip6mr_forward2.isra.0+0xc9/0x1160 ip6_mr_forward+0xef0/0x13f0 ip6mr_mfc_add+0x1ff2/0x31f0 ip6_mroute_setsockopt+0x1825/0x2630 do_ipv6_setsockopt+0x462/0x4440 ipv6_setsockopt+0x105/0x140 rawv6_setsockopt+0xd8/0x690 __sys_setsockopt+0x236/0x4d0 __x64_sys_setsockopt+0xbe/0x160 do_syscall_64+0x34/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: ebc3197963fc ("ipmr: add rcu protection over (struct vif_device)->dev") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-20tcp: read multiple skbs in tcp_read_skb()Cong Wang
Before we switched to ->read_skb(), ->read_sock() was passed with desc.count=1, which technically indicates we only read one skb per ->sk_data_ready() call. However, for TCP, this is not true. TCP at least has sk_rcvlowat which intentionally holds skb's in receive queue until this watermark is reached. This means when ->sk_data_ready() is invoked there could be multiple skb's in the queue, therefore we have to read multiple skbs in tcp_read_skb() instead of one. Fixes: 965b57b469a5 ("net: Introduce a new proto_ops ->read_skb()") Reported-by: Peilin Ye <peilin.ye@bytedance.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jakub Sitnicki <jakub@cloudflare.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Link: https://lore.kernel.org/r/20220912173553.235838-1-xiyou.wangcong@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-09-16tcp: Use WARN_ON_ONCE() in tcp_read_skb()Peilin Ye
Prevent tcp_read_skb() from flooding the syslog. Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-06tcp: fix early ETIMEDOUT after spurious non-SACK RTONeal Cardwell
Fix a bug reported and analyzed by Nagaraj Arankal, where the handling of a spurious non-SACK RTO could cause a connection to fail to clear retrans_stamp, causing a later RTO to very prematurely time out the connection with ETIMEDOUT. Here is the buggy scenario, expanding upon Nagaraj Arankal's excellent report: (*1) Send one data packet on a non-SACK connection (*2) Because no ACK packet is received, the packet is retransmitted and we enter CA_Loss; but this retransmission is spurious. (*3) The ACK for the original data is received. The transmitted packet is acknowledged. The TCP timestamp is before the retrans_stamp, so tcp_may_undo() returns true, and tcp_try_undo_loss() returns true without changing state to Open (because tcp_is_sack() is false), and tcp_process_loss() returns without calling tcp_try_undo_recovery(). Normally after undoing a CA_Loss episode, tcp_fastretrans_alert() would see that the connection has returned to CA_Open and fall through and call tcp_try_to_open(), which would set retrans_stamp to 0. However, for non-SACK connections we hold the connection in CA_Loss, so do not fall through to call tcp_try_to_open() and do not set retrans_stamp to 0. So retrans_stamp is (erroneously) still non-zero. At this point the first "retransmission event" has passed and been recovered from. Any future retransmission is a completely new "event". However, retrans_stamp is erroneously still set. (And we are still in CA_Loss, which is correct.) (*4) After 16 minutes (to correspond with tcp_retries2=15), a new data packet is sent. Note: No data is transmitted between (*3) and (*4) and we disabled keep alives. The socket's timeout SHOULD be calculated from this point in time, but instead it's calculated from the prior "event" 16 minutes ago (step (*2)). (*5) Because no ACK packet is received, the packet is retransmitted. (*6) At the time of the 2nd retransmission, the socket returns ETIMEDOUT, prematurely, because retrans_stamp is (erroneously) too far in the past (set at the time of (*2)). This commit fixes this bug by ensuring that we reuse in tcp_try_undo_loss() the same careful logic for non-SACK connections that we have in tcp_try_undo_recovery(). To avoid duplicating logic, we factor out that logic into a new tcp_is_non_sack_preventing_reopen() helper and call that helper from both undo functions. Fixes: da34ac7626b5 ("tcp: only undo on partial ACKs in CA_Loss") Reported-by: Nagaraj Arankal <nagaraj.p.arankal@hpe.com> Link: https://lore.kernel.org/all/SJ0PR84MB1847BE6C24D274C46A1B9B0EB27A9@SJ0PR84MB1847.NAMPRD84.PROD.OUTLOOK.COM/ Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220903121023.866900-1-ncardwell.kernel@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-09-02Merge tag 'rxrpc-fixes-20220901' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc fixes Here are some fixes for AF_RXRPC: (1) Fix the handling of ICMP/ICMP6 packets. This is a problem due to rxrpc being switched to acting as a UDP tunnel, thereby allowing it to steal the packets before they go through the UDP Rx queue. UDP tunnels can't get ICMP/ICMP6 packets, however. This patch adds an additional encap hook so that they can. (2) Fix the encryption routines in rxkad to handle packets that have more than three parts correctly. The problem is that ->nr_frags doesn't count the initial fragment, so the sglist ends up too short. (3) Fix a problem with destruction of the local endpoint potentially getting repeated. (4) Fix the calculation of the time at which to resend. jiffies_to_usecs() gives microseconds, not nanoseconds. (5) Fix AFS to work out when callback promises and locks expire based on the time an op was issued rather than the time the first reply packet arrives. We don't know how long the server took between calculating the expiry interval and transmitting the reply. (6) Given (5), rxrpc_get_reply_time() is no longer used, so remove it. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-02tcp: TX zerocopy should not sense pfmemalloc statusEric Dumazet
We got a recent syzbot report [1] showing a possible misuse of pfmemalloc page status in TCP zerocopy paths. Indeed, for pages coming from user space or other layers, using page_is_pfmemalloc() is moot, and possibly could give false positives. There has been attempts to make page_is_pfmemalloc() more robust, but not using it in the first place in this context is probably better, removing cpu cycles. Note to stable teams : You need to backport 84ce071e38a6 ("net: introduce __skb_fill_page_desc_noacc") as a prereq. Race is more probable after commit c07aea3ef4d4 ("mm: add a signature in struct page") because page_is_pfmemalloc() is now using low order bit from page->lru.next, which can change more often than page->index. Low order bit should never be set for lru.next (when used as an anchor in LRU list), so KCSAN report is mostly a false positive. Backporting to older kernel versions seems not necessary. [1] BUG: KCSAN: data-race in lru_add_fn / tcp_build_frag write to 0xffffea0004a1d2c8 of 8 bytes by task 18600 on cpu 0: __list_add include/linux/list.h:73 [inline] list_add include/linux/list.h:88 [inline] lruvec_add_folio include/linux/mm_inline.h:105 [inline] lru_add_fn+0x440/0x520 mm/swap.c:228 folio_batch_move_lru+0x1e1/0x2a0 mm/swap.c:246 folio_batch_add_and_move mm/swap.c:263 [inline] folio_add_lru+0xf1/0x140 mm/swap.c:490 filemap_add_folio+0xf8/0x150 mm/filemap.c:948 __filemap_get_folio+0x510/0x6d0 mm/filemap.c:1981 pagecache_get_page+0x26/0x190 mm/folio-compat.c:104 grab_cache_page_write_begin+0x2a/0x30 mm/folio-compat.c:116 ext4_da_write_begin+0x2dd/0x5f0 fs/ext4/inode.c:2988 generic_perform_write+0x1d4/0x3f0 mm/filemap.c:3738 ext4_buffered_write_iter+0x235/0x3e0 fs/ext4/file.c:270 ext4_file_write_iter+0x2e3/0x1210 call_write_iter include/linux/fs.h:2187 [inline] new_sync_write fs/read_write.c:491 [inline] vfs_write+0x468/0x760 fs/read_write.c:578 ksys_write+0xe8/0x1a0 fs/read_write.c:631 __do_sys_write fs/read_write.c:643 [inline] __se_sys_write fs/read_write.c:640 [inline] __x64_sys_write+0x3e/0x50 fs/read_write.c:640 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd read to 0xffffea0004a1d2c8 of 8 bytes by task 18611 on cpu 1: page_is_pfmemalloc include/linux/mm.h:1740 [inline] __skb_fill_page_desc include/linux/skbuff.h:2422 [inline] skb_fill_page_desc include/linux/skbuff.h:2443 [inline] tcp_build_frag+0x613/0xb20 net/ipv4/tcp.c:1018 do_tcp_sendpages+0x3e8/0xaf0 net/ipv4/tcp.c:1075 tcp_sendpage_locked net/ipv4/tcp.c:1140 [inline] tcp_sendpage+0x89/0xb0 net/ipv4/tcp.c:1150 inet_sendpage+0x7f/0xc0 net/ipv4/af_inet.c:833 kernel_sendpage+0x184/0x300 net/socket.c:3561 sock_sendpage+0x5a/0x70 net/socket.c:1054 pipe_to_sendpage+0x128/0x160 fs/splice.c:361 splice_from_pipe_feed fs/splice.c:415 [inline] __splice_from_pipe+0x222/0x4d0 fs/splice.c:559 splice_from_pipe fs/splice.c:594 [inline] generic_splice_sendpage+0x89/0xc0 fs/splice.c:743 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x80/0xa0 fs/splice.c:931 splice_direct_to_actor+0x305/0x620 fs/splice.c:886 do_splice_direct+0xfb/0x180 fs/splice.c:974 do_sendfile+0x3bf/0x910 fs/read_write.c:1249 __do_sys_sendfile64 fs/read_write.c:1317 [inline] __se_sys_sendfile64 fs/read_write.c:1303 [inline] __x64_sys_sendfile64+0x10c/0x150 fs/read_write.c:1303 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x0000000000000000 -> 0xffffea0004a1d288 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 18611 Comm: syz-executor.4 Not tainted 6.0.0-rc2-syzkaller-00248-ge022620b5d05-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/22/2022 Fixes: c07aea3ef4d4 ("mm: add a signature in struct page") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Shakeel Butt <shakeelb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-01rxrpc: Fix ICMP/ICMP6 error handlingDavid Howells
Because rxrpc pretends to be a tunnel on top of a UDP/UDP6 socket, allowing it to siphon off UDP packets early in the handling of received UDP packets thereby avoiding the packet going through the UDP receive queue, it doesn't get ICMP packets through the UDP ->sk_error_report() callback. In fact, it doesn't appear that there's any usable option for getting hold of ICMP packets. Fix this by adding a new UDP encap hook to distribute error messages for UDP tunnels. If the hook is set, then the tunnel driver will be able to see ICMP packets. The hook provides the offset into the packet of the UDP header of the original packet that caused the notification. An alternative would be to call the ->error_handler() hook - but that requires that the skbuff be cloned (as ip_icmp_error() or ipv6_cmp_error() do, though isn't really necessary or desirable in rxrpc's case is we want to parse them there and then, not queue them). Changes ======= ver #3) - Fixed an uninitialised variable. ver #2) - Fixed some missing CONFIG_AF_RXRPC_IPV6 conditionals. Fixes: 5271953cad31 ("rxrpc: Use the UDP encap_rcv hook") Signed-off-by: David Howells <dhowells@redhat.com>
2022-08-31tcp: make global challenge ack rate limitation per net-ns and default disabledEric Dumazet
Because per host rate limiting has been proven problematic (side channel attacks can be based on it), per host rate limiting of challenge acks ideally should be per netns and turned off by default. This is a long due followup of following commits: 083ae308280d ("tcp: enable per-socket rate limiting of all 'challenge acks'") f2b2c582e824 ("tcp: mitigate ACK loops for connections as tcp_sock") 75ff39ccc1bd ("tcp: make challenge acks less predictable") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jason Baron <jbaron@akamai.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-31tcp: annotate data-race around challenge_timestampEric Dumazet
challenge_timestamp can be read an written by concurrent threads. This was expected, but we need to annotate the race to avoid potential issues. Following patch moves challenge_timestamp and challenge_count to per-netns storage to provide better isolation. Fixes: 354e4aa391ed ("tcp: RFC 5961 5.2 Blind Data Injection Attack Mitigation") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-31ip: fix triggering of 'icmp redirect'Nicolas Dichtel
__mkroute_input() uses fib_validate_source() to trigger an icmp redirect. My understanding is that fib_validate_source() is used to know if the src address and the gateway address are on the same link. For that, fib_validate_source() returns 1 (same link) or 0 (not the same network). __mkroute_input() is the only user of these positive values, all other callers only look if the returned value is negative. Since the below patch, fib_validate_source() didn't return anymore 1 when both addresses are on the same network, because the route lookup returns RT_SCOPE_LINK instead of RT_SCOPE_HOST. But this is, in fact, right. Let's adapat the test to return 1 again when both addresses are on the same link. CC: stable@vger.kernel.org Fixes: 747c14307214 ("ip: fix dflt addr selection for connected nexthop") Reported-by: kernel test robot <yujie.liu@intel.com> Reported-by: Heng Qi <hengqi@linux.alibaba.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220829100121.3821-1-nicolas.dichtel@6wind.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel borkmann says: ==================== The following pull-request contains BPF updates for your *net* tree. We've added 11 non-merge commits during the last 14 day(s) which contain a total of 13 files changed, 61 insertions(+), 24 deletions(-). The main changes are: 1) Fix BPF verifier's precision tracking around BPF ring buffer, from Kumar Kartikeya Dwivedi. 2) Fix regression in tunnel key infra when passing FLOWI_FLAG_ANYSRC, from Eyal Birger. 3) Fix insufficient permissions for bpf_sys_bpf() helper, from YiFei Zhu. 4) Fix splat from hitting BUG when purging effective cgroup programs, from Pu Lehui. 5) Fix range tracking for array poke descriptors, from Daniel Borkmann. 6) Fix corrupted packets for XDP_SHARED_UMEM in aligned mode, from Magnus Karlsson. 7) Fix NULL pointer splat in BPF sockmap sk_msg_recvmsg(), from Liu Jian. 8) Add READ_ONCE() to bpf_jit_limit when reading from sysctl, from Kuniyuki Iwashima. 9) Add BPF selftest lru_bug check to s390x deny list, from Daniel Müller. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-24net: Fix data-races around sysctl_devconf_inherit_init_net.Kuniyuki Iwashima
While reading sysctl_devconf_inherit_init_net, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: 856c395cfa63 ("net: introduce a knob to control whether to inherit devconf config") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-24net: Fix data-races around sysctl_max_skb_frags.Kuniyuki Iwashima
While reading sysctl_max_skb_frags, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: 5f74f82ea34c ("net:Add sysctl_max_skb_frags") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-24net: Fix data-races around sysctl_optmem_max.Kuniyuki Iwashima
While reading sysctl_optmem_max, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-24net: Fix data-races around sysctl_[rw]mem_(max|default).Kuniyuki Iwashima
While reading sysctl_[rw]mem_(max|default), they can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-18ip_tunnel: Respect tunnel key's "flow_flags" in IP tunnelsEyal Birger
Commit 451ef36bd229 ("ip_tunnels: Add new flow flags field to ip_tunnel_key") added a "flow_flags" member to struct ip_tunnel_key which was later used by the commit in the fixes tag to avoid dropping packets with sources that aren't locally configured when set in bpf_set_tunnel_key(). VXLAN and GENEVE were made to respect this flag, ip tunnels like IPIP and GRE were not. This commit fixes this omission by making ip_tunnel_init_flow() receive the flow flags from the tunnel key in the relevant collect_md paths. Fixes: b8fff748521c ("bpf: Set flow flag to allow any source IP in bpf_tunnel_key") Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Paul Chaignon <paul@isovalent.com> Link: https://lore.kernel.org/bpf/20220818074118.726639-1-eyal.birger@gmail.com
2022-08-18tcp: handle pure FIN case correctlyCong Wang
When skb->len==0, the recv_actor() returns 0 too, but we also use 0 for error conditions. This patch amends this by propagating the errors to tcp_read_skb() so that we can distinguish skb->len==0 case from error cases. Fixes: 04919bed948d ("tcp: Introduce tcp_read_skb()") Reported-by: Eric Dumazet <edumazet@google.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-18tcp: refactor tcp_read_skb() a bitCong Wang
As tcp_read_skb() only reads one skb at a time, the while loop is unnecessary, we can turn it into an if. This also simplifies the code logic. Cc: Eric Dumazet <edumazet@google.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-18tcp: fix tcp_cleanup_rbuf() for tcp_read_skb()Cong Wang
tcp_cleanup_rbuf() retrieves the skb from sk_receive_queue, it assumes the skb is not yet dequeued. This is no longer true for tcp_read_skb() case where we dequeue the skb first. Fix this by introducing a helper __tcp_cleanup_rbuf() which does not require any skb and calling it in tcp_read_skb(). Fixes: 04919bed948d ("tcp: Introduce tcp_read_skb()") Cc: Eric Dumazet <edumazet@google.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-18tcp: fix sock skb accounting in tcp_read_skb()Cong Wang
Before commit 965b57b469a5 ("net: Introduce a new proto_ops ->read_skb()"), skb was not dequeued from receive queue hence when we close TCP socket skb can be just flushed synchronously. After this commit, we have to uncharge skb immediately after being dequeued, otherwise it is still charged in the original sock. And we still need to retain skb->sk, as eBPF programs may extract sock information from skb->sk. Therefore, we have to call skb_set_owner_sk_safe() here. Fixes: 965b57b469a5 ("net: Introduce a new proto_ops ->read_skb()") Reported-and-tested-by: syzbot+a0e6f8738b58f7654417@syzkaller.appspotmail.com Tested-by: Stanislav Fomichev <sdf@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-04Merge tag 'driver-core-6.0-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core / kernfs updates from Greg KH: "Here is the set of driver core and kernfs changes for 6.0-rc1. The "biggest" thing in here is some scalability improvements for kernfs for large systems. Other than that, included in here are: - arch topology and cache info changes that have been reviewed and discussed a lot. - potential error path cleanup fixes - deferred driver probe cleanups - firmware loader cleanups and tweaks - documentation updates - other small things All of these have been in the linux-next tree for a while with no reported problems" * tag 'driver-core-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (63 commits) docs: embargoed-hardware-issues: fix invalid AMD contact email firmware_loader: Replace kmap() with kmap_local_page() sysfs docs: ABI: Fix typo in comment kobject: fix Kconfig.debug "its" grammar kernfs: Fix typo 'the the' in comment docs: driver-api: firmware: add driver firmware guidelines. (v3) arch_topology: Fix cache attributes detection in the CPU hotplug path ACPI: PPTT: Leave the table mapped for the runtime usage cacheinfo: Use atomic allocation for percpu cache attributes drivers/base: fix userspace break from using bin_attributes for cpumap and cpulist MAINTAINERS: Change mentions of mpm to olivia docs: ABI: sysfs-devices-soc: Update Lee Jones' email address docs: ABI: sysfs-class-pwm: Update Lee Jones' email address Documentation/process: Add embargoed HW contact for LLVM Revert "kernfs: Change kernfs_notify_list to llist." ACPI: Remove the unused find_acpi_cpu_cache_topology() arch_topology: Warn that topology for nested clusters is not supported arch_topology: Add support for parsing sockets in /cpu-map arch_topology: Set cluster identifier in each core/thread from /cpu-map arch_topology: Limit span of cpu_clustergroup_mask() ...
2022-08-01udp: Remove redundant __udp_sysctl_init() call from udp_init().Kuniyuki Iwashima
__udp_sysctl_init() is called for init_net via udp_sysctl_ops. While at it, we can rename __udp_sysctl_init() to udp_sysctl_init(). Fixes: 1e8029515816 ("udp: Move the udp sysctl to namespace.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-27tcp: md5: fix IPv4-mapped supportEric Dumazet
After the blamed commit, IPv4 SYN packets handled by a dual stack IPv6 socket are dropped, even if perfectly valid. $ nstat | grep MD5 TcpExtTCPMD5Failure 5 0.0 For a dual stack listener, an incoming IPv4 SYN packet would call tcp_inbound_md5_hash() with @family == AF_INET, while tp->af_specific is pointing to tcp_sock_ipv6_specific. Only later when an IPv4-mapped child is created, tp->af_specific is changed to tcp_sock_ipv6_mapped_specific. Fixes: 7bbb765b7349 ("net/tcp: Merge TCP-MD5 inbound callbacks") Reported-by: Brian Vazquez <brianvv@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Dmitry Safonov <dima@arista.com> Tested-by: Leonard Crestez <cdleonard@gmail.com> Link: https://lore.kernel.org/r/20220726115743.2759832-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-26tcp: allow tls to decrypt directly from the tcp rcv queueJakub Kicinski
Expose TCP rx queue accessor and cleanup, so that TLS can decrypt directly from the TCP queue. The expectation is that the caller can access the skb returned from tcp_recv_skb() and up to inq bytes worth of data (some of which may be in ->next skbs) and then call tcp_read_done() when data has been consumed. The socket lock must be held continuously across those two operations. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-25Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2022-07-20 1) Don't set DST_NOPOLICY in IPv4, a recent patch made this superfluous. From Eyal Birger. 2) Convert alg_key to flexible array member to avoid an iproute2 compile warning when built with gcc-12. From Stephen Hemminger. 3) xfrm_register_km and xfrm_unregister_km do always return 0 so change the type to void. From Zhengchao Shao. 4) Fix spelling mistake in esp6.c From Zhang Jiaming. 5) Improve the wording of comment above XFRM_OFFLOAD flags. From Petr Vaněk. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-25ipv4: Fix data-races around sysctl_fib_notify_on_flag_change.Kuniyuki Iwashima
While reading sysctl_fib_notify_on_flag_change, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: 680aea08e78c ("net: ipv4: Emit notification when fib hardware flags are changed") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>