aboutsummaryrefslogtreecommitdiffstats
path: root/net
AgeCommit message (Collapse)Author
2021-11-28Merge tag 'v5.14.19' into v5.14/standard/baseBruce Ashfield
This is the 5.14.19 stable release # gpg: Signature made Wed 17 Nov 2021 05:06:08 AM EST # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key # Conflicts: # arch/arm/Makefile
2021-11-17SUNRPC: Partial revert of commit 6f9f17287e78Trond Myklebust
commit ea7a1019d8baf8503ecd6e3ec8436dec283569e6 upstream. The premise of commit 6f9f17287e78 ("SUNRPC: Mitigate cond_resched() in xprt_transmit()") was that cond_resched() is expensive and unnecessary when there has been just a single send. The point of cond_resched() is to ensure that tasks that should pre-empt this one get a chance to do so when it is safe to do so. The code prior to commit 6f9f17287e78 failed to take into account that it was keeping a rpc_task pinned for longer than it needed to, and so rather than doing a full revert, let's just move the cond_resched. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-179p/net: fix missing error check in p9_check_errorsDominique Martinet
commit 27eb4c3144f7a5ebef3c9a261d80cb3e1fa784dc upstream. Link: https://lkml.kernel.org/r/99338965-d36c-886e-cd0e-1d8fff2b4746@gmail.com Reported-by: syzbot+06472778c97ed94af66d@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-17bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed modeDaniel Borkmann
[ Upstream commit 8520e224f547cd070c7c8f97b1fc6d58cff7ccaa ] Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used. Back in the days, commit bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") embedded per-socket cgroup information into sock->sk_cgrp_data and in order to save 8 bytes in struct sock made both mutually exclusive, that is, when cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2 falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp). The assumption made was "there is no reason to mix the two and this is in line with how legacy and v2 compatibility is handled" as stated in bd1060a1d671. However, with Kubernetes more widely supporting cgroups v2 as well nowadays, this assumption no longer holds, and the possibility of the v1/v2 mixed mode with the v2 root fallback being hit becomes a real security issue. Many of the cgroup v2 BPF programs are also used for policy enforcement, just to pick _one_ example, that is, to programmatically deny socket related system calls like connect(2) or bind(2). A v2 root fallback would implicitly cause a policy bypass for the affected Pods. In production environments, we have recently seen this case due to various circumstances: i) a different 3rd party agent and/or ii) a container runtime such as [0] in the user's environment configuring legacy cgroup v1 net_cls tags, which triggered implicitly mentioned root fallback. Another case is Kubernetes projects like kind [1] which create Kubernetes nodes in a container and also add cgroup namespaces to the mix, meaning programs which are attached to the cgroup v2 root of the cgroup namespace get attached to a non-root cgroup v2 path from init namespace point of view. And the latter's root is out of reach for agents on a kind Kubernetes node to configure. Meaning, any entity on the node setting cgroup v1 net_cls tag will trigger the bypass despite cgroup v2 BPF programs attached to the namespace root. Generally, this mutual exclusiveness does not hold anymore in today's user environments and makes cgroup v2 usage from BPF side fragile and unreliable. This fix adds proper struct cgroup pointer for the cgroup v2 case to struct sock_cgroup_data in order to address these issues; this implicitly also fixes the tradeoffs being made back then with regards to races and refcount leaks as stated in bd1060a1d671, and removes the fallback, so that cgroup v2 BPF programs always operate as expected. [0] https://github.com/nestybox/sysbox/ [1] https://kind.sigs.k8s.io/ Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Stanislav Fomichev <sdf@google.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/bpf/20210913230759.2313-1-daniel@iogearbox.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17net, neigh: Enable state migration between NUD_PERMANENT and NTF_USEDaniel Borkmann
[ Upstream commit 3dc20f4762c62d3b3f0940644881ed818aa7b2f5 ] Currently, it is not possible to migrate a neighbor entry between NUD_PERMANENT state and NTF_USE flag with a dynamic NUD state from a user space control plane. Similarly, it is not possible to add/remove NTF_EXT_LEARNED flag from an existing neighbor entry in combination with NTF_USE flag. This is due to the latter directly calling into neigh_event_send() without any meta data updates as happening in __neigh_update(). Thus, to enable this use case, extend the latter with a NEIGH_UPDATE_F_USE flag where we break the NUD_PERMANENT state in particular so that a latter neigh_event_send() is able to re-resolve a neighbor entry. Before fix, NUD_PERMANENT -> NUD_* & NTF_USE: # ./ip/ip n replace 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a PERMANENT [...] # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a PERMANENT [...] As can be seen, despite the admin-triggered replace, the entry remains in the NUD_PERMANENT state. After fix, NUD_PERMANENT -> NUD_* & NTF_USE: # ./ip/ip n replace 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a PERMANENT [...] # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a extern_learn REACHABLE [...] # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a extern_learn STALE [...] # ./ip/ip n replace 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a PERMANENT [...] After the fix, the admin-triggered replace switches to a dynamic state from the NTF_USE flag which triggered a new neighbor resolution. Likewise, we can transition back from there, if needed, into NUD_PERMANENT. Similar before/after behavior can be observed for below transitions: Before fix, NTF_USE -> NTF_USE | NTF_EXT_LEARNED -> NTF_USE: # ./ip/ip n replace 192.168.178.30 dev enp5s0 use # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE [...] # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE [...] After fix, NTF_USE -> NTF_USE | NTF_EXT_LEARNED -> NTF_USE: # ./ip/ip n replace 192.168.178.30 dev enp5s0 use # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE [...] # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a extern_learn REACHABLE [...] # ./ip/ip n replace 192.168.178.30 dev enp5s0 use # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE [..] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17net/smc: fix sk_refcnt underflow on linkdown and fallbackDust Li
[ Upstream commit e5d5aadcf3cd59949316df49c27cb21788d7efe4 ] We got the following WARNING when running ab/nginx test with RDMA link flapping (up-down-up). The reason is when smc_sock fallback and at linkdown happens simultaneously, we may got the following situation: __smc_lgr_terminate() --> smc_conn_kill() --> smc_close_active_abort() smc_sock->sk_state = SMC_CLOSED sock_put(smc_sock) smc_sock was set to SMC_CLOSED and sock_put() been called when terminate the link group. But later application call close() on the socket, then we got: __smc_release(): if (smc_sock->fallback) smc_sock->sk_state = SMC_CLOSED sock_put(smc_sock) Again we set the smc_sock to CLOSED through it's already in CLOSED state, and double put the refcnt, so the following warning happens: refcount_t: underflow; use-after-free. WARNING: CPU: 5 PID: 860 at lib/refcount.c:28 refcount_warn_saturate+0x8d/0xf0 Modules linked in: CPU: 5 PID: 860 Comm: nginx Not tainted 5.10.46+ #403 Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8c24b4c 04/01/2014 RIP: 0010:refcount_warn_saturate+0x8d/0xf0 Code: 05 5c 1e b5 01 01 e8 52 25 bc ff 0f 0b c3 80 3d 4f 1e b5 01 00 75 ad 48 RSP: 0018:ffffc90000527e50 EFLAGS: 00010286 RAX: 0000000000000026 RBX: ffff8881300df2c0 RCX: 0000000000000027 RDX: 0000000000000000 RSI: ffff88813bd58040 RDI: ffff88813bd58048 RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000000000001 R10: ffff8881300df2c0 R11: ffffc90000527c78 R12: ffff8881300df340 R13: ffff8881300df930 R14: ffff88810b3dad80 R15: ffff8881300df4f8 FS: 00007f739de8fb80(0000) GS:ffff88813bd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000a01b008 CR3: 0000000111b64003 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: smc_release+0x353/0x3f0 __sock_release+0x3d/0xb0 sock_close+0x11/0x20 __fput+0x93/0x230 task_work_run+0x65/0xa0 exit_to_user_mode_prepare+0xf9/0x100 syscall_exit_to_user_mode+0x27/0x190 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This patch adds check in __smc_release() to make sure we won't do an extra sock_put() and set the socket to CLOSED when its already in CLOSED state. Fixes: 51f1de79ad8e (net/smc: replace sock_put worker by socket refcounting) Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17vsock: prevent unnecessary refcnt inc for nonblocking connectEiichi Tsukata
[ Upstream commit c7cd82b90599fa10915f41e3dd9098a77d0aa7b6 ] Currently vosck_connect() increments sock refcount for nonblocking socket each time it's called, which can lead to memory leak if it's called multiple times because connect timeout function decrements sock refcount only once. Fixes it by making vsock_connect() return -EALREADY immediately when sock state is already SS_CONNECTING. Fixes: d021c344051a ("VSOCK: Introduce VM Sockets") Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Eiichi Tsukata <eiichi.tsukata@nutanix.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17net/sched: sch_taprio: fix undefined behavior in ktime_mono_to_anyEric Dumazet
[ Upstream commit 6dc25401cba4d428328eade8ceae717633fdd702 ] 1) if q->tk_offset == TK_OFFS_MAX, then get_tcp_tstamp() calls ktime_mono_to_any() with out-of-bound value. 2) if q->tk_offset is changed in taprio_parse_clockid(), taprio_get_time() might also call ktime_mono_to_any() with out-of-bound value as sysbot found: UBSAN: array-index-out-of-bounds in kernel/time/timekeeping.c:908:27 index 3 is out of range for type 'ktime_t *[3]' CPU: 1 PID: 25668 Comm: kworker/u4:0 Not tainted 5.15.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: bat_events batadv_iv_send_outstanding_bat_ogm_packet Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 ubsan_epilogue+0xb/0x5a lib/ubsan.c:151 __ubsan_handle_out_of_bounds.cold+0x62/0x6c lib/ubsan.c:291 ktime_mono_to_any+0x1d4/0x1e0 kernel/time/timekeeping.c:908 get_tcp_tstamp net/sched/sch_taprio.c:322 [inline] get_packet_txtime net/sched/sch_taprio.c:353 [inline] taprio_enqueue_one+0x5b0/0x1460 net/sched/sch_taprio.c:420 taprio_enqueue+0x3b1/0x730 net/sched/sch_taprio.c:485 dev_qdisc_enqueue+0x40/0x300 net/core/dev.c:3785 __dev_xmit_skb net/core/dev.c:3869 [inline] __dev_queue_xmit+0x1f6e/0x3630 net/core/dev.c:4194 batadv_send_skb_packet+0x4a9/0x5f0 net/batman-adv/send.c:108 batadv_iv_ogm_send_to_if net/batman-adv/bat_iv_ogm.c:393 [inline] batadv_iv_ogm_emit net/batman-adv/bat_iv_ogm.c:421 [inline] batadv_iv_send_outstanding_bat_ogm_packet+0x6d7/0x8e0 net/batman-adv/bat_iv_ogm.c:1701 process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298 worker_thread+0x658/0x11f0 kernel/workqueue.c:2445 kthread+0x405/0x4f0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 Fixes: 7ede7b03484b ("taprio: make clock reference conversions easier") Fixes: 54002066100b ("taprio: Adjust timestamps for TCP packets") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vedang Patel <vedang.patel@intel.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Link: https://lore.kernel.org/r/20211108180815.1822479-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17bpf, sockmap: sk_skb data_end access incorrect when src_reg = dst_regJussi Maki
[ Upstream commit b2c4618162ec615a15883a804cce7e27afecfa58 ] The current conversion of skb->data_end reads like this: ; data_end = (void*)(long)skb->data_end; 559: (79) r1 = *(u64 *)(r2 +200) ; r1 = skb->data 560: (61) r11 = *(u32 *)(r2 +112) ; r11 = skb->len 561: (0f) r1 += r11 562: (61) r11 = *(u32 *)(r2 +116) 563: (1f) r1 -= r11 But similar to the case in 84f44df664e9 ("bpf: sock_ops sk access may stomp registers when dst_reg = src_reg"), the code will read an incorrect skb->len when src == dst. In this case we end up generating this xlated code: ; data_end = (void*)(long)skb->data_end; 559: (79) r1 = *(u64 *)(r1 +200) ; r1 = skb->data 560: (61) r11 = *(u32 *)(r1 +112) ; r11 = (skb->data)->len 561: (0f) r1 += r11 562: (61) r11 = *(u32 *)(r1 +116) 563: (1f) r1 -= r11 ... where line 560 is the reading 4B of (skb->data + 112) instead of the intended skb->len Here the skb pointer in r1 gets set to skb->data and the later deref for skb->len ends up following skb->data instead of skb. This fixes the issue similarly to the patch mentioned above by creating an additional temporary variable and using to store the register when dst_reg = src_reg. We name the variable bpf_temp_reg and place it in the cb context for sk_skb. Then we restore from the temp to ensure nothing is lost. Fixes: 16137b09a66f2 ("bpf: Compute data_end dynamically with JIT code") Signed-off-by: Jussi Maki <joamaki@gmail.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20211103204736.248403-6-john.fastabend@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17bpf: sockmap, strparser, and tls are reusing qdisc_skb_cb and collidingJohn Fastabend
[ Upstream commit e0dc3b93bd7bcff8c3813d1df43e0908499c7cf0 ] Strparser is reusing the qdisc_skb_cb struct to stash the skb message handling progress, e.g. offset and length of the skb. First this is poorly named and inherits a struct from qdisc that doesn't reflect the actual usage of cb[] at this layer. But, more importantly strparser is using the following to access its metadata. (struct _strp_msg *)((void *)skb->cb + offsetof(struct qdisc_skb_cb, data)) Where _strp_msg is defined as: struct _strp_msg { struct strp_msg strp; /* 0 8 */ int accum_len; /* 8 4 */ /* size: 12, cachelines: 1, members: 2 */ /* last cacheline: 12 bytes */ }; So we use 12 bytes of ->data[] in struct. However in BPF code running parser and verdict the user has read capabilities into the data[] array as well. Its not too problematic, but we should not be exposing internal state to BPF program. If its really needed then we can use the probe_read() APIs which allow reading kernel memory. And I don't believe cb[] layer poses any API breakage by moving this around because programs can't depend on cb[] across layers. In order to fix another issue with a ctx rewrite we need to stash a temp variable somewhere. To make this work cleanly this patch builds a cb struct for sk_skb types called sk_skb_cb struct. Then we can use this consistently in the strparser, sockmap space. Additionally we can start allowing ->cb[] write access after this. Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Jussi Maki <joamaki@gmail.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20211103204736.248403-5-john.fastabend@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17bpf, sockmap: Fix race in ingress receive verdict with redirect to selfJohn Fastabend
[ Upstream commit c5d2177a72a1659554922728fc407f59950aa929 ] A socket in a sockmap may have different combinations of programs attached depending on configuration. There can be no programs in which case the socket acts as a sink only. There can be a TX program in this case a BPF program is attached to sending side, but no RX program is attached. There can be an RX program only where sends have no BPF program attached, but receives are hooked with BPF. And finally, both TX and RX programs may be attached. Giving us the permutations: None, Tx, Rx, and TxRx To date most of our use cases have been TX case being used as a fast datapath to directly copy between local application and a userspace proxy. Or Rx cases and TxRX applications that are operating an in kernel based proxy. The traffic in the first case where we hook applications into a userspace application looks like this: AppA redirect AppB Tx <-----------> Rx | | + + TCP <--> lo <--> TCP In this case all traffic from AppA (after 3whs) is copied into the AppB ingress queue and no traffic is ever on the TCP recieive_queue. In the second case the application never receives, except in some rare error cases, traffic on the actual user space socket. Instead the send happens in the kernel. AppProxy socket pool sk0 ------------->{sk1,sk2, skn} ^ | | | | v ingress lb egress TCP TCP Here because traffic is never read off the socket with userspace recv() APIs there is only ever one reader on the sk receive_queue. Namely the BPF programs. However, we've started to introduce a third configuration where the BPF program on receive should process the data, but then the normal case is to push the data into the receive queue of AppB. AppB recv() (userspace) ----------------------- tcp_bpf_recvmsg() (kernel) | | | | | | ingress_msgQ | | | RX_BPF | | | v v sk->receive_queue This is different from the App{A,B} redirect because traffic is first received on the sk->receive_queue. Now for the issue. The tcp_bpf_recvmsg() handler first checks the ingress_msg queue for any data handled by the BPF rx program and returned with PASS code so that it was enqueued on the ingress msg queue. Then if no data exists on that queue it checks the socket receive queue. Unfortunately, this is the same receive_queue the BPF program is reading data off of. So we get a race. Its possible for the recvmsg() hook to pull data off the receive_queue before the BPF hook has a chance to read it. It typically happens when an application is banging on recv() and getting EAGAINs. Until they manage to race with the RX BPF program. To fix this we note that before this patch at attach time when the socket is loaded into the map we check if it needs a TX program or just the base set of proto bpf hooks. Then it uses the above general RX hook regardless of if we have a BPF program attached at rx or not. This patch now extends this check to handle all cases enumerated above, TX, RX, TXRX, and none. And to fix above race when an RX program is attached we use a new hook that is nearly identical to the old one except now we do not let the recv() call skip the RX BPF program. Now only the BPF program pulls data from sk->receive_queue and recv() only pulls data from the ingress msgQ post BPF program handling. With this resolved our AppB from above has been up and running for many hours without detecting any errors. We do this by correlating counters in RX BPF events and the AppB to ensure data is never skipping the BPF program. Selftests, was not able to detect this because we only run them for a short period of time on well ordered send/recvs so we don't get any of the noise we see in real application environments. Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Jussi Maki <joamaki@gmail.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20211103204736.248403-4-john.fastabend@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17bpf, sockmap: Remove unhash handler for BPF sockmap usageJohn Fastabend
[ Upstream commit b8b8315e39ffaca82e79d86dde26e9144addf66b ] We do not need to handle unhash from BPF side we can simply wait for the close to happen. The original concern was a socket could transition from ESTABLISHED state to a new state while the BPF hook was still attached. But, we convinced ourself this is no longer possible and we also improved BPF sockmap to handle listen sockets so this is no longer a problem. More importantly though there are cases where unhash is called when data is in the receive queue. The BPF unhash logic will flush this data which is wrong. To be correct it should keep the data in the receive queue and allow a receiving application to continue reading the data. This may happen when tcp_abort() is received for example. Instead of complicating the logic in unhash simply moving all this to tcp_close() hook solves this. Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Jussi Maki <joamaki@gmail.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20211103204736.248403-3-john.fastabend@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17net: dsa: felix: fix broken VLAN-tagged PTP under VLAN-aware bridgeVladimir Oltean
[ Upstream commit 92f62485b3715882cd397b0cbd80a96d179b86d6 ] Normally it is expected that the dsa_device_ops :: rcv() method finishes parsing the DSA tag and consumes it, then never looks at it again. But commit c0bcf537667c ("net: dsa: ocelot: add hardware timestamping support for Felix") added support for RX timestamping in a very unconventional way. On this switch, a partial timestamp is available in the DSA header, but the driver got away with not parsing that timestamp right away, but instead delayed that parsing for a little longer: dsa_switch_rcv(): nskb = cpu_dp->rcv(skb, dev); <------------- not here -> ocelot_rcv() ... skb = nskb; skb_push(skb, ETH_HLEN); skb->pkt_type = PACKET_HOST; skb->protocol = eth_type_trans(skb, skb->dev); ... if (dsa_skb_defer_rx_timestamp(p, skb)) <--- but here -> felix_rxtstamp() return 0; When in felix_rxtstamp(), this driver accounted for the fact that eth_type_trans() happened in the meanwhile, so it got a hold of the extraction header again by subtracting (ETH_HLEN + OCELOT_TAG_LEN) bytes from the current skb->data. This worked for quite some time but was quite fragile from the very beginning. Not to mention that having DSA tag parsing split in two different files, under different folders (net/dsa/tag_ocelot.c vs drivers/net/dsa/ocelot/felix.c) made it quite non-obvious for patches to come that they might break this. Finally, the blamed commit does the following: at the end of ocelot_rcv(), it checks whether the skb payload contains a VLAN header. If it does, and this port is under a VLAN-aware bridge, that VLAN ID might not be correct in the sense that the packet might have suffered VLAN rewriting due to TCAM rules (VCAP IS1). So we consume the VLAN ID from the skb payload using __skb_vlan_pop(), and take the classified VLAN ID from the DSA tag, and construct a hwaccel VLAN tag with the classified VLAN, and the skb payload is VLAN-untagged. The big problem is that __skb_vlan_pop() does: memmove(skb->data + VLAN_HLEN, skb->data, 2 * ETH_ALEN); __skb_pull(skb, VLAN_HLEN); aka it moves the Ethernet header 4 bytes to the right, and pulls 4 bytes from the skb headroom (effectively also moving skb->data, by definition). So for felix_rxtstamp()'s fragile logic, all bets are off now. Instead of having the "extraction" pointer point to the DSA header, it actually points to 4 bytes _inside_ the extraction header. Corollary, the last 4 bytes of the "extraction" header are in fact 4 stale bytes of the destination MAC address from the Ethernet header, from prior to the __skb_vlan_pop() movement. So of course, RX timestamps are completely bogus when the system is configured in this way. The fix is actually very simple: just don't structure the code like that. For better or worse, the DSA PTP timestamping API does not offer a straightforward way for drivers to present their RX timestamps, but other drivers (sja1105) have established a simple mechanism to carry their RX timestamp from dsa_device_ops :: rcv() all the way to dsa_switch_ops :: port_rxtstamp() and even later. That mechanism is to simply save the partial timestamp to the skb->cb, and complete it later. Question: why don't we simply populate the skb's struct skb_shared_hwtstamps from ocelot_rcv(), and bother with this complication of propagating the timestamp to felix_rxtstamp()? Answer: dsa_switch_ops :: port_rxtstamp() answers the question whether PTP packets need sleepable context to retrieve the full RX timestamp. Currently felix_rxtstamp() answers "no, thanks" to that question, and calls ocelot_ptp_gettime64() from softirq atomic context. This is understandable, since Felix VSC9959 is a PCIe memory-mapped switch, so hardware access does not require sleeping. But the felix driver is preparing for the introduction of other switches where hardware access is over a slow bus like SPI or MDIO: https://lore.kernel.org/lkml/20210814025003.2449143-1-colin.foster@in-advantage.com/ So I would like to keep this code structure, so the rework needed when that driver will need PTP support will be minimal (answer "yes, I need deferred context for this skb's RX timestamp", then the partial timestamp will still be found in the skb->cb. Fixes: ea440cd2d9b2 ("net: dsa: tag_ocelot: use VLAN information from tagging header when available") Reported-by: Po Liu <po.liu@nxp.com> Cc: Yangbo Lu <yangbo.lu@nxp.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17net: dsa: tag_ocelot: break circular dependency with ocelot switch lib driverVladimir Oltean
[ Upstream commit deab6b1cd9789bb9bd466d5e76aecb8b336259b4 ] As explained here: https://lore.kernel.org/netdev/20210908220834.d7gmtnwrorhharna@skbuf/ DSA tagging protocol drivers cannot depend on symbols exported by switch drivers, because this creates a circular dependency that breaks module autoloading. The tag_ocelot.c file depends on the ocelot_ptp_rew_op() function exported by the common ocelot switch lib. This function looks at OCELOT_SKB_CB(skb) and computes how to populate the REW_OP field of the DSA tag, for PTP timestamping (the command: one-step/two-step, and the TX timestamp identifier). None of that requires deep insight into the driver, it is quite stateless, as it only depends upon the skb->cb. So let's make it a static inline function and put it in include/linux/dsa/ocelot.h, a file that despite its name is used by the ocelot switch driver for populating the injection header too - since commit 40d3f295b5fe ("net: mscc: ocelot: use common tag parsing code with DSA"). With that function declared as static inline, its body is expanded inside each call site, so the dependency is broken and the DSA tagger can be built without the switch library, upon which the felix driver depends. Fixes: 39e5308b3250 ("net: mscc: ocelot: support PTP Sync one-step timestamping") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17net: vlan: fix a UAF in vlan_dev_real_dev()Ziyang Xuan
[ Upstream commit 563bcbae3ba233c275c244bfce2efe12938f5363 ] The real_dev of a vlan net_device may be freed after unregister_vlan_dev(). Access the real_dev continually by vlan_dev_real_dev() will trigger the UAF problem for the real_dev like following: ================================================================== BUG: KASAN: use-after-free in vlan_dev_real_dev+0xf9/0x120 Call Trace: kasan_report.cold+0x83/0xdf vlan_dev_real_dev+0xf9/0x120 is_eth_port_of_netdev_filter.part.0+0xb1/0x2c0 is_eth_port_of_netdev_filter+0x28/0x40 ib_enum_roce_netdev+0x1a3/0x300 ib_enum_all_roce_netdevs+0xc7/0x140 netdevice_event_work_handler+0x9d/0x210 ... Freed by task 9288: kasan_save_stack+0x1b/0x40 kasan_set_track+0x1c/0x30 kasan_set_free_info+0x20/0x30 __kasan_slab_free+0xfc/0x130 slab_free_freelist_hook+0xdd/0x240 kfree+0xe4/0x690 kvfree+0x42/0x50 device_release+0x9f/0x240 kobject_put+0x1c8/0x530 put_device+0x1b/0x30 free_netdev+0x370/0x540 ppp_destroy_interface+0x313/0x3d0 ... Move the put_device(real_dev) to vlan_dev_free(). Ensure real_dev not be freed before vlan_dev unregistered. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+e4df4e1389e28972e955@syzkaller.appspotmail.com Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17ethtool: fix ethtool msg len calculation for pause statsJakub Kicinski
[ Upstream commit 1aabe578dd86e9f2867c4db4fba9a15f4ba1825d ] ETHTOOL_A_PAUSE_STAT_MAX is the MAX attribute id, so we need to subtract non-stats and add one to get a count (IOW -2+1 == -1). Otherwise we'll see: ethnl cmd 21: calculated reply length 40, but consumed 52 Fixes: 9a27a33027f2 ("ethtool: add standard pause stats") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17netfilter: nfnetlink_queue: fix OOB when mac header was clearedFlorian Westphal
[ Upstream commit 5648b5e1169ff1d6d6a46c35c0b5fbebd2a5cbb2 ] On 64bit platforms the MAC header is set to 0xffff on allocation and also when a helper like skb_unset_mac_header() is called. dev_parse_header may call skb_mac_header() which assumes valid mac offset: BUG: KASAN: use-after-free in eth_header_parse+0x75/0x90 Read of size 6 at addr ffff8881075a5c05 by task nf-queue/1364 Call Trace: memcpy+0x20/0x60 eth_header_parse+0x75/0x90 __nfqnl_enqueue_packet+0x1a61/0x3380 __nf_queue+0x597/0x1300 nf_queue+0xf/0x40 nf_hook_slow+0xed/0x190 nf_hook+0x184/0x440 ip_output+0x1c0/0x2a0 nf_reinject+0x26f/0x700 nfqnl_recv_verdict+0xa16/0x18b0 nfnetlink_rcv_msg+0x506/0xe70 The existing code only works if the skb has a mac header. Fixes: 2c38de4c1f8da7 ("netfilter: fix looped (broad|multi)cast's MAC handling") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17nfsd: don't alloc under spinlock in rpc_parse_scope_idJ. Bruce Fields
[ Upstream commit 9b6e27d01adcec58e046c624874f8a124e8b07ec ] Dan Carpenter says: The patch d20c11d86d8f: "nfsd: Protect session creation and client confirm using client_lock" from Jul 30, 2014, leads to the following Smatch static checker warning: net/sunrpc/addr.c:178 rpc_parse_scope_id() warn: sleeping in atomic context Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: d20c11d86d8f ("nfsd: Protect session creation and client...") Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17skmsg: Lose offset info in sk_psock_skb_ingressLiu Jian
[ Upstream commit 7303524e04af49a47991e19f895c3b8cdc3796c7 ] If sockmap enable strparser, there are lose offset info in sk_psock_skb_ingress(). If the length determined by parse_msg function is not skb->len, the skb will be converted to sk_msg multiple times, and userspace app will get the data multiple times. Fix this by get the offset and length from strp_msg. And as Cong suggested, add one bit in skb->_sk_redir to distinguish enable or disable strparser. Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Liu Jian <liujian56@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Cong Wang <cong.wang@bytedance.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20211029141216.211899-1-liujian56@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17udp6: allow SO_MARK ctrl msg to affect routingJakub Kicinski
[ Upstream commit 42dcfd850e514b229d616a53dec06d0f2533217c ] Commit c6af0c227a22 ("ip: support SO_MARK cmsg") added propagation of SO_MARK from cmsg to skb->mark. For IPv4 and raw sockets the mark also affects route lookup, but in case of IPv6 the flow info is initialized before cmsg is parsed. Fixes: c6af0c227a22 ("ip: support SO_MARK cmsg") Reported-and-tested-by: Xintong Hu <huxintong@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17net: bridge: fix uninitialized variables when BRIDGE_CFM is disabledIvan Vecera
[ Upstream commit 829e050eea69c7442441b714b6f5b339b5b8c367 ] Function br_get_link_af_size_filtered() calls br_cfm_{,peer}_mep_count() that return a count. When BRIDGE_CFM is not enabled these functions simply return -EOPNOTSUPP but do not modify count parameter and calling function then works with uninitialized variables. Modify these inline functions to return zero in count parameter. Fixes: b6d0425b816e ("bridge: cfm: Netlink Notifications.") Cc: Henrik Bjoernlund <henrik.bjoernlund@microchip.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17sctp: return true only for pathmtu update in sctp_transport_pl_toobigXin Long
[ Upstream commit 75cf662c64dd8543f56c329c69eba18141c8fd9f ] sctp_transport_pl_toobig() supposes to return true only if there's pathmtu update, so that in sctp_icmp_frag_needed() it would call sctp_assoc_sync_pmtu() and sctp_retransmit(). This patch is to fix these return places in sctp_transport_pl_toobig(). Fixes: 836964083177 ("sctp: do state transition when receiving an icmp TOOBIG packet") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17sctp: allow IP fragmentation when PLPMTUD enters Error stateXin Long
[ Upstream commit 40171248bb8934537fec8fbaf718e57c8add187c ] Currently when PLPMTUD enters Error state, transport pathmtu will be set to MIN_PLPMTU(512) while probe is continuing with BASE_PLPMTU(1200). It will cause pathmtu to stay in a very small value, even if the real pmtu is some value like 1000. RFC8899 doesn't clearly say how to set the value in Error state. But one possibility could be keep using BASE_PLPMTU for the real pmtu, but allow to do IP fragmentation when it's in Error state. As it says in rfc8899#section-5.4: Some paths could be unable to sustain packets of the BASE_PLPMTU size. The Error State could be implemented to provide robustness to such paths. This allows fallback to a smaller than desired PLPMTU rather than suffer connectivity failure. This could utilize methods such as endpoint IP fragmentation to enable the PL sender to communicate using packets smaller than the BASE_PLPMTU. This patch is to set pmtu to BASE_PLPMTU instead of MIN_PLPMTU for Error state in sctp_transport_pl_send/toobig(), and set packet ipfragok for non-probe packets when it's in Error state. Fixes: 1dc68c194571 ("sctp: do state transition when PROBE_COUNT == MAX_PROBES on HB send path") Reported-by: Ying Xu <yinxu@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17tcp: don't free a FIN sk_buff in tcp_remove_empty_skb()Jon Maxwell
[ Upstream commit cf12e6f9124629b18a6182deefc0315f0a73a199 ] v1: Implement a more general statement as recommended by Eric Dumazet. The sequence number will be advanced, so this check will fix the FIN case and other cases. A customer reported sockets stuck in the CLOSING state. A Vmcore revealed that the write_queue was not empty as determined by tcp_write_queue_empty() but the sk_buff containing the FIN flag had been freed and the socket was zombied in that state. Corresponding pcaps show no FIN from the Linux kernel on the wire. Some instrumentation was added to the kernel and it was found that there is a timing window where tcp_sendmsg() can run after tcp_send_fin(). tcp_sendmsg() will hit an error, for example: 1269 ▹ if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))↩ 1270 ▹ ▹ goto do_error;↩ tcp_remove_empty_skb() will then free the FIN sk_buff as "skb->len == 0". The TCP socket is now wedged in the FIN-WAIT-1 state because the FIN is never sent. If the other side sends a FIN packet the socket will transition to CLOSING and remain that way until the system is rebooted. Fix this by checking for the FIN flag in the sk_buff and don't free it if that is the case. Testing confirmed that fixed the issue. Fixes: fdfc5c8594c2 ("tcp: remove empty skb from write queue in error cases") Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com> Reported-by: Monir Zouaoui <Monir.Zouaoui@mail.schwarz> Reported-by: Simon Stier <simon.stier@mail.schwarz> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17net: dsa: avoid refcount warnings when ->port_{fdb,mdb}_del returns errorVladimir Oltean
[ Upstream commit 232deb3f9567ce37d99b8616a6c07c1fc0436abf ] At present, when either of ds->ops->port_fdb_del() or ds->ops->port_mdb_del() return a non-zero error code, we attempt to save the day and keep the data structure associated with that switchdev object, as the deletion procedure did not complete. However, the way in which we do this is suspicious to the checker in lib/refcount.c, who thinks it is buggy to increment a refcount that became zero, and that this is indicative of a use-after-free. Fixes: 161ca59d39e9 ("net: dsa: reference count the MDB entries at the cross-chip notifier level") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17net: stream: don't purge sk_error_queue in sk_stream_kill_queues()Jakub Kicinski
[ Upstream commit 24bcbe1cc69fa52dc4f7b5b2456678ed464724d8 ] sk_stream_kill_queues() can be called on close when there are still outstanding skbs to transmit. Those skbs may try to queue notifications to the error queue (e.g. timestamps). If sk_stream_kill_queues() purges the queue without taking its lock the queue may get corrupted, and skbs leaked. This shows up as a warning about an rmem leak: WARNING: CPU: 24 PID: 0 at net/ipv4/af_inet.c:154 inet_sock_destruct+0x... The leak is always a multiple of 0x300 bytes (the value is in %rax on my builds, so RAX: 0000000000000300). 0x300 is truesize of an empty sk_buff. Indeed if we dump the socket state at the time of the warning the sk_error_queue is often (but not always) corrupted. The ->next pointer points back at the list head, but not the ->prev pointer. Indeed we can find the leaked skb by scanning the kernel memory for something that looks like an skb with ->sk = socket in question, and ->truesize = 0x300. The contents of ->cb[] of the skb confirms the suspicion that it is indeed a timestamp notification (as generated in __skb_complete_tx_timestamp()). Removing purging of sk_error_queue should be okay, since inet_sock_destruct() does it again once all socket refs are gone. Eric suggests this may cause sockets that go thru disconnect() to maintain notifications from the previous incarnations of the socket, but that should be okay since the race was there anyway, and disconnect() is not exactly dependable. Thanks to Jonathan Lemon and Omar Sandoval for help at various stages of tracing the issue. Fixes: cb9eff097831 ("net: new user space API for time stamping of incoming and outgoing packets") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17tcp: switch orphan_count to bare per-cpu countersEric Dumazet
[ Upstream commit 19757cebf0c5016a1f36f7fe9810a9f0b33c0832 ] Use of percpu_counter structure to track count of orphaned sockets is causing problems on modern hosts with 256 cpus or more. Stefan Bach reported a serious spinlock contention in real workloads, that I was able to reproduce with a netfilter rule dropping incoming FIN packets. 53.56% server [kernel.kallsyms] [k] queued_spin_lock_slowpath | ---queued_spin_lock_slowpath | --53.51%--_raw_spin_lock_irqsave | --53.51%--__percpu_counter_sum tcp_check_oom | |--39.03%--__tcp_close | tcp_close | inet_release | inet6_release | sock_close | __fput | ____fput | task_work_run | exit_to_usermode_loop | do_syscall_64 | entry_SYSCALL_64_after_hwframe | __GI___libc_close | --14.48%--tcp_out_of_resources tcp_write_timeout tcp_retransmit_timer tcp_write_timer_handler tcp_write_timer call_timer_fn expire_timers __run_timers run_timer_softirq __softirqentry_text_start As explained in commit cf86a086a180 ("net/dst: use a smaller percpu_counter batch for dst entries accounting"), default batch size is too big for the default value of tcp_max_orphans (262144). But even if we reduce batch sizes, there would still be cases where the estimated count of orphans is beyond the limit, and where tcp_too_many_orphans() has to call the expensive percpu_counter_sum_positive(). One solution is to use plain per-cpu counters, and have a timer to periodically refresh this cache. Updating this cache every 100ms seems about right, tcp pressure state is not radically changing over shorter periods. percpu_counter was nice 15 years ago while hosts had less than 16 cpus, not anymore by current standards. v2: Fix the build issue for CONFIG_CRYPTO_DEV_CHELSIO_TLS=m, reported by kernel test robot <lkp@intel.com> Remove unused socket argument from tcp_too_many_orphans() Fixes: dd24c00191d5 ("net: Use a percpu_counter for orphan_count") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Stefan Bach <sfb@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17net, neigh: Fix NTF_EXT_LEARNED in combination with NTF_USEDaniel Borkmann
[ Upstream commit e4400bbf5b15750e1b59bf4722d18d99be60c69f ] The NTF_EXT_LEARNED neigh flag is usually propagated back to user space upon dump of the neighbor table. However, when used in combination with NTF_USE flag this is not the case despite exempting the entry from the garbage collector. This results in inconsistent state since entries are typically marked in neigh->flags with NTF_EXT_LEARNED, but here they are not. Fix it by propagating the creation flag to ___neigh_create(). Before fix: # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE [...] After fix: # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a extern_learn REACHABLE [...] Fixes: 9ce33e46531d ("neighbour: support for NTF_EXT_LEARNED flag") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17netfilter: nft_dynset: relax superfluous check on set updatesPablo Neira Ayuso
[ Upstream commit 7b1394892de8d95748d05e3ee41e85edb4abbfa1 ] Relax this condition to make add and update commands idempotent for sets with no timeout. The eval function already checks if the set element timeout is available and updates it if the update command is used. Fixes: 22fe54d5fefc ("netfilter: nf_tables: add support for dynamic set updates") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17cfg80211: always free wiphy specific regdomainJohannes Berg
[ Upstream commit e53e9828a8d2c6545e01ff9711f1221f2fd199ce ] In the (somewhat unlikely) event that we allocate a wiphy, then add a regdomain to it, and then fail registration, we leak the regdomain. Fix this by just always freeing it at the end, in the normal cases we'll free (and NULL) it during wiphy_unregister(). This happened when the wiphy settings were bad, and since they can be controlled by userspace with hwsim, syzbot was able to find this issue. Reported-by: syzbot+1638e7c770eef6b6c0d0@syzkaller.appspotmail.com Fixes: 3e0c3ff36c4c ("cfg80211: allow multiple driver regulatory_hints()") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Link: https://lore.kernel.org/r/20210927131105.68b70cef4674.I4b9f0aa08c2af28555963b9fe3d34395bb72e0cc@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17rxrpc: Fix _usecs_to_jiffies() by using usecs_to_jiffies()Jiasheng Jiang
[ Upstream commit acde891c243c1ed85b19d4d5042bdf00914f5739 ] Directly using _usecs_to_jiffies() might be unsafe, so it's better to use usecs_to_jiffies() instead. Because we can see that the result of _usecs_to_jiffies() could be larger than MAX_JIFFY_OFFSET values without the check of the input. Fixes: c410bf01933e ("Fix the excessive initial retransmission timeout") Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17net: net_namespace: Fix undefined member in key_remove_domain()Yajun Deng
[ Upstream commit aed0826b0cf2e488900ab92193893e803d65c070 ] The key_domain member in struct net only exists if we define CONFIG_KEYS. So we should add the define when we used key_domain. Fixes: 9b242610514f ("keys: Network namespace domain tag") Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17Bluetooth: fix init and cleanup of sco_conn.timeout_workDesmond Cheong Zhi Xi
[ Upstream commit 49d8a5606428ca0962d09050a5af81461ff90fbb ] Before freeing struct sco_conn, all delayed timeout work should be cancelled. Otherwise, sco_sock_timeout could potentially use the sco_conn after it has been freed. Additionally, sco_conn.timeout_work should be initialized when the connection is allocated, not when the channel is added. This is because an sco_conn can create channels with multiple sockets over its lifetime, which happens if sockets are released but the connection isn't deleted. Fixes: ba316be1b6a0 ("Bluetooth: schedule SCO timeouts with delayed_work") Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17netfilter: conntrack: set on IPS_ASSURED if flows enters internal stream statePablo Neira Ayuso
[ Upstream commit b7b1d02fc43925a4d569ec221715db2dfa1ce4f5 ] The internal stream state sets the timeout to 120 seconds 2 seconds after the creation of the flow, attach this internal stream state to the IPS_ASSURED flag for consistent event reporting. Before this patch: [NEW] udp 17 30 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 [UNREPLIED] src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [UPDATE] udp 17 30 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [UPDATE] udp 17 30 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [ASSURED] [DESTROY] udp 17 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [ASSURED] Note IPS_ASSURED for the flow not yet in the internal stream state. after this update: [NEW] udp 17 30 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 [UNREPLIED] src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [UPDATE] udp 17 30 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [UPDATE] udp 17 120 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [ASSURED] [DESTROY] udp 17 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [ASSURED] Before this patch, short-lived UDP flows never entered IPS_ASSURED, so they were already candidate flow to be deleted by early_drop under stress. Before this patch, IPS_ASSURED is set on regardless the internal stream state, attach this internal stream state to IPS_ASSURED. packet #1 (original direction) enters NEW state packet #2 (reply direction) enters ESTABLISHED state, sets on IPS_SEEN_REPLY paclet #3 (any direction) sets on IPS_ASSURED (if 2 seconds since the creation has passed by). Reported-by: Maciej Żenczykowski <zenczykowski@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17gre/sit: Don't generate link-local addr if addr_gen_mode is ↵Stephen Suryaputra
IN6_ADDR_GEN_MODE_NONE [ Upstream commit 61e18ce7348bfefb5688a8bcd4b4d6b37c0f9b2a ] When addr_gen_mode is set to IN6_ADDR_GEN_MODE_NONE, the link-local addr should not be generated. But it isn't the case for GRE (as well as GRE6) and SIT tunnels. Make it so that tunnels consider the addr_gen_mode, especially for IN6_ADDR_GEN_MODE_NONE. Do this in add_v4_addrs() to cover both GRE and SIT only if the addr scope is link. Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Acked-by: Antonio Quartulli <a@unstable.cc> Link: https://lore.kernel.org/r/20211020200618.467342-1-ssuryaextr@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17net-sysfs: try not to restart the syscall if it will fail eventuallyAntoine Tenart
[ Upstream commit 146e5e733310379f51924111068f08a3af0db830 ] Due to deadlocks in the networking subsystem spotted 12 years ago[1], a workaround was put in place[2] to avoid taking the rtnl lock when it was not available and restarting the syscall (back to VFS, letting userspace spin). The following construction is found a lot in the net sysfs and sysctl code: if (!rtnl_trylock()) return restart_syscall(); This can be problematic when multiple userspace threads use such interfaces in a short period, making them to spin a lot. This happens for example when adding and moving virtual interfaces: userspace programs listening on events, such as systemd-udevd and NetworkManager, do trigger actions reading files in sysfs. It gets worse when a lot of virtual interfaces are created concurrently, say when creating containers at boot time. Returning early without hitting the above pattern when the syscall will fail eventually does make things better. While it is not a fix for the issue, it does ease things. [1] https://lore.kernel.org/netdev/49A4D5D5.5090602@trash.net/ https://lore.kernel.org/netdev/m14oyhis31.fsf@fess.ebiederm.org/ and https://lore.kernel.org/netdev/20090226084924.16cb3e08@nehalam/ [2] Rightfully, those deadlocks are *hard* to solve. Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17net: sched: update default qdisc visibility after Tx queue cnt changesJakub Kicinski
[ Upstream commit 1e080f17750d1083e8a32f7b350584ae1cd7ff20 ] mq / mqprio make the default child qdiscs visible. They only do so for the qdiscs which are within real_num_tx_queues when the device is registered. Depending on order of calls in the driver, or if user space changes config via ethtool -L the number of qdiscs visible under tc qdisc show will differ from the number of queues. This is confusing to users and potentially to system configuration scripts which try to make sure qdiscs have the right parameters. Add a new Qdisc_ops callback and make relevant qdiscs TTRT. Note that this uncovers the "shortcut" created by commit 1f27cde313d7 ("net: sched: use pfifo_fast for non real queues") The default child qdiscs beyond initial real_num_tx are always pfifo_fast, no matter what the sysfs setting is. Fixing this gets a little tricky because we'd need to keep a reference on whatever the default qdisc was at the time of creation. In practice this is likely an non-issue the qdiscs likely have to be configured to non-default settings, so whatever user space is doing such configuration can replace the pfifos... now that it will see them. Reported-by: Matthew Massey <matthewmassey@fb.com> Reviewed-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17Bluetooth: fix use-after-free error in lock_sock_nested()Wang ShaoBo
[ Upstream commit 1bff51ea59a9afb67d2dd78518ab0582a54a472c ] use-after-free error in lock_sock_nested is reported: [ 179.140137][ T3731] ===================================================== [ 179.142675][ T3731] BUG: KMSAN: use-after-free in lock_sock_nested+0x280/0x2c0 [ 179.145494][ T3731] CPU: 4 PID: 3731 Comm: kworker/4:2 Not tainted 5.12.0-rc6+ #54 [ 179.148432][ T3731] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 [ 179.151806][ T3731] Workqueue: events l2cap_chan_timeout [ 179.152730][ T3731] Call Trace: [ 179.153301][ T3731] dump_stack+0x24c/0x2e0 [ 179.154063][ T3731] kmsan_report+0xfb/0x1e0 [ 179.154855][ T3731] __msan_warning+0x5c/0xa0 [ 179.155579][ T3731] lock_sock_nested+0x280/0x2c0 [ 179.156436][ T3731] ? kmsan_get_metadata+0x116/0x180 [ 179.157257][ T3731] l2cap_sock_teardown_cb+0xb8/0x890 [ 179.158154][ T3731] ? __msan_metadata_ptr_for_load_8+0x10/0x20 [ 179.159141][ T3731] ? kmsan_get_metadata+0x116/0x180 [ 179.159994][ T3731] ? kmsan_get_shadow_origin_ptr+0x84/0xb0 [ 179.160959][ T3731] ? l2cap_sock_recv_cb+0x420/0x420 [ 179.161834][ T3731] l2cap_chan_del+0x3e1/0x1d50 [ 179.162608][ T3731] ? kmsan_get_metadata+0x116/0x180 [ 179.163435][ T3731] ? kmsan_get_shadow_origin_ptr+0x84/0xb0 [ 179.164406][ T3731] l2cap_chan_close+0xeea/0x1050 [ 179.165189][ T3731] ? kmsan_internal_unpoison_shadow+0x42/0x70 [ 179.166180][ T3731] l2cap_chan_timeout+0x1da/0x590 [ 179.167066][ T3731] ? __msan_metadata_ptr_for_load_8+0x10/0x20 [ 179.168023][ T3731] ? l2cap_chan_create+0x560/0x560 [ 179.168818][ T3731] process_one_work+0x121d/0x1ff0 [ 179.169598][ T3731] worker_thread+0x121b/0x2370 [ 179.170346][ T3731] kthread+0x4ef/0x610 [ 179.171010][ T3731] ? process_one_work+0x1ff0/0x1ff0 [ 179.171828][ T3731] ? kthread_blkcg+0x110/0x110 [ 179.172587][ T3731] ret_from_fork+0x1f/0x30 [ 179.173348][ T3731] [ 179.173752][ T3731] Uninit was created at: [ 179.174409][ T3731] kmsan_internal_poison_shadow+0x5c/0xf0 [ 179.175373][ T3731] kmsan_slab_free+0x76/0xc0 [ 179.176060][ T3731] kfree+0x3a5/0x1180 [ 179.176664][ T3731] __sk_destruct+0x8af/0xb80 [ 179.177375][ T3731] __sk_free+0x812/0x8c0 [ 179.178032][ T3731] sk_free+0x97/0x130 [ 179.178686][ T3731] l2cap_sock_release+0x3d5/0x4d0 [ 179.179457][ T3731] sock_close+0x150/0x450 [ 179.180117][ T3731] __fput+0x6bd/0xf00 [ 179.180787][ T3731] ____fput+0x37/0x40 [ 179.181481][ T3731] task_work_run+0x140/0x280 [ 179.182219][ T3731] do_exit+0xe51/0x3e60 [ 179.182930][ T3731] do_group_exit+0x20e/0x450 [ 179.183656][ T3731] get_signal+0x2dfb/0x38f0 [ 179.184344][ T3731] arch_do_signal_or_restart+0xaa/0xe10 [ 179.185266][ T3731] exit_to_user_mode_prepare+0x2d2/0x560 [ 179.186136][ T3731] syscall_exit_to_user_mode+0x35/0x60 [ 179.186984][ T3731] do_syscall_64+0xc5/0x140 [ 179.187681][ T3731] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 179.188604][ T3731] ===================================================== In our case, there are two Thread A and B: Context: Thread A: Context: Thread B: l2cap_chan_timeout() __se_sys_shutdown() l2cap_chan_close() l2cap_sock_shutdown() l2cap_chan_del() l2cap_chan_close() l2cap_sock_teardown_cb() l2cap_sock_teardown_cb() Once l2cap_sock_teardown_cb() excuted, this sock will be marked as SOCK_ZAPPED, and can be treated as killable in l2cap_sock_kill() if sock_orphan() has excuted, at this time we close sock through sock_close() which end to call l2cap_sock_kill() like Thread C: Context: Thread C: sock_close() l2cap_sock_release() sock_orphan() l2cap_sock_kill() #free sock if refcnt is 1 If C completed, Once A or B reaches l2cap_sock_teardown_cb() again, use-after-free happened. We should set chan->data to NULL if sock is destructed, for telling teardown operation is not allowed in l2cap_sock_teardown_cb(), and also we should avoid killing an already killed socket in l2cap_sock_close_cb(). Signed-off-by: Wang ShaoBo <bobo.shaobowang@huawei.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17Bluetooth: sco: Fix lock_sock() blockage by memcpy_from_msg()Takashi Iwai
[ Upstream commit 99c23da0eed4fd20cae8243f2b51e10e66aa0951 ] The sco_send_frame() also takes lock_sock() during memcpy_from_msg() call that may be endlessly blocked by a task with userfaultd technique, and this will result in a hung task watchdog trigger. Just like the similar fix for hci_sock_sendmsg() in commit 92c685dc5de0 ("Bluetooth: reorganize functions..."), this patch moves the memcpy_from_msg() out of lock_sock() for addressing the hang. This should be the last piece for fixing CVE-2021-3640 after a few already queued fixes. Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17can: j1939: j1939_can_recv(): ignore messages with invalid source addressZhang Changzhong
commit a79305e156db3d24fcd8eb649cdb3c3b2350e5c2 upstream. According to SAE-J1939-82 2015 (A.3.6 Row 2), a receiver should never send TP.CM_CTS to the global address, so we can add a check in j1939_can_recv() to drop messages with invalid source address. Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol") Link: https://lore.kernel.org/all/1635431907-15617-3-git-send-email-zhangchangzhong@huawei.com Cc: stable@vger.kernel.org Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-17can: j1939: j1939_tp_cmd_recv(): ignore abort message in the BAM transportZhang Changzhong
commit c0f49d98006f2db3333b917caac65bce2af9865c upstream. This patch prevents BAM transport from being closed by receiving abort message, as specified in SAE-J1939-82 2015 (A.3.3 Row 4). Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol") Link: https://lore.kernel.org/all/1635431907-15617-2-git-send-email-zhangchangzhong@huawei.com Cc: stable@vger.kernel.org Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-17net/smc: Correct spelling mistake to TCPF_SYN_RECVWen Gu
[ Upstream commit f3a3a0fe0b644582fa5d83dd94b398f99fc57914 ] There should use TCPF_SYN_RECV instead of TCP_SYN_RECV. Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17net/smc: Fix smc_link->llc_testlink_time overflowTony Lu
[ Upstream commit c4a146c7cf5e8ad76231523b174d161bf152c6e7 ] The value of llc_testlink_time is set to the value stored in net->ipv4.sysctl_tcp_keepalive_time when linkgroup init. The value of sysctl_tcp_keepalive_time is already jiffies, so we don't need to multiply by HZ, which would cause smc_link->llc_testlink_time overflow, and test_link send flood. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17net: multicast: calculate csum of looped-back and forwarded packetsCyril Strejc
[ Upstream commit 9122a70a6333705c0c35614ddc51c274ed1d3637 ] During a testing of an user-space application which transmits UDP multicast datagrams and utilizes multicast routing to send the UDP datagrams out of defined network interfaces, I've found a multicast router does not fill-in UDP checksum into locally produced, looped-back and forwarded UDP datagrams, if an original output NIC the datagrams are sent to has UDP TX checksum offload enabled. The datagrams are sent malformed out of the NIC the datagrams have been forwarded to. It is because: 1. If TX checksum offload is enabled on the output NIC, UDP checksum is not calculated by kernel and is not filled into skb data. 2. dev_loopback_xmit(), which is called solely by ip_mc_finish_output(), sets skb->ip_summed = CHECKSUM_UNNECESSARY unconditionally. 3. Since 35fc92a9 ("[NET]: Allow forwarding of ip_summed except CHECKSUM_COMPLETE"), the ip_summed value is preserved during forwarding. 4. If ip_summed != CHECKSUM_PARTIAL, checksum is not calculated during a packet egress. The minimum fix in dev_loopback_xmit(): 1. Preserves skb->ip_summed CHECKSUM_PARTIAL. This is the case when the original output NIC has TX checksum offload enabled. The effects are: a) If the forwarding destination interface supports TX checksum offloading, the NIC driver is responsible to fill-in the checksum. b) If the forwarding destination interface does NOT support TX checksum offloading, checksums are filled-in by kernel before skb is submitted to the NIC driver. c) For local delivery, checksum validation is skipped as in the case of CHECKSUM_UNNECESSARY, thanks to skb_csum_unnecessary(). 2. Translates ip_summed CHECKSUM_NONE to CHECKSUM_UNNECESSARY. It means, for CHECKSUM_NONE, the behavior is unmodified and is there to skip a looped-back packet local delivery checksum validation. Signed-off-by: Cyril Strejc <cyril.strejc@skoda.cz> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-17bpf: Prevent increasing bpf_jit_limit above maxLorenz Bauer
[ Upstream commit fadb7ff1a6c2c565af56b4aacdd086b067eed440 ] Restrict bpf_jit_limit to the maximum supported by the arch's JIT. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211014142554.53120-4-lmb@cloudflare.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-04Merge tag 'v5.14.16' into v5.14/standard/baseBruce Ashfield
This is the 5.14.16 stable release # gpg: Signature made Tue 02 Nov 2021 02:51:07 PM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2021-11-02sctp: add vtag check in sctp_sf_ootbXin Long
[ Upstream commit 9d02831e517aa36ee6bdb453a0eb47bd49923fe3 ] sctp_sf_ootb() is called when processing DATA chunk in closed state, and many other places are also using it. The vtag in the chunk's sctphdr should be verified, otherwise, as later in chunk length check, it may send abort with the existent asoc's vtag, which can be exploited by one to cook a malicious chunk to terminate a SCTP asoc. When fails to verify the vtag from the chunk, this patch sets asoc to NULL, so that the abort will be made with the vtag from the received chunk later. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-02sctp: add vtag check in sctp_sf_do_8_5_1_E_saXin Long
[ Upstream commit ef16b1734f0a176277b7bb9c71a6d977a6ef3998 ] sctp_sf_do_8_5_1_E_sa() is called when processing SHUTDOWN_ACK chunk in cookie_wait and cookie_echoed state. The vtag in the chunk's sctphdr should be verified, otherwise, as later in chunk length check, it may send abort with the existent asoc's vtag, which can be exploited by one to cook a malicious chunk to terminate a SCTP asoc. Note that when fails to verify the vtag from SHUTDOWN-ACK chunk, SHUTDOWN COMPLETE message will still be sent back to peer, but with the vtag from SHUTDOWN-ACK chunk, as said in 5) of rfc4960#section-8.4. While at it, also remove the unnecessary chunk length check from sctp_sf_shut_8_4_5(), as it's already done in both places where it calls sctp_sf_shut_8_4_5(). Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-02sctp: add vtag check in sctp_sf_violationXin Long
[ Upstream commit aa0f697e45286a6b5f0ceca9418acf54b9099d99 ] sctp_sf_violation() is called when processing HEARTBEAT_ACK chunk in cookie_wait state, and some other places are also using it. The vtag in the chunk's sctphdr should be verified, otherwise, as later in chunk length check, it may send abort with the existent asoc's vtag, which can be exploited by one to cook a malicious chunk to terminate a SCTP asoc. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-02sctp: fix the processing for COOKIE_ECHO chunkXin Long
[ Upstream commit a64b341b8695e1c744dd972b39868371b4f68f83 ] 1. In closed state: in sctp_sf_do_5_1D_ce(): When asoc is NULL, making packet for abort will use chunk's vtag in sctp_ootb_pkt_new(). But when asoc exists, vtag from the chunk should be verified before using peer.i.init_tag to make packet for abort in sctp_ootb_pkt_new(), and just discard it if vtag is not correct. 2. In the other states: in sctp_sf_do_5_2_4_dupcook(): asoc always exists, but duplicate cookie_echo's vtag will be handled by sctp_tietags_compare() and then take actions, so before that we only verify the vtag for the abort sent for invalid chunk length. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>