aboutsummaryrefslogtreecommitdiffstats
path: root/net/mptcp/protocol.c
AgeCommit message (Collapse)Author
2024-02-26mptcp: fix double-free on socket dismantleDavide Caratti
when MPTCP server accepts an incoming connection, it clones its listener socket. However, the pointer to 'inet_opt' for the new socket has the same value as the original one: as a consequence, on program exit it's possible to observe the following splat: BUG: KASAN: double-free in inet_sock_destruct+0x54f/0x8b0 Free of addr ffff888485950880 by task swapper/25/0 CPU: 25 PID: 0 Comm: swapper/25 Kdump: loaded Not tainted 6.8.0-rc1+ #609 Hardware name: Supermicro SYS-6027R-72RF/X9DRH-7TF/7F/iTF/iF, BIOS 3.0 07/26/2013 Call Trace: <IRQ> dump_stack_lvl+0x32/0x50 print_report+0xca/0x620 kasan_report_invalid_free+0x64/0x90 __kasan_slab_free+0x1aa/0x1f0 kfree+0xed/0x2e0 inet_sock_destruct+0x54f/0x8b0 __sk_destruct+0x48/0x5b0 rcu_do_batch+0x34e/0xd90 rcu_core+0x559/0xac0 __do_softirq+0x183/0x5a4 irq_exit_rcu+0x12d/0x170 sysvec_apic_timer_interrupt+0x6b/0x80 </IRQ> <TASK> asm_sysvec_apic_timer_interrupt+0x16/0x20 RIP: 0010:cpuidle_enter_state+0x175/0x300 Code: 30 00 0f 84 1f 01 00 00 83 e8 01 83 f8 ff 75 e5 48 83 c4 18 44 89 e8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc fb 45 85 ed <0f> 89 60 ff ff ff 48 c1 e5 06 48 c7 43 18 00 00 00 00 48 83 44 2b RSP: 0018:ffff888481cf7d90 EFLAGS: 00000202 RAX: 0000000000000000 RBX: ffff88887facddc8 RCX: 0000000000000000 RDX: 1ffff1110ff588b1 RSI: 0000000000000019 RDI: ffff88887fac4588 RBP: 0000000000000004 R08: 0000000000000002 R09: 0000000000043080 R10: 0009b02ea273363f R11: ffff88887fabf42b R12: ffffffff932592e0 R13: 0000000000000004 R14: 0000000000000000 R15: 00000022c880ec80 cpuidle_enter+0x4a/0xa0 do_idle+0x310/0x410 cpu_startup_entry+0x51/0x60 start_secondary+0x211/0x270 secondary_startup_64_no_verify+0x184/0x18b </TASK> Allocated by task 6853: kasan_save_stack+0x1c/0x40 kasan_save_track+0x10/0x30 __kasan_kmalloc+0xa6/0xb0 __kmalloc+0x1eb/0x450 cipso_v4_sock_setattr+0x96/0x360 netlbl_sock_setattr+0x132/0x1f0 selinux_netlbl_socket_post_create+0x6c/0x110 selinux_socket_post_create+0x37b/0x7f0 security_socket_post_create+0x63/0xb0 __sock_create+0x305/0x450 __sys_socket_create.part.23+0xbd/0x130 __sys_socket+0x37/0xb0 __x64_sys_socket+0x6f/0xb0 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x6e/0x76 Freed by task 6858: kasan_save_stack+0x1c/0x40 kasan_save_track+0x10/0x30 kasan_save_free_info+0x3b/0x60 __kasan_slab_free+0x12c/0x1f0 kfree+0xed/0x2e0 inet_sock_destruct+0x54f/0x8b0 __sk_destruct+0x48/0x5b0 subflow_ulp_release+0x1f0/0x250 tcp_cleanup_ulp+0x6e/0x110 tcp_v4_destroy_sock+0x5a/0x3a0 inet_csk_destroy_sock+0x135/0x390 tcp_fin+0x416/0x5c0 tcp_data_queue+0x1bc8/0x4310 tcp_rcv_state_process+0x15a3/0x47b0 tcp_v4_do_rcv+0x2c1/0x990 tcp_v4_rcv+0x41fb/0x5ed0 ip_protocol_deliver_rcu+0x6d/0x9f0 ip_local_deliver_finish+0x278/0x360 ip_local_deliver+0x182/0x2c0 ip_rcv+0xb5/0x1c0 __netif_receive_skb_one_core+0x16e/0x1b0 process_backlog+0x1e3/0x650 __napi_poll+0xa6/0x500 net_rx_action+0x740/0xbb0 __do_softirq+0x183/0x5a4 The buggy address belongs to the object at ffff888485950880 which belongs to the cache kmalloc-64 of size 64 The buggy address is located 0 bytes inside of 64-byte region [ffff888485950880, ffff8884859508c0) The buggy address belongs to the physical page: page:0000000056d1e95e refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888485950700 pfn:0x485950 flags: 0x57ffffc0000800(slab|node=1|zone=2|lastcpupid=0x1fffff) page_type: 0xffffffff() raw: 0057ffffc0000800 ffff88810004c640 ffffea00121b8ac0 dead000000000006 raw: ffff888485950700 0000000000200019 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888485950780: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ffff888485950800: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc >ffff888485950880: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ^ ffff888485950900: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ffff888485950980: 00 00 00 00 00 01 fc fc fc fc fc fc fc fc fc fc Something similar (a refcount underflow) happens with CALIPSO/IPv6. Fix this by duplicating IP / IPv6 options after clone, so that ip{,6}_sock_destruct() doesn't end up freeing the same memory area twice. Fixes: cf7da0d66cc1 ("mptcp: Create SUBFLOW socket for incoming connections") Cc: stable@vger.kernel.org Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://lore.kernel.org/r/20240223-upstream-net-20240223-misc-fixes-v1-8-162e87e48497@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-26mptcp: fix snd_wnd initialization for passive socketPaolo Abeni
Such value should be inherited from the first subflow, but passive sockets always used 'rsk_rcv_wnd'. Fixes: 6f8a612a33e4 ("mptcp: keep track of advertised windows right edge") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://lore.kernel.org/r/20240223-upstream-net-20240223-misc-fixes-v1-5-162e87e48497@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-26mptcp: push at DSS boundariesPaolo Abeni
when inserting not contiguous data in the subflow write queue, the protocol creates a new skb and prevent the TCP stack from merging it later with already queued skbs by setting the EOR marker. Still no push flag is explicitly set at the end of previous GSO packet, making the aggregation on the receiver side sub-optimal - and packetdrill self-tests less predictable. Explicitly mark the end of not contiguous DSS with the push flag. Fixes: 6d0060f600ad ("mptcp: Write MPTCP DSS headers to outgoing data packets") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://lore.kernel.org/r/20240223-upstream-net-20240223-misc-fixes-v1-4-162e87e48497@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-18mptcp: fix data races on local_idPaolo Abeni
The local address id is accessed lockless by the NL PM, add all the required ONCE annotation. There is a caveat: the local id can be initialized late in the subflow life-cycle, and its validity is controlled by the local_id_valid flag. Remove such flag and encode the validity in the local_id field itself with negative value before initialization. That allows accessing the field consistently with a single read operation. Fixes: 0ee4261a3681 ("mptcp: implement mptcp_pm_remove_subflow") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-12mptcp: corner case locking for rx path fields initializationPaolo Abeni
Most MPTCP-level related fields are under the mptcp data lock protection, but are written one-off without such lock at MPC complete time, both for the client and the server Leverage the mptcp_propagate_state() infrastructure to move such initialization under the proper lock client-wise. The server side critical init steps are done by mptcp_subflow_fully_established(): ensure the caller properly held the relevant lock, and avoid acquiring the same lock in the nested scopes. There are no real potential races, as write access to such fields is implicitly serialized by the MPTCP state machine; the primary goal is consistency. Fixes: d22f4988ffec ("mptcp: process MP_CAPABLE data option") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-12mptcp: fix more tx path fields initializationPaolo Abeni
The 'msk->write_seq' and 'msk->snd_nxt' are always updated under the msk socket lock, except at MPC handshake completiont time. Builds-up on the previous commit to move such init under the relevant lock. There are no known problems caused by the potential race, the primary goal is consistency. Fixes: 6d0060f600ad ("mptcp: Write MPTCP DSS headers to outgoing data packets") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-12mptcp: fix rcv space initializationPaolo Abeni
mptcp_rcv_space_init() is supposed to happen under the msk socket lock, but active msk socket does that without such protection. Leverage the existing mptcp_propagate_state() helper to that extent. We need to ensure mptcp_rcv_space_init will happen before mptcp_rcv_space_adjust(), and the release_cb does not assure that: explicitly check for such condition. While at it, move the wnd_end initialization out of mptcp_rcv_space_init(), it never belonged there. Note that the race does not produce ill effect in practice, but change allows cleaning-up and defying better the locking model. Fixes: a6b118febbab ("mptcp: add receive buffer auto-tuning") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-12mptcp: drop the push_pending fieldPaolo Abeni
Such field is there to avoid acquiring the data lock in a few spots, but it adds complexity to the already non trivial locking schema. All the relevant call sites (mptcp-level re-injection, set socket options), are slow-path, drop such field in favor of 'cb_flags', adding the relevant locking. This patch could be seen as an improvement, instead of a fix. But it simplifies the next patch. The 'Fixes' tag has been added to help having this series backported to stable. Fixes: e9d09baca676 ("mptcp: avoid atomic bit manipulation when possible") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-01mptcp: fix data re-injection from stale subflowPaolo Abeni
When the MPTCP PM detects that a subflow is stale, all the packet scheduler must re-inject all the mptcp-level unacked data. To avoid acquiring unneeded locks, it first try to check if any unacked data is present at all in the RTX queue, but such check is currently broken, as it uses TCP-specific helper on an MPTCP socket. Funnily enough fuzzers and static checkers are happy, as the accessed memory still belongs to the mptcp_sock struct, and even from a functional perspective the recovery completed successfully, as the short-cut test always failed. A recent unrelated TCP change - commit d5fed5addb2b ("tcp: reorganize tcp_sock fast path variables") - exposed the issue, as the tcp field reorganization makes the mptcp code always skip the re-inection. Fix the issue dropping the bogus call: we are on a slow path, the early optimization proved once again to be evil. Fixes: 1e1d9d6f119c ("mptcp: handle pending data on closed subflow") Cc: stable@vger.kernel.org Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/468 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://lore.kernel.org/r/20240131-upstream-net-20240131-mptcp-ci-issues-v1-1-4c1c11e571ff@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-02mptcp: use mptcp_set_stateGeliang Tang
This patch replaces all the 'inet_sk_state_store()' calls under net/mptcp with the new helper mptcp_set_state(). Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/460 Signed-off-by: Geliang Tang <geliang.tang@linux.dev> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-02mptcp: add CurrEstab MIB counter supportGeliang Tang
Add a new MIB counter named MPTCP_MIB_CURRESTAB to count current established MPTCP connections, similar to TCP_MIB_CURRESTAB. This is useful to quickly list the number of MPTCP connections without having to iterate over all of them. This patch adds a new helper function mptcp_set_state(): if the state switches from or to ESTABLISHED state, this newly added counter is incremented. This helper is going to be used in the following patch. Similar to MPTCP_INC_STATS(), a new helper called MPTCP_DEC_STATS() is also needed to decrement a MIB counter. Signed-off-by: Geliang Tang <geliang.tang@linux.dev> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-26mptcp: don't overwrite sock_ops in mptcp_is_tcpsk()Davide Caratti
Eric Dumazet suggests: > The fact that mptcp_is_tcpsk() was able to write over sock->ops was a > bit strange to me. > mptcp_is_tcpsk() should answer a question, with a read-only argument. re-factor code to avoid overwriting sock_ops inside that function. Also, change the helper name to reflect the semantics and to disambiguate from its dual, sk_is_mptcp(). While at it, collapse mptcp_stream_accept() and mptcp_accept() into a single function, where fallback / non-fallback are separated into a single sk_is_mptcp() conditional. Link: https://github.com/multipath-tcp/mptcp_net-next/issues/432 Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-17mptcp: fix inconsistent state on fastopen racePaolo Abeni
The netlink PM can race with fastopen self-connect attempts, shutting down the first subflow via: MPTCP_PM_CMD_DEL_ADDR -> mptcp_nl_remove_id_zero_address -> mptcp_pm_nl_rm_subflow_received -> mptcp_close_ssk and transitioning such subflow to FIN_WAIT1 status before the syn-ack packet is processed. The MPTCP code does not react to such state change, leaving the connection in not-fallback status and the subflow handshake uncompleted, triggering the following splat: WARNING: CPU: 0 PID: 10630 at net/mptcp/subflow.c:1405 subflow_data_ready+0x39f/0x690 net/mptcp/subflow.c:1405 Modules linked in: CPU: 0 PID: 10630 Comm: kworker/u4:11 Not tainted 6.6.0-syzkaller-14500-g1c41041124bd #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023 Workqueue: bat_events batadv_nc_worker RIP: 0010:subflow_data_ready+0x39f/0x690 net/mptcp/subflow.c:1405 Code: 18 89 ee e8 e3 d2 21 f7 40 84 ed 75 1f e8 a9 d7 21 f7 44 89 fe bf 07 00 00 00 e8 0c d3 21 f7 41 83 ff 07 74 07 e8 91 d7 21 f7 <0f> 0b e8 8a d7 21 f7 48 89 df e8 d2 b2 ff ff 31 ff 89 c5 89 c6 e8 RSP: 0018:ffffc90000007448 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888031efc700 RCX: ffffffff8a65baf4 RDX: ffff888043222140 RSI: ffffffff8a65baff RDI: 0000000000000005 RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000007 R10: 000000000000000b R11: 0000000000000000 R12: 1ffff92000000e89 R13: ffff88807a534d80 R14: ffff888021c11a00 R15: 000000000000000b FS: 0000000000000000(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa19a0ffc81 CR3: 000000007a2db000 CR4: 00000000003506f0 DR0: 000000000000d8dd DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: <IRQ> tcp_data_ready+0x14c/0x5b0 net/ipv4/tcp_input.c:5128 tcp_data_queue+0x19c3/0x5190 net/ipv4/tcp_input.c:5208 tcp_rcv_state_process+0x11ef/0x4e10 net/ipv4/tcp_input.c:6844 tcp_v4_do_rcv+0x369/0xa10 net/ipv4/tcp_ipv4.c:1929 tcp_v4_rcv+0x3888/0x3b30 net/ipv4/tcp_ipv4.c:2329 ip_protocol_deliver_rcu+0x9f/0x480 net/ipv4/ip_input.c:205 ip_local_deliver_finish+0x2e4/0x510 net/ipv4/ip_input.c:233 NF_HOOK include/linux/netfilter.h:314 [inline] NF_HOOK include/linux/netfilter.h:308 [inline] ip_local_deliver+0x1b6/0x550 net/ipv4/ip_input.c:254 dst_input include/net/dst.h:461 [inline] ip_rcv_finish+0x1c4/0x2e0 net/ipv4/ip_input.c:449 NF_HOOK include/linux/netfilter.h:314 [inline] NF_HOOK include/linux/netfilter.h:308 [inline] ip_rcv+0xce/0x440 net/ipv4/ip_input.c:569 __netif_receive_skb_one_core+0x115/0x180 net/core/dev.c:5527 __netif_receive_skb+0x1f/0x1b0 net/core/dev.c:5641 process_backlog+0x101/0x6b0 net/core/dev.c:5969 __napi_poll.constprop.0+0xb4/0x540 net/core/dev.c:6531 napi_poll net/core/dev.c:6600 [inline] net_rx_action+0x956/0xe90 net/core/dev.c:6733 __do_softirq+0x21a/0x968 kernel/softirq.c:553 do_softirq kernel/softirq.c:454 [inline] do_softirq+0xaa/0xe0 kernel/softirq.c:441 </IRQ> <TASK> __local_bh_enable_ip+0xf8/0x120 kernel/softirq.c:381 spin_unlock_bh include/linux/spinlock.h:396 [inline] batadv_nc_purge_paths+0x1ce/0x3c0 net/batman-adv/network-coding.c:471 batadv_nc_worker+0x9b1/0x10e0 net/batman-adv/network-coding.c:722 process_one_work+0x884/0x15c0 kernel/workqueue.c:2630 process_scheduled_works kernel/workqueue.c:2703 [inline] worker_thread+0x8b9/0x1290 kernel/workqueue.c:2784 kthread+0x33c/0x440 kernel/kthread.c:388 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242 </TASK> To address the issue, catch the racing subflow state change and use it to cause the MPTCP fallback. Such fallback is also used to cause the first subflow state propagation to the msk socket via mptcp_set_connected(). After this change, the first subflow can additionally propagate the TCP_FIN_WAIT1 state, so rename the helper accordingly. Finally, if the state propagation is delayed to the msk release callback, the first subflow can change to a different state in between. Cache the relevant target state in a new msk-level field and use such value to update the msk state at release time. Fixes: 1e777f39b4d7 ("mptcp: add MSG_FASTOPEN sendmsg flag support") Cc: stable@vger.kernel.org Reported-by: <syzbot+c53d4d3ddb327e80bc51@syzkaller.appspotmail.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/458 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-14mptcp: fix possible NULL pointer dereference on closePaolo Abeni
After the blamed commit below, the MPTCP release callback can dereference the first subflow pointer via __mptcp_set_connected() and send buffer auto-tuning. Such pointer is always expected to be valid, except at socket destruction time, when the first subflow is deleted and the pointer zeroed. If the connect event is handled by the release callback while the msk socket is finally released, MPTCP hits the following splat: general protection fault, probably for non-canonical address 0xdffffc00000000f2: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000790-0x0000000000000797] CPU: 1 PID: 26719 Comm: syz-executor.2 Not tainted 6.6.0-syzkaller-10102-gff269e2cd5ad #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023 RIP: 0010:mptcp_subflow_ctx net/mptcp/protocol.h:542 [inline] RIP: 0010:__mptcp_propagate_sndbuf net/mptcp/protocol.h:813 [inline] RIP: 0010:__mptcp_set_connected+0x57/0x3e0 net/mptcp/subflow.c:424 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff8a62323c RDX: 00000000000000f2 RSI: ffffffff8a630116 RDI: 0000000000000790 RBP: ffff88803334b100 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000034 R12: ffff88803334b198 R13: ffff888054f0b018 R14: 0000000000000000 R15: ffff88803334b100 FS: 0000000000000000(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fbcb4f75198 CR3: 000000006afb5000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> mptcp_release_cb+0xa2c/0xc40 net/mptcp/protocol.c:3405 release_sock+0xba/0x1f0 net/core/sock.c:3537 mptcp_close+0x32/0xf0 net/mptcp/protocol.c:3084 inet_release+0x132/0x270 net/ipv4/af_inet.c:433 inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:485 __sock_release+0xae/0x260 net/socket.c:659 sock_close+0x1c/0x20 net/socket.c:1419 __fput+0x270/0xbb0 fs/file_table.c:394 task_work_run+0x14d/0x240 kernel/task_work.c:180 exit_task_work include/linux/task_work.h:38 [inline] do_exit+0xa92/0x2a20 kernel/exit.c:876 do_group_exit+0xd4/0x2a0 kernel/exit.c:1026 get_signal+0x23ba/0x2790 kernel/signal.c:2900 arch_do_signal_or_restart+0x90/0x7f0 arch/x86/kernel/signal.c:309 exit_to_user_mode_loop kernel/entry/common.c:168 [inline] exit_to_user_mode_prepare+0x11f/0x240 kernel/entry/common.c:204 __syscall_exit_to_user_mode_work kernel/entry/common.c:285 [inline] syscall_exit_to_user_mode+0x1d/0x60 kernel/entry/common.c:296 do_syscall_64+0x4b/0x110 arch/x86/entry/common.c:88 entry_SYSCALL_64_after_hwframe+0x63/0x6b RIP: 0033:0x7fb515e7cae9 Code: Unable to access opcode bytes at 0x7fb515e7cabf. RSP: 002b:00007fb516c560c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: 000000000000003c RBX: 00007fb515f9c120 RCX: 00007fb515e7cae9 RDX: 0000000000000000 RSI: 0000000020000140 RDI: 0000000000000006 RBP: 00007fb515ec847a R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000006e R14: 00007fb515f9c120 R15: 00007ffc631eb968 </TASK> To avoid sparkling unneeded conditionals, address the issue explicitly checking msk->first only in the critical place. Fixes: 8005184fd1ca ("mptcp: refactor sndbuf auto-tuning") Cc: stable@vger.kernel.org Reported-by: <syzbot+9dfbaedb6e6baca57a32@syzkaller.appspotmail.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/454 Reported-by: Eric Dumazet <edumazet@google.com> Closes: https://lore.kernel.org/netdev/CANn89iLZUA6S2a=K8GObnS62KK6Jt4B7PsAs7meMFooM8xaTgw@mail.gmail.com/ Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matttbe@kernel.org> Link: https://lore.kernel.org/r/20231114-upstream-net-20231113-mptcp-misc-fixes-6-7-rc2-v1-2-7b9cd6a7b7f4@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-14mptcp: deal with large GSO sizePaolo Abeni
After the blamed commit below, the TCP sockets (and the MPTCP subflows) can build egress packets larger than 64K. That exceeds the maximum DSS data size, the length being misrepresent on the wire and the stream being corrupted, as later observed on the receiver: WARNING: CPU: 0 PID: 9696 at net/mptcp/protocol.c:705 __mptcp_move_skbs_from_subflow+0x2604/0x26e0 CPU: 0 PID: 9696 Comm: syz-executor.7 Not tainted 6.6.0-rc5-gcd8bdf563d46 #45 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 netlink: 8 bytes leftover after parsing attributes in process `syz-executor.4'. RIP: 0010:__mptcp_move_skbs_from_subflow+0x2604/0x26e0 net/mptcp/protocol.c:705 RSP: 0018:ffffc90000006e80 EFLAGS: 00010246 RAX: ffffffff83e9f674 RBX: ffff88802f45d870 RCX: ffff888102ad0000 netlink: 8 bytes leftover after parsing attributes in process `syz-executor.4'. RDX: 0000000080000303 RSI: 0000000000013908 RDI: 0000000000003908 RBP: ffffc90000007110 R08: ffffffff83e9e078 R09: 1ffff1100e548c8a R10: dffffc0000000000 R11: ffffed100e548c8b R12: 0000000000013908 R13: dffffc0000000000 R14: 0000000000003908 R15: 000000000031cf29 FS: 00007f239c47e700(0000) GS:ffff88811b200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f239c45cd78 CR3: 000000006a66c006 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 PKRU: 55555554 Call Trace: <IRQ> mptcp_data_ready+0x263/0xac0 net/mptcp/protocol.c:819 subflow_data_ready+0x268/0x6d0 net/mptcp/subflow.c:1409 tcp_data_queue+0x21a1/0x7a60 net/ipv4/tcp_input.c:5151 tcp_rcv_established+0x950/0x1d90 net/ipv4/tcp_input.c:6098 tcp_v6_do_rcv+0x554/0x12f0 net/ipv6/tcp_ipv6.c:1483 tcp_v6_rcv+0x2e26/0x3810 net/ipv6/tcp_ipv6.c:1749 ip6_protocol_deliver_rcu+0xd6b/0x1ae0 net/ipv6/ip6_input.c:438 ip6_input+0x1c5/0x470 net/ipv6/ip6_input.c:483 ipv6_rcv+0xef/0x2c0 include/linux/netfilter.h:304 __netif_receive_skb+0x1ea/0x6a0 net/core/dev.c:5532 process_backlog+0x353/0x660 net/core/dev.c:5974 __napi_poll+0xc6/0x5a0 net/core/dev.c:6536 net_rx_action+0x6a0/0xfd0 net/core/dev.c:6603 __do_softirq+0x184/0x524 kernel/softirq.c:553 do_softirq+0xdd/0x130 kernel/softirq.c:454 Address the issue explicitly bounding the maximum GSO size to what MPTCP actually allows. Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/450 Fixes: 7c4e983c4f3c ("net: allow gso_max_size to exceed 65536") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matttbe@kernel.org> Link: https://lore.kernel.org/r/20231114-upstream-net-20231113-mptcp-misc-fixes-6-7-rc2-v1-1-7b9cd6a7b7f4@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27mptcp: use mptcp_get_ext helperGeliang Tang
Use mptcp_get_ext() helper defined in protocol.h instead of open-coding it in mptcp_sendmsg_frag(). Reviewed-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231025-send-net-next-20231025-v1-6-db8f25f798eb@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27mptcp: drop useless ssk in pm_subflow_check_nextGeliang Tang
The code using 'ssk' parameter of mptcp_pm_subflow_check_next() has been dropped in commit "95d686517884 (mptcp: fix subflow accounting on close)". So drop this useless parameter ssk. Reviewed-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231025-send-net-next-20231025-v1-4-db8f25f798eb@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25mptcp: refactor sndbuf auto-tuningPaolo Abeni
The MPTCP protocol account for the data enqueued on all the subflows to the main socket send buffer, while the send buffer auto-tuning algorithm set the main socket send buffer size as the max size among the subflows. That causes bad performances when at least one subflow is sndbuf limited, e.g. due to very high latency, as the MPTCP scheduler can't even fill such buffer. Change the send-buffer auto-tuning algorithm to compute the main socket send buffer size as the sum of all the subflows buffer size. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-9-9dc60939d371@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25mptcp: consolidate sockopt synchronizationPaolo Abeni
Move the socket option synchronization for active subflows at subflow creation time. This allows removing the now unused unlocked variant of such helper. While at that, clean-up a bit the mptcp_subflow_create_socket() errors path. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-7-9dc60939d371@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25mptcp: use copy_from_iter helpers on transmitPaolo Abeni
The perf traces show an high cost for the MPTCP transmit path memcpy. It turn out that the helper currently in use carries quite a bit of unneeded overhead, e.g. to map/unmap the memory pages. Moving to the 'copy_from_iter' variant removes such overhead and additionally gains the no-cache support. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-6-9dc60939d371@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25mptcp: give rcvlowat some lovePaolo Abeni
The MPTCP protocol allow setting sk_rcvlowat, but the value there is currently ignored. Additionally, the default subflows sk_rcvlowat basically disables per subflow delayed ack: the MPTCP protocol move the incoming data from the subflows into the msk socket as soon as the TCP stacks invokes the subflow data_ready callback. Later, when __tcp_ack_snd_check() takes action, the subflow-level copied_seq matches rcv_nxt, and that mandate for an immediate ack. Let the mptcp receive path be aware of such threshold, explicitly tracking the amount of data available to be ready and checking vs sk_rcvlowat in mptcp_poll() and before waking-up readers. Additionally implement the set_rcvlowat() callback, to properly handle the rcvbuf auto-tuning on sk_rcvlowat changes. Finally to properly handle delayed ack, force the subflow level threshold to 0 and instead explicitly ask for an immediate ack when the msk level th is not reached. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-5-9dc60939d371@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25mptcp: add a new sysctl for make after break timeoutPaolo Abeni
The MPTCP protocol allows sockets with no alive subflows to stay in ESTABLISHED status for and user-defined timeout, to allow for later subflows creation. Currently such timeout is constant - TCP_TIMEWAIT_LEN. Let the user-space configure them via a newly added sysctl, to better cope with busy servers and simplify (make them faster) the relevant pktdrill tests. Note that the new know does not apply to orphaned MPTCP socket waiting for the data_fin handshake completion: they always wait TCP_TIMEWAIT_LEN. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-1-9dc60939d371@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-19mptcp: avoid sending RST when closing the initial subflowGeliang Tang
When closing the first subflow, the MPTCP protocol unconditionally calls tcp_disconnect(), which in turn generates a reset if the subflow is established. That is unexpected and different from what MPTCP does with MPJ subflows, where resets are generated only on FASTCLOSE and other edge scenarios. We can't reuse for the first subflow the same code in place for MPJ subflows, as MPTCP clean them up completely via a tcp_close() call, while must keep the first subflow socket alive for later re-usage, due to implementation constraints. This patch adds a new helper __mptcp_subflow_disconnect() that encapsulates, a logic similar to tcp_close, issuing a reset only when the MPTCP_CF_FASTCLOSE flag is set, and performing a clean shutdown otherwise. Fixes: c2b2ae3925b6 ("mptcp: handle correctly disconnect() failures") Cc: stable@vger.kernel.org Reviewed-by: Matthieu Baerts <matttbe@kernel.org> Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231018-send-net-20231018-v1-4-17ecb002e41d@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-19mptcp: more conservative check for zero probesPaolo Abeni
Christoph reported that the MPTCP protocol can find the subflow-level write queue unexpectedly not empty while crafting a zero-window probe, hitting a warning: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 188 at net/mptcp/protocol.c:1312 mptcp_sendmsg_frag+0xc06/0xe70 Modules linked in: CPU: 0 PID: 188 Comm: kworker/0:2 Not tainted 6.6.0-rc2-g1176aa719d7a #47 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 Workqueue: events mptcp_worker RIP: 0010:mptcp_sendmsg_frag+0xc06/0xe70 net/mptcp/protocol.c:1312 RAX: 47d0530de347ff6a RBX: 47d0530de347ff6b RCX: ffff8881015d3c00 RDX: ffff8881015d3c00 RSI: 47d0530de347ff6b RDI: 47d0530de347ff6b RBP: 47d0530de347ff6b R08: ffffffff8243c6a8 R09: ffffffff82042d9c R10: 0000000000000002 R11: ffffffff82056850 R12: ffff88812a13d580 R13: 0000000000000001 R14: ffff88812b375e50 R15: ffff88812bbf3200 FS: 0000000000000000(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000695118 CR3: 0000000115dfc001 CR4: 0000000000170ef0 Call Trace: <TASK> __subflow_push_pending+0xa4/0x420 net/mptcp/protocol.c:1545 __mptcp_push_pending+0x128/0x3b0 net/mptcp/protocol.c:1614 mptcp_release_cb+0x218/0x5b0 net/mptcp/protocol.c:3391 release_sock+0xf6/0x100 net/core/sock.c:3521 mptcp_worker+0x6e8/0x8f0 net/mptcp/protocol.c:2746 process_scheduled_works+0x341/0x690 kernel/workqueue.c:2630 worker_thread+0x3a7/0x610 kernel/workqueue.c:2784 kthread+0x143/0x180 kernel/kthread.c:388 ret_from_fork+0x4d/0x60 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:304 </TASK> The root cause of the issue is that expectations are wrong: e.g. due to MPTCP-level re-injection we can hit the critical condition. Explicitly avoid the zero-window probe when the subflow write queue is not empty and drop the related warnings. Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/444 Fixes: f70cad1085d1 ("mptcp: stop relying on tcp_tx_skb_cache") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231018-send-net-20231018-v1-3-17ecb002e41d@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-13tcp: allow again tcp_disconnect() when threads are waitingPaolo Abeni
As reported by Tom, .NET and applications build on top of it rely on connect(AF_UNSPEC) to async cancel pending I/O operations on TCP socket. The blamed commit below caused a regression, as such cancellation can now fail. As suggested by Eric, this change addresses the problem explicitly causing blocking I/O operation to terminate immediately (with an error) when a concurrent disconnect() is executed. Instead of tracking the number of threads blocked on a given socket, track the number of disconnect() issued on such socket. If such counter changes after a blocking operation releasing and re-acquiring the socket lock, error out the current operation. Fixes: 4faeee0cf8a5 ("tcp: deny tcp_disconnect() when threads are waiting") Reported-by: Tom Deseyn <tdeseyn@redhat.com> Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1886305 Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/f3b95e47e3dbed840960548aebaa8d954372db41.1697008693.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-05mptcp: fix delegated action racesPaolo Abeni
The delegated action infrastructure is prone to the following race: different CPUs can try to schedule different delegated actions on the same subflow at the same time. Each of them will check different bits via mptcp_subflow_delegate(), and will try to schedule the action on the related per-cpu napi instance. Depending on the timing, both can observe an empty delegated list node, causing the same entry to be added simultaneously on two different lists. The root cause is that the delegated actions infra does not provide a single synchronization point. Address the issue reserving an additional bit to mark the subflow as scheduled for delegation. Acquiring such bit guarantee the caller to own the delegated list node, and being able to safely schedule the subflow. Clear such bit only when the subflow scheduling is completed, ensuring proper barrier in place. Additionally swap the meaning of the delegated_action bitmask, to allow the usage of the existing helper to set multiple bit at once. Fixes: bcd97734318d ("mptcp: use delegate action to schedule 3rd ack retrans") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231004-send-net-20231004-v1-1-28de4ac663ae@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-09-18mptcp: fix dangling connection hang-upPaolo Abeni
According to RFC 8684 section 3.3: A connection is not closed unless [...] or an implementation-specific connection-level send timeout. Currently the MPTCP protocol does not implement such timeout, and connection timing-out at the TCP-level never move to close state. Introduces a catch-up condition at subflow close time to move the MPTCP socket to close, too. That additionally allows removing similar existing inside the worker. Finally, allow some additional timeout for plain ESTABLISHED mptcp sockets, as the protocol allows creating new subflows even at that point and making the connection functional again. This issue is actually present since the beginning, but it is basically impossible to solve without a long chain of functional pre-requisites topped by commit bbd49d114d57 ("mptcp: consolidate transition to TCP_CLOSE in mptcp_do_fastclose()"). When backporting this current patch, please also backport this other commit as well. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/430 Fixes: e16163b6e2b7 ("mptcp: refactor shutdown and close") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-18mptcp: rename timer related helper to less confusing namesPaolo Abeni
The msk socket uses to different timeout to track close related events and retransmissions. The existing helpers do not indicate clearly which timer they actually touch, making the related code quite confusing. Change the existing helpers name to avoid such confusion. No functional change intended. This patch is linked to the next one ("mptcp: fix dangling connection hang-up"). The two patches are supposed to be backported together. Cc: stable@vger.kernel.org # v5.11+ Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-18mptcp: process pending subflow error on closePaolo Abeni
On incoming TCP reset, subflow closing could happen before error propagation. That in turn could cause the socket error being ignored, and a missing socket state transition, as reported by Daire-Byrne. Address the issues explicitly checking for subflow socket error at close time. To avoid code duplication, factor-out of __mptcp_error_report() a new helper implementing the relevant bits. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/429 Fixes: 15cc10453398 ("mptcp: deliver ssk errors to msk") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-18mptcp: move __mptcp_error_report in protocol.cPaolo Abeni
This will simplify the next patch ("mptcp: process pending subflow error on close"). No functional change intended. Cc: stable@vger.kernel.org # v5.12+ Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-01mptcp: annotate data-races around msk->rmem_fwd_allocEric Dumazet
msk->rmem_fwd_alloc can be read locklessly. Add mptcp_rmem_fwd_alloc_add(), similar to sk_forward_alloc_add(), and appropriate READ_ONCE()/WRITE_ONCE() annotations. Fixes: 6511882cdd82 ("mptcp: allocate fwd memory separately on the rx and tx path") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-01net: annotate data-races around sk->sk_forward_allocEric Dumazet
Every time sk->sk_forward_alloc is read locklessly, add a READ_ONCE(). Add sk_forward_alloc_add() helper to centralize updates, to reduce number of WRITE_ONCE(). Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-22mptcp: register default schedulerGeliang Tang
This patch defines the default packet scheduler mptcp_sched_default. Register it in mptcp_sched_init(), which is invoked in mptcp_proto_init(). Skip deleting this default scheduler in mptcp_unregister_scheduler(). Set msk->sched to the default scheduler when the input parameter of mptcp_init_sched() is NULL. Invoke mptcp_sched_default_get_subflow in get_send() and get_retrans() if the defaut scheduler is set or msk->sched is NULL. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-10-0c860fb256a8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22mptcp: use get_retrans wrapperGeliang Tang
This patch adds the multiple subflows support for __mptcp_retrans(). Use get_retrans() wrapper instead of mptcp_subflow_get_retrans() in it. Check the subflow scheduled flags to test which subflow or subflows are picked by the scheduler, use them to send data. Move msk_owned_by_me() and fallback checks into get_retrans() wrapper from mptcp_subflow_get_retrans(). Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-9-0c860fb256a8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22mptcp: use get_send wrapperGeliang Tang
This patch adds the multiple subflows support for __mptcp_push_pending and __mptcp_subflow_push_pending. Use get_send() wrapper instead of mptcp_subflow_get_send() in them. Check the subflow scheduled flags to test which subflow or subflows are picked by the scheduler, use them to send data. Move msk_owned_by_me() and fallback checks into get_send() wrapper from mptcp_subflow_get_send(). This commit allows the scheduler to set the subflow->scheduled bit in multiple subflows, but it does not allow for sending redundant data. Multiple scheduled subflows will send sequential data on each subflow. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-8-0c860fb256a8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22mptcp: add scheduler wrappersGeliang Tang
This patch defines two packet scheduler wrappers mptcp_sched_get_send() and mptcp_sched_get_retrans(), invoke get_subflow() of msk->sched in them. Set data->reinject to true in mptcp_sched_get_retrans(), set it false in mptcp_sched_get_send(). If msk->sched is NULL, use default functions mptcp_subflow_get_send() and mptcp_subflow_get_retrans() to send data. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-7-0c860fb256a8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22mptcp: add sched in mptcp_sockGeliang Tang
This patch adds a new struct member sched in struct mptcp_sock. And two helpers mptcp_init_sched() and mptcp_release_sched() to init and release it. Init it with the sysctl scheduler in mptcp_init_sock(), copy the scheduler from the parent in mptcp_sk_clone(), and release it in __mptcp_destroy_sock(). Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-5-0c860fb256a8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22mptcp: drop last_snd and MPTCP_RESET_SCHEDULERGeliang Tang
Since the burst check conditions have moved out of the function mptcp_subflow_get_send(), it makes all msk->last_snd useless. This patch drops them as well as the macro MPTCP_RESET_SCHEDULER. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-2-0c860fb256a8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22mptcp: refactor push_pending logicGeliang Tang
To support redundant package schedulers more easily, this patch refactors __mptcp_push_pending() logic from: For each dfrag: While sends succeed: Call the scheduler (selects subflow and msk->snd_burst) Update subflow locks (push/release/acquire as needed) Send the dfrag data with mptcp_sendmsg_frag() Update already_sent, snd_nxt, snd_burst Update msk->first_pending Push/release on final subflow -> While first_pending isn't empty: Call the scheduler (selects subflow and msk->snd_burst) Update subflow locks (push/release/acquire as needed) For each pending dfrag: While sends succeed: Send the dfrag data with mptcp_sendmsg_frag() Update already_sent, snd_nxt, snd_burst Update msk->first_pending Break if required by msk->snd_burst / etc Push/release on final subflow Refactors __mptcp_subflow_push_pending logic from: For each dfrag: While sends succeed: Call the scheduler (selects subflow and msk->snd_burst) Send the dfrag data with mptcp_subflow_delegate(), break Send the dfrag data with mptcp_sendmsg_frag() Update dfrag->already_sent, msk->snd_nxt, msk->snd_burst Update msk->first_pending -> While first_pending isn't empty: Call the scheduler (selects subflow and msk->snd_burst) Send the dfrag data with mptcp_subflow_delegate(), break Send the dfrag data with mptcp_sendmsg_frag() For each pending dfrag: While sends succeed: Send the dfrag data with mptcp_sendmsg_frag() Update already_sent, snd_nxt, snd_burst Update msk->first_pending Break if required by msk->snd_burst / etc Move the duplicate code from __mptcp_push_pending() and __mptcp_subflow_push_pending() into a new helper function, named __subflow_push_pending(). Simplify __mptcp_push_pending() and __mptcp_subflow_push_pending() by invoking this helper. Also move the burst check conditions out of the function mptcp_subflow_get_send(), check them in __subflow_push_pending() in the inner "for each pending dfrag" loop. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-1-0c860fb256a8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-16inet: move inet->defer_connect to inet->inet_flagsEric Dumazet
Make room in struct inet_sock by removing this bit field, using one available bit in inet_flags instead. Also move local_port_range to fill the resulting hole, saving 8 bytes on 64bit arches. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14mptcp: Remove unnecessary test for __mptcp_init_sock()Kuniyuki Iwashima
__mptcp_init_sock() always returns 0 because mptcp_init_sock() used to return the value directly. But after commit 18b683bff89d ("mptcp: queue data for mptcp level retransmission"), __mptcp_init_sock() need not return value anymore. Let's remove the unnecessary test for __mptcp_init_sock() and make it return void. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14mptcp: get rid of msk->subflowPaolo Abeni
Such field is now unused just as a flag to control the first subflow deletion at close() time. Introduce a new bit flag for that and finally drop the mentioned field. As an intended side effect, now the first subflow sock is not freed before close() even for passive sockets. The msk has no open/active subflows if the first one is closed and the subflow list is singular, update accordingly the state check in mptcp_stream_accept(). Among other benefits, the subflow removal, reduces the amount of memory used on the client side for each mptcp connection, allows passive sockets to go through successful accept()/disconnect()/connect() and makes return error code consistent for failing both passive and active sockets. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/290 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14mptcp: change the mpc check helper to return a skPaolo Abeni
After the previous patch the __mptcp_nmpc_socket helper is used only to ensure that the MPTCP socket is a suitable status - that is, the mptcp capable handshake is not started yet. Change the return value to the relevant subflow sock, to finally remove the last references to first subflow socket in the MPTCP stack. As a bonus, we can get rid of a few local variables in different functions. No functional change intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14mptcp: avoid unneeded indirection in mptcp_stream_accept()Paolo Abeni
We are going to remove the first subflow socket soon, so avoid the additional indirection at accept() time. Instead access directly the first subflow sock, and update mptcp_accept() to operate on it. This allows dropping a duplicated check in mptcp_accept(). No functional changes intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14mptcp: avoid additional indirection in mptcp_poll()Paolo Abeni
We are going to remove the first subflow socket soon, so avoid the additional indirection at poll() time. Instead access directly the first subflow sock. No functional changes intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14mptcp: avoid additional indirection in mptcp_listen()Paolo Abeni
We are going to remove the first subflow socket soon, so avoid the additional indirection via at listen() time. Instead call directly the recently introduced helper on the first subflow sock. No functional changes intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14mptcp: mptcp: avoid additional indirection in mptcp_bind()Paolo Abeni
We are going to remove the first subflow socket soon, so avoid the additional indirection via at bind() time. Instead call directly the recently introduced helpers on the first subflow sock. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14mptcp: avoid subflow socket usage in mptcp_get_port()Paolo Abeni
We are going to remove the first subflow socket soon, so avoid accessing it in mptcp_get_port(). Instead, access directly the first subflow sock. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14mptcp: avoid additional __inet_stream_connect() callPaolo Abeni
The mptcp protocol maintains an additional socket just to easily invoke a few stream operations on the first subflow. One of them is __inet_stream_connect(). We are going to remove the first subflow socket soon, so avoid the additional indirection via at connect time, calling directly into the sock-level connect() ops. The sk-level connect never return -EINPROGRESS, cleanup the error path accordingly. Additionally, the ssk status on error is always TCP_CLOSE. Avoid unneeded access to the subflow sk state. No functional change intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14mptcp: avoid unneeded mptcp_token_destroy() callsPaolo Abeni
The MPTCP protocol currently clears the msk token both at connect() and listen() time. That is needed to deal with failing connect() calls that can create a new token while leaving the sk in TCP_CLOSE,SS_UNCONNECTED status and thus allowing later connect() and/or listen() calls. Let's deal with such failures explicitly, cleaning the token in a timely manner and avoid the confusing early mptcp_token_destroy(). Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>