aboutsummaryrefslogtreecommitdiffstats
path: root/net/core
AgeCommit message (Collapse)Author
2019-10-10Merge branch 'v5.2/standard/base' into v5.2/standard/preempt-rt/intel-x86Bruce Ashfield
2019-10-07net: Unpublish sk from sk_reuseport_cb before call_rcuMartin KaFai Lau
[ Upstream commit 8c7138b33e5c690c308b2a7085f6313fdcb3f616 ] The "reuse->sock[]" array is shared by multiple sockets. The going away sk must unpublish itself from "reuse->sock[]" before making call_rcu() call. However, this unpublish-action is currently done after a grace period and it may cause use-after-free. The fix is to move reuseport_detach_sock() to sk_destruct(). Due to the above reason, any socket with sk_reuseport_cb has to go through the rcu grace period before freeing it. It is a rather old bug (~3 yrs). The Fixes tag is not necessary the right commit but it is the one that introduced the SOCK_RCU_FREE logic and this fix is depending on it. Fixes: a4298e4522d6 ("net: add SOCK_RCU_FREE socket flag") Cc: Eric Dumazet <eric.dumazet@gmail.com> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-24Merge branch 'v5.2/standard/base' into v5.2/standard/preempt-rt/intel-x86Bruce Ashfield
2019-09-21bpf: allow narrow loads of some sk_reuseport_md fields with offset > 0Ilya Leoshkevich
[ Upstream commit 2c238177bd7f4b14bdf7447cc1cd9bb791f147e6 ] test_select_reuseport fails on s390 due to verifier rejecting test_select_reuseport_kern.o with the following message: ; data_check.eth_protocol = reuse_md->eth_protocol; 18: (69) r1 = *(u16 *)(r6 +22) invalid bpf_context access off=22 size=2 This is because on big-endian machines casts from __u32 to __u16 are generated by referencing the respective variable as __u16 with an offset of 2 (as opposed to 0 on little-endian machines). The verifier already has all the infrastructure in place to allow such accesses, it's just that they are not explicitly enabled for eth_protocol field. Enable them for eth_protocol field by using bpf_ctx_range instead of offsetof. Ditto for ip_protocol, bind_inany and len, since they already allow narrowing, and the same problem can arise when working with them. Fixes: 2dbb9b9e6df6 ("bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT") Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-09-21flow_dissector: Fix potential use-after-free on BPF_PROG_DETACHJakub Sitnicki
[ Upstream commit db38de39684dda2bf307f41797db2831deba64e9 ] Call to bpf_prog_put(), with help of call_rcu(), queues an RCU-callback to free the program once a grace period has elapsed. The callback can run together with new RCU readers that started after the last grace period. New RCU readers can potentially see the "old" to-be-freed or already-freed pointer to the program object before the RCU update-side NULLs it. Reorder the operations so that the RCU update-side resets the protected pointer before the end of the grace period after which the program will be freed. Fixes: d58e468b1112 ("flow_dissector: implements flow dissector BPF hook") Reported-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: Petar Penkov <ppenkov@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-09-21udp: correct reuseport selection with connected socketsWillem de Bruijn
[ Upstream commit acdcecc61285faed359f1a3568c32089cc3a8329 ] UDP reuseport groups can hold a mix unconnected and connected sockets. Ensure that connections only receive all traffic to their 4-tuple. Fast reuseport returns on the first reuseport match on the assumption that all matches are equal. Only if connections are present, return to the previous behavior of scoring all sockets. Record if connections are present and if so (1) treat such connected sockets as an independent match from the group, (2) only return 2-tuple matches from reuseport and (3) do not return on the first 2-tuple reuseport match to allow for a higher scoring match later. New field has_conns is set without locks. No other fields in the bitmap are modified at runtime and the field is only ever set unconditionally, so an RMW cannot miss a change. Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection") Link: http://lkml.kernel.org/r/CA+FuTSfRP09aJNYRt04SS6qj22ViiOEWaWmLAwX0psk8-PGNxw@mail.gmail.com Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Craig Gallek <kraig@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-21net/sched: fix race between deactivation and dequeue for NOLOCK qdiscPaolo Abeni
[ Upstream commit d518d2ed8640c1cbbbb6f63939e3e65471817367 ] The test implemented by some_qdisc_is_busy() is somewhat loosy for NOLOCK qdisc, as we may hit the following scenario: CPU1 CPU2 // in net_tx_action() clear_bit(__QDISC_STATE_SCHED...); // in some_qdisc_is_busy() val = (qdisc_is_running(q) || test_bit(__QDISC_STATE_SCHED, &q->state)); // here val is 0 but... qdisc_run(q) // ... CPU1 is going to run the qdisc next As a conseguence qdisc_run() in net_tx_action() can race with qdisc_reset() in dev_qdisc_reset(). Such race is not possible for !NOLOCK qdisc as both the above bit operations are under the root qdisc lock(). After commit 021a17ed796b ("pfifo_fast: drop unneeded additional lock on dequeue") the race can cause use after free and/or null ptr dereference, but the root cause is likely older. This patch addresses the issue explicitly checking for deactivation under the seqlock for NOLOCK qdisc, so that the qdisc_run() in the critical scenario becomes a no-op. Note that the enqueue() op can still execute concurrently with dev_qdisc_reset(), but that is safe due to the skb_array() locking, and we can't avoid that for NOLOCK qdiscs. Fixes: 021a17ed796b ("pfifo_fast: drop unneeded additional lock on dequeue") Reported-by: Li Shuang <shuali@redhat.com> Reported-and-tested-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-19Merge branch 'v5.2/standard/base' into v5.2/standard/preempt-rt/intel-x86Bruce Ashfield
2019-09-19net: sock_map, fix missing ulp check in sock hash caseJohn Fastabend
[ Upstream commit 44580a0118d3ede95fec4dce32df5f75f73cd663 ] sock_map and ULP only work together when ULP is loaded after the sock map is loaded. In the sock_map case we added a check for this to fail the load if ULP is already set. However, we missed the check on the sock_hash side. Add a ULP check to the sock_hash update path. Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface") Reported-by: syzbot+7a6ee4d0078eac6bf782@syzkaller.appspotmail.com Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-19net: gso: Fix skb_segment splat when splitting gso_size mangled skb having ↵Shmulik Ladkani
linear-headed frag_list [ Upstream commit 3dcbdb134f329842a38f0e6797191b885ab00a00 ] Historically, support for frag_list packets entering skb_segment() was limited to frag_list members terminating on exact same gso_size boundaries. This is verified with a BUG_ON since commit 89319d3801d1 ("net: Add frag_list support to skb_segment"), quote: As such we require all frag_list members terminate on exact MSS boundaries. This is checked using BUG_ON. As there should only be one producer in the kernel of such packets, namely GRO, this requirement should not be difficult to maintain. However, since commit 6578171a7ff0 ("bpf: add bpf_skb_change_proto helper"), the "exact MSS boundaries" assumption no longer holds: An eBPF program using bpf_skb_change_proto() DOES modify 'gso_size', but leaves the frag_list members as originally merged by GRO with the original 'gso_size'. Example of such programs are bpf-based NAT46 or NAT64. This lead to a kernel BUG_ON for flows involving: - GRO generating a frag_list skb - bpf program performing bpf_skb_change_proto() or bpf_skb_adjust_room() - skb_segment() of the skb See example BUG_ON reports in [0]. In commit 13acc94eff12 ("net: permit skb_segment on head_frag frag_list skb"), skb_segment() was modified to support the "gso_size mangling" case of a frag_list GRO'ed skb, but *only* for frag_list members having head_frag==true (having a page-fragment head). Alas, GRO packets having frag_list members with a linear kmalloced head (head_frag==false) still hit the BUG_ON. This commit adds support to skb_segment() for a 'head_skb' packet having a frag_list whose members are *non* head_frag, with gso_size mangled, by disabling SG and thus falling-back to copying the data from the given 'head_skb' into the generated segmented skbs - as suggested by Willem de Bruijn [1]. Since this approach involves the penalty of skb_copy_and_csum_bits() when building the segments, care was taken in order to enable this solution only when required: - untrusted gso_size, by testing SKB_GSO_DODGY is set (SKB_GSO_DODGY is set by any gso_size mangling functions in net/core/filter.c) - the frag_list is non empty, its item is a non head_frag, *and* the headlen of the given 'head_skb' does not match the gso_size. [0] https://lore.kernel.org/netdev/20190826170724.25ff616f@pixies/ https://lore.kernel.org/netdev/9265b93f-253d-6b8c-f2b8-4b54eff1835c@fb.com/ [1] https://lore.kernel.org/netdev/CA+FuTSfVsgNDi7c=GUU8nMg2hWxF2SjCNLXetHeVPdnxAW5K-w@mail.gmail.com/ Fixes: 6578171a7ff0 ("bpf: add bpf_skb_change_proto helper") Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-19net: Fix null de-reference of device refcountSubash Abhinov Kasiviswanathan
[ Upstream commit 10cc514f451a0f239aa34f91bc9dc954a9397840 ] In event of failure during register_netdevice, free_netdev is invoked immediately. free_netdev assumes that all the netdevice refcounts have been dropped prior to it being called and as a result frees and clears out the refcount pointer. However, this is not necessarily true as some of the operations in the NETDEV_UNREGISTER notifier handlers queue RCU callbacks for invocation after a grace period. The IPv4 callback in_dev_rcu_put tries to access the refcount after free_netdev is called which leads to a null de-reference- 44837.761523: <6> Unable to handle kernel paging request at virtual address 0000004a88287000 44837.761651: <2> pc : in_dev_finish_destroy+0x4c/0xc8 44837.761654: <2> lr : in_dev_finish_destroy+0x2c/0xc8 44837.762393: <2> Call trace: 44837.762398: <2> in_dev_finish_destroy+0x4c/0xc8 44837.762404: <2> in_dev_rcu_put+0x24/0x30 44837.762412: <2> rcu_nocb_kthread+0x43c/0x468 44837.762418: <2> kthread+0x118/0x128 44837.762424: <2> ret_from_fork+0x10/0x1c Fix this by waiting for the completion of the call_rcu() in case of register_netdevice errors. Fixes: 93ee31f14f6f ("[NET]: Fix free_netdev on register_netdev failure.") Cc: Sean Tranchetti <stranche@codeaurora.org> Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-15Merge branch 'v5.2/standard/base' into v5.2/standard/preempt-rt/intel-x86Bruce Ashfield
2019-09-10net: fix skb use after free in netpollFeng Sun
[ Upstream commit 2c1644cf6d46a8267d79ed95cb9b563839346562 ] After commit baeababb5b85d5c4e6c917efe2a1504179438d3b ("tun: return NET_XMIT_DROP for dropped packets"), when tun_net_xmit drop packets, it will free skb and return NET_XMIT_DROP, netpoll_send_skb_on_dev will run into following use after free cases: 1. retry netpoll_start_xmit with freed skb; 2. queue freed skb in npinfo->txq. queue_process will also run into use after free case. hit netpoll_send_skb_on_dev first case with following kernel log: [ 117.864773] kernel BUG at mm/slub.c:306! [ 117.864773] invalid opcode: 0000 [#1] SMP PTI [ 117.864774] CPU: 3 PID: 2627 Comm: loop_printmsg Kdump: loaded Tainted: P OE 5.3.0-050300rc5-generic #201908182231 [ 117.864775] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 117.864775] RIP: 0010:kmem_cache_free+0x28d/0x2b0 [ 117.864781] Call Trace: [ 117.864781] ? tun_net_xmit+0x21c/0x460 [ 117.864781] kfree_skbmem+0x4e/0x60 [ 117.864782] kfree_skb+0x3a/0xa0 [ 117.864782] tun_net_xmit+0x21c/0x460 [ 117.864782] netpoll_start_xmit+0x11d/0x1b0 [ 117.864788] netpoll_send_skb_on_dev+0x1b8/0x200 [ 117.864789] __br_forward+0x1b9/0x1e0 [bridge] [ 117.864789] ? skb_clone+0x53/0xd0 [ 117.864790] ? __skb_clone+0x2e/0x120 [ 117.864790] deliver_clone+0x37/0x50 [bridge] [ 117.864790] maybe_deliver+0x89/0xc0 [bridge] [ 117.864791] br_flood+0x6c/0x130 [bridge] [ 117.864791] br_dev_xmit+0x315/0x3c0 [bridge] [ 117.864792] netpoll_start_xmit+0x11d/0x1b0 [ 117.864792] netpoll_send_skb_on_dev+0x1b8/0x200 [ 117.864792] netpoll_send_udp+0x2c6/0x3e8 [ 117.864793] write_msg+0xd9/0xf0 [netconsole] [ 117.864793] console_unlock+0x386/0x4e0 [ 117.864793] vprintk_emit+0x17e/0x280 [ 117.864794] vprintk_default+0x29/0x50 [ 117.864794] vprintk_func+0x4c/0xbc [ 117.864794] printk+0x58/0x6f [ 117.864795] loop_fun+0x24/0x41 [printmsg_loop] [ 117.864795] kthread+0x104/0x140 [ 117.864795] ? 0xffffffffc05b1000 [ 117.864796] ? kthread_park+0x80/0x80 [ 117.864796] ret_from_fork+0x35/0x40 Signed-off-by: Feng Sun <loyou85@gmail.com> Signed-off-by: Xiaojun Zhao <xiaojunzhao141@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-09Merge branch 'v5.2/standard/base' into v5.2/standard/preempt-rt/intel-x86Bruce Ashfield
2019-09-09Merge branch 'v5.2/standard/base' into v5.2/standard/preempt-rt/intel-x86Bruce Ashfield
2019-09-06tcp: make sure EPOLLOUT wont be missedEric Dumazet
[ Upstream commit ef8d8ccdc216f797e66cb4a1372f5c4c285ce1e4 ] As Jason Baron explained in commit 790ba4566c1a ("tcp: set SOCK_NOSPACE under memory pressure"), it is crucial we properly set SOCK_NOSPACE when needed. However, Jason patch had a bug, because the 'nonblocking' status as far as sk_stream_wait_memory() is concerned is governed by MSG_DONTWAIT flag passed at sendmsg() time : long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT); So it is very possible that tcp sendmsg() calls sk_stream_wait_memory(), and that sk_stream_wait_memory() returns -EAGAIN with SOCK_NOSPACE cleared, if sk->sk_sndtimeo has been set to a small (but not zero) value. This patch removes the 'noblock' variable since we must always set SOCK_NOSPACE if -EAGAIN is returned. It also renames the do_nonblock label since we might reach this code path even if we were in blocking mode. Fixes: 790ba4566c1a ("tcp: set SOCK_NOSPACE under memory pressure") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jason Baron <jbaron@akamai.com> Reported-by: Vladimir Rutsky <rutsky@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Jason Baron <jbaron@akamai.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-29bpf: sockmap, only create entry if ulp is not already enabledJohn Fastabend
[ Upstream commit 0e858739c2d2eedeeac1d35bfa0ec3cc2a7190d8 ] Sockmap does not currently support adding sockets after TLS has been enabled. There never was a real use case for this so it was never added. But, we lost the test for ULP at some point so add it here and fail the socket insert if TLS is enabled. Future work could make sockmap support this use case but fixup the bug here. Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-29bpf: sockmap, synchronize_rcu before free'ing mapJohn Fastabend
[ Upstream commit 2bb90e5cc90e1d09f631aeab041a9cf913a5bbe5 ] We need to have a synchronize_rcu before free'ing the sockmap because any outstanding psock references will have a pointer to the map and when they use this could trigger a use after free. Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-29bpf: sockmap, sock_map_delete needs to use xchgJohn Fastabend
[ Upstream commit 45a4521dcbd92e71c9e53031b40e34211d3b4feb ] __sock_map_delete() may be called from a tcp event such as unhash or close from the following trace, tcp_bpf_close() tcp_bpf_remove() sk_psock_unlink() sock_map_delete_from_link() __sock_map_delete() In this case the sock lock is held but this only protects against duplicate removals on the TCP side. If the map is free'd then we have this trace, sock_map_free xchg() <- replaces map entry sock_map_unref() sk_psock_put() sock_map_del_link() The __sock_map_delete() call however uses a read, test, null over the map entry which can result in both paths trying to free the map entry. To fix use xchg in TCP paths as well so we avoid having two references to the same map entry. Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-27Merge branch 'v5.2/standard/base' into v5.2/standard/preempt-rt/intel-x86Bruce Ashfield
2019-08-25net/tls: prevent skb_orphan() from leaking TLS plain text with offloadJakub Kicinski
[ Upstream commit 414776621d1006e57e80e6db7fdc3837897aaa64 ] sk_validate_xmit_skb() and drivers depend on the sk member of struct sk_buff to identify segments requiring encryption. Any operation which removes or does not preserve the original TLS socket such as skb_orphan() or skb_clone() will cause clear text leaks. Make the TCP socket underlying an offloaded TLS connection mark all skbs as decrypted, if TLS TX is in offload mode. Then in sk_validate_xmit_skb() catch skbs which have no socket (or a socket with no validation) and decrypted flag set. Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and sk->sk_validate_xmit_skb are slightly interchangeable right now, they all imply TLS offload. The new checks are guarded by CONFIG_TLS_DEVICE because that's the option guarding the sk_buff->decrypted member. Second, smaller issue with orphaning is that it breaks the guarantee that packets will be delivered to device queues in-order. All TLS offload drivers depend on that scheduling property. This means skb_orphan_partial()'s trick of preserving partial socket references will cause issues in the drivers. We need a full orphan, and as a result netem delay/throttling will cause all TLS offload skbs to be dropped. Reusing the sk_buff->decrypted flag also protects from leaking clear text when incoming, decrypted skb is redirected (e.g. by TC). See commit 0608c69c9a80 ("bpf: sk_msg, sock{map|hash} redirect through ULP") for justification why the internal flag is safe. The only location which could leak the flag in is tcp_bpf_sendmsg(), which is taken care of by clearing the previously unused bit. v2: - remove superfluous decrypted mark copy (Willem); - remove the stale doc entry (Boris); - rely entirely on EOR marking to prevent coalescing (Boris); - use an internal sendpages flag instead of marking the socket (Boris). v3 (Willem): - reorganize the can_skb_orphan_partial() condition; - fix the flag leak-in through tcp_bpf_sendmsg. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-25bpf: fix access to skb_shared_info->gso_segsEric Dumazet
commit 06a22d897d82f12776d44dbf0850f5895469cb2a upstream. It is possible we reach bpf_convert_ctx_access() with si->dst_reg == si->src_reg Therefore, we need to load BPF_REG_AX before eventually mangling si->src_reg. syzbot generated this x86 code : 3: 55 push %rbp 4: 48 89 e5 mov %rsp,%rbp 7: 48 81 ec 00 00 00 00 sub $0x0,%rsp // Might be avoided ? e: 53 push %rbx f: 41 55 push %r13 11: 41 56 push %r14 13: 41 57 push %r15 15: 6a 00 pushq $0x0 17: 31 c0 xor %eax,%eax 19: 48 8b bf c0 00 00 00 mov 0xc0(%rdi),%rdi 20: 44 8b 97 bc 00 00 00 mov 0xbc(%rdi),%r10d 27: 4c 01 d7 add %r10,%rdi 2a: 48 0f b7 7f 06 movzwq 0x6(%rdi),%rdi // Crash 2f: 5b pop %rbx 30: 41 5f pop %r15 32: 41 5e pop %r14 34: 41 5d pop %r13 36: 5b pop %rbx 37: c9 leaveq 38: c3 retq Fixes: d9ff286a0f59 ("bpf: allow BPF programs access skb_shared_info->gso_segs field") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-12Merge branch 'v5.2/standard/base' into v5.2/standard/preempt-rt/baseBruce Ashfield
2019-08-12Merge branch 'v5.2/standard/base' into v5.2/standard/preempt-rt/baseBruce Ashfield
2019-08-09net: fix bpf_xdp_adjust_head regression for generic-XDPJesper Dangaard Brouer
[ Upstream commit 065af355470519bd184019a93ac579f22b036045 ] When generic-XDP was moved to a later processing step by commit 458bf2f224f0 ("net: core: support XDP generic on stacked devices.") a regression was introduced when using bpf_xdp_adjust_head. The issue is that after this commit the skb->network_header is now changed prior to calling generic XDP and not after. Thus, if the header is changed by XDP (via bpf_xdp_adjust_head), then skb->network_header also need to be updated again. Fix by calling skb_reset_network_header(). Fixes: 458bf2f224f0 ("net: core: support XDP generic on stacked devices.") Reported-by: Brandon Cazander <brandon.cazander@multapplied.net> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-09net: fix ifindex collision during namespace removalJiri Pirko
[ Upstream commit 55b40dbf0e76b4bfb9d8b3a16a0208640a9a45df ] Commit aca51397d014 ("netns: Fix arbitrary net_device-s corruptions on net_ns stop.") introduced a possibility to hit a BUG in case device is returning back to init_net and two following conditions are met: 1) dev->ifindex value is used in a name of another "dev%d" device in init_net. 2) dev->name is used by another device in init_net. Under real life circumstances this is hard to get. Therefore this has been present happily for over 10 years. To reproduce: $ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 86:89:3f:86:61:29 brd ff:ff:ff:ff:ff:ff 3: enp0s2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff $ ip netns add ns1 $ ip -n ns1 link add dummy1ns1 type dummy $ ip -n ns1 link add dummy2ns1 type dummy $ ip link set enp0s2 netns ns1 $ ip -n ns1 link set enp0s2 name dummy0 [ 100.858894] virtio_net virtio0 dummy0: renamed from enp0s2 $ ip link add dev4 type dummy $ ip -n ns1 a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy1ns1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 16:63:4c:38:3e:ff brd ff:ff:ff:ff:ff:ff 3: dummy2ns1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether aa:9e:86:dd:6b:5d brd ff:ff:ff:ff:ff:ff 4: dummy0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff $ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 86:89:3f:86:61:29 brd ff:ff:ff:ff:ff:ff 4: dev4: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 5a:e1:4a:b6:ec:f8 brd ff:ff:ff:ff:ff:ff $ ip netns del ns1 [ 158.717795] default_device_exit: failed to move dummy0 to init_net: -17 [ 158.719316] ------------[ cut here ]------------ [ 158.720591] kernel BUG at net/core/dev.c:9824! [ 158.722260] invalid opcode: 0000 [#1] SMP KASAN PTI [ 158.723728] CPU: 0 PID: 56 Comm: kworker/u2:1 Not tainted 5.3.0-rc1+ #18 [ 158.725422] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014 [ 158.727508] Workqueue: netns cleanup_net [ 158.728915] RIP: 0010:default_device_exit.cold+0x1d/0x1f [ 158.730683] Code: 84 e8 18 c9 3e fe 0f 0b e9 70 90 ff ff e8 36 e4 52 fe 89 d9 4c 89 e2 48 c7 c6 80 d6 25 84 48 c7 c7 20 c0 25 84 e8 f4 c8 3e [ 158.736854] RSP: 0018:ffff8880347e7b90 EFLAGS: 00010282 [ 158.738752] RAX: 000000000000003b RBX: 00000000ffffffef RCX: 0000000000000000 [ 158.741369] RDX: 0000000000000000 RSI: ffffffff8128013d RDI: ffffed10068fcf64 [ 158.743418] RBP: ffff888033550170 R08: 000000000000003b R09: fffffbfff0b94b9c [ 158.745626] R10: fffffbfff0b94b9b R11: ffffffff85ca5cdf R12: ffff888032f28000 [ 158.748405] R13: dffffc0000000000 R14: ffff8880335501b8 R15: 1ffff110068fcf72 [ 158.750638] FS: 0000000000000000(0000) GS:ffff888036000000(0000) knlGS:0000000000000000 [ 158.752944] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 158.755245] CR2: 00007fe8b45d21d0 CR3: 00000000340b4005 CR4: 0000000000360ef0 [ 158.757654] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 158.760012] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 158.762758] Call Trace: [ 158.763882] ? dev_change_net_namespace+0xbb0/0xbb0 [ 158.766148] ? devlink_nl_cmd_set_doit+0x520/0x520 [ 158.768034] ? dev_change_net_namespace+0xbb0/0xbb0 [ 158.769870] ops_exit_list.isra.0+0xa8/0x150 [ 158.771544] cleanup_net+0x446/0x8f0 [ 158.772945] ? unregister_pernet_operations+0x4a0/0x4a0 [ 158.775294] process_one_work+0xa1a/0x1740 [ 158.776896] ? pwq_dec_nr_in_flight+0x310/0x310 [ 158.779143] ? do_raw_spin_lock+0x11b/0x280 [ 158.780848] worker_thread+0x9e/0x1060 [ 158.782500] ? process_one_work+0x1740/0x1740 [ 158.784454] kthread+0x31b/0x420 [ 158.786082] ? __kthread_create_on_node+0x3f0/0x3f0 [ 158.788286] ret_from_fork+0x3a/0x50 [ 158.789871] ---[ end trace defd6c657c71f936 ]--- [ 158.792273] RIP: 0010:default_device_exit.cold+0x1d/0x1f [ 158.795478] Code: 84 e8 18 c9 3e fe 0f 0b e9 70 90 ff ff e8 36 e4 52 fe 89 d9 4c 89 e2 48 c7 c6 80 d6 25 84 48 c7 c7 20 c0 25 84 e8 f4 c8 3e [ 158.804854] RSP: 0018:ffff8880347e7b90 EFLAGS: 00010282 [ 158.807865] RAX: 000000000000003b RBX: 00000000ffffffef RCX: 0000000000000000 [ 158.811794] RDX: 0000000000000000 RSI: ffffffff8128013d RDI: ffffed10068fcf64 [ 158.816652] RBP: ffff888033550170 R08: 000000000000003b R09: fffffbfff0b94b9c [ 158.820930] R10: fffffbfff0b94b9b R11: ffffffff85ca5cdf R12: ffff888032f28000 [ 158.825113] R13: dffffc0000000000 R14: ffff8880335501b8 R15: 1ffff110068fcf72 [ 158.829899] FS: 0000000000000000(0000) GS:ffff888036000000(0000) knlGS:0000000000000000 [ 158.834923] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 158.838164] CR2: 00007fe8b45d21d0 CR3: 00000000340b4005 CR4: 0000000000360ef0 [ 158.841917] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 158.845149] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Fix this by checking if a device with the same name exists in init_net and fallback to original code - dev%d to allocate name - in case it does. This was found using syzkaller. Fixes: aca51397d014 ("netns: Fix arbitrary net_device-s corruptions on net_ns stop.") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-28tcp: fix tcp_set_congestion_control() use from bpf hookEric Dumazet
[ Upstream commit 8d650cdedaabb33e85e9b7c517c0c71fcecc1de9 ] Neal reported incorrect use of ns_capable() from bpf hook. bpf_setsockopt(...TCP_CONGESTION...) -> tcp_set_congestion_control() -> ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN) -> ns_capable_common() -> current_cred() -> rcu_dereference_protected(current->cred, 1) Accessing 'current' in bpf context makes no sense, since packets are processed from softirq context. As Neal stated : The capability check in tcp_set_congestion_control() was written assuming a system call context, and then was reused from a BPF call site. The fix is to add a new parameter to tcp_set_congestion_control(), so that the ns_capable() call is only performed under the right context. Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Lawrence Brakmo <brakmo@fb.com> Reported-by: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-28net: neigh: fix multiple neigh timer schedulingLorenzo Bianconi
[ Upstream commit 071c37983d99da07797294ea78e9da1a6e287144 ] Neigh timer can be scheduled multiple times from userspace adding multiple neigh entries and forcing the neigh timer scheduling passing NTF_USE in the netlink requests. This will result in a refcount leak and in the following dump stack: [ 32.465295] NEIGH: BUG, double timer add, state is 8 [ 32.465308] CPU: 0 PID: 416 Comm: double_timer_ad Not tainted 5.2.0+ #65 [ 32.465311] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014 [ 32.465313] Call Trace: [ 32.465318] dump_stack+0x7c/0xc0 [ 32.465323] __neigh_event_send+0x20c/0x880 [ 32.465326] ? ___neigh_create+0x846/0xfb0 [ 32.465329] ? neigh_lookup+0x2a9/0x410 [ 32.465332] ? neightbl_fill_info.constprop.0+0x800/0x800 [ 32.465334] neigh_add+0x4f8/0x5e0 [ 32.465337] ? neigh_xmit+0x620/0x620 [ 32.465341] ? find_held_lock+0x85/0xa0 [ 32.465345] rtnetlink_rcv_msg+0x204/0x570 [ 32.465348] ? rtnl_dellink+0x450/0x450 [ 32.465351] ? mark_held_locks+0x90/0x90 [ 32.465354] ? match_held_lock+0x1b/0x230 [ 32.465357] netlink_rcv_skb+0xc4/0x1d0 [ 32.465360] ? rtnl_dellink+0x450/0x450 [ 32.465363] ? netlink_ack+0x420/0x420 [ 32.465366] ? netlink_deliver_tap+0x115/0x560 [ 32.465369] ? __alloc_skb+0xc9/0x2f0 [ 32.465372] netlink_unicast+0x270/0x330 [ 32.465375] ? netlink_attachskb+0x2f0/0x2f0 [ 32.465378] netlink_sendmsg+0x34f/0x5a0 [ 32.465381] ? netlink_unicast+0x330/0x330 [ 32.465385] ? move_addr_to_kernel.part.0+0x20/0x20 [ 32.465388] ? netlink_unicast+0x330/0x330 [ 32.465391] sock_sendmsg+0x91/0xa0 [ 32.465394] ___sys_sendmsg+0x407/0x480 [ 32.465397] ? copy_msghdr_from_user+0x200/0x200 [ 32.465401] ? _raw_spin_unlock_irqrestore+0x37/0x40 [ 32.465404] ? lockdep_hardirqs_on+0x17d/0x250 [ 32.465407] ? __wake_up_common_lock+0xcb/0x110 [ 32.465410] ? __wake_up_common+0x230/0x230 [ 32.465413] ? netlink_bind+0x3e1/0x490 [ 32.465416] ? netlink_setsockopt+0x540/0x540 [ 32.465420] ? __fget_light+0x9c/0xf0 [ 32.465423] ? sockfd_lookup_light+0x8c/0xb0 [ 32.465426] __sys_sendmsg+0xa5/0x110 [ 32.465429] ? __ia32_sys_shutdown+0x30/0x30 [ 32.465432] ? __fd_install+0xe1/0x2c0 [ 32.465435] ? lockdep_hardirqs_off+0xb5/0x100 [ 32.465438] ? mark_held_locks+0x24/0x90 [ 32.465441] ? do_syscall_64+0xf/0x270 [ 32.465444] do_syscall_64+0x63/0x270 [ 32.465448] entry_SYSCALL_64_after_hwframe+0x49/0xbe Fix the issue unscheduling neigh_timer if selected entry is in 'IN_TIMER' receiving a netlink request with NTF_USE flag set Reported-by: Marek Majkowski <marek@cloudflare.com> Fixes: 0c5c2d308906 ("neigh: Allow for user space users of the neighbour table") Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-22net: Add a mutex around devnet_rename_seqSebastian Andrzej Siewior
On RT write_seqcount_begin() disables preemption and device_rename() allocates memory with GFP_KERNEL and grabs later the sysfs_mutex mutex. Serialize with a mutex and add use the non preemption disabling __write_seqcount_begin(). To avoid writer starvation, let the reader grab the mutex and release it when it detects a writer in progress. This keeps the normal case (no reader on the fly) fast. [ tglx: Instead of replacing the seqcount by a mutex, add the mutex ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2019-07-22net/core: protect users of napi_alloc_cache against reentranceSebastian Andrzej Siewior
On -RT the code running in BH can not be moved to another CPU so CPU local variable remain local. However the code can be preempted and another task may enter BH accessing the same CPU using the same napi_alloc_cache variable. This patch ensures that each user of napi_alloc_cache uses a local lock. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2019-07-22net: Another local_irq_disable/kmalloc headacheThomas Gleixner
Replace it by a local lock. Though that's pretty inefficient :( Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2019-07-22net: Remove preemption disabling in netif_rx()Priyanka Jain
1)enqueue_to_backlog() (called from netif_rx) should be bind to a particluar CPU. This can be achieved by disabling migration. No need to disable preemption 2)Fixes crash "BUG: scheduling while atomic: ksoftirqd" in case of RT. If preemption is disabled, enqueue_to_backog() is called in atomic context. And if backlog exceeds its count, kfree_skb() is called. But in RT, kfree_skb() might gets scheduled out, so it expects non atomic context. 3)When CONFIG_PREEMPT_RT_FULL is not defined, migrate_enable(), migrate_disable() maps to preempt_enable() and preempt_disable(), so no change in functionality in case of non-RT. -Replace preempt_enable(), preempt_disable() with migrate_enable(), migrate_disable() respectively -Replace get_cpu(), put_cpu() with get_cpu_light(), put_cpu_light() respectively Signed-off-by: Priyanka Jain <Priyanka.Jain@freescale.com> Acked-by: Rajan Srivastava <Rajan.Srivastava@freescale.com> Cc: <rostedt@goodmis.orgn> Link: http://lkml.kernel.org/r/1337227511-2271-1-git-send-email-Priyanka.Jain@freescale.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2019-07-22net: Have __napi_schedule_irqoff() disable interrupts on RTSteven Rostedt
A customer hit a crash where the napi sd->poll_list became corrupted. The customer had the bnx2x driver, which does a __napi_schedule_irqoff() in its interrupt handler. Unfortunately, when running with CONFIG_PREEMPT_RT_FULL, this interrupt handler is run as a thread and is preemptable. The call to ____napi_schedule() must be done with interrupts disabled to protect the per cpu softnet_data's "poll_list, which is protected by disabling interrupts (disabling preemption is enough when all interrupts are threaded and local_bh_disable() can't preempt)." As bnx2x isn't the only driver that does this, the safest thing to do is to make __napi_schedule_irqoff() call __napi_schedule() instead when CONFIG_PREEMPT_RT_FULL is enabled, which will call local_irq_save() before calling ____napi_schedule(). Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt (Red Hat) <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2019-07-22net/Qdisc: use a seqlock instead seqcountSebastian Andrzej Siewior
The seqcount disables preemption on -RT while it is held which can't remove. Also we don't want the reader to spin for ages if the writer is scheduled out. The seqlock on the other hand will serialize / sleep on the lock while writer is active. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2019-07-22net: dev: always take qdisc's busylock in __dev_xmit_skb()Sebastian Andrzej Siewior
The root-lock is dropped before dev_hard_start_xmit() is invoked and after setting the __QDISC___STATE_RUNNING bit. If this task is now pushed away by a task with a higher priority then the task with the higher priority won't be able to submit packets to the NIC directly instead they will be enqueued into the Qdisc. The NIC will remain idle until the task(s) with higher priority leave the CPU and the task with lower priority gets back and finishes the job. If we take always the busylock we ensure that the RT task can boost the low-prio task and submit the packet. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2019-07-22net: Use skbufhead with raw lockThomas Gleixner
Use the rps lock as rawlock so we can keep irq-off regions. It looks low latency. However we can't kfree() from this context therefore we defer this to the softirq and use the tofree_queue list for it (similar to process_queue). Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2019-07-22net/core: use local_bh_disable() in netif_rx_ni()Sebastian Andrzej Siewior
In 2004 netif_rx_ni() gained a preempt_disable() section around netif_rx() and its do_softirq() + testing for it. The do_softirq() part is required because netif_rx() raises the softirq but does not invoke it. The preempt_disable() is required to remain on the same CPU which added the skb to the per-CPU list. All this can be avoided be putting this into a local_bh_disable()ed section. The local_bh_enable() part will invoke do_softirq() if required. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2019-07-22softirq: Check preemption after reenabling interruptsThomas Gleixner
raise_softirq_irqoff() disables interrupts and wakes the softirq daemon, but after reenabling interrupts there is no preemption check, so the execution of the softirq thread might be delayed arbitrarily. In principle we could add that check to local_irq_enable/restore, but that's overkill as the rasie_softirq_irqoff() sections are the only ones which show this behaviour. Reported-by: Carsten Emde <cbe@osadl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2019-07-22hrtimer: consolidate hrtimer_init() + hrtimer_init_sleeper() callsSebastian Andrzej Siewior
hrtimer_init_sleeper() calls require a prior initialisation of the hrtimer object with hrtimer_init(). Lets make the initialisation of the hrtimer object part of hrtimer_init_sleeper(). To remain consistent consider init_on_stack as well. Beside adapting the hrtimer_init_sleeper[_on_stack]() functions, call sites need to be updated as well. Link: http://lkml.kernel.org/r/20180703092541.2870-1-anna-maria@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> [anna-maria: Updating the commit message, add staging/android/vsoc.c] Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
2019-06-18net: remove duplicate fetch in sock_getsockoptJingYi Hou
In sock_getsockopt(), 'optlen' is fetched the first time from userspace. 'len < 0' is then checked. Then in condition 'SO_MEMINFO', 'optlen' is fetched the second time from userspace. If change it between two fetches may cause security problems or unexpected behaivor, and there is no reason to fetch it a second time. To fix this, we need to remove the second fetch. Signed-off-by: JingYi Hou <houjingyi647@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "Lots of bug fixes here: 1) Out of bounds access in __bpf_skc_lookup, from Lorenz Bauer. 2) Fix rate reporting in cfg80211_calculate_bitrate_he(), from John Crispin. 3) Use after free in psock backlog workqueue, from John Fastabend. 4) Fix source port matching in fdb peer flow rule of mlx5, from Raed Salem. 5) Use atomic_inc_not_zero() in fl6_sock_lookup(), from Eric Dumazet. 6) Network header needs to be set for packet redirect in nfp, from John Hurley. 7) Fix udp zerocopy refcnt, from Willem de Bruijn. 8) Don't assume linear buffers in vxlan and geneve error handlers, from Stefano Brivio. 9) Fix TOS matching in mlxsw, from Jiri Pirko. 10) More SCTP cookie memory leak fixes, from Neil Horman. 11) Fix VLAN filtering in rtl8366, from Linus Walluij. 12) Various TCP SACK payload size and fragmentation memory limit fixes from Eric Dumazet. 13) Use after free in pneigh_get_next(), also from Eric Dumazet. 14) LAPB control block leak fix from Jeremy Sowden" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (145 commits) lapb: fixed leak of control-blocks. tipc: purge deferredq list for each grp member in tipc_group_delete ax25: fix inconsistent lock state in ax25_destroy_timer neigh: fix use-after-free read in pneigh_get_next tcp: fix compile error if !CONFIG_SYSCTL hv_sock: Suppress bogus "may be used uninitialized" warnings be2net: Fix number of Rx queues used for flow hashing net: handle 802.1P vlan 0 packets properly tcp: enforce tcp_min_snd_mss in tcp_mtu_probing() tcp: add tcp_min_snd_mss sysctl tcp: tcp_fragment() should apply sane memory limits tcp: limit payload size of sacked skbs Revert "net: phylink: set the autoneg state in phylink_phy_change" bpf: fix nested bpf tracepoints with per-cpu data bpf: Fix out of bounds memory access in bpf_sk_storage vsock/virtio: set SOCK_DONE on peer shutdown net: dsa: rtl8366: Fix up VLAN filtering net: phylink: set the autoneg state in phylink_phy_change net: add high_order_alloc_disable sysctl/static key tcp: add tcp_tx_skb_cache sysctl ...
2019-06-16neigh: fix use-after-free read in pneigh_get_nextEric Dumazet
Nine years ago, I added RCU handling to neighbours, not pneighbours. (pneigh are not commonly used) Unfortunately I missed that /proc dump operations would use a common entry and exit point : neigh_seq_start() and neigh_seq_stop() We need to read_lock(tbl->lock) or risk use-after-free while iterating the pneigh structures. We might later convert pneigh to RCU and revert this patch. sysbot reported : BUG: KASAN: use-after-free in pneigh_get_next.isra.0+0x24b/0x280 net/core/neighbour.c:3158 Read of size 8 at addr ffff888097f2a700 by task syz-executor.0/9825 CPU: 1 PID: 9825 Comm: syz-executor.0 Not tainted 5.2.0-rc4+ #32 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188 __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317 kasan_report+0x12/0x20 mm/kasan/common.c:614 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132 pneigh_get_next.isra.0+0x24b/0x280 net/core/neighbour.c:3158 neigh_seq_next+0xdb/0x210 net/core/neighbour.c:3240 seq_read+0x9cf/0x1110 fs/seq_file.c:258 proc_reg_read+0x1fc/0x2c0 fs/proc/inode.c:221 do_loop_readv_writev fs/read_write.c:714 [inline] do_loop_readv_writev fs/read_write.c:701 [inline] do_iter_read+0x4a4/0x660 fs/read_write.c:935 vfs_readv+0xf0/0x160 fs/read_write.c:997 kernel_readv fs/splice.c:359 [inline] default_file_splice_read+0x475/0x890 fs/splice.c:414 do_splice_to+0x127/0x180 fs/splice.c:877 splice_direct_to_actor+0x2d2/0x970 fs/splice.c:954 do_splice_direct+0x1da/0x2a0 fs/splice.c:1063 do_sendfile+0x597/0xd00 fs/read_write.c:1464 __do_sys_sendfile64 fs/read_write.c:1525 [inline] __se_sys_sendfile64 fs/read_write.c:1511 [inline] __x64_sys_sendfile64+0x1dd/0x220 fs/read_write.c:1511 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x4592c9 Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f4aab51dc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028 RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00000000004592c9 RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000005 RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000080000000 R11: 0000000000000246 R12: 00007f4aab51e6d4 R13: 00000000004c689d R14: 00000000004db828 R15: 00000000ffffffff Allocated by task 9827: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_kmalloc mm/kasan/common.c:489 [inline] __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462 kasan_kmalloc+0x9/0x10 mm/kasan/common.c:503 __do_kmalloc mm/slab.c:3660 [inline] __kmalloc+0x15c/0x740 mm/slab.c:3669 kmalloc include/linux/slab.h:552 [inline] pneigh_lookup+0x19c/0x4a0 net/core/neighbour.c:731 arp_req_set_public net/ipv4/arp.c:1010 [inline] arp_req_set+0x613/0x720 net/ipv4/arp.c:1026 arp_ioctl+0x652/0x7f0 net/ipv4/arp.c:1226 inet_ioctl+0x2a0/0x340 net/ipv4/af_inet.c:926 sock_do_ioctl+0xd8/0x2f0 net/socket.c:1043 sock_ioctl+0x3ed/0x780 net/socket.c:1194 vfs_ioctl fs/ioctl.c:46 [inline] file_ioctl fs/ioctl.c:509 [inline] do_vfs_ioctl+0xd5f/0x1380 fs/ioctl.c:696 ksys_ioctl+0xab/0xd0 fs/ioctl.c:713 __do_sys_ioctl fs/ioctl.c:720 [inline] __se_sys_ioctl fs/ioctl.c:718 [inline] __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 9824: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459 __cache_free mm/slab.c:3432 [inline] kfree+0xcf/0x220 mm/slab.c:3755 pneigh_ifdown_and_unlock net/core/neighbour.c:812 [inline] __neigh_ifdown+0x236/0x2f0 net/core/neighbour.c:356 neigh_ifdown+0x20/0x30 net/core/neighbour.c:372 arp_ifdown+0x1d/0x21 net/ipv4/arp.c:1274 inetdev_destroy net/ipv4/devinet.c:319 [inline] inetdev_event+0xa14/0x11f0 net/ipv4/devinet.c:1544 notifier_call_chain+0xc2/0x230 kernel/notifier.c:95 __raw_notifier_call_chain kernel/notifier.c:396 [inline] raw_notifier_call_chain+0x2e/0x40 kernel/notifier.c:403 call_netdevice_notifiers_info+0x3f/0x90 net/core/dev.c:1749 call_netdevice_notifiers_extack net/core/dev.c:1761 [inline] call_netdevice_notifiers net/core/dev.c:1775 [inline] rollback_registered_many+0x9b9/0xfc0 net/core/dev.c:8178 rollback_registered+0x109/0x1d0 net/core/dev.c:8220 unregister_netdevice_queue net/core/dev.c:9267 [inline] unregister_netdevice_queue+0x1ee/0x2c0 net/core/dev.c:9260 unregister_netdevice include/linux/netdevice.h:2631 [inline] __tun_detach+0xd8a/0x1040 drivers/net/tun.c:724 tun_detach drivers/net/tun.c:741 [inline] tun_chr_close+0xe0/0x180 drivers/net/tun.c:3451 __fput+0x2ff/0x890 fs/file_table.c:280 ____fput+0x16/0x20 fs/file_table.c:313 task_work_run+0x145/0x1c0 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:185 [inline] exit_to_usermode_loop+0x273/0x2c0 arch/x86/entry/common.c:168 prepare_exit_to_usermode arch/x86/entry/common.c:199 [inline] syscall_return_slowpath arch/x86/entry/common.c:279 [inline] do_syscall_64+0x58e/0x680 arch/x86/entry/common.c:304 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff888097f2a700 which belongs to the cache kmalloc-64 of size 64 The buggy address is located 0 bytes inside of 64-byte region [ffff888097f2a700, ffff888097f2a740) The buggy address belongs to the page: page:ffffea00025fca80 refcount:1 mapcount:0 mapping:ffff8880aa400340 index:0x0 flags: 0x1fffc0000000200(slab) raw: 01fffc0000000200 ffffea000250d548 ffffea00025726c8 ffff8880aa400340 raw: 0000000000000000 ffff888097f2a000 0000000100000020 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888097f2a600: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc ffff888097f2a680: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc >ffff888097f2a700: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ^ ffff888097f2a780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ffff888097f2a800: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc Fixes: 767e97e1e0db ("neigh: RCU conversion of struct neighbour") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-16net: handle 802.1P vlan 0 packets properlyGovindarajulu Varadarajan
When stack receives pkt: [802.1P vlan 0][802.1AD vlan 100][IPv4], vlan_do_receive() returns false if it does not find vlan_dev. Later __netif_receive_skb_core() fails to find packet type handler for skb->protocol 801.1AD and drops the packet. 801.1P header with vlan id 0 should be handled as untagged packets. This patch fixes it by checking if vlan_id is 0 and processes next vlan header. Signed-off-by: Govindarajulu Varadarajan <gvaradar@cisco.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf 2019-06-15 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) fix stack layout of JITed x64 bpf code, from Alexei. 2) fix out of bounds memory access in bpf_sk_storage, from Arthur. 3) fix lpm trie walk, from Jonathan. 4) fix nested bpf_perf_event_output, from Matt. 5) and several other fixes. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15bpf: Fix out of bounds memory access in bpf_sk_storageArthur Fabre
bpf_sk_storage maps use multiple spin locks to reduce contention. The number of locks to use is determined by the number of possible CPUs. With only 1 possible CPU, bucket_log == 0, and 2^0 = 1 locks are used. When updating elements, the correct lock is determined with hash_ptr(). Calling hash_ptr() with 0 bits is undefined behavior, as it does: x >> (64 - bits) Using the value results in an out of bounds memory access. In my case, this manifested itself as a page fault when raw_spin_lock_bh() is called later, when running the self tests: ./tools/testing/selftests/bpf/test_verifier 773 775 [ 16.366342] BUG: unable to handle page fault for address: ffff8fe7a66f93f8 Force the minimum number of locks to two. Signed-off-by: Arthur Fabre <afabre@cloudflare.com> Fixes: 6ac99e8f23d4 ("bpf: Introduce bpf sk local storage") Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-14net: add high_order_alloc_disable sysctl/static keyEric Dumazet
>From linux-3.7, (commit 5640f7685831 "net: use a per task frag allocator") TCP sendmsg() has preferred using order-3 allocations. While it gives good results for most cases, we had reports that heavy uses of TCP over loopback were hitting a spinlock contention in page allocations/freeing. This commits adds a sysctl so that admins can opt-in for order-0 allocations. Hopefully mm layer might optimize order-3 allocations in the future since it could give us a nice boost (see 8 lines of following benchmark) The following benchmark shows a win when more than 8 TCP_STREAM threads are running (56 x86 cores server in my tests) for thr in {1..30} do sysctl -wq net.core.high_order_alloc_disable=0 T0=`./super_netperf $thr -H 127.0.0.1 -l 15` sysctl -wq net.core.high_order_alloc_disable=1 T1=`./super_netperf $thr -H 127.0.0.1 -l 15` echo $thr:$T0:$T1 done 1: 49979: 37267 2: 98745: 76286 3: 141088: 110051 4: 177414: 144772 5: 197587: 173563 6: 215377: 208448 7: 241061: 234087 8: 267155: 263373 9: 295069: 297402 10: 312393: 335213 11: 340462: 368778 12: 371366: 403954 13: 412344: 443713 14: 426617: 473580 15: 474418: 507861 16: 503261: 538539 17: 522331: 563096 18: 532409: 567084 19: 550824: 605240 20: 525493: 641988 21: 564574: 665843 22: 567349: 690868 23: 583846: 710917 24: 588715: 736306 25: 603212: 763494 26: 604083: 792654 27: 602241: 796450 28: 604291: 797993 29: 611610: 833249 30: 577356: 841062 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-12net: ethtool: Allow matching on vlan DEI bitMaxime Chevallier
Using ethtool, users can specify a classification action matching on the full vlan tag, which includes the DEI bit (also previously called CFI). However, when converting the ethool_flow_spec to a flow_rule, we use dissector keys to represent the matching patterns. Since the vlan dissector key doesn't include the DEI bit, this information was silently discarded when translating the ethtool flow spec in to a flow_rule. This commit adds the DEI bit into the vlan dissector key, and allows propagating the information to the driver when parsing the ethtool flow spec. Fixes: eca4205f9ec3 ("ethtool: add ethtool_rx_flow_spec to flow_rule structure translator") Reported-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-12bpf: net: Set sk_bpf_storage back to NULL for cloned skMartin KaFai Lau
The cloned sk should not carry its parent-listener's sk_bpf_storage. This patch fixes it by setting it back to NULL. Fixes: 6ac99e8f23d4 ("bpf: Introduce bpf sk local storage") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-08Merge tag 'spdx-5.2-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull yet more SPDX updates from Greg KH: "Another round of SPDX header file fixes for 5.2-rc4 These are all more "GPL-2.0-or-later" or "GPL-2.0-only" tags being added, based on the text in the files. We are slowly chipping away at the 700+ different ways people tried to write the license text. All of these were reviewed on the spdx mailing list by a number of different people. We now have over 60% of the kernel files covered with SPDX tags: $ ./scripts/spdxcheck.py -v 2>&1 | grep Files Files checked: 64533 Files with SPDX: 40392 Files with errors: 0 I think the majority of the "easy" fixups are now done, it's now the start of the longer-tail of crazy variants to wade through" * tag 'spdx-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (159 commits) treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 450 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 449 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 448 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 446 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 445 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 444 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 443 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 442 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 440 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 438 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 437 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 436 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 435 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 434 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 433 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 432 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 431 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 430 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 429 ...
2019-06-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2019-06-07 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix several bugs in riscv64 JIT code emission which forgot to clear high 32-bits for alu32 ops, from Björn and Luke with selftests covering all relevant BPF alu ops from Björn and Jiong. 2) Two fixes for UDP BPF reuseport that avoid calling the program in case of __udp6_lib_err and UDP GRO which broke reuseport_select_sock() assumption that skb->data is pointing to transport header, from Martin. 3) Two fixes for BPF sockmap: a use-after-free from sleep in psock's backlog workqueue, and a missing restore of sk_write_space when psock gets dropped, from Jakub and John. 4) Fix unconnected UDP sendmsg hook API which is insufficient as-is since it breaks standard applications like DNS if reverse NAT is not performed upon receive, from Daniel. 5) Fix an out-of-bounds read in __bpf_skc_lookup which in case of AF_INET6 fails to verify that the length of the tuple is long enough, from Lorenz. 6) Fix libbpf's libbpf__probe_raw_btf to return an fd instead of 0/1 (for {un,}successful probe) as that is expected to be propagated as an fd to load_sk_storage_btf() and thus closing the wrong descriptor otherwise, from Michal. 7) Fix bpftool's JSON output for the case when a lookup fails, from Krzesimir. 8) Minor misc fixes in docs, samples and selftests, from various others. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>