aboutsummaryrefslogtreecommitdiffstats
path: root/net/core
AgeCommit message (Collapse)Author
2021-10-06af_unix: fix races in sk_peer_pid and sk_peer_cred accessesEric Dumazet
[ Upstream commit 35306eb23814444bd4021f8a1c3047d3cb0c8b2b ] Jann Horn reported that SO_PEERCRED and SO_PEERGROUPS implementations are racy, as af_unix can concurrently change sk_peer_pid and sk_peer_cred. In order to fix this issue, this patch adds a new spinlock that needs to be used whenever these fields are read or written. Jann also pointed out that l2cap_sock_get_peer_pid_cb() is currently reading sk->sk_peer_pid which makes no sense, as this field is only possibly set by AF_UNIX sockets. We will have to clean this in a separate patch. This could be done by reverting b48596d1dc25 "Bluetooth: L2CAP: Add get_peer_pid callback" or implementing what was truly expected. Fixes: 109f6e39fa07 ("af_unix: Allow SO_PEERCRED to work across namespaces.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jann Horn <jannh@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Cc: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-27Merge branch 'v5.4/standard/base' into v5.4/standard/ti-j72xBruce Ashfield
2021-09-27Merge branch 'v5.4/standard/base' into v5.4/standard/ti-j72xBruce Ashfield
2021-09-22flow_dissector: Fix out-of-bounds warningsGustavo A. R. Silva
[ Upstream commit 323e0cb473e2a8706ff162b6b4f4fa16023c9ba7 ] Fix the following out-of-bounds warnings: net/core/flow_dissector.c: In function '__skb_flow_dissect': >> net/core/flow_dissector.c:1104:4: warning: 'memcpy' offset [24, 39] from the object at '<unknown>' is out of the bounds of referenced subobject 'saddr' with type 'struct in6_addr' at offset 8 [-Warray-bounds] 1104 | memcpy(&key_addrs->v6addrs, &iph->saddr, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1105 | sizeof(key_addrs->v6addrs)); | ~~~~~~~~~~~~~~~~~~~~~~~~~~~ In file included from include/linux/ipv6.h:5, from net/core/flow_dissector.c:6: include/uapi/linux/ipv6.h:133:18: note: subobject 'saddr' declared here 133 | struct in6_addr saddr; | ^~~~~ >> net/core/flow_dissector.c:1059:4: warning: 'memcpy' offset [16, 19] from the object at '<unknown>' is out of the bounds of referenced subobject 'saddr' with type 'unsigned int' at offset 12 [-Warray-bounds] 1059 | memcpy(&key_addrs->v4addrs, &iph->saddr, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1060 | sizeof(key_addrs->v4addrs)); | ~~~~~~~~~~~~~~~~~~~~~~~~~~~ In file included from include/linux/ip.h:17, from net/core/flow_dissector.c:5: include/uapi/linux/ip.h:103:9: note: subobject 'saddr' declared here 103 | __be32 saddr; | ^~~~~ The problem is that the original code is trying to copy data into a couple of struct members adjacent to each other in a single call to memcpy(). So, the compiler legitimately complains about it. As these are just a couple of members, fix this by copying each one of them in separate calls to memcpy(). This helps with the ongoing efforts to globally enable -Warray-bounds and get us closer to being able to tighten the FORTIFY_SOURCE routines on memcpy(). Link: https://github.com/KSPP/linux/issues/109 Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/lkml/d5ae2e65-1f18-2577-246f-bada7eee6ccd@intel.com/ Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15netns: protect netns ID lookups with RCUGuillaume Nault
commit 2dce224f469f060b9998a5a869151ef83c08ce77 upstream. __peernet2id() can be protected by RCU as it only calls idr_for_each(), which is RCU-safe, and never modifies the nsid table. rtnl_net_dumpid() can also do lockless lookups. It does two nested idr_for_each() calls on nsid tables (one direct call and one indirect call because of rtnl_net_dumpid_one() calling __peernet2id()). The netnsid tables are never updated. Therefore it is safe to not take the nsid_lock and run within an RCU-critical section instead. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
2021-09-08Merge branch 'v5.4/standard/base' into v5.4/standard/ti-j72xBruce Ashfield
2021-09-03rtnetlink: Return correct error on changing device netnsAndrey Ignatov
[ Upstream commit 96a6b93b69880b2c978e1b2be9cae6970b605008 ] Currently when device is moved between network namespaces using RTM_NEWLINK message type and one of netns attributes (FLA_NET_NS_PID, IFLA_NET_NS_FD, IFLA_TARGET_NETNSID) but w/o specifying IFLA_IFNAME, and target namespace already has device with same name, userspace will get EINVAL what is confusing and makes debugging harder. Fix it so that userspace gets more appropriate EEXIST instead what makes debugging much easier. Before: # ./ifname.sh + ip netns add ns0 + ip netns exec ns0 ip link add l0 type dummy + ip netns exec ns0 ip link show l0 8: l0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 66:90:b5:d5:78:69 brd ff:ff:ff:ff:ff:ff + ip link add l0 type dummy + ip link show l0 10: l0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 6e:c6:1f:15:20:8d brd ff:ff:ff:ff:ff:ff + ip link set l0 netns ns0 RTNETLINK answers: Invalid argument After: # ./ifname.sh + ip netns add ns0 + ip netns exec ns0 ip link add l0 type dummy + ip netns exec ns0 ip link show l0 8: l0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 1e:4a:72:e3:e3:8f brd ff:ff:ff:ff:ff:ff + ip link add l0 type dummy + ip link show l0 10: l0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether f2:fc:fe:2b:7d:a6 brd ff:ff:ff:ff:ff:ff + ip link set l0 netns ns0 RTNETLINK answers: File exists The problem is that do_setlink() passes its `char *ifname` argument, that it gets from a caller, to __dev_change_net_namespace() as is (as `const char *pat`), but semantics of ifname and pat can be different. For example, __rtnl_newlink() does this: net/core/rtnetlink.c 3270 char ifname[IFNAMSIZ]; ... 3286 if (tb[IFLA_IFNAME]) 3287 nla_strscpy(ifname, tb[IFLA_IFNAME], IFNAMSIZ); 3288 else 3289 ifname[0] = '\0'; ... 3364 if (dev) { ... 3394 return do_setlink(skb, dev, ifm, extack, tb, ifname, status); 3395 } , i.e. do_setlink() gets ifname pointer that is always valid no matter if user specified IFLA_IFNAME or not and then do_setlink() passes this ifname pointer as is to __dev_change_net_namespace() as pat argument. But the pat (pattern) in __dev_change_net_namespace() is used as: net/core/dev.c 11198 err = -EEXIST; 11199 if (__dev_get_by_name(net, dev->name)) { 11200 /* We get here if we can't use the current device name */ 11201 if (!pat) 11202 goto out; 11203 err = dev_get_valid_name(net, dev, pat); 11204 if (err < 0) 11205 goto out; 11206 } As the result the `goto out` path on line 11202 is neven taken and instead of returning EEXIST defined on line 11198, __dev_change_net_namespace() returns an error from dev_get_valid_name() and this, in turn, will be EINVAL for ifname[0] = '\0' set earlier. Fixes: d8a5ec672768 ("[NET]: netlink support for moving devices between network namespaces.") Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-22Merge branch 'v5.4/standard/base' into v5.4/standard/ti-j72xBruce Ashfield
2021-08-18net: linkwatch: fix failure to restore device state across suspend/resumeWilly Tarreau
[ Upstream commit 6922110d152e56d7569616b45a1f02876cf3eb9f ] After migrating my laptop from 4.19-LTS to 5.4-LTS a while ago I noticed that my Ethernet port to which a bond and a VLAN interface are attached appeared to remain up after resuming from suspend with the cable unplugged (and that problem still persists with 5.10-LTS). It happens that the following happens: - the network driver (e1000e here) prepares to suspend, calls e1000e_down() which calls netif_carrier_off() to signal that the link is going down. - netif_carrier_off() adds a link_watch event to the list of events for this device - the device is completely stopped. - the machine suspends - the cable is unplugged and the machine brought to another location - the machine is resumed - the queued linkwatch events are processed for the device - the device doesn't yet have the __LINK_STATE_PRESENT bit and its events are silently dropped - the device is resumed with its link down - the upper VLAN and bond interfaces are never notified that the link had been turned down and remain up - the only way to provoke a change is to physically connect the machine to a port and possibly unplug it. The state after resume looks like this: $ ip -br li | egrep 'bond|eth' bond0 UP e8:6a:64:64:64:64 <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> eth0 DOWN e8:6a:64:64:64:64 <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> eth0.2@eth0 UP e8:6a:64:64:64:64 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> Placing an explicit call to netdev_state_change() either in the suspend or the resume code in the NIC driver worked around this but the solution is not satisfying. The issue in fact really is in link_watch that loses events while it ought not to. It happens that the test for the device being present was added by commit 124eee3f6955 ("net: linkwatch: add check for netdevice being present to linkwatch_do_dev") in 4.20 to avoid an access to devices that are not present. Instead of dropping events, this patch proceeds slightly differently by postponing their handling so that they happen after the device is fully resumed. Fixes: 124eee3f6955 ("net: linkwatch: add check for netdevice being present to linkwatch_do_dev") Link: https://lists.openwall.net/netdev/2018/03/15/62 Cc: Heiner Kallweit <hkallweit1@gmail.com> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Willy Tarreau <w@1wt.eu> Link: https://lore.kernel.org/r/20210809160628.22623-1-w@1wt.eu Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-08Merge branch 'v5.4/standard/base' into v5.4/standard/ti-j72xBruce Ashfield
2021-08-08net: Fix zero-copy head len calculation.Pravin B Shelar
[ Upstream commit a17ad0961706244dce48ec941f7e476a38c0e727 ] In some cases skb head could be locked and entire header data is pulled from skb. When skb_zerocopy() called in such cases, following BUG is triggered. This patch fixes it by copying entire skb in such cases. This could be optimized incase this is performance bottleneck. ---8<--- kernel BUG at net/core/skbuff.c:2961! invalid opcode: 0000 [#1] SMP PTI CPU: 2 PID: 0 Comm: swapper/2 Tainted: G OE 5.4.0-77-generic #86-Ubuntu Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:skb_zerocopy+0x37a/0x3a0 RSP: 0018:ffffbcc70013ca38 EFLAGS: 00010246 Call Trace: <IRQ> queue_userspace_packet+0x2af/0x5e0 [openvswitch] ovs_dp_upcall+0x3d/0x60 [openvswitch] ovs_dp_process_packet+0x125/0x150 [openvswitch] ovs_vport_receive+0x77/0xd0 [openvswitch] netdev_port_receive+0x87/0x130 [openvswitch] netdev_frame_hook+0x4b/0x60 [openvswitch] __netif_receive_skb_core+0x2b4/0xc90 __netif_receive_skb_one_core+0x3f/0xa0 __netif_receive_skb+0x18/0x60 process_backlog+0xa9/0x160 net_rx_action+0x142/0x390 __do_softirq+0xe1/0x2d6 irq_exit+0xae/0xb0 do_IRQ+0x5a/0xf0 common_interrupt+0xf/0xf Code that triggered BUG: int skb_zerocopy(struct sk_buff *to, struct sk_buff *from, int len, int hlen) { int i, j = 0; int plen = 0; /* length of skb->head fragment */ int ret; struct page *page; unsigned int offset; BUG_ON(!from->head_frag && !hlen); Signed-off-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-02Merge branch 'v5.4/standard/base' into v5.4/standard/ti-j72xBruce Ashfield
2021-07-31net: annotate data race around sk_ll_usecEric Dumazet
[ Upstream commit 0dbffbb5335a1e3aa6855e4ee317e25e669dd302 ] sk_ll_usec is read locklessly from sk_can_busy_loop() while another thread can change its value in sock_setsockopt() This is correct but needs annotations. BUG: KCSAN: data-race in __skb_try_recv_datagram / sock_setsockopt write to 0xffff88814eb5f904 of 4 bytes by task 14011 on cpu 0: sock_setsockopt+0x1287/0x2090 net/core/sock.c:1175 __sys_setsockopt+0x14f/0x200 net/socket.c:2100 __do_sys_setsockopt net/socket.c:2115 [inline] __se_sys_setsockopt net/socket.c:2112 [inline] __x64_sys_setsockopt+0x62/0x70 net/socket.c:2112 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff88814eb5f904 of 4 bytes by task 14001 on cpu 1: sk_can_busy_loop include/net/busy_poll.h:41 [inline] __skb_try_recv_datagram+0x14f/0x320 net/core/datagram.c:273 unix_dgram_recvmsg+0x14c/0x870 net/unix/af_unix.c:2101 unix_seqpacket_recvmsg+0x5a/0x70 net/unix/af_unix.c:2067 ____sys_recvmsg+0x15d/0x310 include/linux/uio.h:244 ___sys_recvmsg net/socket.c:2598 [inline] do_recvmmsg+0x35c/0x9f0 net/socket.c:2692 __sys_recvmmsg net/socket.c:2771 [inline] __do_sys_recvmmsg net/socket.c:2794 [inline] __se_sys_recvmmsg net/socket.c:2787 [inline] __x64_sys_recvmmsg+0xcf/0x150 net/socket.c:2787 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x00000000 -> 0x00000101 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 14001 Comm: syz-executor.3 Not tainted 5.13.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20Merge branch 'v5.4/standard/base' into v5.4/standard/ti-j72xBruce Ashfield
2021-07-19net: Treat __napi_schedule_irqoff() as __napi_schedule() on PREEMPT_RTSebastian Andrzej Siewior
[ Upstream commit 8380c81d5c4fced6f4397795a5ae65758272bbfd ] __napi_schedule_irqoff() is an optimized version of __napi_schedule() which can be used where it is known that interrupts are disabled, e.g. in interrupt-handlers, spin_lock_irq() sections or hrtimer callbacks. On PREEMPT_RT enabled kernels this assumptions is not true. Force- threaded interrupt handlers and spinlocks are not disabling interrupts and the NAPI hrtimer callback is forced into softirq context which runs with interrupts enabled as well. Chasing all usage sites of __napi_schedule_irqoff() is a whack-a-mole game so make __napi_schedule_irqoff() invoke __napi_schedule() for PREEMPT_RT kernels. The callers of ____napi_schedule() in the networking core have been audited and are correct on PREEMPT_RT kernels as well. Reported-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14Merge branch 'v5.4/base' into v5.4/standard/ti-j72xBruce Ashfield
2021-07-14bpf: Do not change gso_size during bpf_skb_change_proto()Maciej Żenczykowski
[ Upstream commit 364745fbe981a4370f50274475da4675661104df ] This is technically a backwards incompatible change in behaviour, but I'm going to argue that it is very unlikely to break things, and likely to fix *far* more then it breaks. In no particular order, various reasons follow: (a) I've long had a bug assigned to myself to debug a super rare kernel crash on Android Pixel phones which can (per stacktrace) be traced back to BPF clat IPv6 to IPv4 protocol conversion causing some sort of ugly failure much later on during transmit deep in the GSO engine, AFAICT precisely because of this change to gso_size, though I've never been able to manually reproduce it. I believe it may be related to the particular network offload support of attached USB ethernet dongle being used for tethering off of an IPv6-only cellular connection. The reason might be we end up with more segments than max permitted, or with a GSO packet with only one segment... (either way we break some assumption and hit a BUG_ON) (b) There is no check that the gso_size is > 20 when reducing it by 20, so we might end up with a negative (or underflowing) gso_size or a gso_size of 0. This can't possibly be good. Indeed this is probably somehow exploitable (or at least can result in a kernel crash) by delivering crafted packets and perhaps triggering an infinite loop or a divide by zero... As a reminder: gso_size (MSS) is related to MTU, but not directly derived from it: gso_size/MSS may be significantly smaller then one would get by deriving from local MTU. And on some NICs (which do loose MTU checking on receive, it may even potentially be larger, for example my work pc with 1500 MTU can receive 1520 byte frames [and sometimes does due to bugs in a vendor plat46 implementation]). Indeed even just going from 21 to 1 is potentially problematic because it increases the number of segments by a factor of 21 (think DoS, or some other crash due to too many segments). (c) It's always safe to not increase the gso_size, because it doesn't result in the max packet size increasing. So the skb_increase_gso_size() call was always unnecessary for correctness (and outright undesirable, see later). As such the only part which is potentially dangerous (ie. could cause backwards compatibility issues) is the removal of the skb_decrease_gso_size() call. (d) If the packets are ultimately destined to the local device, then there is absolutely no benefit to playing around with gso_size. It only matters if the packets will egress the device. ie. we're either forwarding, or transmitting from the device. (e) This logic only triggers for packets which are GSO. It does not trigger for skbs which are not GSO. It will not convert a non-GSO MTU sized packet into a GSO packet (and you don't even know what the MTU is, so you can't even fix it). As such your transmit path must *already* be able to handle an MTU 20 bytes larger then your receive path (for IPv4 to IPv6 translation) - and indeed 28 bytes larger due to IPv4 fragments. Thus removing the skb_decrease_gso_size() call doesn't actually increase the size of the packets your transmit side must be able to handle. ie. to handle non-GSO max-MTU packets, the IPv4/IPv6 device/ route MTUs must already be set correctly. Since for example with an IPv4 egress MTU of 1500, IPv4 to IPv6 translation will already build 1520 byte IPv6 frames, so you need a 1520 byte device MTU. This means if your IPv6 device's egress MTU is 1280, your IPv4 route must be 1260 (and actually 1252, because of the need to handle fragments). This is to handle normal non-GSO packets. Thus the reduction is simply not needed for GSO packets, because when they're correctly built, they will already be the right size. (f) TSO/GSO should be able to exactly undo GRO: the number of packets (TCP segments) should not be modified, so that TCP's MSS counting works correctly (this matters for congestion control). If protocol conversion changes the gso_size, then the number of TCP segments may increase or decrease. Packet loss after protocol conversion can result in partial loss of MSS segments that the sender sent. How's the sending TCP stack going to react to receiving ACKs/SACKs in the middle of the segments it sent? (g) skb_{decrease,increase}_gso_size() are already no-ops for GSO_BY_FRAGS case (besides triggering WARN_ON_ONCE). This means you already cannot guarantee that gso_size (and thus resulting packet MTU) is changed. ie. you must assume it won't be changed. (h) changing gso_size is outright buggy for UDP GSO packets, where framing matters (I believe that's also the case for SCTP, but it's already excluded by [g]). So the only remaining case is TCP, which also doesn't want it (see [f]). (i) see also the reasoning on the previous attempt at fixing this (commit fa7b83bf3b156c767f3e4a25bbf3817b08f3ff8e) which shows that the current behaviour causes TCP packet loss: In the forwarding path GRO -> BPF 6 to 4 -> GSO for TCP traffic, the coalesced packet payload can be > MSS, but < MSS + 20. bpf_skb_proto_6_to_4() will upgrade the MSS and it can be > the payload length. After then tcp_gso_segment checks for the payload length if it is <= MSS. The condition is causing the packet to be dropped. tcp_gso_segment(): [...] mss = skb_shinfo(skb)->gso_size; if (unlikely(skb->len <= mss)) goto out; [...] Thus changing the gso_size is simply a very bad idea. Increasing is unnecessary and buggy, and decreasing can go negative. Fixes: 6578171a7ff0 ("bpf: add bpf_skb_change_proto helper") Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Dongseok Yi <dseok.yi@samsung.com> Cc: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/bpf/CANP3RGfjLikQ6dg=YpBU0OeHvyv7JOki7CyOUS9modaXAi-9vQ@mail.gmail.com Link: https://lore.kernel.org/bpf/20210617000953.2787453-2-zenczykowski@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-30Merge branch 'v5.4/base' into v5.4/standard/ti-j72xBruce Ashfield
Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2021-06-30net: ethtool: clear heap allocations for ethtool functionAustin Kim
[ Upstream commit 80ec82e3d2c1fab42eeb730aaa7985494a963d3f ] Several ethtool functions leave heap uncleared (potentially) by drivers. This will leave the unused portion of heap unchanged and might copy the full contents back to userspace. Signed-off-by: Austin Kim <austindh.kim@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-23Merge branch 'v5.4/standard/base' into v5.4/standard/ti-j72xBruce Ashfield
Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2021-06-23Merge branch 'v5.4/standard/base' into v5.4/standard/ti-j72xBruce Ashfield
2021-06-23net: make get_net_ns return error if NET_NS is disabledChangbin Du
[ Upstream commit ea6932d70e223e02fea3ae20a4feff05d7c1ea9a ] There is a panic in socket ioctl cmd SIOCGSKNS when NET_NS is not enabled. The reason is that nsfs tries to access ns->ops but the proc_ns_operations is not implemented in this case. [7.670023] Unable to handle kernel NULL pointer dereference at virtual address 00000010 [7.670268] pgd = 32b54000 [7.670544] [00000010] *pgd=00000000 [7.671861] Internal error: Oops: 5 [#1] SMP ARM [7.672315] Modules linked in: [7.672918] CPU: 0 PID: 1 Comm: systemd Not tainted 5.13.0-rc3-00375-g6799d4f2da49 #16 [7.673309] Hardware name: Generic DT based system [7.673642] PC is at nsfs_evict+0x24/0x30 [7.674486] LR is at clear_inode+0x20/0x9c The same to tun SIOCGSKNS command. To fix this problem, we make get_net_ns() return -EINVAL when NET_NS is disabled. Meanwhile move it to right place net/core/net_namespace.c. Signed-off-by: Changbin Du <changbin.du@gmail.com> Fixes: c62cce2caee5 ("net: add an ioctl to get a socket network namespace") Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Christian Brauner <christian.brauner@ubuntu.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-23rtnetlink: Fix regression in bridge VLAN configurationIdo Schimmel
[ Upstream commit d2e381c4963663bca6f30c3b996fa4dbafe8fcb5 ] Cited commit started returning errors when notification info is not filled by the bridge driver, resulting in the following regression: # ip link add name br1 type bridge vlan_filtering 1 # bridge vlan add dev br1 vid 555 self pvid untagged RTNETLINK answers: Invalid argument As long as the bridge driver does not fill notification info for the bridge device itself, an empty notification should not be considered as an error. This is explained in commit 59ccaaaa49b5 ("bridge: dont send notification when skb->len == 0 in rtnl_bridge_notify"). Fix by removing the error and add a comment to avoid future bugs. Fixes: a8db57c1d285 ("rtnetlink: Fix missing error code in rtnl_bridge_notify()") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-18fib: Return the correct errno codeZheng Yongjun
[ Upstream commit 59607863c54e9eb3f69afc5257dfe71c38bb751e ] When kalloc or kmemdup failed, should return ENOMEM rather than ENOBUF. Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-18rtnetlink: Fix missing error code in rtnl_bridge_notify()Jiapeng Chong
[ Upstream commit a8db57c1d285c758adc7fb43d6e2bad2554106e1 ] The error code is missing in this code scenario, add the error code '-EINVAL' to the return value 'err'. Eliminate the follow smatch warning: net/core/rtnetlink.c:4834 rtnl_bridge_notify() warn: missing error code 'err'. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-15Merge branch 'v5.4/standard/base' into v5.4/standard/ti-j72xBruce Ashfield
2021-06-10Merge branch 'v5.4/standard/base' into v5.4/standard/ti-j72xBruce Ashfield
2021-06-10neighbour: allow NUD_NOARP entries to be forced GCedDavid Ahern
commit 7a6b1ab7475fd6478eeaf5c9d1163e7a18125c8f upstream. IFF_POINTOPOINT interfaces use NUD_NOARP entries for IPv6. It's possible to fill up the neighbour table with enough entries that it will overflow for valid connections after that. This behaviour is more prevalent after commit 58956317c8de ("neighbor: Improve garbage collection") is applied, as it prevents removal from entries that are not NUD_FAILED, unless they are more than 5s old. Fixes: 58956317c8de (neighbor: Improve garbage collection) Reported-by: Kasper Dupont <kasperd@gjkwv.06.feb.2021.kasperd.net> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-03neighbour: Prevent Race condition in neighbour subsytemChinmay Agarwal
commit eefb45eef5c4c425e87667af8f5e904fbdd47abf upstream. Following Race Condition was detected: <CPU A, t0>: Executing: __netif_receive_skb() ->__netif_receive_skb_core() -> arp_rcv() -> arp_process().arp_process() calls __neigh_lookup() which takes a reference on neighbour entry 'n'. Moves further along, arp_process() and calls neigh_update()-> __neigh_update(). Neighbour entry is unlocked just before a call to neigh_update_gc_list. This unlocking paves way for another thread that may take a reference on the same and mark it dead and remove it from gc_list. <CPU B, t1> - neigh_flush_dev() is under execution and calls neigh_mark_dead(n) marking the neighbour entry 'n' as dead. Also n will be removed from gc_list. Moves further along neigh_flush_dev() and calls neigh_cleanup_and_release(n), but since reference count increased in t1, 'n' couldn't be destroyed. <CPU A, t3>- Code hits neigh_update_gc_list, with neighbour entry set as dead. <CPU A, t4> - arp_process() finally calls neigh_release(n), destroying the neighbour entry and we have a destroyed ntry still part of gc_list. Fixes: eb4e8fac00d1("neighbour: Prevent a dead entry from updating gc_list") Signed-off-by: Chinmay Agarwal <chinagar@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-03bpf: Set mac_len in bpf_skb_change_headJussi Maki
[ Upstream commit 84316ca4e100d8cbfccd9f774e23817cb2059868 ] The skb_change_head() helper did not set "skb->mac_len", which is problematic when it's used in combination with skb_redirect_peer(). Without it, redirecting a packet from a L3 device such as wireguard to the veth peer device will cause skb->data to point to the middle of the IP header on entry to tcp_v4_rcv() since the L2 header is not pulled correctly due to mac_len=0. Fixes: 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure") Signed-off-by: Jussi Maki <joamaki@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210519154743.2554771-2-joamaki@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-03net: sched: fix tx action reschedule issue with stopped queueYunsheng Lin
[ Upstream commit dcad9ee9e0663d74a89b25b987f9c7be86432812 ] The netdev qeueue might be stopped when byte queue limit has reached or tx hw ring is full, net_tx_action() may still be rescheduled if STATE_MISSED is set, which consumes unnecessary cpu without dequeuing and transmiting any skb because the netdev queue is stopped, see qdisc_run_end(). This patch fixes it by checking the netdev queue state before calling qdisc_run() and clearing STATE_MISSED if netdev queue is stopped during qdisc_run(), the net_tx_action() is rescheduled again when netdev qeueue is restarted, see netif_tx_wake_queue(). As there is time window between netif_xmit_frozen_or_stopped() checking and STATE_MISSED clearing, between which STATE_MISSED may set by net_tx_action() scheduled by netif_tx_wake_queue(), so set the STATE_MISSED again if netdev queue is restarted. Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking") Reported-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-03net: sched: fix tx action rescheduling issue during deactivationYunsheng Lin
[ Upstream commit 102b55ee92f9fda4dde7a45d2b20538e6e3e3d1e ] Currently qdisc_run() checks the STATE_DEACTIVATED of lockless qdisc before calling __qdisc_run(), which ultimately clear the STATE_MISSED when all the skb is dequeued. If STATE_DEACTIVATED is set before clearing STATE_MISSED, there may be rescheduling of net_tx_action() at the end of qdisc_run_end(), see below: CPU0(net_tx_atcion) CPU1(__dev_xmit_skb) CPU2(dev_deactivate) . . . . set STATE_MISSED . . __netif_schedule() . . . set STATE_DEACTIVATED . . qdisc_reset() . . . .<--------------- . synchronize_net() clear __QDISC_STATE_SCHED | . . . | . . . | . some_qdisc_is_busy() . | . return *false* . | . . test STATE_DEACTIVATED | . . __qdisc_run() *not* called | . . . | . . test STATE_MISS | . . __netif_schedule()--------| . . . . . . . . __qdisc_run() is not called by net_tx_atcion() in CPU0 because CPU2 has set STATE_DEACTIVATED flag during dev_deactivate(), and STATE_MISSED is only cleared in __qdisc_run(), __netif_schedule is called at the end of qdisc_run_end(), causing tx action rescheduling problem. qdisc_run() called by net_tx_action() runs in the softirq context, which should has the same semantic as the qdisc_run() called by __dev_xmit_skb() protected by rcu_read_lock_bh(). And there is a synchronize_net() between STATE_DEACTIVATED flag being set and qdisc_reset()/some_qdisc_is_busy in dev_deactivate(), we can safely bail out for the deactived lockless qdisc in net_tx_action(), and qdisc_reset() will reset all skb not dequeued yet. So add the rcu_read_lock() explicitly to protect the qdisc_run() and do the STATE_DEACTIVATED checking in net_tx_action() before calling qdisc_run_begin(). Another option is to do the checking in the qdisc_run_end(), but it will add unnecessary overhead for non-tx_action case, because __dev_queue_xmit() will not see qdisc with STATE_DEACTIVATED after synchronize_net(), the qdisc with STATE_DEACTIVATED can only be seen by net_tx_action() because of __netif_schedule(). The STATE_DEACTIVATED checking in qdisc_run() is to avoid race between net_tx_action() and qdisc_reset(), see: commit d518d2ed8640 ("net/sched: fix race between deactivation and dequeue for NOLOCK qdisc"). As the bailout added above for deactived lockless qdisc in net_tx_action() provides better protection for the race without calling qdisc_run() at all, so remove the STATE_DEACTIVATED checking in qdisc_run(). After qdisc_reset(), there is no skb in qdisc to be dequeued, so clear the STATE_MISSED in dev_reset_queue() too. Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking") Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> V8: Clearing STATE_MISSED before calling __netif_schedule() has avoid the endless rescheduling problem, but there may still be a unnecessary rescheduling, so adjust the commit log. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-03net: really orphan skbs tied to closing skPaolo Abeni
[ Upstream commit 098116e7e640ba677d9e345cbee83d253c13d556 ] If the owing socket is shutting down - e.g. the sock reference count already dropped to 0 and only sk_wmem_alloc is keeping the sock alive, skb_orphan_partial() becomes a no-op. When forwarding packets over veth with GRO enabled, the above causes refcount errors. This change addresses the issue with a plain skb_orphan() call in the critical scenario. Fixes: 9adc89af724f ("net: let skb_orphan_partial wake-up waiters.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-21Merge branch 'v5.4/standard/base' into v5.4/standard/ti-j72xBruce Ashfield
2021-05-19mm: fix struct page layout on 32-bit systemsMatthew Wilcox (Oracle)
commit 9ddb3c14afba8bc5950ed297f02d4ae05ff35cd1 upstream. 32-bit architectures which expect 8-byte alignment for 8-byte integers and need 64-bit DMA addresses (arm, mips, ppc) had their struct page inadvertently expanded in 2019. When the dma_addr_t was added, it forced the alignment of the union to 8 bytes, which inserted a 4 byte gap between 'flags' and the union. Fix this by storing the dma_addr_t in one or two adjacent unsigned longs. This restores the alignment to that of an unsigned long. We always store the low bits in the first word to prevent the PageTail bit from being inadvertently set on a big endian platform. If that happened, get_user_pages_fast() racing against a page which was freed and reallocated to the page_pool could dereference a bogus compound_head(), which would be hard to trace back to this cause. Link: https://lkml.kernel.org/r/20210510153211.1504886-1-willy@infradead.org Fixes: c25fff7171be ("mm: add dma_addr_t to struct page") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Matteo Croce <mcroce@linux.microsoft.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-19ethtool: ioctl: Fix out-of-bounds warning in store_link_ksettings_for_user()Gustavo A. R. Silva
[ Upstream commit c1d9e34e11281a8ba1a1c54e4db554232a461488 ] Fix the following out-of-bounds warning: net/ethtool/ioctl.c:492:2: warning: 'memcpy' offset [49, 84] from the object at 'link_usettings' is out of the bounds of referenced subobject 'base' with type 'struct ethtool_link_settings' at offset 0 [-Warray-bounds] The problem is that the original code is trying to copy data into a some struct members adjacent to each other in a single call to memcpy(). This causes a legitimate compiler warning because memcpy() overruns the length of &link_usettings.base. Fix this by directly using &link_usettings and _from_ as destination and source addresses, instead. This helps with the ongoing efforts to globally enable -Warray-bounds and get us closer to being able to tighten the FORTIFY_SOURCE routines on memcpy(). Link: https://github.com/KSPP/linux/issues/109 Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-19flow_dissector: Fix out-of-bounds warning in __skb_flow_bpf_to_target()Gustavo A. R. Silva
[ Upstream commit 1e3d976dbb23b3fce544752b434bdc32ce64aabc ] Fix the following out-of-bounds warning: net/core/flow_dissector.c:835:3: warning: 'memcpy' offset [33, 48] from the object at 'flow_keys' is out of the bounds of referenced subobject 'ipv6_src' with type '__u32[4]' {aka 'unsigned int[4]'} at offset 16 [-Warray-bounds] The problem is that the original code is trying to copy data into a couple of struct members adjacent to each other in a single call to memcpy(). So, the compiler legitimately complains about it. As these are just a couple of members, fix this by copying each one of them in separate calls to memcpy(). This helps with the ongoing efforts to globally enable -Warray-bounds and get us closer to being able to tighten the FORTIFY_SOURCE routines on memcpy(). Link: https://github.com/KSPP/linux/issues/109 Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-17Merge branch 'v5.4/standard/base' into v5.4/standard/ti-j72xBruce Ashfield
2021-05-14gro: fix napi_gro_frags() Fast GRO breakage due to IP alignment checkAlexander Lobakin
[ Upstream commit 7ad18ff6449cbd6beb26b53128ddf56d2685aa93 ] Commit 38ec4944b593 ("gro: ensure frag0 meets IP header alignment") did the right thing, but missed the fact that napi_gro_frags() logics calls for skb_gro_reset_offset() *before* pulling Ethernet header to the skb linear space. That said, the introduced check for frag0 address being aligned to 4 always fails for it as Ethernet header is obviously 14 bytes long, and in case with NET_IP_ALIGN its start is not aligned to 4. Fix this by adding @nhoff argument to skb_gro_reset_offset() which tells if an IP header is placed right at the start of frag0 or not. This restores Fast GRO for napi_gro_frags() that became very slow after the mentioned commit, and preserves the introduced check to avoid silent unaligned accesses. From v1 [0]: - inline tiny skb_gro_reset_offset() to let the code be optimized more efficively (esp. for the !NET_IP_ALIGN case) (Eric); - pull in Reviewed-by from Eric. [0] https://lore.kernel.org/netdev/20210418114200.5839-1-alobakin@pm.me Fixes: 38ec4944b593 ("gro: ensure frag0 meets IP header alignment") Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-26Merge branch 'v5.4/standard/base' into v5.4/standard/ti-j72xBruce Ashfield
2021-04-21gro: ensure frag0 meets IP header alignmentEric Dumazet
commit 38ec4944b593fd90c5ef42aaaa53e66ae5769d04 upstream. After commit 0f6925b3e8da ("virtio_net: Do not pull payload in skb->head") Guenter Roeck reported one failure in his tests using sh architecture. After much debugging, we have been able to spot silent unaligned accesses in inet_gro_receive() The issue at hand is that upper networking stacks assume their header is word-aligned. Low level drivers are supposed to reserve NET_IP_ALIGN bytes before the Ethernet header to make that happen. This patch hardens skb_gro_reset_offset() to not allow frag0 fast-path if the fragment is not properly aligned. Some arches like x86, arm64 and powerpc do not care and define NET_IP_ALIGN as 0, this extra check will be a NOP for them. Note that if frag0 is not used, GRO will call pskb_may_pull() as many times as needed to pull network and transport headers. Fixes: 0f6925b3e8da ("virtio_net: Do not pull payload in skb->head") Fixes: 78a478d0efd9 ("gro: Inline skb_gro_header and cache frag0 virtual address") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Guenter Roeck <linux@roeck-us.net> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-21neighbour: Disregard DEAD dst in neigh_updateTong Zhu
[ Upstream commit d47ec7a0a7271dda08932d6208e4ab65ab0c987c ] After a short network outage, the dst_entry is timed out and put in DST_OBSOLETE_DEAD. We are in this code because arp reply comes from this neighbour after network recovers. There is a potential race condition that dst_entry is still in DST_OBSOLETE_DEAD. With that, another neighbour lookup causes more harm than good. In best case all packets in arp_queue are lost. This is counterproductive to the original goal of finding a better path for those packets. I observed a worst case with 4.x kernel where a dst_entry in DST_OBSOLETE_DEAD state is associated with loopback net_device. It leads to an ethernet header with all zero addresses. A packet with all zero source MAC address is quite deadly with mac80211, ath9k and 802.11 block ack. It fails ieee80211_find_sta_by_ifaddr in ath9k (xmit.c). Ath9k flushes tx queue (ath_tx_complete_aggr). BAW (block ack window) is not updated. BAW logic is damaged and ath9k transmission is disabled. Signed-off-by: Tong Zhu <zhutong@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-15Merge branch 'v5.4/standard/base' into v5.4/standard/ti-j72xBruce Ashfield
2021-04-14net: let skb_orphan_partial wake-up waiters.Paolo Abeni
commit 9adc89af724f12a03b47099cd943ed54e877cd59 upstream. Currently the mentioned helper can end-up freeing the socket wmem without waking-up any processes waiting for more write memory. If the partially orphaned skb is attached to an UDP (or raw) socket, the lack of wake-up can hang the user-space. Even for TCP sockets not calling the sk destructor could have bad effects on TSQ. Address the issue using skb_orphan to release the sk wmem before setting the new sock_efree destructor. Additionally bundle the whole ownership update in a new helper, so that later other potential users could avoid duplicate code. v1 -> v2: - use skb_orphan() instead of sort of open coding it (Eric) - provide an helper for the ownership change (Eric) Fixes: f6ba8d33cfbb ("netem: fix skb_orphan_partial()") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-12Merge branch 'v5.4/standard/base' into v5.4/standard/ti-j72xBruce Ashfield
2021-04-07bpf: Remove MTU check in __bpf_skb_max_lenJesper Dangaard Brouer
commit 6306c1189e77a513bf02720450bb43bd4ba5d8ae upstream. Multiple BPF-helpers that can manipulate/increase the size of the SKB uses __bpf_skb_max_len() as the max-length. This function limit size against the current net_device MTU (skb->dev->mtu). When a BPF-prog grow the packet size, then it should not be limited to the MTU. The MTU is a transmit limitation, and software receiving this packet should be allowed to increase the size. Further more, current MTU check in __bpf_skb_max_len uses the MTU from ingress/current net_device, which in case of redirects uses the wrong net_device. This patch keeps a sanity max limit of SKB_MAX_ALLOC (16KiB). The real limit is elsewhere in the system. Jesper's testing[1] showed it was not possible to exceed 8KiB when expanding the SKB size via BPF-helper. The limiting factor is the define KMALLOC_MAX_CACHE_SIZE which is 8192 for SLUB-allocator (CONFIG_SLUB) in-case PAGE_SIZE is 4096. This define is in-effect due to this being called from softirq context see code __gfp_pfmemalloc_flags() and __do_kmalloc_node(). Jakub's testing showed that frames above 16KiB can cause NICs to reset (but not crash). Keep this sanity limit at this level as memory layer can differ based on kernel config. [1] https://github.com/xdp-project/bpf-examples/tree/master/MTU-tests Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/161287788936.790810.2937823995775097177.stgit@firesoul Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-07flow_dissector: fix TTL and TOS dissection on IPv4 fragmentsDavide Caratti
[ Upstream commit d2126838050ccd1dadf310ffb78b2204f3b032b9 ] the following command: # tc filter add dev $h2 ingress protocol ip pref 1 handle 101 flower \ $tcflags dst_ip 192.0.2.2 ip_ttl 63 action drop doesn't drop all IPv4 packets that match the configured TTL / destination address. In particular, if "fragment offset" or "more fragments" have non zero value in the IPv4 header, setting of FLOW_DISSECTOR_KEY_IP is simply ignored. Fix this dissecting IPv4 TTL and TOS before fragment info; while at it, add a selftest for tc flower's match on 'ip_ttl' that verifies the correct behavior. Fixes: 518d8a2e9bad ("net/flow_dissector: add support for dissection of misc ip header fields") Reported-by: Shuang Li <shuali@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-04Merge branch 'v5.4/standard/base' into v5.4/standard/ti-j72xBruce Ashfield
2021-03-30can: dev: Move device back to init netns on owning netns deleteMartin Willi
commit 3a5ca857079ea022e0b1b17fc154f7ad7dbc150f upstream. When a non-initial netns is destroyed, the usual policy is to delete all virtual network interfaces contained, but move physical interfaces back to the initial netns. This keeps the physical interface visible on the system. CAN devices are somewhat special, as they define rtnl_link_ops even if they are physical devices. If a CAN interface is moved into a non-initial netns, destroying that netns lets the interface vanish instead of moving it back to the initial netns. default_device_exit() skips CAN interfaces due to having rtnl_link_ops set. Reproducer: ip netns add foo ip link set can0 netns foo ip netns delete foo WARNING: CPU: 1 PID: 84 at net/core/dev.c:11030 ops_exit_list+0x38/0x60 CPU: 1 PID: 84 Comm: kworker/u4:2 Not tainted 5.10.19 #1 Workqueue: netns cleanup_net [<c010e700>] (unwind_backtrace) from [<c010a1d8>] (show_stack+0x10/0x14) [<c010a1d8>] (show_stack) from [<c086dc10>] (dump_stack+0x94/0xa8) [<c086dc10>] (dump_stack) from [<c086b938>] (__warn+0xb8/0x114) [<c086b938>] (__warn) from [<c086ba10>] (warn_slowpath_fmt+0x7c/0xac) [<c086ba10>] (warn_slowpath_fmt) from [<c0629f20>] (ops_exit_list+0x38/0x60) [<c0629f20>] (ops_exit_list) from [<c062a5c4>] (cleanup_net+0x230/0x380) [<c062a5c4>] (cleanup_net) from [<c0142c20>] (process_one_work+0x1d8/0x438) [<c0142c20>] (process_one_work) from [<c0142ee4>] (worker_thread+0x64/0x5a8) [<c0142ee4>] (worker_thread) from [<c0148a98>] (kthread+0x148/0x14c) [<c0148a98>] (kthread) from [<c0100148>] (ret_from_fork+0x14/0x2c) To properly restore physical CAN devices to the initial netns on owning netns exit, introduce a flag on rtnl_link_ops that can be set by drivers. For CAN devices setting this flag, default_device_exit() considers them non-virtual, applying the usual namespace move. The issue was introduced in the commit mentioned below, as at that time CAN devices did not have a dellink() operation. Fixes: e008b5fc8dc7 ("net: Simplfy default_device_exit and improve batching.") Link: https://lore.kernel.org/r/20210302122423.872326-1-martin@strongswan.org Signed-off-by: Martin Willi <martin@strongswan.org> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-08Merge branch 'v5.4/standard/base' into v5.4/standard/ti-j72xBruce Ashfield