summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)Author
2019-07-03Linux 4.19.57v4.19.57Greg Kroah-Hartman
2019-07-03arm64: insn: Fix ldadd instruction encodingJean-Philippe Brucker
commit c5e2edeb01ae9ffbdde95bdcdb6d3614ba1eb195 upstream. GCC 8.1.0 reports that the ldadd instruction encoding, recently added to insn.c, doesn't match the mask and couldn't possibly be identified: linux/arch/arm64/include/asm/insn.h: In function 'aarch64_insn_is_ldadd': linux/arch/arm64/include/asm/insn.h:280:257: warning: bitwise comparison always evaluates to false [-Wtautological-compare] Bits [31:30] normally encode the size of the instruction (1 to 8 bytes) and the current instruction value only encodes the 4- and 8-byte variants. At the moment only the BPF JIT needs this instruction, and doesn't require the 1- and 2-byte variants, but to be consistent with our other ldr and str instruction encodings, clear the size field in the insn value. Fixes: 34b8ab091f9ef57a ("bpf, arm64: use more scalable stadd over ldxr / stxr loop in xadd") Acked-by: Daniel Borkmann <daniel@iogearbox.net> Reported-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03usb: dwc3: Reset num_trbs after skippingThinh Nguyen
commit c7152763f02e05567da27462b2277a554e507c89 upstream. Currently req->num_trbs is not reset after the TRBs are skipped and processed from the cancelled list. The gadget driver may reuse the request with an invalid req->num_trbs, and DWC3 will incorrectly skip trbs. To fix this, simply reset req->num_trbs to 0 after skipping through all of them. Fixes: c3acd5901414 ("usb: dwc3: gadget: use num_trbs when skipping TRBs on ->dequeue()") Signed-off-by: Thinh Nguyen <thinhn@synopsys.com> Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com> Cc: Sasha Levin <sashal@kernel.org> Cc: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03tipc: pass tunnel dev as NULL to udp_tunnel(6)_xmit_skbXin Long
commit c3bcde026684c62d7a2b6f626dc7cf763833875c upstream. udp_tunnel(6)_xmit_skb() called by tipc_udp_xmit() expects a tunnel device to count packets on dev->tstats, a perpcu variable. However, TIPC is using udp tunnel with no tunnel device, and pass the lower dev, like veth device that only initializes dev->lstats(a perpcu variable) when creating it. Later iptunnel_xmit_stats() called by ip(6)tunnel_xmit() thinks the dev as a tunnel device, and uses dev->tstats instead of dev->lstats. tstats' each pointer points to a bigger struct than lstats, so when tstats->tx_bytes is increased, other percpu variable's members could be overwritten. syzbot has reported quite a few crashes due to fib_nh_common percpu member 'nhc_pcpu_rth_output' overwritten, call traces are like: BUG: KASAN: slab-out-of-bounds in rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556 rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556 __mkroute_output net/ipv4/route.c:2332 [inline] ip_route_output_key_hash_rcu+0x819/0x2d50 net/ipv4/route.c:2564 ip_route_output_key_hash+0x1ef/0x360 net/ipv4/route.c:2393 __ip_route_output_key include/net/route.h:125 [inline] ip_route_output_flow+0x28/0xc0 net/ipv4/route.c:2651 ip_route_output_key include/net/route.h:135 [inline] ... or: kasan: GPF could be caused by NULL-ptr deref or user memory access RIP: 0010:dst_dev_put+0x24/0x290 net/core/dst.c:168 <IRQ> rt_fibinfo_free_cpus net/ipv4/fib_semantics.c:200 [inline] free_fib_info_rcu+0x2e1/0x490 net/ipv4/fib_semantics.c:217 __rcu_reclaim kernel/rcu/rcu.h:240 [inline] rcu_do_batch kernel/rcu/tree.c:2437 [inline] invoke_rcu_callbacks kernel/rcu/tree.c:2716 [inline] rcu_process_callbacks+0x100a/0x1ac0 kernel/rcu/tree.c:2697 ... The issue exists since tunnel stats update is moved to iptunnel_xmit by Commit 039f50629b7f ("ip_tunnel: Move stats update to iptunnel_xmit()"), and here to fix it by passing a NULL tunnel dev to udp_tunnel(6)_xmit_skb so that the packets counting won't happen on dev->tstats. Reported-by: syzbot+9d4c12bfd45a58738d0a@syzkaller.appspotmail.com Reported-by: syzbot+a9e23ea2aa21044c2798@syzkaller.appspotmail.com Reported-by: syzbot+c4c4b2bb358bb936ad7e@syzkaller.appspotmail.com Reported-by: syzbot+0290d2290a607e035ba1@syzkaller.appspotmail.com Reported-by: syzbot+a43d8d4e7e8a7a9e149e@syzkaller.appspotmail.com Reported-by: syzbot+a47c5f4c6c00fc1ed16e@syzkaller.appspotmail.com Fixes: 039f50629b7f ("ip_tunnel: Move stats update to iptunnel_xmit()") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03RDMA: Directly cast the sockaddr union to sockaddrJason Gunthorpe
commit 641114d2af312d39ca9bbc2369d18a5823da51c6 upstream. gcc 9 now does allocation size tracking and thinks that passing the member of a union and then accessing beyond that member's bounds is an overflow. Instead of using the union member, use the entire union with a cast to get to the sockaddr. gcc will now know that the memory extends the full size of the union. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03futex: Update comments and docs about return values of arch futex codeWill Deacon
commit 427503519739e779c0db8afe876c1b33f3ac60ae upstream. The architecture implementations of 'arch_futex_atomic_op_inuser()' and 'futex_atomic_cmpxchg_inatomic()' are permitted to return only -EFAULT, -EAGAIN or -ENOSYS in the case of failure. Update the comments in the asm-generic/ implementation and also a stray reference in the robust futex documentation. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03bpf, arm64: use more scalable stadd over ldxr / stxr loop in xaddDaniel Borkmann
commit 34b8ab091f9ef57a2bb3c8c8359a0a03a8abf2f9 upstream. Since ARMv8.1 supplement introduced LSE atomic instructions back in 2016, lets add support for STADD and use that in favor of LDXR / STXR loop for the XADD mapping if available. STADD is encoded as an alias for LDADD with XZR as the destination register, therefore add LDADD to the instruction encoder along with STADD as special case and use it in the JIT for CPUs that advertise LSE atomics in CPUID register. If immediate offset in the BPF XADD insn is 0, then use dst register directly instead of temporary one. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03arm64: futex: Avoid copying out uninitialised stack in failed cmpxchg()Will Deacon
commit 8e4e0ac02b449297b86498ac24db5786ddd9f647 upstream. Returning an error code from futex_atomic_cmpxchg_inatomic() indicates that the caller should not make any use of *uval, and should instead act upon on the value of the error code. Although this is implemented correctly in our futex code, we needlessly copy uninitialised stack to *uval in the error case, which can easily be avoided. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03bpf: udp: ipv6: Avoid running reuseport's bpf_prog from __udp6_lib_errMartin KaFai Lau
commit 4ac30c4b3659efac031818c418beb51e630d512d upstream. __udp6_lib_err() may be called when handling icmpv6 message. For example, the icmpv6 toobig(type=2). __udp6_lib_lookup() is then called which may call reuseport_select_sock(). reuseport_select_sock() will call into a bpf_prog (if there is one). reuseport_select_sock() is expecting the skb->data pointing to the transport header (udphdr in this case). For example, run_bpf_filter() is pulling the transport header. However, in the __udp6_lib_err() path, the skb->data is pointing to the ipv6hdr instead of the udphdr. One option is to pull and push the ipv6hdr in __udp6_lib_err(). Instead of doing this, this patch follows how the original commit 538950a1b752 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF") was done in IPv4, which has passed a NULL skb pointer to reuseport_select_sock(). Fixes: 538950a1b752 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF") Cc: Craig Gallek <kraig@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Craig Gallek <kraig@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03bpf: udp: Avoid calling reuseport's bpf_prog from udp_groMartin KaFai Lau
commit 257a525fe2e49584842c504a92c27097407f778f upstream. When the commit a6024562ffd7 ("udp: Add GRO functions to UDP socket") added udp[46]_lib_lookup_skb to the udp_gro code path, it broke the reuseport_select_sock() assumption that skb->data is pointing to the transport header. This patch follows an earlier __udp6_lib_err() fix by passing a NULL skb to avoid calling the reuseport's bpf_prog. Fixes: a6024562ffd7 ("udp: Add GRO functions to UDP socket") Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03bpf: fix unconnected udp hooksDaniel Borkmann
commit 983695fa676568fc0fe5ddd995c7267aabc24632 upstream. Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently to applications as also stated in original motivation in 7828f20e3779 ("Merge branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter two hooks into Cilium to enable host based load-balancing with Kubernetes, I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes typically sets up DNS as a service and is thus subject to load-balancing. Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API is currently insufficient and thus not usable as-is for standard applications shipped with most distros. To break down the issue we ran into with a simple example: # cat /etc/resolv.conf nameserver 147.75.207.207 nameserver 147.75.207.208 For the purpose of a simple test, we set up above IPs as service IPs and transparently redirect traffic to a different DNS backend server for that node: # cilium service list ID Frontend Backend 1 147.75.207.207:53 1 => 8.8.8.8:53 2 147.75.207.208:53 1 => 8.8.8.8:53 The attached BPF program is basically selecting one of the backends if the service IP/port matches on the cgroup hook. DNS breaks here, because the hooks are not transparent enough to applications which have built-in msg_name address checks: # nslookup 1.1.1.1 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 [...] ;; connection timed out; no servers could be reached # dig 1.1.1.1 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 [...] ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1 ;; global options: +cmd ;; connection timed out; no servers could be reached For comparison, if none of the service IPs is used, and we tell nslookup to use 8.8.8.8 directly it works just fine, of course: # nslookup 1.1.1.1 8.8.8.8 1.1.1.1.in-addr.arpa name = one.one.one.one. In order to fix this and thus act more transparent to the application, this needs reverse translation on recvmsg() side. A minimal fix for this API is to add similar recvmsg() hooks behind the BPF cgroups static key such that the program can track state and replace the current sockaddr_in{,6} with the original service IP. From BPF side, this basically tracks the service tuple plus socket cookie in an LRU map where the reverse NAT can then be retrieved via map value as one example. Side-note: the BPF cgroups static key should be converted to a per-hook static key in future. Same example after this fix: # cilium service list ID Frontend Backend 1 147.75.207.207:53 1 => 8.8.8.8:53 2 147.75.207.208:53 1 => 8.8.8.8:53 Lookups work fine now: # nslookup 1.1.1.1 1.1.1.1.in-addr.arpa name = one.one.one.one. Authoritative answers can be found from: # dig 1.1.1.1 ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550 ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 512 ;; QUESTION SECTION: ;1.1.1.1. IN A ;; AUTHORITY SECTION: . 23426 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400 ;; Query time: 17 msec ;; SERVER: 147.75.207.207#53(147.75.207.207) ;; WHEN: Tue May 21 12:59:38 UTC 2019 ;; MSG SIZE rcvd: 111 And from an actual packet level it shows that we're using the back end server when talking via 147.75.207.20{7,8} front end: # tcpdump -i any udp [...] 12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38) 12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38) 12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67) 12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67) [...] In order to be flexible and to have same semantics as in sendmsg BPF programs, we only allow return codes in [1,1] range. In the sendmsg case the program is called if msg->msg_name is present which can be the case in both, connected and unconnected UDP. The former only relies on the sockaddr_in{,6} passed via connect(2) if passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar way to call into the BPF program whenever a non-NULL msg->msg_name was passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note that for TCP case, the msg->msg_name is ignored in the regular recvmsg path and therefore not relevant. For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE, the hook is not called. This is intentional as it aligns with the same semantics as in case of TCP cgroup BPF hooks right now. This might be better addressed in future through a different bpf_attach_type such that this case can be distinguished from the regular recvmsg paths, for example. Fixes: 1cedee13d25a ("bpf: Hooks for sys_sendmsg") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrey Ignatov <rdna@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03bpf: fix nested bpf tracepoints with per-cpu dataMatt Mullins
commit 9594dc3c7e71b9f52bee1d7852eb3d4e3aea9e99 upstream. BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, as they do not increment bpf_prog_active while executing. This enables three levels of nesting, to support - a kprobe or raw tp or perf event, - another one of the above that irq context happens to call, and - another one in nmi context (at most one of which may be a kprobe or perf event). Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for perf_sample_data") Signed-off-by: Matt Mullins <mmullins@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03bpf: lpm_trie: check left child of last leftmost node for NULLJonathan Lemon
commit da2577fdd0932ea4eefe73903f1130ee366767d2 upstream. If the leftmost parent node of the tree has does not have a child on the left side, then trie_get_next_key (and bpftool map dump) will not look at the child on the right. This leads to the traversal missing elements. Lookup is not affected. Update selftest to handle this case. Reproducer: bpftool map create /sys/fs/bpf/lpm type lpm_trie key 6 \ value 1 entries 256 name test_lpm flags 1 bpftool map update pinned /sys/fs/bpf/lpm key 8 0 0 0 0 0 value 1 bpftool map update pinned /sys/fs/bpf/lpm key 16 0 0 0 0 128 value 2 bpftool map dump pinned /sys/fs/bpf/lpm Returns only 1 element. (2 expected) Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE") Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03bpf: simplify definition of BPF_FIB_LOOKUP related flagsMartynas Pumputis
commit b1d6c15b9d824a58c5415673f374fac19e8eccdf upstream. Previously, the BPF_FIB_LOOKUP_{DIRECT,OUTPUT} flags in the BPF UAPI were defined with the help of BIT macro. This had the following issues: - In order to use any of the flags, a user was required to depend on <linux/bits.h>. - No other flag in bpf.h uses the macro, so it seems that an unwritten convention is to use (1 << (nr)) to define BPF-related flags. Fixes: 87f5fc7e48dd ("bpf: Provide helper to do forwarding lookups in kernel FIB table") Signed-off-by: Martynas Pumputis <m@lambda.lt> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03tun: wake up waitqueues after IFF_UP is setFei Li
[ Upstream commit 72b319dc08b4924a29f5e2560ef6d966fa54c429 ] Currently after setting tap0 link up, the tun code wakes tx/rx waited queues up in tun_net_open() when .ndo_open() is called, however the IFF_UP flag has not been set yet. If there's already a wait queue, it would fail to transmit when checking the IFF_UP flag in tun_sendmsg(). Then the saving vhost_poll_start() will add the wq into wqh until it is waken up again. Although this works when IFF_UP flag has been set when tun_chr_poll detects; this is not true if IFF_UP flag has not been set at that time. Sadly the latter case is a fatal error, as the wq will never be waken up in future unless later manually setting link up on purpose. Fix this by moving the wakeup process into the NETDEV_UP event notifying process, this makes sure IFF_UP has been set before all waited queues been waken up. Signed-off-by: Fei Li <lifei.shirley@bytedance.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03tipc: check msg->req data len in tipc_nl_compat_bearer_disableXin Long
[ Upstream commit 4f07b80c973348a99b5d2a32476a2e7877e94a05 ] This patch is to fix an uninit-value issue, reported by syzbot: BUG: KMSAN: uninit-value in memchr+0xce/0x110 lib/string.c:981 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x191/0x1f0 lib/dump_stack.c:113 kmsan_report+0x130/0x2a0 mm/kmsan/kmsan.c:622 __msan_warning+0x75/0xe0 mm/kmsan/kmsan_instr.c:310 memchr+0xce/0x110 lib/string.c:981 string_is_valid net/tipc/netlink_compat.c:176 [inline] tipc_nl_compat_bearer_disable+0x2a1/0x480 net/tipc/netlink_compat.c:449 __tipc_nl_compat_doit net/tipc/netlink_compat.c:327 [inline] tipc_nl_compat_doit+0x3ac/0xb00 net/tipc/netlink_compat.c:360 tipc_nl_compat_handle net/tipc/netlink_compat.c:1178 [inline] tipc_nl_compat_recv+0x1b1b/0x27b0 net/tipc/netlink_compat.c:1281 TLV_GET_DATA_LEN() may return a negtive int value, which will be used as size_t (becoming a big unsigned long) passed into memchr, cause this issue. Similar to what it does in tipc_nl_compat_bearer_enable(), this fix is to return -EINVAL when TLV_GET_DATA_LEN() is negtive in tipc_nl_compat_bearer_disable(), as well as in tipc_nl_compat_link_stat_dump() and tipc_nl_compat_link_reset_stats(). v1->v2: - add the missing Fixes tags per Eric's request. Fixes: 0762216c0ad2 ("tipc: fix uninit-value in tipc_nl_compat_bearer_enable") Fixes: 8b66fee7f8ee ("tipc: fix uninit-value in tipc_nl_compat_link_reset_stats") Reported-by: syzbot+30eaa8bf392f7fafffaf@syzkaller.appspotmail.com Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03tipc: change to use register_pernet_deviceXin Long
[ Upstream commit c492d4c74dd3f87559883ffa0f94a8f1ae3fe5f5 ] This patch is to fix a dst defcnt leak, which can be reproduced by doing: # ip net a c; ip net a s; modprobe tipc # ip net e s ip l a n eth1 type veth peer n eth1 netns c # ip net e c ip l s lo up; ip net e c ip l s eth1 up # ip net e s ip l s lo up; ip net e s ip l s eth1 up # ip net e c ip a a 1.1.1.2/8 dev eth1 # ip net e s ip a a 1.1.1.1/8 dev eth1 # ip net e c tipc b e m udp n u1 localip 1.1.1.2 # ip net e s tipc b e m udp n u1 localip 1.1.1.1 # ip net d c; ip net d s; rmmod tipc and it will get stuck and keep logging the error: unregister_netdevice: waiting for lo to become free. Usage count = 1 The cause is that a dst is held by the udp sock's sk_rx_dst set on udp rx path with udp_early_demux == 1, and this dst (eventually holding lo dev) can't be released as bearer's removal in tipc pernet .exit happens after lo dev's removal, default_device pernet .exit. "There are two distinct types of pernet_operations recognized: subsys and device. At creation all subsys init functions are called before device init functions, and at destruction all device exit functions are called before subsys exit function." So by calling register_pernet_device instead to register tipc_net_ops, the pernet .exit() will be invoked earlier than loopback dev's removal when a netns is being destroyed, as fou/gue does. Note that vxlan and geneve udp tunnels don't have this issue, as the udp sock is released in their device ndo_stop(). This fix is also necessary for tipc dst_cache, which will hold dsts on tx path and I will introduce in my next patch. Reported-by: Li Shuang <shuali@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03team: Always enable vlan tx offloadYueHaibing
[ Upstream commit ee4297420d56a0033a8593e80b33fcc93fda8509 ] We should rather have vlan_tci filled all the way down to the transmitting netdevice and let it do the hw/sw vlan implementation. Suggested-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03sctp: change to hold sk after auth shkey is created successfullyXin Long
[ Upstream commit 25bff6d5478b2a02368097015b7d8eb727c87e16 ] Now in sctp_endpoint_init(), it holds the sk then creates auth shkey. But when the creation fails, it doesn't release the sk, which causes a sk defcnf leak, Here to fix it by only holding the sk when auth shkey is created successfully. Fixes: a29a5bd4f5c3 ("[SCTP]: Implement SCTP-AUTH initializations.") Reported-by: syzbot+afabda3890cc2f765041@syzkaller.appspotmail.com Reported-by: syzbot+276ca1c77a19977c0130@syzkaller.appspotmail.com Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03net: stmmac: set IC bit when transmitting frames with HW timestampRoland Hii
[ Upstream commit d0bb82fd60183868f46c8ccc595a3d61c3334a18 ] When transmitting certain PTP frames, e.g. SYNC and DELAY_REQ, the PTP daemon, e.g. ptp4l, is polling the driver for the frame transmit hardware timestamp. The polling will most likely timeout if the tx coalesce is enabled due to the Interrupt-on-Completion (IC) bit is not set in tx descriptor for those frames. This patch will ignore the tx coalesce parameter and set the IC bit when transmitting PTP frames which need to report out the frame transmit hardware timestamp to user space. Fixes: f748be531d70 ("net: stmmac: Rework coalesce timer and fix multi-queue races") Signed-off-by: Roland Hii <roland.king.guan.hii@intel.com> Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com> Signed-off-by: Voon Weifeng <weifeng.voon@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03net: stmmac: fixed new system time seconds value calculationRoland Hii
[ Upstream commit a1e5388b4d5fc78688e5e9ee6641f779721d6291 ] When ADDSUB bit is set, the system time seconds field is calculated as the complement of the seconds part of the update value. For example, if 3.000000001 seconds need to be subtracted from the system time, this field is calculated as 2^32 - 3 = 4294967296 - 3 = 0x100000000 - 3 = 0xFFFFFFFD Previously, the 0x100000000 is mistakenly written as 100000000. This is further simplified from sec = (0x100000000ULL - sec); to sec = -sec; Fixes: ba1ffd74df74 ("stmmac: fix PTP support for GMAC4") Signed-off-by: Roland Hii <roland.king.guan.hii@intel.com> Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com> Signed-off-by: Voon Weifeng <weifeng.voon@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03net: remove duplicate fetch in sock_getsockoptJingYi Hou
[ Upstream commit d0bae4a0e3d8c5690a885204d7eb2341a5b4884d ] In sock_getsockopt(), 'optlen' is fetched the first time from userspace. 'len < 0' is then checked. Then in condition 'SO_MEMINFO', 'optlen' is fetched the second time from userspace. If change it between two fetches may cause security problems or unexpected behaivor, and there is no reason to fetch it a second time. To fix this, we need to remove the second fetch. Signed-off-by: JingYi Hou <houjingyi647@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03net/packet: fix memory leak in packet_set_ring()Eric Dumazet
[ Upstream commit 55655e3d1197fff16a7a05088fb0e5eba50eac55 ] syzbot found we can leak memory in packet_set_ring(), if user application provides buggy parameters. Fixes: 7f953ab2ba46 ("af_packet: TX_RING support for TPACKET_V3") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Sowmini Varadhan <sowmini.varadhan@oracle.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03ipv4: Use return value of inet_iif() for __raw_v4_lookup in the while loopStephen Suryaputra
[ Upstream commit 38c73529de13e1e10914de7030b659a2f8b01c3b ] In commit 19e4e768064a8 ("ipv4: Fix raw socket lookup for local traffic"), the dif argument to __raw_v4_lookup() is coming from the returned value of inet_iif() but the change was done only for the first lookup. Subsequent lookups in the while loop still use skb->dev->ifIndex. Fixes: 19e4e768064a8 ("ipv4: Fix raw socket lookup for local traffic") Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03bonding: Always enable vlan tx offloadYueHaibing
[ Upstream commit 30d8177e8ac776d89d387fad547af6a0f599210e ] We build vlan on top of bonding interface, which vlan offload is off, bond mode is 802.3ad (LACP) and xmit_hash_policy is BOND_XMIT_POLICY_ENCAP34. Because vlan tx offload is off, vlan tci is cleared and skb push the vlan header in validate_xmit_vlan() while sending from vlan devices. Then in bond_xmit_hash, __skb_flow_dissect() fails to get information from protocol headers encapsulated within vlan, because 'nhoff' is points to IP header, so bond hashing is based on layer 2 info, which fails to distribute packets across slaves. This patch always enable bonding's vlan tx offload, pass the vlan packets to the slave devices with vlan tci, let them to handle vlan implementation. Fixes: 278339a42a1b ("bonding: propogate vlan_features to bonding master") Suggested-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03af_packet: Block execution of tasks waiting for transmit to complete in ↵Neil Horman
AF_PACKET [ Upstream commit 89ed5b519004a7706f50b70f611edbd3aaacff2c ] When an application is run that: a) Sets its scheduler to be SCHED_FIFO and b) Opens a memory mapped AF_PACKET socket, and sends frames with the MSG_DONTWAIT flag cleared, its possible for the application to hang forever in the kernel. This occurs because when waiting, the code in tpacket_snd calls schedule, which under normal circumstances allows other tasks to run, including ksoftirqd, which in some cases is responsible for freeing the transmitted skb (which in AF_PACKET calls a destructor that flips the status bit of the transmitted frame back to available, allowing the transmitting task to complete). However, when the calling application is SCHED_FIFO, its priority is such that the schedule call immediately places the task back on the cpu, preventing ksoftirqd from freeing the skb, which in turn prevents the transmitting task from detecting that the transmission is complete. We can fix this by converting the schedule call to a completion mechanism. By using a completion queue, we force the calling task, when it detects there are no more frames to send, to schedule itself off the cpu until such time as the last transmitted skb is freed, allowing forward progress to be made. Tested by myself and the reporter, with good results Change Notes: V1->V2: Enhance the sleep logic to support being interruptible and allowing for honoring to SK_SNDTIMEO (Willem de Bruijn) V2->V3: Rearrage the point at which we wait for the completion queue, to avoid needing to check for ph/skb being null at the end of the loop. Also move the complete call to the skb destructor to avoid needing to modify __packet_set_status. Also gate calling complete on packet_read_pending returning zero to avoid multiple calls to complete. (Willem de Bruijn) Move timeo computation within loop, to re-fetch the socket timeout since we also use the timeo variable to record the return code from the wait_for_complete call (Neil Horman) V3->V4: Willem has requested that the control flow be restored to the previous state. Doing so lets us eliminate the need for the po->wait_on_complete flag variable, and lets us get rid of the packet_next_frame function, but introduces another complexity. Specifically, but using the packet pending count, we can, if an applications calls sendmsg multiple times with MSG_DONTWAIT set, each set of transmitted frames, when complete, will cause tpacket_destruct_skb to issue a complete call, for which there will never be a wait_on_completion call. This imbalance will lead to any future call to wait_for_completion here to return early, when the frames they sent may not have completed. To correct this, we need to re-init the completion queue on every call to tpacket_snd before we enter the loop so as to ensure we wait properly for the frames we send in this iteration. Change the timeout and interrupted gotos to out_put rather than out_status so that we don't try to free a non-existant skb Clean up some extra newlines (Willem de Bruijn) Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03eeprom: at24: fix unexpected timeout under high loadWang Xin
commit 9a9e295e7c5c0409c020088b0ae017e6c2b7df6e upstream. Within at24_loop_until_timeout the timestamp used for timeout checking is recorded after the I2C transfer and sleep_range(). Under high CPU load either the execution time for I2C transfer or sleep_range() could actually be larger than the timeout value. Worst case the I2C transfer is only tried once because the loop will exit due to the timeout although the EEPROM is now ready. To fix this issue the timestamp is recorded at the beginning of each iteration. That is, before I2C transfer and sleep. Then the timeout is actually checked against the timestamp of the previous iteration. This makes sure that even if the timeout is reached, there is still one more chance to try the I2C transfer in case the EEPROM is ready. Example: If you have a system which combines high CPU load with repeated EEPROM writes you will run into the following scenario. - System makes a successful regmap_bulk_write() to EEPROM. - System wants to perform another write to EEPROM but EEPROM is still busy with the last write. - Because of high CPU load the usleep_range() will sleep more than 25 ms (at24_write_timeout). - Within the over-long sleeping the EEPROM finished the previous write operation and is ready again. - at24_loop_until_timeout() will detect timeout and won't try to write. Signed-off-by: Wang Xin <xin.wang7@cn.bosch.com> Signed-off-by: Mark Jonas <mark.jonas@de.bosch.com> Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03irqchip/mips-gic: Use the correct local interrupt map registersPaul Burton
commit 6d4d367d0e9ffab4d64a3436256a6a052dc1195d upstream. The MIPS GIC contains a block of registers used to map local interrupts to a particular CPU interrupt pin. Since these registers are found at a consecutive range of addresses we access them using an index, via the (read|write)_gic_v[lo]_map accessor functions. We currently use values from enum mips_gic_local_interrupt as those indices. Unfortunately whilst enum mips_gic_local_interrupt provides the correct offsets for bits in the pending & mask registers, the ordering of the map registers is subtly different... Compared with the ordering of pending & mask bits, the map registers move the FDC from the end of the list to index 3 after the timer interrupt. As a result the performance counter & software interrupts are therefore at indices 4-6 rather than indices 3-5. Notably this causes problems with performance counter interrupts being incorrectly mapped on some systems, and presumably will also cause problems for FDC interrupts. Introduce a function to map from enum mips_gic_local_interrupt to the index of the corresponding map register, and use it to ensure we access the map registers for the correct interrupts. Signed-off-by: Paul Burton <paul.burton@mips.com> Fixes: a0dc5cb5e31b ("irqchip: mips-gic: Simplify gic_local_irq_domain_map()") Fixes: da61fcf9d62a ("irqchip: mips-gic: Use irq_cpu_online to (un)mask all-VP(E) IRQs") Reported-and-tested-by: Archer Yan <ayan@wavecomp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jason Cooper <jason@lakedaemon.net> Cc: stable@vger.kernel.org # v4.14+ Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03SUNRPC: Clean up initialisation of the struct rpc_rqstTrond Myklebust
commit 9dc6edcf676fe188430e8b119f91280bbf285163 upstream. Move the initialisation back into xprt.c. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Yihao Wu <wuyihao@linux.alibaba.com> Cc: Caspar Zhang <caspar@linux.alibaba.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03cpu/speculation: Warn on unsupported mitigations= parameterGeert Uytterhoeven
commit 1bf72720281770162c87990697eae1ba2f1d917a upstream. Currently, if the user specifies an unsupported mitigation strategy on the kernel command line, it will be ignored silently. The code will fall back to the default strategy, possibly leaving the system more vulnerable than expected. This may happen due to e.g. a simple typo, or, for a stable kernel release, because not all mitigation strategies have been backported. Inform the user by printing a message. Fixes: 98af8452945c5565 ("cpu/speculation: Add 'mitigations=' cmdline option") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190516070935.22546-1-geert@linux-m68k.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03NFS/flexfiles: Use the correct TCP timeout for flexfiles I/OTrond Myklebust
commit 68f461593f76bd5f17e87cdd0bea28f4278c7268 upstream. Fix a typo where we're confusing the default TCP retrans value (NFS_DEF_TCP_RETRANS) for the default TCP timeout value. Fixes: 15d03055cf39f ("pNFS/flexfiles: Set reasonable default ...") Cc: stable@vger.kernel.org # 4.8+ Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03KVM: x86/mmu: Allocate PAE root array when using SVM's 32-bit NPTSean Christopherson
commit b6b80c78af838bef17501416d5d383fedab0010a upstream. SVM's Nested Page Tables (NPT) reuses x86 paging for the host-controlled page walk. For 32-bit KVM, this means PAE paging is used even when TDP is enabled, i.e. the PAE root array needs to be allocated. Fixes: ee6268ba3a68 ("KVM: x86: Skip pae_root shadow allocation if tdp enabled") Cc: stable@vger.kernel.org Reported-by: Jiri Palecek <jpalecek@web.de> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Jiri Palecek <jpalecek@web.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03x86/resctrl: Prevent possible overrun during bitmap operationsReinette Chatre
commit 32f010deab575199df4ebe7b6aec20c17bb7eccd upstream. While the DOC at the beginning of lib/bitmap.c explicitly states that "The number of valid bits in a given bitmap does _not_ need to be an exact multiple of BITS_PER_LONG.", some of the bitmap operations do indeed access BITS_PER_LONG portions of the provided bitmap no matter the size of the provided bitmap. For example, if find_first_bit() is provided with an 8 bit bitmap the operation will access BITS_PER_LONG bits from the provided bitmap. While the operation ensures that these extra bits do not affect the result, the memory is still accessed. The capacity bitmasks (CBMs) are typically stored in u32 since they can never exceed 32 bits. A few instances exist where a bitmap_* operation is performed on a CBM by simply pointing the bitmap operation to the stored u32 value. The consequence of this pattern is that some bitmap_* operations will access out-of-bounds memory when interacting with the provided CBM. This same issue has previously been addressed with commit 49e00eee0061 ("x86/intel_rdt: Fix out-of-bounds memory access in CBM tests") but at that time not all instances of the issue were fixed. Fix this by using an unsigned long to store the capacity bitmask data that is passed to bitmap functions. Fixes: e651901187ab ("x86/intel_rdt: Introduce "bit_usage" to display cache allocations details") Fixes: f4e80d67a527 ("x86/intel_rdt: Resctrl files reflect pseudo-locked information") Fixes: 95f0b77efa57 ("x86/intel_rdt: Initialize new resource group with sane defaults") Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: stable <stable@vger.kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/58c9b6081fd9bf599af0dfc01a6fdd335768efef.1560975645.git.reinette.chatre@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03x86/microcode: Fix the microcode load on CPU hotplug for realThomas Gleixner
commit 5423f5ce5ca410b3646f355279e4e937d452e622 upstream. A recent change moved the microcode loader hotplug callback into the early startup phase which is running with interrupts disabled. It missed that the callbacks invoke sysfs functions which might sleep causing nice 'might sleep' splats with proper debugging enabled. Split the callbacks and only load the microcode in the early startup phase and move the sysfs handling back into the later threaded and preemptible bringup phase where it was before. Fixes: 78f4e932f776 ("x86/microcode, cpuhotplug: Add a microcode loader CPU hotplug callback") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: stable@vger.kernel.org Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1906182228350.1766@nanos.tec.linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03x86/speculation: Allow guests to use SSBD even if host does notAlejandro Jimenez
commit c1f7fec1eb6a2c86d01bc22afce772c743451d88 upstream. The bits set in x86_spec_ctrl_mask are used to calculate the guest's value of SPEC_CTRL that is written to the MSR before VMENTRY, and control which mitigations the guest can enable. In the case of SSBD, unless the host has enabled SSBD always on mode (by passing "spec_store_bypass_disable=on" in the kernel parameters), the SSBD bit is not set in the mask and the guest can not properly enable the SSBD always on mitigation mode. This has been confirmed by running the SSBD PoC on a guest using the SSBD always on mitigation mode (booted with kernel parameter "spec_store_bypass_disable=on"), and verifying that the guest is vulnerable unless the host is also using SSBD always on mode. In addition, the guest OS incorrectly reports the SSB vulnerability as mitigated. Always set the SSBD bit in x86_spec_ctrl_mask when the host CPU supports it, allowing the guest to use SSBD whether or not the host has chosen to enable the mitigation in any of its modes. Fixes: be6fcb5478e9 ("x86/bugs: Rework spec_ctrl base and mask logic") Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Reviewed-by: Mark Kanda <mark.kanda@oracle.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Cc: bp@alien8.de Cc: rkrcmar@redhat.com Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1560187210-11054-1-git-send-email-alejandro.j.jimenez@oracle.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03scsi: vmw_pscsi: Fix use-after-free in pvscsi_queue_lck()Jan Kara
commit 240b4cc8fd5db138b675297d4226ec46594d9b3b upstream. Once we unlock adapter->hw_lock in pvscsi_queue_lck() nothing prevents just queued scsi_cmnd from completing and freeing the request. Thus cmd->cmnd[0] dereference can dereference already freed request leading to kernel crashes or other issues (which one of our customers observed). Store cmd->cmnd[0] in a local variable before unlocking adapter->hw_lock to fix the issue. CC: <stable@vger.kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03dm log writes: make sure super sector log updates are written in orderzhangyi (F)
commit 211ad4b733037f66f9be0a79eade3da7ab11cbb8 upstream. Currently, although we submit super bios in order (and super.nr_entries is incremented by each logged entry), submit_bio() is async so each super sector may not be written to log device in order and then the final nr_entries may be smaller than it should be. This problem can be reproduced by the xfstests generic/455 with ext4: QA output created by 455 -Silence is golden +mark 'end' does not exist Fix this by serializing submission of super sectors to make sure each is written to the log disk in order. Fixes: 0e9cebe724597 ("dm: add log writes target") Cc: stable@vger.kernel.org Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Suggested-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03mm/page_idle.c: fix oops because end_pfn is larger than max_pfnColin Ian King
commit 7298e3b0a149c91323b3205d325e942c3b3b9ef6 upstream. Currently the calcuation of end_pfn can round up the pfn number to more than the actual maximum number of pfns, causing an Oops. Fix this by ensuring end_pfn is never more than max_pfn. This can be easily triggered when on systems where the end_pfn gets rounded up to more than max_pfn using the idle-page stress-ng stress test: sudo stress-ng --idle-page 0 BUG: unable to handle kernel paging request at 00000000000020d8 #PF error: [normal kernel read fault] PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 1 PID: 11039 Comm: stress-ng-idle- Not tainted 5.0.0-5-generic #6-Ubuntu Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 RIP: 0010:page_idle_get_page+0xc8/0x1a0 Code: 0f b1 0a 75 7d 48 8b 03 48 89 c2 48 c1 e8 33 83 e0 07 48 c1 ea 36 48 8d 0c 40 4c 8d 24 88 49 c1 e4 07 4c 03 24 d5 00 89 c3 be <49> 8b 44 24 58 48 8d b8 80 a1 02 00 e8 07 d5 77 00 48 8b 53 08 48 RSP: 0018:ffffafd7c672fde8 EFLAGS: 00010202 RAX: 0000000000000005 RBX: ffffe36341fff700 RCX: 000000000000000f RDX: 0000000000000284 RSI: 0000000000000275 RDI: 0000000001fff700 RBP: ffffafd7c672fe00 R08: ffffa0bc34056410 R09: 0000000000000276 R10: ffffa0bc754e9b40 R11: ffffa0bc330f6400 R12: 0000000000002080 R13: ffffe36341fff700 R14: 0000000000080000 R15: ffffa0bc330f6400 FS: 00007f0ec1ea5740(0000) GS:ffffa0bc7db00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000020d8 CR3: 0000000077d68000 CR4: 00000000000006e0 Call Trace: page_idle_bitmap_write+0x8c/0x140 sysfs_kf_bin_write+0x5c/0x70 kernfs_fop_write+0x12e/0x1b0 __vfs_write+0x1b/0x40 vfs_write+0xab/0x1b0 ksys_write+0x55/0xc0 __x64_sys_write+0x1a/0x20 do_syscall_64+0x5a/0x110 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Link: http://lkml.kernel.org/r/20190618124352.28307-1-colin.king@canonical.com Fixes: 33c3fc71c8cf ("mm: introduce idle page tracking") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHugeNaoya Horiguchi
commit faf53def3b143df11062d87c12afe6afeb6f8cc7 upstream. madvise(MADV_SOFT_OFFLINE) often returns -EBUSY when calling soft offline for hugepages with overcommitting enabled. That was caused by the suboptimal code in current soft-offline code. See the following part: ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL, MIGRATE_SYNC, MR_MEMORY_FAILURE); if (ret) { ... } else { /* * We set PG_hwpoison only when the migration source hugepage * was successfully dissolved, because otherwise hwpoisoned * hugepage remains on free hugepage list, then userspace will * find it as SIGBUS by allocation failure. That's not expected * in soft-offlining. */ ret = dissolve_free_huge_page(page); if (!ret) { if (set_hwpoison_free_buddy_page(page)) num_poisoned_pages_inc(); } } return ret; Here dissolve_free_huge_page() returns -EBUSY if the migration source page was freed into buddy in migrate_pages(), but even in that case we actually has a chance that set_hwpoison_free_buddy_page() succeeds. So that means current code gives up offlining too early now. dissolve_free_huge_page() checks that a given hugepage is suitable for dissolving, where we should return success for !PageHuge() case because the given hugepage is considered as already dissolved. This change also affects other callers of dissolve_free_huge_page(), which are cleaned up together. [n-horiguchi@ah.jp.nec.com: v3] Link: http://lkml.kernel.org/r/1560761476-4651-3-git-send-email-n-horiguchi@ah.jp.nec.comLink: http://lkml.kernel.org/r/1560154686-18497-3-git-send-email-n-horiguchi@ah.jp.nec.com Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: Chen, Jerry T <jerry.t.chen@intel.com> Tested-by: Chen, Jerry T <jerry.t.chen@intel.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com> Cc: "Chen, Jerry T" <jerry.t.chen@intel.com> Cc: "Zhuo, Qiuxu" <qiuxu.zhuo@intel.com> Cc: <stable@vger.kernel.org> [4.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03mm: soft-offline: return -EBUSY if set_hwpoison_free_buddy_page() failsNaoya Horiguchi
commit b38e5962f8ed0d2a2b28a887fc2221f7f41db119 upstream. The pass/fail of soft offline should be judged by checking whether the raw error page was finally contained or not (i.e. the result of set_hwpoison_free_buddy_page()), but current code do not work like that. It might lead us to misjudge the test result when set_hwpoison_free_buddy_page() fails. Without this fix, there are cases where madvise(MADV_SOFT_OFFLINE) may not offline the original page and will not return an error. Link: http://lkml.kernel.org/r/1560154686-18497-2-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining") Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com> Cc: "Chen, Jerry T" <jerry.t.chen@intel.com> Cc: "Zhuo, Qiuxu" <qiuxu.zhuo@intel.com> Cc: <stable@vger.kernel.org> [4.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03clk: socfpga: stratix10: fix divider entry for the emac clocksDinh Nguyen
commit 74684cce5ebd567b01e9bc0e9a1945c70a32f32f upstream. The fixed dividers for the emac clocks should be 2 not 4. Cc: stable@vger.kernel.org Signed-off-by: Dinh Nguyen <dinguyen@kernel.org> Signed-off-by: Stephen Boyd <sboyd@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03fs/binfmt_flat.c: make load_flat_shared_library() workJann Horn
commit 867bfa4a5fcee66f2b25639acae718e8b28b25a5 upstream. load_flat_shared_library() is broken: It only calls load_flat_file() if prepare_binprm() returns zero, but prepare_binprm() returns the number of bytes read - so this only happens if the file is empty. Instead, call into load_flat_file() if the number of bytes read is non-negative. (Even if the number of bytes is zero - in that case, load_flat_file() will see nullbytes and return a nice -ENOEXEC.) In addition, remove the code related to bprm creds and stop using prepare_binprm() - this code is loading a library, not a main executable, and it only actually uses the members "buf", "file" and "filename" of the linux_binprm struct. Instead, call kernel_read() directly. Link: http://lkml.kernel.org/r/20190524201817.16509-1-jannh@google.com Fixes: 287980e49ffc ("remove lots of IS_ERR_VALUE abuses") Signed-off-by: Jann Horn <jannh@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03mm/mempolicy.c: fix an incorrect rebind node in mpol_rebind_nodemaskzhong jiang
commit 29b190fa774dd1b72a1a6f19687d55dc72ea83be upstream. mpol_rebind_nodemask() is called for MPOL_BIND and MPOL_INTERLEAVE mempoclicies when the tasks's cpuset's mems_allowed changes. For policies created without MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES, it works by remapping the policy's allowed nodes (stored in v.nodes) using the previous value of mems_allowed (stored in w.cpuset_mems_allowed) as the domain of map and the new mems_allowed (passed as nodes) as the range of the map (see the comment of bitmap_remap() for details). The result of remapping is stored back as policy's nodemask in v.nodes, and the new value of mems_allowed should be stored in w.cpuset_mems_allowed to facilitate the next rebind, if it happens. However, 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets") introduced a bug where the result of remapping is stored in w.cpuset_mems_allowed instead. Thus, a mempolicy's allowed nodes can evolve in an unexpected way after a series of rebinding due to cpuset mems_allowed changes, possibly binding to a wrong node or a smaller number of nodes which may e.g. overload them. This patch fixes the bug so rebinding again works as intended. [vbabka@suse.cz: new changlog] Link: http://lkml.kernel.org/r/ef6a69c6-c052-b067-8f2c-9d615c619bb9@suse.cz Link: http://lkml.kernel.org/r/1558768043-23184-1-git-send-email-zhongjiang@huawei.com Fixes: 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets") Signed-off-by: zhong jiang <zhongjiang@huawei.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Oscar Salvador <osalvador@suse.de> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03fs/proc/array.c: allow reporting eip/esp for all coredumping threadsJohn Ogness
commit cb8f381f1613cafe3aec30809991cd56e7135d92 upstream. 0a1eb2d474ed ("fs/proc: Stop reporting eip and esp in /proc/PID/stat") stopped reporting eip/esp and fd7d56270b52 ("fs/proc: Report eip/esp in /prod/PID/stat for coredumping") reintroduced the feature to fix a regression with userspace core dump handlers (such as minicoredumper). Because PF_DUMPCORE is only set for the primary thread, this didn't fix the original problem for secondary threads. Allow reporting the eip/esp for all threads by checking for PF_EXITING as well. This is set for all the other threads when they are killed. coredump_wait() waits for all the tasks to become inactive before proceeding to invoke a core dumper. Link: http://lkml.kernel.org/r/87y32p7i7a.fsf@linutronix.de Link: http://lkml.kernel.org/r/20190522161614.628-1-jlu@pengutronix.de Fixes: fd7d56270b526ca3 ("fs/proc: Report eip/esp in /prod/PID/stat for coredumping") Signed-off-by: John Ogness <john.ogness@linutronix.de> Reported-by: Jan Luebbe <jlu@pengutronix.de> Tested-by: Jan Luebbe <jlu@pengutronix.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03usb: dwc3: gadget: Clear req->needs_extra_trb flag on cleanupJack Pham
commit bd6742249b9ca918565e4e3abaa06665e587f4b5 upstream OUT endpoint requests may somtimes have this flag set when preparing to be submitted to HW indicating that there is an additional TRB chained to the request for alignment purposes. If that request is removed before the controller can execute the transfer (e.g. ep_dequeue/ep_disable), the request will not go through the dwc3_gadget_ep_cleanup_completed_request() handler and will not have its needs_extra_trb flag cleared when dwc3_gadget_giveback() is called. This same request could be later requeued for a new transfer that does not require an extra TRB and if it is successfully completed, the cleanup and TRB reclamation will incorrectly process the additional TRB which belongs to the next request, and incorrectly advances the TRB dequeue pointer, thereby messing up calculation of the next requeust's actual/remaining count when it completes. The right thing to do here is to ensure that the flag is cleared before it is given back to the function driver. A good place to do that is in dwc3_gadget_del_and_unmap_request(). Fixes: c6267a51639b ("usb: dwc3: gadget: align transfers to wMaxPacketSize") Cc: Fei Yang <fei.yang@intel.com> Cc: Sam Protsenko <semen.protsenko@linaro.org> Cc: Felipe Balbi <balbi@kernel.org> Cc: linux-usb@vger.kernel.org Cc: stable@vger.kernel.org # 4.19.y Signed-off-by: Jack Pham <jackp@codeaurora.org> Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com> (cherry picked from commit bd6742249b9ca918565e4e3abaa06665e587f4b5) Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-03usb: dwc3: gadget: remove wait_end_transferFelipe Balbi
commit fec9095bdef4e7c988adb603d0d4f92ee735d4a1 upstream Now that we have a list of cancelled requests, we can skip over TRBs when END_TRANSFER command completes. Cc: Fei Yang <fei.yang@intel.com> Cc: Sam Protsenko <semen.protsenko@linaro.org> Cc: Felipe Balbi <balbi@kernel.org> Cc: linux-usb@vger.kernel.org Cc: stable@vger.kernel.org # 4.19.y Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com> (cherry picked from commit fec9095bdef4e7c988adb603d0d4f92ee735d4a1) Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-03usb: dwc3: gadget: move requests to cancelled_listFelipe Balbi
commit d4f1afe5e896c18ae01099a85dab5e1a198bd2a8 upstream Whenever we have a request in flight, we can move it to the cancelled list and later simply iterate over that list and skip over any TRBs we find. Cc: Fei Yang <fei.yang@intel.com> Cc: Sam Protsenko <semen.protsenko@linaro.org> Cc: Felipe Balbi <balbi@kernel.org> Cc: linux-usb@vger.kernel.org Cc: stable@vger.kernel.org # 4.19.y Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com> (cherry picked from commit d4f1afe5e896c18ae01099a85dab5e1a198bd2a8) Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-03usb: dwc3: gadget: introduce cancelled_listFelipe Balbi
commit d5443bbf5fc8f8389cce146b1fc2987cdd229d12 upstream This list will host cancelled requests who still have TRBs being processed. Cc: Fei Yang <fei.yang@intel.com> Cc: Sam Protsenko <semen.protsenko@linaro.org> Cc: Felipe Balbi <balbi@kernel.org> Cc: linux-usb@vger.kernel.org Cc: stable@vger.kernel.org # 4.19.y Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com> (cherry picked from commit d5443bbf5fc8f8389cce146b1fc2987cdd229d12) Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-03usb: dwc3: gadget: extract dwc3_gadget_ep_skip_trbs()Felipe Balbi
commit 7746a8dfb3f9c91b3a0b63a1d5c2664410e6498d upstream Extract the logic for skipping over TRBs to its own function. This makes the code slightly more readable and makes it easier to move this call to its final resting place as a following patch. Cc: Fei Yang <fei.yang@intel.com> Cc: Sam Protsenko <semen.protsenko@linaro.org> Cc: Felipe Balbi <balbi@kernel.org> Cc: linux-usb@vger.kernel.org Cc: stable@vger.kernel.org # 4.19.y Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com> (cherry picked from commit 7746a8dfb3f9c91b3a0b63a1d5c2664410e6498d) Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-03usb: dwc3: gadget: use num_trbs when skipping TRBs on ->dequeue()Felipe Balbi
commit c3acd59014148470dc58519870fbc779785b4bf7 upstream Now that we track how many TRBs a request uses, it's easier to skip over them in case of a call to usb_ep_dequeue(). Let's do so and simplify the code a bit. Cc: Fei Yang <fei.yang@intel.com> Cc: Sam Protsenko <semen.protsenko@linaro.org> Cc: Felipe Balbi <balbi@kernel.org> Cc: linux-usb@vger.kernel.org Cc: stable@vger.kernel.org # 4.19.y Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com> (cherry picked from commit c3acd59014148470dc58519870fbc779785b4bf7) Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>