aboutsummaryrefslogtreecommitdiffstats
path: root/include
AgeCommit message (Collapse)Author
2019-05-25fuse: Add FOPEN_STREAM to use stream_open()Kirill Smelkov
commit bbd84f33652f852ce5992d65db4d020aba21f882 upstream. Starting from commit 9c225f2655e3 ("vfs: atomic f_pos accesses as per POSIX") files opened even via nonseekable_open gate read and write via lock and do not allow them to be run simultaneously. This can create read vs write deadlock if a filesystem is trying to implement a socket-like file which is intended to be simultaneously used for both read and write from filesystem client. See commit 10dce8af3422 ("fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock") for details and e.g. commit 581d21a2d02a ("xenbus: fix deadlock on writes to /proc/xen/xenbus") for a similar deadlock example on /proc/xen/xenbus. To avoid such deadlock it was tempting to adjust fuse_finish_open to use stream_open instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE, and in particular GVFS which actually uses offset in its read and write handlers https://codesearch.debian.net/search?q=-%3Enonseekable+%3D https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481 so if we would do such a change it will break a real user. Add another flag (FOPEN_STREAM) for filesystem servers to indicate that the opened handler is having stream-like semantics; does not use file position and thus the kernel is free to issue simultaneous read and write request on opened file handle. This patch together with stream_open() should be added to stable kernels starting from v3.14+. This will allow to patch OSSPD and other FUSE filesystems that provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE in open handler and this way avoid the deadlock on all kernel versions. This should work because fuse_finish_open ignores unknown open flags returned from a filesystem and so passing FOPEN_STREAM to a kernel that is not aware of this flag cannot hurt. In turn the kernel that is not aware of FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE is sufficient to implement streams without read vs write deadlock. Cc: stable@vger.kernel.org # v3.14+ Signed-off-by: Kirill Smelkov <kirr@nexedi.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-25PCI: Work around Pericom PCIe-to-PCI bridge Retrain Link erratumStefan Mätje
commit 4ec73791a64bab25cabf16a6067ee478692e506d upstream. Due to an erratum in some Pericom PCIe-to-PCI bridges in reverse mode (conventional PCI on primary side, PCIe on downstream side), the Retrain Link bit needs to be cleared manually to allow the link training to complete successfully. If it is not cleared manually, the link training is continuously restarted and no devices below the PCI-to-PCIe bridge can be accessed. That means drivers for devices below the bridge will be loaded but won't work and may even crash because the driver is only reading 0xffff. See the Pericom Errata Sheet PI7C9X111SLB_errata_rev1.2_102711.pdf for details. Devices known as affected so far are: PI7C9X110, PI7C9X111SL, PI7C9X130. Add a new flag, clear_retrain_link, in struct pci_dev. Quirks for affected devices set this bit. Note that pcie_retrain_link() lives in aspm.c because that's currently the only place we use it, but this erratum is not specific to ASPM, and we may retrain links for other reasons in the future. Signed-off-by: Stefan Mätje <stefan.maetje@esd.eu> [bhelgaas: apply regardless of CONFIG_PCIEASPM] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> CC: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-25x86/mpx, mm/core: Fix recursive munmap() corruptionDave Hansen
commit 5a28fc94c9143db766d1ba5480cae82d856ad080 upstream. This is a bit of a mess, to put it mildly. But, it's a bug that only seems to have showed up in 4.20 but wasn't noticed until now, because nobody uses MPX. MPX has the arch_unmap() hook inside of munmap() because MPX uses bounds tables that protect other areas of memory. When memory is unmapped, there is also a need to unmap the MPX bounds tables. Barring this, unused bounds tables can eat 80% of the address space. But, the recursive do_munmap() that gets called vi arch_unmap() wreaks havoc with __do_munmap()'s state. It can result in freeing populated page tables, accessing bogus VMA state, double-freed VMAs and more. See the "long story" further below for the gory details. To fix this, call arch_unmap() before __do_unmap() has a chance to do anything meaningful. Also, remove the 'vma' argument and force the MPX code to do its own, independent VMA lookup. == UML / unicore32 impact == Remove unused 'vma' argument to arch_unmap(). No functional change. I compile tested this on UML but not unicore32. == powerpc impact == powerpc uses arch_unmap() well to watch for munmap() on the VDSO and zeroes out 'current->mm->context.vdso_base'. Moving arch_unmap() makes this happen earlier in __do_munmap(). But, 'vdso_base' seems to only be used in perf and in the signal delivery that happens near the return to userspace. I can not find any likely impact to powerpc, other than the zeroing happening a little earlier. powerpc does not use the 'vma' argument and is unaffected by its removal. I compile-tested a 64-bit powerpc defconfig. == x86 impact == For the common success case this is functionally identical to what was there before. For the munmap() failure case, it's possible that some MPX tables will be zapped for memory that continues to be in use. But, this is an extraordinarily unlikely scenario and the harm would be that MPX provides no protection since the bounds table got reset (zeroed). I can't imagine anyone doing this: ptr = mmap(); // use ptr ret = munmap(ptr); if (ret) // oh, there was an error, I'll // keep using ptr. Because if you're doing munmap(), you are *done* with the memory. There's probably no good data in there _anyway_. This passes the original reproducer from Richard Biener as well as the existing mpx selftests/. The long story: munmap() has a couple of pieces: 1. Find the affected VMA(s) 2. Split the start/end one(s) if neceesary 3. Pull the VMAs out of the rbtree 4. Actually zap the memory via unmap_region(), including freeing page tables (or queueing them to be freed). 5. Fix up some of the accounting (like fput()) and actually free the VMA itself. This specific ordering was actually introduced by: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") during the 4.20 merge window. The previous __do_munmap() code was actually safe because the only thing after arch_unmap() was remove_vma_list(). arch_unmap() could not see 'vma' in the rbtree because it was detached, so it is not even capable of doing operations unsafe for remove_vma_list()'s use of 'vma'. Richard Biener reported a test that shows this in dmesg: [1216548.787498] BUG: Bad rss-counter state mm:0000000017ce560b idx:1 val:551 [1216548.787500] BUG: non-zero pgtables_bytes on freeing mm: 24576 What triggered this was the recursive do_munmap() called via arch_unmap(). It was freeing page tables that has not been properly zapped. But, the problem was bigger than this. For one, arch_unmap() can free VMAs. But, the calling __do_munmap() has variables that *point* to VMAs and obviously can't handle them just getting freed while the pointer is still in use. I tried a couple of things here. First, I tried to fix the page table freeing problem in isolation, but I then found the VMA issue. I also tried having the MPX code return a flag if it modified the rbtree which would force __do_munmap() to re-walk to restart. That spiralled out of control in complexity pretty fast. Just moving arch_unmap() and accepting that the bonkers failure case might eat some bounds tables seems like the simplest viable fix. This was also reported in the following kernel bugzilla entry: https://bugzilla.kernel.org/show_bug.cgi?id=203123 There are some reports that this commit triggered this bug: dd2283f2605 ("mm: mmap: zap pages with read mmap_sem in munmap") While that commit certainly made the issues easier to hit, I believe the fundamental issue has been with us as long as MPX itself, thus the Fixes: tag below is for one of the original MPX commits. [ mingo: Minor edits to the changelog and the patch. ] Reported-by: Richard Biener <rguenther@suse.de> Reported-by: H.J. Lu <hjl.tools@gmail.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: linux-arch@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-um@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org Cc: stable@vger.kernel.org Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") Link: http://lkml.kernel.org/r/20190419194747.5E1AD6DC@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-25udlfb: introduce a rendering mutexMikulas Patocka
commit babc250e278eac7b0e671bdaedf833759b43bb78 upstream. Rendering calls may be done simultaneously from the workqueue, dlfb_ops_write, dlfb_ops_ioctl, dlfb_ops_set_par and dlfb_dpy_deferred_io. The code is robust enough so that it won't crash on concurrent rendering. However, concurrent rendering may cause display corruption if the same pixel is simultaneously being rendered. In order to avoid this corruption, this patch adds a mutex around the rendering calls. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: Bernie Thompson <bernie@plugable.com> Cc: Ladislav Michl <ladis@linux-mips.org> Cc: <stable@vger.kernel.org> [b.zolnierkie: replace "dlfb:" with "uldfb:" in the patch summary] Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-25udlfb: fix sleeping inside spinlockMikulas Patocka
commit 6b11f9d8433b471fdd3ebed232b43a4b723be6ff upstream. If a framebuffer device is used as a console, the rendering calls (copyarea, fillrect, imageblit) may be done with the console spinlock held. On udlfb, these function call dlfb_handle_damage that takes a blocking semaphore before acquiring an URB. In order to fix the bug, this patch changes the calls copyarea, fillrect and imageblit to offload USB work to a workqueue. A side effect of this patch is 3x improvement in console scrolling speed because the device doesn't have to be updated after each copyarea call. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: Bernie Thompson <bernie@plugable.com> Cc: Ladislav Michl <ladis@linux-mips.org> Cc: <stable@vger.kernel.org> Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-25RDMA/mlx5: Use get_zeroed_page() for clock_infoJason Gunthorpe
commit ddcdc368b1033e19fd3a5f750752e10e28a87826 upstream. get_zeroed_page() returns a virtual address for the page which is better than allocating a struct page and doing a permanent kmap on it. Cc: stable@vger.kernel.org Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Reviewed-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-25of: fix clang -Wunsequenced for be32_to_cpu()Phong Tran
commit 440868661f36071886ed360d91de83bd67c73b4f upstream. Now, make the loop explicit to avoid clang warning. ./include/linux/of.h:238:37: warning: multiple unsequenced modifications to 'cell' [-Wunsequenced] r = (r << 32) | be32_to_cpu(*(cell++)); ^~ ./include/linux/byteorder/generic.h:95:21: note: expanded from macro 'be32_to_cpu' ^ ./include/uapi/linux/byteorder/little_endian.h:40:59: note: expanded from macro '__be32_to_cpu' ^ ./include/uapi/linux/swab.h:118:21: note: expanded from macro '__swab32' ___constant_swab32(x) : \ ^ ./include/uapi/linux/swab.h:18:12: note: expanded from macro '___constant_swab32' (((__u32)(x) & (__u32)0x000000ffUL) << 24) | \ ^ Signed-off-by: Phong Tran <tranmanphong@gmail.com> Reported-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://github.com/ClangBuiltLinux/linux/issues/460 Suggested-by: David Laight <David.Laight@ACULAB.COM> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Cc: stable@vger.kernel.org [robh: fix up whitespace] Signed-off-by: Rob Herring <robh@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-25dcache: sort the freeing-without-RCU-delay mess for good.Al Viro
commit 5467a68cbf6884c9a9d91e2a89140afb1839c835 upstream. For lockless accesses to dentries we don't have pinned we rely (among other things) upon having an RCU delay between dropping the last reference and actually freeing the memory. On the other hand, for things like pipes and sockets we neither do that kind of lockless access, nor want to deal with the overhead of an RCU delay every time a socket gets closed. So delay was made optional - setting DCACHE_RCUACCESS in ->d_flags made sure it would happen. We tried to avoid setting it unless we knew we need it. Unfortunately, that had led to recurring class of bugs, in which we missed the need to set it. We only really need it for dentries that are created by d_alloc_pseudo(), so let's not bother with trying to be smart - just make having an RCU delay the default. The ones that do *not* get it set the replacement flag (DCACHE_NORCU) and we'd better use that sparingly. d_alloc_pseudo() is the only such user right now. FWIW, the race that finally prompted that switch had been between __lock_parent() of immediate subdirectory of what's currently the root of a disconnected tree (e.g. from open-by-handle in progress) racing with d_splice_alias() elsewhere picking another alias for the same inode, either on outright corrupted fs image, or (in case of open-by-handle on NFS) that subdirectory having been just moved on server. It's not easy to hit, so the sky is not falling, but that's not the first race on similar missed cases and the logics for settinf DCACHE_RCUACCESS has gotten ridiculously convoluted. Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-25net: test nouarg before dereferencing zerocopy pointersWillem de Bruijn
[ Upstream commit 185ce5c38ea76f29b6bd9c7c8c7a5e5408834920 ] Zerocopy skbs without completion notification were added for packet sockets with PACKET_TX_RING user buffers. Those signal completion through the TP_STATUS_USER bit in the ring. Zerocopy annotation was added only to avoid premature notification after clone or orphan, by triggering a copy on these paths for these packets. The mechanism had to define a special "no-uarg" mode because packet sockets already use skb_uarg(skb) == skb_shinfo(skb)->destructor_arg for a different pointer. Before deferencing skb_uarg(skb), verify that it is a real pointer. Fixes: 5cd8d46ea1562 ("packet: copy user buffers before orphan or clone") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-25ipv6: prevent possible fib6 leaksEric Dumazet
[ Upstream commit 61fb0d01680771f72cc9d39783fb2c122aaad51e ] At ipv6 route dismantle, fib6_drop_pcpu_from() is responsible for finding all percpu routes and set their ->from pointer to NULL, so that fib6_ref can reach its expected value (1). The problem right now is that other cpus can still catch the route being deleted, since there is no rcu grace period between the route deletion and call to fib6_drop_pcpu_from() This can leak the fib6 and associated resources, since no notifier will take care of removing the last reference(s). I decided to add another boolean (fib6_destroying) instead of reusing/renaming exception_bucket_flushed to ease stable backports, and properly document the memory barriers used to implement this fix. This patch has been co-developped with Wei Wang. Fixes: 93531c674315 ("net/ipv6: separate handling of FIB entries from dst based routes") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Wei Wang <weiwan@google.com> Cc: David Ahern <dsahern@gmail.com> Cc: Martin Lau <kafai@fb.com> Acked-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-22jbd2: fix potential double freeChengguang Xu
commit 0d52154bb0a700abb459a2cbce0a30fc2549b67e upstream. When failing from creating cache jbd2_inode_cache, we will destroy the previously created cache jbd2_handle_cache twice. This patch fixes this by moving each cache initialization/destruction to its own separate, individual function. Signed-off-by: Chengguang Xu <cgxu519@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-22mfd: max77620: Fix swapped FPS_PERIOD_MAX_US valuesDmitry Osipenko
commit ea611d1cc180fbb56982c83cd5142a2b34881f5c upstream. The FPS_PERIOD_MAX_US definitions are swapped for MAX20024 and MAX77620, fix it. Cc: stable <stable@vger.kernel.org> Signed-off-by: Dmitry Osipenko <digetx@gmail.com> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-22mfd: da9063: Fix OTP control register names to match datasheets for DA9063/63LSteve Twiss
commit 6b4814a9451add06d457e198be418bf6a3e6a990 upstream. Mismatch between what is found in the Datasheets for DA9063 and DA9063L provided by Dialog Semiconductor, and the register names provided in the MFD registers file. The changes are for the OTP (one-time-programming) control registers. The two naming errors are OPT instead of OTP, and COUNT instead of CONT (i.e. control). Cc: Stable <stable@vger.kernel.org> Signed-off-by: Steve Twiss <stwiss.opensource@diasemi.com> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-22hugetlb: use same fault hash key for shared and private mappingsMike Kravetz
commit 1b426bac66e6cc83c9f2d92b96e4e72acf43419a upstream. hugetlb uses a fault mutex hash table to prevent page faults of the same pages concurrently. The key for shared and private mappings is different. Shared keys off address_space and file index. Private keys off mm and virtual address. Consider a private mappings of a populated hugetlbfs file. A fault will map the page from the file and if needed do a COW to map a writable page. Hugetlbfs hole punch uses the fault mutex to prevent mappings of file pages. It uses the address_space file index key. However, private mappings will use a different key and could race with this code to map the file page. This causes problems (BUG) for the page cache remove code as it expects the page to be unmapped. A sample stack is: page dumped because: VM_BUG_ON_PAGE(page_mapped(page)) kernel BUG at mm/filemap.c:169! ... RIP: 0010:unaccount_page_cache_page+0x1b8/0x200 ... Call Trace: __delete_from_page_cache+0x39/0x220 delete_from_page_cache+0x45/0x70 remove_inode_hugepages+0x13c/0x380 ? __add_to_page_cache_locked+0x162/0x380 hugetlbfs_fallocate+0x403/0x540 ? _cond_resched+0x15/0x30 ? __inode_security_revalidate+0x5d/0x70 ? selinux_file_permission+0x100/0x130 vfs_fallocate+0x13f/0x270 ksys_fallocate+0x3c/0x80 __x64_sys_fallocate+0x1a/0x20 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x44/0xa9 There seems to be another potential COW issue/race with this approach of different private and shared keys as noted in commit 8382d914ebf7 ("mm, hugetlb: improve page-fault scalability"). Since every hugetlb mapping (even anon and private) is actually a file mapping, just use the address_space index key for all mappings. This results in potentially more hash collisions. However, this should not be the common case. Link: http://lkml.kernel.org/r/20190328234704.27083-3-mike.kravetz@oracle.com Link: http://lkml.kernel.org/r/20190412165235.t4sscoujczfhuiyt@linux-r8p5 Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-22mm/huge_memory: fix vmf_insert_pfn_{pmd, pud}() crash, handle unaligned ↵Dan Williams
addresses commit fce86ff5802bac3a7b19db171aa1949ef9caac31 upstream. Starting with c6f3c5ee40c1 ("mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()") vmf_insert_pfn_pmd() internally calls pmdp_set_access_flags(). That helper enforces a pmd aligned @address argument via VM_BUG_ON() assertion. Update the implementation to take a 'struct vm_fault' argument directly and apply the address alignment fixup internally to fix crash signatures like: kernel BUG at arch/x86/mm/pgtable.c:515! invalid opcode: 0000 [#1] SMP NOPTI CPU: 51 PID: 43713 Comm: java Tainted: G OE 4.19.35 #1 [..] RIP: 0010:pmdp_set_access_flags+0x48/0x50 [..] Call Trace: vmf_insert_pfn_pmd+0x198/0x350 dax_iomap_fault+0xe82/0x1190 ext4_dax_huge_fault+0x103/0x1f0 ? __switch_to_asm+0x40/0x70 __handle_mm_fault+0x3f6/0x1370 ? __switch_to_asm+0x34/0x70 ? __switch_to_asm+0x40/0x70 handle_mm_fault+0xda/0x200 __do_page_fault+0x249/0x4f0 do_page_fault+0x32/0x110 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 Link: http://lkml.kernel.org/r/155741946350.372037.11148198430068238140.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: c6f3c5ee40c1 ("mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Piotr Balcer <piotr.balcer@intel.com> Tested-by: Yan Ma <yan.ma@intel.com> Tested-by: Pankaj Gupta <pagupta@redhat.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Chandan Rajendra <chandan@linux.ibm.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-16bpf: only test gso type on gso packetsWillem de Bruijn
[ Upstream commit 4c3024debf62de4c6ac6d3cb4c0063be21d4f652 ] BPF can adjust gso only for tcp bytestreams. Fail on other gso types. But only on gso packets. It does not touch this field if !gso_size. Fixes: b90efd225874 ("bpf: only adjust gso_size on bytestream protocols") Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2019-05-16IB/mlx5: Fix scatter to CQE in DCT QP creationGuy Levi
[ Upstream commit 7249c8ea227a582c14f63e9e8853eb7369122f10 ] When scatter to CQE is enabled on a DCT QP it corrupts the mailbox command since it tried to treat it as as QP create mailbox command instead of a DCT create command. The corrupted mailbox command causes userspace to malfunction as the device doesn't create the QP as expected. A new mlx5 capability is exposed to user-space which ensures that it will not enable the feature on DCT without this fix in the kernel. Fixes: 5d6ff1babe78 ("IB/mlx5: Support scatter to CQE for DC transport type") Signed-off-by: Guy Levi <guyle@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-16netfilter: ctnetlink: don't use conntrack/expect object addresses as idFlorian Westphal
[ Upstream commit 3c79107631db1f7fd32cf3f7368e4672004a3010 ] else, we leak the addresses to userspace via ctnetlink events and dumps. Compute an ID on demand based on the immutable parts of nf_conn struct. Another advantage compared to using an address is that there is no immediate re-use of the same ID in case the conntrack entry is freed and reallocated again immediately. Fixes: 3583240249ef ("[NETFILTER]: nf_conntrack_expect: kill unique ID") Fixes: 7f85f914721f ("[NETFILTER]: nf_conntrack: kill unique ID") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-16KVM: fix spectrev1 gadgetsPaolo Bonzini
[ Upstream commit 1d487e9bf8ba66a7174c56a0029c54b1eca8f99c ] These were found with smatch, and then generalized when applicable. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-16x86/reboot, efi: Use EFI reboot for Acer TravelMate X514-51TJian-Hong Pan
[ Upstream commit 0082517fa4bce073e7cf542633439f26538a14cc ] Upon reboot, the Acer TravelMate X514-51T laptop appears to complete the shutdown process, but then it hangs in BIOS POST with a black screen. The problem is intermittent - at some points it has appeared related to Secure Boot settings or different kernel builds, but ultimately we have not been able to identify the exact conditions that trigger the issue to come and go. Besides, the EFI mode cannot be disabled in the BIOS of this model. However, after extensive testing, we observe that using the EFI reboot method reliably avoids the issue in all cases. So add a boot time quirk to use EFI reboot on such systems. Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=203119 Signed-off-by: Jian-Hong Pan <jian-hong@endlessm.com> Signed-off-by: Daniel Drake <drake@endlessm.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Cc: linux@endlessm.com Link: http://lkml.kernel.org/r/20190412080152.3718-1-jian-hong@endlessm.com [ Fix !CONFIG_EFI build failure, clarify the code and the changelog a bit. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-16bfq: update internal depth state when queue depth changesJens Axboe
[ Upstream commit 77f1e0a52d26242b6c2dba019f6ebebfb9ff701e ] A previous commit moved the shallow depth and BFQ depth map calculations to be done at init time, moving it outside of the hotter IO path. This potentially causes hangs if the users changes the depth of the scheduler map, by writing to the 'nr_requests' sysfs file for that device. Add a blk-mq-sched hook that allows blk-mq to inform the scheduler if the depth changes, so that the scheduler can update its internal state. Tested-by: Kai Krakow <kai@kaishome.de> Reported-by: Paolo Valente <paolo.valente@linaro.org> Fixes: f0635b8a416e ("bfq: calculate shallow depths at init time") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-14cpu/speculation: Add 'mitigations=' cmdline optionJosh Poimboeuf
commit 98af8452945c55652de68536afdde3b520fec429 upstream Keeping track of the number of mitigations for all the CPU speculation bugs has become overwhelming for many users. It's getting more and more complicated to decide which mitigations are needed for a given architecture. Complicating matters is the fact that each arch tends to have its own custom way to mitigate the same vulnerability. Most users fall into a few basic categories: a) they want all mitigations off; b) they want all reasonable mitigations on, with SMT enabled even if it's vulnerable; or c) they want all reasonable mitigations on, with SMT disabled if vulnerable. Define a set of curated, arch-independent options, each of which is an aggregation of existing options: - mitigations=off: Disable all mitigations. - mitigations=auto: [default] Enable all the default mitigations, but leave SMT enabled, even if it's vulnerable. - mitigations=auto,nosmt: Enable all the default mitigations, disabling SMT if needed by a mitigation. Currently, these options are placeholders which don't actually do anything. They will be fleshed out in upcoming patches. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jiri Kosina <jkosina@suse.cz> (on x86) Reviewed-by: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@alien8.de> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jon Masters <jcm@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux-s390@vger.kernel.org Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-arch@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steven Price <steven.price@arm.com> Cc: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/b07a8ef9b7c5055c3a4637c87d07c296d5016fe0.1555085500.git.jpoimboe@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-14x86/speculation/mds: Add sysfs reporting for MDSThomas Gleixner
commit 8a4b06d391b0a42a373808979b5028f5c84d9c6a upstream Add the sysfs reporting file for MDS. It exposes the vulnerability and mitigation state similar to the existing files for the other speculative hardware vulnerabilities. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jon Masters <jcm@redhat.com> Tested-by: Jon Masters <jcm@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-10Bluetooth: Align minimum encryption key size for LE and BR/EDR connectionsMarcel Holtmann
commit d5bb334a8e171b262e48f378bd2096c0ea458265 upstream. The minimum encryption key size for LE connections is 56 bits and to align LE with BR/EDR, enforce 56 bits of minimum encryption key size for BR/EDR connections as well. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-10nvmet: fix discover log page when offsets are usedKeith Busch
[ Upstream commit d808b7f759b50acf0784ce6230ffa63e12ef465d ] The nvme target hadn't been taking the Get Log Page offset parameter into consideration, and so has been returning corrupted log pages when offsets are used. Since many tools, including nvme-cli, split the log request to 4k, we've been breaking discovery log responses when more than 3 subsystems exist. Fix the returned data by internally generating the entire discovery log page and copying only the requested bytes into the user buffer. The command log page offset type has been modified to a native __le64 to make it easier to extract the value from a command. Signed-off-by: Keith Busch <keith.busch@intel.com> Tested-by: Minwoo Im <minwoo.im@samsung.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-10blk-mq: introduce blk_mq_complete_request_sync()Ming Lei
[ Upstream commit 1b8f21b74c3c9c82fce5a751d7aefb7cc0b8d33d ] In NVMe's error handler, follows the typical steps of tearing down hardware for recovering controller: 1) stop blk_mq hw queues 2) stop the real hw queues 3) cancel in-flight requests via blk_mq_tagset_busy_iter(tags, cancel_request, ...) cancel_request(): mark the request as abort blk_mq_complete_request(req); 4) destroy real hw queues However, there may be race between #3 and #4, because blk_mq_complete_request() may run q->mq_ops->complete(rq) remotelly and asynchronously, and ->complete(rq) may be run after #4. This patch introduces blk_mq_complete_request_sync() for fixing the above race. Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Bart Van Assche <bvanassche@acm.org> Cc: James Smart <james.smart@broadcom.com> Cc: linux-nvme@lists.infradead.org Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-10KEYS: trusted: fix -Wvarags warningndesaulniers@google.com
[ Upstream commit be24b37e22c20cbaa891971616784dd0f35211e8 ] Fixes the warning reported by Clang: security/keys/trusted.c:146:17: warning: passing an object that undergoes default argument promotion to 'va_start' has undefined behavior [-Wvarargs] va_start(argp, h3); ^ security/keys/trusted.c:126:37: note: parameter of type 'unsigned char' is declared here unsigned char *h2, unsigned char h3, ...) ^ Specifically, it seems that both the C90 (4.8.1.1) and C11 (7.16.1.4) standards explicitly call this out as undefined behavior: The parameter parmN is the identifier of the rightmost parameter in the variable parameter list in the function definition (the one just before the ...). If the parameter parmN is declared with ... or with a type that is not compatible with the type that results after application of the default argument promotions, the behavior is undefined. Link: https://github.com/ClangBuiltLinux/linux/issues/41 Link: https://www.eskimo.com/~scs/cclass/int/sx11c.html Suggested-by: David Laight <David.Laight@aculab.com> Suggested-by: Denis Kenzior <denkenz@gmail.com> Suggested-by: James Bottomley <jejb@linux.vnet.ibm.com> Suggested-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Tested-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Signed-off-by: James Morris <james.morris@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-10linux/kernel.h: Use parentheses around argument in u64_to_user_ptr()Jann Horn
[ Upstream commit a0fe2c6479aab5723239b315ef1b552673f434a3 ] Use parentheses around uses of the argument in u64_to_user_ptr() to ensure that the cast doesn't apply to part of the argument. There are existing uses of the macro of the form u64_to_user_ptr(A + B) which expands to (void __user *)(uintptr_t)A + B (the cast applies to the first operand of the addition, the addition is a pointer addition). This happens to still work as intended, the semantic difference doesn't cause a difference in behavior. But I want to use u64_to_user_ptr() with a ternary operator in the argument, like so: u64_to_user_ptr(A ? B : C) This currently doesn't work as intended. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mukesh Ojha <mojha@codeaurora.org> Cc: Andrei Vagin <avagin@openvz.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: NeilBrown <neilb@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qiaowei Ren <qiaowei.ren@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190329214652.258477-1-jannh@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-10ASoC: dpcm: prevent snd_soc_dpcm use after freeKaiChieh Chuang
[ Upstream commit a9764869779081e8bf24da07ac040e8f3efcf13a ] The dpcm get from fe_clients/be_clients may be free before use Add a spin lock at snd_soc_card level, to protect the dpcm instance. The lock may be used in atomic context, so use spin lock. Use irq spin lock version, since the lock may be used in interrupts. possible race condition between void dpcm_be_disconnect( ... list_del(&dpcm->list_be); list_del(&dpcm->list_fe); kfree(dpcm); ... and for_each_dpcm_fe() for_each_dpcm_be*() race condition example Thread 1: snd_soc_dapm_mixer_update_power() -> soc_dpcm_runtime_update() -> dpcm_be_disconnect() -> kfree(dpcm); Thread 2: dpcm_fe_dai_trigger() -> dpcm_be_dai_trigger() -> snd_soc_dpcm_can_be_free_stop() -> if (dpcm->fe == fe) Excpetion Scenario: two FE link to same BE FE1 -> BE FE2 -> Thread 1: switch of mixer between FE2 -> BE Thread 2: pcm_stop FE1 Exception: Unable to handle kernel paging request at virtual address dead0000000000e0 pc=<> [<ffffff8960e2cd10>] dpcm_be_dai_trigger+0x29c/0x47c sound/soc/soc-pcm.c:3226 if (dpcm->fe == fe) lr=<> [<ffffff8960e2f694>] dpcm_fe_dai_do_trigger+0x94/0x26c Backtrace: [<ffffff89602dba80>] notify_die+0x68/0xb8 [<ffffff896028c7dc>] die+0x118/0x2a8 [<ffffff89602a2f84>] __do_kernel_fault+0x13c/0x14c [<ffffff89602a27f4>] do_translation_fault+0x64/0xa0 [<ffffff8960280cf8>] do_mem_abort+0x4c/0xd0 [<ffffff8960282ad0>] el1_da+0x24/0x40 [<ffffff8960e2cd10>] dpcm_be_dai_trigger+0x29c/0x47c [<ffffff8960e2f694>] dpcm_fe_dai_do_trigger+0x94/0x26c [<ffffff8960e2edec>] dpcm_fe_dai_trigger+0x3c/0x44 [<ffffff8960de5588>] snd_pcm_do_stop+0x50/0x5c [<ffffff8960dded24>] snd_pcm_action+0xb4/0x13c [<ffffff8960ddfdb4>] snd_pcm_drop+0xa0/0x128 [<ffffff8960de69bc>] snd_pcm_common_ioctl+0x9d8/0x30f0 [<ffffff8960de1cac>] snd_pcm_ioctl_compat+0x29c/0x2f14 [<ffffff89604c9d60>] compat_SyS_ioctl+0x128/0x244 [<ffffff8960283740>] el0_svc_naked+0x34/0x38 [<ffffffffffffffff>] 0xffffffffffffffff Signed-off-by: KaiChieh Chuang <kaichieh.chuang@mediatek.com> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-08clk: x86: Add system specific quirk to mark clocks as criticalDavid Müller
commit 7c2e07130090ae001a97a6b65597830d6815e93e upstream. Since commit 648e921888ad ("clk: x86: Stop marking clocks as CLK_IS_CRITICAL"), the pmc_plt_clocks of the Bay Trail SoC are unconditionally gated off. Unfortunately this will break systems where these clocks are used for external purposes beyond the kernel's knowledge. Fix it by implementing a system specific quirk to mark the necessary pmc_plt_clks as critical. Fixes: 648e921888ad ("clk: x86: Stop marking clocks as CLK_IS_CRITICAL") Signed-off-by: David Müller <dave.mueller@gmx.ch> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Stephen Boyd <sboyd@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-08fs: stream_open - opener for stream-like files so that read and write can ↵Kirill Smelkov
run simultaneously without deadlock [ Upstream commit 10dce8af34226d90fa56746a934f8da5dcdba3df ] Commit 9c225f2655e3 ("vfs: atomic f_pos accesses as per POSIX") added locking for file.f_pos access and in particular made concurrent read and write not possible - now both those functions take f_pos lock for the whole run, and so if e.g. a read is blocked waiting for data, write will deadlock waiting for that read to complete. This caused regression for stream-like files where previously read and write could run simultaneously, but after that patch could not do so anymore. See e.g. commit 581d21a2d02a ("xenbus: fix deadlock on writes to /proc/xen/xenbus") which fixes such regression for particular case of /proc/xen/xenbus. The patch that added f_pos lock in 2014 did so to guarantee POSIX thread safety for read/write/lseek and added the locking to file descriptors of all regular files. In 2014 that thread-safety problem was not new as it was already discussed earlier in 2006. However even though 2006'th version of Linus's patch was adding f_pos locking "only for files that are marked seekable with FMODE_LSEEK (thus avoiding the stream-like objects like pipes and sockets)", the 2014 version - the one that actually made it into the tree as 9c225f2655e3 - is doing so irregardless of whether a file is seekable or not. See https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/ https://lwn.net/Articles/180387 https://lwn.net/Articles/180396 for historic context. The reason that it did so is, probably, that there are many files that are marked non-seekable, but e.g. their read implementation actually depends on knowing current position to correctly handle the read. Some examples: kernel/power/user.c snapshot_read fs/debugfs/file.c u32_array_read fs/fuse/control.c fuse_conn_waiting_read + ... drivers/hwmon/asus_atk0110.c atk_debugfs_ggrp_read arch/s390/hypfs/inode.c hypfs_read_iter ... Despite that, many nonseekable_open users implement read and write with pure stream semantics - they don't depend on passed ppos at all. And for those cases where read could wait for something inside, it creates a situation similar to xenbus - the write could be never made to go until read is done, and read is waiting for some, potentially external, event, for potentially unbounded time -> deadlock. Besides xenbus, there are 14 such places in the kernel that I've found with semantic patch (see below): drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write() drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write() drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write() drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write() net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write() drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write() drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write() drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write() net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write() drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write() drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write() drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write() drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write() drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write() In addition to the cases above another regression caused by f_pos locking is that now FUSE filesystems that implement open with FOPEN_NONSEEKABLE flag, can no longer implement bidirectional stream-like files - for the same reason as above e.g. read can deadlock write locking on file.f_pos in the kernel. FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990f715 ("fuse: implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and write routines not depending on current position at all, and with both read and write being potentially blocking operations: See https://github.com/libfuse/osspd https://lwn.net/Articles/308445 https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406 https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477 https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510 Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as "somewhat pipe-like files ..." with read handler not using offset. However that test implements only read without write and cannot exercise the deadlock scenario: https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131 https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163 https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216 I've actually hit the read vs write deadlock for real while implementing my FUSE filesystem where there is /head/watch file, for which open creates separate bidirectional socket-like stream in between filesystem and its user with both read and write being later performed simultaneously. And there it is semantically not easy to split the stream into two separate read-only and write-only channels: https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169 Let's fix this regression. The plan is: 1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS - doing so would break many in-kernel nonseekable_open users which actually use ppos in read/write handlers. 2. Add stream_open() to kernel to open stream-like non-seekable file descriptors. Read and write on such file descriptors would never use nor change ppos. And with that property on stream-like files read and write will be running without taking f_pos lock - i.e. read and write could be running simultaneously. 3. With semantic patch search and convert to stream_open all in-kernel nonseekable_open users for which read and write actually do not depend on ppos and where there is no other methods in file_operations which assume @offset access. 4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via steam_open if that bit is present in filesystem open reply. It was tempting to change fs/fuse/ open handler to use stream_open instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE, and in particular GVFS which actually uses offset in its read and write handlers https://codesearch.debian.net/search?q=-%3Enonseekable+%3D https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481 so if we would do such a change it will break a real user. 5. Add stream_open and FOPEN_STREAM handling to stable kernels starting from v3.14+ (the kernel where 9c225f2655 first appeared). This will allow to patch OSSPD and other FUSE filesystems that provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE in their open handler and this way avoid the deadlock on all kernel versions. This should work because fs/fuse/ ignores unknown open flags returned from a filesystem and so passing FOPEN_STREAM to a kernel that is not aware of this flag cannot hurt. In turn the kernel that is not aware of FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE is sufficient to implement streams without read vs write deadlock. This patch adds stream_open, converts /proc/xen/xenbus to it and adds semantic patch to automatically locate in-kernel places that are either required to be converted due to read vs write deadlock, or that are just safe to be converted because read and write do not use ppos and there are no other funky methods in file_operations. Regarding semantic patch I've verified each generated change manually - that it is correct to convert - and each other nonseekable_open instance left - that it is either not correct to convert there, or that it is not converted due to current stream_open.cocci limitations. The script also does not convert files that should be valid to convert, but that currently have .llseek = noop_llseek or generic_file_llseek for unknown reason despite file being opened with nonseekable_open (e.g. drivers/input/mousedev.c) Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Yongzhi Pan <panyongzhi@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Juergen Gross <jgross@suse.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Tejun Heo <tj@kernel.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Julia Lawall <Julia.Lawall@lip6.fr> Cc: Nikolaus Rath <Nikolaus@rath.org> Cc: Han-Wen Nienhuys <hanwen@google.com> Signed-off-by: Kirill Smelkov <kirr@nexedi.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
2019-05-08USB: core: Fix bug caused by duplicate interface PM usage counterAlan Stern
commit c2b71462d294cf517a0bc6e4fd6424d7cee5596f upstream. The syzkaller fuzzer reported a bug in the USB hub driver which turned out to be caused by a negative runtime-PM usage counter. This allowed a hub to be runtime suspended at a time when the driver did not expect it. The symptom is a WARNING issued because the hub's status URB is submitted while it is already active: URB 0000000031fb463e submitted while active WARNING: CPU: 0 PID: 2917 at drivers/usb/core/urb.c:363 The negative runtime-PM usage count was caused by an unfortunate design decision made when runtime PM was first implemented for USB. At that time, USB class drivers were allowed to unbind from their interfaces without balancing the usage counter (i.e., leaving it with a positive count). The core code would take care of setting the counter back to 0 before allowing another driver to bind to the interface. Later on when runtime PM was implemented for the entire kernel, the opposite decision was made: Drivers were required to balance their runtime-PM get and put calls. In order to maintain backward compatibility, however, the USB subsystem adapted to the new implementation by keeping an independent usage counter for each interface and using it to automatically adjust the normal usage counter back to 0 whenever a driver was unbound. This approach involves duplicating information, but what is worse, it doesn't work properly in cases where a USB class driver delays decrementing the usage counter until after the driver's disconnect() routine has returned and the counter has been adjusted back to 0. Doing so would cause the usage counter to become negative. There's even a warning about this in the USB power management documentation! As it happens, this is exactly what the hub driver does. The kick_hub_wq() routine increments the runtime-PM usage counter, and the corresponding decrement is carried out by hub_event() in the context of the hub_wq work-queue thread. This work routine may sometimes run after the driver has been unbound from its interface, and when it does it causes the usage counter to go negative. It is not possible for hub_disconnect() to wait for a pending hub_event() call to finish, because hub_disconnect() is called with the device lock held and hub_event() acquires that lock. The only feasible fix is to reverse the original design decision: remove the duplicate interface-specific usage counter and require USB drivers to balance their runtime PM gets and puts. As far as I know, all existing drivers currently do this. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Reported-and-tested-by: syzbot+7634edaea4d0b341c625@syzkaller.appspotmail.com CC: <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-05sctp: avoid running the sctp state machine recursivelyXin Long
[ Upstream commit fbd019737d71e405f86549fd738f81e2ff3dd073 ] Ying triggered a call trace when doing an asconf testing: BUG: scheduling while atomic: swapper/12/0/0x10000100 Call Trace: <IRQ> [<ffffffffa4375904>] dump_stack+0x19/0x1b [<ffffffffa436fcaf>] __schedule_bug+0x64/0x72 [<ffffffffa437b93a>] __schedule+0x9ba/0xa00 [<ffffffffa3cd5326>] __cond_resched+0x26/0x30 [<ffffffffa437bc4a>] _cond_resched+0x3a/0x50 [<ffffffffa3e22be8>] kmem_cache_alloc_node+0x38/0x200 [<ffffffffa423512d>] __alloc_skb+0x5d/0x2d0 [<ffffffffc0995320>] sctp_packet_transmit+0x610/0xa20 [sctp] [<ffffffffc098510e>] sctp_outq_flush+0x2ce/0xc00 [sctp] [<ffffffffc098646c>] sctp_outq_uncork+0x1c/0x20 [sctp] [<ffffffffc0977338>] sctp_cmd_interpreter.isra.22+0xc8/0x1460 [sctp] [<ffffffffc0976ad1>] sctp_do_sm+0xe1/0x350 [sctp] [<ffffffffc099443d>] sctp_primitive_ASCONF+0x3d/0x50 [sctp] [<ffffffffc0977384>] sctp_cmd_interpreter.isra.22+0x114/0x1460 [sctp] [<ffffffffc0976ad1>] sctp_do_sm+0xe1/0x350 [sctp] [<ffffffffc097b3a4>] sctp_assoc_bh_rcv+0xf4/0x1b0 [sctp] [<ffffffffc09840f1>] sctp_inq_push+0x51/0x70 [sctp] [<ffffffffc099732b>] sctp_rcv+0xa8b/0xbd0 [sctp] As it shows, the first sctp_do_sm() running under atomic context (NET_RX softirq) invoked sctp_primitive_ASCONF() that uses GFP_KERNEL flag later, and this flag is supposed to be used in non-atomic context only. Besides, sctp_do_sm() was called recursively, which is not expected. Vlad tried to fix this recursive call in Commit c0786693404c ("sctp: Fix oops when sending queued ASCONF chunks") by introducing a new command SCTP_CMD_SEND_NEXT_ASCONF. But it didn't work as this command is still used in the first sctp_do_sm() call, and sctp_primitive_ASCONF() will be called in this command again. To avoid calling sctp_do_sm() recursively, we send the next queued ASCONF not by sctp_primitive_ASCONF(), but by sctp_sf_do_prm_asconf() in the 1st sctp_do_sm() directly. Reported-by: Ying Xu <yinxu@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-04ptrace: take into account saved_sigmask in PTRACE{GET,SET}SIGMASKAndrei Vagin
[ Upstream commit fcfc2aa0185f4a731d05a21e9f359968fdfd02e7 ] There are a few system calls (pselect, ppoll, etc) which replace a task sigmask while they are running in a kernel-space When a task calls one of these syscalls, the kernel saves a current sigmask in task->saved_sigmask and sets a syscall sigmask. On syscall-exit-stop, ptrace traps a task before restoring the saved_sigmask, so PTRACE_GETSIGMASK returns the syscall sigmask and PTRACE_SETSIGMASK does nothing, because its sigmask is replaced by saved_sigmask, when the task returns to user-space. This patch fixes this problem. PTRACE_GETSIGMASK returns saved_sigmask if it's set. PTRACE_SETSIGMASK drops the TIF_RESTORE_SIGMASK flag. Link: http://lkml.kernel.org/r/20181120060616.6043-1-avagin@gmail.com Fixes: 29000caecbe8 ("ptrace: add ability to get/set signal-blocked mask") Signed-off-by: Andrei Vagin <avagin@gmail.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
2019-05-04net/sched: don't dereference a->goto_chain to read the chain indexDavide Caratti
[ Upstream commit fe384e2fa36ca084a456fd30558cccc75b4b3fbd ] callers of tcf_gact_goto_chain_index() can potentially read an old value of the chain index, or even dereference a NULL 'goto_chain' pointer, because 'goto_chain' and 'tcfa_action' are read in the traffic path without caring of concurrent write in the control path. The most recent value of chain index can be read also from a->tcfa_action (it's encoded there together with TC_ACT_GOTO_CHAIN bits), so we don't really need to dereference 'goto_chain': just read the chain id from the control action. Fixes: e457d86ada27 ("net: sched: add couple of goto_chain helpers") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
2019-05-04xsk: fix umem memory leak on cleanupBjörn Töpel
[ Upstream commit 044175a06706d516aa42874bb44dbbfc3c4d20eb ] When the umem is cleaned up, the task that created it might already be gone. If the task was gone, the xdp_umem_release function did not free the pages member of struct xdp_umem. It turned out that the task lookup was not needed at all; The code was a left-over when we moved from task accounting to user accounting [1]. This patch fixes the memory leak by removing the task lookup logic completely. [1] https://lore.kernel.org/netdev/20180131135356.19134-3-bjorn.topel@gmail.com/ Link: https://lore.kernel.org/netdev/c1cb2ca8-6a14-3980-8672-f3de0bb38dfd@suse.cz/ Fixes: c0c77d8fb787 ("xsk: add user memory registration support sockopt") Reported-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
2019-05-04fs: prevent page refcount overflow in pipe_buf_getMatthew Wilcox
commit 15fab63e1e57be9fdb5eec1bbc5916e9825e9acb upstream. Change pipe_buf_get() to return a bool indicating whether it succeeded in raising the refcount of the page (if the thing in the pipe is a page). This removes another mechanism for overflowing the page refcount. All callers converted to handle a failure. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Matthew Wilcox <willy@infradead.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-04mm: add 'try_get_page()' helper functionLinus Torvalds
commit 88b1a17dfc3ed7728316478fae0f5ad508f50397 upstream. This is the same as the traditional 'get_page()' function, but instead of unconditionally incrementing the reference count of the page, it only does so if the count was "safe". It returns whether the reference count was incremented (and is marked __must_check, since the caller obviously has to be aware of it). Also like 'get_page()', you can't use this function unless you already had a reference to the page. The intent is that you can use this exactly like get_page(), but in situations where you want to limit the maximum reference count. The code currently does an unconditional WARN_ON_ONCE() if we ever hit the reference count issues (either zero or negative), as a notification that the conditional non-increment actually happened. NOTE! The count access for the "safety" check is inherently racy, but that doesn't matter since the buffer we use is basically half the range of the reference count (ie we look at the sign of the count). Acked-by: Matthew Wilcox <willy@infradead.org> Cc: Jann Horn <jannh@google.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-04mm: make page ref count overflow check tighter and more explicitLinus Torvalds
commit f958d7b528b1b40c44cfda5eabe2d82760d868c3 upstream. We have a VM_BUG_ON() to check that the page reference count doesn't underflow (or get close to overflow) by checking the sign of the count. That's all fine, but we actually want to allow people to use a "get page ref unless it's already very high" helper function, and we want that one to use the sign of the page ref (without triggering this VM_BUG_ON). Change the VM_BUG_ON to only check for small underflows (or _very_ close to overflowing), and ignore overflows which have strayed into negative territory. Acked-by: Matthew Wilcox <willy@infradead.org> Cc: Jann Horn <jannh@google.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-02net/ncsi: handle overflow when incrementing mac addressTao Ren
[ Upstream commit 1c5c12ee308aacf635c8819cd4baa3bd58f8a8b7 ] Previously BMC's MAC address is calculated by simply adding 1 to the last byte of network controller's MAC address, and it produces incorrect result when network controller's MAC address ends with 0xFF. The problem can be fixed by calling eth_addr_inc() function to increment MAC address; besides, the MAC address is also validated before assigning to BMC. Fixes: cb10c7c0dfd9 ("net/ncsi: Add NCSI Broadcom OEM command") Signed-off-by: Tao Ren <taoren@fb.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-02net: netrom: Fix error cleanup path of nr_proto_initYueHaibing
commit d3706566ae3d92677b932dd156157fd6c72534b1 upstream. Syzkaller report this: BUG: unable to handle kernel paging request at fffffbfff830524b PGD 237fe8067 P4D 237fe8067 PUD 237e64067 PMD 1c9716067 PTE 0 Oops: 0000 [#1] SMP KASAN PTI CPU: 1 PID: 4465 Comm: syz-executor.0 Not tainted 5.0.0+ #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 RIP: 0010:__list_add_valid+0x21/0xe0 lib/list_debug.c:23 Code: 8b 0c 24 e9 17 fd ff ff 90 55 48 89 fd 48 8d 7a 08 53 48 89 d3 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 48 83 ec 08 <80> 3c 02 00 0f 85 8b 00 00 00 48 8b 53 08 48 39 f2 75 35 48 89 f2 RSP: 0018:ffff8881ea2278d0 EFLAGS: 00010282 RAX: dffffc0000000000 RBX: ffffffffc1829250 RCX: 1ffff1103d444ef4 RDX: 1ffffffff830524b RSI: ffffffff85659300 RDI: ffffffffc1829258 RBP: ffffffffc1879250 R08: fffffbfff0acb269 R09: fffffbfff0acb269 R10: ffff8881ea2278f0 R11: fffffbfff0acb268 R12: ffffffffc1829250 R13: dffffc0000000000 R14: 0000000000000008 R15: ffffffffc187c830 FS: 00007fe0361df700(0000) GS:ffff8881f7300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffffbfff830524b CR3: 00000001eb39a001 CR4: 00000000007606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: __list_add include/linux/list.h:60 [inline] list_add include/linux/list.h:79 [inline] proto_register+0x444/0x8f0 net/core/sock.c:3375 nr_proto_init+0x73/0x4b3 [netrom] ? 0xffffffffc1628000 ? 0xffffffffc1628000 do_one_initcall+0xbc/0x47d init/main.c:887 do_init_module+0x1b5/0x547 kernel/module.c:3456 load_module+0x6405/0x8c10 kernel/module.c:3804 __do_sys_finit_module+0x162/0x190 kernel/module.c:3898 do_syscall_64+0x9f/0x450 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x462e99 Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fe0361dec58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462e99 RDX: 0000000000000000 RSI: 0000000020000100 RDI: 0000000000000003 RBP: 00007fe0361dec70 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe0361df6bc R13: 00000000004bcefa R14: 00000000006f6fb0 R15: 0000000000000004 Modules linked in: netrom(+) ax25 fcrypt pcbc af_alg arizona_ldo1 v4l2_common videodev media v4l2_dv_timings hdlc ide_cd_mod snd_soc_sigmadsp_regmap snd_soc_sigmadsp intel_spi_platform intel_spi mtd spi_nor snd_usbmidi_lib usbcore lcd ti_ads7950 hi6421_regulator snd_soc_kbl_rt5663_max98927 snd_soc_hdac_hdmi snd_hda_ext_core snd_hda_core snd_soc_rt5663 snd_soc_core snd_pcm_dmaengine snd_compress snd_soc_rl6231 mac80211 rtc_rc5t583 spi_slave_time leds_pwm hid_gt683r hid industrialio_triggered_buffer kfifo_buf industrialio ir_kbd_i2c rc_core led_class_flash dwc_xlgmac snd_ymfpci gameport snd_mpu401_uart snd_rawmidi snd_ac97_codec snd_pcm ac97_bus snd_opl3_lib snd_timer snd_seq_device snd_hwdep snd soundcore iptable_security iptable_raw iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel hsr veth netdevsim vxcan batman_adv cfg80211 rfkill chnl_net caif nlmon dummy team bonding vcan bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun joydev mousedev ppdev tpm kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ide_pci_generic piix aesni_intel aes_x86_64 crypto_simd cryptd glue_helper ide_core psmouse input_leds i2c_piix4 serio_raw intel_agp intel_gtt ata_generic agpgart pata_acpi parport_pc rtc_cmos parport floppy sch_fq_codel ip_tables x_tables sha1_ssse3 sha1_generic ipv6 [last unloaded: rxrpc] Dumping ftrace buffer: (ftrace buffer empty) CR2: fffffbfff830524b ---[ end trace 039ab24b305c4b19 ]--- If nr_proto_init failed, it may forget to call proto_unregister, tiggering this issue.This patch rearrange code of nr_proto_init to avoid such issues. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-02drm/ttm: fix re-init of global structuresChristian König
commit bd4264112f93045704731850c5e4d85db981cd85 upstream. When a driver unloads without unloading TTM we don't correctly clear the global structures leading to errors on re-init. Next step should probably be to remove the global structures and kobjs all together, but this is tricky since we need to maintain backward compatibility. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Karol Herbst <kherbst@redhat.com> Tested-by: Karol Herbst <kherbst@redhat.com> CC: stable@vger.kernel.org # 5.0.x Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-02tracing: Fix buffer_ref pipe opsJann Horn
commit b987222654f84f7b4ca95b3a55eca784cb30235b upstream. This fixes multiple issues in buffer_pipe_buf_ops: - The ->steal() handler must not return zero unless the pipe buffer has the only reference to the page. But generic_pipe_buf_steal() assumes that every reference to the pipe is tracked by the page's refcount, which isn't true for these buffers - buffer_pipe_buf_get(), which duplicates a buffer, doesn't touch the page's refcount. Fix it by using generic_pipe_buf_nosteal(), which refuses every attempted theft. It should be easy to actually support ->steal, but the only current users of pipe_buf_steal() are the virtio console and FUSE, and they also only use it as an optimization. So it's probably not worth the effort. - The ->get() and ->release() handlers can be invoked concurrently on pipe buffers backed by the same struct buffer_ref. Make them safe against concurrency by using refcount_t. - The pointers stored in ->private were only zeroed out when the last reference to the buffer_ref was dropped. As far as I know, this shouldn't be necessary anyway, but if we do it, let's always do it. Link: http://lkml.kernel.org/r/20190404215925.253531-1-jannh@google.com Cc: Ingo Molnar <mingo@redhat.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org Fixes: 73a757e63114d ("ring-buffer: Return reader page back into existing ring buffer") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-02netfilter: nf_tables: bogus EBUSY when deleting set after flushPablo Neira Ayuso
[ Upstream commit 273fe3f1006ea5ebc63d6729e43e8e45e32b256a ] Set deletion after flush coming in the same batch results in EBUSY. Add set use counter to track the number of references to this set from rules. We cannot rely on the list of bindings for this since such list is still populated from the preparation phase. Reported-by: Václav Zindulka <vaclav.zindulka@tlapnet.cz> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-27x86/kprobes: Verify stack frame on kretprobeMasami Hiramatsu
commit 3ff9c075cc767b3060bdac12da72fc94dd7da1b8 upstream. Verify the stack frame pointer on kretprobe trampoline handler, If the stack frame pointer does not match, it skips the wrong entry and tries to find correct one. This can happen if user puts the kretprobe on the function which can be used in the path of ftrace user-function call. Such functions should not be probed, so this adds a warning message that reports which function should be blacklisted. Tested-by: Andrea Righi <righi.andrea@gmail.com> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/155094059185.6137.15527904013362842072.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-27coredump: fix race condition between mmget_not_zero()/get_task_mm() and core ↵Andrea Arcangeli
dumping commit 04f5866e41fb70690e28397487d8bd8eea7d712a upstream. The core dumping code has always run without holding the mmap_sem for writing, despite that is the only way to ensure that the entire vma layout will not change from under it. Only using some signal serialization on the processes belonging to the mm is not nearly enough. This was pointed out earlier. For example in Hugh's post from Jul 2017: https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils "Not strictly relevant here, but a related note: I was very surprised to discover, only quite recently, how handle_mm_fault() may be called without down_read(mmap_sem) - when core dumping. That seems a misguided optimization to me, which would also be nice to correct" In particular because the growsdown and growsup can move the vm_start/vm_end the various loops the core dump does around the vma will not be consistent if page faults can happen concurrently. Pretty much all users calling mmget_not_zero()/get_task_mm() and then taking the mmap_sem had the potential to introduce unexpected side effects in the core dumping code. Adding mmap_sem for writing around the ->core_dump invocation is a viable long term fix, but it requires removing all copy user and page faults and to replace them with get_dump_page() for all binary formats which is not suitable as a short term fix. For the time being this solution manually covers the places that can confuse the core dump either by altering the vma layout or the vma flags while it runs. Once ->core_dump runs under mmap_sem for writing the function mmget_still_valid() can be dropped. Allowing mmap_sem protected sections to run in parallel with the coredump provides some minor parallelism advantage to the swapoff code (which seems to be safe enough by never mangling any vma field and can keep doing swapins in parallel to the core dumping) and to some other corner case. In order to facilitate the backporting I added "Fixes: 86039bd3b4e6" however the side effect of this same race condition in /proc/pid/mem should be reproducible since before 2.6.12-rc2 so I couldn't add any other "Fixes:" because there's no hash beyond the git genesis commit. Because find_extend_vma() is the only location outside of the process context that could modify the "mm" structures under mmap_sem for reading, by adding the mmget_still_valid() check to it, all other cases that take the mmap_sem for reading don't need the new check after mmget_not_zero()/get_task_mm(). The expand_stack() in page fault context also doesn't need the new check, because all tasks under core dumping are frozen. Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Jann Horn <jannh@google.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Jann Horn <jannh@google.com> Acked-by: Jason Gunthorpe <jgg@mellanox.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-27nfc: nci: Potential off by one in ->pipes[] arrayDan Carpenter
[ Upstream commit 6491d698396fd5da4941980a35ca7c162a672016 ] This is similar to commit e285d5bfb7e9 ("NFC: Fix the number of pipes") where we changed NFC_HCI_MAX_PIPES from 127 to 128. As the comment next to the define explains, the pipe identifier is 7 bits long. The highest possible pipe is 127, but the number of possible pipes is 128. As the code is now, then there is potential for an out of bounds array access: net/nfc/nci/hci.c:297 nci_hci_cmd_received() warn: array off by one? 'ndev->hci_dev->pipes[pipe]' '0-127 == 127' Fixes: 11f54f228643 ("NFC: nci: Add HCI over NCI protocol support") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-27net/tls: prevent bad memory access in tls_is_sk_tx_device_offloaded()Jakub Kicinski
[ Upstream commit b4f47f3848eb70986f75d06112af7b48b7f5f462 ] Unlike '&&' operator, the '&' does not have short-circuit evaluation semantics. IOW both sides of the operator always get evaluated. Fix the wrong operator in tls_is_sk_tx_device_offloaded(), which would lead to out-of-bounds access for for non-full sockets. Fixes: 4799ac81e52a ("tls: Add rx inline crypto offload") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-27net/tls: don't leak partially sent record in device modeJakub Kicinski
[ Upstream commit 35b71a34ada62c9573847a324bf06a133fe11b11 ] David reports that tls triggers warnings related to sk->sk_forward_alloc not being zero at destruction time: WARNING: CPU: 5 PID: 6831 at net/core/stream.c:206 sk_stream_kill_queues+0x103/0x110 WARNING: CPU: 5 PID: 6831 at net/ipv4/af_inet.c:160 inet_sock_destruct+0x15b/0x170 When sender fills up the write buffer and dies from SIGPIPE. This is due to the device implementation not cleaning up the partially_sent_record. This is because commit a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance") moved the partial record cleanup to the SW-only path. Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance") Reported-by: David Beckett <david.beckett@netronome.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-27failover: allow name change on IFF_UP slave interfacesSi-Wei Liu
[ Upstream commit 8065a779f17e94536a1c4dcee4f9d88011672f97 ] When a netdev appears through hot plug then gets enslaved by a failover master that is already up and running, the slave will be opened right away after getting enslaved. Today there's a race that userspace (udev) may fail to rename the slave if the kernel (net_failover) opens the slave earlier than when the userspace rename happens. Unlike bond or team, the primary slave of failover can't be renamed by userspace ahead of time, since the kernel initiated auto-enslavement is unable to, or rather, is never meant to be synchronized with the rename request from userspace. As the failover slave interfaces are not designed to be operated directly by userspace apps: IP configuration, filter rules with regard to network traffic passing and etc., should all be done on master interface. In general, userspace apps only care about the name of master interface, while slave names are less important as long as admin users can see reliable names that may carry other information describing the netdev. For e.g., they can infer that "ens3nsby" is a standby slave of "ens3", while for a name like "eth0" they can't tell which master it belongs to. Historically the name of IFF_UP interface can't be changed because there might be admin script or management software that is already relying on such behavior and assumes that the slave name can't be changed once UP. But failover is special: with the in-kernel auto-enslavement mechanism, the userspace expectation for device enumeration and bring-up order is already broken. Previously initramfs and various userspace config tools were modified to bypass failover slaves because of auto-enslavement and duplicate MAC address. Similarly, in case that users care about seeing reliable slave name, the new type of failover slaves needs to be taken care of specifically in userspace anyway. It's less risky to lift up the rename restriction on failover slave which is already UP. Although it's possible this change may potentially break userspace component (most likely configuration scripts or management software) that assumes slave name can't be changed while UP, it's relatively a limited and controllable set among all userspace components, which can be fixed specifically to listen for the rename events on failover slaves. Userspace component interacting with slaves is expected to be changed to operate on failover master interface instead, as the failover slave is dynamic in nature which may come and go at any point. The goal is to make the role of failover slaves less relevant, and userspace components should only deal with failover master in the long run. Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module") Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>