summaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)Author
2019-04-03mm/migrate.c: add missing flush_dcache_page for non-mapped page migrateLars Persson
commit d2b2c6dd227ba5b8a802858748ec9a780cb75b47 upstream. Our MIPS 1004Kc SoCs were seeing random userspace crashes with SIGILL and SIGSEGV that could not be traced back to a userspace code bug. They had all the magic signs of an I/D cache coherency issue. Now recently we noticed that the /proc/sys/vm/compact_memory interface was quite efficient at provoking this class of userspace crashes. Studying the code in mm/migrate.c there is a distinction made between migrating a page that is mapped at the instant of migration and one that is not mapped. Our problem turned out to be the non-mapped pages. For the non-mapped page the code performs a copy of the page content and all relevant meta-data of the page without doing the required D-cache maintenance. This leaves dirty data in the D-cache of the CPU and on the 1004K cores this data is not visible to the I-cache. A subsequent page-fault that triggers a mapping of the page will happily serve the process with potentially stale code. What about ARM then, this bug should have seen greater exposure? Well ARM became immune to this flaw back in 2010, see commit c01778001a4f ("ARM: 6379/1: Assume new page cache pages have dirty D-cache"). My proposed fix moves the D-cache maintenance inside move_to_new_page to make it common for both cases. Link: http://lkml.kernel.org/r/20190315083502.11849-1-larper@axis.com Fixes: 97ee0524614 ("flush cache before installing new page at migraton") Signed-off-by: Lars Persson <larper@axis.com> Reviewed-by: Paul Burton <paul.burton@mips.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-03mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specifiedYang Shi
commit a7f40cfe3b7ada57af9b62fd28430eeb4a7cfcb7 upstream. When MPOL_MF_STRICT was specified and an existing page was already on a node that does not follow the policy, mbind() should return -EIO. But commit 6f4576e3687b ("mempolicy: apply page table walker on queue_pages_range()") broke the rule. And commit c8633798497c ("mm: mempolicy: mbind and migrate_pages support thp migration") didn't return the correct value for THP mbind() too. If MPOL_MF_STRICT is set, ignore vma_migratable() to make sure it reaches queue_pages_to_pte_range() or queue_pages_pmd() to check if an existing page was already on a node that does not follow the policy. And, non-migratable vma may be used, return -EIO too if MPOL_MF_MOVE or MPOL_MF_MOVE_ALL was specified. Tested with https://github.com/metan-ucw/ltp/blob/master/testcases/kernel/syscalls/mbind/mbind02.c [akpm@linux-foundation.org: tweak code comment] Link: http://lkml.kernel.org/r/1553020556-38583-1-git-send-email-yang.shi@linux.alibaba.com Fixes: 6f4576e3687b ("mempolicy: apply page table walker on queue_pages_range()") Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Oscar Salvador <osalvador@suse.de> Reported-by: Cyril Hrubis <chrubis@suse.cz> Suggested-by: Kirill A. Shutemov <kirill@shutemov.name> Acked-by: Rafael Aquini <aquini@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-03mm: add support for kmem caches in DMA32 zoneNicolas Boichat
commit 6d6ea1e967a246f12cfe2f5fb743b70b2e608d4a upstream. Patch series "iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables", v6. This is a followup to the discussion in [1], [2]. IOMMUs using ARMv7 short-descriptor format require page tables (level 1 and 2) to be allocated within the first 4GB of RAM, even on 64-bit systems. For L1 tables that are bigger than a page, we can just use __get_free_pages with GFP_DMA32 (on arm64 systems only, arm would still use GFP_DMA). For L2 tables that only take 1KB, it would be a waste to allocate a full page, so we considered 3 approaches: 1. This series, adding support for GFP_DMA32 slab caches. 2. genalloc, which requires pre-allocating the maximum number of L2 page tables (4096, so 4MB of memory). 3. page_frag, which is not very memory-efficient as it is unable to reuse freed fragments until the whole page is freed. [3] This series is the most memory-efficient approach. stable@ note: We confirmed that this is a regression, and IOMMU errors happen on 4.19 and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue most likely starts from commit ad67f5a6545f ("arm64: replace ZONE_DMA with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number of Mediatek platforms (and maybe others?). [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html [2] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html [3] https://patchwork.codeaurora.org/patch/671639/ This patch (of 3): IOMMUs using ARMv7 short-descriptor format require page tables to be allocated within the first 4GB of RAM, even on 64-bit systems. On arm64, this is done by passing GFP_DMA32 flag to memory allocation functions. For IOMMU L2 tables that only take 1KB, it would be a waste to allocate a full page using get_free_pages, so we considered 3 approaches: 1. This patch, adding support for GFP_DMA32 slab caches. 2. genalloc, which requires pre-allocating the maximum number of L2 page tables (4096, so 4MB of memory). 3. page_frag, which is not very memory-efficient as it is unable to reuse freed fragments until the whole page is freed. This change makes it possible to create a custom cache in DMA32 zone using kmem_cache_create, then allocate memory using kmem_cache_alloc. We do not create a DMA32 kmalloc cache array, as there are currently no users of kmalloc(..., GFP_DMA32). These calls will continue to trigger a warning, as we keep GFP_DMA32 in GFP_SLAB_BUG_MASK. This implies that calls to kmem_cache_*alloc on a SLAB_CACHE_DMA32 kmem_cache must _not_ use GFP_DMA32 (it is anyway redundant and unnecessary). Link: http://lkml.kernel.org/r/20181210011504.122604-2-drinkcat@chromium.org Signed-off-by: Nicolas Boichat <drinkcat@chromium.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Sasha Levin <Alexander.Levin@microsoft.com> Cc: Huaisheng Ye <yehs1@lenovo.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Yong Wu <yong.wu@mediatek.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Tomasz Figa <tfiga@google.com> Cc: Yingjoe Chen <yingjoe.chen@mediatek.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hsin-Yi Wang <hsinyi@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-23mm/memory.c: do_fault: avoid usage of stale vm_area_structJan Stancek
commit fc8efd2ddfed3f343c11b693e87140ff358d7ff5 upstream. LTP testcase mtest06 [1] can trigger a crash on s390x running 5.0.0-rc8. This is a stress test, where one thread mmaps/writes/munmaps memory area and other thread is trying to read from it: CPU: 0 PID: 2611 Comm: mmap1 Not tainted 5.0.0-rc8+ #51 Hardware name: IBM 2964 N63 400 (z/VM 6.4.0) Krnl PSW : 0404e00180000000 00000000001ac8d8 (__lock_acquire+0x7/0x7a8) Call Trace: ([<0000000000000000>] (null)) [<00000000001adae4>] lock_acquire+0xec/0x258 [<000000000080d1ac>] _raw_spin_lock_bh+0x5c/0x98 [<000000000012a780>] page_table_free+0x48/0x1a8 [<00000000002f6e54>] do_fault+0xdc/0x670 [<00000000002fadae>] __handle_mm_fault+0x416/0x5f0 [<00000000002fb138>] handle_mm_fault+0x1b0/0x320 [<00000000001248cc>] do_dat_exception+0x19c/0x2c8 [<000000000080e5ee>] pgm_check_handler+0x19e/0x200 page_table_free() is called with NULL mm parameter, but because "0" is a valid address on s390 (see S390_lowcore), it keeps going until it eventually crashes in lockdep's lock_acquire. This crash is reproducible at least since 4.14. Problem is that "vmf->vma" used in do_fault() can become stale. Because mmap_sem may be released, other threads can come in, call munmap() and cause "vma" be returned to kmem cache, and get zeroed/re-initialized and re-used: handle_mm_fault | __handle_mm_fault | do_fault | vma = vmf->vma | do_read_fault | __do_fault | vma->vm_ops->fault(vmf); | mmap_sem is released | | | do_munmap() | remove_vma_list() | remove_vma() | vm_area_free() | # vma is released | ... | # same vma is allocated | # from kmem cache | do_mmap() | vm_area_alloc() | memset(vma, 0, ...) | pte_free(vma->vm_mm, ...); | page_table_free | spin_lock_bh(&mm->context.lock);| <crash> | Cache mm_struct to avoid using potentially stale "vma". [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c Link: http://lkml.kernel.org/r/5b3fdf19e2a5be460a384b936f5b56e13733f1b8.1551595137.git.jstancek@redhat.com Signed-off-by: Jan Stancek <jstancek@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Acked-by: Rafael Aquini <aquini@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@surriel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-23mm/vmalloc: fix size check for remap_vmalloc_range_partial()Roman Penyaev
commit 401592d2e095947344e10ec0623adbcd58934dd4 upstream. When VM_NO_GUARD is not set area->size includes adjacent guard page, thus for correct size checking get_vm_area_size() should be used, but not area->size. This fixes possible kernel oops when userspace tries to mmap an area on 1 page bigger than was allocated by vmalloc_user() call: the size check inside remap_vmalloc_range_partial() accounts non-existing guard page also, so check successfully passes but vmalloc_to_page() returns NULL (guard page does not physically exist). The following code pattern example should trigger an oops: static int oops_mmap(struct file *file, struct vm_area_struct *vma) { void *mem; mem = vmalloc_user(4096); BUG_ON(!mem); /* Do not care about mem leak */ return remap_vmalloc_range(vma, mem, 0); } And userspace simply mmaps size + PAGE_SIZE: mmap(NULL, 8192, PROT_WRITE|PROT_READ, MAP_PRIVATE, fd, 0); Possible candidates for oops which do not have any explicit size checks: *** drivers/media/usb/stkwebcam/stk-webcam.c: v4l_stk_mmap[789] ret = remap_vmalloc_range(vma, sbuf->buffer, 0); Or the following one: *** drivers/video/fbdev/core/fbmem.c static int fb_mmap(struct file *file, struct vm_area_struct * vma) ... res = fb->fb_mmap(info, vma); Where fb_mmap callback calls remap_vmalloc_range() directly without any explicit checks: *** drivers/video/fbdev/vfb.c static int vfb_mmap(struct fb_info *info, struct vm_area_struct *vma) { return remap_vmalloc_range(vma, (void *)info->fix.smem_start, vma->vm_pgoff); } Link: http://lkml.kernel.org/r/20190103145954.16942-2-rpenyaev@suse.de Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Joe Perches <joe@perches.com> Cc: "Luis R. Rodriguez" <mcgrof@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-23mm: hwpoison: fix thp split handing in soft_offline_in_use_page()zhongjiang
commit 46612b751c4941c5c0472ddf04027e877ae5990f upstream. When soft_offline_in_use_page() runs on a thp tail page after pmd is split, we trigger the following VM_BUG_ON_PAGE(): Memory failure: 0x3755ff: non anonymous thp __get_any_page: 0x3755ff: unknown zero refcount page type 2fffff80000000 Soft offlining pfn 0x34d805 at process virtual address 0x20fff000 page:ffffea000d360140 count:0 mapcount:0 mapping:0000000000000000 index:0x1 flags: 0x2fffff80000000() raw: 002fffff80000000 ffffea000d360108 ffffea000d360188 0000000000000000 raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0) ------------[ cut here ]------------ kernel BUG at ./include/linux/mm.h:519! soft_offline_in_use_page() passed refcount and page lock from tail page to head page, which is not needed because we can pass any subpage to split_huge_page(). Naoya had fixed a similar issue in c3901e722b29 ("mm: hwpoison: fix thp split handling in memory_failure()"). But he missed fixing soft offline. Link: http://lkml.kernel.org/r/1551452476-24000-1-git-send-email-zhongjiang@huawei.com Fixes: 61f5d698cc97 ("mm: re-enable THP") Signed-off-by: zhongjiang <zhongjiang@huawei.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [4.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-23tmpfs: fix uninitialized return value in shmem_linkDarrick J. Wong
[ Upstream commit 29b00e609960ae0fcff382f4c7079dd0874a5311 ] When we made the shmem_reserve_inode call in shmem_link conditional, we forgot to update the declaration for ret so that it always has a known value. Dan Carpenter pointed out this deficiency in the original patch. Fixes: 1062af920c07 ("tmpfs: fix link accounting when a tmpfile is linked in") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Matej Kupljen <matej.kupljen@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-23tmpfs: fix link accounting when a tmpfile is linked inDarrick J. Wong
[ Upstream commit 1062af920c07f5b54cf5060fde3339da6df0cf6b ] tmpfs has a peculiarity of accounting hard links as if they were separate inodes: so that when the number of inodes is limited, as it is by default, a user cannot soak up an unlimited amount of unreclaimable dcache memory just by repeatedly linking a file. But when v3.11 added O_TMPFILE, and the ability to use linkat() on the fd, we missed accommodating this new case in tmpfs: "df -i" shows that an extra "inode" remains accounted after the file is unlinked and the fd closed and the actual inode evicted. If a user repeatedly links tmpfiles into a tmpfs, the limit will be hit (ENOSPC) even after they are deleted. Just skip the extra reservation from shmem_link() in this case: there's a sense in which this first link of a tmpfile is then cheaper than a hard link of another file, but the accounting works out, and there's still good limiting, so no need to do anything more complicated. Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1902182134370.7035@eggly.anvils Fixes: f4e0c30c191 ("allow the temp files created by open() to be linked to") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Matej Kupljen <matej.kupljen@gmail.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-23mm: handle lru_add_drain_all for UP properlyMichal Hocko
[ Upstream commit 6ea183d60c469560e7b08a83c9804299e84ec9eb ] Since for_each_cpu(cpu, mask) added by commit 2d3854a37e8b767a ("cpumask: introduce new API, without changing anything") did not evaluate the mask argument if NR_CPUS == 1 due to CONFIG_SMP=n, lru_add_drain_all() is hitting WARN_ON() at __flush_work() added by commit 4d43d395fed12463 ("workqueue: Try to catch flush_work() without INIT_WORK().") by unconditionally calling flush_work() [1]. Workaround this issue by using CONFIG_SMP=n specific lru_add_drain_all implementation. There is no real need to defer the implementation to the workqueue as the draining is going to happen on the local cpu. So alias lru_add_drain_all to lru_add_drain which does all the necessary work. [akpm@linux-foundation.org: fix various build warnings] [1] https://lkml.kernel.org/r/18a30387-6aa5-6123-e67c-57579ecc3f38@roeck-us.net Link: http://lkml.kernel.org/r/20190213124334.GH4525@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Guenter Roeck <linux@roeck-us.net> Debugged-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-23mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocsJann Horn
[ Upstream commit 2c2ade81741c66082f8211f0b96cf509cc4c0218 ] The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum number of references that we might need to create in the fastpath later, the bump-allocation fastpath only has to modify the non-atomic bias value that tracks the number of extra references we hold instead of the atomic refcount. The maximum number of allocations we can serve (under the assumption that no allocation is made with size 0) is nc->size, so that's the bias used. However, even when all memory in the allocation has been given away, a reference to the page is still held; and in the `offset < 0` slowpath, the page may be reused if everyone else has dropped their references. This means that the necessary number of references is actually `nc->size+1`. Luckily, from a quick grep, it looks like the only path that can call page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which requires CAP_NET_ADMIN in the init namespace and is only intended to be used for kernel testing and fuzzing. To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the `offset < 0` path, below the virt_to_page() call, and then repeatedly call writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI, with a vector consisting of 15 elements containing 1 byte each. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-23Revert "mm: use early_pfn_to_nid in page_ext_init"Qian Cai
[ Upstream commit 2f1ee0913ce58efe7f18fbd518bd54c598559b89 ] This reverts commit fe53ca54270a ("mm: use early_pfn_to_nid in page_ext_init"). When booting a system with "page_owner=on", start_kernel page_ext_init invoke_init_callbacks init_section_page_ext init_page_owner init_early_allocated_pages init_zones_in_node init_pages_in_zone lookup_page_ext page_to_nid The issue here is that page_to_nid() will not work since some page flags have no node information until later in page_alloc_init_late() due to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds access with an invalid nid. UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50 index 7 is out of range for type 'zone [5]' Also, kernel will panic since flags were poisoned earlier with, CONFIG_DEBUG_VM_PGFLAGS=y CONFIG_NODE_NOT_IN_PAGE_FLAGS=n start_kernel setup_arch pagetable_init paging_init sparse_init sparse_init_nid memblock_alloc_try_nid_raw It did not handle it well in init_pages_in_zone() which ends up calling page_to_nid(). page:ffffea0004200000 is uninitialized and poisoned raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) page_owner info is not active (free page?) kernel BUG at include/linux/mm.h:990! RIP: 0010:init_page_owner+0x486/0x520 This means that assumptions behind commit fe53ca54270a ("mm: use early_pfn_to_nid in page_ext_init") are incomplete. Therefore, revert the commit for now. A proper way to move the page_owner initialization to sooner is to hook into memmap initialization. Link: http://lkml.kernel.org/r/20190115202812.75820-1-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Michal Hocko <mhocko@kernel.org> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Yang Shi <yang.shi@linaro.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-23mm/gup: fix gup_pmd_range() for daxYu Zhao
[ Upstream commit 414fd080d125408cb15d04ff4907e1dd8145c8c7 ] For dax pmd, pmd_trans_huge() returns false but pmd_huge() returns true on x86. So the function works as long as hugetlb is configured. However, dax doesn't depend on hugetlb. Link: http://lkml.kernel.org/r/20190111034033.601-1-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Keith Busch <keith.busch@intel.com> Cc: "Michael S . Tsirkin" <mst@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-13mm, memory_hotplug: fix off-by-one in is_pageblock_removableMichal Hocko
[ Upstream commit 891cb2a72d821f930a39d5900cb7a3aa752c1d5b ] Rong Chen has reported the following boot crash: PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 PID: 239 Comm: udevd Not tainted 5.0.0-rc4-00149-gefad4e4 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 RIP: 0010:page_mapping+0x12/0x80 Code: 5d c3 48 89 df e8 0e ad 02 00 85 c0 75 da 89 e8 5b 5d c3 0f 1f 44 00 00 53 48 89 fb 48 8b 43 08 48 8d 50 ff a8 01 48 0f 45 da <48> 8b 53 08 48 8d 42 ff 83 e2 01 48 0f 44 c3 48 83 38 ff 74 2f 48 RSP: 0018:ffff88801fa87cd8 EFLAGS: 00010202 RAX: ffffffffffffffff RBX: fffffffffffffffe RCX: 000000000000000a RDX: fffffffffffffffe RSI: ffffffff820b9a20 RDI: ffff88801e5c0000 RBP: 6db6db6db6db6db7 R08: ffff88801e8bb000 R09: 0000000001b64d13 R10: ffff88801fa87cf8 R11: 0000000000000001 R12: ffff88801e640000 R13: ffffffff820b9a20 R14: ffff88801f145258 R15: 0000000000000001 FS: 00007fb2079817c0(0000) GS:ffff88801dd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000006 CR3: 000000001fa82000 CR4: 00000000000006a0 Call Trace: __dump_page+0x14/0x2c0 is_mem_section_removable+0x24c/0x2c0 removable_show+0x87/0xa0 dev_attr_show+0x25/0x60 sysfs_kf_seq_show+0xba/0x110 seq_read+0x196/0x3f0 __vfs_read+0x34/0x180 vfs_read+0xa0/0x150 ksys_read+0x44/0xb0 do_syscall_64+0x5e/0x4a0 entry_SYSCALL_64_after_hwframe+0x49/0xbe and bisected it down to commit efad4e475c31 ("mm, memory_hotplug: is_mem_section_removable do not pass the end of a zone"). The reason for the crash is that the mapping is garbage for poisoned (uninitialized) page. This shouldn't happen as all pages in the zone's boundary should be initialized. Later debugging revealed that the actual problem is an off-by-one when evaluating the end_page. 'start_pfn + nr_pages' resp 'zone_end_pfn' refers to a pfn after the range and as such it might belong to a differen memory section. This along with CONFIG_SPARSEMEM then makes the loop condition completely bogus because a pointer arithmetic doesn't work for pages from two different sections in that memory model. Fix the issue by reworking is_pageblock_removable to be pfn based and only use struct page where necessary. This makes the code slightly easier to follow and we will remove the problematic pointer arithmetic completely. Link: http://lkml.kernel.org/r/20190218181544.14616-1-mhocko@kernel.org Fixes: efad4e475c31 ("mm, memory_hotplug: is_mem_section_removable do not pass the end of a zone") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: <rong.a.chen@intel.com> Tested-by: <rong.a.chen@intel.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-13mm, memory_hotplug: test_pages_in_a_zone do not pass the end of zoneMikhail Zaslonko
[ Upstream commit 24feb47c5fa5b825efb0151f28906dfdad027e61 ] If memory end is not aligned with the sparse memory section boundary, the mapping of such a section is only partly initialized. This may lead to VM_BUG_ON due to uninitialized struct pages access from test_pages_in_a_zone() function triggered by memory_hotplug sysfs handlers. Here are the the panic examples: CONFIG_DEBUG_VM_PGFLAGS=y kernel parameter mem=2050M -------------------------- page:000003d082008000 is uninitialized and poisoned page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) Call Trace: test_pages_in_a_zone+0xde/0x160 show_valid_zones+0x5c/0x190 dev_attr_show+0x34/0x70 sysfs_kf_seq_show+0xc8/0x148 seq_read+0x204/0x480 __vfs_read+0x32/0x178 vfs_read+0x82/0x138 ksys_read+0x5a/0xb0 system_call+0xdc/0x2d8 Last Breaking-Event-Address: test_pages_in_a_zone+0xde/0x160 Kernel panic - not syncing: Fatal exception: panic_on_oops Fix this by checking whether the pfn to check is within the zone. [mhocko@suse.com: separated this change from http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com] Link: http://lkml.kernel.org/r/20190128144506.15603-3-mhocko@kernel.org [mhocko@suse.com: separated this change from http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com] Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-13mm, memory_hotplug: is_mem_section_removable do not pass the end of a zoneMichal Hocko
[ Upstream commit efad4e475c312456edb3c789d0996d12ed744c13 ] Patch series "mm, memory_hotplug: fix uninitialized pages fallouts", v2. Mikhail Zaslonko has posted fixes for the two bugs quite some time ago [1]. I have pushed back on those fixes because I believed that it is much better to plug the problem at the initialization time rather than play whack-a-mole all over the hotplug code and find all the places which expect the full memory section to be initialized. We have ended up with commit 2830bf6f05fb ("mm, memory_hotplug: initialize struct pages for the full memory section") merged and cause a regression [2][3]. The reason is that there might be memory layouts when two NUMA nodes share the same memory section so the merged fix is simply incorrect. In order to plug this hole we really have to be zone range aware in those handlers. I have split up the original patch into two. One is unchanged (patch 2) and I took a different approach for `removable' crash. [1] http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com [2] https://bugzilla.redhat.com/show_bug.cgi?id=1666948 [3] http://lkml.kernel.org/r/20190125163938.GA20411@dhcp22.suse.cz This patch (of 2): Mikhail has reported the following VM_BUG_ON triggered when reading sysfs removable state of a memory block: page:000003d08300c000 is uninitialized and poisoned page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) Call Trace: is_mem_section_removable+0xb4/0x190 show_mem_removable+0x9a/0xd8 dev_attr_show+0x34/0x70 sysfs_kf_seq_show+0xc8/0x148 seq_read+0x204/0x480 __vfs_read+0x32/0x178 vfs_read+0x82/0x138 ksys_read+0x5a/0xb0 system_call+0xdc/0x2d8 Last Breaking-Event-Address: is_mem_section_removable+0xb4/0x190 Kernel panic - not syncing: Fatal exception: panic_on_oops The reason is that the memory block spans the zone boundary and we are stumbling over an unitialized struct page. Fix this by enforcing zone range in is_mem_section_removable so that we never run away from a zone. Link: http://lkml.kernel.org/r/20190128144506.15603-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Debugged-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-05hugetlbfs: fix races and page leaks during migrationMike Kravetz
commit cb6acd01e2e43fd8bad11155752b7699c3d0fb76 upstream. hugetlb pages should only be migrated if they are 'active'. The routines set/clear_page_huge_active() modify the active state of hugetlb pages. When a new hugetlb page is allocated at fault time, set_page_huge_active is called before the page is locked. Therefore, another thread could race and migrate the page while it is being added to page table by the fault code. This race is somewhat hard to trigger, but can be seen by strategically adding udelay to simulate worst case scheduling behavior. Depending on 'how' the code races, various BUG()s could be triggered. To address this issue, simply delay the set_page_huge_active call until after the page is successfully added to the page table. Hugetlb pages can also be leaked at migration time if the pages are associated with a file in an explicitly mounted hugetlbfs filesystem. For example, consider a two node system with 4GB worth of huge pages available. A program mmaps a 2G file in a hugetlbfs filesystem. It then migrates the pages associated with the file from one node to another. When the program exits, huge page counts are as follows: node0 1024 free_hugepages 1024 nr_hugepages node1 0 free_hugepages 1024 nr_hugepages Filesystem Size Used Avail Use% Mounted on nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool That is as expected. 2G of huge pages are taken from the free_hugepages counts, and 2G is the size of the file in the explicitly mounted filesystem. If the file is then removed, the counts become: node0 1024 free_hugepages 1024 nr_hugepages node1 1024 free_hugepages 1024 nr_hugepages Filesystem Size Used Avail Use% Mounted on nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool Note that the filesystem still shows 2G of pages used, while there actually are no huge pages in use. The only way to 'fix' the filesystem accounting is to unmount the filesystem If a hugetlb page is associated with an explicitly mounted filesystem, this information in contained in the page_private field. At migration time, this information is not preserved. To fix, simply transfer page_private from old to new page at migration time if necessary. There is a related race with removing a huge page from a file and migration. When a huge page is removed from the pagecache, the page_mapping() field is cleared, yet page_private remains set until the page is actually freed by free_huge_page(). A page could be migrated while in this state. However, since page_mapping() is not set the hugetlbfs specific routine to transfer page_private is not called and we leak the page count in the filesystem. To fix that, check for this condition before migrating a huge page. If the condition is detected, return EBUSY for the page. Link: http://lkml.kernel.org/r/74510272-7319-7372-9ea6-ec914734c179@oracle.com Link: http://lkml.kernel.org/r/20190212221400.3512-1-mike.kravetz@oracle.com Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: <stable@vger.kernel.org> [mike.kravetz@oracle.com: v2] Link: http://lkml.kernel.org/r/7534d322-d782-8ac6-1c8d-a8dc380eb3ab@oracle.com [mike.kravetz@oracle.com: update comment and changelog] Link: http://lkml.kernel.org/r/420bcfd6-158b-38e4-98da-26d0cd85bd01@oracle.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-05mm: enforce min addr even if capable() in expand_downwards()Jann Horn
commit 0a1d52994d440e21def1c2174932410b4f2a98a1 upstream. security_mmap_addr() does a capability check with current_cred(), but we can reach this code from contexts like a VFS write handler where current_cred() must not be used. This can be abused on systems without SMAP to make NULL pointer dereferences exploitable again. Fixes: 8869477a49c3 ("security: protect from stack expansion into low vm addresses") Cc: stable@kernel.org Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-05writeback: synchronize sync(2) against cgroup writeback membership switchesTejun Heo
[ Upstream commit 7fc5854f8c6efae9e7624970ab49a1eac2faefb1 ] sync_inodes_sb() can race against cgwb (cgroup writeback) membership switches and fail to writeback some inodes. For example, if an inode switches to another wb while sync_inodes_sb() is in progress, the new wb might not be visible to bdi_split_work_to_wbs() at all or the inode might jump from a wb which hasn't issued writebacks yet to one which already has. This patch adds backing_dev_info->wb_switch_rwsem to synchronize cgwb switch path against sync_inodes_sb() so that sync_inodes_sb() is guaranteed to see all the target wbs and inodes can't jump wbs to escape syncing. v2: Fixed misplaced rwsem init. Spotted by Jiufei. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jiufei Xue <xuejiufei@gmail.com> Link: http://lkml.kernel.org/r/dc694ae2-f07f-61e1-7097-7c8411cee12d@gmail.com Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-27numa: change get_mempolicy() to use nr_node_ids instead of MAX_NUMNODESRalph Campbell
commit 050c17f239fd53adb55aa768d4f41bc76c0fe045 upstream. The system call, get_mempolicy() [1], passes an unsigned long *nodemask pointer and an unsigned long maxnode argument which specifies the length of the user's nodemask array in bits (which is rounded up). The manual page says that if the maxnode value is too small, get_mempolicy will return EINVAL but there is no system call to return this minimum value. To determine this value, some programs search /proc/<pid>/status for a line starting with "Mems_allowed:" and use the number of digits in the mask to determine the minimum value. A recent change to the way this line is formatted [2] causes these programs to compute a value less than MAX_NUMNODES so get_mempolicy() returns EINVAL. Change get_mempolicy(), the older compat version of get_mempolicy(), and the copy_nodes_to_user() function to use nr_node_ids instead of MAX_NUMNODES, thus preserving the defacto method of computing the minimum size for the nodemask array and the maxnode argument. [1] http://man7.org/linux/man-pages/man2/get_mempolicy.2.html [2] https://lore.kernel.org/lkml/1545405631-6808-1-git-send-email-longman@redhat.com Link: http://lkml.kernel.org/r/20190211180245.22295-1-rcampbell@nvidia.com Fixes: 4fb8e5b89bcbbbb ("include/linux/nodemask.h: use nr_node_ids (not MAX_NUMNODES) in __nodemask_pr_numnodes()") Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Suggested-by: Alexander Duyck <alexander.duyck@gmail.com> Cc: Waiman Long <longman@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-20Revert "mm: slowly shrink slabs with a relatively small number of objects"Dave Chinner
commit a9a238e83fbb0df31c3b9b67003f8f9d1d1b6c96 upstream. This reverts commit 172b06c32b9497 ("mm: slowly shrink slabs with a relatively small number of objects"). This change changes the agressiveness of shrinker reclaim, causing small cache and low priority reclaim to greatly increase scanning pressure on small caches. As a result, light memory pressure has a disproportionate affect on small caches, and causes large caches to be reclaimed much faster than previously. As a result, it greatly perturbs the delicate balance of the VFS caches (dentry/inode vs file page cache) such that the inode/dentry caches are reclaimed much, much faster than the page cache and this drives us into several other caching imbalance related problems. As such, this is a bad change and needs to be reverted. [ Needs some massaging to retain the later seekless shrinker modifications.] Link: http://lkml.kernel.org/r/20190130041707.27750-3-david@fromorbit.com Fixes: 172b06c32b9497 ("mm: slowly shrink slabs with a relatively small number of objects") Signed-off-by: Dave Chinner <dchinner@redhat.com> Cc: Wolfgang Walter <linux@stwm.de> Cc: Roman Gushchin <guro@fb.com> Cc: Spock <dairinin@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-12mm/page_alloc.c: don't call kasan_free_pages() at deferred mem initWaiman Long
[ Upstream commit 3c0c12cc8f00ca5f81acb010023b8eb13e9a7004 ] When CONFIG_KASAN is enabled on large memory SMP systems, the deferrred pages initialization can take a long time. Below were the reported init times on a 8-socket 96-core 4TB IvyBridge system. 1) Non-debug kernel without CONFIG_KASAN [ 8.764222] node 1 initialised, 132086516 pages in 7027ms 2) Debug kernel with CONFIG_KASAN [ 146.288115] node 1 initialised, 132075466 pages in 143052ms So the page init time in a debug kernel was 20X of the non-debug kernel. The long init time can be problematic as the page initialization is done with interrupt disabled. In this particular case, it caused the appearance of following warning messages as well as NMI backtraces of all the cores that were doing the initialization. [ 68.240049] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [ 68.241000] rcu: 25-...0: (100 ticks this GP) idle=b72/1/0x4000000000000000 softirq=915/915 fqs=16252 [ 68.241000] rcu: 44-...0: (95 ticks this GP) idle=49a/1/0x4000000000000000 softirq=788/788 fqs=16253 [ 68.241000] rcu: 54-...0: (104 ticks this GP) idle=03a/1/0x4000000000000000 softirq=721/825 fqs=16253 [ 68.241000] rcu: 60-...0: (103 ticks this GP) idle=cbe/1/0x4000000000000000 softirq=637/740 fqs=16253 [ 68.241000] rcu: 72-...0: (105 ticks this GP) idle=786/1/0x4000000000000000 softirq=536/641 fqs=16253 [ 68.241000] rcu: 84-...0: (99 ticks this GP) idle=292/1/0x4000000000000000 softirq=537/537 fqs=16253 [ 68.241000] rcu: 111-...0: (104 ticks this GP) idle=bde/1/0x4000000000000000 softirq=474/476 fqs=16253 [ 68.241000] rcu: (detected by 13, t=65018 jiffies, g=249, q=2) The long init time was mainly caused by the call to kasan_free_pages() to poison the newly initialized pages. On a 4TB system, we are talking about almost 500GB of memory probably on the same node. In reality, we may not need to poison the newly initialized pages before they are ever allocated. So KASAN poisoning of freed pages before the completion of deferred memory initialization is now disabled. Those pages will be properly poisoned when they are allocated or freed after deferred pages initialization is done. With this change, the new page initialization time became: [ 21.948010] node 1 initialised, 132075466 pages in 18702ms This was still about double the non-debug kernel time, but was much better than before. Link: http://lkml.kernel.org/r/1544459388-8736-1-git-send-email-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-12percpu: convert spin_lock_irq to spin_lock_irqsave.Dennis Zhou
[ Upstream commit 6ab7d47bcbf0144a8cb81536c2cead4cde18acfe ] From Michael Cree: "Bisection lead to commit b38d08f3181c ("percpu: restructure locking") as being the cause of lockups at initial boot on the kernel built for generic Alpha. On a suggestion by Tejun Heo that: So, the only thing I can think of is that it's calling spin_unlock_irq() while irq handling isn't set up yet. Can you please try the followings? 1. Convert all spin_[un]lock_irq() to spin_lock_irqsave/unlock_irqrestore()." Fixes: b38d08f3181c ("percpu: restructure locking") Reported-and-tested-by: Michael Cree <mcree@orcon.net.nz> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-06mm: migrate: don't rely on __PageMovable() of newpage after unlocking itDavid Hildenbrand
commit e0a352fabce61f730341d119fbedf71ffdb8663f upstream. We had a race in the old balloon compaction code before b1123ea6d3b3 ("mm: balloon: use general non-lru movable page feature") refactored it that became visible after backporting 195a8c43e93d ("virtio-balloon: deflate via a page list") without the refactoring. The bug existed from commit d6d86c0a7f8d ("mm/balloon_compaction: redesign ballooned pages management") till b1123ea6d3b3 ("mm: balloon: use general non-lru movable page feature"). d6d86c0a7f8d ("mm/balloon_compaction: redesign ballooned pages management") was backported to 3.12, so the broken kernels are stable kernels [3.12 - 4.7]. There was a subtle race between dropping the page lock of the newpage in __unmap_and_move() and checking for __is_movable_balloon_page(newpage). Just after dropping this page lock, virtio-balloon could go ahead and deflate the newpage, effectively dequeueing it and clearing PageBalloon, in turn making __is_movable_balloon_page(newpage) fail. This resulted in dropping the reference of the newpage via putback_lru_page(newpage) instead of put_page(newpage), leading to page->lru getting modified and a !LRU page ending up in the LRU lists. With 195a8c43e93d ("virtio-balloon: deflate via a page list") backported, one would suddenly get corrupted lists in release_pages_balloon(): - WARNING: CPU: 13 PID: 6586 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0 - list_del corruption. prev->next should be ffffe253961090a0, but was dead000000000100 Nowadays this race is no longer possible, but it is hidden behind very ugly handling of __ClearPageMovable() and __PageMovable(). __ClearPageMovable() will not make __PageMovable() fail, only PageMovable(). So the new check (__PageMovable(newpage)) will still hold even after newpage was dequeued by virtio-balloon. If anybody would ever change that special handling, the BUG would be introduced again. So instead, make it explicit and use the information of the original isolated page before migration. This patch can be backported fairly easy to stable kernels (in contrast to the refactoring). Link: http://lkml.kernel.org/r/20190129233217.10747-1-david@redhat.com Fixes: d6d86c0a7f8d ("mm/balloon_compaction: redesign ballooned pages management") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Vratislav Bendel <vbendel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vratislav Bendel <vbendel@redhat.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Konstantin Khlebnikov <k.khlebnikov@samsung.com> Cc: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> [3.12 - 4.7] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-06mm: hwpoison: use do_send_sig_info() instead of force_sig()Naoya Horiguchi
commit 6376360ecbe525a9c17b3d081dfd88ba3e4ed65b upstream. Currently memory_failure() is racy against process's exiting, which results in kernel crash by null pointer dereference. The root cause is that memory_failure() uses force_sig() to forcibly kill asynchronous (meaning not in the current context) processes. As discussed in thread https://lkml.org/lkml/2010/6/8/236 years ago for OOM fixes, this is not a right thing to do. OOM solves this issue by using do_send_sig_info() as done in commit d2d393099de2 ("signal: oom_kill_task: use SEND_SIG_FORCED instead of force_sig()"), so this patch is suggesting to do the same for hwpoison. do_send_sig_info() properly accesses to siglock with lock_task_sighand(), so is free from the reported race. I confirmed that the reported bug reproduces with inserting some delay in kill_procs(), and it never reproduces with this patch. Note that memory_failure() can send another type of signal using force_sig_mceerr(), and the reported race shouldn't happen on it because force_sig_mceerr() is called only for synchronous processes (i.e. BUS_MCEERR_AR happens only when some process accesses to the corrupted memory.) Link: http://lkml.kernel.org/r/20190116093046.GA29835@hori1.linux.bs1.fc.nec.co.jp Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-06mm, oom: fix use-after-free in oom_kill_processShakeel Butt
commit cefc7ef3c87d02fc9307835868ff721ea12cc597 upstream. Syzbot instance running on upstream kernel found a use-after-free bug in oom_kill_process. On further inspection it seems like the process selected to be oom-killed has exited even before reaching read_lock(&tasklist_lock) in oom_kill_process(). More specifically the tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task() and the put_task_struct within for_each_thread() frees the tsk and for_each_thread() tries to access the tsk. The easiest fix is to do get/put across the for_each_thread() on the selected task. Now the next question is should we continue with the oom-kill as the previously selected task has exited? However before adding more complexity and heuristics, let's answer why we even look at the children of oom-kill selected task? The select_bad_process() has already selected the worst process in the system/memcg. Due to race, the selected process might not be the worst at the kill time but does that matter? The userspace can use the oom_score_adj interface to prefer children to be killed before the parent. I looked at the history but it seems like this is there before git history. Link: http://lkml.kernel.org/r/20190121215850.221745-1-shakeelb@google.com Reported-by: syzbot+7fbbfa368521945f0e3d@syzkaller.appspotmail.com Fixes: 6b0c81b3be11 ("mm, oom: reduce dependency on tasklist_lock") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-06mm,memory_hotplug: fix scan_movable_pages() for gigantic hugepagesOscar Salvador
commit eeb0efd071d821a88da3fbd35f2d478f40d3b2ea upstream. This is the same sort of error we saw in commit 17e2e7d7e1b8 ("mm, page_alloc: fix has_unmovable_pages for HugePages"). Gigantic hugepages cross several memblocks, so it can be that the page we get in scan_movable_pages() is a page-tail belonging to a 1G-hugepage. If that happens, page_hstate()->size_to_hstate() will return NULL, and we will blow up in hugepage_migration_supported(). The splat is as follows: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 #PF error: [normal kernel read fault] PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 1 PID: 1350 Comm: bash Tainted: G E 5.0.0-rc1-mm1-1-default+ #27 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 RIP: 0010:__offline_pages+0x6ae/0x900 Call Trace: memory_subsys_offline+0x42/0x60 device_offline+0x80/0xa0 state_store+0xab/0xc0 kernfs_fop_write+0x102/0x180 __vfs_write+0x26/0x190 vfs_write+0xad/0x1b0 ksys_write+0x42/0x90 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Modules linked in: af_packet(E) xt_tcpudp(E) ipt_REJECT(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv4(E) ip_set(E) nfnetlink(E) ebtable_nat(E) ebtable_broute(E) bridge(E) stp(E) llc(E) iptable_mangle(E) iptable_raw(E) iptable_security(E) ebtable_filter(E) ebtables(E) iptable_filter(E) ip_tables(E) x_tables(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) bochs_drm(E) ttm(E) aesni_intel(E) drm_kms_helper(E) aes_x86_64(E) crypto_simd(E) cryptd(E) glue_helper(E) drm(E) virtio_net(E) syscopyarea(E) sysfillrect(E) net_failover(E) sysimgblt(E) pcspkr(E) failover(E) i2c_piix4(E) fb_sys_fops(E) parport_pc(E) parport(E) button(E) btrfs(E) libcrc32c(E) xor(E) zstd_decompress(E) zstd_compress(E) xxhash(E) raid6_pq(E) sd_mod(E) ata_generic(E) ata_piix(E) ahci(E) libahci(E) libata(E) crc32c_intel(E) serio_raw(E) virtio_pci(E) virtio_ring(E) virtio(E) sg(E) scsi_mod(E) autofs4(E) [akpm@linux-foundation.org: fix brace layout, per David. Reduce indentation] Link: http://lkml.kernel.org/r/20190122154407.18417-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Anthony Yznaga <anthony.yznaga@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-06oom, oom_reaper: do not enqueue same task twiceTetsuo Handa
commit 9bcdeb51bd7d2ae9fe65ea4d60643d2aeef5bfe3 upstream. Arkadiusz reported that enabling memcg's group oom killing causes strange memcg statistics where there is no task in a memcg despite the number of tasks in that memcg is not 0. It turned out that there is a bug in wake_oom_reaper() which allows enqueuing same task twice which makes impossible to decrease the number of tasks in that memcg due to a refcount leak. This bug existed since the OOM reaper became invokable from task_will_free_mem(current) path in out_of_memory() in Linux 4.7, T1@P1 |T2@P1 |T3@P1 |OOM reaper ----------+----------+----------+------------ # Processing an OOM victim in a different memcg domain. try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) out_of_memory() oom_kill_process(P1) do_send_sig_info(SIGKILL, @P1) mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T2@P1) wake_oom_reaper(T2@P1) # T2@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL. mutex_unlock(&oom_lock) # Completed processing an OOM victim in a different memcg domain. spin_lock(&oom_reaper_lock) # T1P1 is dequeued. spin_unlock(&oom_reaper_lock) but memcg's group oom killing made it easier to trigger this bug by calling wake_oom_reaper() on the same task from one out_of_memory() request. Fix this bug using an approach used by commit 855b018325737f76 ("oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task"). As a side effect of this patch, this patch also avoids enqueuing multiple threads sharing memory via task_will_free_mem(current) path. Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp Fixes: af8e15cc85a25315 ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl> Tested-by: Arkadiusz Miskiewicz <arekm@maven.pl> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Aleksa Sarai <asarai@suse.de> Cc: Jay Kamat <jgkamat@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-06mm/hugetlb.c: teach follow_hugetlb_page() to handle FOLL_NOWAITAndrea Arcangeli
commit 1ac25013fb9e4ed595cd608a406191e93520881e upstream. hugetlb needs the same fix as faultin_nopage (which was applied in commit 96312e61282a ("mm/gup.c: teach get_user_pages_unlocked to handle FOLL_NOWAIT")) or KVM hangs because it thinks the mmap_sem was already released by hugetlb_fault() if it returned VM_FAULT_RETRY, but it wasn't in the FOLL_NOWAIT case. Link: http://lkml.kernel.org/r/20190109020203.26669-2-aarcange@redhat.com Fixes: ce53053ce378 ("kvm: switch get_user_page_nowait() to get_user_pages_unlocked()") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Tested-by: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Reported-by: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-31Revert "mm, memory_hotplug: initialize struct pages for the full memory section"Michal Hocko
commit 4aa9fc2a435abe95a1e8d7f8c7b3d6356514b37a upstream. This reverts commit 2830bf6f05fb3e05bc4743274b806c821807a684. The underlying assumption that one sparse section belongs into a single numa node doesn't hold really. Robert Shteynfeld has reported a boot failure. The boot log was not captured but his memory layout is as follows: Early memory node ranges node 1: [mem 0x0000000000001000-0x0000000000090fff] node 1: [mem 0x0000000000100000-0x00000000dbdf8fff] node 1: [mem 0x0000000100000000-0x0000001423ffffff] node 0: [mem 0x0000001424000000-0x0000002023ffffff] This means that node0 starts in the middle of a memory section which is also in node1. memmap_init_zone tries to initialize padding of a section even when it is outside of the given pfn range because there are code paths (e.g. memory hotplug) which assume that the full worth of memory section is always initialized. In this particular case, though, such a range is already intialized and most likely already managed by the page allocator. Scribbling over those pages corrupts the internal state and likely blows up when any of those pages gets used. Reported-by: Robert Shteynfeld <robert.shteynfeld@gmail.com> Fixes: 2830bf6f05fb ("mm, memory_hotplug: initialize struct pages for the full memory section") Cc: stable@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-26mm/swap: use nr_node_ids for avail_lists in swap_info_structAaron Lu
[ Upstream commit 66f71da9dd38af17dc17209cdde7987d4679a699 ] Since a2468cc9bfdf ("swap: choose swap device according to numa node"), avail_lists field of swap_info_struct is changed to an array with MAX_NUMNODES elements. This made swap_info_struct size increased to 40KiB and needs an order-4 page to hold it. This is not optimal in that: 1 Most systems have way less than MAX_NUMNODES(1024) nodes so it is a waste of memory; 2 It could cause swapon failure if the swap device is swapped on after system has been running for a while, due to no order-4 page is available as pointed out by Vasily Averin. Solve the above two issues by using nr_node_ids(which is the actual possible node number the running system has) for avail_lists instead of MAX_NUMNODES. nr_node_ids is unknown at compile time so can't be directly used when declaring this array. What I did here is to declare avail_lists as zero element array and allocate space for it when allocating space for swap_info_struct. The reason why keep using array but not pointer is plist_for_each_entry needs the field to be part of the struct, so pointer will not work. This patch is on top of Vasily Averin's fix commit. I think the use of kvzalloc for swap_info_struct is still needed in case nr_node_ids is really big on some systems. Link: http://lkml.kernel.org/r/20181115083847.GA11129@intel.com Signed-off-by: Aaron Lu <aaron.lu@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vasily Averin <vvs@virtuozzo.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-26mm/page-writeback.c: don't break integrity writeback on ->writepage() errorBrian Foster
[ Upstream commit 3fa750dcf29e8606e3969d13d8e188cc1c0f511d ] write_cache_pages() is used in both background and integrity writeback scenarios by various filesystems. Background writeback is mostly concerned with cleaning a certain number of dirty pages based on various mm heuristics. It may not write the full set of dirty pages or wait for I/O to complete. Integrity writeback is responsible for persisting a set of dirty pages before the writeback job completes. For example, an fsync() call must perform integrity writeback to ensure data is on disk before the call returns. write_cache_pages() unconditionally breaks out of its processing loop in the event of a ->writepage() error. This is fine for background writeback, which had no strict requirements and will eventually come around again. This can cause problems for integrity writeback on filesystems that might need to clean up state associated with failed page writeouts. For example, XFS performs internal delayed allocation accounting before returning a ->writepage() error, where applicable. If the current writeback happens to be associated with an unmount and write_cache_pages() completes the writeback prematurely due to error, the filesystem is unmounted in an inconsistent state if dirty+delalloc pages still exist. To handle this problem, update write_cache_pages() to always process the full set of pages for integrity writeback regardless of ->writepage() errors. Save the first encountered error and return it to the caller once complete. This facilitates XFS (or any other fs that expects integrity writeback to process the entire set of dirty pages) to clean up its internal state completely in the event of persistent mapping errors. Background writeback continues to exit on the first error encountered. [akpm@linux-foundation.org: fix typo in comment] Link: http://lkml.kernel.org/r/20181116134304.32440-1-bfoster@redhat.com Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-16mm: page_mapped: don't assume compound page is huge or THPJan Stancek
commit 8ab88c7169b7fba98812ead6524b9d05bc76cf00 upstream. LTP proc01 testcase has been observed to rarely trigger crashes on arm64: page_mapped+0x78/0xb4 stable_page_flags+0x27c/0x338 kpageflags_read+0xfc/0x164 proc_reg_read+0x7c/0xb8 __vfs_read+0x58/0x178 vfs_read+0x90/0x14c SyS_read+0x60/0xc0 The issue is that page_mapped() assumes that if compound page is not huge, then it must be THP. But if this is 'normal' compound page (COMPOUND_PAGE_DTOR), then following loop can keep running (for HPAGE_PMD_NR iterations) until it tries to read from memory that isn't mapped and triggers a panic: for (i = 0; i < hpage_nr_pages(page); i++) { if (atomic_read(&page[i]._mapcount) >= 0) return true; } I could replicate this on x86 (v4.20-rc4-98-g60b548237fed) only with a custom kernel module [1] which: - allocates compound page (PAGEC) of order 1 - allocates 2 normal pages (COPY), which are initialized to 0xff (to satisfy _mapcount >= 0) - 2 PAGEC page structs are copied to address of first COPY page - second page of COPY is marked as not present - call to page_mapped(COPY) now triggers fault on access to 2nd COPY page at offset 0x30 (_mapcount) [1] https://github.com/jstancek/reproducers/blob/master/kernel/page_mapped_crash/repro.c Fix the loop to iterate for "1 << compound_order" pages. Kirrill said "IIRC, sound subsystem can producuce custom mapped compound pages". Link: http://lkml.kernel.org/r/c440d69879e34209feba21e12d236d06bc0a25db.1543577156.git.jstancek@redhat.com Fixes: e1534ae95004 ("mm: differentiate page_mapped() from page_mapcount() for compound pages") Signed-off-by: Jan Stancek <jstancek@redhat.com> Debugged-by: Laszlo Ersek <lersek@redhat.com> Suggested-by: "Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-16mm, memcg: fix reclaim deadlock with writebackMichal Hocko
commit 63f3655f950186752236bb88a22f8252c11ce394 upstream. Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the ext4 writeback task1: wait_on_page_bit+0x82/0xa0 shrink_page_list+0x907/0x960 shrink_inactive_list+0x2c7/0x680 shrink_node_memcg+0x404/0x830 shrink_node+0xd8/0x300 do_try_to_free_pages+0x10d/0x330 try_to_free_mem_cgroup_pages+0xd5/0x1b0 try_charge+0x14d/0x720 memcg_kmem_charge_memcg+0x3c/0xa0 memcg_kmem_charge+0x7e/0xd0 __alloc_pages_nodemask+0x178/0x260 alloc_pages_current+0x95/0x140 pte_alloc_one+0x17/0x40 __pte_alloc+0x1e/0x110 alloc_set_pte+0x5fe/0xc20 do_fault+0x103/0x970 handle_mm_fault+0x61e/0xd10 __do_page_fault+0x252/0x4d0 do_page_fault+0x30/0x80 page_fault+0x28/0x30 task2: __lock_page+0x86/0xa0 mpage_prepare_extent_to_map+0x2e7/0x310 [ext4] ext4_writepages+0x479/0xd60 do_writepages+0x1e/0x30 __writeback_single_inode+0x45/0x320 writeback_sb_inodes+0x272/0x600 __writeback_inodes_wb+0x92/0xc0 wb_writeback+0x268/0x300 wb_workfn+0xb4/0x390 process_one_work+0x189/0x420 worker_thread+0x4e/0x4b0 kthread+0xe6/0x100 ret_from_fork+0x41/0x50 He adds "task1 is waiting for the PageWriteback bit of the page that task2 has collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED bit the page which tasks1 has locked" More precisely task1 is handling a page fault and it has a page locked while it charges a new page table to a memcg. That in turn hits a memory limit reclaim and the memcg reclaim for legacy controller is waiting on the writeback but that is never going to finish because the writeback itself is waiting for the page locked in the #PF path. So this is essentially ABBA deadlock: lock_page(A) SetPageWriteback(A) unlock_page(A) lock_page(B) lock_page(B) pte_alloc_pne shrink_page_list wait_on_page_writeback(A) SetPageWriteback(B) unlock_page(B) # flush A, B to clear the writeback This accumulating of more pages to flush is used by several filesystems to generate a more optimal IO patterns. Waiting for the writeback in legacy memcg controller is a workaround for pre-mature OOM killer invocations because there is no dirty IO throttling available for the controller. There is no easy way around that unfortunately. Therefore fix this specific issue by pre-allocating the page table outside of the page lock. We have that handy infrastructure for that already so simply reuse the fault-around pattern which already does this. There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations from under a fs page locked but they should be really rare. I am not aware of a better solution unfortunately. [akpm@linux-foundation.org: fix mm/memory.c:__do_fault()] [akpm@linux-foundation.org: coding-style fixes] [mhocko@kernel.org: enhance comment, per Johannes] Link: http://lkml.kernel.org/r/20181214084948.GA5624@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Liu Bo <bo.liu@linux.alibaba.com> Debugged-by: Liu Bo <bo.liu@linux.alibaba.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-16mm/usercopy.c: no check page span for stack objectsQian Cai
commit 7bff3c06997374fb9b9991536a547b840549a813 upstream. It is easy to trigger this with CONFIG_HARDENED_USERCOPY_PAGESPAN=y, usercopy: Kernel memory overwrite attempt detected to spans multiple pages (offset 0, size 23)! kernel BUG at mm/usercopy.c:102! For example, print_worker_info char name[WQ_NAME_LEN] = { }; char desc[WORKER_DESC_LEN] = { }; probe_kernel_read(name, wq->name, sizeof(name) - 1); probe_kernel_read(desc, worker->desc, sizeof(desc) - 1); __copy_from_user_inatomic check_object_size check_heap_object check_page_span This is because on-stack variables could cross PAGE_SIZE boundary, and failed this check, if (likely(((unsigned long)ptr & (unsigned long)PAGE_MASK) == ((unsigned long)end & (unsigned long)PAGE_MASK))) ptr = FFFF889007D7EFF8 end = FFFF889007D7F00E Hence, fix it by checking if it is a stack object first. [keescook@chromium.org: improve comments after reorder] Link: http://lkml.kernel.org/r/20190103165151.GA32845@beast Link: http://lkml.kernel.org/r/20181231030254.99441-1-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-16slab: alien caches must not be initialized if the allocation of the alien ↵Christoph Lameter
cache failed commit 09c2e76ed734a1d36470d257a778aaba28e86531 upstream. Callers of __alloc_alien() check for NULL. We must do the same check in __alloc_alien_cache to avoid NULL pointer dereferences on allocation failures. Link: http://lkml.kernel.org/r/010001680f42f192-82b4e12e-1565-4ee0-ae1f-1e98974906aa-000000@email.amazonses.com Fixes: 49dfc304ba241 ("slab: use the lock on alien_cache, instead of the lock on array_cache") Fixes: c8522a3a5832b ("Slab: introduce alloc_alien") Signed-off-by: Christoph Lameter <cl@linux.com> Reported-by: syzbot+d6ed4ec679652b4fd4e4@syzkaller.appspotmail.com Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-13memcg, oom: notify on oom killer invocation from the charge pathMichal Hocko
commit 7056d3a37d2c6aaaab10c13e8e69adc67ec1fc65 upstream. Burt Holzman has noticed that memcg v1 doesn't notify about OOM events via eventfd anymore. The reason is that 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path") has moved the oom handling back to the charge path. While doing so the notification was left behind in mem_cgroup_oom_synchronize. Fix the issue by replicating the oom hierarchy locking and the notification. Link: http://lkml.kernel.org/r/20181224091107.18354-1-mhocko@kernel.org Fixes: 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Burt Holzman <burt@fnal.gov> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com Cc: <stable@vger.kernel.org> [4.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-13mm, swap: fix swapoff with KSM pagesHuang Ying
commit 7af7a8e19f0c5425ff639b0f0d2d244c2a647724 upstream. KSM pages may be mapped to the multiple VMAs that cannot be reached from one anon_vma. So during swapin, a new copy of the page need to be generated if a different anon_vma is needed, please refer to comments of ksm_might_need_to_copy() for details. During swapoff, unuse_vma() uses anon_vma (if available) to locate VMA and virtual address mapped to the page, so not all mappings to a swapped out KSM page could be found. So in try_to_unuse(), even if the swap count of a swap entry isn't zero, the page needs to be deleted from swap cache, so that, in the next round a new page could be allocated and swapin for the other mappings of the swapped out KSM page. But this contradicts with the THP swap support. Where the THP could be deleted from swap cache only after the swap count of every swap entry in the huge swap cluster backing the THP has reach 0. So try_to_unuse() is changed in commit e07098294adf ("mm, THP, swap: support to reclaim swap space for THP swapped out") to check that before delete a page from swap cache, but this has broken KSM swapoff too. Fortunately, KSM is for the normal pages only, so the original behavior for KSM pages could be restored easily via checking PageTransCompound(). That is how this patch works. The bug is introduced by e07098294adf ("mm, THP, swap: support to reclaim swap space for THP swapped out"), which is merged by v4.14-rc1. So I think we should backport the fix to from 4.14 on. But Hugh thinks it may be rare for the KSM pages being in the swap device when swapoff, so nobody reports the bug so far. Link: http://lkml.kernel.org/r/20181226051522.28442-1-ying.huang@intel.com Fixes: e07098294adf ("mm, THP, swap: support to reclaim swap space for THP swapped out") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reported-by: Hugh Dickins <hughd@google.com> Tested-by: Hugh Dickins <hughd@google.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@kernel.org> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-13mm, hmm: mark hmm_devmem_{add, add_resource} EXPORT_SYMBOL_GPLDan Williams
commit 02917e9f8676207a4c577d4d94eae12bf348e9d7 upstream. At Maintainer Summit, Greg brought up a topic I proposed around EXPORT_SYMBOL_GPL usage. The motivation was considerations for when EXPORT_SYMBOL_GPL is warranted and the criteria for taking the exceptional step of reclassifying an existing export. Specifically, I wanted to make the case that although the line is fuzzy and hard to specify in abstract terms, it is nonetheless clear that devm_memremap_pages() and HMM (Heterogeneous Memory Management) have crossed it. The devm_memremap_pages() facility should have been EXPORT_SYMBOL_GPL from the beginning, and HMM as a derivative of that functionality should have naturally picked up that designation as well. Contrary to typical rules, the HMM infrastructure was merged upstream with zero in-tree consumers. There was a promise at the time that those users would be merged "soon", but it has been over a year with no drivers arriving. While the Nouveau driver is about to belatedly make good on that promise it is clear that HMM was targeted first and foremost at an out-of-tree consumer. HMM is derived from devm_memremap_pages(), a facility Christoph and I spearheaded to support persistent memory. It combines a device lifetime model with a dynamically created 'struct page' / memmap array for any physical address range. It enables coordination and control of the many code paths in the kernel built to interact with memory via 'struct page' objects. With HMM the integration goes even deeper by allowing device drivers to hook and manipulate page fault and page free events. One interpretation of when EXPORT_SYMBOL is suitable is when it is exporting stable and generic leaf functionality. The devm_memremap_pages() facility continues to see expanding use cases, peer-to-peer DMA being the most recent, with no clear end date when it will stop attracting reworks and semantic changes. It is not suitable to export devm_memremap_pages() as a stable 3rd party driver API due to the fact that it is still changing and manipulates core behavior. Moreover, it is not in the best interest of the long term development of the core memory management subsystem to permit any external driver to effectively define its own system-wide memory management policies with no encouragement to engage with upstream. I am also concerned that HMM was designed in a way to minimize further engagement with the core-MM. That, with these hooks in place, device-drivers are free to implement their own policies without much consideration for whether and how the core-MM could grow to meet that need. Going forward not only should HMM be EXPORT_SYMBOL_GPL, but the core-MM should be allowed the opportunity and stimulus to change and address these new use cases as first class functionality. Original changelog: hmm_devmem_add(), and hmm_devmem_add_resource() duplicated devm_memremap_pages() and are now simple now wrappers around the core facility to inject a dev_pagemap instance into the global pgmap_radix and hook page-idle events. The devm_memremap_pages() interface is base infrastructure for HMM. HMM has more and deeper ties into the kernel memory management implementation than base ZONE_DEVICE which is itself a EXPORT_SYMBOL_GPL facility. Originally, the HMM page structure creation routines copied the devm_memremap_pages() code and reused ZONE_DEVICE. A cleanup to unify the implementations was discussed during the initial review: http://lkml.iu.edu/hypermail/linux/kernel/1701.2/00812.html Recent work to extend devm_memremap_pages() for the peer-to-peer-DMA facility enabled this cleanup to move forward. In addition to the integration with devm_memremap_pages() HMM depends on other GPL-only symbols: mmu_notifier_unregister_no_release percpu_ref region_intersects __class_create It goes further to consume / indirectly expose functionality that is not exported to any other driver: alloc_pages_vma walk_page_range HMM is derived from devm_memremap_pages(), and extends deep core-kernel fundamentals. Similar to devm_memremap_pages(), mark its entry points EXPORT_SYMBOL_GPL(). [logang@deltatee.com: PCI/P2PDMA: match interface changes to devm_memremap_pages()] Link: http://lkml.kernel.org/r/20181130225911.2900-1-logang@deltatee.com Link: http://lkml.kernel.org/r/154275560565.76910.15919297436557795278.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com>, Cc: Michal Hocko <mhocko@suse.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-13mm, hmm: use devm semantics for hmm_devmem_{add, remove}Dan Williams
commit 58ef15b765af0d2cbe6799ec564f1dc485010ab8 upstream. devm semantics arrange for resources to be torn down when device-driver-probe fails or when device-driver-release completes. Similar to devm_memremap_pages() there is no need to support an explicit remove operation when the users properly adhere to devm semantics. Note that devm_kzalloc() automatically handles allocating node-local memory. Link: http://lkml.kernel.org/r/154275559545.76910.9186690723515469051.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jérôme Glisse <jglisse@redhat.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-13hwpoison, memory_hotplug: allow hwpoisoned pages to be offlinedMichal Hocko
commit b15c87263a69272423771118c653e9a1d0672caa upstream. We have received a bug report that an injected MCE about faulty memory prevents memory offline to succeed on 4.4 base kernel. The underlying reason was that the HWPoison page has an elevated reference count and the migration keeps failing. There are two problems with that. First of all it is dubious to migrate the poisoned page because we know that accessing that memory is possible to fail. Secondly it doesn't make any sense to migrate a potentially broken content and preserve the memory corruption over to a new location. Oscar has found out that 4.4 and the current upstream kernels behave slightly differently with his simply testcase === int main(void) { int ret; int i; int fd; char *array = malloc(4096); char *array_locked = malloc(4096); fd = open("/tmp/data", O_RDONLY); read(fd, array, 4095); for (i = 0; i < 4096; i++) array_locked[i] = 'd'; ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked)); if (ret) perror("mlock"); sleep (20); ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON); if (ret) perror("madvise"); for (i = 0; i < 4096; i++) array_locked[i] = 'd'; return 0; } === + offline this memory. In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU list kernel: [<ffffffff81019ac9>] dump_trace+0x59/0x340 kernel: [<ffffffff81019e9a>] show_stack_log_lvl+0xea/0x170 kernel: [<ffffffff8101ac71>] show_stack+0x21/0x40 kernel: [<ffffffff8132bb90>] dump_stack+0x5c/0x7c kernel: [<ffffffff810815a1>] warn_slowpath_common+0x81/0xb0 kernel: [<ffffffff811a275c>] __pagevec_lru_add_fn+0x14c/0x160 kernel: [<ffffffff811a2eed>] pagevec_lru_move_fn+0xad/0x100 kernel: [<ffffffff811a334c>] __lru_cache_add+0x6c/0xb0 kernel: [<ffffffff81195236>] add_to_page_cache_lru+0x46/0x70 kernel: [<ffffffffa02b4373>] extent_readpages+0xc3/0x1a0 [btrfs] kernel: [<ffffffff811a16d7>] __do_page_cache_readahead+0x177/0x200 kernel: [<ffffffff811a18c8>] ondemand_readahead+0x168/0x2a0 kernel: [<ffffffff8119673f>] generic_file_read_iter+0x41f/0x660 kernel: [<ffffffff8120e50d>] __vfs_read+0xcd/0x140 kernel: [<ffffffff8120e9ea>] vfs_read+0x7a/0x120 kernel: [<ffffffff8121404b>] kernel_read+0x3b/0x50 kernel: [<ffffffff81215c80>] do_execveat_common.isra.29+0x490/0x6f0 kernel: [<ffffffff81215f08>] do_execve+0x28/0x30 kernel: [<ffffffff81095ddb>] call_usermodehelper_exec_async+0xfb/0x130 kernel: [<ffffffff8161c045>] ret_from_fork+0x55/0x80 And that latter confuses the hotremove path because an LRU page is attempted to be migrated and that fails due to an elevated reference count. It is quite possible that the reuse of the HWPoisoned page is some kind of fixed race condition but I am not really sure about that. With the upstream kernel the failure is slightly different. The page doesn't seem to have LRU bit set but isolate_movable_page simply fails and do_migrate_range simply puts all the isolated pages back to LRU and therefore no progress is made and scan_movable_pages finds same set of pages over and over again. Fix both cases by explicitly checking HWPoisoned pages before we even try to get reference on the page, try to unmap it if it is still mapped. As explained by Naoya: : Hwpoison code never unmapped those for no big reason because : Ksm pages never dominate memory, so we simply didn't have strong : motivation to save the pages. Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU HWPoison pages which shouldn't happen but I couldn't convince myself about that. Naoya has noted the following: : Theoretically no such gurantee, because try_to_unmap() doesn't have a : guarantee of success and then memory_failure() returns immediately : when hwpoison_user_mappings fails. : Or the following code (comes after hwpoison_user_mappings block) also impli= : es : that the target page can still have PageLRU flag. : : /* : * Torn down by someone else? : */ : if (PageLRU(p) && !PageSwapCache(p) && p->mapping =3D=3D NULL) { : action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED); : res =3D -EBUSY; : goto out; : } : : So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in : current version of your patch. Link: http://lkml.kernel.org/r/20181206120135.14079-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.com> Debugged-by: Oscar Salvador <osalvador@suse.com> Tested-by: Oscar Salvador <osalvador@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-29mm: don't miss the last page because of round-off errorRoman Gushchin
commit 68600f623d69da428c6163275f97ca126e1a8ec5 upstream. I've noticed, that dying memory cgroups are often pinned in memory by a single pagecache page. Even under moderate memory pressure they sometimes stayed in such state for a long time. That looked strange. My investigation showed that the problem is caused by applying the LRU pressure balancing math: scan = div64_u64(scan * fraction[lru], denominator), where denominator = fraction[anon] + fraction[file] + 1. Because fraction[lru] is always less than denominator, if the initial scan size is 1, the result is always 0. This means the last page is not scanned and has no chances to be reclaimed. Fix this by rounding up the result of the division. In practice this change significantly improves the speed of dying cgroups reclaim. [guro@fb.com: prevent double calculation of DIV64_U64_ROUND_UP() arguments] Link: http://lkml.kernel.org/r/20180829213311.GA13501@castle Link: http://lkml.kernel.org/r/20180827162621.30187-3-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-29mm, page_alloc: fix has_unmovable_pages for HugePagesOscar Salvador
commit 17e2e7d7e1b83fa324b3f099bfe426659aa3c2a4 upstream. While playing with gigantic hugepages and memory_hotplug, I triggered the following #PF when "cat memoryX/removable": BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 #PF error: [normal kernel read fault] PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 1 PID: 1481 Comm: cat Tainted: G E 4.20.0-rc6-mm1-1-default+ #18 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 RIP: 0010:has_unmovable_pages+0x154/0x210 Call Trace: is_mem_section_removable+0x7d/0x100 removable_show+0x90/0xb0 dev_attr_show+0x1c/0x50 sysfs_kf_seq_show+0xca/0x1b0 seq_read+0x133/0x380 __vfs_read+0x26/0x180 vfs_read+0x89/0x140 ksys_read+0x42/0x90 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The reason is we do not pass the Head to page_hstate(), and so, the call to compound_order() in page_hstate() returns 0, so we end up checking all hstates's size to match PAGE_SIZE. Obviously, we do not find any hstate matching that size, and we return NULL. Then, we dereference that NULL pointer in hugepage_migration_supported() and we got the #PF from above. Fix that by getting the head page before calling page_hstate(). Also, since gigantic pages span several pageblocks, re-adjust the logic for skipping pages. While are it, we can also get rid of the round_up(). [osalvador@suse.de: remove round_up(), adjust skip pages logic per Michal] Link: http://lkml.kernel.org/r/20181221062809.31771-1-osalvador@suse.de Link: http://lkml.kernel.org/r/20181217225113.17864-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-29mm: thp: fix flags for pmd migration when splitPeter Xu
commit 2e83ee1d8694a61d0d95a5b694f2e61e8dde8627 upstream. When splitting a huge migrating PMD, we'll transfer all the existing PMD bits and apply them again onto the small PTEs. However we are fetching the bits unconditionally via pmd_soft_dirty(), pmd_write() or pmd_yound() while actually they don't make sense at all when it's a migration entry. Fix them up. Since at it, drop the ifdef together as not needed. Note that if my understanding is correct about the problem then if without the patch there is chance to lose some of the dirty bits in the migrating pmd pages (on x86_64 we're fetching bit 11 which is part of swap offset instead of bit 2) and it could potentially corrupt the memory of an userspace program which depends on the dirty bit. Link: http://lkml.kernel.org/r/20181213051510.20306-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: <stable@vger.kernel.org> [4.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-29mm, memory_hotplug: initialize struct pages for the full memory sectionMikhail Zaslonko
commit 2830bf6f05fb3e05bc4743274b806c821807a684 upstream. If memory end is not aligned with the sparse memory section boundary, the mapping of such a section is only partly initialized. This may lead to VM_BUG_ON due to uninitialized struct page access from is_mem_section_removable() or test_pages_in_a_zone() function triggered by memory_hotplug sysfs handlers: Here are the the panic examples: CONFIG_DEBUG_VM=y CONFIG_DEBUG_VM_PGFLAGS=y kernel parameter mem=2050M -------------------------- page:000003d082008000 is uninitialized and poisoned page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) Call Trace: ( test_pages_in_a_zone+0xde/0x160) show_valid_zones+0x5c/0x190 dev_attr_show+0x34/0x70 sysfs_kf_seq_show+0xc8/0x148 seq_read+0x204/0x480 __vfs_read+0x32/0x178 vfs_read+0x82/0x138 ksys_read+0x5a/0xb0 system_call+0xdc/0x2d8 Last Breaking-Event-Address: test_pages_in_a_zone+0xde/0x160 Kernel panic - not syncing: Fatal exception: panic_on_oops kernel parameter mem=3075M -------------------------- page:000003d08300c000 is uninitialized and poisoned page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) Call Trace: ( is_mem_section_removable+0xb4/0x190) show_mem_removable+0x9a/0xd8 dev_attr_show+0x34/0x70 sysfs_kf_seq_show+0xc8/0x148 seq_read+0x204/0x480 __vfs_read+0x32/0x178 vfs_read+0x82/0x138 ksys_read+0x5a/0xb0 system_call+0xdc/0x2d8 Last Breaking-Event-Address: is_mem_section_removable+0xb4/0x190 Kernel panic - not syncing: Fatal exception: panic_on_oops Fix the problem by initializing the last memory section of each zone in memmap_init_zone() till the very end, even if it goes beyond the zone end. Michal said: : This has alwways been problem AFAIU. It just went unnoticed because we : have zeroed memmaps during allocation before f7f99100d8d9 ("mm: stop : zeroing memory during allocation in vmemmap") and so the above test : would simply skip these ranges as belonging to zone 0 or provided a : garbage. : : So I guess we do care for post f7f99100d8d9 kernels mostly and : therefore Fixes: f7f99100d8d9 ("mm: stop zeroing memory during : allocation in vmemmap") Link: http://lkml.kernel.org/r/20181212172712.34019-2-zaslonko@linux.ibm.com Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap") Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Suggested-by: Michal Hocko <mhocko@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17mm/page_alloc.c: fix calculation of pgdat->nr_zonesWei Yang
[ Upstream commit 8f416836c0d50b198cad1225132e5abebf8980dc ] init_currently_empty_zone() will adjust pgdat->nr_zones and set it to 'zone_idx(zone) + 1' unconditionally. This is correct in the normal case, while not exact in hot-plug situation. This function is used in two places: * free_area_init_core() * move_pfn_range_to_zone() In the first case, we are sure zone index increase monotonically. While in the second one, this is under users control. One way to reproduce this is: ---------------------------- 1. create a virtual machine with empty node1 -m 4G,slots=32,maxmem=32G \ -smp 4,maxcpus=8 \ -numa node,nodeid=0,mem=4G,cpus=0-3 \ -numa node,nodeid=1,mem=0G,cpus=4-7 2. hot-add cpu 3-7 cpu-add [3-7] 2. hot-add memory to nod1 object_add memory-backend-ram,id=ram0,size=1G device_add pc-dimm,id=dimm0,memdev=ram0,node=1 3. online memory with following order echo online_movable > memory47/state echo online > memory40/state After this, node1 will have its nr_zones equals to (ZONE_NORMAL + 1) instead of (ZONE_MOVABLE + 1). Michal said: "Having an incorrect nr_zones might result in all sorts of problems which would be quite hard to debug (e.g. reclaim not considering the movable zone). I do not expect many users would suffer from this it but still this is trivial and obviously right thing to do so backporting to the stable tree shouldn't be harmful (last famous words)" Link: http://lkml.kernel.org/r/20181117022022.9956-1-richard.weiyang@gmail.com Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-08userfaultfd: shmem: UFFDIO_COPY: set the page dirty if VM_WRITE is not setAndrea Arcangeli
commit dcf7fe9d89763a28e0f43975b422ff141fe79e43 upstream. Set the page dirty if VM_WRITE is not set because in such case the pte won't be marked dirty and the page would be reclaimed without writepage (i.e. swapout in the shmem case). This was found by source review. Most apps (certainly including QEMU) only use UFFDIO_COPY on PROT_READ|PROT_WRITE mappings or the app can't modify the memory in the first place. This is for correctness and it could help the non cooperative use case to avoid unexpected data loss. Link: http://lkml.kernel.org/r/20181126173452.26955-6-aarcange@redhat.com Reviewed-by: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support") Reported-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-08userfaultfd: shmem: add i_size checksAndrea Arcangeli
commit e2a50c1f64145a04959df2442305d57307e5395a upstream. With MAP_SHARED: recheck the i_size after taking the PT lock, to serialize against truncate with the PT lock. Delete the page from the pagecache if the i_size_read check fails. With MAP_PRIVATE: check the i_size after the PT lock before mapping anonymous memory or zeropages into the MAP_PRIVATE shmem mapping. A mostly irrelevant cleanup: like we do the delete_from_page_cache() pagecache removal after dropping the PT lock, the PT lock is a spinlock so drop it before the sleepable page lock. Link: http://lkml.kernel.org/r/20181126173452.26955-5-aarcange@redhat.com Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Hugh Dickins <hughd@google.com> Reported-by: Jann Horn <jannh@google.com> Cc: <stable@vger.kernel.org> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-08userfaultfd: shmem: allocate anonymous memory for MAP_PRIVATE shmemAndrea Arcangeli
commit 5b51072e97d587186c2f5390c8c9c1fb7e179505 upstream. Userfaultfd did not create private memory when UFFDIO_COPY was invoked on a MAP_PRIVATE shmem mapping. Instead it wrote to the shmem file, even when that had not been opened for writing. Though, fortunately, that could only happen where there was a hole in the file. Fix the shmem-backed implementation of UFFDIO_COPY to create private memory for MAP_PRIVATE mappings. The hugetlbfs-backed implementation was already correct. This change is visible to userland, if userfaultfd has been used in unintended ways: so it introduces a small risk of incompatibility, but is necessary in order to respect file permissions. An app that uses UFFDIO_COPY for anything like postcopy live migration won't notice the difference, and in fact it'll run faster because there will be no copy-on-write and memory waste in the tmpfs pagecache anymore. Userfaults on MAP_PRIVATE shmem keep triggering only on file holes like before. The real zeropage can also be built on a MAP_PRIVATE shmem mapping through UFFDIO_ZEROPAGE and that's safe because the zeropage pte is never dirty, in turn even an mprotect upgrading the vma permission from PROT_READ to PROT_READ|PROT_WRITE won't make the zeropage pte writable. Link: http://lkml.kernel.org/r/20181126173452.26955-3-aarcange@redhat.com Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-08userfaultfd: use ENOENT instead of EFAULT if the atomic copy user failsAndrea Arcangeli
commit 9e368259ad988356c4c95150fafd1a06af095d98 upstream. Patch series "userfaultfd shmem updates". Jann found two bugs in the userfaultfd shmem MAP_SHARED backend: the lack of the VM_MAYWRITE check and the lack of i_size checks. Then looking into the above we also fixed the MAP_PRIVATE case. Hugh by source review also found a data loss source if UFFDIO_COPY is used on shmem MAP_SHARED PROT_READ mappings (the production usages incidentally run with PROT_READ|PROT_WRITE, so the data loss couldn't happen in those production usages like with QEMU). The whole patchset is marked for stable. We verified QEMU postcopy live migration with guest running on shmem MAP_PRIVATE run as well as before after the fix of shmem MAP_PRIVATE. Regardless if it's shmem or hugetlbfs or MAP_PRIVATE or MAP_SHARED, QEMU unconditionally invokes a punch hole if the guest mapping is filebacked and a MADV_DONTNEED too (needed to get rid of the MAP_PRIVATE COWs and for the anon backend). This patch (of 5): We internally used EFAULT to communicate with the caller, switch to ENOENT, so EFAULT can be used as a non internal retval. Link: http://lkml.kernel.org/r/20181126173452.26955-2-aarcange@redhat.com Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Hugh Dickins <hughd@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: <stable@vger.kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-05mm: use swp_offset as key in shmem_replace_page()Yu Zhao
commit c1cb20d43728aa9b5393bd8d489bc85c142949b2 upstream. We changed the key of swap cache tree from swp_entry_t.val to swp_offset. We need to do so in shmem_replace_page() as well. Hugh said: "shmem_replace_page() has been wrong since the day I wrote it: good enough to work on swap "type" 0, which is all most people ever use (especially those few who need shmem_replace_page() at all), but broken once there are any non-0 swp_type bits set in the higher order bits" Link: http://lkml.kernel.org/r/20181121215442.138545-1-yuzhao@google.com Fixes: f6ab1f7f6b2d ("mm, swap: use offset of swap entry as key of swap cache") Signed-off-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Acked-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> [4.9+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>