aboutsummaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)Author
2021-09-08Merge tag 'v5.13.14' into v5.13/standard/baseBruce Ashfield
This is the 5.13.14 stable release # gpg: Signature made Fri 03 Sep 2021 04:28:42 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2021-09-03mm/memory_hotplug: fix potential permanent lru cache disableMiaohe Lin
commit 946746d1ad921e5f493b536533dda02ea22ca609 upstream. If offline_pages failed after lru_cache_disable(), it forgot to do lru_cache_enable() in error path. So we would have lru cache disabled permanently in this case. Link: https://lkml.kernel.org/r/20210821094246.10149-3-linmiaohe@huawei.com Fixes: d479960e44f2 ("mm: disable LRU pagevec during the migration temporarily") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Chris Goldsworthy <cgoldswo@codeaurora.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-01Merge tag 'v5.13.13' into v5.13/standard/baseBruce Ashfield
Linux 5.13.13 # gpg: Signature made Thu 26 Aug 2021 08:50:30 AM EDT # gpg: using RSA key E27E5D8A3403A2EF66873BBCDEA66FF797772CDC # gpg: Can't check signature: No public key
2021-08-26hugetlb: don't pass page cache pages to restore_reserve_on_errorMike Kravetz
[ Upstream commit c7b1850dfb41d0b4154aca8dbc04777fbd75616f ] syzbot hit kernel BUG at fs/hugetlbfs/inode.c:532 as described in [1]. This BUG triggers if the HPageRestoreReserve flag is set on a page in the page cache. It should never be set, as the routine huge_add_to_page_cache explicitly clears the flag after adding a page to the cache. The only code other than huge page allocation which sets the flag is restore_reserve_on_error. It will potentially set the flag in rare out of memory conditions. syzbot was injecting errors to cause memory allocation errors which exercised this specific path. The code in restore_reserve_on_error is doing the right thing. However, there are instances where pages in the page cache were being passed to restore_reserve_on_error. This is incorrect, as once a page goes into the cache reservation information will not be modified for the page until it is removed from the cache. Error paths do not remove pages from the cache, so even in the case of error, the page will remain in the cache and no reservation adjustment is needed. Modify routines that potentially call restore_reserve_on_error with a page cache page to no longer do so. Note on fixes tag: Prior to commit 846be08578ed ("mm/hugetlb: expand restore_reserve_on_error functionality") the routine would not process page cache pages because the HPageRestoreReserve flag is not set on such pages. Therefore, this issue could not be trigggered. The code added by commit 846be08578ed ("mm/hugetlb: expand restore_reserve_on_error functionality") is needed and correct. It exposed incorrect calls to restore_reserve_on_error which is the root cause addressed by this commit. [1] https://lore.kernel.org/linux-mm/00000000000050776d05c9b7c7f0@google.com/ Link: https://lkml.kernel.org/r/20210818213304.37038-1-mike.kravetz@oracle.com Fixes: 846be08578ed ("mm/hugetlb: expand restore_reserve_on_error functionality") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: <syzbot+67654e51e54455f1c585@syzkaller.appspotmail.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-26mm/hwpoison: retry with shake_page() for unhandlable pagesNaoya Horiguchi
[ Upstream commit fcc00621d88b274b5dffd8daeea71d0e4c28b84e ] HWPoisonHandlable() sometimes returns false for typical user pages due to races with average memory events like transfers over LRU lists. This causes failures in hwpoison handling. There's retry code for such a case but does not work because the retry loop reaches the retry limit too quickly before the page settles down to handlable state. Let get_any_page() call shake_page() to fix it. [naoya.horiguchi@nec.com: get_any_page(): return -EIO when retry limit reached] Link: https://lkml.kernel.org/r/20210819001958.2365157-1-naoya.horiguchi@linux.dev Link: https://lkml.kernel.org/r/20210817053703.2267588-1-naoya.horiguchi@linux.dev Fixes: 25182f05ffed ("mm,hwpoison: fix race with hugetlb page allocation") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reported-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> [5.13+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-26mm,hwpoison: make get_hwpoison_page() call get_any_page()Naoya Horiguchi
[ Upstream commit 0ed950d1f28142ccd9a9453c60df87853530d778 ] __get_hwpoison_page() could fail to grab refcount by some race condition, so it's helpful if we can handle it by retrying. We already have retry logic, so make get_hwpoison_page() call get_any_page() when called from memory_failure(). As a result, get_hwpoison_page() can return negative values (i.e. error code), so some callers are also changed to handle error cases. soft_offline_page() does nothing for -EBUSY because that's enough and users in userspace can easily handle it. unpoison_memory() is also unchanged because it's broken and need thorough fixes (will be done later). Link: https://lkml.kernel.org/r/20210603233632.2964832-3-nao.horiguchi@gmail.com Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-26mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaimJohannes Weiner
[ Upstream commit f56ce412a59d7d938b81de8878faef128812482c ] We've noticed occasional OOM killing when memory.low settings are in effect for cgroups. This is unexpected and undesirable as memory.low is supposed to express non-OOMing memory priorities between cgroups. The reason for this is proportional memory.low reclaim. When cgroups are below their memory.low threshold, reclaim passes them over in the first round, and then retries if it couldn't find pages anywhere else. But when cgroups are slightly above their memory.low setting, page scan force is scaled down and diminished in proportion to the overage, to the point where it can cause reclaim to fail as well - only in that case we currently don't retry, and instead trigger OOM. To fix this, hook proportional reclaim into the same retry logic we have in place for when cgroups are skipped entirely. This way if reclaim fails and some cgroups were scanned with diminished pressure, we'll try another full-force cycle before giving up and OOMing. [akpm@linux-foundation.org: coding-style fixes] Link: https://lkml.kernel.org/r/20210817180506.220056-1-hannes@cmpxchg.org Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Leon Yang <lnyng@fb.com> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Chris Down <chris@chrisdown.name> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> [5.4+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-22Merge tag 'v5.13.12' into v5.13/standard/baseBruce Ashfield
This is the 5.13.12 stable release # gpg: Signature made Wed 18 Aug 2021 03:07:34 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2021-08-18kasan, slub: reset tag when printing addressKuan-Ying Lee
commit 340caf178ddc2efb0294afaf54c715f7928c258e upstream. The address still includes the tags when it is printed. With hardware tag-based kasan enabled, we will get a false positive KASAN issue when we access metadata. Reset the tag before we access the metadata. Link: https://lkml.kernel.org/r/20210804090957.12393-3-Kuan-Ying.Lee@mediatek.com Fixes: aa1ef4d7b3f6 ("kasan, mm: reset tags when accessing metadata") Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: Nicholas Tang <nicholas.tang@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-08Merge tag 'v5.13.8' into v5.13/standard/baseBruce Ashfield
This is the 5.13.8 stable release # gpg: Signature made Wed 04 Aug 2021 06:48:09 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2021-08-04mm/memcg: fix NULL pointer dereference in memcg_slab_free_hook()Wang Hai
commit 121dffe20b141c9b27f39d49b15882469cbebae7 upstream. When I use kfree_rcu() to free a large memory allocated by kmalloc_node(), the following dump occurs. BUG: kernel NULL pointer dereference, address: 0000000000000020 [...] Oops: 0000 [#1] SMP [...] Workqueue: events kfree_rcu_work RIP: 0010:__obj_to_index include/linux/slub_def.h:182 [inline] RIP: 0010:obj_to_index include/linux/slub_def.h:191 [inline] RIP: 0010:memcg_slab_free_hook+0x120/0x260 mm/slab.h:363 [...] Call Trace: kmem_cache_free_bulk+0x58/0x630 mm/slub.c:3293 kfree_bulk include/linux/slab.h:413 [inline] kfree_rcu_work+0x1ab/0x200 kernel/rcu/tree.c:3300 process_one_work+0x207/0x530 kernel/workqueue.c:2276 worker_thread+0x320/0x610 kernel/workqueue.c:2422 kthread+0x13d/0x160 kernel/kthread.c:313 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 When kmalloc_node() a large memory, page is allocated, not slab, so when freeing memory via kfree_rcu(), this large memory should not be used by memcg_slab_free_hook(), because memcg_slab_free_hook() is is used for slab. Using page_objcgs_check() instead of page_objcgs() in memcg_slab_free_hook() to fix this bug. Link: https://lkml.kernel.org/r/20210728145655.274476-1-wanghai38@huawei.com Fixes: 270c6a71460e ("mm: memcontrol/slab: Use helpers to access slab page's memcg_data") Signed-off-by: Wang Hai <wanghai38@huawei.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-04mm: memcontrol: fix blocking rstat function called from atomic cgroup1 ↵Johannes Weiner
thresholding code commit 30def93565e5ba08676aa2b9083f253fc586dbed upstream. Dan Carpenter reports: The patch 2d146aa3aa84: "mm: memcontrol: switch to rstat" from Apr 29, 2021, leads to the following static checker warning: kernel/cgroup/rstat.c:200 cgroup_rstat_flush() warn: sleeping in atomic context mm/memcontrol.c 3572 static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) 3573 { 3574 unsigned long val; 3575 3576 if (mem_cgroup_is_root(memcg)) { 3577 cgroup_rstat_flush(memcg->css.cgroup); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This is from static analysis and potentially a false positive. The problem is that mem_cgroup_usage() is called from __mem_cgroup_threshold() which holds an rcu_read_lock(). And the cgroup_rstat_flush() function can sleep. 3578 val = memcg_page_state(memcg, NR_FILE_PAGES) + 3579 memcg_page_state(memcg, NR_ANON_MAPPED); 3580 if (swap) 3581 val += memcg_page_state(memcg, MEMCG_SWAP); 3582 } else { 3583 if (!swap) 3584 val = page_counter_read(&memcg->memory); 3585 else 3586 val = page_counter_read(&memcg->memsw); 3587 } 3588 return val; 3589 } __mem_cgroup_threshold() indeed holds the rcu lock. In addition, the thresholding code is invoked during stat changes, and those contexts have irqs disabled as well. If the lock breaking occurs inside the flush function, it will result in a sleep from an atomic context. Use the irqsafe flushing variant in mem_cgroup_usage() to fix this. Link: https://lkml.kernel.org/r/20210726150019.251820-1-hannes@cmpxchg.org Fixes: 2d146aa3aa84 ("mm: memcontrol: switch to rstat") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Chris Down <chris@chrisdown.name> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-02Merge tag 'v5.13.6' into v5.13/standard/baseBruce Ashfield
This is the 5.13.6 stable release # gpg: Signature made Wed 28 Jul 2021 08:40:51 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2021-07-28mm: fix the deadlock in finish_fault()Qi Zheng
commit e4dc3489143f84f7ed30be58b886bb6772f229b9 upstream. Commit 63f3655f9501 ("mm, memcg: fix reclaim deadlock with writeback") fix the following ABBA deadlock by pre-allocating the pte page table without holding the page lock. lock_page(A) SetPageWriteback(A) unlock_page(A) lock_page(B) lock_page(B) pte_alloc_one shrink_page_list wait_on_page_writeback(A) SetPageWriteback(B) unlock_page(B) # flush A, B to clear the writeback Commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths") reworked the relevant code but ignored this race. This will cause the deadlock above to appear again, so fix it. Link: https://lkml.kernel.org/r/20210721074849.57004-1-zhengqi.arch@bytedance.com Fixes: f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths") Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-28memblock: make for_each_mem_range() traverse MEMBLOCK_HOTPLUG regionsMike Rapoport
commit 79e482e9c3ae86e849c701c846592e72baddda5a upstream. Commit b10d6bca8720 ("arch, drivers: replace for_each_membock() with for_each_mem_range()") didn't take into account that when there is movable_node parameter in the kernel command line, for_each_mem_range() would skip ranges marked with MEMBLOCK_HOTPLUG. The page table setup code in POWER uses for_each_mem_range() to create the linear mapping of the physical memory and since the regions marked as MEMORY_HOTPLUG are skipped, they never make it to the linear map. A later access to the memory in those ranges will fail: BUG: Unable to handle kernel data access on write at 0xc000000400000000 Faulting instruction address: 0xc00000000008a3c0 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries Modules linked in: CPU: 0 PID: 53 Comm: kworker/u2:0 Not tainted 5.13.0 #7 NIP: c00000000008a3c0 LR: c0000000003c1ed8 CTR: 0000000000000040 REGS: c000000008a57770 TRAP: 0300 Not tainted (5.13.0) MSR: 8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 84222202 XER: 20040000 CFAR: c0000000003c1ed4 DAR: c000000400000000 DSISR: 42000000 IRQMASK: 0 GPR00: c0000000003c1ed8 c000000008a57a10 c0000000019da700 c000000400000000 GPR04: 0000000000000280 0000000000000180 0000000000000400 0000000000000200 GPR08: 0000000000000100 0000000000000080 0000000000000040 0000000000000300 GPR12: 0000000000000380 c000000001bc0000 c0000000001660c8 c000000006337e00 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000040000000 0000000020000000 c000000001a81990 c000000008c30000 GPR24: c000000008c20000 c000000001a81998 000fffffffff0000 c000000001a819a0 GPR28: c000000001a81908 c00c000001000000 c000000008c40000 c000000008a64680 NIP clear_user_page+0x50/0x80 LR __handle_mm_fault+0xc88/0x1910 Call Trace: __handle_mm_fault+0xc44/0x1910 (unreliable) handle_mm_fault+0x130/0x2a0 __get_user_pages+0x248/0x610 __get_user_pages_remote+0x12c/0x3e0 get_arg_page+0x54/0xf0 copy_string_kernel+0x11c/0x210 kernel_execve+0x16c/0x220 call_usermodehelper_exec_async+0x1b0/0x2f0 ret_from_kernel_thread+0x5c/0x70 Instruction dump: 79280fa4 79271764 79261f24 794ae8e2 7ca94214 7d683a14 7c893a14 7d893050 7d4903a6 60000000 60000000 60000000 <7c001fec> 7c091fec 7c081fec 7c051fec ---[ end trace 490b8c67e6075e09 ]--- Making for_each_mem_range() include MEMBLOCK_HOTPLUG regions in the traversal fixes this issue. Link: https://bugzilla.redhat.com/show_bug.cgi?id=1976100 Link: https://lkml.kernel.org/r/20210712071132.20902-1-rppt@kernel.org Fixes: b10d6bca8720 ("arch, drivers: replace for_each_membock() with for_each_mem_range()") Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Tested-by: Greg Kurz <groug@kaod.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> [5.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-28mm: page_alloc: fix page_poison=1 / INIT_ON_ALLOC_DEFAULT_ON interactionSergei Trofimovich
commit 69e5d322a2fb86173fde8bad26e8eb38cad1b1e9 upstream. To reproduce the failure we need the following system: - kernel command: page_poison=1 init_on_free=0 init_on_alloc=0 - kernel config: * CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y * CONFIG_INIT_ON_FREE_DEFAULT_ON=y * CONFIG_PAGE_POISONING=y Resulting in: 0000000085629bdd: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 0000000022861832: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000000c597f5b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ CPU: 11 PID: 15195 Comm: bash Kdump: loaded Tainted: G U O 5.13.1-gentoo-x86_64 #1 Hardware name: System manufacturer System Product Name/PRIME Z370-A, BIOS 2801 01/13/2021 Call Trace: dump_stack+0x64/0x7c __kernel_unpoison_pages.cold+0x48/0x84 post_alloc_hook+0x60/0xa0 get_page_from_freelist+0xdb8/0x1000 __alloc_pages+0x163/0x2b0 __get_free_pages+0xc/0x30 pgd_alloc+0x2e/0x1a0 mm_init+0x185/0x270 dup_mm+0x6b/0x4f0 copy_process+0x190d/0x1b10 kernel_clone+0xba/0x3b0 __do_sys_clone+0x8f/0xb0 do_syscall_64+0x68/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae Before commit 51cba1ebc60d ("init_on_alloc: Optimize static branches") init_on_alloc never enabled static branch by default. It could only be enabed explicitly by init_mem_debugging_and_hardening(). But after commit 51cba1ebc60d, a static branch could already be enabled by default. There was no code to ever disable it. That caused page_poison=1 / init_on_free=1 conflict. This change extends init_mem_debugging_and_hardening() to also disable static branch disabling. Link: https://lkml.kernel.org/r/20210714031935.4094114-1-keescook@chromium.org Link: https://lore.kernel.org/r/20210712215816.1512739-1-slyfox@gentoo.org Fixes: 51cba1ebc60d ("init_on_alloc: Optimize static branches") Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> Signed-off-by: Kees Cook <keescook@chromium.org> Co-developed-by: Kees Cook <keescook@chromium.org> Reported-by: Mikhail Morfikov <mmorfikov@gmail.com> Reported-by: <bowsingbetee@pm.me> Tested-by: <bowsingbetee@protonmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-28kfence: skip all GFP_ZONEMASK allocationsAlexander Potapenko
commit 236e9f1538523d3d380dda1cc99571d587058f37 upstream. Allocation requests outside ZONE_NORMAL (MOVABLE, HIGHMEM or DMA) cannot be fulfilled by KFENCE, because KFENCE memory pool is located in a zone different from the requested one. Because callers of kmem_cache_alloc() may actually rely on the allocation to reside in the requested zone (e.g. memory allocations done with __GFP_DMA must be DMAable), skip all allocations done with GFP_ZONEMASK and/or respective SLAB flags (SLAB_CACHE_DMA and SLAB_CACHE_DMA32). Link: https://lkml.kernel.org/r/20210714092222.1890268-2-glider@google.com Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure") Signed-off-by: Alexander Potapenko <glider@google.com> Reviewed-by: Marco Elver <elver@google.com> Acked-by: Souptick Joarder <jrdr.linux@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: <stable@vger.kernel.org> [5.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-28kfence: move the size check to the beginning of __kfence_alloc()Alexander Potapenko
commit 235a85cb32bb123854ad31de46fdbf04c1d57cda upstream. Check the allocation size before toggling kfence_allocation_gate. This way allocations that can't be served by KFENCE will not result in waiting for another CONFIG_KFENCE_SAMPLE_INTERVAL without allocating anything. Link: https://lkml.kernel.org/r/20210714092222.1890268-1-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Suggested-by: Marco Elver <elver@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: <stable@vger.kernel.org> [5.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-25Merge tag 'v5.13.5' into v5.13/standard/baseBruce Ashfield
This is the 5.13.5 stable release # gpg: Signature made Sun 25 Jul 2021 08:37:43 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2021-07-25mm/userfaultfd: fix uffd-wp special cases for fork()Peter Xu
commit 8f34f1eac3820fc2722e5159acceb22545b30b0d upstream. We tried to do something similar in b569a1760782 ("userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork") previously, but it's not doing it all right.. A few fixes around the code path: 1. We were referencing VM_UFFD_WP vm_flags on the _old_ vma rather than the new vma. That's overlooked in b569a1760782, so it won't work as expected. Thanks to the recent rework on fork code (7a4830c380f3a8b3), we can easily get the new vma now, so switch the checks to that. 2. Dropping the uffd-wp bit in copy_huge_pmd() could be wrong if the huge pmd is a migration huge pmd. When it happens, instead of using pmd_uffd_wp(), we should use pmd_swp_uffd_wp(). The fix is simply to handle them separately. 3. Forget to carry over uffd-wp bit for a write migration huge pmd entry. This also happens in copy_huge_pmd(), where we converted a write huge migration entry into a read one. 4. In copy_nonpresent_pte(), drop uffd-wp if necessary for swap ptes. 5. In copy_present_page() when COW is enforced when fork(), we also need to pass over the uffd-wp bit if VM_UFFD_WP is armed on the new vma, and when the pte to be copied has uffd-wp bit set. Remove the comment in copy_present_pte() about this. It won't help a huge lot to only comment there, but comment everywhere would be an overkill. Let's assume the commit messages would help. [peterx@redhat.com: fix a few thp pmd missing uffd-wp bit] Link: https://lkml.kernel.org/r/20210428225030.9708-4-peterx@redhat.com Link: https://lkml.kernel.org/r/20210428225030.9708-3-peterx@redhat.com Fixes: b569a1760782f ("userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork") Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joe Perches <joe@perches.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Upton <oupton@google.com> Cc: Shaohua Li <shli@fb.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Wang Qing <wangqing@vivo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-25mm/thp: simplify copying of huge zero page pmd when forkPeter Xu
commit 5fc7a5f6fd04bc18f309d9f979b32ef7d1d0a997 upstream. Patch series "mm/uffd: Misc fix for uffd-wp and one more test". This series tries to fix some corner case bugs for uffd-wp on either thp or fork(). Then it introduced a new test with pagemap/pageout. Patch layout: Patch 1: cleanup for THP, it'll slightly simplify the follow up patches Patch 2-4: misc fixes for uffd-wp here and there; please refer to each patch Patch 5: add pagemap support for uffd-wp Patch 6: add pagemap/pageout test for uffd-wp The last test introduced can also verify some of the fixes in previous patches, as the test will fail without the fixes. However it's not easy to verify all the changes in patch 2-4, but hopefully they can still be properly reviewed. Note that if considering the ongoing uffd-wp shmem & hugetlbfs work, patch 5 will be incomplete as it's missing e.g. hugetlbfs part or the special swap pte detection. However that's not needed in this series, and since that series is still during review, this series does not depend on that one (the last test only runs with anonymous memory, not file-backed). So this series can be merged even before that series. This patch (of 6): Huge zero page is handled in a special path in copy_huge_pmd(), however it should share most codes with a normal thp page. Trying to share more code with it by removing the special path. The only leftover so far is the huge zero page refcounting (mm_get_huge_zero_page()), because that's separately done with a global counter. This prepares for a future patch to modify the huge pmd to be installed, so that we don't need to duplicate it explicitly into huge zero page case too. Link: https://lkml.kernel.org/r/20210428225030.9708-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20210428225030.9708-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Mike Kravetz <mike.kravetz@oracle.com>, peterx@redhat.com Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Brian Geffon <bgeffon@google.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Upton <oupton@google.com> Cc: Shaohua Li <shli@fb.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Wang Qing <wangqing@vivo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-25Revert "mm/shmem: fix shmem_swapin() race with swapoff"Greg Kroah-Hartman
This reverts commit a533a21b692fc15a6aadfa827b29c7d9989109ca which is commit 2efa33fc7f6ec94a3a538c1a264273c889be2b36 upstream. It should not have been added to the stable trees, sorry about that. Link: https://lore.kernel.org/r/YPVgaY6uw59Fqg5x@casper.infradead.org Reported-by: From: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ying Huang <ying.huang@intel.com> Cc: Alex Shi <alexs@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-25Revert "swap: fix do_swap_page() race with swapoff"Greg Kroah-Hartman
This reverts commit c3b39134bbd088b7dce5e5f342ccd6bb9142fd18 which is commit 2799e77529c2a25492a4395db93996e3dacd762d upstream. It should not have been added to the stable trees, sorry about that. Link: https://lore.kernel.org/r/YPVgaY6uw59Fqg5x@casper.infradead.org Reported-by: From: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ying Huang <ying.huang@intel.com> Cc: Alex Shi <alexs@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-20Merge branch 'v5.13/base' into v5.13/standard/baseBruce Ashfield
2021-07-20mm/hugetlb: fix refs calculation from unaligned @vaddrJoao Martins
commit d08af0a59684e18a51aa4bfd24c658994ea3fc5b upstream. Commit 82e5d378b0e47 ("mm/hugetlb: refactor subpage recording") refactored the count of subpages but missed an edge case when @vaddr is not aligned to PAGE_SIZE e.g. when close to vma->vm_end. It would then errousnly set @refs to 0 and record_subpages_vmas() wouldn't set the @pages array element to its value, consequently causing the reported null-deref by syzbot. Fix it by aligning down @vaddr by PAGE_SIZE in @refs calculation. Link: https://lkml.kernel.org/r/20210713152440.28650-1-joao.m.martins@oracle.com Fixes: 82e5d378b0e47 ("mm/hugetlb: refactor subpage recording") Reported-by: syzbot+a3fcd59df1b372066f5a@syzkaller.appspotmail.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-19Merge branch 'v5.13/base' into v5.13/standard/baseBruce Ashfield
2021-07-19mm/mremap: hold the rmap lock in write mode when moving page table entries.Aneesh Kumar K.V
commit 97113eb39fa7972722ff490b947d8af023e1f6a2 upstream. To avoid a race between rmap walk and mremap, mremap does take_rmap_locks(). The lock was taken to ensure that rmap walk don't miss a page table entry due to PTE moves via move_pagetables(). The kernel does further optimization of this lock such that if we are going to find the newly added vma after the old vma, the rmap lock is not taken. This is because rmap walk would find the vmas in the same order and if we don't find the page table attached to older vma we would find it with the new vma which we would iterate later. As explained in commit eb66ae030829 ("mremap: properly flush TLB before releasing the page") mremap is special in that it doesn't take ownership of the page. The optimized version for PUD/PMD aligned mremap also doesn't hold the ptl lock. This can result in stale TLB entries as show below. This patch updates the rmap locking requirement in mremap to handle the race condition explained below with optimized mremap:: Optmized PMD move CPU 1 CPU 2 CPU 3 mremap(old_addr, new_addr) page_shrinker/try_to_unmap_one mmap_write_lock_killable() addr = old_addr lock(pte_ptl) lock(pmd_ptl) pmd = *old_pmd pmd_clear(old_pmd) flush_tlb_range(old_addr) *new_pmd = pmd *new_addr = 10; and fills TLB with new addr and old pfn unlock(pmd_ptl) ptep_clear_flush() old pfn is free. Stale TLB entry Optimized PUD move also suffers from a similar race. Both the above race condition can be fixed if we force mremap path to take rmap lock. Link: https://lkml.kernel.org/r/20210616045239.370802-7-aneesh.kumar@linux.ibm.com Fixes: 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions") Fixes: c49dd3401802 ("mm: speedup mremap on 1GB or larger regions") Link: https://lore.kernel.org/linux-mm/CAHk-=wgXVR04eBNtxQfevontWnP6FDm+oj5vauQXP3S-huwbPw@mail.gmail.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-14Merge tag 'v5.13.2' into v5.13/standard/baseBruce Ashfield
This is the 5.13.2 stable release # gpg: Signature made Wed 14 Jul 2021 11:08:07 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2021-07-14Merge tag 'v5.13.1' into v5.13/standard/baseBruce Ashfield
Linux 5.13.1
2021-07-14kfence: unconditionally use unbound work queueMarco Elver
[ Upstream commit ff06e45d3aace3f93d23956c1e655224f363ebe2 ] Unconditionally use unbound work queue, and not just if wq_power_efficient is true. Because if the system is idle, KFENCE may wait, and by being run on the unbound work queue, we permit the scheduler to make better scheduling decisions and not require pinning KFENCE to the same CPU upon waking up. Link: https://lkml.kernel.org/r/20210521111630.472579-1-elver@google.com Fixes: 36f0b35d0894 ("kfence: use power-efficient work queue to run delayed work") Signed-off-by: Marco Elver <elver@google.com> Reported-by: Hillf Danton <hdanton@sina.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14mm/zswap.c: fix two bugs in zswap_writeback_entry()Miaohe Lin
[ Upstream commit 46b76f2e09dc35f70aca2f4349eb0d158f53fe93 ] In the ZSWAP_SWAPCACHE_FAIL and ZSWAP_SWAPCACHE_EXIST case, we forgot to call zpool_unmap_handle() when zpool can't sleep. And we might sleep in zswap_get_swap_cache_page() while zpool can't sleep. To fix all of these, zpool_unmap_handle() should be done before zswap_get_swap_cache_page() when zpool can't sleep. Link: https://lkml.kernel.org/r/20210522092242.3233191-4-linmiaohe@huawei.com Fixes: fc6697a89f56 ("mm/zswap: add the flag can_sleep_mapped") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Seth Jennings <sjenning@redhat.com> Cc: Tian Tao <tiantao6@hisilicon.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14mm: migrate: fix missing update page_private to hugetlb_page_subpoolMuchun Song
[ Upstream commit 6acfb5ba150cf75005ce85e0e25d79ef2fec287c ] Since commit d6995da31122 ("hugetlb: use page.private for hugetlb specific page flags") converts page.private for hugetlb specific page flags. We should use hugetlb_page_subpool() to get the subpool pointer instead of page_private(). This 'could' prevent the migration of hugetlb pages. page_private(hpage) is now used for hugetlb page specific flags. At migration time, the only flag which could be set is HPageVmemmapOptimized. This flag will only be set if the new vmemmap reduction feature is enabled. In addition, !page_mapping() implies an anonymous mapping. So, this will prevent migration of hugetb pages in anonymous mappings if the vmemmap reduction feature is enabled. In addition, that if statement checked for the rare race condition of a page being migrated while in the process of being freed. Since that check is now wrong, we could leak hugetlb subpool usage counts. The commit forgot to update it in the page migration routine. So fix it. [songmuchun@bytedance.com: fix compiler error when !CONFIG_HUGETLB_PAGE reported by Randy] Link: https://lkml.kernel.org/r/20210521022747.35736-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20210520025949.1866-1-songmuchun@bytedance.com Fixes: d6995da31122 ("hugetlb: use page.private for hugetlb specific page flags") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reported-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Tested-by: Anshuman Khandual <anshuman.khandual@arm.com> [arm64] Cc: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14mm/z3fold: use release_z3fold_page_locked() to release locked z3fold pageMiaohe Lin
[ Upstream commit 28473d91ff7f686d58047ff55f2fa98ab59114a4 ] We should use release_z3fold_page_locked() to release z3fold page when it's locked, although it looks harmless to use release_z3fold_page() now. Link: https://lkml.kernel.org/r/20210619093151.1492174-7-linmiaohe@huawei.com Fixes: dcf5aedb24f8 ("z3fold: stricter locking and more careful reclaim") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Hillf Danton <hdanton@sina.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14mm/z3fold: fix potential memory leak in z3fold_destroy_pool()Miaohe Lin
[ Upstream commit dac0d1cfda56472378d330b1b76b9973557a7b1d ] There is a memory leak in z3fold_destroy_pool() as it forgets to free_percpu pool->unbuddied. Call free_percpu for pool->unbuddied to fix this issue. Link: https://lkml.kernel.org/r/20210619093151.1492174-6-linmiaohe@huawei.com Fixes: d30561c56f41 ("z3fold: use per-cpu unbuddied lists") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Hillf Danton <hdanton@sina.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14hugetlb: remove prep_compound_huge_page cleanupMike Kravetz
[ Upstream commit 48b8d744ea841b8adf8d07bfe7a2d55f22e4d179 ] Patch series "Fix prep_compound_gigantic_page ref count adjustment". These patches address the possible race between prep_compound_gigantic_page and __page_cache_add_speculative as described by Jann Horn in [1]. The first patch simply removes the unnecessary/obsolete helper routine prep_compound_huge_page to make the actual fix a little simpler. The second patch is the actual fix and has a detailed explanation in the commit message. This potential issue has existed for almost 10 years and I am unaware of anyone actually hitting the race. I did not cc stable, but would be happy to squash the patches and send to stable if anyone thinks that is a good idea. [1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/ This patch (of 2): I could not think of a reliable way to recreate the issue for testing. Rather, I 'simulated errors' to exercise all the error paths. The routine prep_compound_huge_page is a simple wrapper to call either prep_compound_gigantic_page or prep_compound_page. However, it is only called from gather_bootmem_prealloc which only processes gigantic pages. Eliminate the routine and call prep_compound_gigantic_page directly. Link: https://lkml.kernel.org/r/20210622021423.154662-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20210622021423.154662-2-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Youquan Song <youquan.song@intel.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14mm/huge_memory.c: don't discard hugepage if other processes are mapping itMiaohe Lin
[ Upstream commit babbbdd08af98a59089334eb3effbed5a7a0cf7f ] If other processes are mapping any other subpages of the hugepage, i.e. in pte-mapped thp case, page_mapcount() will return 1 incorrectly. Then we would discard the page while other processes are still mapping it. Fix it by using total_mapcount() which can tell whether other processes are still mapping it. Link: https://lkml.kernel.org/r/20210511134857.1581273-6-linmiaohe@huawei.com Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called") Reviewed-by: Yang Shi <shy828301@gmail.com> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Rik van Riel <riel@surriel.com> Cc: Song Liu <songliubraving@fb.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14mm/huge_memory.c: add missing read-only THP checking in ↵Miaohe Lin
transparent_hugepage_enabled() [ Upstream commit e6be37b2e7bddfe0c76585ee7c7eee5acc8efeab ] Since commit 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS"), read-only THP file mapping is supported. But it forgot to add checking for it in transparent_hugepage_enabled(). To fix it, we add checking for read-only THP file mapping and also introduce helper transhuge_vma_enabled() to check whether thp is enabled for specified vma to reduce duplicated code. We rename transparent_hugepage_enabled to transparent_hugepage_active to make the code easier to follow as suggested by David Hildenbrand. [linmiaohe@huawei.com: define transhuge_vma_enabled next to transhuge_vma_suitable] Link: https://lkml.kernel.org/r/20210514093007.4117906-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210511134857.1581273-4-linmiaohe@huawei.com Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Rik van Riel <riel@surriel.com> Cc: Song Liu <songliubraving@fb.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14mm/page_alloc: fix counting of managed_pagesLiu Shixin
[ Upstream commit f7ec104458e00d27a190348ac3a513f3df3699a4 ] commit f63661566fad ("mm/page_alloc.c: clear out zone->lowmem_reserve[] if the zone is empty") clears out zone->lowmem_reserve[] if zone is empty. But when zone is not empty and sysctl_lowmem_reserve_ratio[i] is set to zero, zone_managed_pages(zone) is not counted in the managed_pages either. This is inconsistent with the description of lowmem_reserve, so fix it. Link: https://lkml.kernel.org/r/20210527125707.3760259-1-liushixin2@huawei.com Fixes: f63661566fad ("mm/page_alloc.c: clear out zone->lowmem_reserve[] if the zone is empty") Signed-off-by: Liu Shixin <liushixin2@huawei.com> Reported-by: yangerkun <yangerkun@huawei.com> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14mm: memcg/slab: properly set up gfp flags for objcg pointer arrayWaiman Long
[ Upstream commit 41eb5df1cbc9b302fc263ad7c9f38cfc38b4df61 ] Patch series "mm: memcg/slab: Fix objcg pointer array handling problem", v4. Since the merging of the new slab memory controller in v5.9, the page structure stores a pointer to objcg pointer array for slab pages. When the slab has no used objects, it can be freed in free_slab() which will call kfree() to free the objcg pointer array in memcg_alloc_page_obj_cgroups(). If it happens that the objcg pointer array is the last used object in its slab, that slab may then be freed which may caused kfree() to be called again. With the right workload, the slab cache may be set up in a way that allows the recursive kfree() calling loop to nest deep enough to cause a kernel stack overflow and panic the system. In fact, we have a reproducer that can cause kernel stack overflow on a s390 system involving kmalloc-rcl-256 and kmalloc-rcl-128 slabs with the following kfree() loop recursively called 74 times: [ 285.520739] [<000000000ec432fc>] kfree+0x4bc/0x560 [ 285.520740] [<000000000ec43466>] __free_slab+0xc6/0x228 [ 285.520741] [<000000000ec41fc2>] __slab_free+0x3c2/0x3e0 [ 285.520742] [<000000000ec432fc>] kfree+0x4bc/0x560 : While investigating this issue, I also found an issue on the allocation side. If the objcg pointer array happen to come from the same slab or a circular dependency linkage is formed with multiple slabs, those affected slabs can never be freed again. This patch series addresses these two issues by introducing a new set of kmalloc-cg-<n> caches split from kmalloc-<n> caches. The new set will only contain non-reclaimable and non-dma objects that are accounted in memory cgroups whereas the old set are now for unaccounted objects only. By making this split, all the objcg pointer arrays will come from the kmalloc-<n> caches, but those caches will never hold any objcg pointer array. As a result, deeply nested kfree() call and the unfreeable slab problems are now gone. This patch (of 4): Since the merging of the new slab memory controller in v5.9, the page structure may store a pointer to obj_cgroup pointer array for slab pages. Currently, only the __GFP_ACCOUNT bit is masked off. However, the array is not readily reclaimable and doesn't need to come from the DMA buffer. So those GFP bits should be masked off as well. Do the flag bit clearing at memcg_alloc_page_obj_cgroups() to make sure that it is consistently applied no matter where it is called. Link: https://lkml.kernel.org/r/20210505200610.13943-1-longman@redhat.com Link: https://lkml.kernel.org/r/20210505200610.13943-2-longman@redhat.com Fixes: 286e04b8ed7a ("mm: memcg/slab: allocate obj_cgroups for non-root slab pages") Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14mm/shmem: fix shmem_swapin() race with swapoffMiaohe Lin
[ Upstream commit 2efa33fc7f6ec94a3a538c1a264273c889be2b36 ] When I was investigating the swap code, I found the below possible race window: CPU 1 CPU 2 ----- ----- shmem_swapin swap_cluster_readahead if (likely(si->flags & (SWP_BLKDEV | SWP_FS_OPS))) { swapoff .. si->swap_file = NULL; .. struct inode *inode = si->swap_file->f_mapping->host;[oops!] Close this race window by using get/put_swap_device() to guard against concurrent swapoff. Link: https://lkml.kernel.org/r/20210426123316.806267-5-linmiaohe@huawei.com Fixes: 8fd2e0b505d1 ("mm: swap: check if swap backing device is congested or not") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alex Shi <alexs@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14swap: fix do_swap_page() race with swapoffMiaohe Lin
[ Upstream commit 2799e77529c2a25492a4395db93996e3dacd762d ] When I was investigating the swap code, I found the below possible race window: CPU 1 CPU 2 ----- ----- do_swap_page if (data_race(si->flags & SWP_SYNCHRONOUS_IO) swap_readpage if (data_race(sis->flags & SWP_FS_OPS)) { swapoff .. p->swap_file = NULL; .. struct file *swap_file = sis->swap_file; struct address_space *mapping = swap_file->f_mapping;[oops!] Note that for the pages that are swapped in through swap cache, this isn't an issue. Because the page is locked, and the swap entry will be marked with SWAP_HAS_CACHE, so swapoff() can not proceed until the page has been unlocked. Fix this race by using get/put_swap_device() to guard against concurrent swapoff. Link: https://lkml.kernel.org/r/20210426123316.806267-3-linmiaohe@huawei.com Fixes: 0bcac06f27d7 ("mm,swap: skip swapcache for swapin of synchronous device") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Alex Shi <alexs@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14mm: mmap_lock: use local locks instead of disabling preemptionNicolas Saenz Julienne
[ Upstream commit 832b50725373e8c46781b7d4db104ec9cf564a6b ] mmap_lock will explicitly disable/enable preemption upon manipulating its local CPU variables. This is to be expected, but in this case, it doesn't play well with PREEMPT_RT. The preemption disabled code section also takes a spin-lock. Spin-locks in RT systems will try to schedule, which is exactly what we're trying to avoid. To mitigate this, convert the explicit preemption handling to local_locks. Which are RT aware, and will disable migration instead of preemption when PREEMPT_RT=y. The faulty call trace looks like the following: __mmap_lock_do_trace_*() preempt_disable() get_mm_memcg_path() cgroup_path() kernfs_path_from_node() spin_lock_irqsave() /* Scheduling while atomic! */ Link: https://lkml.kernel.org/r/20210604163506.2103900-1-nsaenzju@redhat.com Fixes: 2b5067a8143e3 ("mm: mmap_lock: add tracepoints around lock acquisition ") Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14mm/debug_vm_pgtable: ensure THP availability via has_transparent_hugepage()Anshuman Khandual
[ Upstream commit 65ac1a60a57e2c55f2ac37f27095f6b012295e81 ] On certain platforms, THP support could not just be validated via the build option CONFIG_TRANSPARENT_HUGEPAGE. Instead has_transparent_hugepage() also needs to be called upon to verify THP runtime support. Otherwise the debug test will just run into unusable THP helpers like in the case of a 4K hash config on powerpc platform [1]. This just moves all pfn_pmd() and pfn_pud() after THP runtime validation with has_transparent_hugepage() which prevents the mentioned problem. [1] https://bugzilla.kernel.org/show_bug.cgi?id=213069 Link: https://lkml.kernel.org/r/1621397588-19211-1-git-send-email-anshuman.khandual@arm.com Fixes: 787d563b8642 ("mm/debug_vm_pgtable: fix kernel crash by checking for THP support") Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14mm/gup: fix try_grab_compound_head() race with split_huge_page()Jann Horn
commit c24d37322548a6ec3caec67100d28b9c1f89f60a upstream. try_grab_compound_head() is used to grab a reference to a page from get_user_pages_fast(), which is only protected against concurrent freeing of page tables (via local_irq_save()), but not against concurrent TLB flushes, freeing of data pages, or splitting of compound pages. Because no reference is held to the page when try_grab_compound_head() is called, the page may have been freed and reallocated by the time its refcount has been elevated; therefore, once we're holding a stable reference to the page, the caller re-checks whether the PTE still points to the same page (with the same access rights). The problem is that try_grab_compound_head() has to grab a reference on the head page; but between the time we look up what the head page is and the time we actually grab a reference on the head page, the compound page may have been split up (either explicitly through split_huge_page() or by freeing the compound page to the buddy allocator and then allocating its individual order-0 pages). If that happens, get_user_pages_fast() may end up returning the right page but lifting the refcount on a now-unrelated page, leading to use-after-free of pages. To fix it: Re-check whether the pages still belong together after lifting the refcount on the head page. Move anything else that checks compound_head(page) below the refcount increment. This can't actually happen on bare-metal x86 (because there, disabling IRQs locks out remote TLB flushes), but it can happen on virtualized x86 (e.g. under KVM) and probably also on arm64. The race window is pretty narrow, and constantly allocating and shattering hugepages isn't exactly fast; for now I've only managed to reproduce this in an x86 KVM guest with an artificially widened timing window (by adding a loop that repeatedly calls `inl(0x3f8 + 5)` in `try_get_compound_head()` to force VM exits, so that PV TLB flushes are used instead of IPIs). As requested on the list, also replace the existing VM_BUG_ON_PAGE() with a warning and bailout. Since the existing code only performed the BUG_ON check on DEBUG_VM kernels, ensure that the new code also only performs the check under that configuration - I don't want to mix two logically separate changes together too much. The macro VM_WARN_ON_ONCE_PAGE() doesn't return a value on !DEBUG_VM, so wrap the whole check in an #ifdef block. An alternative would be to change the VM_WARN_ON_ONCE_PAGE() definition for !DEBUG_VM such that it always returns false, but since that would differ from the behavior of the normal WARN macros, it might be too confusing for readers. Link: https://lkml.kernel.org/r/20210615012014.1100672-1-jannh@google.com Fixes: 7aef4172c795 ("mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton") Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-14mm/page_alloc: fix memory map initialization for descending nodesMike Rapoport
commit 122e093c1734361dedb64f65c99b93e28e4624f4 upstream. On systems with memory nodes sorted in descending order, for instance Dell Precision WorkStation T5500, the struct pages for higher PFNs and respectively lower nodes, could be overwritten by the initialization of struct pages corresponding to the holes in the memory sections. For example for the below memory layout [ 0.245624] Early memory node ranges [ 0.248496] node 1: [mem 0x0000000000001000-0x0000000000090fff] [ 0.251376] node 1: [mem 0x0000000000100000-0x00000000dbdf8fff] [ 0.254256] node 1: [mem 0x0000000100000000-0x0000001423ffffff] [ 0.257144] node 0: [mem 0x0000001424000000-0x0000002023ffffff] the range 0x1424000000 - 0x1428000000 in the beginning of node 0 starts in the middle of a section and will be considered as a hole during the initialization of the last section in node 1. The wrong initialization of the memory map causes panic on boot when CONFIG_DEBUG_VM is enabled. Reorder loop order of the memory map initialization so that the outer loop will always iterate over populated memory regions in the ascending order and the inner loop will select the zone corresponding to the PFN range. This way initialization of the struct pages for the memory holes will be always done for the ranges that are actually not populated. [akpm@linux-foundation.org: coding style fixes] Link: https://lkml.kernel.org/r/YNXlMqBbL+tBG7yq@kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=213073 Link: https://lkml.kernel.org/r/20210624062305.10940-1-rppt@kernel.org Fixes: 0740a50b9baa ("mm/page_alloc.c: refactor initialization of struct page for holes in memory layout") Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Boris Petkov <bp@alien8.de> Cc: Robert Shteynfeld <robert.shteynfeld@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-07mm/page_alloc: correct return value of populated elements if bulk array is ↵Mel Gorman
populated commit ff4b2b4014cbffb3d32b22629252f4dc8616b0fe upstream. Dave Jones reported the following This made it into 5.13 final, and completely breaks NFSD for me (Serving tcp v3 mounts). Existing mounts on clients hang, as do new mounts from new clients. Rebooting the server back to rc7 everything recovers. The commit b3b64ebd3822 ("mm/page_alloc: do bulk array bounds check after checking populated elements") returns the wrong value if the array is already populated which is interpreted as an allocation failure. Dave reported this fixes his problem and it also passed a test running dbench over NFS. Link: https://lkml.kernel.org/r/20210628150219.GC3840@techsingularity.net Fixes: b3b64ebd3822 ("mm/page_alloc: do bulk array bounds check after checking populated elements") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Dave Jones <davej@codemonkey.org.uk> Tested-by: Dave Jones <davej@codemonkey.org.uk> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [5.13+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-28aufs5: aufs-mmapBruce Ashfield
Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2021-06-24mm/page_alloc: do bulk array bounds check after checking populated elementsMel Gorman
Dan Carpenter reported the following The patch 0f87d9d30f21: "mm/page_alloc: add an array-based interface to the bulk page allocator" from Apr 29, 2021, leads to the following static checker warning: mm/page_alloc.c:5338 __alloc_pages_bulk() warn: potentially one past the end of array 'page_array[nr_populated]' The problem can occur if an array is passed in that is fully populated. That potentially ends up allocating a single page and storing it past the end of the array. This patch returns 0 if the array is fully populated. Link: https://lkml.kernel.org/r/20210618125102.GU30378@techsingularity.net Fixes: 0f87d9d30f21 ("mm/page_alloc: add an array-based interface to the bulk page allocator") Signed-off-by: Mel Gorman <mgorman@techsinguliarity.net> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm/page_alloc: __alloc_pages_bulk(): do bounds check before accessing arrayRasmus Villemoes
In the event that somebody would call this with an already fully populated page_array, the last loop iteration would do an access beyond the end of page_array. It's of course extremely unlikely that would ever be done, but this triggers my internal static analyzer. Also, if it really is not supposed to be invoked this way (i.e., with no NULL entries in page_array), the nr_populated<nr_pages check could simply be removed instead. Link: https://lkml.kernel.org/r/20210507064504.1712559-1-linux@rasmusvillemoes.dk Fixes: 0f87d9d30f21 ("mm/page_alloc: add an array-based interface to the bulk page allocator") Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm/hwpoison: do not lock page again when me_huge_page() successfully recoversNaoya Horiguchi
Currently me_huge_page() temporary unlocks page to perform some actions then locks it again later. My testcase (which calls hard-offline on some tail page in a hugetlb, then accesses the address of the hugetlb range) showed that page allocation code detects this page lock on buddy page and printed out "BUG: Bad page state" message. check_new_page_bad() does not consider a page with __PG_HWPOISON as bad page, so this flag works as kind of filter, but this filtering doesn't work in this case because the "bad page" is not the actual hwpoisoned page. So stop locking page again. Actions to be taken depend on the page type of the error, so page unlocking should be done in ->action() callbacks. So let's make it assumed and change all existing callbacks that way. Link: https://lkml.kernel.org/r/20210609072029.74645-1-nao.horiguchi@gmail.com Fixes: commit 78bb920344b8 ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>