summaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)Author
2020-08-26mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possiblePeter Xu
commit 75802ca66354a39ab8e35822747cd08b3384a99a upstream. This is found by code observation only. Firstly, the worst case scenario should assume the whole range was covered by pmd sharing. The old algorithm might not work as expected for ranges like (1g-2m, 1g+2m), where the adjusted range should be (0, 1g+2m) but the expected range should be (0, 2g). Since at it, remove the loop since it should not be required. With that, the new code should be faster too when the invalidating range is huge. Mike said: : With range (1g-2m, 1g+2m) within a vma (0, 2g) the existing code will only : adjust to (0, 1g+2m) which is incorrect. : : We should cc stable. The original reason for adjusting the range was to : prevent data corruption (getting wrong page). Since the range is not : always adjusted correctly, the potential for corruption still exists. : : However, I am fairly confident that adjust_range_if_pmd_sharing_possible : is only gong to be called in two cases: : : 1) for a single page : 2) for range == entire vma : : In those cases, the current code should produce the correct results. : : To be safe, let's just cc stable. Fixes: 017b1660df89 ("mm: migration: fix migration of huge PMD shared pages") Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200730201636.74778-1-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-26mm, page_alloc: fix core hung in free_pcppages_bulk()Charan Teja Reddy
commit 88e8ac11d2ea3acc003cf01bb5a38c8aa76c3cfd upstream. The following race is observed with the repeated online, offline and a delay between two successive online of memory blocks of movable zone. P1 P2 Online the first memory block in the movable zone. The pcp struct values are initialized to default values,i.e., pcp->high = 0 & pcp->batch = 1. Allocate the pages from the movable zone. Try to Online the second memory block in the movable zone thus it entered the online_pages() but yet to call zone_pcp_update(). This process is entered into the exit path thus it tries to release the order-0 pages to pcp lists through free_unref_page_commit(). As pcp->high = 0, pcp->count = 1 proceed to call the function free_pcppages_bulk(). Update the pcp values thus the new pcp values are like, say, pcp->high = 378, pcp->batch = 63. Read the pcp's batch value using READ_ONCE() and pass the same to free_pcppages_bulk(), pcp values passed here are, batch = 63, count = 1. Since num of pages in the pcp lists are less than ->batch, then it will stuck in while(list_empty(list)) loop with interrupts disabled thus a core hung. Avoid this by ensuring free_pcppages_bulk() is called with proper count of pcp list pages. The mentioned race is some what easily reproducible without [1] because pcp's are not updated for the first memory block online and thus there is a enough race window for P2 between alloc+free and pcp struct values update through onlining of second memory block. With [1], the race still exists but it is very narrow as we update the pcp struct values for the first memory block online itself. This is not limited to the movable zone, it could also happen in cases with the normal zone (e.g., hotplug to a node that only has DMA memory, or no other memory yet). [1]: https://patchwork.kernel.org/patch/11696389/ Fixes: 5f8dcc21211a ("page-allocator: split per-cpu list into one-list-per-migrate-type") Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: <stable@vger.kernel.org> [2.6+] Link: http://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeaurora.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-26mm: include CMA pages in lowmem_reserve at bootDoug Berger
commit e08d3fdfe2dafa0331843f70ce1ff6c1c4900bf4 upstream. The lowmem_reserve arrays provide a means of applying pressure against allocations from lower zones that were targeted at higher zones. Its values are a function of the number of pages managed by higher zones and are assigned by a call to the setup_per_zone_lowmem_reserve() function. The function is initially called at boot time by the function init_per_zone_wmark_min() and may be called later by accesses of the /proc/sys/vm/lowmem_reserve_ratio sysctl file. The function init_per_zone_wmark_min() was moved up from a module_init to a core_initcall to resolve a sequencing issue with khugepaged. Unfortunately this created a sequencing issue with CMA page accounting. The CMA pages are added to the managed page count of a zone when cma_init_reserved_areas() is called at boot also as a core_initcall. This makes it uncertain whether the CMA pages will be added to the managed page counts of their zones before or after the call to init_per_zone_wmark_min() as it becomes dependent on link order. With the current link order the pages are added to the managed count after the lowmem_reserve arrays are initialized at boot. This means the lowmem_reserve values at boot may be lower than the values used later if /proc/sys/vm/lowmem_reserve_ratio is accessed even if the ratio values are unchanged. In many cases the difference is not significant, but for example an ARM platform with 1GB of memory and the following memory layout cma: Reserved 256 MiB at 0x0000000030000000 Zone ranges: DMA [mem 0x0000000000000000-0x000000002fffffff] Normal empty HighMem [mem 0x0000000030000000-0x000000003fffffff] would result in 0 lowmem_reserve for the DMA zone. This would allow userspace to deplete the DMA zone easily. Funnily enough $ cat /proc/sys/vm/lowmem_reserve_ratio would fix up the situation because as a side effect it forces setup_per_zone_lowmem_reserve. This commit breaks the link order dependency by invoking init_per_zone_wmark_min() as a postcore_initcall so that the CMA pages have the chance to be properly accounted in their zone(s) and allowing the lowmem_reserve arrays to receive consistent values. Fixes: bc22af74f271 ("mm: update min_free_kbytes from khugepaged after core initialization") Signed-off-by: Doug Berger <opendmb@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jason Baron <jbaron@akamai.com> Cc: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/1597423766-27849-1-git-send-email-opendmb@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-26khugepaged: adjust VM_BUG_ON_MM() in __khugepaged_enter()Hugh Dickins
[ Upstream commit f3f99d63a8156c7a4a6b20aac22b53c5579c7dc1 ] syzbot crashes on the VM_BUG_ON_MM(khugepaged_test_exit(mm), mm) in __khugepaged_enter(): yes, when one thread is about to dump core, has set core_state, and is waiting for others, another might do something calling __khugepaged_enter(), which now crashes because I lumped the core_state test (known as "mmget_still_valid") into khugepaged_test_exit(). I still think it's best to lump them together, so just in this exceptional case, check mm->mm_users directly instead of khugepaged_test_exit(). Fixes: bbe98f9cadff ("khugepaged: khugepaged_test_exit() check mmget_still_valid()") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Yang Shi <shy828301@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Song Liu <songliubraving@fb.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Eric Dumazet <edumazet@google.com> Cc: <stable@vger.kernel.org> [4.8+] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008141503370.18085@eggly.anvils Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-26khugepaged: khugepaged_test_exit() check mmget_still_valid()Hugh Dickins
[ Upstream commit bbe98f9cadff58cdd6a4acaeba0efa8565dabe65 ] Move collapse_huge_page()'s mmget_still_valid() check into khugepaged_test_exit() itself. collapse_huge_page() is used for anon THP only, and earned its mmget_still_valid() check because it inserts a huge pmd entry in place of the page table's pmd entry; whereas collapse_file()'s retract_page_tables() or collapse_pte_mapped_thp() merely clears the page table's pmd entry. But core dumping without mmap lock must have been as open to mistaking a racily cleared pmd entry for a page table at physical page 0, as exit_mmap() was. And we certainly have no interest in mapping as a THP once dumping core. Fixes: 59ea6d06cfa9 ("coredump: fix race condition between collapse_huge_page() and core dumping") Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Song Liu <songliubraving@fb.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> [4.8+] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021217020.27773@eggly.anvils Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-21khugepaged: retract_page_tables() remember to test exitHugh Dickins
commit 18e77600f7a1ed69f8ce46c9e11cad0985712dfa upstream. Only once have I seen this scenario (and forgot even to notice what forced the eventual crash): a sequence of "BUG: Bad page map" alerts from vm_normal_page(), from zap_pte_range() servicing exit_mmap(); pmd:00000000, pte values corresponding to data in physical page 0. The pte mappings being zapped in this case were supposed to be from a huge page of ext4 text (but could as well have been shmem): my belief is that it was racing with collapse_file()'s retract_page_tables(), found *pmd pointing to a page table, locked it, but *pmd had become 0 by the time start_pte was decided. In most cases, that possibility is excluded by holding mmap lock; but exit_mmap() proceeds without mmap lock. Most of what's run by khugepaged checks khugepaged_test_exit() after acquiring mmap lock: khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate() do so, for example. But retract_page_tables() did not: fix that. The fix is for retract_page_tables() to check khugepaged_test_exit(), after acquiring mmap lock, before doing anything to the page table. Getting the mmap lock serializes with __mmput(), which briefly takes and drops it in __khugepaged_exit(); then the khugepaged_test_exit() check on mm_users makes sure we don't touch the page table once exit_mmap() might reach it, since exit_mmap() will be proceeding without mmap lock, not expecting anyone to be racing with it. Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Song Liu <songliubraving@fb.com> Cc: <stable@vger.kernel.org> [4.8+] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021215400.27773@eggly.anvils Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-21mm/page_counter.c: fix protection usage propagationMichal Koutný
commit a6f23d14ec7d7d02220ad8bb2774be3322b9aeec upstream. When workload runs in cgroups that aren't directly below root cgroup and their parent specifies reclaim protection, it may end up ineffective. The reason is that propagate_protected_usage() is not called in all hierarchy up. All the protected usage is incorrectly accumulated in the workload's parent. This means that siblings_low_usage is overestimated and effective protection underestimated. Even though it is transitional phenomenon (uncharge path does correct propagation and fixes the wrong children_low_usage), it can undermine the intended protection unexpectedly. We have noticed this problem while seeing a swap out in a descendant of a protected memcg (intermediate node) while the parent was conveniently under its protection limit and the memory pressure was external to that hierarchy. Michal has pinpointed this down to the wrong siblings_low_usage which led to the unwanted reclaim. The fix is simply updating children_low_usage in respective ancestors also in the charging path. Fixes: 230671533d64 ("mm: memory.low hierarchical behavior") Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> [4.18+] Link: http://lkml.kernel.org/r/20200803153231.15477-1-mhocko@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-19mm/mmap.c: Add cond_resched() for exit_mmap() CPU stallsPaul E. McKenney
[ Upstream commit 0a3b3c253a1eb2c7fe7f34086d46660c909abeb3 ] A large process running on a heavily loaded system can encounter the following RCU CPU stall warning: rcu: INFO: rcu_sched self-detected stall on CPU rcu: 3-....: (20998 ticks this GP) idle=4ea/1/0x4000000000000002 softirq=556558/556558 fqs=5190 (t=21013 jiffies g=1005461 q=132576) NMI backtrace for cpu 3 CPU: 3 PID: 501900 Comm: aio-free-ring-w Kdump: loaded Not tainted 5.2.9-108_fbk12_rc3_3858_gb83b75af7909 #1 Hardware name: Wiwynn HoneyBadger/PantherPlus, BIOS HBM6.71 02/03/2016 Call Trace: <IRQ> dump_stack+0x46/0x60 nmi_cpu_backtrace.cold.3+0x13/0x50 ? lapic_can_unplug_cpu.cold.27+0x34/0x34 nmi_trigger_cpumask_backtrace+0xba/0xca rcu_dump_cpu_stacks+0x99/0xc7 rcu_sched_clock_irq.cold.87+0x1aa/0x397 ? tick_sched_do_timer+0x60/0x60 update_process_times+0x28/0x60 tick_sched_timer+0x37/0x70 __hrtimer_run_queues+0xfe/0x270 hrtimer_interrupt+0xf4/0x210 smp_apic_timer_interrupt+0x5e/0x120 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:kmem_cache_free+0x223/0x300 Code: 88 00 00 00 0f 85 ca 00 00 00 41 8b 55 18 31 f6 f7 da 41 f6 45 0a 02 40 0f 94 c6 83 c6 05 9c 41 5e fa e8 a0 a7 01 00 41 56 9d <49> 8b 47 08 a8 03 0f 85 87 00 00 00 65 48 ff 08 e9 3d fe ff ff 65 RSP: 0018:ffffc9000e8e3da8 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13 RAX: 0000000000020000 RBX: ffff88861b9de960 RCX: 0000000000000030 RDX: fffffffffffe41e8 RSI: 000060777fe3a100 RDI: 000000000001be18 RBP: ffffea00186e7780 R08: ffffffffffffffff R09: ffffffffffffffff R10: ffff88861b9dea28 R11: ffff88887ffde000 R12: ffffffff81230a1f R13: ffff888854684dc0 R14: 0000000000000206 R15: ffff8888547dbc00 ? remove_vma+0x4f/0x60 remove_vma+0x4f/0x60 exit_mmap+0xd6/0x160 mmput+0x4a/0x110 do_exit+0x278/0xae0 ? syscall_trace_enter+0x1d3/0x2b0 ? handle_mm_fault+0xaa/0x1c0 do_group_exit+0x3a/0xa0 __x64_sys_exit_group+0x14/0x20 do_syscall_64+0x42/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 And on a PREEMPT=n kernel, the "while (vma)" loop in exit_mmap() can run for a very long time given a large process. This commit therefore adds a cond_resched() to this loop, providing RCU any needed quiescent states. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: <linux-mm@kvack.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-07-29mm: memcg/slab: fix memory leak at non-root kmem_cache destroyMuchun Song
commit d38a2b7a9c939e6d7329ab92b96559ccebf7b135 upstream. If the kmem_cache refcount is greater than one, we should not mark the root kmem_cache as dying. If we mark the root kmem_cache dying incorrectly, the non-root kmem_cache can never be destroyed. It resulted in memory leak when memcg was destroyed. We can use the following steps to reproduce. 1) Use kmem_cache_create() to create a new kmem_cache named A. 2) Coincidentally, the kmem_cache A is an alias for kmem_cache B, so the refcount of B is just increased. 3) Use kmem_cache_destroy() to destroy the kmem_cache A, just decrease the B's refcount but mark the B as dying. 4) Create a new memory cgroup and alloc memory from the kmem_cache B. It leads to create a non-root kmem_cache for allocating memory. 5) When destroy the memory cgroup created in the step 4), the non-root kmem_cache can never be destroyed. If we repeat steps 4) and 5), this will cause a lot of memory leak. So only when refcount reach zero, we mark the root kmem_cache as dying. Fixes: 92ee383f6daa ("mm: fix race between kmem_cache destroy, create and deactivate") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200716165103.83462-1-songmuchun@bytedance.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-07-29mm: memcg/slab: synchronize access to kmem_cache dying flag using a spinlockRoman Gushchin
[ Upstream commit 63b02ef7dc4ec239df45c018ac0adbd02ba30a0c ] Currently the memcg_params.dying flag and the corresponding workqueue used for the asynchronous deactivation of kmem_caches is synchronized using the slab_mutex. It makes impossible to check this flag from the irq context, which will be required in order to implement asynchronous release of kmem_caches. So let's switch over to the irq-save flavor of the spinlock-based synchronization. Link: http://lkml.kernel.org/r/20190611231813.3148843-8-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Waiman Long <longman@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Andrei Vagin <avagin@gmail.com> Cc: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-07-29mm/memcg: fix refcount error while moving and swappingHugh Dickins
commit 8d22a9351035ef2ff12ef163a1091b8b8cf1e49c upstream. It was hard to keep a test running, moving tasks between memcgs with move_charge_at_immigrate, while swapping: mem_cgroup_id_get_many()'s refcount is discovered to be 0 (supposedly impossible), so it is then forced to REFCOUNT_SATURATED, and after thousands of warnings in quick succession, the test is at last put out of misery by being OOM killed. This is because of the way moved_swap accounting was saved up until the task move gets completed in __mem_cgroup_clear_mc(), deferred from when mem_cgroup_move_swap_account() actually exchanged old and new ids. Concurrent activity can free up swap quicker than the task is scanned, bringing id refcount down 0 (which should only be possible when offlining). Just skip that optimization: do that part of the accounting immediately. Fixes: 615d66c37c75 ("mm: memcontrol: fix memcg id ref counter on swap charge move") Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2007071431050.4726@eggly.anvils Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-09mm/slub: fix stack overruns with SLUB_STATSQian Cai
[ Upstream commit a68ee0573991e90af2f1785db309206408bad3e5 ] There is no need to copy SLUB_STATS items from root memcg cache to new memcg cache copies. Doing so could result in stack overruns because the store function only accepts 0 to clear the stat and returns an error for everything else while the show method would print out the whole stat. Then, the mismatch of the lengths returns from show and store methods happens in memcg_propagate_slab_attrs(): else if (root_cache->max_attr_size < ARRAY_SIZE(mbuf)) buf = mbuf; max_attr_size is only 2 from slab_attr_store(), then, it uses mbuf[64] in show_stat() later where a bounch of sprintf() would overrun the stack variable. Fix it by always allocating a page of buffer to be used in show_stat() if SLUB_STATS=y which should only be used for debug purpose. # echo 1 > /sys/kernel/slab/fs_cache/shrink BUG: KASAN: stack-out-of-bounds in number+0x421/0x6e0 Write of size 1 at addr ffffc900256cfde0 by task kworker/76:0/53251 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func Call Trace: number+0x421/0x6e0 vsnprintf+0x451/0x8e0 sprintf+0x9e/0xd0 show_stat+0x124/0x1d0 alloc_slowpath_show+0x13/0x20 __kmem_cache_create+0x47a/0x6b0 addr ffffc900256cfde0 is located in stack of task kworker/76:0/53251 at offset 0 in frame: process_one_work+0x0/0xb90 this frame has 1 object: [32, 72) 'lockdep_map' Memory state around the buggy address: ffffc900256cfc80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffffc900256cfd00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffffc900256cfd80: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 ^ ffffc900256cfe00: 00 00 00 00 00 f2 f2 f2 00 00 00 00 00 00 00 00 ffffc900256cfe80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: __kmem_cache_create+0x6ac/0x6b0 Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func Call Trace: __kmem_cache_create+0x6ac/0x6b0 Fixes: 107dab5c92d5 ("slub: slub-specific propagation changes") Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Glauber Costa <glauber@scylladb.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/20200429222356.4322-1-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-07-09mm/slub.c: fix corrupted freechain in deactivate_slab()Dongli Zhang
[ Upstream commit 52f23478081ae0dcdb95d1650ea1e7d52d586829 ] The slub_debug is able to fix the corrupted slab freelist/page. However, alloc_debug_processing() only checks the validity of current and next freepointer during allocation path. As a result, once some objects have their freepointers corrupted, deactivate_slab() may lead to page fault. Below is from a test kernel module when 'slub_debug=PUF,kmalloc-128 slub_nomerge'. The test kernel corrupts the freepointer of one free object on purpose. Unfortunately, deactivate_slab() does not detect it when iterating the freechain. BUG: unable to handle page fault for address: 00000000123456f8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI ... ... RIP: 0010:deactivate_slab.isra.92+0xed/0x490 ... ... Call Trace: ___slab_alloc+0x536/0x570 __slab_alloc+0x17/0x30 __kmalloc+0x1d9/0x200 ext4_htree_store_dirent+0x30/0xf0 htree_dirblock_to_tree+0xcb/0x1c0 ext4_htree_fill_tree+0x1bc/0x2d0 ext4_readdir+0x54f/0x920 iterate_dir+0x88/0x190 __x64_sys_getdents+0xa6/0x140 do_syscall_64+0x49/0x170 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Therefore, this patch adds extra consistency check in deactivate_slab(). Once an object's freepointer is corrupted, all following objects starting at this object are isolated. [akpm@linux-foundation.org: fix build with CONFIG_SLAB_DEBUG=n] Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Joe Jin <joe.jin@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/20200331031450.12182-1-dongli.zhang@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-07-09mm: fix swap cache node allocation maskHugh Dickins
[ Upstream commit 243bce09c91b0145aeaedd5afba799d81841c030 ] Chris Murphy reports that a slightly overcommitted load, testing swap and zram along with i915, splats and keeps on splatting, when it had better fail less noisily: gnome-shell: page allocation failure: order:0, mode:0x400d0(__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_RECLAIMABLE), nodemask=(null),cpuset=/,mems_allowed=0 CPU: 2 PID: 1155 Comm: gnome-shell Not tainted 5.7.0-1.fc33.x86_64 #1 Call Trace: dump_stack+0x64/0x88 warn_alloc.cold+0x75/0xd9 __alloc_pages_slowpath.constprop.0+0xcfa/0xd30 __alloc_pages_nodemask+0x2df/0x320 alloc_slab_page+0x195/0x310 allocate_slab+0x3c5/0x440 ___slab_alloc+0x40c/0x5f0 __slab_alloc+0x1c/0x30 kmem_cache_alloc+0x20e/0x220 xas_nomem+0x28/0x70 add_to_swap_cache+0x321/0x400 __read_swap_cache_async+0x105/0x240 swap_cluster_readahead+0x22c/0x2e0 shmem_swapin+0x8e/0xc0 shmem_swapin_page+0x196/0x740 shmem_getpage_gfp+0x3a2/0xa60 shmem_read_mapping_page_gfp+0x32/0x60 shmem_get_pages+0x155/0x5e0 [i915] __i915_gem_object_get_pages+0x68/0xa0 [i915] i915_vma_pin+0x3fe/0x6c0 [i915] eb_add_vma+0x10b/0x2c0 [i915] i915_gem_do_execbuffer+0x704/0x3430 [i915] i915_gem_execbuffer2_ioctl+0x1ea/0x3e0 [i915] drm_ioctl_kernel+0x86/0xd0 [drm] drm_ioctl+0x206/0x390 [drm] ksys_ioctl+0x82/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x5b/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported on 5.7, but it goes back really to 3.1: when shmem_read_mapping_page_gfp() was implemented for use by i915, and allowed for __GFP_NORETRY and __GFP_NOWARN flags in most places, but missed swapin's "& GFP_KERNEL" mask for page tree node allocation in __read_swap_cache_async() - that was to mask off HIGHUSER_MOVABLE bits from what page cache uses, but GFP_RECLAIM_MASK is now what's needed. Link: https://bugzilla.kernel.org/show_bug.cgi?id=208085 Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2006151330070.11064@eggly.anvils Fixes: 68da9f055755 ("tmpfs: pass gfp to shmem_getpage_gfp") Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: Chris Murphy <lists@colorremedies.com> Analyzed-by: Vlastimil Babka <vbabka@suse.cz> Analyzed-by: Matthew Wilcox <willy@infradead.org> Tested-by: Chris Murphy <lists@colorremedies.com> Cc: <stable@vger.kernel.org> [3.1+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-30mm/slab: use memzero_explicit() in kzfree()Waiman Long
commit 8982ae527fbef170ef298650c15d55a9ccd33973 upstream. The kzfree() function is normally used to clear some sensitive information, like encryption keys, in the buffer before freeing it back to the pool. Memset() is currently used for buffer clearing. However unlikely, there is still a non-zero probability that the compiler may choose to optimize away the memory clearing especially if LTO is being used in the future. To make sure that this optimization will never happen, memzero_explicit(), which is introduced in v3.18, is now used in kzfree() to future-proof it. Link: http://lkml.kernel.org/r/20200616154311.12314-2-longman@redhat.com Fixes: 3ef0e5ba4673 ("slab: introduce kzfree()") Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Howells <dhowells@redhat.com> Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Cc: James Morris <jmorris@namei.org> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Joe Perches <joe@perches.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: "Jason A . Donenfeld" <Jason@zx2c4.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-22mm: initialize deferred pages with interrupts enabledPavel Tatashin
commit 3d060856adfc59afb9d029c233141334cfaba418 upstream. Initializing struct pages is a long task and keeping interrupts disabled for the duration of this operation introduces a number of problems. 1. jiffies are not updated for long period of time, and thus incorrect time is reported. See proposed solution and discussion here: lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com 2. It prevents farther improving deferred page initialization by allowing intra-node multi-threading. We are keeping interrupts disabled to solve a rather theoretical problem that was never observed in real world (See 3a2d7fa8a3d5). Let's keep interrupts enabled. In case we ever encounter a scenario where an interrupt thread wants to allocate large amount of memory this early in boot we can deal with that by growing zone (see deferred_grow_zone()) by the needed amount before starting deferred_init_memmap() threads. Before: [ 1.232459] node 0 initialised, 12058412 pages in 1ms After: [ 1.632580] node 0 initialised, 12051227 pages in 436ms Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages") Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: James Morris <jmorris@namei.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Sasha Levin <sashal@kernel.org> Cc: Yiqian Wei <yiwei@redhat.com> Cc: <stable@vger.kernel.org> [4.17+] Link: http://lkml.kernel.org/r/20200403140952.17177-3-pasha.tatashin@soleen.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-22mm: thp: make the THP mapcount atomic against __split_huge_pmd_locked()Andrea Arcangeli
commit c444eb564fb16645c172d550359cb3d75fe8a040 upstream. Write protect anon page faults require an accurate mapcount to decide if to break the COW or not. This is implemented in the THP path with reuse_swap_page() -> page_trans_huge_map_swapcount()/page_trans_huge_mapcount(). If the COW triggers while the other processes sharing the page are under a huge pmd split, to do an accurate reading, we must ensure the mapcount isn't computed while it's being transferred from the head page to the tail pages. reuse_swap_cache() already runs serialized by the page lock, so it's enough to add the page lock around __split_huge_pmd_locked too, in order to add the missing serialization. Note: the commit in "Fixes" is just to facilitate the backporting, because the code before such commit didn't try to do an accurate THP mapcount calculation and it instead used the page_count() to decide if to COW or not. Both the page_count and the pin_count are THP-wide refcounts, so they're inaccurate if used in reuse_swap_page(). Reverting such commit (besides the unrelated fix to the local anon_vma assignment) would have also opened the window for memory corruption side effects to certain workloads as documented in such commit header. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Suggested-by: Jann Horn <jannh@google.com> Reported-by: Jann Horn <jannh@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Fixes: 6d0a07edd17c ("mm: thp: calculate the mapcount correctly for THP pages during WP faults") Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-22mm/slub: fix a memory leak in sysfs_slab_add()Wang Hai
commit dde3c6b72a16c2db826f54b2d49bdea26c3534a2 upstream. syzkaller reports for memory leak when kobject_init_and_add() returns an error in the function sysfs_slab_add() [1] When this happened, the function kobject_put() is not called for the corresponding kobject, which potentially leads to memory leak. This patch fixes the issue by calling kobject_put() even if kobject_init_and_add() fails. [1] BUG: memory leak unreferenced object 0xffff8880a6d4be88 (size 8): comm "syz-executor.3", pid 946, jiffies 4295772514 (age 18.396s) hex dump (first 8 bytes): 70 69 64 5f 33 00 ff ff pid_3... backtrace: kstrdup+0x35/0x70 mm/util.c:60 kstrdup_const+0x3d/0x50 mm/util.c:82 kvasprintf_const+0x112/0x170 lib/kasprintf.c:48 kobject_set_name_vargs+0x55/0x130 lib/kobject.c:289 kobject_add_varg lib/kobject.c:384 [inline] kobject_init_and_add+0xd8/0x170 lib/kobject.c:473 sysfs_slab_add+0x1d8/0x290 mm/slub.c:5811 __kmem_cache_create+0x50a/0x570 mm/slub.c:4384 create_cache+0x113/0x1e0 mm/slab_common.c:407 kmem_cache_create_usercopy+0x1a1/0x260 mm/slab_common.c:505 kmem_cache_create+0xd/0x10 mm/slab_common.c:564 create_pid_cachep kernel/pid_namespace.c:54 [inline] create_pid_namespace kernel/pid_namespace.c:96 [inline] copy_pid_ns+0x77c/0x8f0 kernel/pid_namespace.c:148 create_new_namespaces+0x26b/0xa30 kernel/nsproxy.c:95 unshare_nsproxy_namespaces+0xa7/0x1e0 kernel/nsproxy.c:229 ksys_unshare+0x3d2/0x770 kernel/fork.c:2969 __do_sys_unshare kernel/fork.c:3037 [inline] __se_sys_unshare kernel/fork.c:3035 [inline] __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3035 do_syscall_64+0xa1/0x530 arch/x86/entry/common.c:295 Fixes: 80da026a8e5d ("mm/slub: fix slab double-free in case of duplicate sysfs filename") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Hai <wanghai38@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/20200602115033.1054-1-wanghai38@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-22mm: add kvfree_sensitive() for freeing sensitive data objectsWaiman Long
[ Upstream commit d4eaa2837851db2bfed572898bfc17f9a9f9151e ] For kvmalloc'ed data object that contains sensitive information like cryptographic keys, we need to make sure that the buffer is always cleared before freeing it. Using memset() alone for buffer clearing may not provide certainty as the compiler may compile it away. To be sure, the special memzero_explicit() has to be used. This patch introduces a new kvfree_sensitive() for freeing those sensitive data objects allocated by kvmalloc(). The relevant places where kvfree_sensitive() can be used are modified to use it. Fixes: 4f0882491a14 ("KEYS: Avoid false positive ENOMEM error on key read") Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Eric Biggers <ebiggers@google.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Cc: James Morris <jmorris@namei.org> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Joe Perches <joe@perches.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Rientjes <rientjes@google.com> Cc: Uladzislau Rezki <urezki@gmail.com> Link: http://lkml.kernel.org/r/20200407200318.11711-1-longman@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-07mm: Fix mremap not considering huge pmd devmapFan Yang
commit 5bfea2d9b17f1034a68147a8b03b9789af5700f9 upstream. The original code in mm/mremap.c checks huge pmd by: if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd)) { However, a DAX mapped nvdimm is mapped as huge page (by default) but it is not transparent huge page (_PAGE_PSE | PAGE_DEVMAP). This commit changes the condition to include the case. This addresses CVE-2020-10757. Fixes: 5c7fb56e5e3f ("mm, dax: dax-pmd vs thp-pmd vs hugetlbfs-pmd") Cc: <stable@vger.kernel.org> Reported-by: Fan Yang <Fan_Yang@sjtu.edu.cn> Signed-off-by: Fan Yang <Fan_Yang@sjtu.edu.cn> Tested-by: Fan Yang <Fan_Yang@sjtu.edu.cn> Tested-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-03mm/vmalloc.c: don't dereference possible NULL pointer in __vunmap()Liviu Dudau
commit 6ade20327dbb808882888ed8ccded71e93067cf9 upstream. find_vmap_area() can return a NULL pointer and we're going to dereference it without checking it first. Use the existing find_vm_area() function which does exactly what we want and checks for the NULL pointer. Link: http://lkml.kernel.org/r/20181228171009.22269-1-liviu@dudau.co.uk Fixes: f3c01d2f3ade ("mm: vmalloc: avoid racy handling of debugobjects in vunmap") Signed-off-by: Liviu Dudau <liviu@dudau.co.uk> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Chintan Pandya <cpandya@codeaurora.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-20shmem: fix possible deadlocks on shmlock_user_lockHugh Dickins
[ Upstream commit ea0dfeb4209b4eab954d6e00ed136bc6b48b380d ] Recent commit 71725ed10c40 ("mm: huge tmpfs: try to split_huge_page() when punching hole") has allowed syzkaller to probe deeper, uncovering a long-standing lockdep issue between the irq-unsafe shmlock_user_lock, the irq-safe xa_lock on mapping->i_pages, and shmem inode's info->lock which nests inside xa_lock (or tree_lock) since 4.8's shmem_uncharge(). user_shm_lock(), servicing SysV shmctl(SHM_LOCK), wants shmlock_user_lock while its caller shmem_lock() holds info->lock with interrupts disabled; but hugetlbfs_file_setup() calls user_shm_lock() with interrupts enabled, and might be interrupted by a writeback endio wanting xa_lock on i_pages. This may not risk an actual deadlock, since shmem inodes do not take part in writeback accounting, but there are several easy ways to avoid it. Requiring interrupts disabled for shmlock_user_lock would be easy, but it's a high-level global lock for which that seems inappropriate. Instead, recall that the use of info->lock to guard info->flags in shmem_lock() dates from pre-3.1 days, when races with SHMEM_PAGEIN and SHMEM_TRUNCATE could occur: nowadays it serves no purpose, the only flag added or removed is VM_LOCKED itself, and calls to shmem_lock() an inode are already serialized by the caller. Take info->lock out of the chain and the possibility of deadlock or lockdep warning goes away. Fixes: 4595ef88d136 ("shmem: make shmem_inode_info::lock irq-safe") Reported-by: syzbot+c8a8197c8852f566b9d9@syzkaller.appspotmail.com Reported-by: syzbot+40b71e145e73f78f81ad@syzkaller.appspotmail.com Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2004161707410.16322@eggly.anvils Link: https://lore.kernel.org/lkml/000000000000e5838c05a3152f53@google.com/ Link: https://lore.kernel.org/lkml/0000000000003712b305a331d3b1@google.com/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-14mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous()David Hildenbrand
commit e84fe99b68ce353c37ceeecc95dce9696c976556 upstream. Without CONFIG_PREEMPT, it can happen that we get soft lockups detected, e.g., while booting up. watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-next-20200331+ #4 Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014 RIP: __pageblock_pfn_to_page+0x134/0x1c0 Call Trace: set_zone_contiguous+0x56/0x70 page_alloc_init_late+0x166/0x176 kernel_init_freeable+0xfa/0x255 kernel_init+0xa/0x106 ret_from_fork+0x35/0x40 The issue becomes visible when having a lot of memory (e.g., 4TB) assigned to a single NUMA node - a system that can easily be created using QEMU. Inside VMs on a hypervisor with quite some memory overcommit, this is fairly easy to trigger. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Shile Zhang <shile.zhang@linux.alibaba.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Shile Zhang <shile.zhang@linux.alibaba.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200416073417.5003-1-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-02mm: shmem: disable interrupt when acquiring info->lock in userfaultfd_copy pathYang Shi
commit 94b7cc01da5a3cc4f3da5e0ff492ef008bb555d6 upstream. Syzbot reported the below lockdep splat: WARNING: possible irq lock inversion dependency detected 5.6.0-rc7-syzkaller #0 Not tainted -------------------------------------------------------- syz-executor.0/10317 just changed the state of lock: ffff888021d16568 (&(&info->lock)->rlock){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline] ffff888021d16568 (&(&info->lock)->rlock){+.+.}, at: shmem_mfill_atomic_pte+0x1012/0x21c0 mm/shmem.c:2407 but this lock was taken by another, SOFTIRQ-safe lock in the past: (&(&xa->xa_lock)->rlock#5){..-.} and interrupts could create inverse lock ordering between them. other info that might help us debug this: Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&(&info->lock)->rlock); local_irq_disable(); lock(&(&xa->xa_lock)->rlock#5); lock(&(&info->lock)->rlock); <Interrupt> lock(&(&xa->xa_lock)->rlock#5); *** DEADLOCK *** The full report is quite lengthy, please see: https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2004152007370.13597@eggly.anvils/T/#m813b412c5f78e25ca8c6c7734886ed4de43f241d It is because CPU 0 held info->lock with IRQ enabled in userfaultfd_copy path, then CPU 1 is splitting a THP which held xa_lock and info->lock in IRQ disabled context at the same time. If softirq comes in to acquire xa_lock, the deadlock would be triggered. The fix is to acquire/release info->lock with *_irq version instead of plain spin_{lock,unlock} to make it softirq safe. Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support") Reported-by: syzbot+e27980339d305f2dbfd9@syzkaller.appspotmail.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: syzbot+e27980339d305f2dbfd9@syzkaller.appspotmail.com Acked-by: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Link: http://lkml.kernel.org/r/1587061357-122619-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-29mm/ksm: fix NULL pointer dereference when KSM zero page is enabledMuchun Song
commit 56df70a63ed5d989c1d36deee94cae14342be6e9 upstream. find_mergeable_vma() can return NULL. In this case, it leads to a crash when we access vm_mm(its offset is 0x40) later in write_protect_page. And this case did happen on our server. The following call trace is captured in kernel 4.19 with the following patch applied and KSM zero page enabled on our server. commit e86c59b1b12d ("mm/ksm: improve deduplication of zero pages with colouring") So add a vma check to fix it. BUG: unable to handle kernel NULL pointer dereference at 0000000000000040 Oops: 0000 [#1] SMP NOPTI CPU: 9 PID: 510 Comm: ksmd Kdump: loaded Tainted: G OE 4.19.36.bsk.9-amd64 #4.19.36.bsk.9 RIP: try_to_merge_one_page+0xc7/0x760 Code: 24 58 65 48 33 34 25 28 00 00 00 89 e8 0f 85 a3 06 00 00 48 83 c4 60 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 46 08 a8 01 75 b8 <49> 8b 44 24 40 4c 8d 7c 24 20 b9 07 00 00 00 4c 89 e6 4c 89 ff 48 RSP: 0018:ffffadbdd9fffdb0 EFLAGS: 00010246 RAX: ffffda83ffd4be08 RBX: ffffda83ffd4be40 RCX: 0000002c6e800000 RDX: 0000000000000000 RSI: ffffda83ffd4be40 RDI: 0000000000000000 RBP: ffffa11939f02ec0 R08: 0000000094e1a447 R09: 00000000abe76577 R10: 0000000000000962 R11: 0000000000004e6a R12: 0000000000000000 R13: ffffda83b1e06380 R14: ffffa18f31f072c0 R15: ffffda83ffd4be40 FS: 0000000000000000(0000) GS:ffffa0da43b80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000040 CR3: 0000002c77c0a003 CR4: 00000000007626e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: ksm_scan_thread+0x115e/0x1960 kthread+0xf5/0x130 ret_from_fork+0x1f/0x30 [songmuchun@bytedance.com: if the vma is out of date, just exit] Link: http://lkml.kernel.org/r/20200416025034.29780-1-songmuchun@bytedance.com [akpm@linux-foundation.org: add the conventional braces, replace /** with /*] Fixes: e86c59b1b12d ("mm/ksm: improve deduplication of zero pages with colouring") Co-developed-by: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Hugh Dickins <hughd@google.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com> Cc: Markus Elfring <Markus.Elfring@web.de> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200416025034.29780-1-songmuchun@bytedance.com Link: http://lkml.kernel.org/r/20200414132905.83819-1-songmuchun@bytedance.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-29mm/hugetlb: fix a addressing exception caused by huge_pte_offsetLongpeng
commit 3c1d7e6ccb644d517a12f73a7ff200870926f865 upstream. Our machine encountered a panic(addressing exception) after run for a long time and the calltrace is: RIP: hugetlb_fault+0x307/0xbe0 RSP: 0018:ffff9567fc27f808 EFLAGS: 00010286 RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48 RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48 RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080 R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8 R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074 FS: 00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: follow_hugetlb_page+0x175/0x540 __get_user_pages+0x2a0/0x7e0 __get_user_pages_unlocked+0x15d/0x210 __gfn_to_pfn_memslot+0x3c5/0x460 [kvm] try_async_pf+0x6e/0x2a0 [kvm] tdp_page_fault+0x151/0x2d0 [kvm] ... kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm] kvm_vcpu_ioctl+0x309/0x6d0 [kvm] do_vfs_ioctl+0x3f0/0x540 SyS_ioctl+0xa1/0xc0 system_call_fastpath+0x22/0x27 For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it may return a wrong 'pmdp' if there is a race. Please look at the following code snippet: ... pud = pud_offset(p4d, addr); if (sz != PUD_SIZE && pud_none(*pud)) return NULL; /* hugepage or swap? */ if (pud_huge(*pud) || !pud_present(*pud)) return (pte_t *)pud; pmd = pmd_offset(pud, addr); if (sz != PMD_SIZE && pmd_none(*pmd)) return NULL; /* hugepage or swap? */ if (pmd_huge(*pmd) || !pmd_present(*pmd)) return (pte_t *)pmd; ... The following sequence would trigger this bug: - CPU0: sz = PUD_SIZE and *pud = 0 , continue - CPU0: "pud_huge(*pud)" is false - CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT) - CPU0: "!pud_present(*pud)" is false, continue - CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp However, we want CPU0 to return NULL or pudp in this case. We must make sure there is exactly one dereference of pud and pmd. Signed-off-by: Longpeng <longpeng2@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200413010342.771-1-longpeng2@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-29vmalloc: fix remap_vmalloc_range() bounds checksJann Horn
commit bdebd6a2831b6fab69eb85cee74a8ba77f1a1cc2 upstream. remap_vmalloc_range() has had various issues with the bounds checks it promises to perform ("This function checks that addr is a valid vmalloc'ed area, and that it is big enough to cover the vma") over time, e.g.: - not detecting pgoff<<PAGE_SHIFT overflow - not detecting (pgoff<<PAGE_SHIFT)+usize overflow - not checking whether addr and addr+(pgoff<<PAGE_SHIFT) are the same vmalloc allocation - comparing a potentially wildly out-of-bounds pointer with the end of the vmalloc region In particular, since commit fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY"), unprivileged users can cause kernel null pointer dereferences by calling mmap() on a BPF map with a size that is bigger than the distance from the start of the BPF map to the end of the address space. This could theoretically be used as a kernel ASLR bypass, by using whether mmap() with a given offset oopses or returns an error code to perform a binary search over the possible address range. To allow remap_vmalloc_range_partial() to verify that addr and addr+(pgoff<<PAGE_SHIFT) are in the same vmalloc region, pass the offset to remap_vmalloc_range_partial() instead of adding it to the pointer in remap_vmalloc_range(). In remap_vmalloc_range_partial(), fix the check against get_vm_area_size() by using size comparisons instead of pointer comparisons, and add checks for pgoff. Fixes: 833423143c3a ("[PATCH] mm: introduce remap_vmalloc_range()") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Song Liu <songliubraving@fb.com> Cc: Yonghong Song <yhs@fb.com> Cc: Andrii Nakryiko <andriin@fb.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@chromium.org> Link: http://lkml.kernel.org/r/20200415222312.236431-1-jannh@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-21mm/vmalloc.c: move 'area->pages' after if statementAustin Kim
commit 7ea362427c170061b8822dd41bafaa72b3bcb9ad upstream. If !area->pages statement is true where memory allocation fails, area is freed. In this case 'area->pages = pages' should not executed. So move 'area->pages = pages' after if statement. [akpm@linux-foundation.org: give area->pages the same treatment] Link: http://lkml.kernel.org/r/20190830035716.GA190684@LGEARND20B15 Signed-off-by: Austin Kim <austindh.kim@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Roman Penyaev <rpenyaev@suse.de> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17mm: Use fixed constant in page_frag_alloc instead of size + 1Alexander Duyck
commit 8644772637deb121f7ac2df690cbf83fa63d3b70 upstream. This patch replaces the size + 1 value introduced with the recent fix for 1 byte allocs with a constant value. The idea here is to reduce code overhead as the previous logic would have to read size into a register, then increment it, and write it back to whatever field was being used. By using a constant we can avoid those memory reads and arithmetic operations in favor of just encoding the maximum value into the operation itself. Fixes: 2c2ade81741c ("mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs") Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17slub: improve bit diffusion for freelist ptr obfuscationKees Cook
commit 1ad53d9fa3f6168ebcf48a50e08b170432da2257 upstream. Under CONFIG_SLAB_FREELIST_HARDENED=y, the obfuscation was relatively weak in that the ptr and ptr address were usually so close that the first XOR would result in an almost entirely 0-byte value[1], leaving most of the "secret" number ultimately being stored after the third XOR. A single blind memory content exposure of the freelist was generally sufficient to learn the secret. Add a swab() call to mix bits a little more. This is a cheap way (1 cycle) to make attacks need more than a single exposure to learn the secret (or to know _where_ the exposure is in memory). kmalloc-32 freelist walk, before: ptr ptr_addr stored value secret ffff90c22e019020@ffff90c22e019000 is 86528eb656b3b5bd (86528eb656b3b59d) ffff90c22e019040@ffff90c22e019020 is 86528eb656b3b5fd (86528eb656b3b59d) ffff90c22e019060@ffff90c22e019040 is 86528eb656b3b5bd (86528eb656b3b59d) ffff90c22e019080@ffff90c22e019060 is 86528eb656b3b57d (86528eb656b3b59d) ffff90c22e0190a0@ffff90c22e019080 is 86528eb656b3b5bd (86528eb656b3b59d) ... after: ptr ptr_addr stored value secret ffff9eed6e019020@ffff9eed6e019000 is 793d1135d52cda42 (86528eb656b3b59d) ffff9eed6e019040@ffff9eed6e019020 is 593d1135d52cda22 (86528eb656b3b59d) ffff9eed6e019060@ffff9eed6e019040 is 393d1135d52cda02 (86528eb656b3b59d) ffff9eed6e019080@ffff9eed6e019060 is 193d1135d52cdae2 (86528eb656b3b59d) ffff9eed6e0190a0@ffff9eed6e019080 is f93d1135d52cdac2 (86528eb656b3b59d) [1] https://blog.infosectcbr.com.au/2020/03/weaknesses-in-linux-kernel-heap.html Fixes: 2482ddec670f ("mm: add SLUB free list pointer obfuscation") Reported-by: Silvio Cesare <silvio.cesare@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/202003051623.AF4F8CB@keescook Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [kees: Backport to v4.19 which doesn't call kasan_reset_untag()] Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-13mm: mempolicy: require at least one nodeid for MPOL_PREFERREDRandy Dunlap
commit aa9f7d5172fac9bf1f09e678c35e287a40a7b7dd upstream. Using an empty (malformed) nodelist that is not caught during mount option parsing leads to a stack-out-of-bounds access. The option string that was used was: "mpol=prefer:,". However, MPOL_PREFERRED requires a single node number, which is not being provided here. Add a check that 'nodes' is not empty after parsing for MPOL_PREFERRED's nodeid. Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display") Reported-by: Entropy Moe <3ntr0py1337@gmail.com> Reported-by: syzbot+b055b1a6b2b958707a21@syzkaller.appspotmail.com Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: syzbot+b055b1a6b2b958707a21@syzkaller.appspotmail.com Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Link: http://lkml.kernel.org/r/89526377-7eb6-b662-e1d8-4430928abde9@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-25x86/mm: split vmalloc_sync_all()Joerg Roedel
commit 763802b53a427ed3cbd419dbba255c414fdd9e7c upstream. Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in the vunmap() code-path. While this change was necessary to maintain correctness on x86-32-pae kernels, it also adds additional cycles for architectures that don't need it. Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported severe performance regressions in micro-benchmarks because it now also calls the x86-64 implementation of vmalloc_sync_all() on vunmap(). But the vmalloc_sync_all() implementation on x86-64 is only needed for newly created mappings. To avoid the unnecessary work on x86-64 and to gain the performance back, split up vmalloc_sync_all() into two functions: * vmalloc_sync_mappings(), and * vmalloc_sync_unmappings() Most call-sites to vmalloc_sync_all() only care about new mappings being synchronized. The only exception is the new call-site added in the above mentioned commit. Shile Zhang directed us to a report of an 80% regression in reaim throughput. Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()") Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Borislav Petkov <bp@suse.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [GHES] Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/ Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-25mm, slub: prevent kmalloc_node crashes and memory leaksVlastimil Babka
commit 0715e6c516f106ed553828a671d30ad9a3431536 upstream. Sachin reports [1] a crash in SLUB __slab_alloc(): BUG: Kernel NULL pointer dereference on read at 0x000073b0 Faulting instruction address: 0xc0000000003d55f4 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: CPU: 19 PID: 1 Comm: systemd Not tainted 5.6.0-rc2-next-20200218-autotest #1 NIP: c0000000003d55f4 LR: c0000000003d5b94 CTR: 0000000000000000 REGS: c0000008b37836d0 TRAP: 0300 Not tainted (5.6.0-rc2-next-20200218-autotest) MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24004844 XER: 00000000 CFAR: c00000000000dec4 DAR: 00000000000073b0 DSISR: 40000000 IRQMASK: 1 GPR00: c0000000003d5b94 c0000008b3783960 c00000000155d400 c0000008b301f500 GPR04: 0000000000000dc0 0000000000000002 c0000000003443d8 c0000008bb398620 GPR08: 00000008ba2f0000 0000000000000001 0000000000000000 0000000000000000 GPR12: 0000000024004844 c00000001ec52a00 0000000000000000 0000000000000000 GPR16: c0000008a1b20048 c000000001595898 c000000001750c18 0000000000000002 GPR20: c000000001750c28 c000000001624470 0000000fffffffe0 5deadbeef0000122 GPR24: 0000000000000001 0000000000000dc0 0000000000000002 c0000000003443d8 GPR28: c0000008b301f500 c0000008bb398620 0000000000000000 c00c000002287180 NIP ___slab_alloc+0x1f4/0x760 LR __slab_alloc+0x34/0x60 Call Trace: ___slab_alloc+0x334/0x760 (unreliable) __slab_alloc+0x34/0x60 __kmalloc_node+0x110/0x490 kvmalloc_node+0x58/0x110 mem_cgroup_css_online+0x108/0x270 online_css+0x48/0xd0 cgroup_apply_control_enable+0x2ec/0x4d0 cgroup_mkdir+0x228/0x5f0 kernfs_iop_mkdir+0x90/0xf0 vfs_mkdir+0x110/0x230 do_mkdirat+0xb0/0x1a0 system_call+0x5c/0x68 This is a PowerPC platform with following NUMA topology: available: 2 nodes (0-1) node 0 cpus: node 0 size: 0 MB node 0 free: 0 MB node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 1 size: 35247 MB node 1 free: 30907 MB node distances: node 0 1 0: 10 40 1: 40 10 possible numa nodes: 0-31 This only happens with a mmotm patch "mm/memcontrol.c: allocate shrinker_map on appropriate NUMA node" [2] which effectively calls kmalloc_node for each possible node. SLUB however only allocates kmem_cache_node on online N_NORMAL_MEMORY nodes, and relies on node_to_mem_node to return such valid node for other nodes since commit a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating on memoryless node"). This is however not true in this configuration where the _node_numa_mem_ array is not initialized for nodes 0 and 2-31, thus it contains zeroes and get_partial() ends up accessing non-allocated kmem_cache_node. A related issue was reported by Bharata (originally by Ramachandran) [3] where a similar PowerPC configuration, but with mainline kernel without patch [2] ends up allocating large amounts of pages by kmalloc-1k kmalloc-512. This seems to have the same underlying issue with node_to_mem_node() not behaving as expected, and might probably also lead to an infinite loop with CONFIG_SLUB_CPU_PARTIAL [4]. This patch should fix both issues by not relying on node_to_mem_node() anymore and instead simply falling back to NUMA_NO_NODE, when kmalloc_node(node) is attempted for a node that's not online, or has no usable memory. The "usable memory" condition is also changed from node_present_pages() to N_NORMAL_MEMORY node state, as that is exactly the condition that SLUB uses to allocate kmem_cache_node structures. The check in get_partial() is removed completely, as the checks in ___slab_alloc() are now sufficient to prevent get_partial() being reached with an invalid node. [1] https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/ [2] https://lore.kernel.org/linux-mm/fff0e636-4c36-ed10-281c-8cdb0687c839@virtuozzo.com/ [3] https://lore.kernel.org/linux-mm/20200317092624.GB22538@in.ibm.com/ [4] https://lore.kernel.org/linux-mm/088b5996-faae-8a56-ef9c-5b567125ae54@suse.cz/ Fixes: a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating on memoryless node") Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Reported-by: PUVICHAKRAVARTHY RAMACHANDRAN <puvichakravarthy@in.ibm.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Tested-by: Bharata B Rao <bharata@linux.ibm.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christopher Lameter <cl@linux.com> Cc: linuxppc-dev@lists.ozlabs.org Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200320115533.9604-1-vbabka@suse.cz Debugged-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-25mm: slub: be more careful about the double cmpxchg of freelistLinus Torvalds
commit 5076190daded2197f62fe92cf69674488be44175 upstream. This is just a cleanup addition to Jann's fix to properly update the transaction ID for the slub slowpath in commit fd4d9c7d0c71 ("mm: slub: add missing TID bump.."). The transaction ID is what protects us against any concurrent accesses, but we should really also make sure to make the 'freelist' comparison itself always use the same freelist value that we then used as the new next free pointer. Jann points out that if we do all of this carefully, we could skip the transaction ID update for all the paths that only remove entries from the lists, and only update the TID when adding entries (to avoid the ABA issue with cmpxchg and list handling re-adding a previously seen value). But this patch just does the "make sure to cmpxchg the same value we used" rather than then try to be clever. Acked-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-25memcg: fix NULL pointer dereference in __mem_cgroup_usage_unregister_eventChunguang Xu
commit 7d36665a5886c27ca4c4d0afd3ecc50b400f3587 upstream. An eventfd monitors multiple memory thresholds of the cgroup, closes them, the kernel deletes all events related to this eventfd. Before all events are deleted, another eventfd monitors the memory threshold of this cgroup, leading to a crash: BUG: kernel NULL pointer dereference, address: 0000000000000004 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 800000033058e067 P4D 800000033058e067 PUD 3355ce067 PMD 0 Oops: 0002 [#1] SMP PTI CPU: 2 PID: 14012 Comm: kworker/2:6 Kdump: loaded Not tainted 5.6.0-rc4 #3 Hardware name: LENOVO 20AWS01K00/20AWS01K00, BIOS GLET70WW (2.24 ) 05/21/2014 Workqueue: events memcg_event_remove RIP: 0010:__mem_cgroup_usage_unregister_event+0xb3/0x190 RSP: 0018:ffffb47e01c4fe18 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffff8bb223a8a000 RCX: 0000000000000001 RDX: 0000000000000001 RSI: ffff8bb22fb83540 RDI: 0000000000000001 RBP: ffffb47e01c4fe48 R08: 0000000000000000 R09: 0000000000000010 R10: 000000000000000c R11: 071c71c71c71c71c R12: ffff8bb226aba880 R13: ffff8bb223a8a480 R14: 0000000000000000 R15: 0000000000000000 FS:  0000000000000000(0000) GS:ffff8bb242680000(0000) knlGS:0000000000000000 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000004 CR3: 000000032c29c003 CR4: 00000000001606e0 Call Trace: memcg_event_remove+0x32/0x90 process_one_work+0x172/0x380 worker_thread+0x49/0x3f0 kthread+0xf8/0x130 ret_from_fork+0x35/0x40 CR2: 0000000000000004 We can reproduce this problem in the following ways: 1. We create a new cgroup subdirectory and a new eventfd, and then we monitor multiple memory thresholds of the cgroup through this eventfd. 2. closing this eventfd, and __mem_cgroup_usage_unregister_event () will be called multiple times to delete all events related to this eventfd. The first time __mem_cgroup_usage_unregister_event() is called, the kernel will clear all items related to this eventfd in thresholds-> primary. Since there is currently only one eventfd, thresholds-> primary becomes empty, so the kernel will set thresholds-> primary and hresholds-> spare to NULL. If at this time, the user creates a new eventfd and monitor the memory threshold of this cgroup, kernel will re-initialize thresholds-> primary. Then when __mem_cgroup_usage_unregister_event () is called for the second time, because thresholds-> primary is not empty, the system will access thresholds-> spare, but thresholds-> spare is NULL, which will trigger a crash. In general, the longer it takes to delete all events related to this eventfd, the easier it is to trigger this problem. The solution is to check whether the thresholds associated with the eventfd has been cleared when deleting the event. If so, we do nothing. [akpm@linux-foundation.org: fix comment, per Kirill] Fixes: 907860ed381a ("cgroups: make cftype.unregister_event() void-returning") Signed-off-by: Chunguang Xu <brookxu@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/077a6f67-aefa-4591-efec-f2f3af2b0b02@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-20mm: slub: add missing TID bump in kmem_cache_alloc_bulk()Jann Horn
commit fd4d9c7d0c71866ec0c2825189ebd2ce35bd95b8 upstream. When kmem_cache_alloc_bulk() attempts to allocate N objects from a percpu freelist of length M, and N > M > 0, it will first remove the M elements from the percpu freelist, then call ___slab_alloc() to allocate the next element and repopulate the percpu freelist. ___slab_alloc() can re-enable IRQs via allocate_slab(), so the TID must be bumped before ___slab_alloc() to properly commit the freelist head change. Fix it by unconditionally bumping c->tid when entering the slowpath. Cc: stable@vger.kernel.org Fixes: ebe909e0fdb3 ("slub: improve bulk alloc strategy") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-18net: memcg: late association of sock to memcgShakeel Butt
[ Upstream commit d752a4986532cb6305dfd5290a614cde8072769d ] If a TCP socket is allocated in IRQ context or cloned from unassociated (i.e. not associated to a memcg) in IRQ context then it will remain unassociated for its whole life. Almost half of the TCPs created on the system are created in IRQ context, so, memory used by such sockets will not be accounted by the memcg. This issue is more widespread in cgroup v1 where network memory accounting is opt-in but it can happen in cgroup v2 if the source socket for the cloning was created in root memcg. To fix the issue, just do the association of the sockets at the accept() time in the process context and then force charge the memory buffer already used and reserved by the socket. Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-18cgroup: memcg: net: do not associate sock with unrelated cgroupShakeel Butt
[ Upstream commit e876ecc67db80dfdb8e237f71e5b43bb88ae549c ] We are testing network memory accounting in our setup and noticed inconsistent network memory usage and often unrelated cgroups network usage correlates with testing workload. On further inspection, it seems like mem_cgroup_sk_alloc() and cgroup_sk_alloc() are broken in irq context specially for cgroup v1. mem_cgroup_sk_alloc() and cgroup_sk_alloc() can be called in irq context and kind of assumes that this can only happen from sk_clone_lock() and the source sock object has already associated cgroup. However in cgroup v1, where network memory accounting is opt-in, the source sock can be unassociated with any cgroup and the new cloned sock can get associated with unrelated interrupted cgroup. Cgroup v2 can also suffer if the source sock object was created by process in the root cgroup or if sk_alloc() is called in irq context. The fix is to just do nothing in interrupt. WARNING: Please note that about half of the TCP sockets are allocated from the IRQ context, so, memory used by such sockets will not be accouted by the memcg. The stack trace of mem_cgroup_sk_alloc() from IRQ-context: CPU: 70 PID: 12720 Comm: ssh Tainted: 5.6.0-smp-DEV #1 Hardware name: ... Call Trace: <IRQ> dump_stack+0x57/0x75 mem_cgroup_sk_alloc+0xe9/0xf0 sk_clone_lock+0x2a7/0x420 inet_csk_clone_lock+0x1b/0x110 tcp_create_openreq_child+0x23/0x3b0 tcp_v6_syn_recv_sock+0x88/0x730 tcp_check_req+0x429/0x560 tcp_v6_rcv+0x72d/0xa40 ip6_protocol_deliver_rcu+0xc9/0x400 ip6_input+0x44/0xd0 ? ip6_protocol_deliver_rcu+0x400/0x400 ip6_rcv_finish+0x71/0x80 ipv6_rcv+0x5b/0xe0 ? ip6_sublist_rcv+0x2e0/0x2e0 process_backlog+0x108/0x1e0 net_rx_action+0x26b/0x460 __do_softirq+0x104/0x2a6 do_softirq_own_stack+0x2a/0x40 </IRQ> do_softirq.part.19+0x40/0x50 __local_bh_enable_ip+0x51/0x60 ip6_finish_output2+0x23d/0x520 ? ip6table_mangle_hook+0x55/0x160 __ip6_finish_output+0xa1/0x100 ip6_finish_output+0x30/0xd0 ip6_output+0x73/0x120 ? __ip6_finish_output+0x100/0x100 ip6_xmit+0x2e3/0x600 ? ipv6_anycast_cleanup+0x50/0x50 ? inet6_csk_route_socket+0x136/0x1e0 ? skb_free_head+0x1e/0x30 inet6_csk_xmit+0x95/0xf0 __tcp_transmit_skb+0x5b4/0xb20 __tcp_send_ack.part.60+0xa3/0x110 tcp_send_ack+0x1d/0x20 tcp_rcv_state_process+0xe64/0xe80 ? tcp_v6_connect+0x5d1/0x5f0 tcp_v6_do_rcv+0x1b1/0x3f0 ? tcp_v6_do_rcv+0x1b1/0x3f0 __release_sock+0x7f/0xd0 release_sock+0x30/0xa0 __inet_stream_connect+0x1c3/0x3b0 ? prepare_to_wait+0xb0/0xb0 inet_stream_connect+0x3b/0x60 __sys_connect+0x101/0x120 ? __sys_getsockopt+0x11b/0x140 __x64_sys_connect+0x1a/0x20 do_syscall_64+0x51/0x200 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The stack trace of mem_cgroup_sk_alloc() from IRQ-context: Fixes: 2d7580738345 ("mm: memcontrol: consolidate cgroup socket tracking") Fixes: d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-11mm: fix possible PMD dirty bit lost in set_pmd_migration_entry()Huang Ying
commit 8a8683ad9ba48b4b52a57f013513d1635c1ca5c4 upstream. In set_pmd_migration_entry(), pmdp_invalidate() is used to change PMD atomically. But the PMD is read before that with an ordinary memory reading. If the THP (transparent huge page) is written between the PMD reading and pmdp_invalidate(), the PMD dirty bit may be lost, and cause data corruption. The race window is quite small, but still possible in theory, so need to be fixed. The race is fixed via using the return value of pmdp_invalidate() to get the original content of PMD, which is a read/modify/write atomic operation. So no THP writing can occur in between. The race has been introduced when the THP migration support is added in the commit 616b8371539a ("mm: thp: enable thp migration in generic path"). But this fix depends on the commit d52605d7cb30 ("mm: do not lose dirty and accessed bits in pmdp_invalidate()"). So it's easy to be backported after v4.16. But the race window is really small, so it may be fine not to backport the fix at all. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Link: http://lkml.kernel.org/r/20200220075220.2327056-1-ying.huang@intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-11mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking ↵Mel Gorman
page tables prot_numa commit 8b272b3cbbb50a6a8e62d8a15affd473a788e184 upstream. : A user reported a bug against a distribution kernel while running a : proprietary workload described as "memory intensive that is not swapping" : that is expected to apply to mainline kernels. The workload is : read/write/modifying ranges of memory and checking the contents. They : reported that within a few hours that a bad PMD would be reported followed : by a memory corruption where expected data was all zeros. A partial : report of the bad PMD looked like : : [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2) : [ 5195.341184] ------------[ cut here ]------------ : [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35! : .... : [ 5195.410033] Call Trace: : [ 5195.410471] [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930 : [ 5195.410716] [<ffffffff811d4be8>] change_prot_numa+0x18/0x30 : [ 5195.410918] [<ffffffff810adefe>] task_numa_work+0x1fe/0x310 : [ 5195.411200] [<ffffffff81098322>] task_work_run+0x72/0x90 : [ 5195.411246] [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2 : [ 5195.411494] [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40 : [ 5195.411739] [<ffffffff815e56af>] retint_user+0x8/0x10 : : Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD : was a false detection. The bug does not trigger if automatic NUMA : balancing or transparent huge pages is disabled. : : The bug is due a race in change_pmd_range between a pmd_trans_huge and : pmd_nond_or_clear_bad check without any locks held. During the : pmd_trans_huge check, a parallel protection update under lock can have : cleared the PMD and filled it with a prot_numa entry between the transhuge : check and the pmd_none_or_clear_bad check. : : While this could be fixed with heavy locking, it's only necessary to make : a copy of the PMD on the stack during change_pmd_range and avoid races. A : new helper is created for this as the check if quite subtle and the : existing similar helpful is not suitable. This passed 154 hours of : testing (usually triggers between 20 minutes and 24 hours) without : detecting bad PMDs or corruption. A basic test of an autonuma-intensive : workload showed no significant change in behaviour. Although Mel withdrew the patch on the face of LKML comment https://lkml.org/lkml/2017/4/10/922 the race window aforementioned is still open, and we have reports of Linpack test reporting bad residuals after the bad PMD warning is observed. In addition to that, bad rss-counter and non-zero pgtables assertions are triggered on mm teardown for the task hitting the bad PMD. host kernel: mm/pgtable-generic.c:40: bad pmd 00000000b3152f68(8000000d2d2008e7) .... host kernel: BUG: Bad rss-counter state mm:00000000b583043d idx:1 val:512 host kernel: BUG: non-zero pgtables_bytes on freeing mm: 4096 The issue is observed on a v4.18-based distribution kernel, but the race window is expected to be applicable to mainline kernels, as well. [akpm@linux-foundation.org: fix comment typo, per Rafael] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Link: http://lkml.kernel.org/r/20200216191800.22423-1-aquini@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-05mm, thp: fix defrag setting if newline is not usedDavid Rientjes
commit f42f25526502d851d0e3ca1e46297da8aafce8a7 upstream. If thp defrag setting "defer" is used and a newline is *not* used when writing to the sysfs file, this is interpreted as the "defer+madvise" option. This is because we do prefix matching and if five characters are written without a newline, the current code ends up comparing to the first five bytes of the "defer+madvise" option and using that instead. Use the more appropriate sysfs_streq() that handles the trailing newline for us. Since this doubles as a nice cleanup, do it in enabled_store() as well. The current implementation relies on prefix matching: the number of bytes compared is either the number of bytes written or the length of the option being compared. With a newline, "defer\n" does not match "defer+"madvise"; without a newline, however, "defer" is considered to match "defer+madvise" (prefix matching is only comparing the first five bytes). End result is that writing "defer" is broken unless it has an additional trailing character. This means that writing "madv" in the past would match and set "madvise". With strict checking, that no longer is the case but it is unlikely anybody is currently doing this. Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2001171411020.56385@chino.kir.corp.google.com Fixes: 21440d7eb904 ("mm, thp: add new defer+madvise defrag option") Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-05mm/huge_memory.c: use head to check huge zero pageWei Yang
commit cb829624867b5ab10bc6a7036d183b1b82bfe9f8 upstream. The page could be a tail page, if this is the case, this BUG_ON will never be triggered. Link: http://lkml.kernel.org/r/20200110032610.26499-1-richardw.yang@linux.intel.com Fixes: e9b61f19858a ("thp: reintroduce split_huge_page()") Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28mm/vmscan.c: don't round up scan size for online memory cgroupGavin Shan
commit 76073c646f5f4999d763f471df9e38a5a912d70d upstream. Commit 68600f623d69 ("mm: don't miss the last page because of round-off error") makes the scan size round up to @denominator regardless of the memory cgroup's state, online or offline. This affects the overall reclaiming behavior: the corresponding LRU list is eligible for reclaiming only when its size logically right shifted by @sc->priority is bigger than zero in the former formula. For example, the inactive anonymous LRU list should have at least 0x4000 pages to be eligible for reclaiming when we have 60/12 for swappiness/priority and without taking scan/rotation ratio into account. After the roundup is applied, the inactive anonymous LRU list becomes eligible for reclaiming when its size is bigger than or equal to 0x1000 in the same condition. (0x4000 >> 12) * 60 / (60 + 140 + 1) = 1 ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1 aarch64 has 512MB huge page size when the base page size is 64KB. The memory cgroup that has a huge page is always eligible for reclaiming in that case. The reclaiming is likely to stop after the huge page is reclaimed, meaing the further iteration on @sc->priority and the silbing and child memory cgroups will be skipped. The overall behaviour has been changed. This fixes the issue by applying the roundup to offlined memory cgroups only, to give more preference to reclaim memory from offlined memory cgroup. It sounds reasonable as those memory is unlikedly to be used by anyone. The issue was found by starting up 8 VMs on a Ampere Mustang machine, which has 8 CPUs and 16 GB memory. Each VM is given with 2 vCPUs and 2GB memory. It took 264 seconds for all VMs to be completely up and 784MB swap is consumed after that. With this patch applied, it took 236 seconds and 60MB swap to do same thing. So there is 10% performance improvement for my case. Note that KSM is disable while THP is enabled in the testing. total used free shared buff/cache available Mem: 16196 10065 2049 16 4081 3749 Swap: 8175 784 7391 total used free shared buff/cache available Mem: 16196 11324 3656 24 1215 2936 Swap: 8175 60 8115 Link: http://lkml.kernel.org/r/20200211024514.8730-1-gshan@redhat.com Fixes: 68600f623d69 ("mm: don't miss the last page because of round-off error") Signed-off-by: Gavin Shan <gshan@redhat.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: <stable@vger.kernel.org> [4.20+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28mm/memcontrol.c: lost css_put in memcg_expand_shrinker_maps()Vasily Averin
commit 75866af62b439859d5146b7093ceb6b482852683 upstream. for_each_mem_cgroup() increases css reference counter for memory cgroup and requires to use mem_cgroup_iter_break() if the walk is cancelled. Link: http://lkml.kernel.org/r/c98414fb-7e1f-da0f-867a-9340ec4bd30b@virtuozzo.com Fixes: 0a4465d34028 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg") Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-11mm/page_alloc.c: fix uninitialized memmaps on a partially populated last sectionDavid Hildenbrand
[ Upstream commit e822969cab48b786b64246aad1a3ba2a774f5d23 ] Patch series "mm: fix max_pfn not falling on section boundary", v2. Playing with different memory sizes for a x86-64 guest, I discovered that some memmaps (highest section if max_mem does not fall on the section boundary) are marked as being valid and online, but contain garbage. We have to properly initialize these memmaps. Looking at /proc/kpageflags and friends, I found some more issues, partially related to this. This patch (of 3): If max_pfn is not aligned to a section boundary, we can easily run into BUGs. This can e.g., be triggered on x86-64 under QEMU by specifying a memory size that is not a multiple of 128MB (e.g., 4097MB, but also 4160MB). I was told that on real HW, we can easily have this scenario (esp., one of the main reasons sub-section hotadd of devmem was added). The issue is, that we have a valid memmap (pfn_valid()) for the whole section, and the whole section will be marked "online". pfn_to_online_page() will succeed, but the memmap contains garbage. E.g., doing a "./page-types -r -a 0x144001" when QEMU was started with "-m 4160M" - (see tools/vm/page-types.c): [ 200.476376] BUG: unable to handle page fault for address: fffffffffffffffe [ 200.477500] #PF: supervisor read access in kernel mode [ 200.478334] #PF: error_code(0x0000) - not-present page [ 200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0 [ 200.479557] Oops: 0000 [#4] SMP NOPTI [ 200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G D W 5.5.0-rc1-next-20191209 #93 [ 200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4 [ 200.481648] RIP: 0010:stable_page_flags+0x4d/0x410 [ 200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f [ 200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202 [ 200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000 [ 200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246 [ 200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000 [ 200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001 [ 200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08 [ 200.487130] FS: 00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000 [ 200.487804] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0 [ 200.488897] Call Trace: [ 200.489115] kpageflags_read+0xe9/0x140 [ 200.489447] proc_reg_read+0x3c/0x60 [ 200.489755] vfs_read+0xc2/0x170 [ 200.490037] ksys_pread64+0x65/0xa0 [ 200.490352] do_syscall_64+0x5c/0xa0 [ 200.490665] entry_SYSCALL_64_after_hwframe+0x49/0xbe But it can be triggered much easier via "cat /proc/kpageflags > /dev/null" after cold/hot plugging a DIMM to such a system: [root@localhost ~]# cat /proc/kpageflags > /dev/null [ 111.517275] BUG: unable to handle page fault for address: fffffffffffffffe [ 111.517907] #PF: supervisor read access in kernel mode [ 111.518333] #PF: error_code(0x0000) - not-present page [ 111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0 This patch fixes that by at least zero-ing out that memmap (so e.g., page_to_pfn() will not crash). Commit 907ec5fca3dc ("mm: zero remaining unavailable struct pages") tried to fix a similar issue, but forgot to consider this special case. After this patch, there are still problems to solve. E.g., not all of these pages falling into a memory hole will actually get initialized later and set PageReserved - they are only zeroed out - but at least the immediate crashes are gone. A follow-up patch will take care of this. Link: http://lkml.kernel.org/r/20191211163201.17179-2-david@redhat.com Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap") Signed-off-by: David Hildenbrand <david@redhat.com> Tested-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Bob Picco <bob.picco@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: <stable@vger.kernel.org> [4.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-11mm: return zero_resv_unavail optimizationPavel Tatashin
[ Upstream commit ec393a0f014eaf688a3dbe8c8a4cbb52d7f535f9 ] When checking for valid pfns in zero_resv_unavail(), it is not necessary to verify that pfns within pageblock_nr_pages ranges are valid, only the first one needs to be checked. This is because memory for pages are allocated in contiguous chunks that contain pageblock_nr_pages struct pages. Link: http://lkml.kernel.org/r/20181002143821.5112-3-msys.mizuma@gmail.com Signed-off-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Reviewed-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-11mm: zero remaining unavailable struct pagesNaoya Horiguchi
[ Upstream commit 907ec5fca3dc38d37737de826f06f25b063aa08e ] Patch series "mm: Fix for movable_node boot option", v3. This patch series contains a fix for the movable_node boot option issue which was introduced by commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"). The commit breaks the option because it changed the memory gap range to reserved memblock. So, the node is marked as Normal zone even if the SRAT has Hot pluggable affinity. First and second patch fix the original issue which the commit tried to fix, then revert the commit. This patch (of 3): There is a kernel panic that is triggered when reading /proc/kpageflags on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]': BUG: unable to handle kernel paging request at fffffffffffffffe PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0 Oops: 0000 [#1] SMP PTI CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014 RIP: 0010:stable_page_flags+0x27/0x3c0 Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7 RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202 RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0 RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001 R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0 R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10 FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0 Call Trace: kpageflags_read+0xc7/0x120 proc_reg_read+0x3c/0x60 __vfs_read+0x36/0x170 vfs_read+0x89/0x130 ksys_pread64+0x71/0x90 do_syscall_64+0x5b/0x160 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7efc42e75e23 Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24 According to kernel bisection, this problem became visible due to commit f7f99100d8d9 which changes how struct pages are initialized. Memblock layout affects the pfn ranges covered by node/zone. Consider that we have a VM with 2 NUMA nodes and each node has 4GB memory, and the default (no memmap= given) memblock layout is like below: MEMBLOCK configuration: memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000 memory.cnt = 0x4 memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0 memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0 memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0 memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0 ... If you give memmap=1G!4G (so it just covers memory[0x2]), the range [0x100000000-0x13fffffff] is gone: MEMBLOCK configuration: memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000 memory.cnt = 0x3 memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0 memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0 memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0 ... This causes shrinking node 0's pfn range because it is calculated by the address range of memblock.memory. So some of struct pages in the gap range are left uninitialized. We have a function zero_resv_unavail() which does zeroing the struct pages outside memblock.memory, but currently it covers only the reserved unavailable range (i.e. memblock.memory && !memblock.reserved). This patch extends it to cover all unavailable range, which fixes the reported issue. Link: http://lkml.kernel.org/r/20181002143821.5112-2-msys.mizuma@gmail.com Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Tested-by: Oscar Salvador <osalvador@suse.de> Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-11mm: move_pages: report the number of non-attempted pagesYang Shi
commit 5984fabb6e82d9ab4e6305cb99694c85d46de8ae upstream. Since commit a49bd4d71637 ("mm, numa: rework do_pages_move"), the semantic of move_pages() has changed to return the number of non-migrated pages if they were result of a non-fatal reasons (usually a busy page). This was an unintentional change that hasn't been noticed except for LTP tests which checked for the documented behavior. There are two ways to go around this change. We can even get back to the original behavior and return -EAGAIN whenever migrate_pages is not able to migrate pages due to non-fatal reasons. Another option would be to simply continue with the changed semantic and extend move_pages documentation to clarify that -errno is returned on an invalid input or when migration simply cannot succeed (e.g. -ENOMEM, -EBUSY) or the number of pages that couldn't have been migrated due to ephemeral reasons (e.g. page is pinned or locked for other reasons). This patch implements the second option because this behavior is in place for some time without anybody complaining and possibly new users depending on it. Also it allows to have a slightly easier error handling as the caller knows that it is worth to retry when err > 0. But since the new semantic would be aborted immediately if migration is failed due to ephemeral reasons, need include the number of non-attempted pages in the return value too. Link: http://lkml.kernel.org/r/1580160527-109104-1-git-send-email-yang.shi@linux.alibaba.com Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move") Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Wei Yang <richardw.yang@linux.intel.com> Cc: <stable@vger.kernel.org> [4.17+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-11mm/memory_hotplug: fix remove_memory() lockdep splatDan Williams
commit f1037ec0cc8ac1a450974ad9754e991f72884f48 upstream. The daxctl unit test for the dax_kmem driver currently triggers the (false positive) lockdep splat below. It results from the fact that remove_memory_block_devices() is invoked under the mem_hotplug_lock() causing lockdep entanglements with cpu_hotplug_lock() and sysfs (kernfs active state tracking). It is a false positive because the sysfs attribute path triggering the memory remove is not the same attribute path associated with memory-block device. sysfs_break_active_protection() is not applicable since there is no real deadlock conflict, instead move memory-block device removal outside the lock. The mem_hotplug_lock() is not needed to synchronize the memory-block device removal vs the page online state, that is already handled by lock_device_hotplug(). Specifically, lock_device_hotplug() is sufficient to allow try_remove_memory() to check the offline state of the memblocks and be assured that any in progress online attempts are flushed / blocked by kernfs_drain() / attribute removal. The add_memory() path safely creates memblock devices under the mem_hotplug_lock(). There is no kernfs active state synchronization in the memblock device_register() path, so nothing to fix there. This change is only possible thanks to the recent change that refactored memory block device removal out of arch_remove_memory() (commit 4c4b7f9ba948 "mm/memory_hotplug: remove memory block devices before arch_remove_memory()"), and David's due diligence tracking down the guarantees afforded by kernfs_drain(). Not flagged for -stable since this only impacts ongoing development and lockdep validation, not a runtime issue. ====================================================== WARNING: possible circular locking dependency detected 5.5.0-rc3+ #230 Tainted: G OE ------------------------------------------------------ lt-daxctl/6459 is trying to acquire lock: ffff99c7f0003510 (kn->count#241){++++}, at: kernfs_remove_by_name_ns+0x41/0x80 but task is already holding lock: ffffffffa76a5450 (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x20/0xe0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (mem_hotplug_lock.rw_sem){++++}: __lock_acquire+0x39c/0x790 lock_acquire+0xa2/0x1b0 get_online_mems+0x3e/0xb0 kmem_cache_create_usercopy+0x2e/0x260 kmem_cache_create+0x12/0x20 ptlock_cache_init+0x20/0x28 start_kernel+0x243/0x547 secondary_startup_64+0xb6/0xc0 -> #1 (cpu_hotplug_lock.rw_sem){++++}: __lock_acquire+0x39c/0x790 lock_acquire+0xa2/0x1b0 cpus_read_lock+0x3e/0xb0 online_pages+0x37/0x300 memory_subsys_online+0x17d/0x1c0 device_online+0x60/0x80 state_store+0x65/0xd0 kernfs_fop_write+0xcf/0x1c0 vfs_write+0xdb/0x1d0 ksys_write+0x65/0xe0 do_syscall_64+0x5c/0xa0 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #0 (kn->count#241){++++}: check_prev_add+0x98/0xa40 validate_chain+0x576/0x860 __lock_acquire+0x39c/0x790 lock_acquire+0xa2/0x1b0 __kernfs_remove+0x25f/0x2e0 kernfs_remove_by_name_ns+0x41/0x80 remove_files.isra.0+0x30/0x70 sysfs_remove_group+0x3d/0x80 sysfs_remove_groups+0x29/0x40 device_remove_attrs+0x39/0x70 device_del+0x16a/0x3f0 device_unregister+0x16/0x60 remove_memory_block_devices+0x82/0xb0 try_remove_memory+0xb5/0x130 remove_memory+0x26/0x40 dev_dax_kmem_remove+0x44/0x6a [kmem] device_release_driver_internal+0xe4/0x1c0 unbind_store+0xef/0x120 kernfs_fop_write+0xcf/0x1c0 vfs_write+0xdb/0x1d0 ksys_write+0x65/0xe0 do_syscall_64+0x5c/0xa0 entry_SYSCALL_64_after_hwframe+0x49/0xbe other info that might help us debug this: Chain exists of: kn->count#241 --> cpu_hotplug_lock.rw_sem --> mem_hotplug_lock.rw_sem Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(mem_hotplug_lock.rw_sem); lock(cpu_hotplug_lock.rw_sem); lock(mem_hotplug_lock.rw_sem); lock(kn->count#241); *** DEADLOCK *** No fixes tag as this has been a long standing issue that predated the addition of kernfs lockdep annotations. Link: http://lkml.kernel.org/r/157991441887.2763922.4770790047389427325.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-05mm/migrate.c: also overwrite error when it is bigger than zeroWei Yang
[ Upstream commit dfe9aa23cab7880a794db9eb2d176c00ed064eb6 ] If we get here after successfully adding page to list, err would be 1 to indicate the page is queued in the list. Current code has two problems: * on success, 0 is not returned * on error, if add_page_for_migratioin() return 1, and the following err1 from do_move_pages_to_node() is set, the err1 is not returned since err is 1 And these behaviors break the user interface. Link: http://lkml.kernel.org/r/20200119065753.21694-1-richardw.yang@linux.intel.com Fixes: e0153fc2c760 ("mm: move_pages: return valid node id in status if the page is already on the target node"). Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Acked-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>