aboutsummaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)Author
2020-03-21Merge branch 'linux-5.2.y' into v5.2/standard/baseBruce Ashfield
2020-03-20mm/migrate.c: also overwrite error when it is bigger than zeroWei Yang
commit dfe9aa23cab7880a794db9eb2d176c00ed064eb6 upstream. If we get here after successfully adding page to list, err would be 1 to indicate the page is queued in the list. Current code has two problems: * on success, 0 is not returned * on error, if add_page_for_migratioin() return 1, and the following err1 from do_move_pages_to_node() is set, the err1 is not returned since err is 1 And these behaviors break the user interface. Link: http://lkml.kernel.org/r/20200119065753.21694-1-richardw.yang@linux.intel.com Fixes: e0153fc2c760 ("mm: move_pages: return valid node id in status if the page is already on the target node"). Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Acked-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-20mm/hugetlb: defer freeing of huge pages if in non-task contextWaiman Long
commit c77c0a8ac4c522638a8242fcb9de9496e3cdbb2d upstream. The following lockdep splat was observed when a certain hugetlbfs test was run: ================================ WARNING: inconsistent lock state 4.18.0-159.el8.x86_64+debug #1 Tainted: G W --------- - - -------------------------------- inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. swapper/30/0 [HC0[0]:SC1[1]:HE1:SE0] takes: ffffffff9acdc038 (hugetlb_lock){+.?.}, at: free_huge_page+0x36f/0xaa0 {SOFTIRQ-ON-W} state was registered at: lock_acquire+0x14f/0x3b0 _raw_spin_lock+0x30/0x70 __nr_hugepages_store_common+0x11b/0xb30 hugetlb_sysctl_handler_common+0x209/0x2d0 proc_sys_call_handler+0x37f/0x450 vfs_write+0x157/0x460 ksys_write+0xb8/0x170 do_syscall_64+0xa5/0x4d0 entry_SYSCALL_64_after_hwframe+0x6a/0xdf irq event stamp: 691296 hardirqs last enabled at (691296): [<ffffffff99bb034b>] _raw_spin_unlock_irqrestore+0x4b/0x60 hardirqs last disabled at (691295): [<ffffffff99bb0ad2>] _raw_spin_lock_irqsave+0x22/0x81 softirqs last enabled at (691284): [<ffffffff97ff0c63>] irq_enter+0xc3/0xe0 softirqs last disabled at (691285): [<ffffffff97ff0ebe>] irq_exit+0x23e/0x2b0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(hugetlb_lock); <Interrupt> lock(hugetlb_lock); *** DEADLOCK *** : Call Trace: <IRQ> __lock_acquire+0x146b/0x48c0 lock_acquire+0x14f/0x3b0 _raw_spin_lock+0x30/0x70 free_huge_page+0x36f/0xaa0 bio_check_pages_dirty+0x2fc/0x5c0 clone_endio+0x17f/0x670 [dm_mod] blk_update_request+0x276/0xe50 scsi_end_request+0x7b/0x6a0 scsi_io_completion+0x1c6/0x1570 blk_done_softirq+0x22e/0x350 __do_softirq+0x23d/0xad8 irq_exit+0x23e/0x2b0 do_IRQ+0x11a/0x200 common_interrupt+0xf/0xf </IRQ> Both the hugetbl_lock and the subpool lock can be acquired in free_huge_page(). One way to solve the problem is to make both locks irq-safe. However, Mike Kravetz had learned that the hugetlb_lock is held for a linear scan of ALL hugetlb pages during a cgroup reparentling operation. So it is just too long to have irq disabled unless we can break hugetbl_lock down into finer-grained locks with shorter lock hold times. Another alternative is to defer the freeing to a workqueue job. This patch implements the deferred freeing by adding a free_hpage_workfn() work function to do the actual freeing. The free_huge_page() call in a non-task context saves the page to be freed in the hpage_freelist linked list in a lockless manner using the llist APIs. The generic workqueue is used to process the work, but a dedicated workqueue can be used instead if it is desirable to have the huge page freed ASAP. Thanks to Kirill Tkhai <ktkhai@virtuozzo.com> for suggesting the use of llist APIs which simplfy the code. Link: http://lkml.kernel.org/r/20191217170331.30893-1-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-20arm64: Revert support for execute-only user mappingsCatalin Marinas
commit 24cecc37746393432d994c0dbc251fb9ac7c5d72 upstream. The ARMv8 64-bit architecture supports execute-only user permissions by clearing the PTE_USER and PTE_UXN bits, practically making it a mostly privileged mapping but from which user running at EL0 can still execute. The downside, however, is that the kernel at EL1 inadvertently reading such mapping would not trip over the PAN (privileged access never) protection. Revert the relevant bits from commit cab15ce604e5 ("arm64: Introduce execute-only page access permissions") so that PROT_EXEC implies PROT_READ (and therefore PTE_USER) until the architecture gains proper support for execute-only user mappings. Fixes: cab15ce604e5 ("arm64: Introduce execute-only page access permissions") Cc: <stable@vger.kernel.org> # 4.9.x- Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-20mm/gup: fix memory leak in __gup_benchmark_ioctlNavid Emamdoost
commit a7c46c0c0e3d62f2764cd08b90934cd2aaaf8545 upstream. In the implementation of __gup_benchmark_ioctl() the allocated pages should be released before returning in case of an invalid cmd. Release pages via kvfree(). [akpm@linux-foundation.org: rework code flow, return -EINVAL rather than -1] Link: http://lkml.kernel.org/r/20191211174653.4102-1-navid.emamdoost@gmail.com Fixes: 714a3a1ebafe ("mm/gup_benchmark.c: add additional pinning methods") Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-20mm: move_pages: return valid node id in status if the page is already on the ↵Yang Shi
target node commit e0153fc2c7606f101392b682e720a7a456d6c766 upstream. Felix Abecassis reports move_pages() would return random status if the pages are already on the target node by the below test program: int main(void) { const long node_id = 1; const long page_size = sysconf(_SC_PAGESIZE); const int64_t num_pages = 8; unsigned long nodemask = 1 << node_id; long ret = set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask)); if (ret < 0) return (EXIT_FAILURE); void **pages = malloc(sizeof(void*) * num_pages); for (int i = 0; i < num_pages; ++i) { pages[i] = mmap(NULL, page_size, PROT_WRITE | PROT_READ, MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS, -1, 0); if (pages[i] == MAP_FAILED) return (EXIT_FAILURE); } ret = set_mempolicy(MPOL_DEFAULT, NULL, 0); if (ret < 0) return (EXIT_FAILURE); int *nodes = malloc(sizeof(int) * num_pages); int *status = malloc(sizeof(int) * num_pages); for (int i = 0; i < num_pages; ++i) { nodes[i] = node_id; status[i] = 0xd0; /* simulate garbage values */ } ret = move_pages(0, num_pages, pages, nodes, status, MPOL_MF_MOVE); printf("move_pages: %ld\n", ret); for (int i = 0; i < num_pages; ++i) printf("status[%d] = %d\n", i, status[i]); } Then running the program would return nonsense status values: $ ./move_pages_bug move_pages: 0 status[0] = 208 status[1] = 208 status[2] = 208 status[3] = 208 status[4] = 208 status[5] = 208 status[6] = 208 status[7] = 208 This is because the status is not set if the page is already on the target node, but move_pages() should return valid status as long as it succeeds. The valid status may be errno or node id. We can't simply initialize status array to zero since the pages may be not on node 0. Fix it by updating status with node id which the page is already on. Link: http://lkml.kernel.org/r/1575584353-125392-1-git-send-email-yang.shi@linux.alibaba.com Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move") Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reported-by: Felix Abecassis <fabecassis@nvidia.com> Tested-by: Felix Abecassis <fabecassis@nvidia.com> Suggested-by: Michal Hocko <mhocko@suse.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> [4.17+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-20mm/zsmalloc.c: fix the migrated zspage statistics.Chanho Min
commit ac8f05da5174c560de122c499ce5dfb5d0dfbee5 upstream. When zspage is migrated to the other zone, the zone page state should be updated as well, otherwise the NR_ZSPAGE for each zone shows wrong counts including proc/zoneinfo in practice. Link: http://lkml.kernel.org/r/1575434841-48009-1-git-send-email-chanho.min@lge.com Fixes: 91537fee0013 ("mm: add NR_ZSMALLOC to vmstat") Signed-off-by: Chanho Min <chanho.min@lge.com> Signed-off-by: Jinsuk Choi <jjinsuk.choi@lge.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> [4.9+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-20mm: drop mmap_sem before calling balance_dirty_pages() in write faultJohannes Weiner
commit 89b15332af7c0312a41e50846819ca6613b58b4c upstream. One of our services is observing hanging ps/top/etc under heavy write IO, and the task states show this is an mmap_sem priority inversion: A write fault is holding the mmap_sem in read-mode and waiting for (heavily cgroup-limited) IO in balance_dirty_pages(): balance_dirty_pages+0x724/0x905 balance_dirty_pages_ratelimited+0x254/0x390 fault_dirty_shared_page.isra.96+0x4a/0x90 do_wp_page+0x33e/0x400 __handle_mm_fault+0x6f0/0xfa0 handle_mm_fault+0xe4/0x200 __do_page_fault+0x22b/0x4a0 page_fault+0x45/0x50 Somebody tries to change the address space, contending for the mmap_sem in write-mode: call_rwsem_down_write_failed_killable+0x13/0x20 do_mprotect_pkey+0xa8/0x330 SyS_mprotect+0xf/0x20 do_syscall_64+0x5b/0x100 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 The waiting writer locks out all subsequent readers to avoid lock starvation, and several threads can be seen hanging like this: call_rwsem_down_read_failed+0x14/0x30 proc_pid_cmdline_read+0xa0/0x480 __vfs_read+0x23/0x140 vfs_read+0x87/0x130 SyS_read+0x42/0x90 do_syscall_64+0x5b/0x100 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 To fix this, do what we do for cache read faults already: drop the mmap_sem before calling into anything IO bound, in this case the balance_dirty_pages() function, and return VM_FAULT_RETRY. Link: http://lkml.kernel.org/r/20190924194238.GA29030@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-03-20shmem: pin the file in shmem_fault() if mmap_sem is droppedKirill A. Shutemov
commit 8897c1b1a1795cab23d5ac13e4e23bf0b5f4e0c6 upstream. syzbot found the following crash: BUG: KASAN: use-after-free in perf_trace_lock_acquire+0x401/0x530 include/trace/events/lock.h:13 Read of size 8 at addr ffff8880a5cf2c50 by task syz-executor.0/26173 CPU: 0 PID: 26173 Comm: syz-executor.0 Not tainted 5.3.0-rc6 #146 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: perf_trace_lock_acquire+0x401/0x530 include/trace/events/lock.h:13 trace_lock_acquire include/trace/events/lock.h:13 [inline] lock_acquire+0x2de/0x410 kernel/locking/lockdep.c:4411 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:338 [inline] shmem_fault+0x5ec/0x7b0 mm/shmem.c:2034 __do_fault+0x111/0x540 mm/memory.c:3083 do_shared_fault mm/memory.c:3535 [inline] do_fault mm/memory.c:3613 [inline] handle_pte_fault mm/memory.c:3840 [inline] __handle_mm_fault+0x2adf/0x3f20 mm/memory.c:3964 handle_mm_fault+0x1b5/0x6b0 mm/memory.c:4001 do_user_addr_fault arch/x86/mm/fault.c:1441 [inline] __do_page_fault+0x536/0xdd0 arch/x86/mm/fault.c:1506 do_page_fault+0x38/0x590 arch/x86/mm/fault.c:1530 page_fault+0x39/0x40 arch/x86/entry/entry_64.S:1202 It happens if the VMA got unmapped under us while we dropped mmap_sem and inode got freed. Pinning the file if we drop mmap_sem fixes the issue. Link: http://lkml.kernel.org/r/20190927083908.rhifa4mmaxefc24r@box Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: syzbot+03ee87124ee05af991bd@syzkaller.appspotmail.com Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-02-05Merge tag 'v5.2.31' into v5.2/standard/baseBruce Ashfield
This is the 5.2.31 stable release # gpg: Signature made Fri 31 Jan 2020 04:58:15 PM EST # gpg: using RSA key EBCE84042C07D1D6 # gpg: Can't check signature: No public key
2020-01-31mm/shmem.c: cast the type of unmap_start to u64Chen Jun
commit aa71ecd8d86500da6081a72da6b0b524007e0627 upstream. In 64bit system. sb->s_maxbytes of shmem filesystem is MAX_LFS_FILESIZE, which equal LLONG_MAX. If offset > LLONG_MAX - PAGE_SIZE, offset + len < LLONG_MAX in shmem_fallocate, which will pass the checking in vfs_fallocate. /* Check for wrap through zero too */ if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0)) return -EFBIG; loff_t unmap_start = round_up(offset, PAGE_SIZE) in shmem_fallocate causes a overflow. Syzkaller reports a overflow problem in mm/shmem: UBSAN: Undefined behaviour in mm/shmem.c:2014:10 signed integer overflow: '9223372036854775807 + 1' cannot be represented in type 'long long int' CPU: 0 PID:17076 Comm: syz-executor0 Not tainted 4.1.46+ #1 Hardware name: linux, dummy-virt (DT) Call trace: dump_backtrace+0x0/0x2c8 arch/arm64/kernel/traps.c:100 show_stack+0x20/0x30 arch/arm64/kernel/traps.c:238 __dump_stack lib/dump_stack.c:15 [inline] ubsan_epilogue+0x18/0x70 lib/ubsan.c:164 handle_overflow+0x158/0x1b0 lib/ubsan.c:195 shmem_fallocate+0x6d0/0x820 mm/shmem.c:2104 vfs_fallocate+0x238/0x428 fs/open.c:312 SYSC_fallocate fs/open.c:335 [inline] SyS_fallocate+0x54/0xc8 fs/open.c:239 The highest bit of unmap_start will be appended with sign bit 1 (overflow) when calculate shmem_falloc.start: shmem_falloc.start = unmap_start >> PAGE_SHIFT. Fix it by casting the type of unmap_start to u64, when right shifted. This bug is found in LTS Linux 4.1. It also seems to exist in mainline. Link: http://lkml.kernel.org/r/1573867464-5107-1-git-send-email-chenjun102@huawei.com Signed-off-by: Chen Jun <chenjun102@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Qian Cai <cai@lca.pw> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-01-31mm: memcg/slab: wait for !root kmem_cache refcnt killing on root kmem_cache ↵Roman Gushchin
destruction commit a264df74df38855096393447f1b8f386069a94b9 upstream. Christian reported a warning like the following obtained during running some KVM-related tests on s390: WARNING: CPU: 8 PID: 208 at lib/percpu-refcount.c:108 percpu_ref_exit+0x50/0x58 Modules linked in: kvm(-) xt_CHECKSUM xt_MASQUERADE bonding xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_na> CPU: 8 PID: 208 Comm: kworker/8:1 Not tainted 5.2.0+ #66 Hardware name: IBM 2964 NC9 712 (LPAR) Workqueue: events sysfs_slab_remove_workfn Krnl PSW : 0704e00180000000 0000001529746850 (percpu_ref_exit+0x50/0x58) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 Krnl GPRS: 00000000ffff8808 0000001529746740 000003f4e30e8e18 0036008100000000 0000001f00000000 0035008100000000 0000001fb3573ab8 0000000000000000 0000001fbdb6de00 0000000000000000 0000001529f01328 0000001fb3573b00 0000001fbb27e000 0000001fbdb69300 000003e009263d00 000003e009263cd0 Krnl Code: 0000001529746842: f0a0000407fe srp 4(11,%r0),2046,0 0000001529746848: 47000700 bc 0,1792 #000000152974684c: a7f40001 brc 15,152974684e >0000001529746850: a7f4fff2 brc 15,1529746834 0000001529746854: 0707 bcr 0,%r7 0000001529746856: 0707 bcr 0,%r7 0000001529746858: eb8ff0580024 stmg %r8,%r15,88(%r15) 000000152974685e: a738ffff lhi %r3,-1 Call Trace: ([<000003e009263d00>] 0x3e009263d00) [<00000015293252ea>] slab_kmem_cache_release+0x3a/0x70 [<0000001529b04882>] kobject_put+0xaa/0xe8 [<000000152918cf28>] process_one_work+0x1e8/0x428 [<000000152918d1b0>] worker_thread+0x48/0x460 [<00000015291942c6>] kthread+0x126/0x160 [<0000001529b22344>] ret_from_fork+0x28/0x30 [<0000001529b2234c>] kernel_thread_starter+0x0/0x10 Last Breaking-Event-Address: [<000000152974684c>] percpu_ref_exit+0x4c/0x58 ---[ end trace b035e7da5788eb09 ]--- The problem occurs because kmem_cache_destroy() is called immediately after deleting of a memcg, so it races with the memcg kmem_cache deactivation. flush_memcg_workqueue() at the beginning of kmem_cache_destroy() is supposed to guarantee that all deactivation processes are finished, but failed to do so. It waits for an rcu grace period, after which all children kmem_caches should be deactivated. During the deactivation percpu_ref_kill() is called for non root kmem_cache refcounters, but it requires yet another rcu grace period to finish the transition to the atomic (dead) state. So in a rare case when not all children kmem_caches are destroyed at the moment when the root kmem_cache is about to be gone, we need to wait another rcu grace period before destroying the root kmem_cache. This issue can be triggered only with dynamically created kmem_caches which are used with memcg accounting. In this case per-memcg child kmem_caches are created. They are deactivated from the cgroup removing path. If the destruction of the root kmem_cache is racing with the removal of the cgroup (both are quite complicated multi-stage processes), the described issue can occur. The only known way to trigger it in the real life, is to unload some kernel module which creates a dedicated kmem_cache, used from different memory cgroups with GFP_ACCOUNT flag. If the unloading happens immediately after calling rmdir on the corresponding cgroup, there is some chance to trigger the issue. Link: http://lkml.kernel.org/r/20191129025011.3076017-1-guro@fb.com Fixes: f0a3a24b532d ("mm: memcg/slab: rework non-root kmem_cache lifecycle management") Signed-off-by: Roman Gushchin <guro@fb.com> Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-01-31mm, memfd: fix COW issue on MAP_PRIVATE and F_SEAL_FUTURE_WRITE mappingsNicolas Geoffray
commit 05d351102dbe4e103d6bdac18b1122cd3cd04925 upstream. F_SEAL_FUTURE_WRITE has unexpected behavior when used with MAP_PRIVATE: A private mapping created after the memfd file that gets sealed with F_SEAL_FUTURE_WRITE loses the copy-on-write at fork behavior, meaning children and parent share the same memory, even though the mapping is private. The reason for this is due to the code below: static int shmem_mmap(struct file *file, struct vm_area_struct *vma) { struct shmem_inode_info *info = SHMEM_I(file_inode(file)); if (info->seals & F_SEAL_FUTURE_WRITE) { /* * New PROT_WRITE and MAP_SHARED mmaps are not allowed when * "future write" seal active. */ if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_WRITE)) return -EPERM; /* * Since the F_SEAL_FUTURE_WRITE seals allow for a MAP_SHARED * read-only mapping, take care to not allow mprotect to revert * protections. */ vma->vm_flags &= ~(VM_MAYWRITE); } ... } And for the mm to know if a mapping is copy-on-write: static inline bool is_cow_mapping(vm_flags_t flags) { return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; } The patch fixes the issue by making the mprotect revert protection happen only for shared mappings. For private mappings, using mprotect will have no effect on the seal behavior. The F_SEAL_FUTURE_WRITE feature was introduced in v5.1 so v5.3.x stable kernels would need a backport. [akpm@linux-foundation.org: reflow comment, per Christoph] Link: http://lkml.kernel.org/r/20191107195355.80608-1-joel@joelfernandes.org Fixes: ab3948f58ff84 ("mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd") Signed-off-by: Nicolas Geoffray <ngeoffray@google.com> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Hugh Dickins <hughd@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-01-15Merge tag 'v5.2.29' into v5.2/standard/baseBruce Ashfield
This is the 5.2.29 stable release # gpg: Signature made Fri 10 Jan 2020 09:45:31 AM EST # gpg: using RSA key EBCE84042C07D1D6 # gpg: Can't check signature: No public key
2020-01-09mm/ksm.c: don't WARN if page is still mapped in remove_stable_node()Andrey Ryabinin
commit 9a63236f1ad82d71a98aa80320b6cb618fb32f44 upstream. It's possible to hit the WARN_ON_ONCE(page_mapped(page)) in remove_stable_node() when it races with __mmput() and squeezes in between ksm_exit() and exit_mmap(). WARNING: CPU: 0 PID: 3295 at mm/ksm.c:888 remove_stable_node+0x10c/0x150 Call Trace: remove_all_stable_nodes+0x12b/0x330 run_store+0x4ef/0x7b0 kernfs_fop_write+0x200/0x420 vfs_write+0x154/0x450 ksys_write+0xf9/0x1d0 do_syscall_64+0x99/0x510 entry_SYSCALL_64_after_hwframe+0x49/0xbe Remove the warning as there is nothing scary going on. Link: http://lkml.kernel.org/r/20191119131850.5675-1-aryabinin@virtuozzo.com Fixes: cbf86cfe04a6 ("ksm: remove old stable nodes more thoroughly") Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-01-09mm/memory_hotplug: don't access uninitialized memmaps in shrink_zone_span()David Hildenbrand
commit 7ce700bf11b5e2cb84e4352bbdf2123a7a239c84 upstream. Let's limit shrinking to !ZONE_DEVICE so we can fix the current code. We should never try to touch the memmap of offline sections where we could have uninitialized memmaps and could trigger BUGs when calling page_to_nid() on poisoned pages. There is no reliable way to distinguish an uninitialized memmap from an initialized memmap that belongs to ZONE_DEVICE, as we don't have anything like SECTION_IS_ONLINE we can use similar to pfn_to_online_section() for !ZONE_DEVICE memory. E.g., set_zone_contiguous() similarly relies on pfn_to_online_section() and will therefore never set a ZONE_DEVICE zone consecutive. Stopping to shrink the ZONE_DEVICE therefore results in no observable changes, besides /proc/zoneinfo indicating different boundaries - something we can totally live with. Before commit d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug"), the memmap was initialized with 0 and the node with the right value. So the zone might be wrong but not garbage. After that commit, both the zone and the node will be garbage when touching uninitialized memmaps. Toshiki reported a BUG (race between delayed initialization of ZONE_DEVICE memmaps without holding the memory hotplug lock and concurrent zone shrinking). https://lkml.org/lkml/2019/11/14/1040 "Iteration of create and destroy namespace causes the panic as below: kernel BUG at mm/page_alloc.c:535! CPU: 7 PID: 2766 Comm: ndctl Not tainted 5.4.0-rc4 #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014 RIP: 0010:set_pfnblock_flags_mask+0x95/0xf0 Call Trace: memmap_init_zone_device+0x165/0x17c memremap_pages+0x4c1/0x540 devm_memremap_pages+0x1d/0x60 pmem_attach_disk+0x16b/0x600 [nd_pmem] nvdimm_bus_probe+0x69/0x1c0 really_probe+0x1c2/0x3e0 driver_probe_device+0xb4/0x100 device_driver_attach+0x4f/0x60 bind_store+0xc9/0x110 kernfs_fop_write+0x116/0x190 vfs_write+0xa5/0x1a0 ksys_write+0x59/0xd0 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x44/0xa9 While creating a namespace and initializing memmap, if you destroy the namespace and shrink the zone, it will initialize the memmap outside the zone and trigger VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page) in set_pfnblock_flags_mask()." This BUG is also mitigated by this commit, where we for now stop to shrink the ZONE_DEVICE zone until we can do it in a safe and clean way. Link: http://lkml.kernel.org/r/20191006085646.5768-5-david@redhat.com Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319] Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reported-by: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Damian Tometzki <damian.tometzki@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Halil Pasic <pasic@linux.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jun Yao <yaojun8558363@gmail.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Pankaj Gupta <pagupta@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qian Cai <cai@lca.pw> Cc: Rich Felker <dalias@libc.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Steve Capper <steve.capper@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> [4.13+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [PG: Use 4.19.87-stable verstion of the backport for v5.2.x codebase.] Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-01-09mm/memory_hotplug: fix updating the node spanDavid Hildenbrand
commit 656d571193262a11c2daa4012e53e4d645bbce56 upstream. We recently started updating the node span based on the zone span to avoid touching uninitialized memmaps. Currently, we will always detect the node span to start at 0, meaning a node can easily span too many pages. pgdat_is_empty() will still work correctly if all zones span no pages. We should skip over all zones without spanned pages and properly handle the first detected zone that spans pages. Unfortunately, in contrast to the zone span (/proc/zoneinfo), the node span cannot easily be inspected and tested. The node span gives no real guarantees when an architecture supports memory hotplug, meaning it can easily contain holes or span pages of different nodes. The node span is not really used after init on architectures that support memory hotplug. E.g., we use it in mm/memory_hotplug.c:try_offline_node() and in mm/kmemleak.c:kmemleak_scan(). These users seem to be fine. Link: http://lkml.kernel.org/r/20191027222714.5313-1-david@redhat.com Fixes: 00d6c019b5bc ("mm/memory_hotplug: don't access uninitialized memmaps in shrink_pgdat_span()") Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-01-09mm/memory_hotplug: don't access uninitialized memmaps in shrink_pgdat_span()David Hildenbrand
commit 00d6c019b5bc175cee3770e0e659f2b5f4804ea5 upstream. We might use the nid of memmaps that were never initialized. For example, if the memmap was poisoned, we will crash the kernel in pfn_to_nid() right now. Let's use the calculated boundaries of the separate zones instead. This now also avoids having to iterate over a whole bunch of subsections again, after shrinking one zone. Before commit d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug"), the memmap was initialized to 0 and the node was set to the right value. After that commit, the node might be garbage. We'll have to fix shrink_zone_span() next. Link: http://lkml.kernel.org/r/20191006085646.5768-4-david@redhat.com Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [d0dc12e86b319] Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Damian Tometzki <damian.tometzki@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Halil Pasic <pasic@linux.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jun Yao <yaojun8558363@gmail.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Pankaj Gupta <pagupta@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qian Cai <cai@lca.pw> Cc: Rich Felker <dalias@libc.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Steve Capper <steve.capper@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> [4.13+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [PG: use 4.19.86-stable version of the backport.] Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-01-09mm/page_io.c: do not free shared swap slotsVinayak Menon
commit 5df373e95689b9519b8557da7c5bd0db0856d776 upstream. The following race is observed due to which a processes faulting on a swap entry, finds the page neither in swapcache nor swap. This causes zram to give a zero filled page that gets mapped to the process, resulting in a user space crash later. Consider parent and child processes Pa and Pb sharing the same swap slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set. Virtual address 'VA' of Pa and Pb points to the shared swap entry. Pa Pb fault on VA fault on VA do_swap_page do_swap_page lookup_swap_cache fails lookup_swap_cache fails Pb scheduled out swapin_readahead (deletes zram entry) swap_free (makes swap_count 1) Pb scheduled in swap_readpage (swap_count == 1) Takes SWP_SYNCHRONOUS_IO path zram enrty absent zram gives a zero filled page Fix this by making sure that swap slot is freed only when swap count drops down to one. Link: http://lkml.kernel.org/r/1571743294-14285-1-git-send-email-vinmenon@codeaurora.org Fixes: aa8d22a11da9 ("mm: swap: SWP_SYNCHRONOUS_IO: skip swapcache only if swapped page has no other reference") Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Suggested-by: Minchan Kim <minchan@google.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-01-09mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup()Roman Gushchin
commit 0362f326d86c645b5e96b7dbc3ee515986ed019d upstream. An exiting task might belong to an offline cgroup. In this case an attempt to grab a cgroup reference from the task can end up with an infinite loop in hugetlb_cgroup_charge_cgroup(), because neither the cgroup will become online, neither the task will be migrated to a live cgroup. Fix this by switching over to css_tryget(). As css_tryget_online() can't guarantee that the cgroup won't go offline, in most cases the check doesn't make sense. In this particular case users of hugetlb_cgroup_charge_cgroup() are not affected by this change. A similar problem is described by commit 18fa84a2db0e ("cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css()"). Link: http://lkml.kernel.org/r/20191106225131.3543616-2-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-01-09mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()Roman Gushchin
commit 00d484f354d85845991b40141d40ba9e5eb60faf upstream. We've encountered a rcu stall in get_mem_cgroup_from_mm(): rcu: INFO: rcu_sched self-detected stall on CPU rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017 (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33 <...> RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90 <...> __memcg_kmem_charge+0x55/0x140 __alloc_pages_nodemask+0x267/0x320 pipe_write+0x1ad/0x400 new_sync_write+0x127/0x1c0 __kernel_write+0x4f/0xf0 dump_emit+0x91/0xc0 writenote+0xa0/0xc0 elf_core_dump+0x11af/0x1430 do_coredump+0xc65/0xee0 get_signal+0x132/0x7c0 do_signal+0x36/0x640 exit_to_usermode_loop+0x61/0xd0 do_syscall_64+0xd4/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The problem is caused by an exiting task which is associated with an offline memcg. We're iterating over and over in the do {} while (!css_tryget_online()) loop, but obviously the memcg won't become online and the exiting task won't be migrated to a live memcg. Let's fix it by switching from css_tryget_online() to css_tryget(). As css_tryget_online() cannot guarantee that the memcg won't go offline, the check is usually useless, except some rare cases when for example it determines if something should be presented to a user. A similar problem is described by commit 18fa84a2db0e ("cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css()"). Johannes: : The bug aside, it doesn't matter whether the cgroup is online for the : callers. It used to matter when offlining needed to evacuate all charges : from the memcg, and so needed to prevent new ones from showing up, but we : don't care now. Link: http://lkml.kernel.org/r/20191106225131.3543616-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Shakeel Butt <shakeeb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-01-09mm: mempolicy: fix the wrong return value and potential pages leak of mbindYang Shi
commit a85dfc305a21acfc48fa28a0fa0a0cb6ad496120 upstream. Commit d883544515aa ("mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") fixed the return value of mbind() for a couple of corner cases. But, it altered the errno for some other cases, for example, mbind() should return -EFAULT when part or all of the memory range specified by nodemask and maxnode points outside your accessible address space, or there was an unmapped hole in the specified memory range specified by addr and len. Fix this by preserving the errno returned by queue_pages_range(). And, the pagelist may be not empty even though queue_pages_range() returns error, put the pages back to LRU since mbind_range() is not called to really apply the policy so those pages should not be migrated, this is also the old behavior before the problematic commit. Link: http://lkml.kernel.org/r/1572454731-3925-1-git-send-email-yang.shi@linux.alibaba.com Fixes: d883544515aa ("mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reported-by: Li Xinhai <lixinhai.lxh@gmail.com> Reviewed-by: Li Xinhai <lixinhai.lxh@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> [4.19 and 5.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-12-29Merge tag 'v5.2.28' into v5.2/standard/baseBruce Ashfield
This is the 5.2.28 stable release # gpg: Signature made Sun 29 Dec 2019 05:29:45 PM EST # gpg: using RSA key EBCE84042C07D1D6 # gpg: Can't check signature: No public key
2019-12-29mm/filemap.c: don't initiate writeback if mapping has no dirty pagesKonstantin Khlebnikov
commit c3aab9a0bd91b696a852169479b7db1ece6cbf8c upstream. Functions like filemap_write_and_wait_range() should do nothing if inode has no dirty pages or pages currently under writeback. But they anyway construct struct writeback_control and this does some atomic operations if CONFIG_CGROUP_WRITEBACK=y - on fast path it locks inode->i_lock and updates state of writeback ownership, on slow path might be more work. Current this path is safely avoided only when inode mapping has no pages. For example generic_file_read_iter() calls filemap_write_and_wait_range() at each O_DIRECT read - pretty hot path. This patch skips starting new writeback if mapping has no dirty tags set. If writeback is already in progress filemap_write_and_wait_range() will wait for it. Link: http://lkml.kernel.org/r/156378816804.1087.8607636317907921438.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-12-29mm/khugepaged: fix might_sleep() warn with CONFIG_HIGHPTE=yVille Syrjälä
commit ec649c9d454ea372dcf16cccf48250994f1d7788 upstream. I got some khugepaged spew on a 32bit x86: BUG: sleeping function called from invalid context at include/linux/mmu_notifier.h:346 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 25, name: khugepaged INFO: lockdep is turned off. CPU: 1 PID: 25 Comm: khugepaged Not tainted 5.4.0-rc5-elk+ #206 Hardware name: System manufacturer P5Q-EM/P5Q-EM, BIOS 2203 07/08/2009 Call Trace: dump_stack+0x66/0x8e ___might_sleep.cold.96+0x95/0xa6 __might_sleep+0x2e/0x80 collapse_huge_page.isra.51+0x5ac/0x1360 khugepaged+0x9a9/0x20f0 kthread+0xf5/0x110 ret_from_fork+0x2e/0x38 Looks like it's due to CONFIG_HIGHPTE=y pte_offset_map()->kmap_atomic() vs. mmu_notifier_invalidate_range_start(). Let's do the naive approach and just reorder the two operations. Link: http://lkml.kernel.org/r/20191029201513.GG1208@intel.com Fixes: 810e24e009cf71 ("mm/mmu_notifiers: annotate with might_sleep()") Signed-off-by: Ville Syrjl <ville.syrjala@linux.intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-12-29mm, vmstat: hide /proc/pagetypeinfo from normal usersMichal Hocko
commit abaed0112c1db08be15a784a2c5c8a8b3063cdd3 upstream. /proc/pagetypeinfo is a debugging tool to examine internal page allocator state wrt to fragmentation. It is not very useful for any other use so normal users really do not need to read this file. Waiman Long has noticed that reading this file can have negative side effects because zone->lock is necessary for gathering data and that a) interferes with the page allocator and its users and b) can lead to hard lockups on large machines which have very long free_list. Reduce both issues by simply not exporting the file to regular users. Link: http://lkml.kernel.org/r/20191025072610.18526-2-mhocko@kernel.org Fixes: 467c996c1e19 ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Waiman Long <longman@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Waiman Long <longman@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Jann Horn <jannh@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-12-29mm, meminit: recalculate pcpu batch and high limits after init completesMel Gorman
commit 3e8fc0075e24338b1117cdff6a79477427b8dbed upstream. Deferred memory initialisation updates zone->managed_pages during the initialisation phase but before that finishes, the per-cpu page allocator (pcpu) calculates the number of pages allocated/freed in batches as well as the maximum number of pages allowed on a per-cpu list. As zone->managed_pages is not up to date yet, the pcpu initialisation calculates inappropriately low batch and high values. This increases zone lock contention quite severely in some cases with the degree of severity depending on how many CPUs share a local zone and the size of the zone. A private report indicated that kernel build times were excessive with extremely high system CPU usage. A perf profile indicated that a large chunk of time was lost on zone->lock contention. This patch recalculates the pcpu batch and high values after deferred initialisation completes for every populated zone in the system. It was tested on a 2-socket AMD EPYC 2 machine using a kernel compilation workload -- allmodconfig and all available CPUs. mmtests configuration: config-workload-kernbench-max Configuration was modified to build on a fresh XFS partition. kernbench 5.4.0-rc3 5.4.0-rc3 vanilla resetpcpu-v2 Amean user-256 13249.50 ( 0.00%) 16401.31 * -23.79%* Amean syst-256 14760.30 ( 0.00%) 4448.39 * 69.86%* Amean elsp-256 162.42 ( 0.00%) 119.13 * 26.65%* Stddev user-256 42.97 ( 0.00%) 19.15 ( 55.43%) Stddev syst-256 336.87 ( 0.00%) 6.71 ( 98.01%) Stddev elsp-256 2.46 ( 0.00%) 0.39 ( 84.03%) 5.4.0-rc3 5.4.0-rc3 vanilla resetpcpu-v2 Duration User 39766.24 49221.79 Duration System 44298.10 13361.67 Duration Elapsed 519.11 388.87 The patch reduces system CPU usage by 69.86% and total build time by 26.65%. The variance of system CPU usage is also much reduced. Before, this was the breakdown of batch and high values over all zones was: 256 batch: 1 256 batch: 63 512 batch: 7 256 high: 0 256 high: 378 512 high: 42 512 pcpu pagesets had a batch limit of 7 and a high limit of 42. After the patch: 256 batch: 1 768 batch: 63 256 high: 0 768 high: 378 [mgorman@techsingularity.net: fix merge/linkage snafu] Link: http://lkml.kernel.org/r/20191023084705.GD3016@techsingularity.netLink: http://lkml.kernel.org/r/20191021094808.28824-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Qian Cai <cai@lca.pw> Cc: <stable@vger.kernel.org> [4.1+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-12-29mm: memcontrol: fix network errors from failing __GFP_ATOMIC chargesJohannes Weiner
commit 869712fd3de5a90b7ba23ae1272278cddc66b37b upstream. While upgrading from 4.16 to 5.2, we noticed these allocation errors in the log of the new kernel: SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC) cache: tw_sock_TCPv6(960:helper-logs), object size: 232, buffer size: 240, default order: 1, min order: 0 node 0: slabs: 5, objs: 170, free: 0 slab_out_of_memory+1 ___slab_alloc+969 __slab_alloc+14 kmem_cache_alloc+346 inet_twsk_alloc+60 tcp_time_wait+46 tcp_fin+206 tcp_data_queue+2034 tcp_rcv_state_process+784 tcp_v6_do_rcv+405 __release_sock+118 tcp_close+385 inet_release+46 __sock_release+55 sock_close+17 __fput+170 task_work_run+127 exit_to_usermode_loop+191 do_syscall_64+212 entry_SYSCALL_64_after_hwframe+68 accompanied by an increase in machines going completely radio silent under memory pressure. One thing that changed since 4.16 is e699e2c6a654 ("net, mm: account sock objects to kmemcg"), which made these slab caches subject to cgroup memory accounting and control. The problem with that is that cgroups, unlike the page allocator, do not maintain dedicated atomic reserves. As a cgroup's usage hovers at its limit, atomic allocations - such as done during network rx - can fail consistently for extended periods of time. The kernel is not able to operate under these conditions. We don't want to revert the culprit patch, because it indeed tracks a potentially substantial amount of memory used by a cgroup. We also don't want to implement dedicated atomic reserves for cgroups. There is no point in keeping a fixed margin of unused bytes in the cgroup's memory budget to accomodate a consumer that is impossible to predict - we'd be wasting memory and get into configuration headaches, not unlike what we have going with min_free_kbytes. We do this for physical mem because we have to, but cgroups are an accounting game. Instead, account these privileged allocations to the cgroup, but let them bypass the configured limit if they have to. This way, we get the benefits of accounting the consumed memory and have it exert pressure on the rest of the cgroup, but like with the page allocator, we shift the burden of reclaimining on behalf of atomic allocations onto the regular allocations that can block. Link: http://lkml.kernel.org/r/20191022233708.365764-1-hannes@cmpxchg.org Fixes: e699e2c6a654 ("net, mm: account sock objects to kmemcg") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> [4.18+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25Merge tag 'v5.2.24' into v5.2/standard/baseBruce Ashfield
This is the 5.2.24 stable release # gpg: Signature made Mon 25 Nov 2019 12:09:44 PM EST # gpg: using RSA key EBCE84042C07D1D6 # gpg: Can't check signature: No public key
2019-11-25mm/memory-failure: poison read receives SIGKILL instead of SIGBUS if mmaped ↵Jane Chu
more than once commit 3d7fed4ad8ccb691d217efbb0f934e6a4df5ef91 upstream. Mmap /dev/dax more than once, then read the poison location using address from one of the mappings. The other mappings due to not having the page mapped in will cause SIGKILLs delivered to the process. SIGKILL succeeds over SIGBUS, so user process loses the opportunity to handle the UE. Although one may add MAP_POPULATE to mmap(2) to work around the issue, MAP_POPULATE makes mapping 128GB of pmem several magnitudes slower, so isn't always an option. Details - ndctl inject-error --block=10 --count=1 namespace6.0 ./read_poison -x dax6.0 -o 5120 -m 2 mmaped address 0x7f5bb6600000 mmaped address 0x7f3cf3600000 doing local read at address 0x7f3cf3601400 Killed Console messages in instrumented kernel - mce: Uncorrected hardware memory error in user-access at edbe201400 Memory failure: tk->addr = 7f5bb6601000 Memory failure: address edbe201: call dev_pagemap_mapping_shift dev_pagemap_mapping_shift: page edbe201: no PUD Memory failure: tk->size_shift == 0 Memory failure: Unable to find user space address edbe201 in read_poison Memory failure: tk->addr = 7f3cf3601000 Memory failure: address edbe201: call dev_pagemap_mapping_shift Memory failure: tk->size_shift = 21 Memory failure: 0xedbe201: forcibly killing read_poison:22434 because of failure to unmap corrupted page => to deliver SIGKILL Memory failure: 0xedbe201: Killing read_poison:22434 due to hardware memory corruption => to deliver SIGBUS Link: http://lkml.kernel.org/r/1565112345-28754-3-git-send-email-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Suggested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic()David Hildenbrand
commit f231fe4235e22e18d847e05cbe705deaca56580a upstream. Uninitialized memmaps contain garbage and in the worst case trigger kernel BUGs, especially with CONFIG_PAGE_POISONING. They should not get touched. Let's make sure that we only consider online memory (managed by the buddy) that has initialized memmaps. ZONE_DEVICE is not applicable. page_zone() will call page_to_nid(), which will trigger VM_BUG_ON_PGFLAGS(PagePoisoned(page), page) with CONFIG_PAGE_POISONING and CONFIG_DEBUG_VM_PGFLAGS when called on uninitialized memmaps. This can be the case when an offline memory block (e.g., never onlined) is spanned by a zone. Note: As explained by Michal in [1], alloc_contig_range() will verify the range. So it boils down to the wrong access in this function. [1] http://lkml.kernel.org/r/20180423000943.GO17484@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20191015120717.4858-1-david@redhat.com Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319] Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Michal Hocko <mhocko@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: <stable@vger.kernel.org> [4.13+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25mm: memblock: do not enforce current limit for memblock_phys* familyMike Rapoport
commit f3057ad767542be7bbac44e548cb44017178a163 upstream. Until commit 92d12f9544b7 ("memblock: refactor internal allocation functions") the maximal address for memblock allocations was forced to memblock.current_limit only for the allocation functions returning virtual address. The changes introduced by that commit moved the limit enforcement into the allocation core and as a result the allocation functions returning physical address also started to limit allocations to memblock.current_limit. This caused breakage of etnaviv GPU driver: etnaviv etnaviv: bound 130000.gpu (ops gpu_ops) etnaviv etnaviv: bound 134000.gpu (ops gpu_ops) etnaviv etnaviv: bound 2204000.gpu (ops gpu_ops) etnaviv-gpu 130000.gpu: model: GC2000, revision: 5108 etnaviv-gpu 130000.gpu: command buffer outside valid memory window etnaviv-gpu 134000.gpu: model: GC320, revision: 5007 etnaviv-gpu 134000.gpu: command buffer outside valid memory window etnaviv-gpu 2204000.gpu: model: GC355, revision: 1215 etnaviv-gpu 2204000.gpu: Ignoring GPU with VG and FE2.0 Restore the behaviour of memblock_phys* family so that these functions will not enforce memblock.current_limit. Link: http://lkml.kernel.org/r/1570915861-17633-1-git-send-email-rppt@kernel.org Fixes: 92d12f9544b7 ("memblock: refactor internal allocation functions") Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reported-by: Adam Ford <aford173@gmail.com> Tested-by: Adam Ford <aford173@gmail.com> [imx6q-logicpd] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Fabio Estevam <festevam@gmail.com> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25mm: memcg: get number of pages on the LRU list in memcgroup base on ↵Honglei Wang
lru_zone_size commit b11edebbc967ebf5c55b8f9e1d5bb6d68ec3a7fd upstream. Commit 1a61ab8038e72 ("mm: memcontrol: replace zone summing with lruvec_page_state()") has made lruvec_page_state to use per-cpu counters instead of calculating it directly from lru_zone_size with an idea that this would be more effective. Tim has reported that this is not really the case for their database benchmark which is showing an opposite results where lruvec_page_state is taking up a huge chunk of CPU cycles (about 25% of the system time which is roughly 7% of total cpu cycles) on 5.3 kernels. The workload is running on a larger machine (96cpus), it has many cgroups (500) and it is heavily direct reclaim bound. Tim Chen said: : The problem can also be reproduced by running simple multi-threaded : pmbench benchmark with a fast Optane SSD swap (see profile below). : : : 6.15% 3.08% pmbench [kernel.vmlinux] [k] lruvec_lru_size : | : |--3.07%--lruvec_lru_size : | | : | |--2.11%--cpumask_next : | | | : | | --1.66%--find_next_bit : | | : | --0.57%--call_function_interrupt : | | : | --0.55%--smp_call_function_interrupt : | : |--1.59%--0x441f0fc3d009 : | _ops_rdtsc_init_base_freq : | access_histogram : | page_fault : | __do_page_fault : | handle_mm_fault : | __handle_mm_fault : | | : | --1.54%--do_swap_page : | swapin_readahead : | swap_cluster_readahead : | | : | --1.53%--read_swap_cache_async : | __read_swap_cache_async : | alloc_pages_vma : | __alloc_pages_nodemask : | __alloc_pages_slowpath : | try_to_free_pages : | do_try_to_free_pages : | shrink_node : | shrink_node_memcg : | | : | |--0.77%--lruvec_lru_size : | | : | --0.76%--inactive_list_is_low : | | : | --0.76%--lruvec_lru_size : | : --1.50%--measure_read : page_fault : __do_page_fault : handle_mm_fault : __handle_mm_fault : do_swap_page : swapin_readahead : swap_cluster_readahead : | : --1.48%--read_swap_cache_async : __read_swap_cache_async : alloc_pages_vma : __alloc_pages_nodemask : __alloc_pages_slowpath : try_to_free_pages : do_try_to_free_pages : shrink_node : shrink_node_memcg : | : |--0.75%--inactive_list_is_low : | | : | --0.75%--lruvec_lru_size : | : --0.73%--lruvec_lru_size The likely culprit is the cache traffic the lruvec_page_state_local generates. Dave Hansen says: : I was thinking purely of the cache footprint. If it's reading : pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192 : bytes of cache *96 CPUs = 18k of data, mostly read-only. 1 cgroup would : be 18k of data for the whole system and the caching would be pretty : efficient and all 18k would probably survive a tight page fault loop in : the L1. 500 cgroups would be ~90k of data per CPU thread which doesn't : fit in the L1 and probably wouldn't survive a tight page fault loop if : both logical threads were banging on different cgroups. : : It's just a theory, but it's why I noted the number of cgroups when I : initially saw this show up in profiles Fix the regression by partially reverting the said commit and calculate the lru size explicitly. Link: http://lkml.kernel.org/r/20190905071034.16822-1-honglei.wang@oracle.com Fixes: 1a61ab8038e72 ("mm: memcontrol: replace zone summing with lruvec_page_state()") Signed-off-by: Honglei Wang <honglei.wang@oracle.com> Reported-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: Tim Chen <tim.c.chen@linux.intel.com> Tested-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: <stable@vger.kernel.org> [5.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25mm, compaction: fix wrong pfn handling in __reset_isolation_pfn()Vlastimil Babka
commit a2e9a5afce080226edbf1882d63d99bf32070e9e upstream. Florian and Dave reported [1] a NULL pointer dereference in __reset_isolation_pfn(). While the exact cause is unclear, staring at the code revealed two bugs, which might be related. One bug is that if zone starts in the middle of pageblock, block_page might correspond to different pfn than block_pfn, and then the pfn_valid_within() checks will check different pfn's than those accessed via struct page. This might result in acessing an unitialized page in CONFIG_HOLES_IN_ZONE configs. The other bug is that end_page refers to the first page of next pageblock and not last page of current pageblock. The online and valid check is then wrong and with sections, the while (page < end_page) loop might wander off actual struct page arrays. [1] https://lore.kernel.org/linux-xfs/87o8z1fvqu.fsf@mid.deneb.enyo.de/ Link: http://lkml.kernel.org/r/20191008152915.24704-1-vbabka@suse.cz Fixes: 6b0868c820ff ("mm/compaction.c: correct zone boundary handling when resetting pageblock skip hints") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Florian Weimer <fw@deneb.enyo.de> Reported-by: Dave Chinner <david@fromorbit.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25mm/page_owner: don't access uninitialized memmaps when reading ↵Qian Cai
/proc/pagetypeinfo commit a26ee565b6cd8dc2bf15ff6aa70bbb28f928b773 upstream. Uninitialized memmaps contain garbage and in the worst case trigger kernel BUGs, especially with CONFIG_PAGE_POISONING. They should not get touched. For example, when not onlining a memory block that is spanned by a zone and reading /proc/pagetypeinfo with CONFIG_DEBUG_VM_PGFLAGS and CONFIG_PAGE_POISONING, we can trigger a kernel BUG: :/# echo 1 > /sys/devices/system/memory/memory40/online :/# echo 1 > /sys/devices/system/memory/memory42/online :/# cat /proc/pagetypeinfo > test.file page:fffff2c585200000 is uninitialized and poisoned raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) There is not page extension available. ------------[ cut here ]------------ kernel BUG at include/linux/mm.h:1107! invalid opcode: 0000 [#1] SMP NOPTI Please note that this change does not affect ZONE_DEVICE, because pagetypeinfo_showmixedcount_print() is called from mm/vmstat.c:pagetypeinfo_showmixedcount() only for populated zones, and ZONE_DEVICE is never populated (zone->present_pages always 0). [david@redhat.com: move check to outer loop, add comment, rephrase description] Link: http://lkml.kernel.org/r/20191011140638.8160-1-david@redhat.com Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visible after d0dc12e86b319 Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Miles Chen <miles.chen@mediatek.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Qian Cai <cai@lca.pw> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: <stable@vger.kernel.org> [4.13+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25mm/slub: fix a deadlock in show_slab_objects()Qian Cai
commit e4f8e513c3d353c134ad4eef9fd0bba12406c7c8 upstream. A long time ago we fixed a similar deadlock in show_slab_objects() [1]. However, it is apparently due to the commits like 01fb58bcba63 ("slab: remove synchronous synchronize_sched() from memcg cache deactivation path") and 03afc0e25f7f ("slab: get_online_mems for kmem_cache_{create,destroy,shrink}"), this kind of deadlock is back by just reading files in /sys/kernel/slab which will generate a lockdep splat below. Since the "mem_hotplug_lock" here is only to obtain a stable online node mask while racing with NUMA node hotplug, in the worst case, the results may me miscalculated while doing NUMA node hotplug, but they shall be corrected by later reads of the same files. WARNING: possible circular locking dependency detected ------------------------------------------------------ cat/5224 is trying to acquire lock: ffff900012ac3120 (mem_hotplug_lock.rw_sem){++++}, at: show_slab_objects+0x94/0x3a8 but task is already holding lock: b8ff009693eee398 (kn->count#45){++++}, at: kernfs_seq_start+0x44/0xf0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (kn->count#45){++++}: lock_acquire+0x31c/0x360 __kernfs_remove+0x290/0x490 kernfs_remove+0x30/0x44 sysfs_remove_dir+0x70/0x88 kobject_del+0x50/0xb0 sysfs_slab_unlink+0x2c/0x38 shutdown_cache+0xa0/0xf0 kmemcg_cache_shutdown_fn+0x1c/0x34 kmemcg_workfn+0x44/0x64 process_one_work+0x4f4/0x950 worker_thread+0x390/0x4bc kthread+0x1cc/0x1e8 ret_from_fork+0x10/0x18 -> #1 (slab_mutex){+.+.}: lock_acquire+0x31c/0x360 __mutex_lock_common+0x16c/0xf78 mutex_lock_nested+0x40/0x50 memcg_create_kmem_cache+0x38/0x16c memcg_kmem_cache_create_func+0x3c/0x70 process_one_work+0x4f4/0x950 worker_thread+0x390/0x4bc kthread+0x1cc/0x1e8 ret_from_fork+0x10/0x18 -> #0 (mem_hotplug_lock.rw_sem){++++}: validate_chain+0xd10/0x2bcc __lock_acquire+0x7f4/0xb8c lock_acquire+0x31c/0x360 get_online_mems+0x54/0x150 show_slab_objects+0x94/0x3a8 total_objects_show+0x28/0x34 slab_attr_show+0x38/0x54 sysfs_kf_seq_show+0x198/0x2d4 kernfs_seq_show+0xa4/0xcc seq_read+0x30c/0x8a8 kernfs_fop_read+0xa8/0x314 __vfs_read+0x88/0x20c vfs_read+0xd8/0x10c ksys_read+0xb0/0x120 __arm64_sys_read+0x54/0x88 el0_svc_handler+0x170/0x240 el0_svc+0x8/0xc other info that might help us debug this: Chain exists of: mem_hotplug_lock.rw_sem --> slab_mutex --> kn->count#45 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(kn->count#45); lock(slab_mutex); lock(kn->count#45); lock(mem_hotplug_lock.rw_sem); *** DEADLOCK *** 3 locks held by cat/5224: #0: 9eff00095b14b2a0 (&p->lock){+.+.}, at: seq_read+0x4c/0x8a8 #1: 0eff008997041480 (&of->mutex){+.+.}, at: kernfs_seq_start+0x34/0xf0 #2: b8ff009693eee398 (kn->count#45){++++}, at: kernfs_seq_start+0x44/0xf0 stack backtrace: Call trace: dump_backtrace+0x0/0x248 show_stack+0x20/0x2c dump_stack+0xd0/0x140 print_circular_bug+0x368/0x380 check_noncircular+0x248/0x250 validate_chain+0xd10/0x2bcc __lock_acquire+0x7f4/0xb8c lock_acquire+0x31c/0x360 get_online_mems+0x54/0x150 show_slab_objects+0x94/0x3a8 total_objects_show+0x28/0x34 slab_attr_show+0x38/0x54 sysfs_kf_seq_show+0x198/0x2d4 kernfs_seq_show+0xa4/0xcc seq_read+0x30c/0x8a8 kernfs_fop_read+0xa8/0x314 __vfs_read+0x88/0x20c vfs_read+0xd8/0x10c ksys_read+0xb0/0x120 __arm64_sys_read+0x54/0x88 el0_svc_handler+0x170/0x240 el0_svc+0x8/0xc I think it is important to mention that this doesn't expose the show_slab_objects to use-after-free. There is only a single path that might really race here and that is the slab hotplug notifier callback __kmem_cache_shrink (via slab_mem_going_offline_callback) but that path doesn't really destroy kmem_cache_node data structures. [1] http://lkml.iu.edu/hypermail/linux/kernel/1101.0/02850.html [akpm@linux-foundation.org: add comment explaining why we don't need mem_hotplug_lock] Link: http://lkml.kernel.org/r/1570192309-10132-1-git-send-email-cai@lca.pw Fixes: 01fb58bcba63 ("slab: remove synchronous synchronize_sched() from memcg cache deactivation path") Fixes: 03afc0e25f7f ("slab: get_online_mems for kmem_cache_{create,destroy,shrink}") Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25mm/memory-failure.c: don't access uninitialized memmaps in memory_failure()David Hildenbrand
commit 96c804a6ae8c59a9092b3d5dd581198472063184 upstream. We should check for pfn_to_online_page() to not access uninitialized memmaps. Reshuffle the code so we don't have to duplicate the error message. Link: http://lkml.kernel.org/r/20191009142435.3975-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319] Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> [4.13+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-10Merge tag 'v5.2.22' into v5.2/standard/baseBruce Ashfield
This is the 5.2.22 stable release # gpg: Signature made Sat 09 Nov 2019 08:56:23 PM EST # gpg: using RSA key EBCE84042C07D1D6 # gpg: Can't check signature: No public key
2019-11-09mm/vmpressure.c: fix a signedness bug in vmpressure_register_event()Dan Carpenter
commit 518a86713078168acd67cf50bc0b45d54b4cce6c upstream. The "mode" and "level" variables are enums and in this context GCC will treat them as unsigned ints so the error handling is never triggered. I also removed the bogus initializer because it isn't required any more and it's sort of confusing. [akpm@linux-foundation.org: reduce implicit and explicit typecasting] [akpm@linux-foundation.org: fix return value, add comment, per Matthew] Link: http://lkml.kernel.org/r/20190925110449.GO3264@mwanda Fixes: 3cadfa2b9497 ("mm/vmpressure.c: convert to use match_string() helper") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Enrico Weigelt <info@metux.net> Cc: Kate Stewart <kstewart@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-09mm/page_alloc.c: fix a crash in free_pages_prepare()Qian Cai
commit 234fdce892f905cbc2674349a9eb4873e288e5b3 upstream. On architectures like s390, arch_free_page() could mark the page unused (set_page_unused()) and any access later would trigger a kernel panic. Fix it by moving arch_free_page() after all possible accessing calls. Hardware name: IBM 2964 N96 400 (z/VM 6.4.0) Krnl PSW : 0404e00180000000 0000000026c2b96e (__free_pages_ok+0x34e/0x5d8) R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 Krnl GPRS: 0000000088d43af7 0000000000484000 000000000000007c 000000000000000f 000003d080012100 000003d080013fc0 0000000000000000 0000000000100000 00000000275cca48 0000000000000100 0000000000000008 000003d080010000 00000000000001d0 000003d000000000 0000000026c2b78a 000000002717fdb0 Krnl Code: 0000000026c2b95c: ec1100b30659 risbgn %r1,%r1,0,179,6 0000000026c2b962: e32014000036 pfd 2,1024(%r1) #0000000026c2b968: d7ff10001000 xc 0(256,%r1),0(%r1) >0000000026c2b96e: 41101100 la %r1,256(%r1) 0000000026c2b972: a737fff8 brctg %r3,26c2b962 0000000026c2b976: d7ff10001000 xc 0(256,%r1),0(%r1) 0000000026c2b97c: e31003400004 lg %r1,832 0000000026c2b982: ebff1430016a asi 5168(%r1),-1 Call Trace: __free_pages_ok+0x16a/0x5d8) memblock_free_all+0x206/0x290 mem_init+0x58/0x120 start_kernel+0x2b0/0x570 startup_continue+0x6a/0xc0 INFO: lockdep is turned off. Last Breaking-Event-Address: __free_pages_ok+0x372/0x5d8 Kernel panic - not syncing: Fatal exception: panic_on_oops 00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 26A2379C In the past, only kernel_poison_pages() would trigger this but it needs "page_poison=on" kernel cmdline, and I suspect nobody tested that on s390. Recently, kernel_init_free_pages() (commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")) was added and could trigger this as well. [akpm@linux-foundation.org: add comment] Link: http://lkml.kernel.org/r/1569613623-16820-1-git-send-email-cai@lca.pw Fixes: 8823b1dbc05f ("mm/page_poison.c: enable PAGE_POISONING as a separate option") Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options") Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: <stable@vger.kernel.org> [5.3+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-09mm/z3fold.c: claim page in the beginning of freeVitaly Wool
commit 5b6807de11445c05b537df8324f5d7ab1c2782f9 upstream. There's a really hard to reproduce race in z3fold between z3fold_free() and z3fold_reclaim_page(). z3fold_reclaim_page() can claim the page after z3fold_free() has checked if the page was claimed and z3fold_free() will then schedule this page for compaction which may in turn lead to random page faults (since that page would have been reclaimed by then). Fix that by claiming page in the beginning of z3fold_free() and not forgetting to clear the claim in the end. [vitalywool@gmail.com: v2] Link: http://lkml.kernel.org/r/20190928113456.152742cf@bigdell Link: http://lkml.kernel.org/r/20190926104844.4f0c6efa1366b8f5741eaba9@gmail.com Signed-off-by: Vitaly Wool <vitalywool@gmail.com> Reported-by: Markus Linnala <markus.linnala@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Markus Linnala <markus.linnala@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-09usercopy: Avoid HIGHMEM pfn warningKees Cook
commit 314eed30ede02fa925990f535652254b5bad6b65 upstream. When running on a system with >512MB RAM with a 32-bit kernel built with: CONFIG_DEBUG_VIRTUAL=y CONFIG_HIGHMEM=y CONFIG_HARDENED_USERCOPY=y all execve()s will fail due to argv copying into kmap()ed pages, and on usercopy checking the calls ultimately of virt_to_page() will be looking for "bad" kmap (highmem) pointers due to CONFIG_DEBUG_VIRTUAL=y: ------------[ cut here ]------------ kernel BUG at ../arch/x86/mm/physaddr.c:83! invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.3.0-rc8 #6 Hardware name: Dell Inc. Inspiron 1318/0C236D, BIOS A04 01/15/2009 EIP: __phys_addr+0xaf/0x100 ... Call Trace: __check_object_size+0xaf/0x3c0 ? __might_sleep+0x80/0xa0 copy_strings+0x1c2/0x370 copy_strings_kernel+0x2b/0x40 __do_execve_file+0x4ca/0x810 ? kmem_cache_alloc+0x1c7/0x370 do_execve+0x1b/0x20 ... The check is from arch/x86/mm/physaddr.c: VIRTUAL_BUG_ON((phys_addr >> PAGE_SHIFT) > max_low_pfn); Due to the kmap() in fs/exec.c: kaddr = kmap(kmapped_page); ... if (copy_from_user(kaddr+offset, str, bytes_to_copy)) ... Now we can fetch the correct page to avoid the pfn check. In both cases, hardened usercopy will need to walk the page-span checker (if enabled) to do sanity checking. Reported-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Fixes: f5509cc18daa ("mm: Hardened usercopy") Cc: Matthew Wilcox <willy@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/201909171056.7F2FFD17@keescook Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-10-10Merge tag 'v5.2.19' into v5.2/standard/baseBruce Ashfield
This is the 5.2.19 stable release # gpg: Signature made Sat 05 Oct 2019 07:14:17 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2019-10-10Merge tag 'v5.2.18' into v5.2/standard/baseBruce Ashfield
This is the 5.2.18 stable release # gpg: Signature made Tue 01 Oct 2019 03:01:46 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2019-10-05mm: Handle MADV_WILLNEED through vfs_fadvise()Jan Kara
commit 692fe62433d4ca47605b39f7c416efd6679ba694 upstream. Currently handling of MADV_WILLNEED hint calls directly into readahead code. Handle it by calling vfs_fadvise() instead so that filesystem can use its ->fadvise() callback to acquire necessary locks or otherwise prepare for the request. Suggested-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Boaz Harrosh <boazh@netapp.com> CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-05fs: Export generic_fadvise()Jan Kara
commit cf1ea0592dbf109e7e7935b7d5b1a47a1ba04174 upstream. Filesystems will need to call this function from their fadvise handlers. CC: stable@vger.kernel.org Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-05memcg, kmem: do not fail __GFP_NOFAIL chargesMichal Hocko
commit e55d9d9bfb69405bd7615c0f8d229d8fafb3e9b8 upstream. Thomas has noticed the following NULL ptr dereference when using cgroup v1 kmem limit: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 3 PID: 16923 Comm: gtk-update-icon Not tainted 4.19.51 #42 Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015 RIP: 0010:create_empty_buffers+0x24/0x100 Code: cd 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 d4 ba 01 00 00 00 55 53 48 89 fb e8 97 fe ff ff 48 89 c5 48 89 c2 eb 03 48 89 ca <48> 8b 4a 08 4c 09 22 48 85 c9 75 f1 48 89 6a 08 48 8b 43 18 48 8d RSP: 0018:ffff927ac1b37bf8 EFLAGS: 00010286 RAX: 0000000000000000 RBX: fffff2d4429fd740 RCX: 0000000100097149 RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff9075a99fbe00 RBP: 0000000000000000 R08: fffff2d440949cc8 R09: 00000000000960c0 R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000 R13: ffff907601f18360 R14: 0000000000002000 R15: 0000000000001000 FS: 00007fb55b288bc0(0000) GS:ffff90761f8c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 000000007aebc002 CR4: 00000000001606e0 Call Trace: create_page_buffers+0x4d/0x60 __block_write_begin_int+0x8e/0x5a0 ? ext4_inode_attach_jinode.part.82+0xb0/0xb0 ? jbd2__journal_start+0xd7/0x1f0 ext4_da_write_begin+0x112/0x3d0 generic_perform_write+0xf1/0x1b0 ? file_update_time+0x70/0x140 __generic_file_write_iter+0x141/0x1a0 ext4_file_write_iter+0xef/0x3b0 __vfs_write+0x17e/0x1e0 vfs_write+0xa5/0x1a0 ksys_write+0x57/0xd0 do_syscall_64+0x55/0x160 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Tetsuo then noticed that this is because the __memcg_kmem_charge_memcg fails __GFP_NOFAIL charge when the kmem limit is reached. This is a wrong behavior because nofail allocations are not allowed to fail. Normal charge path simply forces the charge even if that means to cross the limit. Kmem accounting should be doing the same. Link: http://lkml.kernel.org/r/20190906125608.32129-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Thomas Lindroth <thomas.lindroth@gmail.com> Debugged-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Thomas Lindroth <thomas.lindroth@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-05memcg, oom: don't require __GFP_FS when invoking memcg OOM killerTetsuo Handa
commit f9c645621a28e37813a1de96d9cbd89cde94a1e4 upstream. Masoud Sharbiani noticed that commit 29ef680ae7c21110 ("memcg, oom: move out_of_memory back to the charge path") broke memcg OOM called from __xfs_filemap_fault() path. It turned out that try_charge() is retrying forever without making forward progress because mem_cgroup_oom(GFP_NOFS) cannot invoke the OOM killer due to commit 3da88fb3bacfaa33 ("mm, oom: move GFP_NOFS check to out_of_memory"). Allowing forced charge due to being unable to invoke memcg OOM killer will lead to global OOM situation. Also, just returning -ENOMEM will be risky because OOM path is lost and some paths (e.g. get_user_pages()) will leak -ENOMEM. Therefore, invoking memcg OOM killer (despite GFP_NOFS) will be the only choice we can choose for now. Until 29ef680ae7c21110, we were able to invoke memcg OOM killer when GFP_KERNEL reclaim failed [1]. But since 29ef680ae7c21110, we need to invoke memcg OOM killer when GFP_NOFS reclaim failed [2]. Although in the past we did invoke memcg OOM killer for GFP_NOFS [3], we might get pre-mature memcg OOM reports due to this patch. [1] leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: dump_stack+0x63/0x88 dump_header+0x67/0x27a ? mem_cgroup_scan_tasks+0x91/0xf0 oom_kill_process+0x210/0x410 out_of_memory+0x10a/0x2c0 mem_cgroup_out_of_memory+0x46/0x80 mem_cgroup_oom_synchronize+0x2e4/0x310 ? high_work_func+0x20/0x20 pagefault_out_of_memory+0x31/0x76 mm_fault_error+0x55/0x115 ? handle_mm_fault+0xfd/0x220 __do_page_fault+0x433/0x4e0 do_page_fault+0x22/0x30 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x4009f0 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83 RSP: 002b:00007ffe29ae96f0 EFLAGS: 00010206 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001ce1000 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f94be09220d R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0 R13: 0000000000000003 R14: 00007f949d845000 R15: 0000000002800000 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 158965 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice child Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, file-rss:1208kB, shmem-rss:0kB oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [2] leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), order=0, oom_score_adj=0 CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: dump_stack+0x63/0x88 dump_header+0x67/0x27a ? mem_cgroup_scan_tasks+0x91/0xf0 oom_kill_process+0x210/0x410 out_of_memory+0x109/0x2d0 mem_cgroup_out_of_memory+0x46/0x80 try_charge+0x58d/0x650 ? __radix_tree_replace+0x81/0x100 mem_cgroup_try_charge+0x7a/0x100 __add_to_page_cache_locked+0x92/0x180 add_to_page_cache_lru+0x4d/0xf0 iomap_readpages_actor+0xde/0x1b0 ? iomap_zero_range_actor+0x1d0/0x1d0 iomap_apply+0xaf/0x130 iomap_readpages+0x9f/0x150 ? iomap_zero_range_actor+0x1d0/0x1d0 xfs_vm_readpages+0x18/0x20 [xfs] read_pages+0x60/0x140 __do_page_cache_readahead+0x193/0x1b0 ondemand_readahead+0x16d/0x2c0 page_cache_async_readahead+0x9a/0xd0 filemap_fault+0x403/0x620 ? alloc_set_pte+0x12c/0x540 ? _cond_resched+0x14/0x30 __xfs_filemap_fault+0x66/0x180 [xfs] xfs_filemap_fault+0x27/0x30 [xfs] __do_fault+0x19/0x40 __handle_mm_fault+0x8e8/0xb60 handle_mm_fault+0xfd/0x220 __do_page_fault+0x238/0x4e0 do_page_fault+0x22/0x30 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x4009f0 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83 RSP: 002b:00007ffda45c9290 EFLAGS: 00010206 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001a1e000 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f6d061ff20d R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0 R13: 0000000000000003 R14: 00007f6ce59b2000 R15: 0000000002800000 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 7221 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 kmem: usage 1944kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:3632KB rss:518232KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:518408KB inactive_file:3908KB active_file:12KB unevictable:0KB Memory cgroup out of memory: Kill process 2746 (leaker) score 992 or sacrifice child Killed process 2746 (leaker) total-vm:536704kB, anon-rss:518264kB, file-rss:1188kB, shmem-rss:0kB oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [3] leaker invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0 leaker cpuset=/ mems_allowed=0 CPU: 1 PID: 3206 Comm: leaker Not tainted 3.10.0-957.27.2.el7.x86_64 #1 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: [<ffffffffaf364147>] dump_stack+0x19/0x1b [<ffffffffaf35eb6a>] dump_header+0x90/0x229 [<ffffffffaedbb456>] ? find_lock_task_mm+0x56/0xc0 [<ffffffffaee32a38>] ? try_get_mem_cgroup_from_mm+0x28/0x60 [<ffffffffaedbb904>] oom_kill_process+0x254/0x3d0 [<ffffffffaee36c36>] mem_cgroup_oom_synchronize+0x546/0x570 [<ffffffffaee360b0>] ? mem_cgroup_charge_common+0xc0/0xc0 [<ffffffffaedbc194>] pagefault_out_of_memory+0x14/0x90 [<ffffffffaf35d072>] mm_fault_error+0x6a/0x157 [<ffffffffaf3717c8>] __do_page_fault+0x3c8/0x4f0 [<ffffffffaf371925>] do_page_fault+0x35/0x90 [<ffffffffaf36d768>] page_fault+0x28/0x30 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 20628 memory+swap: usage 524288kB, limit 9007199254740988kB, failcnt 0 kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:840KB rss:523448KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:523448KB inactive_file:464KB active_file:376KB unevictable:0KB Memory cgroup out of memory: Kill process 3206 (leaker) score 970 or sacrifice child Killed process 3206 (leaker) total-vm:536692kB, anon-rss:523304kB, file-rss:412kB, shmem-rss:0kB Bisected by Masoud Sharbiani. Link: http://lkml.kernel.org/r/cbe54ed1-b6ba-a056-8899-2dc42526371d@i-love.sakura.ne.jp Fixes: 3da88fb3bacfaa33 ("mm, oom: move GFP_NOFS check to out_of_memory") [necessary after 29ef680ae7c21110] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Masoud Sharbiani <msharbiani@apple.com> Tested-by: Masoud Sharbiani <msharbiani@apple.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-05mm/compaction.c: clear total_{migrate,free}_scanned before scanning a new zoneYafang Shao
commit a94b525241c0fff3598809131d7cfcfe1d572d8c upstream. total_{migrate,free}_scanned will be added to COMPACTMIGRATE_SCANNED and COMPACTFREE_SCANNED in compact_zone(). We should clear them before scanning a new zone. In the proc triggered compaction, we forgot clearing them. [laoar.shao@gmail.com: introduce a helper compact_zone_counters_init()] Link: http://lkml.kernel.org/r/1563869295-25748-1-git-send-email-laoar.shao@gmail.com [akpm@linux-foundation.org: expand compact_zone_counters_init() into its single callsite, per mhocko] [vbabka@suse.cz: squash compact_zone() list_head init as well] Link: http://lkml.kernel.org/r/1fb6f7da-f776-9e42-22f8-bbb79b030b98@suse.cz [akpm@linux-foundation.org: kcompactd_do_work(): avoid unnecessary initialization of cc.zone] Link: http://lkml.kernel.org/r/1563789275-9639-1-git-send-email-laoar.shao@gmail.com Fixes: 7f354a548d1c ("mm, compaction: add vmstats for kcompactd work") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Yafang Shao <shaoyafang@didiglobal.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-05z3fold: fix memory leak in kmem cacheVitaly Wool
commit 63398413c00c7836ea87a1fa205c91d2199b25cf upstream. Currently there is a leak in init_z3fold_page() -- it allocates handles from kmem cache even for headless pages, but then they are never used and never freed, so eventually kmem cache may get exhausted. This patch provides a fix for that. Link: http://lkml.kernel.org/r/20190917185352.44cf285d3ebd9e64548de5de@gmail.com Signed-off-by: Vitaly Wool <vitalywool@gmail.com> Reported-by: Markus Linnala <markus.linnala@gmail.com> Tested-by: Markus Linnala <markus.linnala@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>