summaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)Author
2016-10-22vfs,mm: fix a dead loop in truncate_inode_pages_range()Wei Fang
commit c2a9737f45e27d8263ff9643f994bda9bac0b944 upstream. We triggered a deadloop in truncate_inode_pages_range() on 32 bits architecture with the test case bellow: ... fd = open(); write(fd, buf, 4096); preadv64(fd, &iovec, 1, 0xffffffff000); ftruncate(fd, 0); ... Then ftruncate() will not return forever. The filesystem used in this case is ubifs, but it can be triggered on many other filesystems. When preadv64() is called with offset=0xffffffff000, a page with index=0xffffffff will be added to the radix tree of ->mapping. Then this page can be found in ->mapping with pagevec_lookup(). After that, truncate_inode_pages_range(), which is called in ftruncate(), will fall into an infinite loop: - find a page with index=0xffffffff, since index>=end, this page won't be truncated - index++, and index become 0 - the page with index=0xffffffff will be found again The data type of index is unsigned long, so index won't overflow to 0 on 64 bits architecture in this case, and the dead loop won't happen. Since truncate_inode_pages_range() is executed with holding lock of inode->i_rwsem, any operation related with this lock will be blocked, and a hung task will happen, e.g.: INFO: task truncate_test:3364 blocked for more than 120 seconds. ... call_rwsem_down_write_failed+0x17/0x30 generic_file_write_iter+0x32/0x1c0 ubifs_write_iter+0xcc/0x170 __vfs_write+0xc4/0x120 vfs_write+0xb2/0x1b0 SyS_write+0x46/0xa0 The page with index=0xffffffff added to ->mapping is useless. Fix this by checking the read position before allocating pages. Link: http://lkml.kernel.org/r/1475151010-40166-1-git-send-email-fangwei1@huawei.com Signed-off-by: Wei Fang <fangwei1@huawei.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-20mm: remove gup_flags FOLL_WRITE games from __get_user_pages()Linus Torvalds
commit 19be0eaffa3ac7d8eb6784ad9bdbc7d67ed8e619 upstream. This is an ancient bug that was actually attempted to be fixed once (badly) by me eleven years ago in commit 4ceb5db9757a ("Fix get_user_pages() race for write access") but that was then undone due to problems on s390 by commit f33ea7f404e5 ("fix get_user_pages bug"). In the meantime, the s390 situation has long been fixed, and we can now fix it by checking the pte_dirty() bit properly (and do it better). The s390 dirty bit was implemented in abf09bed3cce ("s390/mm: implement software dirty bits") which made it into v3.9. Earlier kernels will have to look at the page state itself. Also, the VM has become more scalable, and what used a purely theoretical race back then has become easier to trigger. To fix it, we introduce a new internal FOLL_COW flag to mark the "yes, we already did a COW" rather than play racy games with FOLL_WRITE that is very fundamental, and then use the pte dirty flag to validate that the FOLL_COW flag is still valid. Reported-and-tested-by: Phil "not Paul" Oester <kernel@linuxace.com> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Nick Piggin <npiggin@gmail.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-07mm, kasan: account for object redzone in SLUB's nearest_obj()Alexander Potapenko
commit c146a2b98eb5898eb0fab15a332257a4102ecae9 upstream. When looking up the nearest SLUB object for a given address, correctly calculate its offset if SLAB_RED_ZONE is enabled for that cache. Previously, when KASAN had detected an error on an object from a cache with SLAB_RED_ZONE set, the actual start address of the object was miscalculated, which led to random stacks having been reported. When looking up the nearest SLUB object for a given address, correctly calculate its offset if SLAB_RED_ZONE is enabled for that cache. Fixes: 7ed2f9e663854db ("mm, kasan: SLAB support") Link: http://lkml.kernel.org/r/1468347165-41906-2-git-send-email-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Kuthonuzo Luruo <kuthonuzo.luruo@hpe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-07mm,ksm: fix endless looping in allocating memory when ksm enablezhong jiang
commit 5b398e416e880159fe55eefd93c6588fa072cd66 upstream. I hit the following hung task when runing a OOM LTP test case with 4.1 kernel. Call trace: [<ffffffc000086a88>] __switch_to+0x74/0x8c [<ffffffc000a1bae0>] __schedule+0x23c/0x7bc [<ffffffc000a1c09c>] schedule+0x3c/0x94 [<ffffffc000a1eb84>] rwsem_down_write_failed+0x214/0x350 [<ffffffc000a1e32c>] down_write+0x64/0x80 [<ffffffc00021f794>] __ksm_exit+0x90/0x19c [<ffffffc0000be650>] mmput+0x118/0x11c [<ffffffc0000c3ec4>] do_exit+0x2dc/0xa74 [<ffffffc0000c46f8>] do_group_exit+0x4c/0xe4 [<ffffffc0000d0f34>] get_signal+0x444/0x5e0 [<ffffffc000089fcc>] do_signal+0x1d8/0x450 [<ffffffc00008a35c>] do_notify_resume+0x70/0x78 The oom victim cannot terminate because it needs to take mmap_sem for write while the lock is held by ksmd for read which loops in the page allocator ksm_do_scan scan_get_next_rmap_item down_read get_next_rmap_item alloc_rmap_item #ksmd will loop permanently. There is no way forward because the oom victim cannot release any memory in 4.1 based kernel. Since 4.6 we have the oom reaper which would solve this problem because it would release the memory asynchronously. Nevertheless we can relax alloc_rmap_item requirements and use __GFP_NORETRY because the allocation failure is acceptable as ksm_do_scan would just retry later after the lock got dropped. Such a patch would be also easy to backport to older stable kernels which do not have oom_reaper. While we are at it add GFP_NOWARN so the admin doesn't have to be alarmed by the allocation failure. Link: http://lkml.kernel.org/r/1474165570-44398-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Suggested-by: Hugh Dickins <hughd@google.com> Suggested-by: Michal Hocko <mhocko@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-30mm/kasan: don't reduce quarantine in atomic contextsAndrey Ryabinin
commit 4b3ec5a3f4b1d5c9d64b9ab704042400d050d432 upstream. Currently we call quarantine_reduce() for ___GFP_KSWAPD_RECLAIM (implied by __GFP_RECLAIM) allocation. So, basically we call it on almost every allocation. quarantine_reduce() sometimes is heavy operation, and calling it with disabled interrupts may trigger hard LOCKUP: NMI watchdog: Watchdog detected hard LOCKUP on cpu 2irq event stamp: 1411258 Call Trace: <NMI> dump_stack+0x68/0x96 watchdog_overflow_callback+0x15b/0x190 __perf_event_overflow+0x1b1/0x540 perf_event_overflow+0x14/0x20 intel_pmu_handle_irq+0x36a/0xad0 perf_event_nmi_handler+0x2c/0x50 nmi_handle+0x128/0x480 default_do_nmi+0xb2/0x210 do_nmi+0x1aa/0x220 end_repeat_nmi+0x1a/0x1e <<EOE>> __kernel_text_address+0x86/0xb0 print_context_stack+0x7b/0x100 dump_trace+0x12b/0x350 save_stack_trace+0x2b/0x50 set_track+0x83/0x140 free_debug_processing+0x1aa/0x420 __slab_free+0x1d6/0x2e0 ___cache_free+0xb6/0xd0 qlist_free_all+0x83/0x100 quarantine_reduce+0x177/0x1b0 kasan_kmalloc+0xf3/0x100 Reduce the quarantine_reduce iff direct reclaim is allowed. Fixes: 55834c59098d("mm: kasan: initial memory quarantine implementation") Link: http://lkml.kernel.org/r/1470062715-14077-2-git-send-email-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reported-by: Dave Jones <davej@codemonkey.org.uk> Acked-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-30kasan: avoid overflowing quarantine size on low memory systemsAlexander Potapenko
commit c3cee372282cb6bcdf19ac1457581d5dd5ecb554 upstream. If the total amount of memory assigned to quarantine is less than the amount of memory assigned to per-cpu quarantines, |new_quarantine_size| may overflow. Instead, set it to zero. [akpm@linux-foundation.org: cleanup: use WARN_ONCE return value] Link: http://lkml.kernel.org/r/1470063563-96266-1-git-send-email-glider@google.com Fixes: 55834c59098d ("mm: kasan: initial memory quarantine implementation") Signed-off-by: Alexander Potapenko <glider@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-30mm: delete unnecessary and unsafe init_tlb_ubc()Hugh Dickins
commit b385d21f27d86426472f6ae92a231095f7de2a8d upstream. init_tlb_ubc() looked unnecessary to me: tlb_ubc is statically initialized with zeroes in the init_task, and copied from parent to child while it is quiescent in arch_dup_task_struct(); so I went to delete it. But inserted temporary debug WARN_ONs in place of init_tlb_ubc() to check that it was always empty at that point, and found them firing: because memcg reclaim can recurse into global reclaim (when allocating biosets for swapout in my case), and arrive back at the init_tlb_ubc() in shrink_node_memcg(). Resetting tlb_ubc.flush_required at that point is wrong: if the upper level needs a deferred TLB flush, but the lower level turns out not to, we miss a TLB flush. But fortunately, that's the only part of the protocol that does not nest: with the initialization removed, cpumask collects bits from upper and lower levels, and flushes TLB when needed. Fixes: 72b252aed506 ("mm: send one IPI per CPU to TLB flush all entries after unmapping pages") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-30mm: memcontrol: make per-cpu charge cache IRQ-safe for socket accountingJohannes Weiner
commit db2ba40c277dc545bab531671c3f45ac0afea6f8 upstream. During cgroup2 rollout into production, we started encountering css refcount underflows and css access crashes in the memory controller. Splitting the heavily shared css reference counter into logical users narrowed the imbalance down to the cgroup2 socket memory accounting. The problem turns out to be the per-cpu charge cache. Cgroup1 had a separate socket counter, but the new cgroup2 socket accounting goes through the common charge path that uses a shared per-cpu cache for all memory that is being tracked. Those caches are safe against scheduling preemption, but not against interrupts - such as the newly added packet receive path. When cache draining is interrupted by network RX taking pages out of the cache, the resuming drain operation will put references of in-use pages, thus causing the imbalance. Disable IRQs during all per-cpu charge cache operations. Fixes: f7e1cb6ec51b ("mm: memcontrol: account socket memory in unified hierarchy memory controller") Link: http://lkml.kernel.org/r/20160914194846.11153-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-30mm: fix the page_swap_info() BUG_ON checkSantosh Shilimkar
commit c8de641b1e9c5489aa6ca57b7836acd68e7563f1 upstream. Commit 62c230bc1790 ("mm: add support for a filesystem to activate swap files and use direct_IO for writing swap pages") replaced the swap_aops dirty hook from __set_page_dirty_no_writeback() with swap_set_page_dirty(). For normal cases without these special SWP flags code path falls back to __set_page_dirty_no_writeback() so the behaviour is expected to be the same as before. But swap_set_page_dirty() makes use of the page_swap_info() helper to get the swap_info_struct to check for the flags like SWP_FILE, SWP_BLKDEV etc as desired for those features. This helper has BUG_ON(!PageSwapCache(page)) which is racy and safe only for the set_page_dirty_lock() path. For the set_page_dirty() path which is often needed for cases to be called from irq context, kswapd() can toggle the flag behind the back while the call is getting executed when system is low on memory and heavy swapping is ongoing. This ends up with undesired kernel panic. This patch just moves the check outside the helper to its users appropriately to fix kernel panic for the described path. Couple of users of helpers already take care of SwapCache condition so I skipped them. Link: http://lkml.kernel.org/r/1473460718-31013-1-git-send-email-santosh.shilimkar@oracle.com Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Joe Perches <joe@perches.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Rik van Riel <riel@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Jens Axboe <axboe@fb.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-24mm, mempolicy: task->mempolicy must be NULL before dropping final referenceDavid Rientjes
commit c11600e4fed67ae4cd6a8096936afd445410e8ed upstream. KASAN allocates memory from the page allocator as part of kmem_cache_free(), and that can reference current->mempolicy through any number of allocation functions. It needs to be NULL'd out before the final reference is dropped to prevent a use-after-free bug: BUG: KASAN: use-after-free in alloc_pages_current+0x363/0x370 at addr ffff88010b48102c CPU: 0 PID: 15425 Comm: trinity-c2 Not tainted 4.8.0-rc2+ #140 ... Call Trace: dump_stack kasan_object_err kasan_report_error __asan_report_load2_noabort alloc_pages_current <-- use after free depot_save_stack save_stack kasan_slab_free kmem_cache_free __mpol_put <-- free do_exit This patch sets current->mempolicy to NULL before dropping the final reference. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1608301442180.63329@chino.kir.corp.google.com Fixes: cd11016e5f52 ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB") Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-24mm, oom: prevent premature OOM killer invocation for high order requestMichal Hocko
commit 6b4e3181d7bd5ca5ab6f45929e4a5ffa7ab4ab7f upstream. There have been several reports about pre-mature OOM killer invocation in 4.7 kernel when order-2 allocation request (for the kernel stack) invoked OOM killer even during basic workloads (light IO or even kernel compile on some filesystems). In all reported cases the memory is fragmented and there are no order-2+ pages available. There is usually a large amount of slab memory (usually dentries/inodes) and further debugging has shown that there are way too many unmovable blocks which are skipped during the compaction. Multiple reporters have confirmed that the current linux-next which includes [1] and [2] helped and OOMs are not reproducible anymore. A simpler fix for the late rc and stable is to simply ignore the compaction feedback and retry as long as there is a reclaim progress and we are not getting OOM for order-0 pages. We already do that for CONFING_COMPACTION=n so let's reuse the same code when compaction is enabled as well. [1] http://lkml.kernel.org/r/20160810091226.6709-1-vbabka@suse.cz [2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a933559305a@suse.cz Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection") Link: http://lkml.kernel.org/r/20160823074339.GB23577@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Tested-by: Olaf Hering <olaf@aepfle.de> Tested-by: Ralf-Peter Rohbeck <Ralf-Peter.Rohbeck@quantum.com> Cc: Markus Trippelsdorf <markus@trippelsdorf.de> Cc: Arkadiusz Miskiewicz <a.miskiewicz@gmail.com> Cc: Ralf-Peter Rohbeck <Ralf-Peter.Rohbeck@quantum.com> Cc: Jiri Slaby <jslaby@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-07mm: silently skip readahead for DAX inodesRoss Zwisler
commit 11bd969fdefea3ac0cb9791224f1e09784e21e58 upstream. For DAX inodes we need to be careful to never have page cache pages in the mapping->page_tree. This radix tree should be composed only of DAX exceptional entries and zero pages. ltp's readahead02 test was triggering a warning because we were trying to insert a DAX exceptional entry but found that a page cache page had already been inserted into the tree. This page was being inserted into the radix tree in response to a readahead(2) call. Readahead doesn't make sense for DAX inodes, but we don't want it to report a failure either. Instead, we just return success and don't do any work. Link: http://lkml.kernel.org/r/20160824221429.21158-1-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: Jeff Moyer <jmoyer@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jan Kara <jack@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-07soft_dirty: fix soft_dirty during THP splitAndrea Arcangeli
commit 804dd150468cfd920d92d4b3cf00536fedef3902 upstream. While adding proper userfaultfd_wp support with bits in pagetable and swap entry to avoid false positives WP userfaults through swap/fork/ KSM/etc, I've been adding a framework that mostly mirrors soft dirty. So I noticed in one place I had to add uffd_wp support to the pagetables that wasn't covered by soft_dirty and I think it should have. Example: in the THP migration code migrate_misplaced_transhuge_page() pmd_mkdirty is called unconditionally after mk_huge_pmd. entry = mk_huge_pmd(new_page, vma->vm_page_prot); entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); That sets soft dirty too (it's a false positive for soft dirty, the soft dirty bit could be more finegrained and transfer the bit like uffd_wp will do.. pmd/pte_uffd_wp() enforces the invariant that when it's set pmd/pte_write is not set). However in the THP split there's no unconditional pmd_mkdirty after mk_huge_pmd and pte_swp_mksoft_dirty isn't called after the migration entry is created. The code sets the dirty bit in the struct page instead of setting it in the pagetable (which is fully equivalent as far as the real dirty bit is concerned, as the whole point of pagetable bits is to be eventually flushed out of to the page, but that is not equivalent for the soft-dirty bit that gets lost in translation). This was found by code review only and totally untested as I'm working to actually replace soft dirty and I don't have time to test potential soft dirty bugfixes as well :). Transfer the soft_dirty from pmd to pte during THP splits. This fix avoids losing the soft_dirty bit and avoids userland memory corruption in the checkpoint. Fixes: eef1b3ba053aa6 ("thp: implement split_huge_pmd()") Link: http://lkml.kernel.org/r/1471610515-30229-2-git-send-email-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@virtuozzo.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-07mm/slub.c: run free_partial() outside of the kmem_cache_node->list_lockChris Wilson
commit 6039892396d845b18228935561960441900cffca upstream. With debugobjects enabled and using SLAB_DESTROY_BY_RCU, when a kmem_cache_node is destroyed the call_rcu() may trigger a slab allocation to fill the debug object pool (__debug_object_init:fill_pool). Everywhere but during kmem_cache_destroy(), discard_slab() is performed outside of the kmem_cache_node->list_lock and avoids a lockdep warning about potential recursion: ============================================= [ INFO: possible recursive locking detected ] 4.8.0-rc1-gfxbench+ #1 Tainted: G U --------------------------------------------- rmmod/8895 is trying to acquire lock: (&(&n->list_lock)->rlock){-.-...}, at: [<ffffffff811c80d7>] get_partial_node.isra.63+0x47/0x430 but task is already holding lock: (&(&n->list_lock)->rlock){-.-...}, at: [<ffffffff811cbda4>] __kmem_cache_shutdown+0x54/0x320 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&n->list_lock)->rlock); lock(&(&n->list_lock)->rlock); *** DEADLOCK *** May be due to missing lock nesting notation 5 locks held by rmmod/8895: #0: (&dev->mutex){......}, at: driver_detach+0x42/0xc0 #1: (&dev->mutex){......}, at: driver_detach+0x50/0xc0 #2: (cpu_hotplug.dep_map){++++++}, at: get_online_cpus+0x2d/0x80 #3: (slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x3c/0x220 #4: (&(&n->list_lock)->rlock){-.-...}, at: __kmem_cache_shutdown+0x54/0x320 stack backtrace: CPU: 6 PID: 8895 Comm: rmmod Tainted: G U 4.8.0-rc1-gfxbench+ #1 Hardware name: Gigabyte Technology Co., Ltd. H87M-D3H/H87M-D3H, BIOS F11 08/18/2015 Call Trace: __lock_acquire+0x1646/0x1ad0 lock_acquire+0xb2/0x200 _raw_spin_lock+0x36/0x50 get_partial_node.isra.63+0x47/0x430 ___slab_alloc.constprop.67+0x1a7/0x3b0 __slab_alloc.isra.64.constprop.66+0x43/0x80 kmem_cache_alloc+0x236/0x2d0 __debug_object_init+0x2de/0x400 debug_object_activate+0x109/0x1e0 __call_rcu.constprop.63+0x32/0x2f0 call_rcu+0x12/0x20 discard_slab+0x3d/0x40 __kmem_cache_shutdown+0xdb/0x320 shutdown_cache+0x19/0x60 kmem_cache_destroy+0x1ae/0x220 i915_gem_load_cleanup+0x14/0x40 [i915] i915_driver_unload+0x151/0x180 [i915] i915_pci_remove+0x14/0x20 [i915] pci_device_remove+0x34/0xb0 __device_release_driver+0x95/0x140 driver_detach+0xb6/0xc0 bus_remove_driver+0x53/0xd0 driver_unregister+0x27/0x50 pci_unregister_driver+0x25/0x70 i915_exit+0x1a/0x1e2 [i915] SyS_delete_module+0x193/0x1f0 entry_SYSCALL_64_fastpath+0x1c/0xac Fixes: 52b4b950b507 ("mm: slab: free kmem_cache_node after destroy sysfs file") Link: http://lkml.kernel.org/r/1470759070-18743-1-git-send-email-chris@chris-wilson.co.uk Reported-by: Dave Gordon <david.s.gordon@intel.com> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dmitry Safonov <dsafonov@virtuozzo.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dave Gordon <david.s.gordon@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-20mm/hugetlb: avoid soft lockup in set_max_huge_pages()Jia He
commit 649920c6ab93429b94bc7c1aa7c0e8395351be32 upstream. In powerpc servers with large memory(32TB), we watched several soft lockups for hugepage under stress tests. The call traces are as follows: 1. get_page_from_freelist+0x2d8/0xd50 __alloc_pages_nodemask+0x180/0xc20 alloc_fresh_huge_page+0xb0/0x190 set_max_huge_pages+0x164/0x3b0 2. prep_new_huge_page+0x5c/0x100 alloc_fresh_huge_page+0xc8/0x190 set_max_huge_pages+0x164/0x3b0 This patch fixes such soft lockups. It is safe to call cond_resched() there because it is out of spin_lock/unlock section. Link: http://lkml.kernel.org/r/1469674442-14848-1-git-send-email-hejianet@gmail.com Signed-off-by: Jia He <hejianet@gmail.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-20block: fix bdi vs gendisk lifetime mismatchDan Williams
commit df08c32ce3be5be138c1dbfcba203314a3a7cd6f upstream. The name for a bdi of a gendisk is derived from the gendisk's devt. However, since the gendisk is destroyed before the bdi it leaves a window where a new gendisk could dynamically reuse the same devt while a bdi with the same name is still live. Arrange for the bdi to hold a reference against its "owner" disk device while it is registered. Otherwise we can hit sysfs duplicate name collisions like the following: WARNING: CPU: 10 PID: 2078 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x80 sysfs: cannot create duplicate filename '/devices/virtual/bdi/259:1' Hardware name: HP ProLiant DL580 Gen8, BIOS P79 05/06/2015 0000000000000286 0000000002c04ad5 ffff88006f24f970 ffffffff8134caec ffff88006f24f9c0 0000000000000000 ffff88006f24f9b0 ffffffff8108c351 0000001f0000000c ffff88105d236000 ffff88105d1031e0 ffff8800357427f8 Call Trace: [<ffffffff8134caec>] dump_stack+0x63/0x87 [<ffffffff8108c351>] __warn+0xd1/0xf0 [<ffffffff8108c3cf>] warn_slowpath_fmt+0x5f/0x80 [<ffffffff812a0d34>] sysfs_warn_dup+0x64/0x80 [<ffffffff812a0e1e>] sysfs_create_dir_ns+0x7e/0x90 [<ffffffff8134faaa>] kobject_add_internal+0xaa/0x320 [<ffffffff81358d4e>] ? vsnprintf+0x34e/0x4d0 [<ffffffff8134ff55>] kobject_add+0x75/0xd0 [<ffffffff816e66b2>] ? mutex_lock+0x12/0x2f [<ffffffff8148b0a5>] device_add+0x125/0x610 [<ffffffff8148b788>] device_create_groups_vargs+0xd8/0x100 [<ffffffff8148b7cc>] device_create_vargs+0x1c/0x20 [<ffffffff811b775c>] bdi_register+0x8c/0x180 [<ffffffff811b7877>] bdi_register_dev+0x27/0x30 [<ffffffff813317f5>] add_disk+0x175/0x4a0 Reported-by: Yi Zhang <yizhan@redhat.com> Tested-by: Yi Zhang <yizhan@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Fixed up missing 0 return in bdi_register_owner(). Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-16Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"Michal Hocko
commit 4e390b2b2f34b8daaabf2df1df0cf8f798b87ddb upstream. This reverts commit f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"). There has been a report about OOM killer invoked when swapping out to a dm-crypt device. The primary reason seems to be that the swapout out IO managed to completely deplete memory reserves. Ondrej was able to bisect and explained the issue by pointing to f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"). The reason is that the swapout path is not throttled properly because the md-raid layer needs to allocate from the generic_make_request path which means it allocates from the PF_MEMALLOC context. dm layer uses mempool_alloc in order to guarantee a forward progress which used to inhibit access to memory reserves when using page allocator. This has changed by f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements") which has dropped the __GFP_NOMEMALLOC protection when the memory pool is depleted. If we are running out of memory and the only way forward to free memory is to perform swapout we just keep consuming memory reserves rather than throttling the mempool allocations and allowing the pending IO to complete up to a moment when the memory is depleted completely and there is no way forward but invoking the OOM killer. This is less than optimal. The original intention of f9054c70d28b was to help with the OOM situations where the oom victim depends on mempool allocation to make a forward progress. David has mentioned the following backtrace: schedule schedule_timeout io_schedule_timeout mempool_alloc __split_and_process_bio dm_request generic_make_request submit_bio mpage_readpages ext4_readpages __do_page_cache_readahead ra_submit filemap_fault handle_mm_fault __do_page_fault do_page_fault page_fault We do not know more about why the mempool is depleted without being replenished in time, though. In any case the dm layer shouldn't depend on any allocations outside of the dedicated pools so a forward progress should be guaranteed. If this is not the case then the dm should be fixed rather than papering over the problem and postponing it to later by accessing more memory reserves. mempools are a mechanism to maintain dedicated memory reserves to guaratee forward progress. Allowing them an unbounded access to the page allocator memory reserves is going against the whole purpose of this mechanism. Bisected by Ondrej Kozina. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20160721145309.GR26379@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Ondrej Kozina <okozina@redhat.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: NeilBrown <neilb@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Ondrej Kozina <okozina@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-16mm: memcontrol: fix memcg id ref counter on swap charge moveVladimir Davydov
commit 615d66c37c755c49ce022c9e5ac0875d27d2603d upstream. Since commit 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs") swap entries do not pin memcg->css.refcnt directly. Instead, they pin memcg->id.ref. So we should adjust the reference counters accordingly when moving swap charges between cgroups. Fixes: 73f576c04b941 ("mm: memcontrol: fix cgroup creation failure after many small jobs") Link: http://lkml.kernel.org/r/9ce297c64954a42dc90b543bc76106c4a94f07e8.1470219853.git.vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-16mm: memcontrol: fix swap counter leak on swapout from offline cgroupVladimir Davydov
commit 1f47b61fb4077936465dcde872a4e5cc4fe708da upstream. An offline memory cgroup might have anonymous memory or shmem left charged to it and no swap. Since only swap entries pin the id of an offline cgroup, such a cgroup will have no id and so an attempt to swapout its anon/shmem will not store memory cgroup info in the swap cgroup map. As a result, memcg->swap or memcg->memsw will never get uncharged from it and any of its ascendants. Fix this by always charging swapout to the first ancestor cgroup that hasn't released its id yet. [hannes@cmpxchg.org: add comment to mem_cgroup_swapout] [vdavydov@virtuozzo.com: use WARN_ON_ONCE() in mem_cgroup_id_get_online()] Link: http://lkml.kernel.org/r/20160803123445.GJ13263@esperanza Fixes: 73f576c04b941 ("mm: memcontrol: fix cgroup creation failure after many small jobs") Link: http://lkml.kernel.org/r/5336daa5c9a32e776067773d9da655d2dc126491.1470219853.git.vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-07-23mm: memcontrol: fix cgroup creation failure after many small jobsJohannes Weiner
The memory controller has quite a bit of state that usually outlives the cgroup and pins its CSS until said state disappears. At the same time it imposes a 16-bit limit on the CSS ID space to economically store IDs in the wild. Consequently, when we use cgroups to contain frequent but small and short-lived jobs that leave behind some page cache, we quickly run into the 64k limitations of outstanding CSSs. Creating a new cgroup fails with -ENOSPC while there are only a few, or even no user-visible cgroups in existence. Although pinning CSSs past cgroup removal is common, there are only two instances that actually need an ID after a cgroup is deleted: cache shadow entries and swapout records. Cache shadow entries reference the ID weakly and can deal with the CSS having disappeared when it's looked up later. They pose no hurdle. Swap-out records do need to pin the css to hierarchically attribute swapins after the cgroup has been deleted; though the only pages that remain swapped out after offlining are tmpfs/shmem pages. And those references are under the user's control, so they are manageable. This patch introduces a private 16-bit memcg ID and switches swap and cache shadow entries over to using that. This ID can then be recycled after offlining when the CSS remains pinned only by objects that don't specifically need it. This script demonstrates the problem by faulting one cache page in a new cgroup and deleting it again: set -e mkdir -p pages for x in `seq 128000`; do [ $((x % 1000)) -eq 0 ] && echo $x mkdir /cgroup/foo echo $$ >/cgroup/foo/cgroup.procs echo trex >pages/$x echo $$ >/cgroup/cgroup.procs rmdir /cgroup/foo done When run on an unpatched kernel, we eventually run out of possible IDs even though there are no visible cgroups: [root@ham ~]# ./cssidstress.sh [...] 65000 mkdir: cannot create directory '/cgroup/foo': No space left on device After this patch, the IDs get released upon cgroup destruction and the cache and css objects get released once memory reclaim kicks in. [hannes@cmpxchg.org: init the IDR] Link: http://lkml.kernel.org/r/20160621154601.GA22431@cmpxchg.org Fixes: b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined groups") Link: http://lkml.kernel.org/r/20160617162516.GD19084@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: John Garcia <john.garcia@mesosphere.io> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Nikolay Borisov <kernel@kyup.com> Cc: <stable@vger.kernel.org> [3.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-15mm: workingset: printk missing log level, use pr_info()Anton Blanchard
Commit 612e44939c3c ("mm: workingset: eviction buckets for bigmem/lowbit machines") added a printk without a log level. Quieten it by using pr_info(). Link: http://lkml.kernel.org/r/1466982072-29836-2-git-send-email-anton@ozlabs.org Signed-off-by: Anton Blanchard <anton@samba.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-15mm: thp: refix false positive BUG in page_move_anon_rmap()Hugh Dickins
The VM_BUG_ON_PAGE in page_move_anon_rmap() is more trouble than it's worth: the syzkaller fuzzer hit it again. It's still wrong for some THP cases, because linear_page_index() was never intended to apply to addresses before the start of a vma. That's easily fixed with a signed long cast inside linear_page_index(); and Dmitry has tested such a patch, to verify the false positive. But why extend linear_page_index() just for this case? when the avoidance in page_move_anon_rmap() has already grown ugly, and there's no reason for the check at all (nothing else there is using address or index). Remove address arg from page_move_anon_rmap(), remove VM_BUG_ON_PAGE, remove CONFIG_DEBUG_VM PageTransHuge adjustment. And one more thing: should the compound_head(page) be done inside or outside page_move_anon_rmap()? It's usually pushed down to the lowest level nowadays (and mm/memory.c shows no other explicit use of it), so I think it's better done in page_move_anon_rmap() than by caller. Fixes: 0798d3c022dc ("mm: thp: avoid false positive VM_BUG_ON_PAGE in page_move_anon_rmap()") Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1607120444540.12528@eggly.anvils Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mika Westerberg <mika.westerberg@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [4.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-15mm: rmap: call page_check_address() with sync enabled to avoid racy checkNaoya Horiguchi
The previous patch addresses the race between split_huge_pmd_address() and someone changing the pmd. The fix is only for splitting of normal thp (i.e. pmd-mapped thp,) and for splitting of pte-mapped thp there still is the similar race. For splitting pte-mapped thp, the pte's conversion is done by try_to_unmap_one(TTU_MIGRATION). This function checks page_check_address() to get the target pte, but it can return NULL under some race, leading to VM_BUG_ON() in freeze_page(). Fortunately, page_check_address() already has an argument to decide whether we do a quick/racy check or not, so let's flip it when called from freeze_page(). Link: http://lkml.kernel.org/r/1466990929-7452-2-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-15mm: thp: move pmd check inside ptl for freeze_page()Naoya Horiguchi
I found a race condition triggering VM_BUG_ON() in freeze_page(), when running a testcase with 3 processes: - process 1: keep writing thp, - process 2: keep clearing soft-dirty bits from virtual address of process 1 - process 3: call migratepages for process 1, The kernel message is like this: kernel BUG at /src/linux-dev/mm/huge_memory.c:3096! invalid opcode: 0000 [#1] SMP Modules linked in: cfg80211 rfkill crc32c_intel ppdev serio_raw pcspkr virtio_balloon virtio_console parport_pc parport pvpanic acpi_cpufreq tpm_tis tpm i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy virtio_pci virtio_ring virtio CPU: 0 PID: 28863 Comm: migratepages Not tainted 4.6.0-v4.6-160602-0827-+ #2 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff880037320000 ti: ffff88007cdd0000 task.ti: ffff88007cdd0000 RIP: 0010:[<ffffffff811f8e06>] [<ffffffff811f8e06>] split_huge_page_to_list+0x496/0x590 RSP: 0018:ffff88007cdd3b70 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffff88007c7b88c0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000700000200 RDI: ffffea0003188000 RBP: ffff88007cdd3bb8 R08: 0000000000000001 R09: 00003ffffffff000 R10: ffff880000000000 R11: ffffc000001fffff R12: ffffea0003188000 R13: ffffea0003188000 R14: 0000000000000000 R15: 0400000000000080 FS: 00007f8ec241d740(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8ec1f3ed20 CR3: 000000003707b000 CR4: 00000000000006f0 Call Trace: ? list_del+0xd/0x30 queue_pages_pte_range+0x4d1/0x590 __walk_page_range+0x204/0x4e0 walk_page_range+0x71/0xf0 queue_pages_range+0x75/0x90 ? queue_pages_hugetlb+0x190/0x190 ? new_node_page+0xc0/0xc0 ? change_prot_numa+0x40/0x40 migrate_to_node+0x71/0xd0 do_migrate_pages+0x1c3/0x210 SyS_migrate_pages+0x261/0x290 entry_SYSCALL_64_fastpath+0x1a/0xa4 Code: e8 b0 87 fb ff 0f 0b 48 c7 c6 30 32 9f 81 e8 a2 87 fb ff 0f 0b 48 c7 c6 b8 46 9f 81 e8 94 87 fb ff 0f 0b 85 c0 0f 84 3e fd ff ff <0f> 0b 85 c0 0f 85 a6 00 00 00 48 8b 75 c0 4c 89 f7 41 be f0 ff RIP split_huge_page_to_list+0x496/0x590 I'm not sure of the full scenario of the reproduction, but my debug showed that split_huge_pmd_address(freeze=true) returned without running main code of pmd splitting because pmd_present(*pmd) in precheck somehow returned 0. If this happens, the subsequent try_to_unmap() fails and returns non-zero (because page_mapcount() still > 0), and finally VM_BUG_ON() fires. This patch tries to fix it by prechecking pmd state inside ptl. Link: http://lkml.kernel.org/r/1466990929-7452-1-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-15mm, meminit: ensure node is online before checking whether pages are ↵Mel Gorman
uninitialised early_page_uninitialised looks up an arbitrary PFN. While a machine without node 0 will boot with "mm, page_alloc: Always return a valid node from early_pfn_to_nid", it works because it assumes that nodes are always in PFN order. This is not guaranteed so this patch adds robustness by always checking if the node being checked is online. Link: http://lkml.kernel.org/r/1468008031-3848-4-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-15mm, meminit: always return a valid node from early_pfn_to_nidMel Gorman
early_pfn_to_nid can return node 0 if a PFN is invalid on machines that has no node 0. A machine with only node 1 was observed to crash with the following message: BUG: unable to handle kernel paging request at 000000000002a3c8 PGD 0 Modules linked in: Hardware name: Supermicro H8DSP-8/H8DSP-8, BIOS 080011 06/30/2006 task: ffffffff81c0d500 ti: ffffffff81c00000 task.ti: ffffffff81c00000 RIP: reserve_bootmem_region+0x6a/0xef CR2: 000000000002a3c8 CR3: 0000000001c06000 CR4: 00000000000006b0 Call Trace: free_all_bootmem+0x4b/0x12a mem_init+0x70/0xa3 start_kernel+0x25b/0x49b The problem is that early_page_uninitialised uses the early_pfn_to_nid helper which returns node 0 for invalid PFNs. No caller of early_pfn_to_nid cares except early_page_uninitialised. This patch has early_pfn_to_nid always return a valid node. Link: http://lkml.kernel.org/r/1468008031-3848-3-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-15kasan/quarantine: fix bugs on qlist_move_cache()Joonsoo Kim
There are two bugs on qlist_move_cache(). One is that qlist's tail isn't set properly. curr->next can be NULL since it is singly linked list and NULL value on tail is invalid if there is one item on qlist. Another one is that if cache is matched, qlist_put() is called and it will set curr->next to NULL. It would cause to stop the loop prematurely. These problems come from complicated implementation so I'd like to re-implement it completely. Implementation in this patch is really simple. Iterate all qlist_nodes and put them to appropriate list. Unfortunately, I got this bug sometime ago and lose oops message. But, the bug looks trivial and no need to attach oops. Fixes: 55834c59098d ("mm: kasan: initial memory quarantine implementation") Link: http://lkml.kernel.org/r/1467766348-22419-1-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Alexander Potapenko <glider@google.com> Cc: Kuthonuzo Luruo <poll.stdin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-15madvise_free, thp: fix madvise_free_huge_pmd return value after splittingHuang Ying
madvise_free_huge_pmd should return 0 if the fallback PTE operations are required. In madvise_free_huge_pmd, if part pages of THP are discarded, the THP will be split and fallback PTE operations should be used if splitting succeeds. But the original code will make fallback PTE operations skipped, after splitting succeeds. Fix that via make madvise_free_huge_pmd return 0 after splitting successfully, so that the fallback PTE operations will be done. Link: http://lkml.kernel.org/r/1467135452-16688-1-git-send-email-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-15mm, compaction: prevent VM_BUG_ON when terminating freeing scannerDavid Rientjes
It's possible to isolate some freepages in a pageblock and then fail split_free_page() due to the low watermark check. In this case, we hit VM_BUG_ON() because the freeing scanner terminated early without a contended lock or enough freepages. This should never have been a VM_BUG_ON() since it's not a fatal condition. It should have been a VM_WARN_ON() at best, or even handled gracefully. Regardless, we need to terminate anytime the full pageblock scan was not done. The logic belongs in isolate_freepages_block(), so handle its state gracefully by terminating the pageblock loop and making a note to restart at the same pageblock next time since it was not possible to complete the scan this time. [rientjes@google.com: don't rescan pages in a pageblock] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1607111244150.83138@chino.kir.corp.google.com Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606291436300.145590@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Minchan Kim <minchan@kernel.org> Tested-by: Minchan Kim <minchan@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-10tmpfs: fix regression hang in fallocate undoHugh Dickins
The well-spotted fallocate undo fix is good in most cases, but not when fallocate failed on the very first page. index 0 then passes lend -1 to shmem_undo_range(), and that has two bad effects: (a) that it will undo every fallocation throughout the file, unrestricted by the current range; but more importantly (b) it can cause the undo to hang, because lend -1 is treated as truncation, which makes it keep on retrying until every page has gone, but those already fully instantiated will never go away. Big thank you to xfstests generic/269 which demonstrates this. Fixes: b9b4bb26af01 ("tmpfs: don't undo fallocate past its last page") Cc: stable@vger.kernel.org Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24mm/page_owner: avoid null pointer dereferenceSudip Mukherjee
We have dereferenced page_ext before checking it. Lets check it first and then used it. Fixes: f86e4271978b ("mm: check the return value of lookup_page_ext for all call sites") Link: http://lkml.kernel.org/r/1465249059-7883-1-git-send-email-sudipm.mukherjee@gmail.com Signed-off-by: Sudip Mukherjee <sudip.mukherjee@codethink.co.uk> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24mm, compaction: abort free scanner if split failsDavid Rientjes
If the memory compaction free scanner cannot successfully split a free page (only possible due to per-zone low watermark), terminate the free scanner rather than continuing to scan memory needlessly. If the watermark is insufficient for a free page of order <= cc->order, then terminate the scanner since all future splits will also likely fail. This prevents the compaction freeing scanner from scanning all memory on very large zones (very noticeable for zones > 128GB, for instance) when all splits will likely fail while holding zone->lock. compaction_alloc() iterating a 128GB zone has been benchmarked to take over 400ms on some systems whereas any free page isolated and ready to be split ends up failing in split_free_page() because of the low watermark check and thus the iteration continues. The next time compaction occurs, the freeing scanner will likely start at the end of the zone again since no success was made previously and we get the same lengthy iteration until the zone is brought above the low watermark. All thp page faults can take >400ms in such a state without this fix. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606211820350.97086@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24mm: prevent KASAN false positives in kmemleakDmitry Vyukov
When kmemleak dumps contents of leaked objects it reads whole objects regardless of user-requested size. This upsets KASAN. Disable KASAN checks around object dump. Link: http://lkml.kernel.org/r/1466617631-68387-1-git-send-email-dvyukov@google.com Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24mm/hugetlb: clear compound_mapcount when freeing gigantic pagesGerald Schaefer
While working on s390 support for gigantic hugepages I ran into the following "Bad page state" warning when freeing gigantic pages: BUG: Bad page state in process bash pfn:580001 page:000003d116000040 count:0 mapcount:0 mapping:ffffffff00000000 index:0x0 flags: 0x7fffc0000000000() page dumped because: non-NULL mapping This is because page->compound_mapcount, which is part of a union with page->mapping, is initialized with -1 in prep_compound_gigantic_page(), and not cleared again during destroy_compound_gigantic_page(). Fix this by clearing the compound_mapcount in destroy_compound_gigantic_page() before clearing compound_head. Interestingly enough, the warning will not show up on x86_64, although this should not be architecture specific. Apparently there is an endianness issue, combined with the fact that the union contains both a 64 bit ->mapping pointer and a 32 bit atomic_t ->compound_mapcount as members. The resulting bogus page->mapping on x86_64 therefore contains 00000000ffffffff instead of ffffffff00000000 on s390, which will falsely trigger the PageAnon() check in free_pages_prepare() because page->mapping & PAGE_MAPPING_ANON is true on little-endian architectures like x86_64 in this case (the page is not compound anymore, ->compound_head was already cleared before). As a result, page->mapping will be cleared before doing the checks in free_pages_check(). Not sure if the bogus "PageAnon() returning true" on x86_64 for the first tail page of a gigantic page (at this stage) has other theoretical implications, but they would also be fixed with this patch. Link: http://lkml.kernel.org/r/1466612719-5642-1-git-send-email-gerald.schaefer@de.ibm.com Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24mm/swap.c: flush lru pvecs on compound page arrivalLukasz Odzioba
Currently we can have compound pages held on per cpu pagevecs, which leads to a lot of memory unavailable for reclaim when needed. In the systems with hundreads of processors it can be GBs of memory. On of the way of reproducing the problem is to not call munmap explicitly on all mapped regions (i.e. after receiving SIGTERM). After that some pages (with THP enabled also huge pages) may end up on lru_add_pvec, example below. void main() { #pragma omp parallel { size_t size = 55 * 1000 * 1000; // smaller than MEM/CPUS void *p = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS , -1, 0); if (p != MAP_FAILED) memset(p, 0, size); //munmap(p, size); // uncomment to make the problem go away } } When we run it with THP enabled it will leave significant amount of memory on lru_add_pvec. This memory will be not reclaimed if we hit OOM, so when we run above program in a loop: for i in `seq 100`; do ./a.out; done many processes (95% in my case) will be killed by OOM. The primary point of the LRU add cache is to save the zone lru_lock contention with a hope that more pages will belong to the same zone and so their addition can be batched. The huge page is already a form of batched addition (it will add 512 worth of memory in one go) so skipping the batching seems like a safer option when compared to a potential excess in the caching which can be quite large and much harder to fix because lru_add_drain_all is way to expensive and it is not really clear what would be a good moment to call it. Similarly we can reproduce the problem on lru_deactivate_pvec by adding: madvise(p, size, MADV_FREE); after memset. This patch flushes lru pvecs on compound page arrival making the problem less severe - after applying it kill rate of above example drops to 0%, due to reducing maximum amount of memory held on pvec from 28MB (with THP) to 56kB per CPU. Suggested-by: Michal Hocko <mhocko@suse.com> Link: http://lkml.kernel.org/r/1466180198-18854-1-git-send-email-lukasz.odzioba@intel.com Signed-off-by: Lukasz Odzioba <lukasz.odzioba@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Ming Li <mingli199x@qq.com> Cc: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24memcg: css_alloc should return an ERR_PTR value on errorTejun Heo
mem_cgroup_css_alloc() was returning NULL on failure while cgroup core expected it to return an ERR_PTR value leading to the following NULL deref after a css allocation failure. Fix it by return ERR_PTR(-ENOMEM) instead. I'll also update cgroup core so that it can handle NULL returns. mkdir: page allocation failure: order:6, mode:0x240c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO) CPU: 0 PID: 8738 Comm: mkdir Not tainted 4.7.0-rc3+ #123 ... Call Trace: dump_stack+0x68/0xa1 warn_alloc_failed+0xd6/0x130 __alloc_pages_nodemask+0x4c6/0xf20 alloc_pages_current+0x66/0xe0 alloc_kmem_pages+0x14/0x80 kmalloc_order_trace+0x2a/0x1a0 __kmalloc+0x291/0x310 memcg_update_all_caches+0x6c/0x130 mem_cgroup_css_alloc+0x590/0x610 cgroup_apply_control_enable+0x18b/0x370 cgroup_mkdir+0x1de/0x2e0 kernfs_iop_mkdir+0x55/0x80 vfs_mkdir+0xb9/0x150 SyS_mkdir+0x66/0xd0 do_syscall_64+0x53/0x120 entry_SYSCALL64_slow_path+0x25/0x25 ... BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0 IP: init_and_link_css+0x37/0x220 PGD 34b1e067 PUD 3a109067 PMD 0 Oops: 0002 [#1] SMP Modules linked in: CPU: 0 PID: 8738 Comm: mkdir Not tainted 4.7.0-rc3+ #123 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.2-20160422_131301-anatol 04/01/2014 task: ffff88007cbc5200 ti: ffff8800666d4000 task.ti: ffff8800666d4000 RIP: 0010:[<ffffffff810f2ca7>] [<ffffffff810f2ca7>] init_and_link_css+0x37/0x220 RSP: 0018:ffff8800666d7d90 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffffffff810f2499 RSI: 0000000000000000 RDI: 0000000000000008 RBP: ffff8800666d7db8 R08: 0000000000000003 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffff88005a5fb400 R13: ffffffff81f0f8a0 R14: ffff88005a5fb400 R15: 0000000000000010 FS: 00007fc944689700(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f3aed0d2b80 CR3: 000000003a1e8000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: cgroup_apply_control_enable+0x1ac/0x370 cgroup_mkdir+0x1de/0x2e0 kernfs_iop_mkdir+0x55/0x80 vfs_mkdir+0xb9/0x150 SyS_mkdir+0x66/0xd0 do_syscall_64+0x53/0x120 entry_SYSCALL64_slow_path+0x25/0x25 Code: 89 f5 48 89 fb 49 89 d4 48 83 ec 08 8b 05 72 3b d8 00 85 c0 0f 85 60 01 00 00 4c 89 e7 e8 72 f7 ff ff 48 8d 7b 08 48 89 d9 31 c0 <48> c7 83 d0 00 00 00 00 00 00 00 48 83 e7 f8 48 29 f9 81 c1 d8 RIP init_and_link_css+0x37/0x220 RSP <ffff8800666d7d90> CR2: 00000000000000d0 ---[ end trace a2d8836ae1e852d1 ]--- Link: http://lkml.kernel.org/r/20160621165740.GJ3262@mtj.duckdns.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24memcg: mem_cgroup_migrate() may be called with irq disabledTejun Heo
mem_cgroup_migrate() uses local_irq_disable/enable() but can be called with irq disabled from migrate_page_copy(). This ends up enabling irq while holding a irq context lock triggering the following lockdep warning. Fix it by using irq_save/restore instead. ================================= [ INFO: inconsistent lock state ] 4.7.0-rc1+ #52 Tainted: G W --------------------------------- inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. kcompactd0/151 [HC0[0]:SC0[0]:HE1:SE1] takes: (&(&ctx->completion_lock)->rlock){+.?.-.}, at: [<000000000038fd96>] aio_migratepage+0x156/0x1e8 {IN-SOFTIRQ-W} state was registered at: __lock_acquire+0x5b6/0x1930 lock_acquire+0xee/0x270 _raw_spin_lock_irqsave+0x66/0xb0 aio_complete+0x98/0x328 dio_complete+0xe4/0x1e0 blk_update_request+0xd4/0x450 scsi_end_request+0x48/0x1c8 scsi_io_completion+0x272/0x698 blk_done_softirq+0xca/0xe8 __do_softirq+0xc8/0x518 irq_exit+0xee/0x110 do_IRQ+0x6a/0x88 io_int_handler+0x11a/0x25c __mutex_unlock_slowpath+0x144/0x1d8 __mutex_unlock_slowpath+0x140/0x1d8 kernfs_iop_permission+0x64/0x80 __inode_permission+0x9e/0xf0 link_path_walk+0x6e/0x510 path_lookupat+0xc4/0x1a8 filename_lookup+0x9c/0x160 user_path_at_empty+0x5c/0x70 SyS_readlinkat+0x68/0x140 system_call+0xd6/0x270 irq event stamp: 971410 hardirqs last enabled at (971409): migrate_page_move_mapping+0x3ea/0x588 hardirqs last disabled at (971410): _raw_spin_lock_irqsave+0x3c/0xb0 softirqs last enabled at (970526): __do_softirq+0x460/0x518 softirqs last disabled at (970519): irq_exit+0xee/0x110 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&ctx->completion_lock)->rlock); <Interrupt> lock(&(&ctx->completion_lock)->rlock); *** DEADLOCK *** 3 locks held by kcompactd0/151: #0: (&(&mapping->private_lock)->rlock){+.+.-.}, at: aio_migratepage+0x42/0x1e8 #1: (&ctx->ring_lock){+.+.+.}, at: aio_migratepage+0x5a/0x1e8 #2: (&(&ctx->completion_lock)->rlock){+.?.-.}, at: aio_migratepage+0x156/0x1e8 stack backtrace: CPU: 20 PID: 151 Comm: kcompactd0 Tainted: G W 4.7.0-rc1+ #52 Call Trace: show_trace+0xea/0xf0 show_stack+0x72/0xf0 dump_stack+0x9a/0xd8 print_usage_bug.part.27+0x2d4/0x2e8 mark_lock+0x17e/0x758 mark_held_locks+0xa2/0xd0 trace_hardirqs_on_caller+0x140/0x1c0 mem_cgroup_migrate+0x266/0x370 aio_migratepage+0x16a/0x1e8 move_to_new_page+0xb0/0x260 migrate_pages+0x8f4/0x9f0 compact_zone+0x4dc/0xdc8 kcompactd_do_work+0x1aa/0x358 kcompactd+0xba/0x2c8 kthread+0x10a/0x110 kernel_thread_starter+0x6/0xc kernel_thread_starter+0x0/0xc INFO: lockdep is turned off. Link: http://lkml.kernel.org/r/20160620184158.GO3262@mtj.duckdns.org Link: http://lkml.kernel.org/g/5767CFE5.7080904@de.ibm.com Fixes: 74485cf2bc85 ("mm: migrate: consolidate mem_cgroup_migrate() calls") Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: <stable@vger.kernel.org> [4.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24hugetlb: fix nr_pmds accounting with shared page tablesKirill A. Shutemov
We account HugeTLB's shared page table to all processes who share it. The accounting happens during huge_pmd_share(). If somebody populates pud entry under us, we should decrease pagetable's refcount and decrease nr_pmds of the process. By mistake, I increase nr_pmds again in this case. :-/ It will lead to "BUG: non-zero nr_pmds on freeing mm: 2" on process' exit. Let's fix this by increasing nr_pmds only when we're sure that the page table will be used. Link: http://lkml.kernel.org/r/20160617122506.GC6534@node.shutemov.name Fixes: dc6c9a35b66b ("mm: account pmd page tables to the process") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: zhongjiang <zhongjiang@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24Revert "mm: disable fault around on emulated access bit architecture"Kirill A. Shutemov
This reverts commit d0834a6c2c5b0c76cfb806bd7dba6556d8b4edbb. After revert of 5c0a85fad949 ("mm: make faultaround produce old ptes") faultaround doesn't have dependencies on hardware accessed bit, so let's revert this one too. Link: http://lkml.kernel.org/r/1465893750-44080-3-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: "Huang, Ying" <ying.huang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24Revert "mm: make faultaround produce old ptes"Kirill A. Shutemov
This reverts commit 5c0a85fad949212b3e059692deecdeed74ae7ec7. The commit causes ~6% regression in unixbench. Let's revert it for now and consider other solution for reclaim problem later. Link: http://lkml.kernel.org/r/1465893750-44080-2-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: "Huang, Ying" <ying.huang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim maskMel Gorman
Commit d0164adc89f6 ("mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd") modified __GFP_WAIT to explicitly identify the difference between atomic callers and those that were unwilling to sleep. Later the definition was removed entirely. The GFP_RECLAIM_MASK is the set of flags that affect watermark checking and reclaim behaviour but __GFP_ATOMIC was never added. Without it, atomic users of the slab allocator strip the __GFP_ATOMIC flag and cannot access the page allocator atomic reserves. This patch addresses the problem. The user-visible impact depends on the workload but potentially atomic allocations unnecessarily fail without this path. Link: http://lkml.kernel.org/r/20160610093832.GK2527@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Marcin Wojtas <mw@semihalf.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> [4.4+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24mm: mempool: kasan: don't poot mempool objects in quarantineAndrey Ryabinin
Currently we may put reserved by mempool elements into quarantine via kasan_kfree(). This is totally wrong since quarantine may really free these objects. So when mempool will try to use such element, use-after-free will happen. Or mempool may decide that it no longer need that element and double-free it. So don't put object into quarantine in kasan_kfree(), just poison it. Rename kasan_kfree() to kasan_poison_kfree() to respect that. Also, we shouldn't use kasan_slab_alloc()/kasan_krealloc() in kasan_unpoison_element() because those functions may update allocation stacktrace. This would be wrong for the most of the remove_element call sites. (The only call site where we may want to update alloc stacktrace is in mempool_alloc(). Kmemleak solves this by calling kmemleak_update_trace(), so we could make something like that too. But this is out of scope of this patch). Fixes: 55834c59098d ("mm: kasan: initial memory quarantine implementation") Link: http://lkml.kernel.org/r/575977C3.1010905@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reported-by: Kuthonuzo Luruo <kuthonuzo.luruo@hpe.com> Acked-by: Alexander Potapenko <glider@google.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Kostya Serebryany <kcc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24tmpfs: don't undo fallocate past its last pageAnthony Romano
When fallocate is interrupted it will undo a range that extends one byte past its range of allocated pages. This can corrupt an in-use page by zeroing out its first byte. Instead, undo using the inclusive byte range. Fixes: 1635f6a74152f1d ("tmpfs: undo fallocation on failure") Link: http://lkml.kernel.org/r/1462713387-16724-1-git-send-email-anthony.romano@coreos.com Signed-off-by: Anthony Romano <anthony.romano@coreos.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Brandon Philips <brandon@ifup.co> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24oom_reaper: avoid pointless atomic_inc_not_zero usage.Tetsuo Handa
Since commit 36324a990cf5 ("oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space") changed to use find_lock_task_mm() for finding a mm_struct to reap, it is guaranteed that mm->mm_users > 0 because find_lock_task_mm() returns a task_struct with ->mm != NULL. Therefore, we can safely use atomic_inc(). Link: http://lkml.kernel.org/r/1465024759-8074-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24mm,oom_reaper: don't call mmput_async() without atomic_inc_not_zero()Tetsuo Handa
Commit e2fe14564d33 ("oom_reaper: close race with exiting task") reduced frequency of needlessly selecting next OOM victim, but was calling mmput_async() when atomic_inc_not_zero() failed. Link: http://lkml.kernel.org/r/1464423365-5555-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-23mm: Export migrate_page_move_mapping and migrate_page_copyRichard Weinberger
Export these symbols such that UBIFS can implement ->migratepage. Cc: stable@vger.kernel.org Signed-off-by: Richard Weinberger <richard@nod.at> Acked-by: Christoph Hellwig <hch@lst.de>
2016-06-13Merge branch 'for-4.7-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu fixes from Tejun Heo: "While adding GFP_ATOMIC support to the percpu allocator, the synchronization for the fast-path which doesn't require external allocations was separated into pcpu_lock. Unfortunately, it incorrectly decoupled async paths and percpu chunks could get destroyed while still being operated on. This contains two patches to fix the bug" * 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: fix synchronization between synchronous map extension and chunk destruction percpu: fix synchronization between chunk->map_extend_work and chunk destruction
2016-06-11Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer fixes from Jens Axboe: "A small collection of fixes for the current series. This contains: - Two fixes for xen-blkfront, from Bob Liu. - A bug fix for NVMe, releasing only the specific resources we requested. - Fix for a debugfs flags entry for nbd, from Josef. - Plug fix from Omar, fixing up a case of code being switched between two functions. - A missing bio_put() for the new discard callers of submit_bio_wait(), fixing a regression causing a leak of the bio. From Shaun. - Improve dirty limit calculation precision in the writeback code, fixing a case where setting a limit lower than 1% of memory would end up being zero. From Tejun" * 'for-linus' of git://git.kernel.dk/linux-block: NVMe: Only release requested regions xen-blkfront: fix resume issues after a migration xen-blkfront: don't call talk_to_blkback when already connected to blkback nbd: pass the nbd pointer for flags debugfs block: missing bio_put following submit_bio_wait blk-mq: really fix plug list flushing for nomerge queues writeback: use higher precision calculation in domain_dirty_limits()
2016-06-09mm/fadvise.c: do not discard partial pages with POSIX_FADV_DONTNEEDOleg Drokin
I noticed that the logic in the fadvise64_64 syscall is incorrect for partial pages. While first page of the region is correctly skipped if it is partial, the last page of the region is mistakenly discarded. This leads to problems for applications that read data in non-page-aligned chunks discarding already processed data between the reads. A somewhat misguided application that does something like write(XX bytes (non-page-alligned)); drop the data it just wrote; repeat gets a significant penalty in performance as a result. Link: http://lkml.kernel.org/r/1464917140-1506698-1-git-send-email-green@linuxhacker.ru Signed-off-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-09mm: introduce dedicated WQ_MEM_RECLAIM workqueue to do lru_add_drain_allWang Sheng-Hui
This patch is based on https://patchwork.ozlabs.org/patch/574623/. Tejun submitted commit 23d11a58a9a6 ("workqueue: skip flush dependency checks for legacy workqueues") for the legacy create*_workqueue() interface. But some workq created by alloc_workqueue still reports warning on memory reclaim, e.g nvme_workq with flag WQ_MEM_RECLAIM set: workqueue: WQ_MEM_RECLAIM nvme:nvme_reset_work is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu ------------[ cut here ]------------ WARNING: CPU: 0 PID: 6 at SoC/linux/kernel/workqueue.c:2448 check_flush_dependency+0xb4/0x10c ... check_flush_dependency+0xb4/0x10c flush_work+0x54/0x140 lru_add_drain_all+0x138/0x188 migrate_prep+0xc/0x18 alloc_contig_range+0xf4/0x350 cma_alloc+0xec/0x1e4 dma_alloc_from_contiguous+0x38/0x40 __dma_alloc+0x74/0x25c nvme_alloc_queue+0xcc/0x36c nvme_reset_work+0x5c4/0xda8 process_one_work+0x128/0x2ec worker_thread+0x58/0x434 kthread+0xd4/0xe8 ret_from_fork+0x10/0x50 That's because lru_add_drain_all() will schedule the drain work on system_wq, whose flag is set to 0, !WQ_MEM_RECLAIM. Introduce a dedicated WQ_MEM_RECLAIM workqueue to do lru_add_drain_all(), aiding in getting memory freed. Link: http://lkml.kernel.org/r/1464917521-9775-1-git-send-email-shhuiw@foxmail.com Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Keith Busch <keith.busch@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thierry Reding <treding@nvidia.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>