summaryrefslogtreecommitdiffstats
path: root/mm/vmscan.c
AgeCommit message (Collapse)Author
2018-06-05mm: fix the NULL mapping case in __isolate_lru_page()Hugh Dickins
commit 145e1a71e090575c74969e3daa8136d1e5b99fc8 upstream. George Boole would have noticed a slight error in 4.16 commit 69d763fc6d3a ("mm: pin address_space before dereferencing it while isolating an LRU page"). Fix it, to match both the comment above it, and the original behaviour. Although anonymous pages are not marked PageDirty at first, we have an old habit of calling SetPageDirty when a page is removed from swap cache: so there's a category of ex-swap pages that are easily migratable, but were inadvertently excluded from compaction's async migration in 4.16. Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1805302014001.12558@eggly.anvils Fixes: 69d763fc6d3a ("mm: pin address_space before dereferencing it while isolating an LRU page") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Ivan Kalvachev <ikalvachev@gmail.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30mm: fix races between address_space dereference and free in page_evicatableHuang Ying
[ Upstream commit e92bb4dd9673945179b1fc738c9817dd91bfb629 ] When page_mapping() is called and the mapping is dereferenced in page_evicatable() through shrink_active_list(), it is possible for the inode to be truncated and the embedded address space to be freed at the same time. This may lead to the following race. CPU1 CPU2 truncate(inode) shrink_active_list() ... page_evictable(page) truncate_inode_page(mapping, page); delete_from_page_cache(page) spin_lock_irqsave(&mapping->tree_lock, flags); __delete_from_page_cache(page, NULL) page_cache_tree_delete(..) ... mapping = page_mapping(page); page->mapping = NULL; ... spin_unlock_irqrestore(&mapping->tree_lock, flags); page_cache_free_page(mapping, page) put_page(page) if (put_page_testzero(page)) -> false - inode now has no pages and can be freed including embedded address_space mapping_unevictable(mapping) test_bit(AS_UNEVICTABLE, &mapping->flags); - we've dereferenced mapping which is potentially already free. Similar race exists between swap cache freeing and page_evicatable() too. The address_space in inode and swap cache will be freed after a RCU grace period. So the races are fixed via enclosing the page_mapping() and address_space usage in rcu_read_lock/unlock(). Some comments are added in code to make it clear what is protected by the RCU read lock. Link: http://lkml.kernel.org/r/20180212081227.1940-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-26mm: pin address_space before dereferencing it while isolating an LRU pageMel Gorman
[ Upstream commit 69d763fc6d3aee787a3e8c8c35092b4f4960fa5d ] Minchan Kim asked the following question -- what locks protects address_space destroying when race happens between inode trauncation and __isolate_lru_page? Jan Kara clarified by describing the race as follows CPU1 CPU2 truncate(inode) __isolate_lru_page() ... truncate_inode_page(mapping, page); delete_from_page_cache(page) spin_lock_irqsave(&mapping->tree_lock, flags); __delete_from_page_cache(page, NULL) page_cache_tree_delete(..) ... mapping = page_mapping(page); page->mapping = NULL; ... spin_unlock_irqrestore(&mapping->tree_lock, flags); page_cache_free_page(mapping, page) put_page(page) if (put_page_testzero(page)) -> false - inode now has no pages and can be freed including embedded address_space if (mapping && !mapping->a_ops->migratepage) - we've dereferenced mapping which is potentially already free. The race is theoretically possible but unlikely. Before the delete_from_page_cache, truncate_cleanup_page is called so the page is likely to be !PageDirty or PageWriteback which gets skipped by the only caller that checks the mappping in __isolate_lru_page. Even if the race occurs, a substantial amount of work has to happen during a tiny window with no preemption but it could potentially be done using a virtual machine to artifically slow one CPU or halt it during the critical window. This patch should eliminate the race with truncation by try-locking the page before derefencing mapping and aborting if the lock was not acquired. There was a suggestion from Huang Ying to use RCU as a side-effect to prevent mapping being freed. However, I do not like the solution as it's an unconventional means of preserving a mapping and it's not a context where rcu_read_lock is obviously protecting rcu data. Link: http://lkml.kernel.org/r/20180104102512.2qos3h5vqzeisrek@techsingularity.net Fixes: c82449352854 ("mm: compaction: make isolate_lru_page() filter-aware again") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Minchan Kim <minchan@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-28mm/vmscan: wake up flushers for legacy cgroups tooAndrey Ryabinin
commit 1c610d5f93c709df56787f50b3576704ac271826 upstream. Commit 726d061fbd36 ("mm: vmscan: kick flushers when we encounter dirty pages on the LRU") added flusher invocation to shrink_inactive_list() when many dirty pages on the LRU are encountered. However, shrink_inactive_list() doesn't wake up flushers for legacy cgroup reclaim, so the next commit bbef938429f5 ("mm: vmscan: remove old flusher wakeup from direct reclaim path") removed the only source of flusher's wake up in legacy mem cgroup reclaim path. This leads to premature OOM if there is too many dirty pages in cgroup: # mkdir /sys/fs/cgroup/memory/test # echo $$ > /sys/fs/cgroup/memory/test/tasks # echo 50M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes # dd if=/dev/zero of=tmp_file bs=1M count=100 Killed dd invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0 Call Trace: dump_stack+0x46/0x65 dump_header+0x6b/0x2ac oom_kill_process+0x21c/0x4a0 out_of_memory+0x2a5/0x4b0 mem_cgroup_out_of_memory+0x3b/0x60 mem_cgroup_oom_synchronize+0x2ed/0x330 pagefault_out_of_memory+0x24/0x54 __do_page_fault+0x521/0x540 page_fault+0x45/0x50 Task in /test killed as a result of limit of /test memory: usage 51200kB, limit 51200kB, failcnt 73 memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0 kmem: usage 296kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /test: cache:49632KB rss:1056KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:49500KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:1168KB inactive_file:24760KB active_file:24960KB unevictable:0KB Memory cgroup out of memory: Kill process 3861 (bash) score 88 or sacrifice child Killed process 3876 (dd) total-vm:8484kB, anon-rss:1052kB, file-rss:1720kB, shmem-rss:0kB oom_reaper: reaped process 3876 (dd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB Wake up flushers in legacy cgroup reclaim too. Link: http://lkml.kernel.org/r/20180315164553.17856-1-aryabinin@virtuozzo.com Fixes: bbef938429f5 ("mm: vmscan: remove old flusher wakeup from direct reclaim path") Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Tested-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25mm,vmscan: Make unregister_shrinker() no-op if register_shrinker() failed.Tetsuo Handa
commit bb422a738f6566f7439cd347d54e321e4fe92a9f upstream. Syzbot caught an oops at unregister_shrinker() because combination of commit 1d3d4437eae1bb29 ("vmscan: per-node deferred work") and fault injection made register_shrinker() fail and the caller of register_shrinker() did not check for failure. ---------- [ 554.881422] FAULT_INJECTION: forcing a failure. [ 554.881422] name failslab, interval 1, probability 0, space 0, times 0 [ 554.881438] CPU: 1 PID: 13231 Comm: syz-executor1 Not tainted 4.14.0-rc8+ #82 [ 554.881443] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 [ 554.881445] Call Trace: [ 554.881459] dump_stack+0x194/0x257 [ 554.881474] ? arch_local_irq_restore+0x53/0x53 [ 554.881486] ? find_held_lock+0x35/0x1d0 [ 554.881507] should_fail+0x8c0/0xa40 [ 554.881522] ? fault_create_debugfs_attr+0x1f0/0x1f0 [ 554.881537] ? check_noncircular+0x20/0x20 [ 554.881546] ? find_next_zero_bit+0x2c/0x40 [ 554.881560] ? ida_get_new_above+0x421/0x9d0 [ 554.881577] ? find_held_lock+0x35/0x1d0 [ 554.881594] ? __lock_is_held+0xb6/0x140 [ 554.881628] ? check_same_owner+0x320/0x320 [ 554.881634] ? lock_downgrade+0x990/0x990 [ 554.881649] ? find_held_lock+0x35/0x1d0 [ 554.881672] should_failslab+0xec/0x120 [ 554.881684] __kmalloc+0x63/0x760 [ 554.881692] ? lock_downgrade+0x990/0x990 [ 554.881712] ? register_shrinker+0x10e/0x2d0 [ 554.881721] ? trace_event_raw_event_module_request+0x320/0x320 [ 554.881737] register_shrinker+0x10e/0x2d0 [ 554.881747] ? prepare_kswapd_sleep+0x1f0/0x1f0 [ 554.881755] ? _down_write_nest_lock+0x120/0x120 [ 554.881765] ? memcpy+0x45/0x50 [ 554.881785] sget_userns+0xbcd/0xe20 (...snipped...) [ 554.898693] kasan: CONFIG_KASAN_INLINE enabled [ 554.898724] kasan: GPF could be caused by NULL-ptr deref or user memory access [ 554.898732] general protection fault: 0000 [#1] SMP KASAN [ 554.898737] Dumping ftrace buffer: [ 554.898741] (ftrace buffer empty) [ 554.898743] Modules linked in: [ 554.898752] CPU: 1 PID: 13231 Comm: syz-executor1 Not tainted 4.14.0-rc8+ #82 [ 554.898755] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 [ 554.898760] task: ffff8801d1dbe5c0 task.stack: ffff8801c9e38000 [ 554.898772] RIP: 0010:__list_del_entry_valid+0x7e/0x150 [ 554.898775] RSP: 0018:ffff8801c9e3f108 EFLAGS: 00010246 [ 554.898780] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 554.898784] RDX: 0000000000000000 RSI: ffff8801c53c6f98 RDI: ffff8801c53c6fa0 [ 554.898788] RBP: ffff8801c9e3f120 R08: 1ffff100393c7d55 R09: 0000000000000004 [ 554.898791] R10: ffff8801c9e3ef70 R11: 0000000000000000 R12: 0000000000000000 [ 554.898795] R13: dffffc0000000000 R14: 1ffff100393c7e45 R15: ffff8801c53c6f98 [ 554.898800] FS: 0000000000000000(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000 [ 554.898804] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 [ 554.898807] CR2: 00000000dbc23000 CR3: 00000001c7269000 CR4: 00000000001406e0 [ 554.898813] DR0: 0000000020000000 DR1: 0000000020000000 DR2: 0000000000000000 [ 554.898816] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 [ 554.898818] Call Trace: [ 554.898828] unregister_shrinker+0x79/0x300 [ 554.898837] ? perf_trace_mm_vmscan_writepage+0x750/0x750 [ 554.898844] ? down_write+0x87/0x120 [ 554.898851] ? deactivate_super+0x139/0x1b0 [ 554.898857] ? down_read+0x150/0x150 [ 554.898864] ? check_same_owner+0x320/0x320 [ 554.898875] deactivate_locked_super+0x64/0xd0 [ 554.898883] deactivate_super+0x141/0x1b0 ---------- Since allowing register_shrinker() callers to call unregister_shrinker() when register_shrinker() failed can simplify error recovery path, this patch makes unregister_shrinker() no-op when register_shrinker() failed. Also, reset shrinker->nr_deferred in case unregister_shrinker() was by error called twice. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Aliaksei Karaliou <akaraliou.dev@gmail.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Glauber Costa <glauber@scylladb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02License cleanup: add SPDX GPL-2.0 license identifier to files with no licenseGreg Kroah-Hartman
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-06mm, THP, swap: add THP swapping out fallback countingHuang Ying
When swapping out THP (Transparent Huge Page), instead of swapping out the THP as a whole, sometimes we have to fallback to split the THP into normal pages before swapping, because no free swap clusters are available, or cgroup limit is exceeded, etc. To count the number of the fallback, a new VM event THP_SWPOUT_FALLBACK is added, and counted when we fallback to split the THP. Link: http://lkml.kernel.org/r/20170724051840.2309-13-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Shaohua Li <shli@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c] Cc: Vishal L Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06mm, THP, swap: delay splitting THP after swapped outHuang Ying
In this patch, splitting transparent huge page (THP) during swapping out is delayed from after adding the THP into the swap cache to after swapping out finishes. After the patch, more operations for the anonymous THP reclaiming, such as writing the THP to the swap device, removing the THP from the swap cache could be batched. So that the performance of anonymous THP swapping out could be improved. This is the second step for the THP swap support. The plan is to delay splitting the THP step by step and avoid splitting the THP finally. With the patchset, the swap out throughput improves 42% (from about 5.81GB/s to about 8.25GB/s) in the vm-scalability swap-w-seq test case with 16 processes. At the same time, the IPI (reflect TLB flushing) reduced about 78.9%. The test is done on a Xeon E5 v3 system. The swap device used is a RAM simulated PMEM (persistent memory) device. To test the sequential swapping out, the test case creates 8 processes, which sequentially allocate and write to the anonymous pages until the RAM and part of the swap device is used up. Link: http://lkml.kernel.org/r/20170724051840.2309-12-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Shaohua Li <shli@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c] Cc: Vishal L Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06mm, vmscan: do not loop on too_many_isolated for everMichal Hocko
Tetsuo Handa has reported[1][2][3] that direct reclaimers might get stuck in too_many_isolated loop basically for ever because the last few pages on the LRU lists are isolated by the kswapd which is stuck on fs locks when doing the pageout or slab reclaim. This in turn means that there is nobody to actually trigger the oom killer and the system is basically unusable. too_many_isolated has been introduced by commit 35cd78156c49 ("vmscan: throttle direct reclaim when too many pages are isolated already") to prevent from pre-mature oom killer invocations because back then no reclaim progress could indeed trigger the OOM killer too early. But since the oom detection rework in commit 0a0337e0d1d1 ("mm, oom: rework oom detection") the allocation/reclaim retry loop considers all the reclaimable pages and throttles the allocation at that layer so we can loosen the direct reclaim throttling. Make shrink_inactive_list loop over too_many_isolated bounded and returns immediately when the situation hasn't resolved after the first sleep. Replace congestion_wait by a simple schedule_timeout_interruptible because we are not really waiting on the IO congestion in this path. Please note that this patch can theoretically cause the OOM killer to trigger earlier while there are many pages isolated for the reclaim which makes progress only very slowly. This would be obvious from the oom report as the number of isolated pages are printed there. If we ever hit this should_reclaim_retry should consider those numbers in the evaluation in one way or another. [1] http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp [2] http://lkml.kernel.org/r/201702212335.DJB30777.JOFMHSFtVLQOOF@I-love.SAKURA.ne.jp [3] http://lkml.kernel.org/r/201706300914.CEH95859.FMQOLVFHJFtOOS@I-love.SAKURA.ne.jp [mhocko@suse.com: switch to uninterruptible sleep] Link: http://lkml.kernel.org/r/20170724065048.GB25221@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20170710074842.23175-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06mm: track actual nr_scanned during shrink_slab()Chris Wilson
Some shrinkers may only be able to free a bunch of objects at a time, and so free more than the requested nr_to_scan in one pass. Whilst other shrinkers may find themselves even unable to scan as many objects as they counted, and so underreport. Account for the extra freed/scanned objects against the total number of objects we intend to scan, otherwise we may end up penalising the slab far more than intended. Similarly, we want to add the underperforming scan to the deferred pass so that we try harder and harder in future passes. Link: http://lkml.kernel.org/r/20170822135325.9191-1-chris@chris-wilson.co.uk Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Shaohua Li <shli@fb.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10locking/lockdep: Rework FS_RECLAIM annotationPeter Zijlstra
A while ago someone, and I cannot find the email just now, asked if we could not implement the RECLAIM_FS inversion stuff with a 'fake' lock like we use for other things like workqueues etc. I think this should be possible which allows reducing the 'irq' states and will reduce the amount of __bfs() lookups we do. Removing the 1 IRQ state results in 4 less __bfs() walks per dependency, improving lockdep performance. And by moving this annotation out of the lockdep code it becomes easier for the mm people to extend. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nikolay Borisov <nborisov@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: boqun.feng@gmail.com Cc: iamjoonsoo.kim@lge.com Cc: kernel-team@lge.com Cc: kirill@shutemov.name Cc: npiggin@gmail.com Cc: walken@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-12mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful ↵Michal Hocko
semantic __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to the page allocator. This has been true but only for allocations requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for smaller sizes. This is a bit unfortunate because there is no way to express the same semantic for those requests and they are considered too important to fail so they might end up looping in the page allocator for ever, similarly to GFP_NOFAIL requests. Now that the whole tree has been cleaned up and accidental or misled usage of __GFP_REPEAT flag has been removed for !costly requests we can give the original flag a better name and more importantly a more useful semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user that the allocator would try really hard but there is no promise of a success. This will work independent of the order and overrides the default allocator behavior. Page allocator users have several levels of guarantee vs. cost options (take GFP_KERNEL as an example) - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_ attempt to free memory at all. The most light weight mode which even doesn't kick the background reclaim. Should be used carefully because it might deplete the memory and the next user might hit the more aggressive reclaim - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic allocation without any attempt to free memory from the current context but can wake kswapd to reclaim memory if the zone is below the low watermark. Can be used from either atomic contexts or when the request is a performance optimization and there is another fallback for a slow path. - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) - non sleeping allocation with an expensive fallback so it can access some portion of memory reserves. Usually used from interrupt/bh context with an expensive slow path fallback. - GFP_KERNEL - both background and direct reclaim are allowed and the _default_ page allocator behavior is used. That means that !costly allocation requests are basically nofail but there is no guarantee of that behavior so failures have to be checked properly by callers (e.g. OOM killer victim is allowed to fail currently). - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and all allocation requests fail early rather than cause disruptive reclaim (one round of reclaim in this implementation). The OOM killer is not invoked. - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator behavior and all allocation requests try really hard. The request will fail if the reclaim cannot make any progress. The OOM killer won't be triggered. - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior and all allocation requests will loop endlessly until they succeed. This might be really dangerous especially for larger orders. Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL because they already had their semantic. No new users are added. __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if there is no progress and we have already passed the OOM point. This means that all the reclaim opportunities have been exhausted except the most disruptive one (the OOM killer) and a user defined fallback behavior is more sensible than keep retrying in the page allocator. [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c] [mhocko@suse.com: semantic fix] Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz [mhocko@kernel.org: address other thing spotted by Vlastimil] Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alex Belits <alex.belits@cavium.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David Daney <david.daney@cavium.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: NeilBrown <neilb@suse.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm, vmscan: avoid thrashing anon lru when free + file is lowDavid Rientjes
The purpose of the code that commit 623762517e23 ("revert 'mm: vmscan: do not swap anon pages just because free+file is low'") reintroduces is to prefer swapping anonymous memory rather than trashing the file lru. If the anonymous inactive lru for the set of eligible zones is considered low, however, or the length of the list for the given reclaim priority does not allow for effective anonymous-only reclaiming, then avoid forcing SCAN_ANON. Forcing SCAN_ANON will end up thrashing the small list and leave unreclaimed memory on the file lrus. If the inactive list is insufficient, fallback to balanced reclaim so the file lru doesn't remain untouched. [akpm@linux-foundation.org: fix build] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705011432220.137835@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06mm: vmstat: move slab statistics from zone to node countersJohannes Weiner
Patch series "mm: per-lruvec slab stats" Josef is working on a new approach to balancing slab caches and the page cache. For this to work, he needs slab cache statistics on the lruvec level. These patches implement that by adding infrastructure that allows updating and reading generic VM stat items per lruvec, then switches some existing VM accounting sites, including the slab accounting ones, to this new cgroup-aware API. I'll follow up with more patches on this, because there is actually substantial simplification that can be done to the memory controller when we replace private memcg accounting with making the existing VM accounting sites cgroup-aware. But this is enough for Josef to base his slab reclaim work on, so here goes. This patch (of 5): To re-implement slab cache vs. page cache balancing, we'll need the slab counters at the lruvec level, which, ever since lru reclaim was moved from the zone to the node, is the intersection of the node, not the zone, and the memcg. We could retain the per-zone counters for when the page allocator dumps its memory information on failures, and have counters on both levels - which on all but NUMA node 0 is usually redundant. But let's keep it simple for now and just move them. If anybody complains we can restore the per-zone counters. [hannes@cmpxchg.org: fix oops] Link: http://lkml.kernel.org/r/20170605183511.GA8915@cmpxchg.org Link: http://lkml.kernel.org/r/20170530181724.27197-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06mm: per-cgroup memory reclaim statsRoman Gushchin
Track the following reclaim counters for every memory cgroup: PGREFILL, PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED. These values are exposed using the memory.stats interface of cgroup v2. The meaning of each value is the same as for global counters, available using /proc/vmstat. Also, for consistency, rename mem_cgroup_count_vm_event() to count_memcg_event_mm(). Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06mm, THP, swap: enable THP swap optimization only if has compound mapHuang Ying
If there is no compound map for a THP (Transparent Huge Page), it is possible that the map count of some sub-pages of the THP is 0. So it is better to split the THP before swapping out. In this way, the sub-pages not mapped will be freed, and we can avoid the unnecessary swap out operations for these sub-pages. Link: http://lkml.kernel.org/r/20170515112522.32457-6-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06mm, THP, swap: check whether THP can be split firstlyHuang Ying
To swap out THP (Transparent Huage Page), before splitting the THP, the swap cluster will be allocated and the THP will be added into the swap cache. But it is possible that the THP cannot be split, so that we must delete the THP from the swap cache and free the swap cluster. To avoid that, in this patch, whether the THP can be split is checked firstly. The check can only be done racy, but it is good enough for most cases. With the patch, the swap out throughput improves 3.6% (from about 4.16GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case with 8 processes. The test is done on a Xeon E5 v3 system. The swap device used is a RAM simulated PMEM (persistent memory) device. To test the sequential swapping out, the test case creates 8 processes, which sequentially allocate and write to the anonymous pages until the RAM and part of the swap device is used up. Link: http://lkml.kernel.org/r/20170515112522.32457-5-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [for can_split_huge_page()] Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06mm, THP, swap: move anonymous THP split logic to vmscanMinchan Kim
The add_to_swap aims to allocate swap_space(ie, swap slot and swapcache) so if it fails due to lack of space in case of THP or something(hdd swap but tries THP swapout) *caller* rather than add_to_swap itself should split the THP page and retry it with base page which is more natural. Link: http://lkml.kernel.org/r/20170515112522.32457-4-ying.huang@intel.com Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06mm, THP, swap: unify swap slot free functions to put_swap_pageMinchan Kim
Now, get_swap_page takes struct page and allocates swap space according to page size(ie, normal or THP) so it would be more cleaner to introduce put_swap_page which is a counter function of get_swap_page. Then, it calls right swap slot free function depending on page's size. [ying.huang@intel.com: minor cleanup and fix] Link: http://lkml.kernel.org/r/20170515112522.32457-3-ying.huang@intel.com Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06mm/vmscan.c: fix unsequenced modification and access warningNick Desaulniers
Clang and its -Wunsequenced emits a warning mm/vmscan.c:2961:25: error: unsequenced modification and access to 'gfp_mask' [-Wunsequenced] .gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)), ^ While it is not clear to me whether the initialization code violates the specification (6.7.8 par 19 (ISO/IEC 9899) looks like it disagrees) the code is quite confusing and worth cleaning up anyway. Fix this by reusing sc.gfp_mask rather than the updated input gfp_mask parameter. Link: http://lkml.kernel.org/r/20170510154030.10720-1-nick.desaulniers@gmail.com Signed-off-by: Nick Desaulniers <nick.desaulniers@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-23mm/vmscan: Adjust system_state checksThomas Gleixner
To enable smp_processor_id() and might_sleep() debug checks earlier, it's required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING. Adjust the system_state check in kswapd_run() to handle the extra states. Tested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20170516184736.119158930@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-12mm: vmscan: scan until it finds eligible pagesMinchan Kim
Although there are a ton of free swap and anonymous LRU page in elgible zones, OOM happened. balloon invoked oom-killer: gfp_mask=0x17080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=(null), order=0, oom_score_adj=0 CPU: 7 PID: 1138 Comm: balloon Not tainted 4.11.0-rc6-mm1-zram-00289-ge228d67e9677-dirty #17 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: oom_kill_process+0x21d/0x3f0 out_of_memory+0xd8/0x390 __alloc_pages_slowpath+0xbc1/0xc50 __alloc_pages_nodemask+0x1a5/0x1c0 pte_alloc_one+0x20/0x50 __pte_alloc+0x1e/0x110 __handle_mm_fault+0x919/0x960 handle_mm_fault+0x77/0x120 __do_page_fault+0x27a/0x550 trace_do_page_fault+0x43/0x150 do_async_page_fault+0x2c/0x90 async_page_fault+0x28/0x30 Mem-Info: active_anon:424716 inactive_anon:65314 isolated_anon:0 active_file:52 inactive_file:46 isolated_file:0 unevictable:0 dirty:27 writeback:0 unstable:0 slab_reclaimable:3967 slab_unreclaimable:4125 mapped:133 shmem:43 pagetables:1674 bounce:0 free:4637 free_pcp:225 free_cma:0 Node 0 active_anon:1698864kB inactive_anon:261256kB active_file:208kB inactive_file:184kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:532kB dirty:108kB writeback:0kB shmem:172kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no DMA free:7316kB min:32kB low:44kB high:56kB active_anon:8064kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:464kB slab_unreclaimable:40kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 992 992 1952 DMA32 free:9088kB min:2048kB low:3064kB high:4080kB active_anon:952176kB inactive_anon:0kB active_file:36kB inactive_file:0kB unevictable:0kB writepending:88kB present:1032192kB managed:1019388kB mlocked:0kB slab_reclaimable:13532kB slab_unreclaimable:16460kB kernel_stack:3552kB pagetables:6672kB bounce:0kB free_pcp:56kB local_pcp:24kB free_cma:0kB lowmem_reserve[]: 0 0 0 959 Movable free:3644kB min:1980kB low:2960kB high:3940kB active_anon:738560kB inactive_anon:261340kB active_file:188kB inactive_file:640kB unevictable:0kB writepending:20kB present:1048444kB managed:1010816kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:832kB local_pcp:60kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 DMA: 1*4kB (E) 0*8kB 18*16kB (E) 10*32kB (E) 10*64kB (E) 9*128kB (ME) 8*256kB (E) 2*512kB (E) 2*1024kB (E) 0*2048kB 0*4096kB = 7524kB DMA32: 417*4kB (UMEH) 181*8kB (UMEH) 68*16kB (UMEH) 48*32kB (UMEH) 14*64kB (MH) 3*128kB (M) 1*256kB (H) 1*512kB (M) 2*1024kB (M) 0*2048kB 0*4096kB = 9836kB Movable: 1*4kB (M) 1*8kB (M) 1*16kB (M) 1*32kB (M) 0*64kB 1*128kB (M) 2*256kB (M) 4*512kB (M) 1*1024kB (M) 0*2048kB 0*4096kB = 3772kB 378 total pagecache pages 17 pages in swap cache Swap cache stats: add 17325, delete 17302, find 0/27 Free swap = 978940kB Total swap = 1048572kB 524157 pages RAM 0 pages HighMem/MovableOnly 12629 pages reserved 0 pages cma reserved 0 pages hwpoisoned [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name [ 433] 0 433 4904 5 14 3 82 0 upstart-udev-br [ 438] 0 438 12371 5 27 3 191 -1000 systemd-udevd With investigation, skipping page of isolate_lru_pages makes reclaim void because it returns zero nr_taken easily so LRU shrinking is effectively nothing and just increases priority aggressively. Finally, OOM happens. The problem is that get_scan_count determines nr_to_scan with eligible zones so although priority drops to zero, it couldn't reclaim any pages if the LRU contains mostly ineligible pages. get_scan_count: size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx); size = size >> sc->priority; Assumes sc->priority is 0 and LRU list is as follows. N-N-N-N-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H (Ie, small eligible pages are in the head of LRU but others are almost ineligible pages) In that case, size becomes 4 so VM want to scan 4 pages but 4 pages from tail of the LRU are not eligible pages. If get_scan_count counts skipped pages, it doesn't reclaim any pages remained after scanning 4 pages so it ends up OOM happening. This patch makes isolate_lru_pages try to scan pages until it encounters eligible zones's pages. [akpm@linux-foundation.org: clean up mind-bending `for' statement. Tweak comment text] Fixes: 3db65812d688 ("Revert "mm, vmscan: account for skipped pages as a partial scan"") Link: http://lkml.kernel.org/r/1494457232-27401-1-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08mm: introduce memalloc_noreclaim_{save,restore}Vlastimil Babka
The previous patch ("mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC") has shown that simply setting and clearing PF_MEMALLOC in current->flags can result in wrongly clearing a pre-existing PF_MEMALLOC flag and potentially lead to recursive reclaim. Let's introduce helpers that support proper nesting by saving the previous stat of the flag, similar to the existing memalloc_noio_* and memalloc_nofs_* helpers. Convert existing setting/clearing of PF_MEMALLOC within mm to the new helpers. There are no known issues with the converted code, but the change makes it more robust. Link: http://lkml.kernel.org/r/20170405074700.29871-3-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Boris Brezillon <boris.brezillon@free-electrons.com> Cc: Chris Leech <cleech@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Lee Duncan <lduncan@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Richard Weinberger <richard@nod.at> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: memcontrol: use node page state naming scheme for memcgJohannes Weiner
The memory controllers stat function names are awkwardly long and arbitrarily different from the zone and node stat functions. The current interface is named: mem_cgroup_read_stat() mem_cgroup_update_stat() mem_cgroup_inc_stat() mem_cgroup_dec_stat() mem_cgroup_update_page_stat() mem_cgroup_inc_page_stat() mem_cgroup_dec_page_stat() This patch renames it to match the corresponding node stat functions: memcg_page_state() [node_page_state()] mod_memcg_state() [mod_node_state()] inc_memcg_state() [inc_node_state()] dec_memcg_state() [dec_node_state()] mod_memcg_page_state() [mod_node_page_state()] inc_memcg_page_state() [inc_node_page_state()] dec_memcg_page_state() [dec_node_page_state()] Link: http://lkml.kernel.org/r/20170404220148.28338-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: memcontrol: re-use node VM page state enumJohannes Weiner
The current duplication is a high-maintenance mess, and it's painful to add new items or query memcg state from the rest of the VM. This increases the size of the stat array marginally, but we should aim to track all these stats on a per-cgroup level anyway. Link: http://lkml.kernel.org/r/20170404220148.28338-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: memcontrol: clean up memory.events counting functionJohannes Weiner
We only ever count single events, drop the @nr parameter. Rename the function accordingly. Remove low-information kerneldoc. Link: http://lkml.kernel.org/r/20170404220148.28338-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: vmscan: fix IO/refault regression in cache workingset transitionJohannes Weiner
Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list") we noticed bigger IO spikes during changes in cache access patterns. The patch in question shrunk the inactive list size to leave more room for the current workingset in the presence of streaming IO. However, workingset transitions that previously happened on the inactive list are now pushed out of memory and incur more refaults to complete. This patch disables active list protection when refaults are being observed. This accelerates workingset transitions, and allows more of the new set to establish itself from memory, without eating into the ability to protect the established workingset during stable periods. The workloads that were measurably affected for us were hit pretty bad by it, with refault/majfault rates doubling and tripling during cache transitions, and the machines sustaining half-hour periods of 100% IO utilization, where they'd previously have sub-minute peaks at 60-90%. Stateful services that handle user data tend to be more conservative with kernel upgrades. As a result we hit most page cache issues with some delay, as was the case here. The severity seemed to warrant a stable tag. Fixes: 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list") Link: http://lkml.kernel.org/r/20170404220052.27593-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> [4.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: make ttu's return booleanMinchan Kim
try_to_unmap() returns SWAP_SUCCESS or SWAP_FAIL so it's suitable for boolean return. This patch changes it. Link: http://lkml.kernel.org/r/1489555493-14659-8-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: remove SWAP_AGAIN in ttuMinchan Kim
In 2002, [1] introduced SWAP_AGAIN. At that time, try_to_unmap_one used spin_trylock(&mm->page_table_lock) so it's really easy to contend and fail to hold a lock so SWAP_AGAIN to keep LRU status makes sense. However, now we changed it to mutex-based lock and be able to block without skip pte so there is few of small window to return SWAP_AGAIN so remove SWAP_AGAIN and just return SWAP_FAIL. [1] c48c43e, minimal rmap Link: http://lkml.kernel.org/r/1489555493-14659-7-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: remove SWAP_MLOCK in ttuMinchan Kim
ttu doesn't need to return SWAP_MLOCK. Instead, just return SWAP_FAIL because it means the page is not-swappable so it should move to another LRU list(active or unevictable). putback friends will move it to right list depending on the page's LRU flag. Link: http://lkml.kernel.org/r/1489555493-14659-6-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: remove SWAP_DIRTY in ttuMinchan Kim
If we found lazyfree page is dirty, try_to_unmap_one can just SetPageSwapBakced in there like PG_mlocked page and just return with SWAP_FAIL which is very natural because the page is not swappable right now so that vmscan can activate it. There is no point to introduce new return value SWAP_DIRTY in try_to_unmap at the moment. Link: http://lkml.kernel.org/r/1489555493-14659-3-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm/vmscan: more restrictive condition for retry in do_try_to_free_pagesYisheng Xie
By reviewing code, I find that when enter do_try_to_free_pages, the may_thrash is always clear, and it will retry shrink zones to tap cgroup's reserves memory by setting may_thrash when the former shrink_zones reclaim nothing. However, when memcg is disabled or on legacy hierarchy, or there do not have any memcg protected by low limit, it should not do this useless retry at all, for we do not have any cgroup's reserves memory to tap, and we have already done hard work but made no progress, which as Michal pointed out in former version, we are trying hard to control the retry logical of page alloctor, and the current additional round of reclaim is just lame. Therefore, to avoid this unneeded retrying and make code more readable, we remove the may_thrash field in scan_control, instead, introduce memcg_low_reclaim and memcg_low_skipped, and only retry when memcg_low_skipped, by setting memcg_low_reclaim. [xieyisheng1@huawei.com: remove may_thrash field, introduce mem_cgroup_reclaim] Link: http://lkml.kernel.org/r/1490191893-5923-1-git-send-email-ysxie@foxmail.com Link: http://lkml.kernel.org/r/1490191893-5923-1-git-send-email-ysxie@foxmail.com Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Michal Hocko <mhocko@kernel.org> Suggested-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm, vmscan: prevent kswapd sleeping prematurely due to mismatched classzone_idxMel Gorman
kswapd is woken to reclaim a node based on a failed allocation request from any eligible zone. Once reclaiming in balance_pgdat(), it will continue reclaiming until there is an eligible zone available for the zone it was woken for. kswapd tracks what zone it was recently woken for in pgdat->kswapd_classzone_idx. If it has not been woken recently, this zone will be 0. However, the decision on whether to sleep is made on kswapd_classzone_idx which is 0 without a recent wakeup request and that classzone does not account for lowmem reserves. This allows kswapd to sleep when a low small zone such as ZONE_DMA is balanced for a GFP_DMA request even if a stream of allocations cannot use that zone. While kswapd may be woken again shortly in the near future there are two consequences -- the pgdat bits that control congestion are cleared prematurely and direct reclaim is more likely as kswapd slept prematurely. This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an invalid index) when there has been no recent wakeups. If there are no wakeups, it'll decide whether to sleep based on the highest possible zone available (MAX_NR_ZONES - 1). It then becomes critical that the "pgdat balanced" decisions during reclaim and when deciding to sleep are the same. If there is a mismatch, kswapd can stay awake continually trying to balance tiny zones. simoop was used to evaluate it again. Two of the preparation patches regressed the workload so they are included as the second set of results. Otherwise this patch looks artifically excellent 4.11.0-rc1 4.11.0-rc1 4.11.0-rc1 vanilla clear-v2 keepawake-v2 Amean p50-Read 21670074.18 ( 0.00%) 19786774.76 ( 8.69%) 22668332.52 ( -4.61%) Amean p95-Read 25456267.64 ( 0.00%) 24101956.27 ( 5.32%) 26738688.00 ( -5.04%) Amean p99-Read 29369064.73 ( 0.00%) 27691872.71 ( 5.71%) 30991404.52 ( -5.52%) Amean p50-Write 1390.30 ( 0.00%) 1011.91 ( 27.22%) 924.91 ( 33.47%) Amean p95-Write 412901.57 ( 0.00%) 34874.98 ( 91.55%) 1362.62 ( 99.67%) Amean p99-Write 6668722.09 ( 0.00%) 575449.60 ( 91.37%) 16854.04 ( 99.75%) Amean p50-Allocation 78714.31 ( 0.00%) 84246.26 ( -7.03%) 74729.74 ( 5.06%) Amean p95-Allocation 175533.51 ( 0.00%) 400058.43 (-127.91%) 101609.74 ( 42.11%) Amean p99-Allocation 247003.02 ( 0.00%) 10905600.00 (-4315.17%) 125765.57 ( 49.08%) With this patch on top, write and allocation latencies are massively improved. The read latencies are slightly impaired but it's worth noting that this is mostly due to the IO scheduler and not directly related to reclaim. The vmstats are a bit of a mix but the relevant ones are as follows; 4.10.0-rc7 4.10.0-rc7 4.10.0-rc7 mmots-20170209 clear-v1r25keepawake-v1r25 Swap Ins 0 0 0 Swap Outs 0 608 0 Direct pages scanned 6910672 3132699 6357298 Kswapd pages scanned 57036946 82488665 56986286 Kswapd pages reclaimed 55993488 63474329 55939113 Direct pages reclaimed 6905990 2964843 6352115 Kswapd efficiency 98% 76% 98% Kswapd velocity 12494.375 17597.507 12488.065 Direct efficiency 99% 94% 99% Direct velocity 1513.835 668.306 1393.148 Page writes by reclaim 0.000 4410243.000 0.000 Page writes file 0 4409635 0 Page writes anon 0 608 0 Page reclaim immediate 1036792 14175203 1042571 4.11.0-rc1 4.11.0-rc1 4.11.0-rc1 vanilla clear-v2 keepawake-v2 Swap Ins 0 12 0 Swap Outs 0 838 0 Direct pages scanned 6579706 3237270 6256811 Kswapd pages scanned 61853702 79961486 54837791 Kswapd pages reclaimed 60768764 60755788 53849586 Direct pages reclaimed 6579055 2987453 6256151 Kswapd efficiency 98% 75% 98% Page writes by reclaim 0.000 4389496.000 0.000 Page writes file 0 4388658 0 Page writes anon 0 838 0 Page reclaim immediate 1073573 14473009 982507 Swap-outs are equivalent to baseline. Direct reclaim is reduced but not eliminated. It's worth noting that there are two periods of direct reclaim for this workload. The first is when it switches from preparing the files for the actual test itself. It's a lot of file IO followed by a lot of allocs that reclaims heavily for a brief window. While direct reclaim is lower with clear-v2, it is due to kswapd scanning aggressively and trying to reclaim the world which is not the right thing to do. With the patches applied, there is still direct reclaim but the phase change from "creating work files" to starting multiple threads that allocate a lot of anonymous memory faster than kswapd can reclaim. Scanning/reclaim efficiency is restored by this patch. Page writes from reclaim context are back at 0 which is ideal. Pages immediately reclaimed after IO completes is slightly improved but it is expected this will vary slightly. On UMA, there is almost no change so this is not expected to be a universal win. [mgorman@suse.de: fix ->kswapd_classzone_idx initialization] Link: http://lkml.kernel.org/r/20170406174538.5msrznj6nt6qpbx5@suse.de Link: http://lkml.kernel.org/r/20170309075657.25121-4-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shantanu Goel <sgoel01@yahoo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm, vmscan: only clear pgdat congested/dirty/writeback state when balancedMel Gorman
A pgdat tracks if recent reclaim encountered too many dirty, writeback or congested pages. The flags control whether kswapd writes pages back from reclaim context, tags pages for immediate reclaim when IO completes, whether processes block on wait_iff_congested and whether kswapd blocks when too many pages marked for immediate reclaim are encountered. The state is cleared in a check function with side-effects. With the patch "mm, vmscan: fix zone balance check in prepare_kswapd_sleep", the timing of when the bits get cleared changed. Due to the way the check works, it'll clear the bits if ZONE_DMA is balanced for a GFP_DMA allocation because it does not account for lowmem reserves properly. For the simoop workload, kswapd is not stalling when it should due to the premature clearing, writing pages from reclaim context like crazy and generally being unhelpful. This patch resets the pgdat bits related to page reclaim only when kswapd is going to sleep. The comparison with simoop is then 4.11.0-rc1 4.11.0-rc1 4.11.0-rc1 vanilla fixcheck-v2 clear-v2 Amean p50-Read 21670074.18 ( 0.00%) 20464344.18 ( 5.56%) 19786774.76 ( 8.69%) Amean p95-Read 25456267.64 ( 0.00%) 25721423.64 ( -1.04%) 24101956.27 ( 5.32%) Amean p99-Read 29369064.73 ( 0.00%) 30174230.76 ( -2.74%) 27691872.71 ( 5.71%) Amean p50-Write 1390.30 ( 0.00%) 1395.28 ( -0.36%) 1011.91 ( 27.22%) Amean p95-Write 412901.57 ( 0.00%) 37737.74 ( 90.86%) 34874.98 ( 91.55%) Amean p99-Write 6668722.09 ( 0.00%) 666489.04 ( 90.01%) 575449.60 ( 91.37%) Amean p50-Allocation 78714.31 ( 0.00%) 86286.22 ( -9.62%) 84246.26 ( -7.03%) Amean p95-Allocation 175533.51 ( 0.00%) 351812.27 (-100.42%) 400058.43 (-127.91%) Amean p99-Allocation 247003.02 ( 0.00%) 6291171.56 (-2447.00%) 10905600.00 (-4315.17%) Read latency is improved, write latency is mostly improved but allocation latency is regressed. kswapd is still reclaiming inefficiently, pages are being written back from writeback context and a host of other issues. However, given the change, it needed to be spelled out why the side-effect was moved. Link: http://lkml.kernel.org/r/20170309075657.25121-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shantanu Goel <sgoel01@yahoo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm, vmscan: fix zone balance check in prepare_kswapd_sleepShantanu Goel
Patch series "Reduce amount of time kswapd sleeps prematurely", v2. The series is unusual in that the first patch fixes one problem and introduces other issues that are noted in the changelog. Patch 2 makes a minor modification that is worth considering on its own but leaves the kernel in a state where it behaves badly. It's not until patch 3 that there is an improvement against baseline. This was mostly motivated by examining Chris Mason's "simoop" benchmark which puts the VM under similar pressure to HADOOP. It has been reported that the benchmark has regressed severely during the last number of releases. While I cannot reproduce all the same problems Chris experienced due to hardware limitations, there was a number of problems on a 2-socket machine with a single disk. simoop latencies 4.11.0-rc1 4.11.0-rc1 vanilla keepawake-v2 Amean p50-Read 21670074.18 ( 0.00%) 22668332.52 ( -4.61%) Amean p95-Read 25456267.64 ( 0.00%) 26738688.00 ( -5.04%) Amean p99-Read 29369064.73 ( 0.00%) 30991404.52 ( -5.52%) Amean p50-Write 1390.30 ( 0.00%) 924.91 ( 33.47%) Amean p95-Write 412901.57 ( 0.00%) 1362.62 ( 99.67%) Amean p99-Write 6668722.09 ( 0.00%) 16854.04 ( 99.75%) Amean p50-Allocation 78714.31 ( 0.00%) 74729.74 ( 5.06%) Amean p95-Allocation 175533.51 ( 0.00%) 101609.74 ( 42.11%) Amean p99-Allocation 247003.02 ( 0.00%) 125765.57 ( 49.08%) These are latencies. Read/write are threads reading fixed-size random blocks from a simulated database. The allocation latency is mmaping and faulting regions of memory. The p50, 95 and p99 reports the worst latencies for 50% of the samples, 95% and 99% respectively. For example, the report indicates that while the test was running 99% of writes completed 99.75% faster. It's worth noting that on a UMA machine that no difference in performance with simoop was observed so milage will vary. It's noted that there is a slight impact to read latencies but it's mostly due to IO scheduler decisions and offset by the large reduction in other latencies. This patch (of 3): The check in prepare_kswapd_sleep needs to match the one in balance_pgdat since the latter will return as soon as any one of the zones in the classzone is above the watermark. This is specially important for higher order allocations since balance_pgdat will typically reset the order to zero relying on compaction to create the higher order pages. Without this patch, prepare_kswapd_sleep fails to wake up kcompactd since the zone balance check fails. It was first reported against 4.9.7 that kswapd is failing to wake up kcompactd due to a mismatch in the zone balance check between balance_pgdat() and prepare_kswapd_sleep(). balance_pgdat() returns as soon as a single zone satisfies the allocation but prepare_kswapd_sleep() requires all zones to do +the same. This causes prepare_kswapd_sleep() to never succeed except in the order == 0 case and consequently, wakeup_kcompactd() is never called. For the machine that originally motivated this patch, the state of compaction from /proc/vmstat looked this way after a day and a half +of uptime: compact_migrate_scanned 240496 compact_free_scanned 76238632 compact_isolated 123472 compact_stall 1791 compact_fail 29 compact_success 1762 compact_daemon_wake 0 After applying the patch and about 10 hours of uptime the state looks like this: compact_migrate_scanned 59927299 compact_free_scanned 2021075136 compact_isolated 640926 compact_stall 4 compact_fail 2 compact_success 2 compact_daemon_wake 5160 Further notes from Mel that motivated him to pick this patch up and resend it; It was observed for the simoop workload (pressures the VM similar to HADOOP) that kswapd was failing to keep ahead of direct reclaim. The investigation noted that there was a need to rationalise kswapd decisions to reclaim with kswapd decisions to sleep. With this patch on a 2-socket box, there was a 49% reduction in direct reclaim scanning. However, the impact otherwise is extremely negative. Kswapd reclaim efficiency dropped from 98% to 76%. simoop has three latency-related metrics for read, write and allocation (an anonymous mmap and fault). 4.11.0-rc1 4.11.0-rc1 vanilla fixcheck-v2 Amean p50-Read 21670074.18 ( 0.00%) 20464344.18 ( 5.56%) Amean p95-Read 25456267.64 ( 0.00%) 25721423.64 ( -1.04%) Amean p99-Read 29369064.73 ( 0.00%) 30174230.76 ( -2.74%) Amean p50-Write 1390.30 ( 0.00%) 1395.28 ( -0.36%) Amean p95-Write 412901.57 ( 0.00%) 37737.74 ( 90.86%) Amean p99-Write 6668722.09 ( 0.00%) 666489.04 ( 90.01%) Amean p50-Allocation 78714.31 ( 0.00%) 86286.22 ( -9.62%) Amean p95-Allocation 175533.51 ( 0.00%) 351812.27 (-100.42%) Amean p99-Allocation 247003.02 ( 0.00%) 6291171.56 (-2447.00%) Of greater concern is that the patch causes swapping and page writes from kswapd context rose from 0 pages to 4189753 pages during the hour the workload ran for. By and large, the patch has very bad behaviour but easily missed as the impact on a UMA machine is negligible. This patch is included with the data in case a bisection leads to this area. This patch is also a pre-requisite for the rest of the series. Link: http://lkml.kernel.org/r/20170309075657.25121-2-mgorman@techsingularity.net Signed-off-by: Shantanu Goel <sgoel01@yahoo.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: introduce memalloc_nofs_{save,restore} APIMichal Hocko
GFP_NOFS context is used for the following 5 reasons currently: - to prevent from deadlocks when the lock held by the allocation context would be needed during the memory reclaim - to prevent from stack overflows during the reclaim because the allocation is performed from a deep context already - to prevent lockups when the allocation context depends on other reclaimers to make a forward progress indirectly - just in case because this would be safe from the fs POV - silence lockdep false positives Unfortunately overuse of this allocation context brings some problems to the MM. Memory reclaim is much weaker (especially during heavy FS metadata workloads), OOM killer cannot be invoked because the MM layer doesn't have enough information about how much memory is freeable by the FS layer. In many cases it is far from clear why the weaker context is even used and so it might be used unnecessarily. We would like to get rid of those as much as possible. One way to do that is to use the flag in scopes rather than isolated cases. Such a scope is declared when really necessary, tracked per task and all the allocation requests from within the context will simply inherit the GFP_NOFS semantic. Not only this is easier to understand and maintain because there are much less problematic contexts than specific allocation requests, this also helps code paths where FS layer interacts with other layers (e.g. crypto, security modules, MM etc...) and there is no easy way to convey the allocation context between the layers. Introduce memalloc_nofs_{save,restore} API to control the scope of GFP_NOFS allocation context. This is basically copying memalloc_noio_{save,restore} API we have for other restricted allocation context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is just an alias for PF_FSTRANS which has been xfs specific until recently. There are no more PF_FSTRANS users anymore so let's just drop it. PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags is renamed to current_gfp_context because it now cares about both PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve their semantic. kmem_flags_convert() doesn't need to evaluate the flag anymore. This patch shouldn't introduce any functional changes. Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS) usage as much as possible and only use a properly documented memalloc_nofs_{save,restore} checkpoints where they are appropriate. [akpm@linux-foundation.org: fix comment typo, reflow comment] Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterba@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Brian Foster <bfoster@redhat.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Nikolay Borisov <nborisov@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: reclaim MADV_FREE pagesShaohua Li
When memory pressure is high, we free MADV_FREE pages. If the pages are not dirty in pte, the pages could be freed immediately. Otherwise we can't reclaim them. We put the pages back to anonumous LRU list (by setting SwapBacked flag) and the pages will be reclaimed in normal swapout way. We use normal page reclaim policy. Since MADV_FREE pages are put into inactive file list, such pages and inactive file pages are reclaimed according to their age. This is expected, because we don't want to reclaim too many MADV_FREE pages before used once pages. Based on Minchan's original patch [minchan@kernel.org: clean up lazyfree page handling] Link: http://lkml.kernel.org/r/20170303025237.GB3503@bbox Link: http://lkml.kernel.org/r/14b8eb1d3f6bf6cc492833f183ac8c304e560484.1487965799.git.shli@fb.com Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: delete unnecessary TTU_* flagsShaohua Li
Patch series "mm: fix some MADV_FREE issues", v5. We are trying to use MADV_FREE in jemalloc. Several issues are found. Without solving the issues, jemalloc can't use the MADV_FREE feature. - Doesn't support system without swap enabled. Because if swap is off, we can't or can't efficiently age anonymous pages. And since MADV_FREE pages are mixed with other anonymous pages, we can't reclaim MADV_FREE pages. In current implementation, MADV_FREE will fallback to MADV_DONTNEED without swap enabled. But in our environment, a lot of machines don't enable swap. This will prevent our setup using MADV_FREE. - Increases memory pressure. page reclaim bias file pages reclaim against anonymous pages. This doesn't make sense for MADV_FREE pages, because those pages could be freed easily and refilled with very slight penality. Even page reclaim doesn't bias file pages, there is still an issue, because MADV_FREE pages and other anonymous pages are mixed together. To reclaim a MADV_FREE page, we probably must scan a lot of other anonymous pages, which is inefficient. In our test, we usually see oom with MADV_FREE enabled and nothing without it. - Accounting. There are two accounting problems. We don't have a global accounting. If the system is abnormal, we don't know if it's a problem from MADV_FREE side. The other problem is RSS accounting. MADV_FREE pages are accounted as normal anon pages and reclaimed lazily, so application's RSS becomes bigger. This confuses our workloads. We have monitoring daemon running and if it finds applications' RSS becomes abnormal, the daemon will kill the applications even kernel can reclaim the memory easily. To address the first the two issues, we can either put MADV_FREE pages into a separate LRU list (Minchan's previous patches and V1 patches), or put them into LRU_INACTIVE_FILE list (suggested by Johannes). The patchset use the second idea. The reason is LRU_INACTIVE_FILE list is tiny nowadays and should be full of used once file pages. So we can still efficiently reclaim MADV_FREE pages there without interference with other anon and active file pages. Putting the pages into inactive file list also has an advantage which allows page reclaim to prioritize MADV_FREE pages and used once file pages. MADV_FREE pages are put into the lru list and clear SwapBacked flag, so PageAnon(page) && !PageSwapBacked(page) will indicate a MADV_FREE pages. These pages will directly freed without pageout if they are clean, otherwise normal swap will reclaim them. For the third issue, the previous post adds global accounting and a separate RSS count for MADV_FREE pages. The problem is we never get accurate accounting for MADV_FREE pages. The pages are mapped to userspace, can be dirtied without notice from kernel side. To get accurate accounting, we could write protect the page, but then there is extra page fault overhead, which people don't want to pay. Jemalloc guys have concerns about the inaccurate accounting, so this post drops the accounting patches temporarily. The info exported to /proc/pid/smaps for MADV_FREE pages are kept, which is the only place we can get accurate accounting right now. This patch (of 6): Johannes pointed out TTU_LZFREE is unnecessary. It's true because we always have the flag set if we want to do an unmap. For cases we don't do an unmap, the TTU_LZFREE part of code should never run. Also the TTU_UNMAP is unnecessary. If no other flags set (for example, TTU_MIGRATION), an unmap is implied. The patch includes Johannes's cleanup and dead TTU_ACTION macro removal code Link: http://lkml.kernel.org/r/4be3ea1bc56b26fd98a54d0a6f70bec63f6d8980.1487965799.git.shli@fb.com Signed-off-by: Shaohua Li <shli@fb.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03Revert "mm, vmscan: account for skipped pages as a partial scan"Johannes Weiner
This reverts commit d7f05528eedb047efe2288cff777676b028747b6. Now that reclaimability of a node is no longer based on the ratio between pages scanned and theoretically reclaimable pages, we can remove accounting tricks for pages skipped due to zone constraints. Link: http://lkml.kernel.org/r/20170228214007.5621-9-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jia He <hejianet@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: delete NR_PAGES_SCANNED and pgdat_reclaimable()Johannes Weiner
NR_PAGES_SCANNED counts number of pages scanned since the last page free event in the allocator. This was used primarily to measure the reclaimability of zones and nodes, and determine when reclaim should give up on them. In that role, it has been replaced in the preceding patches by a different mechanism. Being implemented as an efficient vmstat counter, it was automatically exported to userspace as well. It's however unlikely that anyone outside the kernel is using this counter in any meaningful way. Remove the counter and the unused pgdat_reclaimable(). Link: http://lkml.kernel.org/r/20170228214007.5621-8-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jia He <hejianet@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: don't avoid high-priority reclaim on memcg limit reclaimJohannes Weiner
Commit 246e87a93934 ("memcg: fix get_scan_count() for small targets") sought to avoid high reclaim priorities for memcg by forcing it to scan a minimum amount of pages when lru_pages >> priority yielded nothing. This was done at a time when reclaim decisions like dirty throttling were tied to the priority level. Nowadays, the only meaningful thing still tied to priority dropping below DEF_PRIORITY - 2 is gating whether laptop_mode=1 is generally allowed to write. But that is from an era where direct reclaim was still allowed to call ->writepage, and kswapd nowadays avoids writes until it's scanned every clean page in the system. Potential changes to how quick sc->may_writepage could trigger are of little concern. Remove the force_scan stuff, as well as the ugly multi-pass target calculation that it necessitated. Link: http://lkml.kernel.org/r/20170228214007.5621-7-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jia He <hejianet@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: don't avoid high-priority reclaim on unreclaimable nodesJohannes Weiner
Commit 246e87a93934 ("memcg: fix get_scan_count() for small targets") sought to avoid high reclaim priorities for kswapd by forcing it to scan a minimum amount of pages when lru_pages >> priority yielded nothing. Commit b95a2f2d486d ("mm: vmscan: convert global reclaim to per-memcg LRU lists"), due to switching global reclaim to a round-robin scheme over all cgroups, had to restrict this forceful behavior to unreclaimable zones in order to prevent massive overreclaim with many cgroups. The latter patch effectively neutered the behavior completely for all but extreme memory pressure. But in those situations we might as well drop the reclaimers to lower priority levels. Remove the check. Link: http://lkml.kernel.org/r/20170228214007.5621-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jia He <hejianet@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: remove seemingly spurious reclaimability check from laptop_mode gatingJohannes Weiner
Commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes") allowed laptop_mode=1 to start writing not just when the priority drops to DEF_PRIORITY - 2 but also when the node is unreclaimable. That appears to be a spurious change in this patch as I doubt the series was tested with laptop_mode, and neither is that particular change mentioned in the changelog. Remove it, it's still recent. Link: http://lkml.kernel.org/r/20170228214007.5621-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jia He <hejianet@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: fix check for reclaimable pages in PF_MEMALLOC reclaim throttlingJohannes Weiner
PF_MEMALLOC direct reclaimers get throttled on a node when the sum of all free pages in each zone fall below half the min watermark. During the summation, we want to exclude zones that don't have reclaimables. Checking the same pgdat over and over again doesn't make sense. Fixes: 599d0c954f91 ("mm, vmscan: move LRU lists to node") Link: http://lkml.kernel.org/r/20170228214007.5621-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jia He <hejianet@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: fix 100% CPU kswapd busyloop on unreclaimable nodesJohannes Weiner
Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and cleanups". Jia reported a scenario in which the kswapd of a node indefinitely spins at 100% CPU usage. We have seen similar cases at Facebook. The kernel's current method of judging its ability to reclaim a node (or whether to back off and sleep) is based on the amount of scanned pages in proportion to the amount of reclaimable pages. In Jia's and our scenarios, there are no reclaimable pages in the node, however, and the condition for backing off is never met. Kswapd busyloops in an attempt to restore the watermarks while having nothing to work with. This series reworks the definition of an unreclaimable node based not on scanning but on whether kswapd is able to actually reclaim pages in MAX_RECLAIM_RETRIES (16) consecutive runs. This is the same criteria the page allocator uses for giving up on direct reclaim and invoking the OOM killer. If it cannot free any pages, kswapd will go to sleep and leave further attempts to direct reclaim invocations, which will either make progress and re-enable kswapd, or invoke the OOM killer. Patch #1 fixes the immediate problem Jia reported, the remainder are smaller fixlets, cleanups, and overall phasing out of the old method. Patch #6 is the odd one out. It's a nice cleanup to get_scan_count(), and directly related to #5, but in itself not relevant to the series. If the whole series is too ambitious for 4.11, I would consider the first three patches fixes, the rest cleanups. This patch (of 9): Jia He reports a problem with kswapd spinning at 100% CPU when requesting more hugepages than memory available in the system: $ echo 4000 >/proc/sys/vm/nr_hugepages top - 13:42:59 up 3:37, 1 user, load average: 1.09, 1.03, 1.01 Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 12.5 sy, 0.0 ni, 85.5 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 31371520 total, 30915136 used, 456384 free, 320 buffers KiB Swap: 6284224 total, 115712 used, 6168512 free. 48192 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 76 root 20 0 0 0 0 R 100.0 0.000 217:17.29 kswapd3 At that time, there are no reclaimable pages left in the node, but as kswapd fails to restore the high watermarks it refuses to go to sleep. Kswapd needs to back away from nodes that fail to balance. Up until commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes") kswapd had such a mechanism. It considered zones whose theoretically reclaimable pages it had reclaimed six times over as unreclaimable and backed away from them. This guard was erroneously removed as the patch changed the definition of a balanced node. However, simply restoring this code wouldn't help in the case reported here: there *are* no reclaimable pages that could be scanned until the threshold is met. Kswapd would stay awake anyway. Introduce a new and much simpler way of backing off. If kswapd runs through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single page, make it back off from the node. This is the same number of shots direct reclaim takes before declaring OOM. Kswapd will go to sleep on that node until a direct reclaimer manages to reclaim some pages, thus proving the node reclaimable again. [hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count] Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org [shakeelb@google.com: fix condition for throttle_direct_reclaim] Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: Jia He <hejianet@gmail.com> Tested-by: Jia He <hejianet@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-02sched/headers: Prepare to move the memalloc_noio_*() APIs to <linux/sched/mm.h>Ingo Molnar
Update the .c files that depend on these APIs. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-24mm, vmscan: clear PGDAT_WRITEBACK when zone is balancedMel Gorman
Hillf Danton pointed out that since commit 1d82de618dd ("mm, vmscan: make kswapd reclaim in terms of nodes") that PGDAT_WRITEBACK is no longer cleared. It was not noticed as triggering it requires pages under writeback to cycle twice through the LRU and before kswapd gets stalled. Historically, such issues tended to occur on small machines writing heavily to slow storage such as a USB stick. Once kswapd stalls, direct reclaim stalls may be higher but due to the fact that memory pressure is required, it would not be very noticable. Michal Hocko suggested removing the flag entirely but the conservative fix is to restore the intended PGDAT_WRITEBACK behaviour and clear the flag when a suitable zone is balanced. Fixes: 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes") Link: http://lkml.kernel.org/r/20170203203222.gq7hk66yc36lpgtb@suse.de Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm: vmscan: move dirty pages out of the way until they're flushedJohannes Weiner
We noticed a performance regression when moving hadoop workloads from 3.10 kernels to 4.0 and 4.6. This is accompanied by increased pageout activity initiated by kswapd as well as frequent bursts of allocation stalls and direct reclaim scans. Even lowering the dirty ratios to the equivalent of less than 1% of memory would not eliminate the issue, suggesting that dirty pages concentrate where the scanner is looking. This can be traced back to recent efforts of thrash avoidance. Where 3.10 would not detect refaulting pages and continuously supply clean cache to the inactive list, a thrashing workload on 4.0+ will detect and activate refaulting pages right away, distilling used-once pages on the inactive list much more effectively. This is by design, and it makes sense for clean cache. But for the most part our workload's cache faults are refaults and its use-once cache is from streaming writes. We end up with most of the inactive list dirty, and we don't go after the active cache as long as we have use-once pages around. But waiting for writes to avoid reclaiming clean cache that *might* refault is a bad trade-off. Even if the refaults happen, reads are faster than writes. Before getting bogged down on writeback, reclaim should first look at *all* cache in the system, even active cache. To accomplish this, activate pages that are dirty or under writeback when they reach the end of the inactive LRU. The pages are marked for immediate reclaim, meaning they'll get moved back to the inactive LRU tail as soon as they're written back and become reclaimable. But in the meantime, by reducing the inactive list to only immediately reclaimable pages, we allow the scanner to deactivate and refill the inactive list with clean cache from the active list tail to guarantee forward progress. [hannes@cmpxchg.org: update comment] Link: http://lkml.kernel.org/r/20170202191957.22872-8-hannes@cmpxchg.org Link: http://lkml.kernel.org/r/20170123181641.23938-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm: vmscan: only write dirty pages that the scanner has seen twiceJohannes Weiner
Dirty pages can easily reach the end of the LRU while there are still clean pages to reclaim around. Don't let kswapd write them back just because there are a lot of them. It costs more CPU to find the clean pages, but that's almost certainly better than to disrupt writeback from the flushers with LRU-order single-page writes from reclaim. And the flushers have been woken up by that point, so we spend IO capacity on flushing and CPU capacity on finding the clean cache. Only start writing dirty pages if they have cycled around the LRU twice now and STILL haven't been queued on the IO device. It's possible that the dirty pages are so sparsely distributed across different bdis, inodes, memory cgroups, that the flushers take forever to get to the ones we want reclaimed. Once we see them twice on the LRU, we know that's the quicker way to find them, so do LRU writeback. Link: http://lkml.kernel.org/r/20170123181641.23938-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm: vmscan: remove old flusher wakeup from direct reclaim pathJohannes Weiner
Direct reclaim has been replaced by kswapd reclaim in pretty much all common memory pressure situations, so this code most likely doesn't accomplish the described effect anymore. The previous patch wakes up flushers for all reclaimers when we encounter dirty pages at the tail end of the LRU. Remove the crufty old direct reclaim invocation. Link: http://lkml.kernel.org/r/20170123181641.23938-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>