summaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)Author
2015-06-22mm/memory_hotplug.c: set zone->wait_table to null after freeing itGu Zheng
commit 85bd839983778fcd0c1c043327b14a046e979b39 upstream. Izumi found the following oops when hot re-adding a node: BUG: unable to handle kernel paging request at ffffc90008963690 IP: __wake_up_bit+0x20/0x70 Oops: 0000 [#1] SMP CPU: 68 PID: 1237 Comm: rs:main Q:Reg Not tainted 4.1.0-rc5 #80 Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 1.87 04/28/2015 task: ffff880838df8000 ti: ffff880017b94000 task.ti: ffff880017b94000 RIP: 0010:[<ffffffff810dff80>] [<ffffffff810dff80>] __wake_up_bit+0x20/0x70 RSP: 0018:ffff880017b97be8 EFLAGS: 00010246 RAX: ffffc90008963690 RBX: 00000000003c0000 RCX: 000000000000a4c9 RDX: 0000000000000000 RSI: ffffea101bffd500 RDI: ffffc90008963648 RBP: ffff880017b97c08 R08: 0000000002000020 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a0797c73800 R13: ffffea101bffd500 R14: 0000000000000001 R15: 00000000003c0000 FS: 00007fcc7ffff700(0000) GS:ffff880874800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffc90008963690 CR3: 0000000836761000 CR4: 00000000001407e0 Call Trace: unlock_page+0x6d/0x70 generic_write_end+0x53/0xb0 xfs_vm_write_end+0x29/0x80 [xfs] generic_perform_write+0x10a/0x1e0 xfs_file_buffered_aio_write+0x14d/0x3e0 [xfs] xfs_file_write_iter+0x79/0x120 [xfs] __vfs_write+0xd4/0x110 vfs_write+0xac/0x1c0 SyS_write+0x58/0xd0 system_call_fastpath+0x12/0x76 Code: 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 45 f8 31 c0 48 8d 47 48 <48> 39 47 48 48 c7 45 e8 00 00 00 00 48 c7 45 f0 00 00 00 00 48 RIP [<ffffffff810dff80>] __wake_up_bit+0x20/0x70 RSP <ffff880017b97be8> CR2: ffffc90008963690 Reproduce method (re-add a node):: Hot-add nodeA --> remove nodeA --> hot-add nodeA (panic) This seems an use-after-free problem, and the root cause is zone->wait_table was not set to *NULL* after free it in try_offline_node. When hot re-add a node, we will reuse the pgdat of it, so does the zone struct, and when add pages to the target zone, it will init the zone first (including the wait_table) if the zone is not initialized. The judgement of zone initialized is based on zone->wait_table: static inline bool zone_is_initialized(struct zone *zone) { return !!zone->wait_table; } so if we do not set the zone->wait_table to *NULL* after free it, the memory hotplug routine will skip the init of new zone when hot re-add the node, and the wait_table still points to the freed memory, then we will access the invalid address when trying to wake up the waiting people after the i/o operation with the page is done, such as mentioned above. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Reported-by: Taku Izumi <izumi.taku@jp.fujitsu.com> Reviewed by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-06-06mm, numa: really disable NUMA balancing by default on single node machinesMel Gorman
commit b0dc2b9bb4ab782115b964310518ee0b17784277 upstream. NUMA balancing is meant to be disabled by default on UMA machines but the check is using nr_node_ids (highest node) instead of num_online_nodes (online nodes). The consequences are that a UMA machine with a node ID of 1 or higher will enable NUMA balancing. This will incur useless overhead due to minor faults with the impact depending on the workload. These are the impact on the stats when running a kernel build on a single node machine whose node ID happened to be 1: vanilla patched NUMA base PTE updates 5113158 0 NUMA huge PMD updates 643 0 NUMA page range updates 5442374 0 NUMA hint faults 2109622 0 NUMA hint local faults 2109622 0 NUMA hint local percent 100 100 NUMA pages migrated 0 0 Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-17mm: soft-offline: fix num_poisoned_pages counting on concurrent eventsNaoya Horiguchi
commit 602498f9aa43d4951eece3fd6ad95a6d0a78d537 upstream. If multiple soft offline events hit one free page/hugepage concurrently, soft_offline_page() can handle the free page/hugepage multiple times, which makes num_poisoned_pages counter increased more than once. This patch fixes this wrong counting by checking TestSetPageHWPoison for normal papes and by checking the return value of dequeue_hwpoisoned_huge_page() for hugepages. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Dean Nelson <dnelson@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-17writeback: use |1 instead of +1 to protect against div by zeroTejun Heo
commit 464d1387acb94dc43ba772b35242345e3d2ead1b upstream. mm/page-writeback.c has several places where 1 is added to the divisor to prevent division by zero exceptions; however, if the original divisor is equivalent to -1, adding 1 leads to division by zero. There are three places where +1 is used for this purpose - one in pos_ratio_polynom() and two in bdi_position_ratio(). The second one in bdi_position_ratio() actually triggered div-by-zero oops on a machine running a 3.10 kernel. The divisor is x_intercept - bdi_setpoint + 1 == span + 1 span is confirmed to be (u32)-1. It isn't clear how it ended up that but it could be from write bandwidth calculation underflow fixed by c72efb658f7c ("writeback: fix possible underflow in write bandwidth calculation"). At any rate, +1 isn't a proper protection against div-by-zero. This patch converts all +1 protections to |1. Note that bdi_update_dirty_ratelimit() was already using |1 before this patch. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-17mm/memory-failure: call shake_page() when error hits thp tail pageNaoya Horiguchi
commit 09789e5de18e4e442870b2d700831f5cb802eb05 upstream. Currently memory_failure() calls shake_page() to sweep pages out from pcplists only when the victim page is 4kB LRU page or thp head page. But we should do this for a thp tail page too. Consider that a memory error hits a thp tail page whose head page is on a pcplist when memory_failure() runs. Then, the current kernel skips shake_pages() part, so hwpoison_user_mappings() returns without calling split_huge_page() nor try_to_unmap() because PageLRU of the thp head is still cleared due to the skip of shake_page(). As a result, me_huge_page() runs for the thp, which is broken behavior. One effect is a leak of the thp. And another is to fail to isolate the memory error, so later access to the error address causes another MCE, which kills the processes which used the thp. This patch fixes this problem by calling shake_page() for thp tail case. Fixes: 385de35722c9 ("thp: allow a hwpoisoned head page to be put back to LRU") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Acked-by: Dean Nelson <dnelson@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Jin Dongming <jin.dongming@np.css.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-29vm: make stack guard page errors return VM_FAULT_SIGSEGV rather than SIGBUSLinus Torvalds
commit 9c145c56d0c8a0b62e48c8d71e055ad0fb2012ba upstream. The stack guard page error case has long incorrectly caused a SIGBUS rather than a SIGSEGV, but nobody actually noticed until commit fee7e49d4514 ("mm: propagate error from stack expansion even for guard page") because that error case was never actually triggered in any normal situations. Now that we actually report the error, people noticed the wrong signal that resulted. So far, only the test suite of libsigsegv seems to have actually cared, but there are real applications that use libsigsegv, so let's not wait for any of those to break. Reported-and-tested-by: Takashi Iwai <tiwai@suse.de> Tested-by: Jan Engelhardt <jengelh@inai.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots" Cc: linux-arch@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-29vm: add VM_FAULT_SIGSEGV handling supportLinus Torvalds
commit 33692f27597fcab536d7cbbcc8f52905133e4aa7 upstream. The core VM already knows about VM_FAULT_SIGBUS, but cannot return a "you should SIGSEGV" error, because the SIGSEGV case was generally handled by the caller - usually the architecture fault handler. That results in lots of duplication - all the architecture fault handlers end up doing very similar "look up vma, check permissions, do retries etc" - but it generally works. However, there are cases where the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV. In particular, when accessing the stack guard page, libsigsegv expects a SIGSEGV. And it usually got one, because the stack growth is handled by that duplicated architecture fault handler. However, when the generic VM layer started propagating the error return from the stack expansion in commit fee7e49d4514 ("mm: propagate error from stack expansion even for guard page"), that now exposed the existing VM_FAULT_SIGBUS result to user space. And user space really expected SIGSEGV, not SIGBUS. To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those duplicate architecture fault handlers about it. They all already have the code to handle SIGSEGV, so it's about just tying that new return value to the existing code, but it's all a bit annoying. This is the mindless minimal patch to do this. A more extensive patch would be to try to gather up the mostly shared fault handling logic into one generic helper routine, and long-term we really should do that cleanup. Just from this patch, you can generally see that most architectures just copied (directly or indirectly) the old x86 way of doing things, but in the meantime that original x86 model has been improved to hold the VM semaphore for shorter times etc and to handle VM_FAULT_RETRY and other "newer" things, so it would be a good idea to bring all those improvements to the generic case and teach other architectures about them too. Reported-and-tested-by: Takashi Iwai <tiwai@suse.de> Tested-by: Jan Engelhardt <jengelh@inai.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots" Cc: linux-arch@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [shengyong: Backport to 3.14 - adjust context - ignore modification for arch nios2, because 3.14 does not support it - add SIGSEGV handling to powerpc/cell spu_fault.c, because 3.14 does not separate it to copro_fault.c - add SIGSEGV handling to mm/memory.c, because 3.14 does not separate it to gup.c ] Signed-off-by: Sheng Yong <shengyong1@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-29mm: hwpoison: drop lru_add_drain_all() in __soft_offline_page()Naoya Horiguchi
commit 9ab3b598d2dfbdb0153ffa7e4b1456bbff59a25d upstream. A race condition starts to be visible in recent mmotm, where a PG_hwpoison flag is set on a migration source page *before* it's back in buddy page poo= l. This is problematic because no page flag is supposed to be set when freeing (see __free_one_page().) So the user-visible effect of this race is that it could trigger the BUG_ON() when soft-offlining is called. The root cause is that we call lru_add_drain_all() to make sure that the page is in buddy, but that doesn't work because this function just schedule= s a work item and doesn't wait its completion. drain_all_pages() does drainin= g directly, so simply dropping lru_add_drain_all() solves this problem. [n-horiguchi@ah.jp.nec.com: resolve conflict to apply on v3.11.10] Fixes: f15bdfa802bf ("mm/memory-failure.c: fix memory leak in successful soft offlining") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Chen Gong <gong.chen@linux.intel.com> Cc: <stable@vger.kernel.org> [3.11+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-19writeback: fix possible underflow in write bandwidth calculationTejun Heo
commit c72efb658f7c8b27ca3d0efb5cfd5ded9fcac89e upstream. From 1ebf33901ecc75d9496862dceb1ef0377980587c Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Mon, 23 Mar 2015 00:08:19 -0400 2f800fbd777b ("writeback: fix dirtied pages accounting on redirty") introduced account_page_redirty() which reverts stat updates for a redirtied page, making BDI_DIRTIED no longer monotonically increasing. bdi_update_write_bandwidth() uses the delta in BDI_DIRTIED as the basis for bandwidth calculation. While unlikely, since the above patch, the newer value may be lower than the recorded past value and underflow the bandwidth calculation leading to a wild result. Fix it by subtracing min of the old and new values when calculating delta. AFAIK, there hasn't been any report of it happening but the resulting erratic behavior would be non-critical and temporary, so it's possible that the issue is happening without being reported. The risk of the fix is very low, so tagged for -stable. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Fixes: 2f800fbd777b ("writeback: fix dirtied pages accounting on redirty") Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-19writeback: add missing INITIAL_JIFFIES init in global_update_bandwidth()Tejun Heo
commit 7d70e15480c0450d2bfafaad338a32e884fc215e upstream. global_update_bandwidth() uses static variable update_time as the timestamp for the last update but forgets to initialize it to INITIALIZE_JIFFIES. This means that global_dirty_limit will be 5 mins into the future on 32bit and some large amount jiffies into the past on 64bit. This isn't critical as the only effect is that global_dirty_limit won't be updated for the first 5 mins after booting on 32bit machines, especially given the auxiliary nature of global_dirty_limit's role - protecting against global dirty threshold's sudden dips; however, it does lead to unintended suboptimal behavior. Fix it. Fixes: c42843f2f0bb ("writeback: introduce smoothed global dirty limit") Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-19mm/memory hotplug: postpone the reset of obsolete pgdatGu Zheng
commit b0dc3a342af36f95a68fe229b8f0f73552c5ca08 upstream. Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under stress condition: BUG: unable to handle kernel paging request at 0000000000025f60 IP: next_online_pgdat+0x1/0x50 PGD 0 Oops: 0000 [#1] SMP ACPI: Device does not support D3cold Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf] CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G O 3.10.15-5885-euler0302 #1 Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015 Workqueue: events vmstat_update task: ffffa800d32c0000 ti: ffffa800d32ae000 task.ti: ffffa800d32ae000 RIP: 0010: next_online_pgdat+0x1/0x50 RSP: 0018:ffffa800d32afce8 EFLAGS: 00010286 RAX: 0000000000001440 RBX: ffffffff81da53b8 RCX: 0000000000000082 RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000000 RBP: ffffa800d32afd28 R08: ffffffff81c93bfc R09: ffffffff81cbdc96 R10: 00000000000040ec R11: 00000000000000a0 R12: ffffa800fffb3440 R13: ffffa800d32afd38 R14: 0000000000000017 R15: ffffa800e6616800 FS: 0000000000000000(0000) GS:ffffa800e6600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000025f60 CR3: 0000000001a0b000 CR4: 00000000001407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: refresh_cpu_vm_stats+0xd0/0x140 vmstat_update+0x11/0x50 process_one_work+0x194/0x3d0 worker_thread+0x12b/0x410 kthread+0xc6/0xd0 ret_from_fork+0x7c/0xb0 The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of try_offline_node, which will reset all the content of pgdat to 0, as the pgdat is accessed lock-free, so that the users still using the pgdat will panic, such as the vmstat_update routine. process A: offline node XX: vmstat_updat() refresh_cpu_vm_stats() for_each_populated_zone() find online node XX cond_resched() offline cpu and memory, then try_offline_node() node_set_offline(nid), and memset(pgdat, 0, sizeof(*pgdat)) zone = next_zone(zone) pg_data_t *pgdat = zone->zone_pgdat; // here pgdat is NULL now next_online_pgdat(pgdat) next_online_node(pgdat->node_id); // NULL pointer access So the solution here is postponing the reset of obsolete pgdat from try_offline_node() to hotadd_new_pgdat(), and just resetting pgdat->nr_zones and pgdat->classzone_idx to be 0 rather than the memset 0 to avoid breaking pointer information in pgdat. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Reported-by: Xishi Qiu <qiuxishi@huawei.com> Suggested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-18mm/memory.c: actually remap enough memoryGrazvydas Ignotas
commit 9cb12d7b4ccaa976f97ce0c5fd0f1b6a83bc2a75 upstream. For whatever reason, generic_access_phys() only remaps one page, but actually allows to access arbitrary size. It's quite easy to trigger large reads, like printing out large structure with gdb, which leads to a crash. Fix it by remapping correct size. Fixes: 28b2ee20c7cb ("access_process_vm device memory infrastructure") Signed-off-by: Grazvydas Ignotas <notasas@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-18mm/compaction: fix wrong order check in compact_finished()Joonsoo Kim
commit 372549c2a3778fd3df445819811c944ad54609ca upstream. What we want to check here is whether there is highorder freepage in buddy list of other migratetype in order to steal it without fragmentation. But, current code just checks cc->order which means allocation request order. So, this is wrong. Without this fix, non-movable synchronous compaction below pageblock order would not stopped until compaction is complete, because migratetype of most pageblocks are movable and high order freepage made by compaction is usually on movable type buddy list. There is some report related to this bug. See below link. http://www.spinics.net/lists/linux-mm/msg81666.html Although the issued system still has load spike comes from compaction, this makes that system completely stable and responsive according to his report. stress-highalloc test in mmtests with non movable order 7 allocation doesn't show any notable difference in allocation success rate, but, it shows more compaction success rate. Compaction success rate (Compaction success * 100 / Compaction stalls, %) 18.47 : 28.94 Fixes: 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page immediately when it is made available") Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-18mm/nommu.c: fix arithmetic overflow in __vm_enough_memory()Roman Gushchin
commit 8138a67a5557ffea3a21dfd6f037842d4e748513 upstream. I noticed that "allowed" can easily overflow by falling below 0, because (total_vm / 32) can be larger than "allowed". The problem occurs in OVERCOMMIT_NONE mode. In this case, a huge allocation can success and overcommit the system (despite OVERCOMMIT_NONE mode). All subsequent allocations will fall (system-wide), so system become unusable. The problem was masked out by commit c9b1d0981fcc ("mm: limit growth of 3% hardcoded other user reserve"), but it's easy to reproduce it on older kernels: 1) set overcommit_memory sysctl to 2 2) mmap() large file multiple times (with VM_SHARED flag) 3) try to malloc() large amount of memory It also can be reproduced on newer kernels, but miss-configured sysctl_user_reserve_kbytes is required. Fix this issue by switching to signed arithmetic here. Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Cc: Andrew Shewmaker <agshew@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-18mm/mmap.c: fix arithmetic overflow in __vm_enough_memory()Roman Gushchin
commit 5703b087dc8eaf47bfb399d6cf512d471beff405 upstream. I noticed, that "allowed" can easily overflow by falling below 0, because (total_vm / 32) can be larger than "allowed". The problem occurs in OVERCOMMIT_NONE mode. In this case, a huge allocation can success and overcommit the system (despite OVERCOMMIT_NONE mode). All subsequent allocations will fall (system-wide), so system become unusable. The problem was masked out by commit c9b1d0981fcc ("mm: limit growth of 3% hardcoded other user reserve"), but it's easy to reproduce it on older kernels: 1) set overcommit_memory sysctl to 2 2) mmap() large file multiple times (with VM_SHARED flag) 3) try to malloc() large amount of memory It also can be reproduced on newer kernels, but miss-configured sysctl_user_reserve_kbytes is required. Fix this issue by switching to signed arithmetic here. [akpm@linux-foundation.org: use min_t] Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Cc: Andrew Shewmaker <agshew@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-18mm: when stealing freepages, also take pages created by splitting buddy pageVlastimil Babka
commit 99592d598eca62bdbbf62b59941c189176dfc614 upstream. When studying page stealing, I noticed some weird looking decisions in try_to_steal_freepages(). The first I assume is a bug (Patch 1), the following two patches were driven by evaluation. Testing was done with stress-highalloc of mmtests, using the mm_page_alloc_extfrag tracepoint and postprocessing to get counts of how often page stealing occurs for individual migratetypes, and what migratetypes are used for fallbacks. Arguably, the worst case of page stealing is when UNMOVABLE allocation steals from MOVABLE pageblock. RECLAIMABLE allocation stealing from MOVABLE allocation is also not ideal, so the goal is to minimize these two cases. The evaluation of v2 wasn't always clear win and Joonsoo questioned the results. Here I used different baseline which includes RFC compaction improvements from [1]. I found that the compaction improvements reduce variability of stress-highalloc, so there's less noise in the data. First, let's look at stress-highalloc configured to do sync compaction, and how these patches reduce page stealing events during the test. First column is after fresh reboot, other two are reiterations of test without reboot. That was all accumulater over 5 re-iterations (so the benchmark was run 5x3 times with 5 fresh restarts). Baseline: 3.19-rc4 3.19-rc4 3.19-rc4 5-nothp-1 5-nothp-2 5-nothp-3 Page alloc extfrag event 10264225 8702233 10244125 Extfrag fragmenting 10263271 8701552 10243473 Extfrag fragmenting for unmovable 13595 17616 15960 Extfrag fragmenting unmovable placed with movable 7989 12193 8447 Extfrag fragmenting for reclaimable 658 1840 1817 Extfrag fragmenting reclaimable placed with movable 558 1677 1679 Extfrag fragmenting for movable 10249018 8682096 10225696 With Patch 1: 3.19-rc4 3.19-rc4 3.19-rc4 6-nothp-1 6-nothp-2 6-nothp-3 Page alloc extfrag event 11834954 9877523 9774860 Extfrag fragmenting 11833993 9876880 9774245 Extfrag fragmenting for unmovable 7342 16129 11712 Extfrag fragmenting unmovable placed with movable 4191 10547 6270 Extfrag fragmenting for reclaimable 373 1130 923 Extfrag fragmenting reclaimable placed with movable 302 906 738 Extfrag fragmenting for movable 11826278 9859621 9761610 With Patch 2: 3.19-rc4 3.19-rc4 3.19-rc4 7-nothp-1 7-nothp-2 7-nothp-3 Page alloc extfrag event 4725990 3668793 3807436 Extfrag fragmenting 4725104 3668252 3806898 Extfrag fragmenting for unmovable 6678 7974 7281 Extfrag fragmenting unmovable placed with movable 2051 3829 4017 Extfrag fragmenting for reclaimable 429 1208 1278 Extfrag fragmenting reclaimable placed with movable 369 976 1034 Extfrag fragmenting for movable 4717997 3659070 3798339 With Patch 3: 3.19-rc4 3.19-rc4 3.19-rc4 8-nothp-1 8-nothp-2 8-nothp-3 Page alloc extfrag event 5016183 4700142 3850633 Extfrag fragmenting 5015325 4699613 3850072 Extfrag fragmenting for unmovable 1312 3154 3088 Extfrag fragmenting unmovable placed with movable 1115 2777 2714 Extfrag fragmenting for reclaimable 437 1193 1097 Extfrag fragmenting reclaimable placed with movable 330 969 879 Extfrag fragmenting for movable 5013576 4695266 3845887 In v2 we've seen apparent regression with Patch 1 for unmovable events, this is now gone, suggesting it was indeed noise. Here, each patch improves the situation for unmovable events. Reclaimable is improved by patch 1 and then either the same modulo noise, or perhaps sligtly worse - a small price for unmovable improvements, IMHO. The number of movable allocations falling back to other migratetypes is most noisy, but it's reduced to half at Patch 2 nevertheless. These are least critical as compaction can move them around. If we look at success rates, the patches don't affect them, that didn't change. Baseline: 3.19-rc4 3.19-rc4 3.19-rc4 5-nothp-1 5-nothp-2 5-nothp-3 Success 1 Min 49.00 ( 0.00%) 42.00 ( 14.29%) 41.00 ( 16.33%) Success 1 Mean 51.00 ( 0.00%) 45.00 ( 11.76%) 42.60 ( 16.47%) Success 1 Max 55.00 ( 0.00%) 51.00 ( 7.27%) 46.00 ( 16.36%) Success 2 Min 53.00 ( 0.00%) 47.00 ( 11.32%) 44.00 ( 16.98%) Success 2 Mean 59.60 ( 0.00%) 50.80 ( 14.77%) 48.20 ( 19.13%) Success 2 Max 64.00 ( 0.00%) 56.00 ( 12.50%) 52.00 ( 18.75%) Success 3 Min 84.00 ( 0.00%) 82.00 ( 2.38%) 78.00 ( 7.14%) Success 3 Mean 85.60 ( 0.00%) 82.80 ( 3.27%) 79.40 ( 7.24%) Success 3 Max 86.00 ( 0.00%) 83.00 ( 3.49%) 80.00 ( 6.98%) Patch 1: 3.19-rc4 3.19-rc4 3.19-rc4 6-nothp-1 6-nothp-2 6-nothp-3 Success 1 Min 49.00 ( 0.00%) 44.00 ( 10.20%) 44.00 ( 10.20%) Success 1 Mean 51.80 ( 0.00%) 46.00 ( 11.20%) 45.80 ( 11.58%) Success 1 Max 54.00 ( 0.00%) 49.00 ( 9.26%) 49.00 ( 9.26%) Success 2 Min 58.00 ( 0.00%) 49.00 ( 15.52%) 48.00 ( 17.24%) Success 2 Mean 60.40 ( 0.00%) 51.80 ( 14.24%) 50.80 ( 15.89%) Success 2 Max 63.00 ( 0.00%) 54.00 ( 14.29%) 55.00 ( 12.70%) Success 3 Min 84.00 ( 0.00%) 81.00 ( 3.57%) 79.00 ( 5.95%) Success 3 Mean 85.00 ( 0.00%) 81.60 ( 4.00%) 79.80 ( 6.12%) Success 3 Max 86.00 ( 0.00%) 82.00 ( 4.65%) 82.00 ( 4.65%) Patch 2: 3.19-rc4 3.19-rc4 3.19-rc4 7-nothp-1 7-nothp-2 7-nothp-3 Success 1 Min 50.00 ( 0.00%) 44.00 ( 12.00%) 39.00 ( 22.00%) Success 1 Mean 52.80 ( 0.00%) 45.60 ( 13.64%) 42.40 ( 19.70%) Success 1 Max 55.00 ( 0.00%) 46.00 ( 16.36%) 47.00 ( 14.55%) Success 2 Min 52.00 ( 0.00%) 48.00 ( 7.69%) 45.00 ( 13.46%) Success 2 Mean 53.40 ( 0.00%) 49.80 ( 6.74%) 48.80 ( 8.61%) Success 2 Max 57.00 ( 0.00%) 52.00 ( 8.77%) 52.00 ( 8.77%) Success 3 Min 84.00 ( 0.00%) 81.00 ( 3.57%) 79.00 ( 5.95%) Success 3 Mean 85.00 ( 0.00%) 82.40 ( 3.06%) 79.60 ( 6.35%) Success 3 Max 86.00 ( 0.00%) 83.00 ( 3.49%) 80.00 ( 6.98%) Patch 3: 3.19-rc4 3.19-rc4 3.19-rc4 8-nothp-1 8-nothp-2 8-nothp-3 Success 1 Min 46.00 ( 0.00%) 44.00 ( 4.35%) 42.00 ( 8.70%) Success 1 Mean 50.20 ( 0.00%) 45.60 ( 9.16%) 44.00 ( 12.35%) Success 1 Max 52.00 ( 0.00%) 47.00 ( 9.62%) 47.00 ( 9.62%) Success 2 Min 53.00 ( 0.00%) 49.00 ( 7.55%) 48.00 ( 9.43%) Success 2 Mean 55.80 ( 0.00%) 50.60 ( 9.32%) 49.00 ( 12.19%) Success 2 Max 59.00 ( 0.00%) 52.00 ( 11.86%) 51.00 ( 13.56%) Success 3 Min 84.00 ( 0.00%) 80.00 ( 4.76%) 79.00 ( 5.95%) Success 3 Mean 85.40 ( 0.00%) 81.60 ( 4.45%) 80.40 ( 5.85%) Success 3 Max 87.00 ( 0.00%) 83.00 ( 4.60%) 82.00 ( 5.75%) While there's no improvement here, I consider reduced fragmentation events to be worth on its own. Patch 2 also seems to reduce scanning for free pages, and migrations in compaction, suggesting it has somewhat less work to do: Patch 1: Compaction stalls 4153 3959 3978 Compaction success 1523 1441 1446 Compaction failures 2630 2517 2531 Page migrate success 4600827 4943120 5104348 Page migrate failure 19763 16656 17806 Compaction pages isolated 9597640 10305617 10653541 Compaction migrate scanned 77828948 86533283 87137064 Compaction free scanned 517758295 521312840 521462251 Compaction cost 5503 5932 6110 Patch 2: Compaction stalls 3800 3450 3518 Compaction success 1421 1316 1317 Compaction failures 2379 2134 2201 Page migrate success 4160421 4502708 4752148 Page migrate failure 19705 14340 14911 Compaction pages isolated 8731983 9382374 9910043 Compaction migrate scanned 98362797 96349194 98609686 Compaction free scanned 496512560 469502017 480442545 Compaction cost 5173 5526 5811 As with v2, /proc/pagetypeinfo appears unaffected with respect to numbers of unmovable and reclaimable pageblocks. Configuring the benchmark to allocate like THP page fault (i.e. no sync compaction) gives much noisier results for iterations 2 and 3 after reboot. This is not so surprising given how [1] offers lower improvements in this scenario due to less restarts after deferred compaction which would change compaction pivot. Baseline: 3.19-rc4 3.19-rc4 3.19-rc4 5-thp-1 5-thp-2 5-thp-3 Page alloc extfrag event 8148965 6227815 6646741 Extfrag fragmenting 8147872 6227130 6646117 Extfrag fragmenting for unmovable 10324 12942 15975 Extfrag fragmenting unmovable placed with movable 5972 8495 10907 Extfrag fragmenting for reclaimable 601 1707 2210 Extfrag fragmenting reclaimable placed with movable 520 1570 2000 Extfrag fragmenting for movable 8136947 6212481 6627932 Patch 1: 3.19-rc4 3.19-rc4 3.19-rc4 6-thp-1 6-thp-2 6-thp-3 Page alloc extfrag event 8345457 7574471 7020419 Extfrag fragmenting 8343546 7573777 7019718 Extfrag fragmenting for unmovable 10256 18535 30716 Extfrag fragmenting unmovable placed with movable 6893 11726 22181 Extfrag fragmenting for reclaimable 465 1208 1023 Extfrag fragmenting reclaimable placed with movable 353 996 843 Extfrag fragmenting for movable 8332825 7554034 6987979 Patch 2: 3.19-rc4 3.19-rc4 3.19-rc4 7-thp-1 7-thp-2 7-thp-3 Page alloc extfrag event 3512847 3020756 2891625 Extfrag fragmenting 3511940 3020185 2891059 Extfrag fragmenting for unmovable 9017 6892 6191 Extfrag fragmenting unmovable placed with movable 1524 3053 2435 Extfrag fragmenting for reclaimable 445 1081 1160 Extfrag fragmenting reclaimable placed with movable 375 918 986 Extfrag fragmenting for movable 3502478 3012212 2883708 Patch 3: 3.19-rc4 3.19-rc4 3.19-rc4 8-thp-1 8-thp-2 8-thp-3 Page alloc extfrag event 3181699 3082881 2674164 Extfrag fragmenting 3180812 3082303 2673611 Extfrag fragmenting for unmovable 1201 4031 4040 Extfrag fragmenting unmovable placed with movable 974 3611 3645 Extfrag fragmenting for reclaimable 478 1165 1294 Extfrag fragmenting reclaimable placed with movable 387 985 1030 Extfrag fragmenting for movable 3179133 3077107 2668277 The improvements for first iteration are clear, the rest is much noisier and can appear like regression for Patch 1. Anyway, patch 2 rectifies it. Allocation success rates are again unaffected so there's no point in making this e-mail any longer. [1] http://marc.info/?l=linux-mm&m=142166196321125&w=2 This patch (of 3): When __rmqueue_fallback() is called to allocate a page of order X, it will find a page of order Y >= X of a fallback migratetype, which is different from the desired migratetype. With the help of try_to_steal_freepages(), it may change the migratetype (to the desired one) also of: 1) all currently free pages in the pageblock containing the fallback page 2) the fallback pageblock itself 3) buddy pages created by splitting the fallback page (when Y > X) These decisions take the order Y into account, as well as the desired migratetype, with the goal of preventing multiple fallback allocations that could e.g. distribute UNMOVABLE allocations among multiple pageblocks. Originally, decision for 1) has implied the decision for 3). Commit 47118af076f6 ("mm: mmzone: MIGRATE_CMA migration type added") changed that (probably unintentionally) so that the buddy pages in case 3) are always changed to the desired migratetype, except for CMA pageblocks. Commit fef903efcf0c ("mm/page_allo.c: restructure free-page stealing code and fix a bug") did some refactoring and added a comment that the case of 3) is intended. Commit 0cbef29a7821 ("mm: __rmqueue_fallback() should respect pageblock type") removed the comment and tried to restore the original behavior where 1) implies 3), but due to the previous refactoring, the result is instead that only 2) implies 3) - and the conditions for 2) are less frequently met than conditions for 1). This may increase fragmentation in situations where the code decides to steal all free pages from the pageblock (case 1)), but then gives back the buddy pages produced by splitting. This patch restores the original intended logic where 1) implies 3). During testing with stress-highalloc from mmtests, this has shown to decrease the number of events where UNMOVABLE and RECLAIMABLE allocations steal from MOVABLE pageblocks, which can lead to permanent fragmentation. In some cases it has increased the number of events when MOVABLE allocations steal from UNMOVABLE or RECLAIMABLE pageblocks, but these are fixable by sync compaction and thus less harmful. Note that evaluation has shown that the behavior introduced by 47118af076f6 for buddy pages in case 3) is actually even better than the original logic, so the following patch will introduce it properly once again. For stable backports of this patch it makes thus sense to only fix versions containing 0cbef29a7821. [iamjoonsoo.kim@lge.com: tracepoint fix] Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-18mm/hugetlb: add migration entry check in __unmap_hugepage_rangeNaoya Horiguchi
commit 9fbc1f635fd0bd28cb32550211bf095753ac637a upstream. If __unmap_hugepage_range() tries to unmap the address range over which hugepage migration is on the way, we get the wrong page because pte_page() doesn't work for migration entries. This patch simply clears the pte for migration entries as we do for hwpoison entries. Fixes: 290408d4a2 ("hugetlb: hugepage migration core") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Steve Capper <steve.capper@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-18mm/hugetlb: add migration/hwpoisoned entry check in hugetlb_change_protectionNaoya Horiguchi
commit a8bda28d87c38c6aa93de28ba5d30cc18e865a11 upstream. There is a race condition between hugepage migration and change_protection(), where hugetlb_change_protection() doesn't care about migration entries and wrongly overwrites them. That causes unexpected results like kernel crash. HWPoison entries also can cause the same problem. This patch adds is_hugetlb_entry_(migration|hwpoisoned) check in this function to do proper actions. Fixes: 290408d4a2 ("hugetlb: hugepage migration core") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Steve Capper <steve.capper@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-06mm/hugetlb: pmd_huge() returns true for non-present hugepageNaoya Horiguchi
commit cbef8478bee55775ac312a574aad48af7bb9cf9f upstream. Migrating hugepages and hwpoisoned hugepages are considered as non-present hugepages, and they are referenced via migration entries and hwpoison entries in their page table slots. This behavior causes race condition because pmd_huge() doesn't tell non-huge pages from migrating/hwpoisoned hugepages. follow_page_mask() is one example where the kernel would call follow_page_pte() for such hugepage while this function is supposed to handle only normal pages. To avoid this, this patch makes pmd_huge() return true when pmd_none() is true *and* pmd_present() is false. We don't have to worry about mixing up non-present pmd entry with normal pmd (pointing to leaf level pte entry) because pmd_present() is true in normal pmd. The same race condition could happen in (x86-specific) gup_pmd_range(), where this patch simply adds pmd_present() check instead of pmd_huge(). This is because gup_pmd_range() is fast path. If we have non-present hugepage in this function, we will go into gup_huge_pmd(), then return 0 at flag mask check, and finally fall back to the slow path. Fixes: 290408d4a2 ("hugetlb: hugepage migration core") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Steve Capper <steve.capper@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-11mm: pagewalk: call pte_hole() for VM_PFNMAP during walk_page_rangeShiraz Hashim
commit 23aaed6659df9adfabe9c583e67a36b54e21df46 upstream. walk_page_range() silently skips vma having VM_PFNMAP set, which leads to undesirable behaviour at client end (who called walk_page_range). Userspace applications get the wrong data, so the effect is like just confusing users (if the applications just display the data) or sometimes killing the processes (if the applications do something with misunderstanding virtual addresses due to the wrong data.) For example for pagemap_read, when no callbacks are called against VM_PFNMAP vma, pagemap_read may prepare pagemap data for next virtual address range at wrong index. Eventually userspace may get wrong pagemap data for a task. Corresponding to a VM_PFNMAP marked vma region, kernel may report mappings from subsequent vma regions. User space in turn may account more pages (than really are) to the task. In my case I was using procmem, procrack (Android utility) which uses pagemap interface to account RSS pages of a task. Due to this bug it was giving a wrong picture for vmas (with VM_PFNMAP set). Fixes: a9ff785e4437 ("mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas") Signed-off-by: Shiraz Hashim <shashim@codeaurora.org> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: get rid of radix tree gfp mask for pagecache_get_pageMichal Hocko
commit 45f87de57f8fad59302fd263dd81ffa4843b5b24 upstream. Commit 2457aec63745 ("mm: non-atomically mark page accessed during page cache allocation where possible") has added a separate parameter for specifying gfp mask for radix tree allocations. Not only this is less than optimal from the API point of view because it is error prone, it is also buggy currently because grab_cache_page_write_begin is using GFP_KERNEL for radix tree and if fgp_flags doesn't contain FGP_NOFS (mostly controlled by fs by AOP_FLAG_NOFS flag) but the mapping_gfp_mask has __GFP_FS cleared then the radix tree allocation wouldn't obey the restriction and might recurse into filesystem and cause deadlocks. This is the case for most filesystems unfortunately because only ext4 and gfs2 are using AOP_FLAG_NOFS. Let's simply remove radix_gfp_mask parameter because the allocation context is same for both page cache and for the radix tree. Just make sure that the radix tree gets only the sane subset of the mask (e.g. do not pass __GFP_WRITE). Long term it is more preferable to convert remaining users of AOP_FLAG_NOFS to use mapping_gfp_mask instead and simplify this interface even further. Reported-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: page_alloc: reduce cost of the fair zone allocation policyMel Gorman
commit 4ffeaf3560a52b4a69cc7909873d08c0ef5909d4 upstream. The fair zone allocation policy round-robins allocations between zones within a node to avoid age inversion problems during reclaim. If the first allocation fails, the batch counts are reset and a second attempt made before entering the slow path. One assumption made with this scheme is that batches expire at roughly the same time and the resets each time are justified. This assumption does not hold when zones reach their low watermark as the batches will be consumed at uneven rates. Allocation failure due to watermark depletion result in additional zonelist scans for the reset and another watermark check before hitting the slowpath. On UMA, the benefit is negligible -- around 0.25%. On 4-socket NUMA machine it's variable due to the variability of measuring overhead with the vmstat changes. The system CPU overhead comparison looks like 3.16.0-rc3 3.16.0-rc3 3.16.0-rc3 vanilla vmstat-v5 lowercost-v5 User 746.94 774.56 802.00 System 65336.22 32847.27 40852.33 Elapsed 27553.52 27415.04 27368.46 However it is worth noting that the overall benchmark still completed faster and intuitively it makes sense to take as few passes as possible through the zonelists. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: page_alloc: abort fair zone allocation policy when remotes nodes are ↵Mel Gorman
encountered commit f7b5d647946aae1647bf5cd26c16b3a793c1ac49 upstream. The purpose of numa_zonelist_order=zone is to preserve lower zones for use with 32-bit devices. If locality is preferred then the numa_zonelist_order=node policy should be used. Unfortunately, the fair zone allocation policy overrides this by skipping zones on remote nodes until the lower one is found. While this makes sense from a page aging and performance perspective, it breaks the expected zonelist policy. This patch restores the expected behaviour for zone-list ordering. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: vmscan: only update per-cpu thresholds for online CPUMel Gorman
commit bb0b6dffa2ccfbd9747ad0cc87c7459622896e60 upstream. When kswapd is awake reclaiming, the per-cpu stat thresholds are lowered to get more accurate counts to avoid breaching watermarks. This threshold update iterates over all possible CPUs which is unnecessary. Only online CPUs need to be updated. If a new CPU is onlined, refresh_zone_stat_thresholds() will set the thresholds correctly. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: move zone->pages_scanned into a vmstat counterMel Gorman
commit 0d5d823ab4e608ec7b52ac4410de4cb74bbe0edd upstream. zone->pages_scanned is a write-intensive cache line during page reclaim and it's also updated during page free. Move the counter into vmstat to take advantage of the per-cpu updates and do not update it in the free paths unless necessary. On a small UMA machine running tiobench the difference is marginal. On a 4-node machine the overhead is more noticable. Note that automatic NUMA balancing was disabled for this test as otherwise the system CPU overhead is unpredictable. 3.16.0-rc3 3.16.0-rc3 3.16.0-rc3 vanillarearrange-v5 vmstat-v5 User 746.94 759.78 774.56 System 65336.22 58350.98 32847.27 Elapsed 27553.52 27282.02 27415.04 Note that the overhead reduction will vary depending on where exactly pages are allocated and freed. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: rearrange zone fields into read-only, page alloc, statistics and page ↵Mel Gorman
reclaim lines commit 3484b2de9499df23c4604a513b36f96326ae81ad upstream. The arrangement of struct zone has changed over time and now it has reached the point where there is some inappropriate sharing going on. On x86-64 for example o The zone->node field is shared with the zone lock and zone->node is accessed frequently from the page allocator due to the fair zone allocation policy. o span_seqlock is almost never used by shares a line with free_area o Some zone statistics share a cache line with the LRU lock so reclaim-intensive and allocator-intensive workloads can bounce the cache line on a stat update This patch rearranges struct zone to put read-only and read-mostly fields together and then splits the page allocator intensive fields, the zone statistics and the page reclaim intensive fields into their own cache lines. Note that the type of lowmem_reserve changes due to the watermark calculations being signed and avoiding a signed/unsigned conversion there. On the test configuration I used the overall size of struct zone shrunk by one cache line. On smaller machines, this is not likely to be noticable. However, on a 4-node NUMA machine running tiobench the system CPU overhead is reduced by this patch. 3.16.0-rc3 3.16.0-rc3 vanillarearrange-v5r9 User 746.94 759.78 System 65336.22 58350.98 Elapsed 27553.52 27282.02 Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: pagemap: avoid unnecessary overhead when tracepoints are deactivatedMel Gorman
commit 24b7e5819ad5cbef2b7c7376510862aa8319d240 upstream. This was formerly the series "Improve sequential read throughput" which noted some major differences in performance of tiobench since 3.0. While there are a number of factors, two that dominated were the introduction of the fair zone allocation policy and changes to CFQ. The behaviour of fair zone allocation policy makes more sense than tiobench as a benchmark and CFQ defaults were not changed due to insufficient benchmarking. This series is what's left. It's one functional fix to the fair zone allocation policy when used on NUMA machines and a reduction of overhead in general. tiobench was used for the comparison despite its flaws as an IO benchmark as in this case we are primarily interested in the overhead of page allocator and page reclaim activity. On UMA, it makes little difference to overhead 3.16.0-rc3 3.16.0-rc3 vanilla lowercost-v5 User 383.61 386.77 System 403.83 401.74 Elapsed 5411.50 5413.11 On a 4-socket NUMA machine it's a bit more noticable 3.16.0-rc3 3.16.0-rc3 vanilla lowercost-v5 User 746.94 802.00 System 65336.22 40852.33 Elapsed 27553.52 27368.46 This patch (of 6): The LRU insertion and activate tracepoints take PFN as a parameter forcing the overhead to the caller. Move the overhead to the tracepoint fast-assign method to ensure the cost is only incurred when the tracepoint is active. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29memcg, vmscan: Fix forced scan of anonymous pagesJerome Marchand
commit 2ab051e11bfa3cbb7b24177f3d6aaed10a0d743e upstream. When memory cgoups are enabled, the code that decides to force to scan anonymous pages in get_scan_count() compares global values (free, high_watermark) to a value that is restricted to a memory cgroup (file). It make the code over-eager to force anon scan. For instance, it will force anon scan when scanning a memcg that is mainly populated by anonymous page, even when there is plenty of file pages to get rid of in others memcgs, even when swappiness == 0. It breaks user's expectation about swappiness and hurts performance. This patch makes sure that forced anon scan only happens when there not enough file pages for the all zone, not just in one random memcg. [hannes@cmpxchg.org: cleanups] Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29vmalloc: use rcu list iterator to reduce vmap_area_lock contentionJoonsoo Kim
commit 474750aba88817c53f39424e5567b8e4acc4b39b upstream. Richard Yao reported a month ago that his system have a trouble with vmap_area_lock contention during performance analysis by /proc/meminfo. Andrew asked why his analysis checks /proc/meminfo stressfully, but he didn't answer it. https://lkml.org/lkml/2014/4/10/416 Although I'm not sure that this is right usage or not, there is a solution reducing vmap_area_lock contention with no side-effect. That is just to use rcu list iterator in get_vmalloc_info(). rcu can be used in this function because all RCU protocol is already respected by writers, since Nick Piggin commit db64fe02258f1 ("mm: rewrite vmap layer") back in linux-2.6.28 Specifically : insertions use list_add_rcu(), deletions use list_del_rcu() and kfree_rcu(). Note the rb tree is not used from rcu reader (it would not be safe), only the vmap_area_list has full RCU protection. Note that __purge_vmap_area_lazy() already uses this rcu protection. rcu_read_lock(); list_for_each_entry_rcu(va, &vmap_area_list, list) { if (va->flags & VM_LAZY_FREE) { if (va->va_start < *start) *start = va->va_start; if (va->va_end > *end) *end = va->va_end; nr += (va->va_end - va->va_start) >> PAGE_SHIFT; list_add_tail(&va->purge_list, &valist); va->flags |= VM_LAZY_FREEING; va->flags &= ~VM_LAZY_FREE; } } rcu_read_unlock(); Peter: : While rcu list traversal over the vmap_area_list is safe, this may : arrive at different results than the spinlocked version. The rcu list : traversal version will not be a 'snapshot' of a single, valid instant : of the entire vmap_area_list, but rather a potential amalgam of : different list states. Joonsoo: : Yes, you are right, but I don't think that we should be strict here. : Meminfo is already not a 'snapshot' at specific time. While we try to get : certain stats, the other stats can change. And, although we may arrive at : different results than the spinlocked version, the difference would not be : large and would not make serious side-effect. [edumazet@google.com: add more commit description] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Richard Yao <ryao@gentoo.org> Acked-by: Eric Dumazet <edumazet@google.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Zhang Yanfei <zhangyanfei.yes@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: make copy_pte_range static againJerome Marchand
commit 21bda264f4243f61dfcc485174055f12ad0530b4 upstream. Commit 71e3aac0724f ("thp: transparent hugepage core") adds copy_pte_range prototype to huge_mm.h. I'm not sure why (or if) this function have been used outside of memory.c, but it currently isn't. This patch makes copy_pte_range() static again. Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm, thp: only collapse hugepages to nodes with affinity for zone_reclaim_modeDavid Rientjes
commit 14a4e2141e24304fff2c697be6382ffb83888185 upstream. Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node") improved the previous khugepaged logic which allocated a transparent hugepages from the node of the first page being collapsed. However, it is still possible to collapse pages to remote memory which may suffer from additional access latency. With the current policy, it is possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed remotely if the majority are allocated from that node. When zone_reclaim_mode is enabled, it means the VM should make every attempt to allocate locally to prevent NUMA performance degradation. In this case, we do not want to collapse hugepages to remote nodes that would suffer from increased access latency. Thus, when zone_reclaim_mode is enabled, only allow collapsing to nodes with RECLAIM_DISTANCE or less. There is no functional change for systems that disable zone_reclaim_mode. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Bob Liu <bob.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm/memory.c: use entry = ACCESS_ONCE(*pte) in handle_pte_fault()Hugh Dickins
commit c0d73261f5c1355a35b8b40e871d31578ce0c044 upstream. Use ACCESS_ONCE() in handle_pte_fault() when getting the entry or orig_pte upon which all subsequent decisions and pte_same() tests will be made. I have no evidence that its lack is responsible for the mm/filemap.c:202 BUG_ON(page_mapped(page)) in __delete_from_page_cache() found by trinity, and I am not optimistic that it will fix it. But I have found no other explanation, and ACCESS_ONCE() here will surely not hurt. If gcc does re-access the pte before passing it down, then that would be disastrous for correct page fault handling, and certainly could explain the page_mapped() BUGs seen (concurrent fault causing page to be mapped in a second time on top of itself: mapcount 2 for a single pte). Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29shmem: fix init_page_accessed use to stop !PageLRU bugHugh Dickins
commit 66d2f4d28cd030220e7ea2a628993fcabcb956d1 upstream. Under shmem swapping load, I sometimes hit the VM_BUG_ON_PAGE(!PageLRU) in isolate_lru_pages() at mm/vmscan.c:1281! Commit 2457aec63745 ("mm: non-atomically mark page accessed during page cache allocation where possible") looks like interrupted work-in-progress. mm/filemap.c's call to init_page_accessed() is fine, but not mm/shmem.c's - shmem_write_begin() is clearly wrong to use it after shmem_getpage(), when the page is always visible in radix_tree, and often already on LRU. Revert change to shmem_write_begin(), and use init_page_accessed() or mark_page_accessed() appropriately for SGP_WRITE in shmem_getpage_gfp(). SGP_WRITE also covers shmem_symlink(), which did not mark_page_accessed() before; but since many other filesystems use [__]page_symlink(), which did and does mark the page accessed, consider this as rectifying an oversight. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Prabhakar Lad <prabhakar.csengg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: avoid unnecessary atomic operations during end_page_writeback()Mel Gorman
commit 888cf2db475a256fb0cda042140f73d7881f81fe upstream. If a page is marked for immediate reclaim then it is moved to the tail of the LRU list. This occurs when the system is under enough memory pressure for pages under writeback to reach the end of the LRU but we test for this using atomic operations on every writeback. This patch uses an optimistic non-atomic test first. It'll miss some pages in rare cases but the consequences are not severe enough to warrant such a penalty. While the function does not dominate profiles during a simple dd test the cost of it is reduced. 73048 0.7428 vmlinux-3.15.0-rc5-mmotm-20140513 end_page_writeback 23740 0.2409 vmlinux-3.15.0-rc5-lessatomic end_page_writeback Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: non-atomically mark page accessed during page cache allocation where ↵Mel Gorman
possible commit 2457aec63745e235bcafb7ef312b182d8682f0fc upstream. aops->write_begin may allocate a new page and make it visible only to have mark_page_accessed called almost immediately after. Once the page is visible the atomic operations are necessary which is noticable overhead when writing to an in-memory filesystem like tmpfs but should also be noticable with fast storage. The objective of the patch is to initialse the accessed information with non-atomic operations before the page is visible. The bulk of filesystems directly or indirectly use grab_cache_page_write_begin or find_or_create_page for the initial allocation of a page cache page. This patch adds an init_page_accessed() helper which behaves like the first call to mark_page_accessed() but may called before the page is visible and can be done non-atomically. The primary APIs of concern in this care are the following and are used by most filesystems. find_get_page find_lock_page find_or_create_page grab_cache_page_nowait grab_cache_page_write_begin All of them are very similar in detail to the patch creates a core helper pagecache_get_page() which takes a flags parameter that affects its behavior such as whether the page should be marked accessed or not. Then old API is preserved but is basically a thin wrapper around this core function. Each of the filesystems are then updated to avoid calling mark_page_accessed when it is known that the VM interfaces have already done the job. There is a slight snag in that the timing of the mark_page_accessed() has now changed so in rare cases it's possible a page gets to the end of the LRU as PageReferenced where as previously it might have been repromoted. This is expected to be rare but it's worth the filesystem people thinking about it in case they see a problem with the timing change. It is also the case that some filesystems may be marking pages accessed that previously did not but it makes sense that filesystems have consistent behaviour in this regard. The test case used to evaulate this is a simple dd of a large file done multiple times with the file deleted on each iterations. The size of the file is 1/10th physical memory to avoid dirty page balancing. In the async case it will be possible that the workload completes without even hitting the disk and will have variable results but highlight the impact of mark_page_accessed for async IO. The sync results are expected to be more stable. The exception is tmpfs where the normal case is for the "IO" to not hit the disk. The test machine was single socket and UMA to avoid any scheduling or NUMA artifacts. Throughput and wall times are presented for sync IO, only wall times are shown for async as the granularity reported by dd and the variability is unsuitable for comparison. As async results were variable do to writback timings, I'm only reporting the maximum figures. The sync results were stable enough to make the mean and stddev uninteresting. The performance results are reported based on a run with no profiling. Profile data is based on a separate run with oprofile running. async dd 3.15.0-rc3 3.15.0-rc3 vanilla accessed-v2 ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%) tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%) btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%) ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%) xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%) The XFS figure is a bit strange as it managed to avoid a worst case by sheer luck but the average figures looked reasonable. samples percentage ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: do not use unnecessary atomic operations when adding pages to the LRUMel Gorman
commit 6fb81a17d21f2a138b8f424af4cf379f2b694060 upstream. When adding pages to the LRU we clear the active bit unconditionally. As the page could be reachable from other paths we cannot use unlocked operations without risk of corruption such as a parallel mark_page_accessed. This patch tests if is necessary to clear the active flag before using an atomic operation. This potentially opens a tiny race when PageActive is checked as mark_page_accessed could be called after PageActive was checked. The race already exists but this patch changes it slightly. The consequence is that that the page may be promoted to the active list that might have been left on the inactive list before the patch. It's too tiny a race and too marginal a consequence to always use atomic operations for. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: do not use atomic operations when releasing pagesMel Gorman
commit e3741b506c5088fa8c911bb5884c430f770fb49d upstream. There should be no references to it any more and a parallel mark should not be reordered against us. Use non-locked varient to clear page active. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: shmem: avoid atomic operation during shmem_getpage_gfpMel Gorman
commit 07a427884348d38a6fd56fa4d78249c407196650 upstream. shmem_getpage_gfp uses an atomic operation to set the SwapBacked field before it's even added to the LRU or visible. This is unnecessary as what could it possible race against? Use an unlocked variant. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: page_alloc: lookup pageblock migratetype with IRQs enabled during freeMel Gorman
commit cfc47a2803db42140167b92d991ef04018e162c7 upstream. get_pageblock_migratetype() is called during free with IRQs disabled. This is unnecessary and disables IRQs for longer than necessary. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: page_alloc: convert hot/cold parameter and immediate callers to boolMel Gorman
commit b745bc85f21ea707e4ea1a91948055fa3e72c77b upstream. cold is a bool, make it one. Make the likely case the "if" part of the block instead of the else as according to the optimisation manual this is preferred. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: page_alloc: reduce number of times page_to_pfn is calledMel Gorman
commit dc4b0caff24d9b2918e9f27bc65499ee63187eba upstream. In the free path we calculate page_to_pfn multiple times. Reduce that. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: page_alloc: use unsigned int for order in more placesMel Gorman
commit 7aeb09f9104b760fc53c98cb7d20d06640baf9e6 upstream. X86 prefers the use of unsigned types for iterators and there is a tendency to mix whether a signed or unsigned type if used for page order. This converts a number of sites in mm/page_alloc.c to use unsigned int for order where possible. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: page_alloc: take the ALLOC_NO_WATERMARK check out of the fast pathMel Gorman
commit 5dab29113ca56335c78be3f98bf5ddf2ef8eb6a6 upstream. ALLOC_NO_WATERMARK is set in a few cases. Always by kswapd, always for __GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc. Each of these cases are relatively rare events but the ALLOC_NO_WATERMARK check is an unlikely branch in the fast path. This patch moves the check out of the fast path and after it has been determined that the watermarks have not been met. This helps the common fast path at the cost of making the slow path slower and hitting kswapd with a performance cost. It's a reasonable tradeoff. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: page_alloc: only check the alloc flags and gfp_mask for dirty onceMel Gorman
commit a6e21b14f22041382e832d30deda6f26f37b1097 upstream. Currently it's calculated once per zone in the zonelist. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: page_alloc: only check the zone id check if pages are buddiesMel Gorman
commit d34c5fa06fade08a689fc171bf756fba2858ae73 upstream. A node/zone index is used to check if pages are compatible for merging but this happens unconditionally even if the buddy page is not free. Defer the calculation as long as possible. Ideally we would check the zone boundary but nodes can overlap. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: page_alloc: calculate classzone_idx once from the zonelist refMel Gorman
commit d8846374a85f4290a473a4e2a64c1ba046c4a0e1 upstream. There is no need to calculate zone_idx(preferred_zone) multiple times or use the pgdat to figure it out. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: page_alloc: use jump labels to avoid checking number_of_cpusetsMel Gorman
commit 664eeddeef6539247691197c1ac124d4aa872ab6 upstream. If cpusets are not in use then we still check a global variable on every page allocation. Use jump labels to avoid the overhead. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: page_alloc: do not treat a zone that cannot be used for dirty pages as ↵Mel Gorman
"full" commit 800a1e750c7b04c2aa2459afca77e936e01c0029 upstream. If a zone cannot be used for a dirty page then it gets marked "full" which is cached in the zlc and later potentially skipped by allocation requests that have nothing to do with dirty zones. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: page_alloc: do not update zlc unless the zlc is activeMel Gorman
commit 65bb371984d6a2c909244eb749e482bb40b72e36 upstream. The zlc is used on NUMA machines to quickly skip over zones that are full. However it is always updated, even for the first zone scanned when the zlc might not even be active. As it's a write to a bitmap that potentially bounces cache line it's deceptively expensive and most machines will not care. Only update the zlc if it was active. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm/swap.c: clean up *lru_cache_add* functionsJianyu Zhan
commit 2329d3751b082b4fd354f334a88662d72abac52d upstream. In mm/swap.c, __lru_cache_add() is exported, but actually there are no users outside this file. This patch unexports __lru_cache_add(), and makes it static. It also exports lru_cache_add_file(), as it is use by cifs and fuse, which can loaded as modules. Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Bob Liu <bob.liu@oracle.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>