aboutsummaryrefslogtreecommitdiffstats
path: root/arch/arm64/include/asm/pgtable.h
AgeCommit message (Collapse)Author
2019-11-12arm64: Do not mask out PTE_RDONLY in pte_same()Catalin Marinas
commit 6767df245f4736d0cf0c6fb7cf9cf94b27414245 upstream. Following commit 73e86cb03cf2 ("arm64: Move PTE_RDONLY bit handling out of set_pte_at()"), the PTE_RDONLY bit is no longer managed by set_pte_at() but built into the PAGE_* attribute definitions. Consequently, pte_same() must include this bit when checking two PTEs for equality. Remove the arm64-specific pte_same() function, practically reverting commit 747a70e60b72 ("arm64: Fix copy-on-write referencing in HugeTLB") Fixes: 73e86cb03cf2 ("arm64: Move PTE_RDONLY bit handling out of set_pte_at()") Cc: <stable@vger.kernel.org> # 4.14.x- Cc: Will Deacon <will@kernel.org> Cc: Steve Capper <steve.capper@arm.com> Reported-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-25arm64/mm: fix variable 'pud' set but not usedQian Cai
[ Upstream commit 7d4e2dcf311d3b98421d1f119efe5964cafa32fc ] GCC throws a warning, arch/arm64/mm/mmu.c: In function 'pud_free_pmd_page': arch/arm64/mm/mmu.c:1033:8: warning: variable 'pud' set but not used [-Wunused-but-set-variable] pud_t pud; ^~~ because pud_table() is a macro and compiled away. Fix it by making it a static inline function and for pud_sect() as well. Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31arm64: Fix compiler warning from pte_unmap() with -Wunused-but-set-variableQian Cai
[ Upstream commit 74dd022f9e6260c3b5b8d15901d27ebcc5f21eda ] When building with -Wunused-but-set-variable, the compiler shouts about a number of pte_unmap() users, since this expands to an empty macro on arm64: | mm/gup.c: In function 'gup_pte_range': | mm/gup.c:1727:16: warning: variable 'ptem' set but not used | [-Wunused-but-set-variable] | mm/gup.c: At top level: | mm/memory.c: In function 'copy_pte_range': | mm/memory.c:821:24: warning: variable 'orig_dst_pte' set but not used | [-Wunused-but-set-variable] | mm/memory.c:821:9: warning: variable 'orig_src_pte' set but not used | [-Wunused-but-set-variable] | mm/swap_state.c: In function 'swap_ra_info': | mm/swap_state.c:641:15: warning: variable 'orig_pte' set but not used | [-Wunused-but-set-variable] | mm/madvise.c: In function 'madvise_free_pte_range': | mm/madvise.c:318:9: warning: variable 'orig_pte' set but not used | [-Wunused-but-set-variable] Rewrite pte_unmap() as a static inline function, which silences the warnings. Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-02-16arm64: mm: Map entry trampoline into trampoline and kernel page tablesWill Deacon
Commit 51a0048beb44 upstream. The exception entry trampoline needs to be mapped at the same virtual address in both the trampoline page table (which maps nothing else) and also the kernel page table, so that we can swizzle TTBR1_EL1 on exceptions from and return to EL0. This patch maps the trampoline at a fixed virtual address in the fixmap area of the kernel virtual address space, which allows the kernel proper to be randomized with respect to the trampoline when KASLR is enabled. Reviewed-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Laura Abbott <labbott@redhat.com> Tested-by: Shanker Donthineni <shankerd@codeaurora.org> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20arm64: mm: Fix pte_mkclean, pte_mkdirty semanticsSteve Capper
commit 8781bcbc5e69d7da69e84c7044ca0284848d5d01 upstream. On systems with hardware dirty bit management, the ltp madvise09 unit test fails due to dirty bit information being lost and pages being incorrectly freed. This was bisected to: arm64: Ignore hardware dirty bit updates in ptep_set_wrprotect() Reverting this commit leads to a separate problem, that the unit test retains pages that should have been dropped due to the function madvise_free_pte_range(.) not cleaning pte's properly. Currently pte_mkclean only clears the software dirty bit, thus the following code sequence can appear: pte = pte_mkclean(pte); if (pte_dirty(pte)) // this condition can return true with HW DBM! This patch also adjusts pte_mkclean to set PTE_RDONLY thus effectively clearing both the SW and HW dirty information. In order for this to function on systems without HW DBM, we need to also adjust pte_mkdirty to remove the read only bit from writable pte's to avoid infinite fault loops. Fixes: 64c26841b349 ("arm64: Ignore hardware dirty bit updates in ptep_set_wrprotect()") Reported-by: Bhupinder Thakur <bhupinder.thakur@linaro.org> Tested-by: Bhupinder Thakur <bhupinder.thakur@linaro.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Steve Capper <steve.capper@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30arm64: Implement arch-specific pte_access_permitted()Catalin Marinas
commit 6218f96c58dbf44a06aeaf767aab1f54fc397838 upstream. The generic pte_access_permitted() implementation only checks for pte_present() (together with the write permission where applicable). However, for both kernel ptes and PROT_NONE mappings pte_present() also returns true on arm64 even though such mappings are not user accessible. Additionally, arm64 now supports execute-only user permission (PROT_EXEC) which is implemented by clearing the PTE_USER bit. With this patch the arm64 implementation of pte_access_permitted() checks for the PTE_VALID and PTE_USER bits together with writable access if applicable. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-29arm64: mm: Use READ_ONCE when dereferencing pointer to pte tableWill Deacon
On kernels built with support for transparent huge pages, different CPUs can access the PMD concurrently due to e.g. fast GUP or page_vma_mapped_walk and they must take care to use READ_ONCE to avoid value tearing or caching of stale values by the compiler. Unfortunately, these functions call into our pgtable macros, which don't use READ_ONCE, and compiler caching has been observed to cause the following crash during ext4 writeback: PC is at check_pte+0x20/0x170 LR is at page_vma_mapped_walk+0x2e0/0x540 [...] Process doio (pid: 2463, stack limit = 0xffff00000f2e8000) Call trace: [<ffff000008233328>] check_pte+0x20/0x170 [<ffff000008233758>] page_vma_mapped_walk+0x2e0/0x540 [<ffff000008234adc>] page_mkclean_one+0xac/0x278 [<ffff000008234d98>] rmap_walk_file+0xf0/0x238 [<ffff000008236e74>] rmap_walk+0x64/0xa0 [<ffff0000082370c8>] page_mkclean+0x90/0xa8 [<ffff0000081f3c64>] clear_page_dirty_for_io+0x84/0x2a8 [<ffff00000832f984>] mpage_submit_page+0x34/0x98 [<ffff00000832fb4c>] mpage_process_page_bufs+0x164/0x170 [<ffff00000832fc8c>] mpage_prepare_extent_to_map+0x134/0x2b8 [<ffff00000833530c>] ext4_writepages+0x484/0xe30 [<ffff0000081f6ab4>] do_writepages+0x44/0xe8 [<ffff0000081e5bd4>] __filemap_fdatawrite_range+0xbc/0x110 [<ffff0000081e5e68>] file_write_and_wait_range+0x48/0xd8 [<ffff000008324310>] ext4_sync_file+0x80/0x4b8 [<ffff0000082bd434>] vfs_fsync_range+0x64/0xc0 [<ffff0000082332b4>] SyS_msync+0x194/0x1e8 This is because page_vma_mapped_walk loads the PMD twice before calling pte_offset_map: the first time without READ_ONCE (where it gets all zeroes due to a concurrent pmdp_invalidate) and the second time with READ_ONCE (where it sees a valid table pointer due to a concurrent pmd_populate). However, the compiler inlines everything and caches the first value in a register, which is subsequently used in pte_offset_phys which returns a junk pointer that is later dereferenced when attempting to access the relevant pte. This patch fixes the issue by using READ_ONCE in pte_offset_phys to ensure that a stale value is not used. Whilst this is a point fix for a known failure (and simple to backport), a full fix moving all of our page table accessors over to {READ,WRITE}_ONCE and consistently using READ_ONCE in page_vma_mapped_walk is in the works for a future kernel release. Cc: Jon Masters <jcm@redhat.com> Cc: Timur Tabi <timur@codeaurora.org> Cc: <stable@vger.kernel.org> Fixes: f27176cfc363 ("mm: convert page_mkclean_one() to use page_vma_mapped_walk()") Tested-by: Richard Ruigrok <rruigrok@codeaurora.org> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2017-08-21arm64: Remove the !CONFIG_ARM64_HW_AFDBM alternative code pathsCatalin Marinas
Since the pte handling for hardware AF/DBM works even when the hardware feature is not present, make the pte accessors implementation permanent and remove the corresponding #ifdefs. The Kconfig option is kept as it can still be used to disable the feature at the hardware level. Reviewed-by: Will Deacon <will.deacon@arm.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2017-08-21arm64: Ignore hardware dirty bit updates in ptep_set_wrprotect()Catalin Marinas
ptep_set_wrprotect() is only called on CoW mappings which are private (!VM_SHARED) with the pte either read-only (!PTE_WRITE && PTE_RDONLY) or writable and software-dirty (PTE_WRITE && !PTE_RDONLY && PTE_DIRTY). There is no race with the hardware update of the dirty state: clearing of PTE_RDONLY when PTE_WRITE (a.k.a. PTE_DBM) is set. This patch removes the code setting the software PTE_DIRTY bit in ptep_set_wrprotect() as superfluous. A VM_WARN_ONCE is introduced in case the above logic is wrong or the core mm code changes its use of ptep_set_wrprotect(). Reviewed-by: Will Deacon <will.deacon@arm.com> Acked-by: Steve Capper <steve.capper@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2017-08-21arm64: Move PTE_RDONLY bit handling out of set_pte_at()Catalin Marinas
Currently PTE_RDONLY is treated as a hardware only bit and not handled by the pte_mkwrite(), pte_wrprotect() or the user PAGE_* definitions. The set_pte_at() function is responsible for setting this bit based on the write permission or dirty state. This patch moves the PTE_RDONLY handling out of set_pte_at into the pte_mkwrite()/pte_wrprotect() functions. The PAGE_* definitions to need to be updated to explicitly include PTE_RDONLY when !PTE_WRITE. The patch also removes the redundant PAGE_COPY(_EXEC) definitions as they are identical to the corresponding PAGE_READONLY(_EXEC). Reviewed-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2017-08-21arm64: Convert pte handling from inline asm to using (cmp)xchgCatalin Marinas
With the support for hardware updates of the access and dirty states, the following pte handling functions had to be implemented using exclusives: __ptep_test_and_clear_young(), ptep_get_and_clear(), ptep_set_wrprotect() and ptep_set_access_flags(). To take advantage of the LSE atomic instructions and also make the code cleaner, convert these pte functions to use the more generic cmpxchg()/xchg(). Reviewed-by: Will Deacon <will.deacon@arm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Steve Capper <steve.capper@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2017-06-12arm64: hugetlb: Fix huge_pte_offset to return poisoned page table entriesPunit Agrawal
When memory failure is enabled, a poisoned hugepage pte is marked as a swap entry. huge_pte_offset() does not return the poisoned page table entries when it encounters PUD/PMD hugepages. This behaviour of huge_pte_offset() leads to error such as below when munmap is called on poisoned hugepages. [ 344.165544] mm/pgtable-generic.c:33: bad pmd 000000083af00074. Fix huge_pte_offset() to return the poisoned pte which is then appropriately handled by the generic layer code. Signed-off-by: Punit Agrawal <punit.agrawal@arm.com> Acked-by: Steve Capper <steve.capper@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: David Woods <dwoods@mellanox.com> Tested-by: Manoj Iyer <manoj.iyer@canonical.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-03-23arm64: mm: set the contiguous bit for kernel mappings where appropriateArd Biesheuvel
This is the third attempt at enabling the use of contiguous hints for kernel mappings. The most recent attempt 0bfc445dec9d was reverted after it turned out that updating permission attributes on live contiguous ranges may result in TLB conflicts. So this time, the contiguous hint is not set for .rodata or for the linear alias of .text/.rodata, both of which are mapped read-write initially, and remapped read-only at a later stage. (Note that the latter region could also be unmapped and remapped again with updated permission attributes, given that the region, while live, is only mapped for the convenience of the hibernation code, but that also means the TLB footprint is negligible anyway, so why bother) This enables the following contiguous range sizes for the virtual mapping of the kernel image, and for the linear mapping: granule size | cont PTE | cont PMD | -------------+------------+------------+ 4 KB | 64 KB | 32 MB | 16 KB | 2 MB | 1 GB* | 64 KB | 2 MB | 16 GB* | * Only when built for 3 or more levels of translation. This is due to the fact that a 2 level configuration only consists of PGDs and PTEs, and the added complexity of dealing with folded PMDs is not justified considering that 16 GB contiguous ranges are likely to be ignored by the hardware (and 16k/2 levels is a niche configuration) Reviewed-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2017-01-31arm64: Improve detection of user/non-user mappings in set_pte(_at)Catalin Marinas
Commit cab15ce604e5 ("arm64: Introduce execute-only page access permissions") allowed a valid user PTE to have the PTE_USER bit clear. As a consequence, the pte_valid_not_user() macro in set_pte() was replaced with pte_valid_global() under the assumption that only user pages have the nG bit set. EFI mappings, however, also have the nG bit set and set_pte() wrongly ignores issuing the DSB+ISB. This patch reinstates the pte_valid_not_user() macro and adds the PTE_UXN bit check since all kernel mappings have this bit set. For clarity, pte_exec() is renamed to pte_user_exec() as it only checks for the absence of PTE_UXN. Consequently, the user executable check in set_pte_at() drops the pte_ng() test since pte_user_exec() is sufficient. Fixes: cab15ce604e5 ("arm64: Introduce execute-only page access permissions") Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-01-12arm64: Use __pa_symbol for kernel symbolsLaura Abbott
__pa_symbol is technically the marcro that should be used for kernel symbols. Switch to this as a pre-requisite for DEBUG_VIRTUAL which will do bounds checking. Reviewed-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Laura Abbott <labbott@redhat.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-08-25arm64: hibernate: Support DEBUG_PAGEALLOCJames Morse
DEBUG_PAGEALLOC removes the valid bit of page table entries to prevent any access to unallocated memory. Hibernate uses this as a hint that those pages don't need to be saved/restored. This patch adds the kernel_page_present() function it uses. hibernate.c copies the resume kernel's linear map for use during restore. Add _copy_pte() to fill-in the holes made by DEBUG_PAGEALLOC in the resume kernel, so we can restore data the original kernel had at these addresses. Finally, DEBUG_PAGEALLOC means the linear-map alias of KERNEL_START to KERNEL_END may have holes in it, so we can't lazily clean this whole area to the PoC. Only clean the new mmuoff region, and the kernel/kvm idmaps. This reverts commit da24eb1f3f9e2c7b75c5f8c40d8e48e2c4789596. Reported-by: Will Deacon <will.deacon@arm.com> Signed-off-by: James Morse <james.morse@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-08-25arm64: Introduce execute-only page access permissionsCatalin Marinas
The ARMv8 architecture allows execute-only user permissions by clearing the PTE_UXN and PTE_USER bits. However, the kernel running on a CPU implementation without User Access Override (ARMv8.2 onwards) can still access such page, so execute-only page permission does not protect against read(2)/write(2) etc. accesses. Systems requiring such protection must enable features like SECCOMP. This patch changes the arm64 __P100 and __S100 protection_map[] macros to the new __PAGE_EXECONLY attributes. A side effect is that pte_user() no longer triggers for __PAGE_EXECONLY since PTE_USER isn't set. To work around this, the check is done on the PTE_NG bit via the pte_ng() macro. VM_READ is also checked now for page faults. Reviewed-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-08-04arm64: Fix copy-on-write referencing in HugeTLBSteve Capper
set_pte_at(.) will set or unset the PTE_RDONLY hardware bit before writing the entry to the table. This can cause problems with the copy-on-write logic in hugetlb_cow: *) hugetlb_cow(.) called to handle a write fault on read only pte, *) Before the copy-on-write updates the new page table a call is made to pte_same(huge_ptep_get(ptep), pte)), to check for a race, *) Because set_pte_at(.) changed the pte, *ptep != pte, and the hugetlb_cow(.) code erroneously assumes that it lost the race, *) The new page is subsequently freed without being used. On arm64 this problem only becomes apparent when we apply: 67961f9 mm/hugetlb: fix huge page reserve accounting for private mappings When one runs the libhugetlbfs test suite, there are allocation errors and hugetlbfs pages become erroneously locked in memory as reserved. (There is a high HugePages_Rsvd: count). In this patch we introduce pte_same which ignores the PTE_RDONLY bit, allowing for the libhugetlbfs test suite to pass as expected and without leaking any reserved HugeTLB pages. Reported-by: Huang Shijie <shijie.huang@arm.com> Signed-off-by: Steve Capper <steve.capper@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-05-19Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge updates from Andrew Morton: - fsnotify fix - poll() timeout fix - a few scripts/ tweaks - debugobjects updates - the (small) ocfs2 queue - Minor fixes to kernel/padata.c - Maybe half of the MM queue * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (117 commits) mm, page_alloc: restore the original nodemask if the fast path allocation failed mm, page_alloc: uninline the bad page part of check_new_page() mm, page_alloc: don't duplicate code in free_pcp_prepare mm, page_alloc: defer debugging checks of pages allocated from the PCP mm, page_alloc: defer debugging checks of freed pages until a PCP drain cpuset: use static key better and convert to new API mm, page_alloc: inline pageblock lookup in page free fast paths mm, page_alloc: remove unnecessary variable from free_pcppages_bulk mm, page_alloc: pull out side effects from free_pages_check mm, page_alloc: un-inline the bad part of free_pages_check mm, page_alloc: check multiple page fields with a single branch mm, page_alloc: remove field from alloc_context mm, page_alloc: avoid looking up the first zone in a zonelist twice mm, page_alloc: shortcut watermark checks for order-0 pages mm, page_alloc: reduce cost of fair zone allocation policy retry mm, page_alloc: shorten the page allocator fast path mm, page_alloc: check once if a zone has isolated pageblocks mm, page_alloc: move __GFP_HARDWALL modifications out of the fastpath mm, page_alloc: simplify last cpupid reset mm, page_alloc: remove unnecessary initialisation from __alloc_pages_nodemask() ...
2016-05-19arch: fix has_transparent_hugepage()Hugh Dickins
I've just discovered that the useful-sounding has_transparent_hugepage() is actually an architecture-dependent minefield: on some arches it only builds if CONFIG_TRANSPARENT_HUGEPAGE=y, on others it's also there when not, but on some of those (arm and arm64) it then gives the wrong answer; and on mips alone it's marked __init, which would crash if called later (but so far it has not been called later). Straighten this out: make it available to all configs, with a sensible default in asm-generic/pgtable.h, removing its definitions from those arches (arc, arm, arm64, sparc, tile) which are served by the default, adding #define has_transparent_hugepage has_transparent_hugepage to those (mips, powerpc, s390, x86) which need to override the default at runtime, and removing the __init from mips (but maybe that kind of code should be avoided after init: set a static variable the first time it's called). Signed-off-by: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Yang Shi <yang.shi@linaro.org> Cc: Ning Qu <quning@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Vineet Gupta <vgupta@synopsys.com> [arch/arc] Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [arch/s390] Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Paolo Bonzini: "Small release overall. x86: - miscellaneous fixes - AVIC support (local APIC virtualization, AMD version) s390: - polling for interrupts after a VCPU goes to halted state is now enabled for s390 - use hardware provided information about facility bits that do not need any hypervisor activity, and other fixes for cpu models and facilities - improve perf output - floating interrupt controller improvements. MIPS: - miscellaneous fixes PPC: - bugfixes only ARM: - 16K page size support - generic firmware probing layer for timer and GIC Christoffer Dall (KVM-ARM maintainer) says: "There are a few changes in this pull request touching things outside KVM, but they should all carry the necessary acks and it made the merge process much easier to do it this way." though actually the irqchip maintainers' acks didn't make it into the patches. Marc Zyngier, who is both irqchip and KVM-ARM maintainer, later acked at http://mid.gmane.org/573351D1.4060303@arm.com ('more formally and for documentation purposes')" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (82 commits) KVM: MTRR: remove MSR 0x2f8 KVM: x86: make hwapic_isr_update and hwapic_irr_update look the same svm: Manage vcpu load/unload when enable AVIC svm: Do not intercept CR8 when enable AVIC svm: Do not expose x2APIC when enable AVIC KVM: x86: Introducing kvm_x86_ops.apicv_post_state_restore svm: Add VMEXIT handlers for AVIC svm: Add interrupt injection via AVIC KVM: x86: Detect and Initialize AVIC support svm: Introduce new AVIC VMCB registers KVM: split kvm_vcpu_wake_up from kvm_vcpu_kick KVM: x86: Introducing kvm_x86_ops VCPU blocking/unblocking hooks KVM: x86: Introducing kvm_x86_ops VM init/destroy hooks KVM: x86: Rename kvm_apic_get_reg to kvm_lapic_get_reg KVM: x86: Misc LAPIC changes to expose helper functions KVM: shrink halt polling even more for invalid wakeups KVM: s390: set halt polling to 80 microseconds KVM: halt_polling: provide a way to qualify wakeups during poll KVM: PPC: Book3S HV: Re-enable XICS fast path for irqfd-generated interrupts kvm: Conditionally register IRQ bypass consumer ...
2016-05-09kvm: arm64: Enable hardware updates of the Access Flag for Stage 2 page tablesCatalin Marinas
The ARMv8.1 architecture extensions introduce support for hardware updates of the access and dirty information in page table entries. With VTCR_EL2.HA enabled (bit 21), when the CPU accesses an IPA with the PTE_AF bit cleared in the stage 2 page table, instead of raising an Access Flag fault to EL2 the CPU sets the actual page table entry bit (10). To ensure that kernel modifications to the page table do not inadvertently revert a bit set by hardware updates, certain Stage 2 software pte/pmd operations must be performed atomically. The main user of the AF bit is the kvm_age_hva() mechanism. The kvm_age_hva_handler() function performs a "test and clear young" action on the pte/pmd. This needs to be atomic in respect of automatic hardware updates of the AF bit. Since the AF bit is in the same position for both Stage 1 and Stage 2, the patch reuses the existing ptep_test_and_clear_young() functionality if __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG is defined. Otherwise, the existing pte_young/pte_mkold mechanism is preserved. The kvm_set_s2pte_readonly() (and the corresponding pmd equivalent) have to perform atomic modifications in order to avoid a race with updates of the AF bit. The arm64 implementation has been re-written using exclusives. Currently, kvm_set_s2pte_writable() (and pmd equivalent) take a pointer argument and modify the pte/pmd in place. However, these functions are only used on local variables rather than actual page table entries, so it makes more sense to follow the pte_mkwrite() approach for stage 1 attributes. The change to kvm_s2pte_mkwrite() makes it clear that these functions do not modify the actual page table entries. The (pte|pmd)_mkyoung() uses on Stage 2 entries (setting the AF bit explicitly) do not need to be modified since hardware updates of the dirty status are not supported by KVM, so there is no possibility of losing such information. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2016-05-06arm64: Ensure pmd_present() returns false after pmd_mknotpresent()Catalin Marinas
Currently, pmd_present() only checks for a non-zero value, returning true even after pmd_mknotpresent() (which only clears the type bits). This patch converts pmd_present() to using pte_present(), similar to the other pmd_*() checks. As a side effect, it will return true for PROT_NONE mappings, though they are not yet used by the kernel with transparent huge pages. For consistency, also change pmd_mknotpresent() to only clear the PMD_SECT_VALID bit, even though the PMD_TABLE_BIT is already 0 for block mappings (no functional change). The unused PMD_SECT_PROT_NONE definition is removed as transparent huge pages use the pte page prot values. Fixes: 9c7e535fcc17 ("arm64: mm: Route pmd thp functions through pte equivalents") Cc: <stable@vger.kernel.org> # 3.15+ Reviewed-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-05-06arm64: Replace hard-coded values in the pmd/pud_bad() macrosCatalin Marinas
This patch replaces the hard-coded value 2 with PMD_TABLE_BIT in the pmd/pud_bad() macros. Note that using these macros on pmd_trans_huge() entries is giving incorrect results (pmd_none_or_trans_huge_or_clear_bad() correctly checks for pmd_trans_huge before pmd_bad). Additionally, white-space clean-up for pmd_mkclean(). Reviewed-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-05-06arm64: Implement pmdp_set_access_flags() for hardware AF/DBMCatalin Marinas
The update to the accessed or dirty states for block mappings must be done atomically on hardware with support for automatic AF/DBM. The ptep_set_access_flags() function has been fixed as part of commit 66dbd6e61a52 ("arm64: Implement ptep_set_access_flags() for hardware AF/DBM"). This patch brings pmdp_set_access_flags() in line with the pte counterpart. Fixes: 2f4b829c625e ("arm64: Add support for hardware updates of the access and dirty pte bits") Cc: <stable@vger.kernel.org> # 4.4.x: 66dbd6e61a52: arm64: Implement ptep_set_access_flags() for hardware AF/DBM Cc: <stable@vger.kernel.org> # 4.3+ Reviewed-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-05-06arm64: Fix typo in the pmdp_huge_get_and_clear() definitionCatalin Marinas
With hardware AF/DBM support, pmd modifications (transparent huge pages) should be performed atomically using load/store exclusive. The initial patches defined the get-and-clear function and __HAVE_ARCH_* macro without the "huge" word, leaving the pmdp_huge_get_and_clear() to the default, non-atomic implementation. Fixes: 2f4b829c625e ("arm64: Add support for hardware updates of the access and dirty pte bits") Cc: <stable@vger.kernel.org> # 4.3+ Reviewed-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-04-21arm64: Introduce pmd_thp_or_hugeSuzuki K Poulose
Add a helper to determine if a given pmd represents a huge page either by hugetlb or thp, as we have for arm. This will be used by KVM MMU code. Suggested-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Will Deacon <will.deacon@arm.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Will Deacon <will.deacon@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
2016-04-15arm64: Implement ptep_set_access_flags() for hardware AF/DBMCatalin Marinas
When hardware updates of the access and dirty states are enabled, the default ptep_set_access_flags() implementation based on calling set_pte_at() directly is potentially racy. This triggers the "racy dirty state clearing" warning in set_pte_at() because an existing writable PTE is overridden with a clean entry. There are two main scenarios for this situation: 1. The CPU getting an access fault does not support hardware updates of the access/dirty flags. However, a different agent in the system (e.g. SMMU) can do this, therefore overriding a writable entry with a clean one could potentially lose the automatically updated dirty status 2. A more complex situation is possible when all CPUs support hardware AF/DBM: a) Initial state: shareable + writable vma and pte_none(pte) b) Read fault taken by two threads of the same process on different CPUs c) CPU0 takes the mmap_sem and proceeds to handling the fault. It eventually reaches do_set_pte() which sets a writable + clean pte. CPU0 releases the mmap_sem d) CPU1 acquires the mmap_sem and proceeds to handle_pte_fault(). The pte entry it reads is present, writable and clean and it continues to pte_mkyoung() e) CPU1 calls ptep_set_access_flags() If between (d) and (e) the hardware (another CPU) updates the dirty state (clears PTE_RDONLY), CPU1 will override the PTR_RDONLY bit marking the entry clean again. This patch implements an arm64-specific ptep_set_access_flags() function to perform an atomic update of the PTE flags. Fixes: 2f4b829c625e ("arm64: Add support for hardware updates of the access and dirty pte bits") Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Ming Lei <tom.leiming@gmail.com> Tested-by: Julien Grall <julien.grall@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: <stable@vger.kernel.org> # 4.3+ [will: reworded comment] Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-04-15arm64, mm, numa: Add NUMA balancing support for arm64.Ganapatrao Kulkarni
Enable NUMA balancing for arm64 platforms. Add pte, pmd protnone helpers for use by automatic NUMA balancing. Reviewed-by: Steve Capper <steve.capper@arm.com> Reviewed-by: Robert Richter <rrichter@cavium.com> Signed-off-by: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com> Signed-off-by: David Daney <david.daney@cavium.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-04-14arm64: mm: move vmemmap region right below the linear regionArd Biesheuvel
This moves the vmemmap region right below PAGE_OFFSET, aka the start of the linear region, and redefines its size to be a power of two. Due to the placement of PAGE_OFFSET in the middle of the address space, whose size is a power of two as well, this guarantees that virt to page conversions and vice versa can be implemented efficiently, by masking and shifting rather than ordinary arithmetic. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-04-14arm64: mm: avoid virt_to_page() translation for the zero pageArd Biesheuvel
The zero page is statically allocated, so grab its struct page pointer without using virt_to_page(), which will be restricted to the linear mapping later. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-04-14Revert "arm64: account for sparsemem section alignment when choosing vmemmap ↵Ard Biesheuvel
offset" This reverts commit 36e5cd6b897e17d03008f81e075625d8e43e52d0, since the section alignment is now guaranteed by construction when choosing the value of memstart_addr. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-03-17Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: "Here are the main arm64 updates for 4.6. There are some relatively intrusive changes to support KASLR, the reworking of the kernel virtual memory layout and initial page table creation. Summary: - Initial page table creation reworked to avoid breaking large block mappings (huge pages) into smaller ones. The ARM architecture requires break-before-make in such cases to avoid TLB conflicts but that's not always possible on live page tables - Kernel virtual memory layout: the kernel image is no longer linked to the bottom of the linear mapping (PAGE_OFFSET) but at the bottom of the vmalloc space, allowing the kernel to be loaded (nearly) anywhere in physical RAM - Kernel ASLR: position independent kernel Image and modules being randomly mapped in the vmalloc space with the randomness is provided by UEFI (efi_get_random_bytes() patches merged via the arm64 tree, acked by Matt Fleming) - Implement relative exception tables for arm64, required by KASLR (initial code for ARCH_HAS_RELATIVE_EXTABLE added to lib/extable.c but actual x86 conversion to deferred to 4.7 because of the merge dependencies) - Support for the User Access Override feature of ARMv8.2: this allows uaccess functions (get_user etc.) to be implemented using LDTR/STTR instructions. Such instructions, when run by the kernel, perform unprivileged accesses adding an extra level of protection. The set_fs() macro is used to "upgrade" such instruction to privileged accesses via the UAO bit - Half-precision floating point support (part of ARMv8.2) - Optimisations for CPUs with or without a hardware prefetcher (using run-time code patching) - copy_page performance improvement to deal with 128 bytes at a time - Sanity checks on the CPU capabilities (via CPUID) to prevent incompatible secondary CPUs from being brought up (e.g. weird big.LITTLE configurations) - valid_user_regs() reworked for better sanity check of the sigcontext information (restored pstate information) - ACPI parking protocol implementation - CONFIG_DEBUG_RODATA enabled by default - VDSO code marked as read-only - DEBUG_PAGEALLOC support - ARCH_HAS_UBSAN_SANITIZE_ALL enabled - Erratum workaround Cavium ThunderX SoC - set_pte_at() fix for PROT_NONE mappings - Code clean-ups" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (99 commits) arm64: kasan: Fix zero shadow mapping overriding kernel image shadow arm64: kasan: Use actual memory node when populating the kernel image shadow arm64: Update PTE_RDONLY in set_pte_at() for PROT_NONE permission arm64: Fix misspellings in comments. arm64: efi: add missing frame pointer assignment arm64: make mrs_s prefixing implicit in read_cpuid arm64: enable CONFIG_DEBUG_RODATA by default arm64: Rework valid_user_regs arm64: mm: check at build time that PAGE_OFFSET divides the VA space evenly arm64: KVM: Move kvm_call_hyp back to its original localtion arm64: mm: treat memstart_addr as a signed quantity arm64: mm: list kernel sections in order arm64: lse: deal with clobbered IP registers after branch via PLT arm64: mm: dump: Use VA_START directly instead of private LOWEST_ADDR arm64: kconfig: add submenu for 8.2 architectural features arm64: kernel: acpi: fix ioremap in ACPI parking protocol cpu_postboot arm64: Add support for Half precision floating point arm64: Remove fixmap include fragility arm64: Add workaround for Cavium erratum 27456 arm64: mm: Mark .rodata as RO ...
2016-03-11arm64: Update PTE_RDONLY in set_pte_at() for PROT_NONE permissionCatalin Marinas
The set_pte_at() function must update the hardware PTE_RDONLY bit depending on the state of the PTE_WRITE and PTE_DIRTY bits of the given entry value. However, it currently only performs this for pte_valid() entries, ignoring PTE_PROT_NONE. The side-effect is that PROT_NONE mappings would not have the PTE_RDONLY bit set. Without CONFIG_ARM64_HW_AFDBM, this is not an issue since such PROT_NONE pages are not accessible anyway. With commit 2f4b829c625e ("arm64: Add support for hardware updates of the access and dirty pte bits"), the ptep_set_wrprotect() function was re-written to cope with automatic hardware updates of the dirty state. As an optimisation, only PTE_RDONLY is checked to assess the "dirty" status. Since set_pte_at() does not set this bit for PROT_NONE mappings, such pages may be considered "dirty" as a result of ptep_set_wrprotect(). This patch updates the pte_valid() check to pte_present() in set_pte_at(). It also adds PTE_PROT_NONE to the swap entry bits comment. Fixes: 2f4b829c625e ("arm64: Add support for hardware updates of the access and dirty pte bits") Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com> Tested-by: Ganapatrao Kulkarni <gkulkarni@cavium.com> Cc: <stable@vger.kernel.org>
2016-03-09arm64: account for sparsemem section alignment when choosing vmemmap offsetArd Biesheuvel
Commit dfd55ad85e4a ("arm64: vmemmap: use virtual projection of linear region") fixed an issue where the struct page array would overflow into the adjacent virtual memory region if system RAM was placed so high up in physical memory that its addresses were not representable in the build time configured virtual address size. However, the fix failed to take into account that the vmemmap region needs to be relatively aligned with respect to the sparsemem section size, so that a sequence of page structs corresponding with a sparsemem section in the linear region appears naturally aligned in the vmemmap region. So round up vmemmap to sparsemem section size. Since this essentially moves the projection of the linear region up in memory, also revert the reduction of the size of the vmemmap region. Cc: <stable@vger.kernel.org> Fixes: dfd55ad85e4a ("arm64: vmemmap: use virtual projection of linear region") Tested-by: Mark Langsdorf <mlangsdo@redhat.com> Tested-by: David Daney <david.daney@cavium.com> Tested-by: Robert Richter <rrichter@cavium.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-02-26arm64: vmemmap: use virtual projection of linear regionArd Biesheuvel
Commit dd006da21646 ("arm64: mm: increase VA range of identity map") made some changes to the memory mapping code to allow physical memory to reside at an offset that exceeds the size of the virtual mapping. However, since the size of the vmemmap area is proportional to the size of the VA area, but it is populated relative to the physical space, we may end up with the struct page array being mapped outside of the vmemmap region. For instance, on my Seattle A0 box, I can see the following output in the dmesg log. vmemmap : 0xffffffbdc0000000 - 0xffffffbfc0000000 ( 8 GB maximum) 0xffffffbfc0000000 - 0xffffffbfd0000000 ( 256 MB actual) We can fix this by deciding that the vmemmap region is not a projection of the physical space, but of the virtual space above PAGE_OFFSET, i.e., the linear region. This way, we are guaranteed that the vmemmap region is of sufficient size, and we can even reduce the size by half. Cc: <stable@vger.kernel.org> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-02-26arm64: Remove fixmap include fragilityMark Rutland
The asm-generic fixmap.h depends on each architecture's fixmap.h to pull in the definition of PAGE_KERNEL_RO, if this exists. In the absence of this, FIXMAP_PAGE_RO will not be defined. In mm/early_ioremap.c the definition of early_memremap_ro is predicated on FIXMAP_PAGE_RO being defined. Currently, the arm64 fixmap.h doesn't include pgtable.h for the definition of PAGE_KERNEL_RO, and as a knock-on effect early_memremap_ro is not always defined, leading to link-time failures when it is used. This has been observed with defconfig on next-20160226. Unfortunately, as pgtable.h includes fixmap.h, adding the include introduces a circular dependency, which is just as fragile. Instead, this patch factors out PAGE_KERNEL_RO and other prot definitions into a new pgtable-prot header which can be included by poth pgtable.h and fixmap.h, avoiding the circular dependency, and ensuring that early_memremap_ro is alwyas defined where it is used. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reported-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2016-02-25arm64: Fix building error with 16KB pages and 36-bit VACatalin Marinas
In such configuration, Linux uses only two pages of page tables and __pud_populate() should not be used. However, the BUILD_BUG() triggers since pud_sect() is still defined and the compiler cannot eliminate such code, even though at run-time it should not be triggered. This patch extends the #ifdef ARM64_64K_PAGES condition for pud_sect to include PGTABLE_LEVELS < 3. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2016-02-18arm64: move kernel image to base of vmalloc areaArd Biesheuvel
This moves the module area to right before the vmalloc area, and moves the kernel image to the base of the vmalloc area. This is an intermediate step towards implementing KASLR, which allows the kernel image to be located anywhere in the vmalloc area. Since other subsystems such as hibernate may still need to refer to the kernel text or data segments via their linears addresses, both are mapped in the linear region as well. The linear alias of the text region is mapped read-only/non-executable to prevent inadvertent modification or execution. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2016-02-18arm64: pgtable: implement static [pte|pmd|pud]_offset variantsArd Biesheuvel
The page table accessors pte_offset(), pud_offset() and pmd_offset() rely on __va translations, so they can only be used after the linear mapping has been installed. For the early fixmap and kasan init routines, whose page tables are allocated statically in the kernel image, these functions will return bogus values. So implement pte_offset_kimg(), pmd_offset_kimg() and pud_offset_kimg(), which can be used instead before any page tables have been allocated dynamically. Reviewed-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2016-02-16arm64: mm: add functions to walk tables in fixmapMark Rutland
As a preparatory step to allow us to allocate early page tables from unmapped memory using memblock_alloc, add new p??_{set,clear}_fixmap* functions which can be used to walk page tables outside of the linear mapping by using fixmap slots. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Tested-by: Jeremy Linton <jeremy.linton@arm.com> Cc: Laura Abbott <labbott@fedoraproject.org> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2016-02-16arm64: mm: add functions to walk page tables by PAMark Rutland
To allow us to walk tables allocated into the fixmap, we need to acquire the physical address of a page, rather than the virtual address in the linear map. This patch adds new p??_page_paddr and p??_offset_phys functions to acquire the physical address of a next-level table, and changes p??_offset* into macros which simply convert this to a linear map VA. This renders p??_page_vaddr unused, and hence they are removed. At the pgd level, a new pgd_offset_raw function is added to find the relevant PGD entry given the base of a PGD and a virtual address. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Tested-by: Jeremy Linton <jeremy.linton@arm.com> Cc: Laura Abbott <labbott@fedoraproject.org> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2016-02-16arm64: mm: move pte_* macrosMark Rutland
For pmd, pud, and pgd levels of table, functions including p?d_index and p?d_offset are defined after the p?d_page_vaddr function for the immediately higher level of table. The pte functions however are defined much earlier, even though several rely on the later definition of pmd_page_vaddr. While this isn't currently a problem as these are macros, it prevents the logical grouping of later C functions (which cannot rely on prototypes for functions not yet defined). Move these definitions after pmd_page_vaddr, for consistency with the placement of these functions for other levels of table. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Tested-by: Jeremy Linton <jeremy.linton@arm.com> Cc: Laura Abbott <labbott@fedoraproject.org> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2016-02-16arm64: mm: place empty_zero_page in bssMark Rutland
Currently the zero page is set up in paging_init, and thus we cannot use the zero page earlier. We use the zero page as a reserved TTBR value from which no TLB entries may be allocated (e.g. when uninstalling the idmap). To enable such usage earlier (as may be required for invasive changes to the kernel page tables), and to minimise the time that the idmap is active, we need to be able to use the zero page before paging_init. This patch follows the example set by x86, by allocating the zero page at compile time, in .bss. This means that the zero page itself is available immediately upon entry to start_kernel (as we zero .bss before this), and also means that the zero page takes up no space in the raw Image binary. The associated struct page is allocated in bootmem_init, and remains unavailable until this time. Outside of arch code, the only users of empty_zero_page assume that the empty_zero_page symbol refers to the zeroed memory itself, and that ZERO_PAGE(x) must be used to acquire the associated struct page, following the example of x86. This patch also brings arm64 inline with these assumptions. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Tested-by: Jeremy Linton <jeremy.linton@arm.com> Cc: Laura Abbott <labbott@fedoraproject.org> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2016-01-25arm64: Honour !PTE_WRITE in set_pte_at() for kernel mappingsCatalin Marinas
Currently, set_pte_at() only checks the software PTE_WRITE bit for user mappings when it sets or clears the hardware PTE_RDONLY accordingly. The kernel ptes are written directly without any modification, relying solely on the protection bits in macros like PAGE_KERNEL. However, modifying kernel pte attributes via pte_wrprotect() would be ignored by set_pte_at(). Since pte_wrprotect() does not set PTE_RDONLY (it only clears PTE_WRITE), the new permission is not taken into account. This patch changes set_pte_at() to adjust the read-only permission for kernel ptes as well. As a side effect, existing PROT_* definitions used for kernel ioremap*() need to include PTE_DIRTY | PTE_WRITE. (additionally, white space fix for PTE_KERNEL_ROX) Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-01-15arch/arm64/include/asm/pgtable.h: add pmd_mkclean for THPMinchan Kim
MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite of the contents since MADV_FREE syscall is called for THP page. This patch adds pmd_mkclean for THP page MADV_FREE support. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Shaohua Li <shli@kernel.org> Cc: <yalin.wang2010@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chen Gang <gang.chen.5i5j@gmail.com> Cc: Chris Zankel <chris@zankel.net> Cc: Daniel Micay <danielmicay@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: Hugh Dickins <hughd@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jason Evans <je@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mika Penttil <mika.penttila@nextfour.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Rik van Riel <riel@redhat.com> Cc: Roland Dreier <roland@kernel.org> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Shaohua Li <shli@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15arm64, thp: remove infrastructure for handling splitting PMDsKirill A. Shutemov
With new refcounting we don't need to mark PMDs splitting. Let's drop code to handle this. pmdp_splitting_flush() is not needed too: on splitting PMD we will do pmdp_clear_flush() + set_pte_at(). pmdp_clear_flush() will do IPI as needed for fast_gup. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-12Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "Here is the core arm64 queue for 4.5. As you might expect, the Christmas break resulted in a number of patches not making the final cut, so 4.6 is likely to be larger than usual. There's still some useful stuff here, however, and it's detailed below. The EFI changes have been Reviewed-by Matt and the memblock change got an "OK" from akpm. Summary: - Support for a separate IRQ stack, although we haven't reduced the size of our thread stack just yet since we don't have enough data to determine a safe value - Refactoring of our EFI initialisation and runtime code into drivers/firmware/efi/ so that it can be reused by arch/arm/. - Ftrace improvements when unwinding in the function graph tracer - Document our silicon errata handling process - Cache flushing optimisation when mapping executable pages - Support for hugetlb mappings using the contiguous hint in the pte" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (45 commits) arm64: head.S: use memset to clear BSS efi: stub: define DISABLE_BRANCH_PROFILING for all architectures arm64: entry: remove pointless SPSR mode check arm64: mm: move pgd_cache initialisation to pgtable_cache_init arm64: module: avoid undefined shift behavior in reloc_data() arm64: module: fix relocation of movz instruction with negative immediate arm64: traps: address fallout from printk -> pr_* conversion arm64: ftrace: fix a stack tracer's output under function graph tracer arm64: pass a task parameter to unwind_frame() arm64: ftrace: modify a stack frame in a safe way arm64: remove irq_count and do_softirq_own_stack() arm64: hugetlb: add support for PTE contiguous bit arm64: Use PoU cache instr for I/D coherency arm64: Defer dcache flush in __cpu_copy_user_page arm64: reduce stack use in irq_handler arm64: mm: ensure that the zero page is visible to the page table walker arm64: Documentation: add list of software workarounds for errata arm64: mm: place __cpu_setup in .text arm64: cmpxchg: Don't incldue linux/mmdebug.h arm64: mm: fold alternatives into .init ...
2016-01-05arm64: mm: move pgd_cache initialisation to pgtable_cache_initWill Deacon
Initialising the suppport for EFI runtime services requires us to allocate a pgd off the back of an early_initcall. On systems where the PGD_SIZE is smaller than PAGE_SIZE (e.g. 64k pages and 48-bit VA), the pgd_cache isn't initialised at this stage, and we panic with a NULL dereference during boot: Unable to handle kernel NULL pointer dereference at virtual address 00000000 __create_mapping.isra.5+0x84/0x350 create_pgd_mapping+0x20/0x28 efi_create_mapping+0x5c/0x6c arm_enable_runtime_services+0x154/0x1e4 do_one_initcall+0x8c/0x190 kernel_init_freeable+0x84/0x1ec kernel_init+0x10/0xe0 ret_from_fork+0x10/0x50 This patch fixes the problem by initialising the pgd_cache earlier, in the pgtable_cache_init callback, which sounds suspiciously like what it was intended for. Reported-by: Dennis Chen <dennis.chen@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2015-12-21arm64: hugetlb: add support for PTE contiguous bitDavid Woods
The arm64 MMU supports a Contiguous bit which is a hint that the TTE is one of a set of contiguous entries which can be cached in a single TLB entry. Supporting this bit adds new intermediate huge page sizes. The set of huge page sizes available depends on the base page size. Without using contiguous pages the huge page sizes are as follows. 4KB: 2MB 1GB 64KB: 512MB With a 4KB granule, the contiguous bit groups together sets of 16 pages and with a 64KB granule it groups sets of 32 pages. This enables two new huge page sizes in each case, so that the full set of available sizes is as follows. 4KB: 64KB 2MB 32MB 1GB 64KB: 2MB 512MB 16GB If a 16KB granule is used then the contiguous bit groups 128 pages at the PTE level and 32 pages at the PMD level. If the base page size is set to 64KB then 2MB pages are enabled by default. It is possible in the future to make 2MB the default huge page size for both 4KB and 64KB granules. Reviewed-by: Chris Metcalf <cmetcalf@ezchip.com> Reviewed-by: Steve Capper <steve.capper@linaro.org> Signed-off-by: David Woods <dwoods@ezchip.com> Signed-off-by: Will Deacon <will.deacon@arm.com>