summaryrefslogtreecommitdiffstats
path: root/arch/powerpc
AgeCommit message (Collapse)Author
2020-02-14powerpc/pseries: Allow not having ibm, hypertas-functions::hcall-multi-tce ↵Alexey Kardashevskiy
for DDW commit 7559d3d295f3365ea7ac0c0274c05e633fe4f594 upstream. By default a pseries guest supports a H_PUT_TCE hypercall which maps a single IOMMU page in a DMA window. Additionally the hypervisor may support H_PUT_TCE_INDIRECT/H_STUFF_TCE which update multiple TCEs at once; this is advertised via the device tree /rtas/ibm,hypertas-functions property which Linux converts to FW_FEATURE_MULTITCE. FW_FEATURE_MULTITCE is checked when dma_iommu_ops is used; however the code managing the huge DMA window (DDW) ignores it and calls H_PUT_TCE_INDIRECT even if it is explicitly disabled via the "multitce=off" kernel command line parameter. This adds FW_FEATURE_MULTITCE checking to the DDW code path. This changes tce_build_pSeriesLP to take liobn and page size as the huge window does not have iommu_table descriptor which usually the place to store these numbers. Fixes: 4e8b0cf46b25 ("powerpc/pseries: Add support for dynamic dma windows") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com> Tested-by: Thiago Jung Bauermann <bauerman@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191216041924.42318-3-aik@ozlabs.ru Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-14powerpc/pseries/vio: Fix iommu_table use-after-free refcount warningTyrel Datwyler
commit aff8c8242bc638ba57247ae1ec5f272ac3ed3b92 upstream. Commit e5afdf9dd515 ("powerpc/vfio_spapr_tce: Add reference counting to iommu_table") missed an iommu_table allocation in the pseries vio code. The iommu_table is allocated with kzalloc and as a result the associated kref gets a value of zero. This has the side effect that during a DLPAR remove of the associated virtual IOA the iommu_tce_table_put() triggers a use-after-free underflow warning. Call Trace: [c0000002879e39f0] [c00000000071ecb4] refcount_warn_saturate+0x184/0x190 (unreliable) [c0000002879e3a50] [c0000000000500ac] iommu_tce_table_put+0x9c/0xb0 [c0000002879e3a70] [c0000000000f54e4] vio_dev_release+0x34/0x70 [c0000002879e3aa0] [c00000000087cfa4] device_release+0x54/0xf0 [c0000002879e3b10] [c000000000d64c84] kobject_cleanup+0xa4/0x240 [c0000002879e3b90] [c00000000087d358] put_device+0x28/0x40 [c0000002879e3bb0] [c0000000007a328c] dlpar_remove_slot+0x15c/0x250 [c0000002879e3c50] [c0000000007a348c] remove_slot_store+0xac/0xf0 [c0000002879e3cd0] [c000000000d64220] kobj_attr_store+0x30/0x60 [c0000002879e3cf0] [c0000000004ff13c] sysfs_kf_write+0x6c/0xa0 [c0000002879e3d10] [c0000000004fde4c] kernfs_fop_write+0x18c/0x260 [c0000002879e3d60] [c000000000410f3c] __vfs_write+0x3c/0x70 [c0000002879e3d80] [c000000000415408] vfs_write+0xc8/0x250 [c0000002879e3dd0] [c0000000004157dc] ksys_write+0x7c/0x120 [c0000002879e3e20] [c00000000000b278] system_call+0x5c/0x68 Further, since the refcount was always zero the iommu_tce_table_put() fails to call the iommu_table release function resulting in a leak. Fix this issue be initilizing the iommu_table kref immediately after allocation. Fixes: e5afdf9dd515 ("powerpc/vfio_spapr_tce: Add reference counting to iommu_table") Signed-off-by: Tyrel Datwyler <tyreld@linux.ibm.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1579558202-26052-1-git-send-email-tyreld@linux.ibm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-11powerpc/44x: Adjust indentation in ibm4xx_denali_fixup_memsizeNathan Chancellor
commit c3aae14e5d468d18dbb5d7c0c8c7e2968cc14aad upstream. Clang warns: ../arch/powerpc/boot/4xx.c:231:3: warning: misleading indentation; statement is not part of the previous 'else' [-Wmisleading-indentation] val = SDRAM0_READ(DDR0_42); ^ ../arch/powerpc/boot/4xx.c:227:2: note: previous statement is here else ^ This is because there is a space at the beginning of this line; remove it so that the indentation is consistent according to the Linux kernel coding style and clang no longer warns. Fixes: d23f5099297c ("[POWERPC] 4xx: Adds decoding of 440SPE memory size to boot wrapper library") Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://github.com/ClangBuiltLinux/linux/issues/780 Link: https://lore.kernel.org/r/20191209200338.12546-1-natechancellor@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-11KVM: PPC: Book3S PR: Free shared page if mmu initialization failsSean Christopherson
commit cb10bf9194f4d2c5d830eddca861f7ca0fecdbb4 upstream. Explicitly free the shared page if kvmppc_mmu_init() fails during kvmppc_core_vcpu_create(), as the page is freed only in kvmppc_core_vcpu_free(), which is not reached via kvm_vcpu_uninit(). Fixes: 96bc451a15329 ("KVM: PPC: Introduce shared page") Cc: stable@vger.kernel.org Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Acked-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-11KVM: PPC: Book3S HV: Uninit vCPU if vcore creation failsSean Christopherson
commit 1a978d9d3e72ddfa40ac60d26301b154247ee0bc upstream. Call kvm_vcpu_uninit() if vcore creation fails to avoid leaking any resources allocated by kvm_vcpu_init(), i.e. the vcpu->run page. Fixes: 371fefd6f2dc4 ("KVM: PPC: Allow book3s_hv guests to use SMT processor modes") Cc: stable@vger.kernel.org Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Acked-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-11of: Add OF_DMA_DEFAULT_COHERENT & select it on powerpcMichael Ellerman
commit dabf6b36b83a18d57e3d4b9d50544ed040d86255 upstream. There's an OF helper called of_dma_is_coherent(), which checks if a device has a "dma-coherent" property to see if the device is coherent for DMA. But on some platforms devices are coherent by default, and on some platforms it's not possible to update existing device trees to add the "dma-coherent" property. So add a Kconfig symbol to allow arch code to tell of_dma_is_coherent() that devices are coherent by default, regardless of the presence of the property. Select that symbol on powerpc when NOT_COHERENT_CACHE is not set, ie. when the system has a coherent cache. Fixes: 92ea637edea3 ("of: introduce of_dma_is_coherent() helper") Cc: stable@vger.kernel.org # v3.16+ Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de> Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Rob Herring <robh@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-11powerpc/pseries: Advance pfn if section is not present in lmb_is_removable()Pingfan Liu
commit fbee6ba2dca30d302efe6bddb3a886f5e964a257 upstream. In lmb_is_removable(), if a section is not present, it should continue to test the rest of the sections in the block. But the current code fails to do so. Fixes: 51925fb3c5c9 ("powerpc/pseries: Implement memory hotplug remove in the kernel") Cc: stable@vger.kernel.org # v4.1+ Signed-off-by: Pingfan Liu <kernelfans@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1578632042-12415-1-git-send-email-kernelfans@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-11powerpc/xmon: don't access ASDR in VMsSukadev Bhattiprolu
commit c2a20711fc181e7f22ee5c16c28cb9578af84729 upstream. ASDR is HV-privileged and must only be accessed in HV-mode. Fixes a Program Check (0x700) when xmon in a VM dumps SPRs. Fixes: d1e1b351f50f ("powerpc/xmon: Add ISA v3.0 SPRs to SPR dump") Cc: stable@vger.kernel.org # v4.14+ Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200107021633.GB29843@us.ibm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-05powerpc/fsl/dts: add fsl,erratum-a011043Madalin Bucur
[ Upstream commit 73d527aef68f7644e59f22ce7f9ac75e7b533aea ] Add fsl,erratum-a011043 to internal MDIO buses. Software may get false read error when reading internal PCS registers through MDIO. As a workaround, all internal MDIO accesses should ignore the MDIO_CFG[MDIO_RD_ER] bit. Signed-off-by: Madalin Bucur <madalin.bucur@oss.nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-29mm/memory_hotplug: shrink zones when offlining memoryDavid Hildenbrand
commit feee6b2989165631b17ac6d4ccdbf6759254e85a upstream. -- snip -- - Missing arm64 hot(un)plug support - Missing some vmem_altmap_offset() cleanups - Missing sub-section hotadd support - Missing unification of mm/hmm.c and kernel/memremap.c -- snip -- We currently try to shrink a single zone when removing memory. We use the zone of the first page of the memory we are removing. If that memmap was never initialized (e.g., memory was never onlined), we will read garbage and can trigger kernel BUGs (due to a stale pointer): BUG: unable to handle page fault for address: 000000000000353d #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] SMP PTI CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4 Workqueue: kacpi_hotplug acpi_hotplug_work_fn RIP: 0010:clear_zone_contiguous+0x5/0x10 Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840 RSP: 0018:ffffad2400043c98 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000 RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40 RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000 R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680 FS: 0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __remove_pages+0x4b/0x640 arch_remove_memory+0x63/0x8d try_remove_memory+0xdb/0x130 __remove_memory+0xa/0x11 acpi_memory_device_remove+0x70/0x100 acpi_bus_trim+0x55/0x90 acpi_device_hotplug+0x227/0x3a0 acpi_hotplug_work_fn+0x1a/0x30 process_one_work+0x221/0x550 worker_thread+0x50/0x3b0 kthread+0x105/0x140 ret_from_fork+0x3a/0x50 Modules linked in: CR2: 000000000000353d Instead, shrink the zones when offlining memory or when onlining failed. Introduce and use remove_pfn_range_from_zone(() for that. We now properly shrink the zones, even if we have DIMMs whereby - Some memory blocks fall into no zone (never onlined) - Some memory blocks fall into multiple zones (offlined+re-onlined) - Multiple memory blocks that fall into different zones Drop the zone parameter (with a potential dubious value) from __remove_pages() and __remove_section(). Link: http://lkml.kernel.org/r/20191006085646.5768-6-david@redhat.com Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319] Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: <stable@vger.kernel.org> [5.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-29mm/memory_hotplug: allow arch_remove_memory() without CONFIG_MEMORY_HOTREMOVEDavid Hildenbrand
commit 80ec922dbd87fd38d15719c86a94457204648aeb upstream. -- snip -- Missing arm64 memory hot(un)plug support. -- snip -- We want to improve error handling while adding memory by allowing to use arch_remove_memory() and __remove_pages() even if CONFIG_MEMORY_HOTREMOVE is not set to e.g., implement something like: arch_add_memory() rc = do_something(); if (rc) { arch_remove_memory(); } We won't get rid of CONFIG_MEMORY_HOTREMOVE for now, as it will require quite some dependencies for memory offlining. Link: http://lkml.kernel.org/r/20190527111152.16324-7-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Mark Brown <broonie@kernel.org> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Rob Herring <robh@kernel.org> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: "mike.travis@hpe.com" <mike.travis@hpe.com> Cc: Andrew Banman <andrew.banman@hpe.com> Cc: Arun KS <arunks@codeaurora.org> Cc: Qian Cai <cai@lca.pw> Cc: Mathieu Malaterre <malat@debian.org> Cc: Baoquan He <bhe@redhat.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chintan Pandya <cpandya@codeaurora.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jun Yao <yaojun8558363@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-29mm/memory_hotplug: make __remove_pages() and arch_remove_memory() never failDavid Hildenbrand
commit ac5c94264580f498e484c854031d0226b3c1038f upstream. -- snip -- Minor conflict in arch/powerpc/mm/mem.c -- snip -- All callers of arch_remove_memory() ignore errors. And we should really try to remove any errors from the memory removal path. No more errors are reported from __remove_pages(). BUG() in s390x code in case arch_remove_memory() is triggered. We may implement that properly later. WARN in case powerpc code failed to remove the section mapping, which is better than ignoring the error completely right now. Link: http://lkml.kernel.org/r/20190409100148.24703-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Oscar Salvador <osalvador@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Stefan Agner <stefan@agner.ch> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Arun KS <arunks@codeaurora.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Rob Herring <robh@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Qian Cai <cai@lca.pw> Cc: Mathieu Malaterre <malat@debian.org> Cc: Andrew Banman <andrew.banman@hpe.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mike Travis <mike.travis@hpe.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-29powerpc/mm: Fix section mismatch warningAneesh Kumar K.V
commit 26ad26718dfaa7cf49d106d212ebf2370076c253 upstream. This patch fix the below section mismatch warnings. WARNING: vmlinux.o(.text+0x2d1f44): Section mismatch in reference from the function devm_memremap_pages_release() to the function .meminit.text:arch_remove_memory() WARNING: vmlinux.o(.text+0x2d265c): Section mismatch in reference from the function devm_memremap_pages() to the function .meminit.text:arch_add_memory() Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-29mm, memory_hotplug: add nid parameter to arch_remove_memoryOscar Salvador
commit 2c2a5af6fed20cf74401c9d64319c76c5ff81309 upstream. -- snip -- Missing unification of mm/hmm.c and kernel/memremap.c -- snip -- Patch series "Do not touch pages in hot-remove path", v2. This patchset aims for two things: 1) A better definition about offline and hot-remove stage 2) Solving bugs where we can access non-initialized pages during hot-remove operations [2] [3]. This is achieved by moving all page/zone handling to the offline stage, so we do not need to access pages when hot-removing memory. [1] https://patchwork.kernel.org/cover/10691415/ [2] https://patchwork.kernel.org/patch/10547445/ [3] https://www.spinics.net/lists/linux-mm/msg161316.html This patch (of 5): This is a preparation for the following-up patches. The idea of passing the nid is that it will allow us to get rid of the zone parameter afterwards. Link: http://lkml.kernel.org/r/20181127162005.15833-2-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-29mm/memory_hotplug: make remove_memory() take the device_hotplug_lockDavid Hildenbrand
commit d15e59260f62bd5e0f625cf5f5240f6ffac78ab6 upstream. Patch series "mm: online/offline_pages called w.o. mem_hotplug_lock", v3. Reading through the code and studying how mem_hotplug_lock is to be used, I noticed that there are two places where we can end up calling device_online()/device_offline() - online_pages()/offline_pages() without the mem_hotplug_lock. And there are other places where we call device_online()/device_offline() without the device_hotplug_lock. While e.g. echo "online" > /sys/devices/system/memory/memory9/state is fine, e.g. echo 1 > /sys/devices/system/memory/memory9/online Will not take the mem_hotplug_lock. However the device_lock() and device_hotplug_lock. E.g. via memory_probe_store(), we can end up calling add_memory()->online_pages() without the device_hotplug_lock. So we can have concurrent callers in online_pages(). We e.g. touch in online_pages() basically unprotected zone->present_pages then. Looks like there is a longer history to that (see Patch #2 for details), and fixing it to work the way it was intended is not really possible. We would e.g. have to take the mem_hotplug_lock in device/base/core.c, which sounds wrong. Summary: We had a lock inversion on mem_hotplug_lock and device_lock(). More details can be found in patch 3 and patch 6. I propose the general rules (documentation added in patch 6): 1. add_memory/add_memory_resource() must only be called with device_hotplug_lock. 2. remove_memory() must only be called with device_hotplug_lock. This is already documented and holds for all callers. 3. device_online()/device_offline() must only be called with device_hotplug_lock. This is already documented and true for now in core code. Other callers (related to memory hotplug) have to be fixed up. 4. mem_hotplug_lock is taken inside of add_memory/remove_memory/ online_pages/offline_pages. To me, this looks way cleaner than what we have right now (and easier to verify). And looking at the documentation of remove_memory, using lock_device_hotplug also for add_memory() feels natural. This patch (of 6): remove_memory() is exported right now but requires the device_hotplug_lock, which is not exported. So let's provide a variant that takes the lock and only export that one. The lock is already held in arch/powerpc/platforms/pseries/hotplug-memory.c drivers/acpi/acpi_memhotplug.c arch/powerpc/platforms/powernv/memtrace.c Apart from that, there are not other users in the tree. Link: http://lkml.kernel.org/r/20180925091457.28651-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Rashmica Gupta <rashmica.g@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Len Brown <lenb@kernel.org> Cc: Rashmica Gupta <rashmica.g@gmail.com> Cc: Michael Neuling <mikey@neuling.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Cc: John Allen <jallen@linux.vnet.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com> Cc: Mathieu Malaterre <malat@debian.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juergen Gross <jgross@suse.com> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-27powerpc/mm/mce: Keep irqs disabled during lockless page table walkAneesh Kumar K.V
[ Upstream commit d9101bfa6adc831bda8836c4d774820553c14942 ] __find_linux_mm_pte() returns a page table entry pointer after walking the page table without holding locks. To make it safe against a THP split and/or collapse, we disable interrupts around the lockless page table walk. However we need to keep interrupts disabled as long as we use the page table entry pointer that is returned. Fix addr_to_pfn() to do that. Fixes: ba41e1e1ccb9 ("powerpc/mce: Hookup derror (load/store) UE errors") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [mpe: Rearrange code slightly and tweak change log wording] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190918145328.28602-1-aneesh.kumar@linux.ibm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27powerpc/64s/radix: Fix memory hot-unplug page table splitNicholas Piggin
[ Upstream commit 31f210cf42d4b308eacef89b6cb0b1459338b8de ] create_physical_mapping expects physical addresses, but splitting these mapping on hot unplug is supplying virtual (effective) addresses. Fixes: 4dd5f8a99e791 ("powerpc/mm/radix: Split linear mapping on hot-unplug") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190724084638.24982-2-npiggin@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27powerpc/pseries/mobility: rebuild cacheinfo hierarchy post-migrationNathan Lynch
[ Upstream commit e610a466d16a086e321f0bd421e2fc75cff28605 ] It's common for the platform to replace the cache device nodes after a migration. Since the cacheinfo code is never informed about this, it never drops its references to the source system's cache nodes, causing it to wind up in an inconsistent state resulting in warnings and oopses as soon as CPU online/offline occurs after the migration, e.g. cache for /cpus/l3-cache@3113(Unified) refers to cache for /cpus/l2-cache@200d(Unified) WARNING: CPU: 15 PID: 86 at arch/powerpc/kernel/cacheinfo.c:176 release_cache+0x1bc/0x1d0 [...] NIP release_cache+0x1bc/0x1d0 LR release_cache+0x1b8/0x1d0 Call Trace: release_cache+0x1b8/0x1d0 (unreliable) cacheinfo_cpu_offline+0x1c4/0x2c0 unregister_cpu_online+0x1b8/0x260 cpuhp_invoke_callback+0x114/0xf40 cpuhp_thread_fun+0x270/0x310 smpboot_thread_fn+0x2c8/0x390 kthread+0x1b8/0x1c0 ret_from_kernel_thread+0x5c/0x68 Using device tree notifiers won't work since we want to rebuild the hierarchy only after all the removals and additions have occurred and the device tree is in a consistent state. Call cacheinfo_teardown() before processing device tree updates, and rebuild the hierarchy afterward. Fixes: 410bccf97881 ("powerpc/pseries: Partition migration in the kernel") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27powerpc/cacheinfo: add cacheinfo_teardown, cacheinfo_rebuildNathan Lynch
[ Upstream commit d4aa219a074a5abaf95a756b9f0d190b5c03a945 ] Allow external callers to force the cacheinfo code to release all its references to cache nodes, e.g. before processing device tree updates post-migration, and to rebuild the hierarchy afterward. CPU online/offline must be blocked by callers; enforce this. Fixes: 410bccf97881 ("powerpc/pseries: Partition migration in the kernel") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27KVM: PPC: Book3S HV: Fix lockdep warning when entering the guestAlexey Kardashevskiy
[ Upstream commit 3309bec85e60d60d6394802cb8e183a4f4a72def ] The trace_hardirqs_on() sets current->hardirqs_enabled and from here the lockdep assumes interrupts are enabled although they are remain disabled until the context switches to the guest. Consequent srcu_read_lock() checks the flags in rcu_lock_acquire(), observes disabled interrupts and prints a warning (see below). This moves trace_hardirqs_on/off closer to __kvmppc_vcore_entry to prevent lockdep from being confused. DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled) WARNING: CPU: 16 PID: 8038 at kernel/locking/lockdep.c:4128 check_flags.part.25+0x224/0x280 [...] NIP [c000000000185b84] check_flags.part.25+0x224/0x280 LR [c000000000185b80] check_flags.part.25+0x220/0x280 Call Trace: [c000003fec253710] [c000000000185b80] check_flags.part.25+0x220/0x280 (unreliable) [c000003fec253780] [c000000000187ea4] lock_acquire+0x94/0x260 [c000003fec253840] [c00800001a1e9768] kvmppc_run_core+0xa60/0x1ab0 [kvm_hv] [c000003fec253a10] [c00800001a1ed944] kvmppc_vcpu_run_hv+0x73c/0xec0 [kvm_hv] [c000003fec253ae0] [c00800001a1095dc] kvmppc_vcpu_run+0x34/0x48 [kvm] [c000003fec253b00] [c00800001a1056bc] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm] [c000003fec253b90] [c00800001a0f3618] kvm_vcpu_ioctl+0x460/0x850 [kvm] [c000003fec253d00] [c00000000041c4f4] do_vfs_ioctl+0xe4/0x930 [c000003fec253db0] [c00000000041ce04] ksys_ioctl+0xc4/0x110 [c000003fec253e00] [c00000000041ce78] sys_ioctl+0x28/0x80 [c000003fec253e20] [c00000000000b5a4] system_call+0x5c/0x70 Instruction dump: 419e0034 3d220004 39291730 81290000 2f890000 409e0020 3c82ffc6 3c62ffc5 3884be70 386329c0 4bf6ea71 60000000 <0fe00000> 3c62ffc6 3863be90 4801273d irq event stamp: 1025 hardirqs last enabled at (1025): [<c00800001a1e9728>] kvmppc_run_core+0xa20/0x1ab0 [kvm_hv] hardirqs last disabled at (1024): [<c00800001a1e9358>] kvmppc_run_core+0x650/0x1ab0 [kvm_hv] softirqs last enabled at (0): [<c0000000000f1210>] copy_process.isra.4.part.5+0x5f0/0x1d00 softirqs last disabled at (0): [<0000000000000000>] (null) ---[ end trace 31180adcc848993e ]--- possible reason: unannotated irqs-off. irq event stamp: 1025 hardirqs last enabled at (1025): [<c00800001a1e9728>] kvmppc_run_core+0xa20/0x1ab0 [kvm_hv] hardirqs last disabled at (1024): [<c00800001a1e9358>] kvmppc_run_core+0x650/0x1ab0 [kvm_hv] softirqs last enabled at (0): [<c0000000000f1210>] copy_process.isra.4.part.5+0x5f0/0x1d00 softirqs last disabled at (0): [<0000000000000000>] (null) Fixes: 8b24e69fc47e ("KVM: PPC: Book3S HV: Close race with testing for signals on guest entry", 2017-06-26) Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27powerpc: vdso: Make vdso32 installation conditional in vdso_installBen Hutchings
[ Upstream commit ff6d27823f619892ab96f7461764840e0d786b15 ] The 32-bit vDSO is not needed and not normally built for 64-bit little-endian configurations. However, the vdso_install target still builds and installs it. Add the same config condition as is normally used for the build. Fixes: e0d005916994 ("powerpc/vdso: Disable building the 32-bit VDSO ...") Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27powerpc/mm: Check secondary hash page tableRashmica Gupta
[ Upstream commit 790845e2f12709d273d08ea7a2af7c2593689519 ] We were always calling base_hpte_find() with primary = true, even when we wanted to check the secondary table. mpe: I broke this when refactoring Rashmica's original patch. Fixes: 1515ab932156 ("powerpc/mm: Dump hash table") Signed-off-by: Rashmica Gupta <rashmica.g@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27powerpc/64s: Fix logic when handling unknown CPU featuresMichael Ellerman
[ Upstream commit 8cfaf106918a8c13abb24c641556172afbb9545c ] In cpufeatures_process_feature(), if a provided CPU feature is unknown and enable_unknown is false, we erroneously print that the feature is being enabled and return true, even though no feature has been enabled, and may also set feature bits based on the last entry in the match table. Fix this so that we only set feature bits from the match table if we have actually enabled a feature from that table, and when failing to enable an unknown feature, always print the "not enabling" message and return false. Coincidentally, some older gccs (<GCC 7), when invoked with -fsanitize-coverage=trace-pc, cause a spurious uninitialised variable warning in this function: arch/powerpc/kernel/dt_cpu_ftrs.c: In function ‘cpufeatures_process_feature’: arch/powerpc/kernel/dt_cpu_ftrs.c:686:7: warning: ‘m’ may be used uninitialized in this function [-Wmaybe-uninitialized] if (m->cpu_ftr_bit_mask) An upcoming patch will enable support for kcov, which requires this option. This patch avoids the warning. Fixes: 5a61ef74f269 ("powerpc/64s: Support new device tree binding for discovering CPU features") Reported-by: Segher Boessenkool <segher@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> [ajd: add commit message] Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27KVM: PPC: Release all hardware TCE tables attached to a groupAlexey Kardashevskiy
[ Upstream commit a67614cc05a5052b265ea48196dab2fce11f5f2e ] The SPAPR TCE KVM device references all hardware IOMMU tables assigned to some IOMMU group to ensure that in-kernel KVM acceleration of H_PUT_TCE can work. The tables are references when an IOMMU group gets registered with the VFIO KVM device by the KVM_DEV_VFIO_GROUP_ADD ioctl; KVM_DEV_VFIO_GROUP_DEL calls into the dereferencing code in kvm_spapr_tce_release_iommu_group() which walks through the list of LIOBNs, finds a matching IOMMU table and calls kref_put() when found. However that code stops after the very first successful derefencing leaving other tables referenced till the SPAPR TCE KVM device is destroyed which normally happens on guest reboot or termination so if we do hotplug and unplug in a loop, we are leaking IOMMU tables here. This removes a premature return to let kvm_spapr_tce_release_iommu_group() find and dereference all attached tables. Fixes: 121f80ba68f ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27powerpc/kgdb: add kgdb_arch_set/remove_breakpoint()Christophe Leroy
[ Upstream commit fb978ca207743badfe7efd9eebe68bcbb4969f79 ] Generic implementation fails to remove breakpoints after init when CONFIG_STRICT_KERNEL_RWX is selected: [ 13.251285] KGDB: BP remove failed: c001c338 [ 13.259587] kgdbts: ERROR PUT: end of test buffer on 'do_fork_test' line 8 expected OK got $E14#aa [ 13.268969] KGDB: re-enter exception: ALL breakpoints killed [ 13.275099] CPU: 0 PID: 1 Comm: init Not tainted 4.18.0-g82bbb913ffd8 #860 [ 13.282836] Call Trace: [ 13.285313] [c60e1ba0] [c0080ef0] kgdb_handle_exception+0x6f4/0x720 (unreliable) [ 13.292618] [c60e1c30] [c000e97c] kgdb_handle_breakpoint+0x3c/0x98 [ 13.298709] [c60e1c40] [c000af54] program_check_exception+0x104/0x700 [ 13.305083] [c60e1c60] [c000e45c] ret_from_except_full+0x0/0x4 [ 13.310845] [c60e1d20] [c02a22ac] run_simple_test+0x2b4/0x2d4 [ 13.316532] [c60e1d30] [c0081698] put_packet+0xb8/0x158 [ 13.321694] [c60e1d60] [c00820b4] gdb_serial_stub+0x230/0xc4c [ 13.327374] [c60e1dc0] [c0080af8] kgdb_handle_exception+0x2fc/0x720 [ 13.333573] [c60e1e50] [c000e928] kgdb_singlestep+0xb4/0xcc [ 13.339068] [c60e1e70] [c000ae1c] single_step_exception+0x90/0xac [ 13.345100] [c60e1e80] [c000e45c] ret_from_except_full+0x0/0x4 [ 13.350865] [c60e1f40] [c000e11c] ret_from_syscall+0x0/0x38 [ 13.356346] Kernel panic - not syncing: Recursive entry to debugger This patch creates powerpc specific version of kgdb_arch_set_breakpoint() and kgdb_arch_remove_breakpoint() using patch_instruction() Fixes: 1e0fc9d1eb2b ("powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs") Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27powerpc/pseries/memory-hotplug: Fix return value type of find_aa_indexYueHaibing
[ Upstream commit b45e9d761ba2d60044b610297e3ef9f947ac157f ] The variable 'aa_index' is defined as an unsigned value in update_lmb_associativity_index(), but find_aa_index() may return -1 when dlpar_clone_property() fails. So change find_aa_index() to return a bool, which indicates whether 'aa_index' was found or not. Fixes: c05a5a40969e ("powerpc/pseries: Dynamic add entires to associativity lookup array") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Nathan Fontenot nfont@linux.vnet.ibm.com> [mpe: Tweak changelog, rename is_found to just found] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27powerpc/archrandom: fix arch_get_random_seed_int()Ard Biesheuvel
commit b6afd1234cf93aa0d71b4be4788c47534905f0be upstream. Commit 01c9348c7620ec65 powerpc: Use hardware RNG for arch_get_random_seed_* not arch_get_random_* updated arch_get_random_[int|long]() to be NOPs, and moved the hardware RNG backing to arch_get_random_seed_[int|long]() instead. However, it failed to take into account that arch_get_random_int() was implemented in terms of arch_get_random_long(), and so we ended up with a version of the former that is essentially a NOP as well. Fix this by calling arch_get_random_seed_long() from arch_get_random_seed_int() instead. Fixes: 01c9348c7620ec65 ("powerpc: Use hardware RNG for arch_get_random_seed_* not arch_get_random_*") Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191204115015.18015-1-ardb@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-27powerpc/pseries: Enable support for ibm,drc-info propertyTyrel Datwyler
commit 0a87ccd3699983645f54cafd2258514a716b20b8 upstream. Advertise client support for the PAPR architected ibm,drc-info device tree property during CAS handshake. Fixes: c7a3275e0f9e ("powerpc/pseries: Revert support for ibm,drc-info devtree property") Signed-off-by: Tyrel Datwyler <tyreld@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1573449697-5448-11-git-send-email-tyreld@linux.ibm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-17powerpc/powernv: Disable native PCIe port managementOliver O'Halloran
commit 9d72dcef891030545f39ad386a30cf91df517fb2 upstream. On PowerNV the PCIe topology is (currently) managed by the powernv platform code in Linux in cooperation with the platform firmware. Linux's native PCIe port service drivers operate independently of both and this can cause problems. The main issue is that the portbus driver will conflict with the platform specific hotplug driver (pnv_php) over ownership of the MSI used to notify the host when a hotplug event occurs. The portbus driver claims this MSI on behalf of the individual port services because the same interrupt is used for hotplug events, PMEs (on root ports), and link bandwidth change notifications. The portbus driver will always claim the interrupt even if the individual port service drivers, such as pciehp, are compiled out. The second, bigger, problem is that the hotplug port service driver fundamentally does not work on PowerNV. The platform assumes that all PCI devices have a corresponding arch-specific handle derived from the DT node for the device (pci_dn) and without one the platform will not allow a PCI device to be enabled. This problem is largely due to historical baggage, but it can't be resolved without significant re-factoring of the platform PCI support. We can fix these problems in the interim by setting the "pcie_ports_disabled" flag during platform initialisation. The flag indicates the platform owns the PCIe ports which stops the portbus driver from being registered. This does have the side effect of disabling all port services drivers that is: AER, PME, BW notifications, hotplug, and DPC. However, this is not a huge disadvantage on PowerNV since these services are either unused or handled through other means. Fixes: 66725152fb9f ("PCI/hotplug: PowerPC PowerNV PCI hotplug driver") Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191118065553.30362-1-oohall@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-12powerpc/spinlocks: Include correct header for static keyJason A. Donenfeld
commit 6da3eced8c5f3b03340b0c395bacd552c4d52411 upstream. Recently, the spinlock implementation grew a static key optimization, but the jump_label.h header include was left out, leading to build errors: linux/arch/powerpc/include/asm/spinlock.h:44:7: error: implicit declaration of function ‘static_branch_unlikely’ 44 | if (!static_branch_unlikely(&shared_processor)) This commit adds the missing header. mpe: The build break is only seen with CONFIG_JUMP_LABEL=n. Fixes: 656c21d6af5d ("powerpc/shared: Use static key to detect shared processor") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Srikar Dronamraju <srikar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191223133147.129983-1-Jason@zx2c4.com Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-12powerpc/vcpu: Assume dedicated processors as non-preemptSrikar Dronamraju
commit 14c73bd344da60abaf7da3ea2e7733ddda35bbac upstream. With commit 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted vCPUs"), the scheduler avoids preempted vCPUs to schedule tasks on wakeup. This leads to wrong choice of CPU, which in-turn leads to larger wakeup latencies. Eventually, it leads to performance regression in latency sensitive benchmarks like soltp, schbench etc. On Powerpc, vcpu_is_preempted() only looks at yield_count. If the yield_count is odd, the vCPU is assumed to be preempted. However yield_count is increased whenever the LPAR enters CEDE state (idle). So any CPU that has entered CEDE state is assumed to be preempted. Even if vCPU of dedicated LPAR is preempted/donated, it should have right of first-use since they are supposed to own the vCPU. On a Power9 System with 32 cores: # lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 8 Core(s) per socket: 1 Socket(s): 16 NUMA node(s): 2 Model: 2.2 (pvr 004e 0202) Model name: POWER9 (architected), altivec supported Hypervisor vendor: pHyp Virtualization type: para L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 10240K NUMA node0 CPU(s): 0-63 NUMA node1 CPU(s): 64-127 # perf stat -a -r 5 ./schbench v5.4 v5.4 + patch Latency percentiles (usec) Latency percentiles (usec) 50.0000th: 45 50.0th: 45 75.0000th: 62 75.0th: 63 90.0000th: 71 90.0th: 74 95.0000th: 77 95.0th: 78 *99.0000th: 91 *99.0th: 82 99.5000th: 707 99.5th: 83 99.9000th: 6920 99.9th: 86 min=0, max=10048 min=0, max=96 Latency percentiles (usec) Latency percentiles (usec) 50.0000th: 45 50.0th: 46 75.0000th: 61 75.0th: 64 90.0000th: 72 90.0th: 75 95.0000th: 79 95.0th: 79 *99.0000th: 691 *99.0th: 83 99.5000th: 3972 99.5th: 85 99.9000th: 8368 99.9th: 91 min=0, max=16606 min=0, max=117 Latency percentiles (usec) Latency percentiles (usec) 50.0000th: 45 50.0th: 46 75.0000th: 61 75.0th: 64 90.0000th: 71 90.0th: 75 95.0000th: 77 95.0th: 79 *99.0000th: 106 *99.0th: 83 99.5000th: 2364 99.5th: 84 99.9000th: 7480 99.9th: 90 min=0, max=10001 min=0, max=95 Latency percentiles (usec) Latency percentiles (usec) 50.0000th: 45 50.0th: 47 75.0000th: 62 75.0th: 65 90.0000th: 72 90.0th: 75 95.0000th: 78 95.0th: 79 *99.0000th: 93 *99.0th: 84 99.5000th: 108 99.5th: 85 99.9000th: 6792 99.9th: 90 min=0, max=17681 min=0, max=117 Latency percentiles (usec) Latency percentiles (usec) 50.0000th: 46 50.0th: 45 75.0000th: 62 75.0th: 64 90.0000th: 73 90.0th: 75 95.0000th: 79 95.0th: 79 *99.0000th: 113 *99.0th: 82 99.5000th: 2724 99.5th: 83 99.9000th: 6184 99.9th: 93 min=0, max=9887 min=0, max=111 Performance counter stats for 'system wide' (5 runs): context-switches 43,373 ( +- 0.40% ) 44,597 ( +- 0.55% ) cpu-migrations 1,211 ( +- 5.04% ) 220 ( +- 6.23% ) page-faults 15,983 ( +- 5.21% ) 15,360 ( +- 3.38% ) Waiman Long suggested using static_keys. Fixes: 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted vCPUs") Cc: stable@vger.kernel.org # v4.18+ Reported-by: Parth Shah <parth@linux.ibm.com> Reported-by: Ihor Pasichnyk <Ihor.Pasichnyk@ibm.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Waiman Long <longman@redhat.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Phil Auld <pauld@redhat.com> Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.ibm.com> Tested-by: Parth Shah <parth@linux.ibm.com> [mpe: Move the key and setting of the key to pseries/setup.c] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191213035036.6913-1-mpe@ellerman.id.au Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-12powerpc: Ensure that swiotlb buffer is allocated from low memoryMike Rapoport
[ Upstream commit 8fabc623238e68b3ac63c0dd1657bf86c1fa33af ] Some powerpc platforms (e.g. 85xx) limit DMA-able memory way below 4G. If a system has more physical memory than this limit, the swiotlb buffer is not addressable because it is allocated from memblock using top-down mode. Force memblock to bottom-up mode before calling swiotlb_init() to ensure that the swiotlb buffer is DMA-able. Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191204123524.22919-1-rppt@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-09KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flagMichael Roth
[ Upstream commit 3a83f677a6eeff65751b29e3648d7c69c3be83f3 ] On a 2-socket Power9 system with 32 cores/128 threads (SMT4) and 1TB of memory running the following guest configs: guest A: - 224GB of memory - 56 VCPUs (sockets=1,cores=28,threads=2), where: VCPUs 0-1 are pinned to CPUs 0-3, VCPUs 2-3 are pinned to CPUs 4-7, ... VCPUs 54-55 are pinned to CPUs 108-111 guest B: - 4GB of memory - 4 VCPUs (sockets=1,cores=4,threads=1) with the following workloads (with KSM and THP enabled in all): guest A: stress --cpu 40 --io 20 --vm 20 --vm-bytes 512M guest B: stress --cpu 4 --io 4 --vm 4 --vm-bytes 512M host: stress --cpu 4 --io 4 --vm 2 --vm-bytes 256M the below soft-lockup traces were observed after an hour or so and persisted until the host was reset (this was found to be reliably reproducible for this configuration, for kernels 4.15, 4.18, 5.0, and 5.3-rc5): [ 1253.183290] rcu: INFO: rcu_sched self-detected stall on CPU [ 1253.183319] rcu: 124-....: (5250 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=1941 [ 1256.287426] watchdog: BUG: soft lockup - CPU#105 stuck for 23s! [CPU 52/KVM:19709] [ 1264.075773] watchdog: BUG: soft lockup - CPU#24 stuck for 23s! [worker:19913] [ 1264.079769] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! [worker:20331] [ 1264.095770] watchdog: BUG: soft lockup - CPU#45 stuck for 23s! [worker:20338] [ 1264.131773] watchdog: BUG: soft lockup - CPU#64 stuck for 23s! [avocado:19525] [ 1280.408480] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791] [ 1316.198012] rcu: INFO: rcu_sched self-detected stall on CPU [ 1316.198032] rcu: 124-....: (21003 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=8243 [ 1340.411024] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791] [ 1379.212609] rcu: INFO: rcu_sched self-detected stall on CPU [ 1379.212629] rcu: 124-....: (36756 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=14714 [ 1404.413615] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791] [ 1442.227095] rcu: INFO: rcu_sched self-detected stall on CPU [ 1442.227115] rcu: 124-....: (52509 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=21403 [ 1455.111787] INFO: task worker:19907 blocked for more than 120 seconds. [ 1455.111822] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1 [ 1455.111833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.111884] INFO: task worker:19908 blocked for more than 120 seconds. [ 1455.111905] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1 [ 1455.111925] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.111966] INFO: task worker:20328 blocked for more than 120 seconds. [ 1455.111986] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1 [ 1455.111998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.112048] INFO: task worker:20330 blocked for more than 120 seconds. [ 1455.112068] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1 [ 1455.112097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.112138] INFO: task worker:20332 blocked for more than 120 seconds. [ 1455.112159] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1 [ 1455.112179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.112210] INFO: task worker:20333 blocked for more than 120 seconds. [ 1455.112231] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1 [ 1455.112242] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.112282] INFO: task worker:20335 blocked for more than 120 seconds. [ 1455.112303] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1 [ 1455.112332] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.112372] INFO: task worker:20336 blocked for more than 120 seconds. [ 1455.112392] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1 CPUs 45, 24, and 124 are stuck on spin locks, likely held by CPUs 105 and 31. CPUs 105 and 31 are stuck in smp_call_function_many(), waiting on target CPU 42. For instance: # CPU 105 registers (via xmon) R00 = c00000000020b20c R16 = 00007d1bcd800000 R01 = c00000363eaa7970 R17 = 0000000000000001 R02 = c0000000019b3a00 R18 = 000000000000006b R03 = 000000000000002a R19 = 00007d537d7aecf0 R04 = 000000000000002a R20 = 60000000000000e0 R05 = 000000000000002a R21 = 0801000000000080 R06 = c0002073fb0caa08 R22 = 0000000000000d60 R07 = c0000000019ddd78 R23 = 0000000000000001 R08 = 000000000000002a R24 = c00000000147a700 R09 = 0000000000000001 R25 = c0002073fb0ca908 R10 = c000008ffeb4e660 R26 = 0000000000000000 R11 = c0002073fb0ca900 R27 = c0000000019e2464 R12 = c000000000050790 R28 = c0000000000812b0 R13 = c000207fff623e00 R29 = c0002073fb0ca808 R14 = 00007d1bbee00000 R30 = c0002073fb0ca800 R15 = 00007d1bcd600000 R31 = 0000000000000800 pc = c00000000020b260 smp_call_function_many+0x3d0/0x460 cfar= c00000000020b270 smp_call_function_many+0x3e0/0x460 lr = c00000000020b20c smp_call_function_many+0x37c/0x460 msr = 900000010288b033 cr = 44024824 ctr = c000000000050790 xer = 0000000000000000 trap = 100 CPU 42 is running normally, doing VCPU work: # CPU 42 stack trace (via xmon) [link register ] c00800001be17188 kvmppc_book3s_radix_page_fault+0x90/0x2b0 [kvm_hv] [c000008ed3343820] c000008ed3343850 (unreliable) [c000008ed33438d0] c00800001be11b6c kvmppc_book3s_hv_page_fault+0x264/0xe30 [kvm_hv] [c000008ed33439d0] c00800001be0d7b4 kvmppc_vcpu_run_hv+0x8dc/0xb50 [kvm_hv] [c000008ed3343ae0] c00800001c10891c kvmppc_vcpu_run+0x34/0x48 [kvm] [c000008ed3343b00] c00800001c10475c kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm] [c000008ed3343b90] c00800001c0f5a78 kvm_vcpu_ioctl+0x470/0x7c8 [kvm] [c000008ed3343d00] c000000000475450 do_vfs_ioctl+0xe0/0xc70 [c000008ed3343db0] c0000000004760e4 ksys_ioctl+0x104/0x120 [c000008ed3343e00] c000000000476128 sys_ioctl+0x28/0x80 [c000008ed3343e20] c00000000000b388 system_call+0x5c/0x70 --- Exception: c00 (System Call) at 00007d545cfd7694 SP (7d53ff7edf50) is in userspace It was subsequently found that ipi_message[PPC_MSG_CALL_FUNCTION] was set for CPU 42 by at least 1 of the CPUs waiting in smp_call_function_many(), but somehow the corresponding call_single_queue entries were never processed by CPU 42, causing the callers to spin in csd_lock_wait() indefinitely. Nick Piggin suggested something similar to the following sequence as a possible explanation (interleaving of CALL_FUNCTION/RESCHEDULE IPI messages seems to be most common, but any mix of CALL_FUNCTION and !CALL_FUNCTION messages could trigger it): CPU X: smp_muxed_ipi_set_message(): X: smp_mb() X: message[RESCHEDULE] = 1 X: doorbell_global_ipi(42): X: kvmppc_set_host_ipi(42, 1) X: ppc_msgsnd_sync()/smp_mb() X: ppc_msgsnd() -> 42 42: doorbell_exception(): // from CPU X 42: ppc_msgsync() 105: smp_muxed_ipi_set_message(): 105: smb_mb() // STORE DEFERRED DUE TO RE-ORDERING --105: message[CALL_FUNCTION] = 1 | 105: doorbell_global_ipi(42): | 105: kvmppc_set_host_ipi(42, 1) | 42: kvmppc_set_host_ipi(42, 0) | 42: smp_ipi_demux_relaxed() | 42: // returns to executing guest | // RE-ORDERED STORE COMPLETES ->105: message[CALL_FUNCTION] = 1 105: ppc_msgsnd_sync()/smp_mb() 105: ppc_msgsnd() -> 42 42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored 105: // hangs waiting on 42 to process messages/call_single_queue This can be prevented with an smp_mb() at the beginning of kvmppc_set_host_ipi(), such that stores to message[<type>] (or other state indicated by the host_ipi flag) are ordered vs. the store to to host_ipi. However, doing so might still allow for the following scenario (not yet observed): CPU X: smp_muxed_ipi_set_message(): X: smp_mb() X: message[RESCHEDULE] = 1 X: doorbell_global_ipi(42): X: kvmppc_set_host_ipi(42, 1) X: ppc_msgsnd_sync()/smp_mb() X: ppc_msgsnd() -> 42 42: doorbell_exception(): // from CPU X 42: ppc_msgsync() // STORE DEFERRED DUE TO RE-ORDERING -- 42: kvmppc_set_host_ipi(42, 0) | 42: smp_ipi_demux_relaxed() | 105: smp_muxed_ipi_set_message(): | 105: smb_mb() | 105: message[CALL_FUNCTION] = 1 | 105: doorbell_global_ipi(42): | 105: kvmppc_set_host_ipi(42, 1) | // RE-ORDERED STORE COMPLETES -> 42: kvmppc_set_host_ipi(42, 0) 42: // returns to executing guest 105: ppc_msgsnd_sync()/smp_mb() 105: ppc_msgsnd() -> 42 42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored 105: // hangs waiting on 42 to process messages/call_single_queue Fixing this scenario would require an smp_mb() *after* clearing host_ipi flag in kvmppc_set_host_ipi() to order the store vs. subsequent processing of IPI messages. To handle both cases, this patch splits kvmppc_set_host_ipi() into separate set/clear functions, where we execute smp_mb() prior to setting host_ipi flag, and after clearing host_ipi flag. These functions pair with each other to synchronize the sender and receiver sides. With that change in place the above workload ran for 20 hours without triggering any lock-ups. Fixes: 755563bc79c7 ("powerpc/powernv: Fixes for hypervisor doorbell handling") # v4.0 Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190911223155.16045-1-mdroth@linux.vnet.ibm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-09powerpc/pseries/hvconsole: Fix stack overread via udbgDaniel Axtens
[ Upstream commit 934bda59f286d0221f1a3ebab7f5156a996cc37d ] While developing KASAN for 64-bit book3s, I hit the following stack over-read. It occurs because the hypercall to put characters onto the terminal takes 2 longs (128 bits/16 bytes) of characters at a time, and so hvc_put_chars() would unconditionally copy 16 bytes from the argument buffer, regardless of supplied length. However, udbg_hvc_putc() can call hvc_put_chars() with a single-byte buffer, leading to the error. ================================================================== BUG: KASAN: stack-out-of-bounds in hvc_put_chars+0xdc/0x110 Read of size 8 at addr c0000000023e7a90 by task swapper/0 CPU: 0 PID: 0 Comm: swapper Not tainted 5.2.0-rc2-next-20190528-02824-g048a6ab4835b #113 Call Trace: dump_stack+0x104/0x154 (unreliable) print_address_description+0xa0/0x30c __kasan_report+0x20c/0x224 kasan_report+0x18/0x30 __asan_report_load8_noabort+0x24/0x40 hvc_put_chars+0xdc/0x110 hvterm_raw_put_chars+0x9c/0x110 udbg_hvc_putc+0x154/0x200 udbg_write+0xf0/0x240 console_unlock+0x868/0xd30 register_console+0x970/0xe90 register_early_udbg_console+0xf8/0x114 setup_arch+0x108/0x790 start_kernel+0x104/0x784 start_here_common+0x1c/0x534 Memory state around the buggy address: c0000000023e7980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0000000023e7a00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 >c0000000023e7a80: f1 f1 01 f2 f2 f2 00 00 00 00 00 00 00 00 00 00 ^ c0000000023e7b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0000000023e7b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== Document that a 16-byte buffer is requred, and provide it in udbg. Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04Revert "powerpc/vcpu: Assume dedicated processors as non-preempt"Greg Kroah-Hartman
This reverts commit 4ba32bdbd8c66d9c7822aea8dcf4e51410df84a8 which is commit 14c73bd344da60abaf7da3ea2e7733ddda35bbac upstream. It breaks the build. Cc: Guenter Roeck <linux@roeck-us.net> Cc: Parth Shah <parth@linux.ibm.com> Cc: Ihor Pasichnyk <Ihor.Pasichnyk@ibm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Waiman Long <longman@redhat.com> Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Phil Auld <pauld@redhat.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com> Cc: Parth Shah <parth@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-04libfdt: define INT32_MAX and UINT32_MAX in libfdt_env.hMasahiro Yamada
[ Upstream commit a8de1304b7df30e3a14f2a8b9709bb4ff31a0385 ] The DTC v1.5.1 added references to (U)INT32_MAX. This is no problem for user-space programs since <stdint.h> defines (U)INT32_MAX along with (u)int32_t. For the kernel space, libfdt_env.h needs to be adjusted before we pull in the changes. In the kernel, we usually use s/u32 instead of (u)int32_t for the fixed-width types. Accordingly, we already have S/U32_MAX for their max values. So, we should not add (U)INT32_MAX to <linux/limits.h> any more. Instead, add them to the in-kernel libfdt_env.h to compile the latest libfdt. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Rob Herring <robh@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04powerpc: Don't add -mabi= flags when building with ClangNathan Chancellor
[ Upstream commit 465bfd9c44dea6b55962b5788a23ac87a467c923 ] When building pseries_defconfig, building vdso32 errors out: error: unknown target ABI 'elfv1' This happens because -m32 in clang changes the target to 32-bit, which does not allow the ABI to be changed. Commit 4dc831aa8813 ("powerpc: Fix compiling a BE kernel with a powerpc64le toolchain") added these flags to fix building big endian kernels with a little endian GCC. Clang doesn't need -mabi because the target triple controls the default value. -mlittle-endian and -mbig-endian manipulate the triple into either powerpc64-* or powerpc64le-*, which properly sets the default ABI. Adding a debug print out in the PPC64TargetInfo constructor after line 383 above shows this: $ echo | ./clang -E --target=powerpc64-linux -mbig-endian -o /dev/null - Default ABI: elfv1 $ echo | ./clang -E --target=powerpc64-linux -mlittle-endian -o /dev/null - Default ABI: elfv2 $ echo | ./clang -E --target=powerpc64le-linux -mbig-endian -o /dev/null - Default ABI: elfv1 $ echo | ./clang -E --target=powerpc64le-linux -mlittle-endian -o /dev/null - Default ABI: elfv2 Don't specify -mabi when building with clang to avoid the build error with -m32 and not change any code generation. -mcall-aixdesc is not an implemented flag in clang so it can be safely excluded as well, see commit 238abecde8ad ("powerpc: Don't use gcc specific options on clang"). pseries_defconfig successfully builds after this patch and powernv_defconfig and ppc44x_defconfig don't regress. Reviewed-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> [mpe: Trim clang links in change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191119045712.39633-2-natechancellor@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04powerpc/security: Fix wrong message when RFI Flush is disableGustavo L. F. Walbon
[ Upstream commit 4e706af3cd8e1d0503c25332b30cad33c97ed442 ] The issue was showing "Mitigation" message via sysfs whatever the state of "RFI Flush", but it should show "Vulnerable" when it is disabled. If you have "L1D private" feature enabled and not "RFI Flush" you are vulnerable to meltdown attacks. "RFI Flush" is the key feature to mitigate the meltdown whatever the "L1D private" state. SEC_FTR_L1D_THREAD_PRIV is a feature for Power9 only. So the message should be as the truth table shows: CPU | L1D private | RFI Flush | sysfs ----|-------------|-----------|------------------------------------- P9 | False | False | Vulnerable P9 | False | True | Mitigation: RFI Flush P9 | True | False | Vulnerable: L1D private per thread P9 | True | True | Mitigation: RFI Flush, L1D private per thread P8 | False | False | Vulnerable P8 | False | True | Mitigation: RFI Flush Output before this fix: # cat /sys/devices/system/cpu/vulnerabilities/meltdown Mitigation: RFI Flush, L1D private per thread # echo 0 > /sys/kernel/debug/powerpc/rfi_flush # cat /sys/devices/system/cpu/vulnerabilities/meltdown Mitigation: L1D private per thread Output after fix: # cat /sys/devices/system/cpu/vulnerabilities/meltdown Mitigation: RFI Flush, L1D private per thread # echo 0 > /sys/kernel/debug/powerpc/rfi_flush # cat /sys/devices/system/cpu/vulnerabilities/meltdown Vulnerable: L1D private per thread Signed-off-by: Gustavo L. F. Walbon <gwalbon@linux.ibm.com> Signed-off-by: Mauro S. M. Rodrigues <maurosr@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190502210907.42375-1-gwalbon@linux.ibm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04powerpc/pseries/cmm: Implement release() function for sysfs deviceDavid Hildenbrand
[ Upstream commit 7d8212747435c534c8d564fbef4541a463c976ff ] When unloading the module, one gets ------------[ cut here ]------------ Device 'cmm0' does not have a release() function, it is broken and must be fixed. See Documentation/kobject.txt. WARNING: CPU: 0 PID: 19308 at drivers/base/core.c:1244 .device_release+0xcc/0xf0 ... We only have one static fake device. There is nothing to do when releasing the device (via cmm_exit()). Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191031142933.10779-2-david@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04powerpc/book3s64/hash: Add cond_resched to avoid soft lockup warningAneesh Kumar K.V
[ Upstream commit 16f6b67cf03cb43db7104acb2ca877bdc2606c92 ] With large memory (8TB and more) hotplug, we can get soft lockup warnings as below. These were caused by a long loop without any explicit cond_resched which is a problem for !PREEMPT kernels. Avoid this using cond_resched() while inserting hash page table entries. We already do similar cond_resched() in __add_pages(), see commit f64ac5e6e306 ("mm, memory_hotplug: add scheduling point to __add_pages"). rcu: 3-....: (24002 ticks this GP) idle=13e/1/0x4000000000000002 softirq=722/722 fqs=12001 (t=24003 jiffies g=4285 q=2002) NMI backtrace for cpu 3 CPU: 3 PID: 3870 Comm: ndctl Not tainted 5.3.0-197.18-default+ #2 Call Trace: dump_stack+0xb0/0xf4 (unreliable) nmi_cpu_backtrace+0x124/0x130 nmi_trigger_cpumask_backtrace+0x1ac/0x1f0 arch_trigger_cpumask_backtrace+0x28/0x3c rcu_dump_cpu_stacks+0xf8/0x154 rcu_sched_clock_irq+0x878/0xb40 update_process_times+0x48/0x90 tick_sched_handle.isra.16+0x4c/0x80 tick_sched_timer+0x68/0xe0 __hrtimer_run_queues+0x180/0x430 hrtimer_interrupt+0x110/0x300 timer_interrupt+0x108/0x2f0 decrementer_common+0x114/0x120 --- interrupt: 901 at arch_add_memory+0xc0/0x130 LR = arch_add_memory+0x74/0x130 memremap_pages+0x494/0x650 devm_memremap_pages+0x3c/0xa0 pmem_attach_disk+0x188/0x750 nvdimm_bus_probe+0xac/0x2c0 really_probe+0x148/0x570 driver_probe_device+0x19c/0x1d0 device_driver_attach+0xcc/0x100 bind_store+0x134/0x1c0 drv_attr_store+0x44/0x60 sysfs_kf_write+0x64/0x90 kernfs_fop_write+0x1a0/0x270 __vfs_write+0x3c/0x70 vfs_write+0xd0/0x260 ksys_write+0xdc/0x130 system_call+0x5c/0x68 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191001084656.31277-1-aneesh.kumar@linux.ibm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04powerpc/security/book3s64: Report L1TF status in sysfsAnthony Steinhauser
[ Upstream commit 8e6b6da91ac9b9ec5a925b6cb13f287a54bd547d ] Some PowerPC CPUs are vulnerable to L1TF to the same extent as to Meltdown. It is also mitigated by flushing the L1D on privilege transition. Currently the sysfs gives a false negative on L1TF on CPUs that I verified to be vulnerable, a Power9 Talos II Boston 004e 1202, PowerNV T2P9D01. Signed-off-by: Anthony Steinhauser <asteinhauser@google.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> [mpe: Just have cpu_show_l1tf() call cpu_show_meltdown() directly] Link: https://lore.kernel.org/r/20191029190759.84821-1-asteinhauser@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04powerpc/tools: Don't quote $objdump in scriptsMichael Ellerman
[ Upstream commit e44ff9ea8f4c8a90c82f7b85bd4f5e497c841960 ] Some of our scripts are passed $objdump and then call it as "$objdump". This doesn't work if it contains spaces because we're using ccache, for example you get errors such as: ./arch/powerpc/tools/relocs_check.sh: line 48: ccache ppc64le-objdump: No such file or directory ./arch/powerpc/tools/unrel_branch_check.sh: line 26: ccache ppc64le-objdump: No such file or directory Fix it by not quoting the string when we expand it, allowing the shell to do the right thing for us. Fixes: a71aa05e1416 ("powerpc: Convert relocs_check to a shell script using grep") Fixes: 4ea80652dc75 ("powerpc/64s: Tool to flag direct branches from unrelocated interrupt vectors") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191024004730.32135-1-mpe@ellerman.id.au Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04powerpc/pseries: Don't fail hash page table insert for bolted mappingAneesh Kumar K.V
[ Upstream commit 75838a3290cd4ebbd1f567f310ba04b6ef017ce4 ] If the hypervisor returned H_PTEG_FULL for H_ENTER hcall, retry a hash page table insert by removing a random entry from the group. After some runtime, it is very well possible to find all the 8 hash page table entry slot in the hpte group used for mapping. Don't fail a bolted entry insert in that case. With Storage class memory a user can find this error easily since a namespace enable/disable is equivalent to memory add/remove. This results in failures as reported below: $ ndctl create-namespace -r region1 -t pmem -m devdax -a 65536 -s 100M libndctl: ndctl_dax_enable: dax1.3: failed to enable Error: namespace1.2: failed to enable failed to create namespace: No such device or address In kernel log we find the details as below: Unable to create mapping for hot added memory 0xc000042006000000..0xc00004200d000000: -1 dax_pmem: probe of dax1.3 failed with error -14 This indicates that we failed to create a bolted hash table entry for direct-map address backing the namespace. We also observe failures such that not all namespaces will be enabled with ndctl enable-namespace all command. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191024093542.29777-2-aneesh.kumar@linux.ibm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04powerpc/pseries: Mark accumulate_stolen_time() as notraceMichael Ellerman
[ Upstream commit eb8e20f89093b64f48975c74ccb114e6775cee22 ] accumulate_stolen_time() is called prior to interrupt state being reconciled, which can trip the warning in arch_local_irq_restore(): WARNING: CPU: 5 PID: 1017 at arch/powerpc/kernel/irq.c:258 .arch_local_irq_restore+0x9c/0x130 ... NIP .arch_local_irq_restore+0x9c/0x130 LR .rb_start_commit+0x38/0x80 Call Trace: .ring_buffer_lock_reserve+0xe4/0x620 .trace_function+0x44/0x210 .function_trace_call+0x148/0x170 .ftrace_ops_no_ops+0x180/0x1d0 ftrace_call+0x4/0x8 .accumulate_stolen_time+0x1c/0xb0 decrementer_common+0x124/0x160 For now just mark it as notrace. We may change the ordering to call it after interrupt state has been reconciled, but that is a larger change. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191024055932.27940-1-mpe@ellerman.id.au Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-31powerpc/irq: fix stack overflow verificationChristophe Leroy
commit 099bc4812f09155da77eeb960a983470249c9ce1 upstream. Before commit 0366a1c70b89 ("powerpc/irq: Run softirqs off the top of the irq stack"), check_stack_overflow() was called by do_IRQ(), before switching to the irq stack. In that commit, do_IRQ() was renamed __do_irq(), and is now executing on the irq stack, so check_stack_overflow() has just become almost useless. Move check_stack_overflow() call in do_IRQ() to do the check while still on the current stack. Fixes: 0366a1c70b89 ("powerpc/irq: Run softirqs off the top of the irq stack") Cc: stable@vger.kernel.org Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/e033aa8116ab12b7ca9a9c75189ad0741e3b9b5f.1575872340.git.christophe.leroy@c-s.fr Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31powerpc/vcpu: Assume dedicated processors as non-preemptSrikar Dronamraju
commit 14c73bd344da60abaf7da3ea2e7733ddda35bbac upstream. With commit 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted vCPUs"), the scheduler avoids preempted vCPUs to schedule tasks on wakeup. This leads to wrong choice of CPU, which in-turn leads to larger wakeup latencies. Eventually, it leads to performance regression in latency sensitive benchmarks like soltp, schbench etc. On Powerpc, vcpu_is_preempted() only looks at yield_count. If the yield_count is odd, the vCPU is assumed to be preempted. However yield_count is increased whenever the LPAR enters CEDE state (idle). So any CPU that has entered CEDE state is assumed to be preempted. Even if vCPU of dedicated LPAR is preempted/donated, it should have right of first-use since they are supposed to own the vCPU. On a Power9 System with 32 cores: # lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 8 Core(s) per socket: 1 Socket(s): 16 NUMA node(s): 2 Model: 2.2 (pvr 004e 0202) Model name: POWER9 (architected), altivec supported Hypervisor vendor: pHyp Virtualization type: para L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 10240K NUMA node0 CPU(s): 0-63 NUMA node1 CPU(s): 64-127 # perf stat -a -r 5 ./schbench v5.4 v5.4 + patch Latency percentiles (usec) Latency percentiles (usec) 50.0000th: 45 50.0th: 45 75.0000th: 62 75.0th: 63 90.0000th: 71 90.0th: 74 95.0000th: 77 95.0th: 78 *99.0000th: 91 *99.0th: 82 99.5000th: 707 99.5th: 83 99.9000th: 6920 99.9th: 86 min=0, max=10048 min=0, max=96 Latency percentiles (usec) Latency percentiles (usec) 50.0000th: 45 50.0th: 46 75.0000th: 61 75.0th: 64 90.0000th: 72 90.0th: 75 95.0000th: 79 95.0th: 79 *99.0000th: 691 *99.0th: 83 99.5000th: 3972 99.5th: 85 99.9000th: 8368 99.9th: 91 min=0, max=16606 min=0, max=117 Latency percentiles (usec) Latency percentiles (usec) 50.0000th: 45 50.0th: 46 75.0000th: 61 75.0th: 64 90.0000th: 71 90.0th: 75 95.0000th: 77 95.0th: 79 *99.0000th: 106 *99.0th: 83 99.5000th: 2364 99.5th: 84 99.9000th: 7480 99.9th: 90 min=0, max=10001 min=0, max=95 Latency percentiles (usec) Latency percentiles (usec) 50.0000th: 45 50.0th: 47 75.0000th: 62 75.0th: 65 90.0000th: 72 90.0th: 75 95.0000th: 78 95.0th: 79 *99.0000th: 93 *99.0th: 84 99.5000th: 108 99.5th: 85 99.9000th: 6792 99.9th: 90 min=0, max=17681 min=0, max=117 Latency percentiles (usec) Latency percentiles (usec) 50.0000th: 46 50.0th: 45 75.0000th: 62 75.0th: 64 90.0000th: 73 90.0th: 75 95.0000th: 79 95.0th: 79 *99.0000th: 113 *99.0th: 82 99.5000th: 2724 99.5th: 83 99.9000th: 6184 99.9th: 93 min=0, max=9887 min=0, max=111 Performance counter stats for 'system wide' (5 runs): context-switches 43,373 ( +- 0.40% ) 44,597 ( +- 0.55% ) cpu-migrations 1,211 ( +- 5.04% ) 220 ( +- 6.23% ) page-faults 15,983 ( +- 5.21% ) 15,360 ( +- 3.38% ) Waiman Long suggested using static_keys. Fixes: 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted vCPUs") Cc: stable@vger.kernel.org # v4.18+ Reported-by: Parth Shah <parth@linux.ibm.com> Reported-by: Ihor Pasichnyk <Ihor.Pasichnyk@ibm.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Waiman Long <longman@redhat.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Phil Auld <pauld@redhat.com> Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.ibm.com> Tested-by: Parth Shah <parth@linux.ibm.com> [mpe: Move the key and setting of the key to pseries/setup.c] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191213035036.6913-1-mpe@ellerman.id.au Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17powerpc: Fix vDSO clock_getres()Vincenzo Frascino
[ Upstream commit 552263456215ada7ee8700ce022d12b0cffe4802 ] clock_getres in the vDSO library has to preserve the same behaviour of posix_get_hrtimer_res(). In particular, posix_get_hrtimer_res() does: sec = 0; ns = hrtimer_resolution; and hrtimer_resolution depends on the enablement of the high resolution timers that can happen either at compile or at run time. Fix the powerpc vdso implementation of clock_getres keeping a copy of hrtimer_resolution in vdso data and using that directly. Fixes: a7f290dad32e ("[PATCH] powerpc: Merge vdso's and add vdso support to 32 bits kernel") Cc: stable@vger.kernel.org Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> Acked-by: Shuah Khan <skhan@linuxfoundation.org> [chleroy: changed CLOCK_REALTIME_RES to CLOCK_HRTIMER_RES] Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/a55eca3a5e85233838c2349783bcb5164dae1d09.1575273217.git.christophe.leroy@c-s.fr Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-17powerpc: Avoid clang warnings around setjmp and longjmpNathan Chancellor
[ Upstream commit c9029ef9c95765e7b63c4d9aa780674447db1ec0 ] Commit aea447141c7e ("powerpc: Disable -Wbuiltin-requires-header when setjmp is used") disabled -Wbuiltin-requires-header because of a warning about the setjmp and longjmp declarations. r367387 in clang added another diagnostic around this, complaining that there is no jmp_buf declaration. In file included from ../arch/powerpc/xmon/xmon.c:47: ../arch/powerpc/include/asm/setjmp.h:10:13: error: declaration of built-in function 'setjmp' requires the declaration of the 'jmp_buf' type, commonly provided in the header <setjmp.h>. [-Werror,-Wincomplete-setjmp-declaration] extern long setjmp(long *); ^ ../arch/powerpc/include/asm/setjmp.h:11:13: error: declaration of built-in function 'longjmp' requires the declaration of the 'jmp_buf' type, commonly provided in the header <setjmp.h>. [-Werror,-Wincomplete-setjmp-declaration] extern void longjmp(long *, long); ^ 2 errors generated. We are not using the standard library's longjmp/setjmp implementations for obvious reasons; make this clear to clang by using -ffreestanding on these files. Cc: stable@vger.kernel.org # 4.14+ Suggested-by: Segher Boessenkool <segher@kernel.crashing.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191119045712.39633-3-natechancellor@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-17powerpc/xive: Skip ioremap() of ESB pages for LSI interruptsCédric Le Goater
commit b67a95f2abff0c34e5667c15ab8900de73d8d087 upstream. The PCI INTx interrupts and other LSI interrupts are handled differently under a sPAPR platform. When the interrupt source characteristics are queried, the hypervisor returns an H_INT_ESB flag to inform the OS that it should be using the H_INT_ESB hcall for interrupt management and not loads and stores on the interrupt ESB pages. A default -1 value is returned for the addresses of the ESB pages. The driver ignores this condition today and performs a bogus IO mapping. Recent changes and the DEBUG_VM configuration option make the bug visible with : kernel BUG at arch/powerpc/include/asm/book3s/64/pgtable.h:612! Oops: Exception in kernel mode, sig: 5 [#1] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=1024 NUMA pSeries Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.4.0-0.rc6.git0.1.fc32.ppc64le #1 NIP: c000000000f63294 LR: c000000000f62e44 CTR: 0000000000000000 REGS: c0000000fa45f0d0 TRAP: 0700 Not tainted (5.4.0-0.rc6.git0.1.fc32.ppc64le) ... NIP ioremap_page_range+0x4c4/0x6e0 LR ioremap_page_range+0x74/0x6e0 Call Trace: ioremap_page_range+0x74/0x6e0 (unreliable) do_ioremap+0x8c/0x120 __ioremap_caller+0x128/0x140 ioremap+0x30/0x50 xive_spapr_populate_irq_data+0x170/0x260 xive_irq_domain_map+0x8c/0x170 irq_domain_associate+0xb4/0x2d0 irq_create_mapping+0x1e0/0x3b0 irq_create_fwspec_mapping+0x27c/0x3e0 irq_create_of_mapping+0x98/0xb0 of_irq_parse_and_map_pci+0x168/0x230 pcibios_setup_device+0x88/0x250 pcibios_setup_bus_devices+0x54/0x100 __of_scan_bus+0x160/0x310 pcibios_scan_phb+0x330/0x390 pcibios_init+0x8c/0x128 do_one_initcall+0x60/0x2c0 kernel_init_freeable+0x290/0x378 kernel_init+0x2c/0x148 ret_from_kernel_thread+0x5c/0x80 Fixes: bed81ee181dd ("powerpc/xive: introduce H_INT_ESB hcall") Cc: stable@vger.kernel.org # v4.14+ Signed-off-by: Cédric Le Goater <clg@kaod.org> Tested-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191203163642.2428-1-clg@kaod.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17powerpc: Allow flush_icache_range to work across ranges >4GBAlastair D'Silva
commit 29430fae82073d39b1b881a3cd507416a56a363f upstream. When calling flush_icache_range with a size >4GB, we were masking off the upper 32 bits, so we would incorrectly flush a range smaller than intended. This patch replaces the 32 bit shifts with 64 bit ones, so that the full size is accounted for. Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Cc: stable@vger.kernel.org Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191104023305.9581-2-alastair@au1.ibm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>