aboutsummaryrefslogtreecommitdiffstats
path: root/virt
AgeCommit message (Collapse)Author
2024-02-21Merge branch 'v4.19/standard/base' into v4.19/standard/preempt-rt/intel-x86Bruce Ashfield
2022-11-03KVM: arm64: vgic: Fix exit condition in scan_its_table()Eric Ren
commit c000a2607145d28b06c697f968491372ea56c23a upstream. With some PCIe topologies, restoring a guest fails while parsing the ITS device tables. Reproducer hints: 1. Create ARM virt VM with pxb-pcie bus which adds extra host bridges, with qemu command like: ``` -device pxb-pcie,bus_nr=8,id=pci.x,numa_node=0,bus=pcie.0 \ -device pcie-root-port,..,bus=pci.x \ ... -device pxb-pcie,bus_nr=37,id=pci.y,numa_node=1,bus=pcie.0 \ -device pcie-root-port,..,bus=pci.y \ ... ``` 2. Ensure the guest uses 2-level device table 3. Perform VM migration which calls save/restore device tables In that setup, we get a big "offset" between 2 device_ids, which makes unsigned "len" round up a big positive number, causing the scan loop to continue with a bad GPA. For example: 1. L1 table has 2 entries; 2. and we are now scanning at L2 table entry index 2075 (pointed to by L1 first entry) 3. if next device id is 9472, we will get a big offset: 7397; 4. with unsigned 'len', 'len -= offset * esz', len will underflow to a positive number, mistakenly into next iteration with a bad GPA; (It should break out of the current L2 table scanning, and jump into the next L1 table entry) 5. that bad GPA fails the guest read. Fix it by stopping the L2 table scan when the next device id is outside of the current table, allowing the scan to continue from the next L1 table entry. Thanks to Eric Auger for the fix suggestion. Fixes: 920a7a8fa92a ("KVM: arm64: vgic-its: Add infrastructure for tableookup") Suggested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Eric Ren <renzhengeek@gmail.com> [maz: commit message tidy-up] Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/d9c3a564af9e2c5bf63f48a7dcbf08cd593c5c0b.1665802985.git.renzhengeek@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-10Merge branch 'v4.19/standard/base' into v4.19/standard/preempt-rt/baseBruce Ashfield
Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com> # Conflicts: # net/core/dev.c
2022-09-09Merge branch 'v4.19/standard/base' into v4.19/standard/preempt-rt/baseBruce Ashfield
2022-08-25KVM: Add infrastructure and macro to mark VM as buggedSean Christopherson
commit 0b8f11737cffc1a406d1134b58687abc29d76b52 upstream Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <3a0998645c328bf0895f1290e61821b70f048549.1625186503.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [SG: Adjusted context for kernel version 4.19] Signed-off-by: Stefan Ghinea <stefan.ghinea@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-15KVM: Prevent module exit until all VMs are freedDavid Matlack
commit 5f6de5cbebee925a612856fce6f9182bb3eee0db upstream. Tie the lifetime the KVM module to the lifetime of each VM via kvm.users_count. This way anything that grabs a reference to the VM via kvm_get_kvm() cannot accidentally outlive the KVM module. Prior to this commit, the lifetime of the KVM module was tied to the lifetime of /dev/kvm file descriptors, VM file descriptors, and vCPU file descriptors by their respective file_operations "owner" field. This approach is insufficient because references grabbed via kvm_get_kvm() do not prevent closing any of the aforementioned file descriptors. This fixes a long standing theoretical bug in KVM that at least affects async page faults. kvm_setup_async_pf() grabs a reference via kvm_get_kvm(), and drops it in an asynchronous work callback. Nothing prevents the VM file descriptor from being closed and the KVM module from being unloaded before this callback runs. Fixes: af585b921e5d ("KVM: Halt vcpu if page it tries to access is swapped out") Fixes: 3d3aab1b973b ("KVM: set owner of cpu and vm file operations") Cc: stable@vger.kernel.org Suggested-by: Ben Gardon <bgardon@google.com> [ Based on a patch from Ben implemented for Google's kernel. ] Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220303183328.1499189-2-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-23KVM: arm64: Allow SMCCC_ARCH_WORKAROUND_3 to be discovered and migratedJames Morse
commit a5905d6af492ee6a4a2205f0d550b3f931b03d03 upstream. KVM allows the guest to discover whether the ARCH_WORKAROUND SMCCC are implemented, and to preserve that state during migration through its firmware register interface. Add the necessary boiler plate for SMCCC_ARCH_WORKAROUND_3. Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> [ kvm code moved to virt/kvm/arm, removed fw regs ABI. Added 32bit stub ] Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-01-25Merge branch 'v4.19/standard/base' into v4.19/standard/preempt-rt/baseBruce Ashfield
2021-09-26KVM: remember position in kvm->vcpus arrayRadim Krčmář
commit 8750e72a79dda2f665ce17b62049f4d62130d991 upstream. Fetching an index for any vcpu in kvm->vcpus array by traversing the entire array everytime is costly. This patch remembers the position of each vcpu in kvm->vcpus array by storing it in vcpus_idx under kvm_vcpu structure. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [borntraeger@de.ibm.com]: backport to 4.19 (also fits for 5.4) Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-22KVM: arm64: Handle PSCI resets before userspace touches vCPU stateOliver Upton
[ Upstream commit 6826c6849b46aaa91300201213701eb861af4ba0 ] The CPU_ON PSCI call takes a payload that KVM uses to configure a destination vCPU to run. This payload is non-architectural state and not exposed through any existing UAPI. Effectively, we have a race between CPU_ON and userspace saving/restoring a guest: if the target vCPU isn't ran again before the VMM saves its state, the requested PC and context ID are lost. When restored, the target vCPU will be runnable and start executing at its old PC. We can avoid this race by making sure the reset payload is serviced before userspace can access a vCPU's state. Fixes: 358b28f09f0a ("arm/arm64: KVM: Allow a VCPU to fully reset itself") Signed-off-by: Oliver Upton <oupton@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20210818202133.1106786-3-oupton@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-25Merge branch 'v4.19/standard/base' into v4.19/standard/preempt-rt/baseBruce Ashfield
2021-08-25Merge branch 'v4.19/standard/base' into v4.19/standard/preempt-rt/baseBruce Ashfield
2021-07-28KVM: Use kvm_pfn_t for local PFN variable in hva_to_pfn_remapped()Sean Christopherson
commit a9545779ee9e9e103648f6f2552e73cfe808d0f4 upstream. Use kvm_pfn_t, a.k.a. u64, for the local 'pfn' variable when retrieving a so called "remapped" hva/pfn pair. In theory, the hva could resolve to a pfn in high memory on a 32-bit kernel. This bug was inadvertantly exposed by commit bd2fae8da794 ("KVM: do not assume PTE is writable after follow_pfn"), which added an error PFN value to the mix, causing gcc to comlain about overflowing the unsigned long. arch/x86/kvm/../../../virt/kvm/kvm_main.c: In function ‘hva_to_pfn_remapped’: include/linux/kvm_host.h:89:30: error: conversion from ‘long long unsigned int’ to ‘long unsigned int’ changes value from ‘9218868437227405314’ to ‘2’ [-Werror=overflow] 89 | #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2) | ^ virt/kvm/kvm_main.c:1935:9: note: in expansion of macro ‘KVM_PFN_ERR_RO_FAULT’ Cc: stable@vger.kernel.org Fixes: add6a0cd1c5b ("KVM: MMU: try to fix up page faults before giving up") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210208201940.1258328-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-28KVM: do not allow mapping valid but non-reference-counted pagesNicholas Piggin
commit f8be156be163a052a067306417cd0ff679068c97 upstream. It's possible to create a region which maps valid but non-refcounted pages (e.g., tail pages of non-compound higher order allocations). These host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family of APIs, which take a reference to the page, which takes it from 0 to 1. When the reference is dropped, this will free the page incorrectly. Fix this by only taking a reference on valid pages if it was non-zero, which indicates it is participating in normal refcounting (and can be released with put_page). This addresses CVE-2021-22543. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Tested-by: Paolo Bonzini <pbonzini@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-28KVM: do not assume PTE is writable after follow_pfnPaolo Bonzini
commit bd2fae8da794b55bf2ac02632da3a151b10e664c upstream. In order to convert an HVA to a PFN, KVM usually tries to use the get_user_pages family of functinso. This however is not possible for VM_IO vmas; in that case, KVM instead uses follow_pfn. In doing this however KVM loses the information on whether the PFN is writable. That is usually not a problem because the main use of VM_IO vmas with KVM is for BARs in PCI device assignment, however it is a bug. To fix it, use follow_pte and check pte_write while under the protection of the PTE lock. The information can be used to fail hva_to_pfn_remapped or passed back to the caller via *writable. Usage of follow_pfn was introduced in commit add6a0cd1c5b ("KVM: MMU: try to fix up page faults before giving up", 2016-07-05); however, even older version have the same issue, all the way back to commit 2e2e3738af33 ("KVM: Handle vma regions with no backing page", 2008-07-20), as they also did not check whether the PFN was writable. Fixes: 2e2e3738af33 ("KVM: Handle vma regions with no backing page") Reported-by: David Stevens <stevensd@google.com> Cc: 3pvd@google.com Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [OP: backport to 4.19, adjust follow_pte() -> follow_pte_pmd()] Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30KVM: arm/arm64: Fix KVM_VGIC_V3_ADDR_TYPE_REDIST readEric Auger
commit 94ac0835391efc1a30feda6fc908913ec012951e upstream. When reading the base address of the a REDIST region through KVM_VGIC_V3_ADDR_TYPE_REDIST we expect the redistributor region list to be populated with a single element. However list_first_entry() expects the list to be non empty. Instead we should use list_first_entry_or_null which effectively returns NULL if the list is empty. Fixes: dbd9733ab674 ("KVM: arm/arm64: Replace the single rdist region by a list") Cc: <Stable@vger.kernel.org> # v4.18+ Signed-off-by: Eric Auger <eric.auger@redhat.com> Reported-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20210412150034.29185-1-eric.auger@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-10KVM: arm/arm64: Let the timer expire in hardirq context on RTThomas Gleixner
[ Upstream commit 719cc080c914045a6e35787bf4dc3ba91cfd3efb ] The timers are canceled from an preempt-notifier which is invoked with disabled preemption which is not allowed on PREEMPT_RT. The timer callback is short so in could be invoked in hard-IRQ context on -RT. Let the timer expire on hard-IRQ context even on -RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <maz@kernel.org> Tested-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2021-06-10Merge branch 'v4.19/standard/base' into v4.19/standard/preempt-rt/baseBruce Ashfield
Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2021-06-10Merge branch 'v4.19/standard/base' into v4.19/standard/preempt-rt/baseBruce Ashfield
2021-06-10Merge branch 'v4.19/standard/base' into v4.19/standard/preempt-rt/baseBruce Ashfield
2021-05-22KVM: arm64: Initialize VCPU mdcr_el2 before loading itAlexandru Elisei
commit 263d6287da1433aba11c5b4046388f2cdf49675c upstream. When a VCPU is created, the kvm_vcpu struct is initialized to zero in kvm_vm_ioctl_create_vcpu(). On VHE systems, the first time vcpu.arch.mdcr_el2 is loaded on hardware is in vcpu_load(), before it is set to a sensible value in kvm_arm_setup_debug() later in the run loop. The result is that KVM executes for a short time with MDCR_EL2 set to zero. This has several unintended consequences: * Setting MDCR_EL2.HPMN to 0 is constrained unpredictable according to ARM DDI 0487G.a, page D13-3820. The behavior specified by the architecture in this case is for the PE to behave as if MDCR_EL2.HPMN is set to a value less than or equal to PMCR_EL0.N, which means that an unknown number of counters are now disabled by MDCR_EL2.HPME, which is zero. * The host configuration for the other debug features controlled by MDCR_EL2 is temporarily lost. This has been harmless so far, as Linux doesn't use the other fields, but that might change in the future. Let's avoid both issues by initializing the VCPU's mdcr_el2 field in kvm_vcpu_vcpu_first_run_init(), thus making sure that the MDCR_EL2 register has a consistent value after each vcpu_load(). Fixes: d5a21bcc2995 ("KVM: arm64: Move common VHE/non-VHE trap config in separate functions") Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20210407144857.199746-3-alexandru.elisei@arm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-17KVM: arm64: Fix exclusive limit for IPA sizeMarc Zyngier
Commit 262b003d059c6671601a19057e9fe1a5e7f23722 upstream. When registering a memslot, we check the size and location of that memslot against the IPA size to ensure that we can provide guest access to the whole of the memory. Unfortunately, this check rejects memslot that end-up at the exact limit of the addressing capability for a given IPA size. For example, it refuses the creation of a 2GB memslot at 0x8000000 with a 32bit IPA space. Fix it by relaxing the check to accept a memslot reaching the limit of the IPA space. Fixes: c3058d5da222 ("arm/arm64: KVM: Ensure memslots are within KVM_PHYS_SIZE") Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org # 4.4, 4.9, 4.14, 4.19 Reviewed-by: Andrew Jones <drjones@redhat.com> Link: https://lore.kernel.org/r/20210311100016.3830038-3-maz@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-23kvm: check tlbs_dirty directlyLai Jiangshan
commit 88bf56d04bc3564542049ec4ec168a8b60d0b48c upstream In kvm_mmu_notifier_invalidate_range_start(), tlbs_dirty is used as: need_tlb_flush |= kvm->tlbs_dirty; with need_tlb_flush's type being int and tlbs_dirty's type being long. It means that tlbs_dirty is always used as int and the higher 32 bits is useless. We need to check tlbs_dirty in a correct way and this change checks it directly without propagating it to need_tlb_flush. Note: it's _extremely_ unlikely this neglecting of higher 32 bits can cause problems in practice. It would require encountering tlbs_dirty on a 4 billion count boundary, and KVM would need to be using shadow paging or be running a nested guest. Cc: stable@vger.kernel.org Fixes: a4ee1ca4a36e ("KVM: MMU: delay flush all tlbs on sync_page path") Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20201217154118.16497-1-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [sudip: adjust context] Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-21Merge branch 'v4.19/standard/base' into v4.19/standard/preempt-rt/baseBruce Ashfield
2020-12-02KVM: arm64: vgic-v3: Drop the reporting of GICR_TYPER.Last for userspaceZenghui Yu
commit 23bde34771f1ea92fb5e6682c0d8c04304d34b3b upstream. It was recently reported that if GICR_TYPER is accessed before the RD base address is set, we'll suffer from the unset @rdreg dereferencing. Oops... gpa_t last_rdist_typer = rdreg->base + GICR_TYPER + (rdreg->free_index - 1) * KVM_VGIC_V3_REDIST_SIZE; It's "expected" that users will access registers in the redistributor if the RD has been properly configured (e.g., the RD base address is set). But it hasn't yet been covered by the existing documentation. Per discussion on the list [1], the reporting of the GICR_TYPER.Last bit for userspace never actually worked. And it's difficult for us to emulate it correctly given that userspace has the flexibility to access it any time. Let's just drop the reporting of the Last bit for userspace for now (userspace should have full knowledge about it anyway) and it at least prevents kernel from panic ;-) [1] https://lore.kernel.org/kvmarm/c20865a267e44d1e2c0d52ce4e012263@kernel.org/ Fixes: ba7b3f1275fd ("KVM: arm/arm64: Revisit Redistributor TYPER last bit computation") Reported-by: Keqian Zhu <zhukeqian1@huawei.com> Signed-off-by: Zenghui Yu <yuzenghui@huawei.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20201117151629.1738-1-yuzenghui@huawei.com Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-06Merge branch 'v4.19/standard/base' into v4.19/standard/preempt-rt/baseBruce Ashfield
2020-10-01KVM: arm64: Assume write fault on S1PTW permission fault on instruction fetchMarc Zyngier
commit c4ad98e4b72cb5be30ea282fce935248f2300e62 upstream. KVM currently assumes that an instruction abort can never be a write. This is in general true, except when the abort is triggered by a S1PTW on instruction fetch that tries to update the S1 page tables (to set AF, for example). This can happen if the page tables have been paged out and brought back in without seeing a direct write to them (they are thus marked read only), and the fault handling code will make the PT executable(!) instead of writable. The guest gets stuck forever. In these conditions, the permission fault must be considered as a write so that the Stage-1 update can take place. This is essentially the I-side equivalent of the problem fixed by 60e21a0ef54c ("arm64: KVM: Take S1 walks into account when determining S2 write faults"). Update kvm_is_write_fault() to return true on IABT+S1PTW, and introduce kvm_vcpu_trap_is_exec_fault() that only return true when no faulting on a S1 fault. Additionally, kvm_vcpu_dabt_iss1tw() is renamed to kvm_vcpu_abt_iss1tw(), as the above makes it plain that it isn't specific to data abort. Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Will Deacon <will@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200915104218.1284701-2-maz@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-01KVM: arm64: vgic-its: Fix memory leak on the error path of vgic_add_lpi()Zenghui Yu
[ Upstream commit 57bdb436ce869a45881d8aa4bc5dac8e072dd2b6 ] If we're going to fail out the vgic_add_lpi(), let's make sure the allocated vgic_irq memory is also freed. Though it seems that both cases are unlikely to fail. Signed-off-by: Zenghui Yu <yuzenghui@huawei.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20200414030349.625-3-yuzenghui@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01KVM: fix overflow of zero page refcount with ksm runningZhuang Yanying
[ Upstream commit 7df003c85218b5f5b10a7f6418208f31e813f38f ] We are testing Virtual Machine with KSM on v5.4-rc2 kernel, and found the zero_page refcount overflow. The cause of refcount overflow is increased in try_async_pf (get_user_page) without being decreased in mmu_set_spte() while handling ept violation. In kvm_release_pfn_clean(), only unreserved page will call put_page. However, zero page is reserved. So, as well as creating and destroy vm, the refcount of zero page will continue to increase until it overflows. step1: echo 10000 > /sys/kernel/pages_to_scan/pages_to_scan echo 1 > /sys/kernel/pages_to_scan/run echo 1 > /sys/kernel/pages_to_scan/use_zero_pages step2: just create several normal qemu kvm vms. And destroy it after 10s. Repeat this action all the time. After a long period of time, all domains hang because of the refcount of zero page overflow. Qemu print error log as follow: … error: kvm run failed Bad address EAX=00006cdc EBX=00000008 ECX=80202001 EDX=078bfbfd ESI=ffffffff EDI=00000000 EBP=00000008 ESP=00006cc4 EIP=000efd75 EFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA] SS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] DS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] FS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] GS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy GDT= 000f7070 00000037 IDT= 000f70ae 00000000 CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000000 Code=00 01 00 00 00 e9 e8 00 00 00 c7 05 4c 55 0f 00 01 00 00 00 <8b> 35 00 00 01 00 8b 3d 04 00 01 00 b8 d8 d3 00 00 c1 e0 08 0c ea a3 00 00 01 00 c7 05 04 … Meanwhile, a kernel warning is departed. [40914.836375] WARNING: CPU: 3 PID: 82067 at ./include/linux/mm.h:987 try_get_page+0x1f/0x30 [40914.836412] CPU: 3 PID: 82067 Comm: CPU 0/KVM Kdump: loaded Tainted: G OE 5.2.0-rc2 #5 [40914.836415] RIP: 0010:try_get_page+0x1f/0x30 [40914.836417] Code: 40 00 c3 0f 1f 84 00 00 00 00 00 48 8b 47 08 a8 01 75 11 8b 47 34 85 c0 7e 10 f0 ff 47 34 b8 01 00 00 00 c3 48 8d 78 ff eb e9 <0f> 0b 31 c0 c3 66 90 66 2e 0f 1f 84 00 0 0 00 00 00 48 8b 47 08 a8 [40914.836418] RSP: 0018:ffffb4144e523988 EFLAGS: 00010286 [40914.836419] RAX: 0000000080000000 RBX: 0000000000000326 RCX: 0000000000000000 [40914.836420] RDX: 0000000000000000 RSI: 00004ffdeba10000 RDI: ffffdf07093f6440 [40914.836421] RBP: ffffdf07093f6440 R08: 800000424fd91225 R09: 0000000000000000 [40914.836421] R10: ffff9eb41bfeebb8 R11: 0000000000000000 R12: ffffdf06bbd1e8a8 [40914.836422] R13: 0000000000000080 R14: 800000424fd91225 R15: ffffdf07093f6440 [40914.836423] FS: 00007fb60ffff700(0000) GS:ffff9eb4802c0000(0000) knlGS:0000000000000000 [40914.836425] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [40914.836426] CR2: 0000000000000000 CR3: 0000002f220e6002 CR4: 00000000003626e0 [40914.836427] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [40914.836427] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [40914.836428] Call Trace: [40914.836433] follow_page_pte+0x302/0x47b [40914.836437] __get_user_pages+0xf1/0x7d0 [40914.836441] ? irq_work_queue+0x9/0x70 [40914.836443] get_user_pages_unlocked+0x13f/0x1e0 [40914.836469] __gfn_to_pfn_memslot+0x10e/0x400 [kvm] [40914.836486] try_async_pf+0x87/0x240 [kvm] [40914.836503] tdp_page_fault+0x139/0x270 [kvm] [40914.836523] kvm_mmu_page_fault+0x76/0x5e0 [kvm] [40914.836588] vcpu_enter_guest+0xb45/0x1570 [kvm] [40914.836632] kvm_arch_vcpu_ioctl_run+0x35d/0x580 [kvm] [40914.836645] kvm_vcpu_ioctl+0x26e/0x5d0 [kvm] [40914.836650] do_vfs_ioctl+0xa9/0x620 [40914.836653] ksys_ioctl+0x60/0x90 [40914.836654] __x64_sys_ioctl+0x16/0x20 [40914.836658] do_syscall_64+0x5b/0x180 [40914.836664] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [40914.836666] RIP: 0033:0x7fb61cb6bfc7 Signed-off-by: LinFeng <linfeng23@huawei.com> Signed-off-by: Zhuang Yanying <ann.zhuangyanying@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01KVM: arm/arm64: vgic: Fix potential double free dist->spis in ↵Miaohe Lin
__kvm_vgic_destroy() [ Upstream commit 0bda9498dd45280e334bfe88b815ebf519602cc3 ] In kvm_vgic_dist_init() called from kvm_vgic_map_resources(), if dist->vgic_model is invalid, dist->spis will be freed without set dist->spis = NULL. And in vgicv2 resources clean up path, __kvm_vgic_destroy() will be called to free allocated resources. And dist->spis will be freed again in clean up chain because we forget to set dist->spis = NULL in kvm_vgic_dist_init() failed path. So double free would happen. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/1574923128-19956-1-git-send-email-linmiaohe@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-26KVM: fix memory leak in kvm_io_bus_unregister_dev()Rustam Kovhaev
[ Upstream commit f65886606c2d3b562716de030706dfe1bea4ed5e ] when kmalloc() fails in kvm_io_bus_unregister_dev(), before removing the bus, we should iterate over all other devices linked to it and call kvm_iodevice_destructor() for them Fixes: 90db10434b16 ("KVM: kvm_io_bus_unregister_dev() should never fail") Cc: stable@vger.kernel.org Reported-and-tested-by: syzbot+f196caa45793d6374707@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=f196caa45793d6374707 Signed-off-by: Rustam Kovhaev <rkovhaev@gmail.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200907185535.233114-1-rkovhaev@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-26Merge branch 'v4.19/standard/base' into v4.19/standard/preempt-rt/baseBruce Ashfield
2020-08-26KVM: arm64: Only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is not setWill Deacon
commit b5331379bc62611d1026173a09c73573384201d9 upstream. When an MMU notifier call results in unmapping a range that spans multiple PGDs, we end up calling into cond_resched_lock() when crossing a PGD boundary, since this avoids running into RCU stalls during VM teardown. Unfortunately, if the VM is destroyed as a result of OOM, then blocking is not permitted and the call to the scheduler triggers the following BUG(): | BUG: sleeping function called from invalid context at arch/arm64/kvm/mmu.c:394 | in_atomic(): 1, irqs_disabled(): 0, non_block: 1, pid: 36, name: oom_reaper | INFO: lockdep is turned off. | CPU: 3 PID: 36 Comm: oom_reaper Not tainted 5.8.0 #1 | Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 | Call trace: | dump_backtrace+0x0/0x284 | show_stack+0x1c/0x28 | dump_stack+0xf0/0x1a4 | ___might_sleep+0x2bc/0x2cc | unmap_stage2_range+0x160/0x1ac | kvm_unmap_hva_range+0x1a0/0x1c8 | kvm_mmu_notifier_invalidate_range_start+0x8c/0xf8 | __mmu_notifier_invalidate_range_start+0x218/0x31c | mmu_notifier_invalidate_range_start_nonblock+0x78/0xb0 | __oom_reap_task_mm+0x128/0x268 | oom_reap_task+0xac/0x298 | oom_reaper+0x178/0x17c | kthread+0x1e4/0x1fc | ret_from_fork+0x10/0x30 Use the new 'flags' argument to kvm_unmap_hva_range() to ensure that we only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is set in the notifier flags. Cc: <stable@vger.kernel.org> Fixes: 8b3405e345b5 ("kvm: arm/arm64: Fix locking for kvm_free_stage2_pgd") Cc: Marc Zyngier <maz@kernel.org> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: James Morse <james.morse@arm.com> Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20200811102725.7121-3-will@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [will: Backport to 4.19; use 'blockable' instead of non-existent MMU_NOTIFIER_RANGE_BLOCKABLE flag] Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-26KVM: Pass MMU notifier range flags to kvm_unmap_hva_range()Will Deacon
commit fdfe7cbd58806522e799e2a50a15aee7f2cbb7b6 upstream. The 'flags' field of 'struct mmu_notifier_range' is used to indicate whether invalidate_range_{start,end}() are permitted to block. In the case of kvm_mmu_notifier_invalidate_range_start(), this field is not forwarded on to the architecture-specific implementation of kvm_unmap_hva_range() and therefore the backend cannot sensibly decide whether or not to block. Add an extra 'flags' parameter to kvm_unmap_hva_range() so that architectures are aware as to whether or not they are permitted to block. Cc: <stable@vger.kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: James Morse <james.morse@arm.com> Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20200811102725.7121-2-will@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [will: Backport to 4.19; use 'blockable' instead of non-existent range flags] Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-06Merge branch 'v4.19/standard/base' into v4.19/standard/preempt-rt/baseBruce Ashfield
Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2020-06-22KVM: arm64: Synchronize sysreg state on injecting an AArch32 exceptionMarc Zyngier
commit 0370964dd3ff7d3d406f292cb443a927952cbd05 upstream. On a VHE system, the EL1 state is left in the CPU most of the time, and only syncronized back to memory when vcpu_put() is called (most of the time on preemption). Which means that when injecting an exception, we'd better have a way to either: (1) write directly to the EL1 sysregs (2) synchronize the state back to memory, and do the changes there For an AArch64, we already do (1), so we are safe. Unfortunately, doing the same thing for AArch32 would be pretty invasive. Instead, we can easily implement (2) by calling the put/load architectural backends, and keep preemption disabled. We can then reload the state back into EL1. Cc: stable@vger.kernel.org Reported-by: James Morse <james.morse@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-22KVM: x86: Fix APIC page invalidation raceEiichi Tsukata
[ Upstream commit e649b3f0188f8fd34dd0dde8d43fd3312b902fb2 ] Commit b1394e745b94 ("KVM: x86: fix APIC page invalidation") tried to fix inappropriate APIC page invalidation by re-introducing arch specific kvm_arch_mmu_notifier_invalidate_range() and calling it from kvm_mmu_notifier_invalidate_range_start. However, the patch left a possible race where the VMCS APIC address cache is updated *before* it is unmapped: (Invalidator) kvm_mmu_notifier_invalidate_range_start() (Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD) (KVM VCPU) vcpu_enter_guest() (KVM VCPU) kvm_vcpu_reload_apic_access_page() (Invalidator) actually unmap page Because of the above race, there can be a mismatch between the host physical address stored in the APIC_ACCESS_PAGE VMCS field and the host physical address stored in the EPT entry for the APIC GPA (0xfee0000). When this happens, the processor will not trap APIC accesses, and will instead show the raw contents of the APIC-access page. Because Windows OS periodically checks for unexpected modifications to the LAPIC register, this will show up as a BSOD crash with BugCheck CRITICAL_STRUCTURE_CORRUPTION (109) we are currently seeing in https://bugzilla.redhat.com/show_bug.cgi?id=1751017. The root cause of the issue is that kvm_arch_mmu_notifier_invalidate_range() cannot guarantee that no additional references are taken to the pages in the range before kvm_mmu_notifier_invalidate_range_end(). Fortunately, this case is supported by the MMU notifier API, as documented in include/linux/mmu_notifier.h: * If the subsystem * can't guarantee that no additional references are taken to * the pages in the range, it has to implement the * invalidate_range() notifier to remove any references taken * after invalidate_range_start(). The fix therefore is to reload the APIC-access page field in the VMCS from kvm_mmu_notifier_invalidate_range() instead of ..._range_start(). Cc: stable@vger.kernel.org Fixes: b1394e745b94 ("KVM: x86: fix APIC page invalidation") Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=197951 Signed-off-by: Eiichi Tsukata <eiichi.tsukata@nutanix.com> Message-Id: <20200606042627.61070-1-eiichi.tsukata@nutanix.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-28Merge branch 'v4.19/standard/base' into v4.19/standard/preempt-rt/baseBruce Ashfield
2020-05-28Merge branch 'v4.19/standard/base' into v4.19/standard/preempt-rt/baseBruce Ashfield
2020-05-14KVM: arm64: Fix 32bit PC wrap-aroundMarc Zyngier
commit 0225fd5e0a6a32af7af0aefac45c8ebf19dc5183 upstream. In the unlikely event that a 32bit vcpu traps into the hypervisor on an instruction that is located right at the end of the 32bit range, the emulation of that instruction is going to increment PC past the 32bit range. This isn't great, as userspace can then observe this value and get a bit confused. Conversly, userspace can do things like (in the context of a 64bit guest that is capable of 32bit EL0) setting PSTATE to AArch64-EL0, set PC to a 64bit value, change PSTATE to AArch32-USR, and observe that PC hasn't been truncated. More confusion. Fix both by: - truncating PC increments for 32bit guests - sanitizing all 32bit regs every time a core reg is changed by userspace, and that PSTATE indicates a 32bit mode. Cc: stable@vger.kernel.org Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-14KVM: arm: vgic: Fix limit condition when writing to GICD_I[CS]ACTIVERMarc Zyngier
commit 1c32ca5dc6d00012f0c964e5fdd7042fcc71efb1 upstream. When deciding whether a guest has to be stopped we check whether this is a private interrupt or not. Unfortunately, there's an off-by-one bug here, and we fail to recognize a whole range of interrupts as being global (GICv2 SPIs 32-63). Fix the condition from > to be >=. Cc: stable@vger.kernel.org Fixes: abd7229626b93 ("KVM: arm/arm64: Simplify active_change_prepare and plug race") Reported-by: André Przywara <andre.przywara@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-29x86/kvm: Cache gfn to pfn translationBoris Ostrovsky
commit 917248144db5d7320655dbb41d3af0b8a0f3d589 upstream. __kvm_map_gfn()'s call to gfn_to_pfn_memslot() is * relatively expensive * in certain cases (such as when done from atomic context) cannot be called Stashing gfn-to-pfn mapping should help with both cases. This is part of CVE-2019-3016. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-29x86/kvm: Introduce kvm_(un)map_gfn()Boris Ostrovsky
commit 1eff70a9abd46f175defafd29bc17ad456f398a7 upstream. kvm_vcpu_(un)map operates on gfns from any current address space. In certain cases we want to make sure we are not mapping SMRAM and for that we can use kvm_(un)map_gfn() that we are introducing in this patch. This is part of CVE-2019-3016. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-29KVM: Properly check if "page" is valid in kvm_vcpu_unmapKarimAllah Ahmed
commit b614c6027896ff9ad6757122e84760d938cab15e upstream. The field "page" is initialized to KVM_UNMAPPED_PAGE when it is not used (i.e. when the memory lives outside kernel control). So this check will always end up using kunmap even for memremap regions. Fixes: e45adf665a53 ("KVM: Introduce a new guest mapping API") Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-29kvm: fix compile on s390 part 2Christian Borntraeger
commit eb1f2f387db8c0d084581fb26e7faffde700bc8e upstream. We also need to fence the memunmap part. Fixes: e45adf665a53 ("KVM: Introduce a new guest mapping API") Fixes: d30b214d1d0a (kvm: fix compilation on s390) Cc: Michal Kubecek <mkubecek@suse.cz> Cc: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-29kvm: fix compilation on s390Paolo Bonzini
commit d30b214d1d0addb7b2c9c78178d1501cd39a01fb upstream. s390 does not have memremap, even though in this particular case it would be useful. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-29kvm: fix compilation on aarch64Paolo Bonzini
commit c011d23ba046826ccf8c4a4a6c1d01c9ccaa1403 upstream. Commit e45adf665a53 ("KVM: Introduce a new guest mapping API", 2019-01-31) introduced a build failure on aarch64 defconfig: $ make -j$(nproc) ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- O=out defconfig \ Image.gz ... ../arch/arm64/kvm/../../../virt/kvm/kvm_main.c: In function '__kvm_map_gfn': ../arch/arm64/kvm/../../../virt/kvm/kvm_main.c:1763:9: error: implicit declaration of function 'memremap'; did you mean 'memset_p'? ../arch/arm64/kvm/../../../virt/kvm/kvm_main.c:1763:46: error: 'MEMREMAP_WB' undeclared (first use in this function) ../arch/arm64/kvm/../../../virt/kvm/kvm_main.c: In function 'kvm_vcpu_unmap': ../arch/arm64/kvm/../../../virt/kvm/kvm_main.c:1795:3: error: implicit declaration of function 'memunmap'; did you mean 'vm_munmap'? because these functions are declared in <linux/io.h> rather than <asm/io.h>, and the former was being pulled in already on x86 but not on aarch64. Reported-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [bwh: Backported to 4.19: adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-29KVM: Introduce a new guest mapping APIKarimAllah Ahmed
commit e45adf665a53df0db37f784ed87c6b57ddd81885 upstream. In KVM, specially for nested guests, there is a dominant pattern of: => map guest memory -> do_something -> unmap guest memory In addition to all this unnecessarily noise in the code due to boiler plate code, most of the time the mapping function does not properly handle memory that is not backed by "struct page". This new guest mapping API encapsulate most of this boiler plate code and also handles guest memory that is not backed by "struct page". The current implementation of this API is using memremap for memory that is not backed by a "struct page" which would lead to a huge slow-down if it was used for high-frequency mapping operations. The API does not have any effect on current setups where guest memory is backed by a "struct page". Further patches are going to also introduce a pfn-cache which would significantly improve the performance of the memremap case. Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [bwh: Backported to 4.19 as dependency of commit 1eff70a9abd4 "x86/kvm: Introduce kvm_(un)map_gfn()"] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-03-21Merge branch 'v4.19/standard/base' into v4.19/standard/preempt-rt/baseBruce Ashfield
2020-03-11Merge branch 'v4.19/standard/base' into v4.19/standard/preempt-rt/baseBruce Ashfield
Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>