Petr Tesarik 1f049f
From: Claudio Imbrenda <imbrenda@linux.ibm.com>
Petr Tesarik 1f049f
Date: Tue, 28 Jun 2022 15:56:02 +0200
Petr Tesarik 1f049f
Subject: KVM: s390: pv: leak the topmost page table when destroy fails
Petr Tesarik 1f049f
Git-commit: faa2f72cb3569256480c5540d242c84e99965160
Petr Tesarik 1f049f
Patch-mainline: v6.0-rc1
Petr Tesarik 1f049f
References: git-fixes
Petr Tesarik 1f049f
Petr Tesarik 1f049f
Each secure guest must have a unique ASCE (address space control
Petr Tesarik 1f049f
element); we must avoid that new guests use the same page for their
Petr Tesarik 1f049f
ASCE, to avoid errors.
Petr Tesarik 1f049f
Petr Tesarik 1f049f
Since the ASCE mostly consists of the address of the topmost page table
Petr Tesarik 1f049f
(plus some flags), we must not return that memory to the pool unless
Petr Tesarik 1f049f
the ASCE is no longer in use.
Petr Tesarik 1f049f
Petr Tesarik 1f049f
Only a successful Destroy Secure Configuration UVC will make the ASCE
Petr Tesarik 1f049f
reusable again.
Petr Tesarik 1f049f
Petr Tesarik 1f049f
If the Destroy Configuration UVC fails, the ASCE cannot be reused for a
Petr Tesarik 1f049f
secure guest (either for the ASCE or for other memory areas). To avoid
Petr Tesarik 1f049f
a collision, it must not be used again. This is a permanent error and
Petr Tesarik 1f049f
the page becomes in practice unusable, so we set it aside and leak it.
Petr Tesarik 1f049f
On failure we already leak other memory that belongs to the ultravisor
Petr Tesarik 1f049f
(i.e. the variable and base storage for a guest) and not leaking the
Petr Tesarik 1f049f
topmost page table was an oversight.
Petr Tesarik 1f049f
Petr Tesarik 1f049f
This error (and thus the leakage) should not happen unless the hardware
Petr Tesarik 1f049f
is broken or KVM has some unknown serious bug.
Petr Tesarik 1f049f
Petr Tesarik 1f049f
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Petr Tesarik 1f049f
Fixes: 29b40f105ec8d55 ("KVM: s390: protvirt: Add initial vm and cpu lifecycle handling")
Petr Tesarik 1f049f
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Petr Tesarik 1f049f
Link: https://lore.kernel.org/r/20220628135619.32410-2-imbrenda@linux.ibm.com
Petr Tesarik 1f049f
Message-Id: <20220628135619.32410-2-imbrenda@linux.ibm.com>
Petr Tesarik 1f049f
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Petr Tesarik 1f049f
Acked-by: Petr Tesarik <ptesarik@suse.com>
Petr Tesarik 1f049f
---
Petr Tesarik 1f049f
 arch/s390/include/asm/gmap.h |    2 +
Petr Tesarik 1f049f
 arch/s390/kvm/pv.c           |    9 +++-
Petr Tesarik 1f049f
 arch/s390/mm/gmap.c          |   86 +++++++++++++++++++++++++++++++++++++++++++
Petr Tesarik 1f049f
 3 files changed, 94 insertions(+), 3 deletions(-)
Petr Tesarik 1f049f
Petr Tesarik 1f049f
--- a/arch/s390/include/asm/gmap.h
Petr Tesarik 1f049f
+++ b/arch/s390/include/asm/gmap.h
Petr Tesarik 1f049f
@@ -148,4 +148,6 @@ void gmap_sync_dirty_log_pmd(struct gmap
Petr Tesarik 1f049f
 			     unsigned long gaddr, unsigned long vmaddr);
Petr Tesarik 1f049f
 int gmap_mark_unmergeable(void);
Petr Tesarik 1f049f
 void s390_reset_acc(struct mm_struct *mm);
Petr Tesarik 1f049f
+void s390_unlist_old_asce(struct gmap *gmap);
Petr Tesarik 1f049f
+int s390_replace_asce(struct gmap *gmap);
Petr Tesarik 1f049f
 #endif /* _ASM_S390_GMAP_H */
Petr Tesarik 1f049f
--- a/arch/s390/kvm/pv.c
Petr Tesarik 1f049f
+++ b/arch/s390/kvm/pv.c
Petr Tesarik 1f049f
@@ -168,10 +168,13 @@ int kvm_s390_pv_deinit_vm(struct kvm *kv
Petr Tesarik 1f049f
 	atomic_set(&kvm->mm->context.is_protected, 0);
Petr Tesarik 1f049f
 	KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x", *rc, *rrc);
Petr Tesarik 1f049f
 	WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc %x", *rc, *rrc);
Petr Tesarik 1f049f
-	/* Inteded memory leak on "impossible" error */
Petr Tesarik 1f049f
-	if (!cc)
Petr Tesarik 1f049f
+	/* Intended memory leak on "impossible" error */
Petr Tesarik 1f049f
+	if (!cc) {
Petr Tesarik 1f049f
 		kvm_s390_pv_dealloc_vm(kvm);
Petr Tesarik 1f049f
-	return cc ? -EIO : 0;
Petr Tesarik 1f049f
+		return 0;
Petr Tesarik 1f049f
+	}
Petr Tesarik 1f049f
+	s390_replace_asce(kvm->arch.gmap);
Petr Tesarik 1f049f
+	return -EIO;
Petr Tesarik 1f049f
 }
Petr Tesarik 1f049f
 
Petr Tesarik 1f049f
 int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
Petr Tesarik 1f049f
--- a/arch/s390/mm/gmap.c
Petr Tesarik 1f049f
+++ b/arch/s390/mm/gmap.c
Petr Tesarik 1f049f
@@ -2720,3 +2720,89 @@ void s390_reset_acc(struct mm_struct *mm
Petr Tesarik 1f049f
 	mmput(mm);
Petr Tesarik 1f049f
 }
Petr Tesarik 1f049f
 EXPORT_SYMBOL_GPL(s390_reset_acc);
Petr Tesarik 1f049f
+
Petr Tesarik 1f049f
+/**
Petr Tesarik 1f049f
+ * s390_unlist_old_asce - Remove the topmost level of page tables from the
Petr Tesarik 1f049f
+ * list of page tables of the gmap.
Petr Tesarik 1f049f
+ * @gmap: the gmap whose table is to be removed
Petr Tesarik 1f049f
+ *
Petr Tesarik 1f049f
+ * On s390x, KVM keeps a list of all pages containing the page tables of the
Petr Tesarik 1f049f
+ * gmap (the CRST list). This list is used at tear down time to free all
Petr Tesarik 1f049f
+ * pages that are now not needed anymore.
Petr Tesarik 1f049f
+ *
Petr Tesarik 1f049f
+ * This function removes the topmost page of the tree (the one pointed to by
Petr Tesarik 1f049f
+ * the ASCE) from the CRST list.
Petr Tesarik 1f049f
+ *
Petr Tesarik 1f049f
+ * This means that it will not be freed when the VM is torn down, and needs
Petr Tesarik 1f049f
+ * to be handled separately by the caller, unless a leak is actually
Petr Tesarik 1f049f
+ * intended. Notice that this function will only remove the page from the
Petr Tesarik 1f049f
+ * list, the page will still be used as a top level page table (and ASCE).
Petr Tesarik 1f049f
+ */
Petr Tesarik 1f049f
+void s390_unlist_old_asce(struct gmap *gmap)
Petr Tesarik 1f049f
+{
Petr Tesarik 1f049f
+	struct page *old;
Petr Tesarik 1f049f
+
Petr Tesarik 1f049f
+	old = virt_to_page(gmap->table);
Petr Tesarik 1f049f
+	spin_lock(&gmap->guest_table_lock);
Petr Tesarik 1f049f
+	list_del(&old->lru);
Petr Tesarik 1f049f
+	/*
Petr Tesarik 1f049f
+	 * Sometimes the topmost page might need to be "removed" multiple
Petr Tesarik 1f049f
+	 * times, for example if the VM is rebooted into secure mode several
Petr Tesarik 1f049f
+	 * times concurrently, or if s390_replace_asce fails after calling
Petr Tesarik 1f049f
+	 * s390_remove_old_asce and is attempted again later. In that case
Petr Tesarik 1f049f
+	 * the old asce has been removed from the list, and therefore it
Petr Tesarik 1f049f
+	 * will not be freed when the VM terminates, but the ASCE is still
Petr Tesarik 1f049f
+	 * in use and still pointed to.
Petr Tesarik 1f049f
+	 * A subsequent call to replace_asce will follow the pointer and try
Petr Tesarik 1f049f
+	 * to remove the same page from the list again.
Petr Tesarik 1f049f
+	 * Therefore it's necessary that the page of the ASCE has valid
Petr Tesarik 1f049f
+	 * pointers, so list_del can work (and do nothing) without
Petr Tesarik 1f049f
+	 * dereferencing stale or invalid pointers.
Petr Tesarik 1f049f
+	 */
Petr Tesarik 1f049f
+	INIT_LIST_HEAD(&old->lru);
Petr Tesarik 1f049f
+	spin_unlock(&gmap->guest_table_lock);
Petr Tesarik 1f049f
+}
Petr Tesarik 1f049f
+EXPORT_SYMBOL_GPL(s390_unlist_old_asce);
Petr Tesarik 1f049f
+
Petr Tesarik 1f049f
+/**
Petr Tesarik 1f049f
+ * s390_replace_asce - Try to replace the current ASCE of a gmap with a copy
Petr Tesarik 1f049f
+ * @gmap: the gmap whose ASCE needs to be replaced
Petr Tesarik 1f049f
+ *
Petr Tesarik 1f049f
+ * If the allocation of the new top level page table fails, the ASCE is not
Petr Tesarik 1f049f
+ * replaced.
Petr Tesarik 1f049f
+ * In any case, the old ASCE is always removed from the gmap CRST list.
Petr Tesarik 1f049f
+ * Therefore the caller has to make sure to save a pointer to it
Petr Tesarik 1f049f
+ * beforehand, unless a leak is actually intended.
Petr Tesarik 1f049f
+ */
Petr Tesarik 1f049f
+int s390_replace_asce(struct gmap *gmap)
Petr Tesarik 1f049f
+{
Petr Tesarik 1f049f
+	unsigned long asce;
Petr Tesarik 1f049f
+	struct page *page;
Petr Tesarik 1f049f
+	void *table;
Petr Tesarik 1f049f
+
Petr Tesarik 1f049f
+	s390_unlist_old_asce(gmap);
Petr Tesarik 1f049f
+
Petr Tesarik 1f049f
+	page = alloc_pages(GFP_KERNEL_ACCOUNT, CRST_ALLOC_ORDER);
Petr Tesarik 1f049f
+	if (!page)
Petr Tesarik 1f049f
+		return -ENOMEM;
Petr Tesarik 1f049f
+	table = page_to_virt(page);
Petr Tesarik 1f049f
+	memcpy(table, gmap->table, 1UL << (CRST_ALLOC_ORDER + PAGE_SHIFT));
Petr Tesarik 1f049f
+
Petr Tesarik 1f049f
+	/*
Petr Tesarik 1f049f
+	 * The caller has to deal with the old ASCE, but here we make sure
Petr Tesarik 1f049f
+	 * the new one is properly added to the CRST list, so that
Petr Tesarik 1f049f
+	 * it will be freed when the VM is torn down.
Petr Tesarik 1f049f
+	 */
Petr Tesarik 1f049f
+	spin_lock(&gmap->guest_table_lock);
Petr Tesarik 1f049f
+	list_add(&page->lru, &gmap->crst_list);
Petr Tesarik 1f049f
+	spin_unlock(&gmap->guest_table_lock);
Petr Tesarik 1f049f
+
Petr Tesarik 1f049f
+	/* Set new table origin while preserving existing ASCE control bits */
Petr Tesarik 1f049f
+	asce = (gmap->asce & ~_ASCE_ORIGIN) | __pa(table);
Petr Tesarik 1f049f
+	WRITE_ONCE(gmap->asce, asce);
Petr Tesarik 1f049f
+	WRITE_ONCE(gmap->mm->context.gmap_asce, asce);
Petr Tesarik 1f049f
+	WRITE_ONCE(gmap->table, table);
Petr Tesarik 1f049f
+
Petr Tesarik 1f049f
+	return 0;
Petr Tesarik 1f049f
+}
Petr Tesarik 1f049f
+EXPORT_SYMBOL_GPL(s390_replace_asce);