Blob Blame History Raw
From: Yong Zhao <yong.zhao@amd.com>
Date: Fri, 13 Jul 2018 16:17:46 -0400
Subject: drm/amdkfd: Workaround to accommodate Raven too many PPR issue
Git-commit: 8725aecac331954e0827d6bed7be02eb7b8f1e9e
Patch-mainline: v4.19-rc1
References: FATE#326289 FATE#326079 FATE#326049 FATE#322398 FATE#326166

On Raven multiple PPRs can be queued up by the hardware. When the
first of those requests is handled by the IOMMU driver, the memory
access succeeds. After that the application may be done with the
memory and unmap it. At that point the page table entries are
invalidated, but there are still outstanding duplicate PPRs for those
addresses. When the IOMMU driver processes those duplicate requests,
it finds invalid page table entries and triggers an invalid PPR fault.

As a workaround, don't signal invalid PPR faults on Raven to avoid
segfaulting applications that haven't done anything wrong. As a side
effect, real GPU memory access faults may go unnoticed by the
application.

Signed-off-by: Yong Zhao <yong.zhao@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Acked-by: Petr Tesarik <ptesarik@suse.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_events.c |   21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

--- a/drivers/gpu/drm/amd/amdkfd/kfd_events.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
@@ -932,13 +932,24 @@ void kfd_signal_iommu_event(struct kfd_d
 	up_read(&mm->mmap_sem);
 	mmput(mm);
 
-	mutex_lock(&p->event_mutex);
+	pr_debug("notpresent %d, noexecute %d, readonly %d\n",
+			memory_exception_data.failure.NotPresent,
+			memory_exception_data.failure.NoExecute,
+			memory_exception_data.failure.ReadOnly);
 
-	/* Lookup events by type and signal them */
-	lookup_events_by_type_and_signal(p, KFD_EVENT_TYPE_MEMORY,
-			&memory_exception_data);
+	/* Workaround on Raven to not kill the process when memory is freed
+	 * before IOMMU is able to finish processing all the excessive PPRs
+	 */
+	if (dev->device_info->asic_family != CHIP_RAVEN) {
+		mutex_lock(&p->event_mutex);
+
+		/* Lookup events by type and signal them */
+		lookup_events_by_type_and_signal(p, KFD_EVENT_TYPE_MEMORY,
+				&memory_exception_data);
+
+		mutex_unlock(&p->event_mutex);
+	}
 
-	mutex_unlock(&p->event_mutex);
 	kfd_unref_process(p);
 }
 #endif /* KFD_SUPPORT_IOMMU_V2 */