Blob Blame History Raw
From: Yong Zhao <yong.zhao@amd.com>
Date: Wed, 11 Jul 2018 22:33:05 -0400
Subject: drm/amdkfd: Introduce KFD module parameter halt_if_hws_hang
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Git-commit: 0e9a860c72ec387140a0feb4b8d9a6d0004e9316
Patch-mainline: v4.19-rc1
References: FATE#326289 FATE#326079 FATE#326049 FATE#322398 FATE#326166

This avoids triggering a GPU reset or otherwise changing the HW
state. Instead KFD will hang, which allows HW debugging tools to
analyze the problem.

Signed-off-by: Yong Zhao <yong.zhao@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Acked-by: Petr Tesarik <ptesarik@suse.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c |    7 +++++++
 drivers/gpu/drm/amd/amdkfd/kfd_module.c               |    4 ++++
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h                 |    5 +++++
 3 files changed, 16 insertions(+)

--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -1217,6 +1217,13 @@ int amdkfd_fence_wait_timeout(unsigned i
 	while (*fence_addr != fence_value) {
 		if (time_after(jiffies, end_jiffies)) {
 			pr_err("qcm fence wait loop timeout expired\n");
+			/* In HWS case, this is used to halt the driver thread
+			 * in order not to mess up CP states before doing
+			 * scandumps for FW debugging.
+			 */
+			while (halt_if_hws_hang)
+				schedule();
+
 			return -ETIME;
 		}
 		schedule();
--- a/drivers/gpu/drm/amd/amdkfd/kfd_module.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_module.c
@@ -92,6 +92,10 @@ MODULE_PARM_DESC(noretry,
 
 static int amdkfd_init_completed;
 
+int halt_if_hws_hang;
+module_param(halt_if_hws_hang, int, 0644);
+MODULE_PARM_DESC(halt_if_hws_hang, "Halt if HWS hang is detected (0 = off (default), 1 = on)");
+
 int kgd2kfd_init(unsigned int interface_version,
 		const struct kgd2kfd_calls **g2f)
 {
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -144,6 +144,11 @@ extern int ignore_crat;
  */
 extern int vega10_noretry;
 
+/*
+ * Halt if HWS hang is detected
+ */
+extern int halt_if_hws_hang;
+
 /**
  * enum kfd_sched_policy
  *