From: =?UTF-8?q?Nicolai=20H=C3=A4hnle?= Date: Thu, 28 Sep 2017 11:57:32 +0200 Subject: drm/amd/sched: fix deadlock caused by unsignaled fences of deleted jobs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Git-commit: 79867462634836ee5c39a2cdf624719feeb189bd Patch-mainline: v4.15-rc1 References: FATE#326289 FATE#326079 FATE#326049 FATE#322398 FATE#326166 Highly concurrent Piglit runs can trigger a race condition where a pending SDMA job on a buffer object is never executed because the corresponding process is killed (perhaps due to a crash). Since the job's fences were never signaled, the buffer object was effectively leaked. Worse, the buffer was stuck wherever it happened to be at the time, possibly in VRAM. The symptom was user space processes stuck in interruptible waits with kernel stacks like: [] dma_fence_default_wait+0x112/0x250 [] dma_fence_wait_timeout+0x39/0xf0 [] reservation_object_wait_timeout_rcu+0x1c2/0x300 [] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 [ttm] [] ttm_mem_evict_first+0xba/0x1a0 [ttm] [] ttm_bo_mem_space+0x341/0x4c0 [ttm] [] ttm_bo_validate+0xd4/0x150 [ttm] [] ttm_bo_init_reserved+0x2ed/0x420 [ttm] [] amdgpu_bo_create_restricted+0x1f3/0x470 [amdgpu] [] amdgpu_bo_create+0xda/0x220 [amdgpu] [] amdgpu_gem_object_create+0xaa/0x140 [amdgpu] [] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu] [] drm_ioctl+0x1fa/0x480 [drm] [] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] [] do_vfs_ioctl+0xa3/0x5f0 [] SyS_ioctl+0x79/0x90 [] entry_SYSCALL_64_fastpath+0x1e/0xad [] 0xffffffffffffffff Note: The correctness of this change depends on the earlier commit "drm/amd/sched: move adding finish callback to amd_sched_job_begin" v2: set an error on the finished fence Signed-off-by: Nicolai Hähnle Reviewed-by: Christian König Reviewed-by: Andres Rodriguez Signed-off-by: Alex Deucher Acked-by: Petr Tesarik --- drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c @@ -227,8 +227,14 @@ void amd_sched_entity_fini(struct amd_gp */ kthread_park(sched->thread); kthread_unpark(sched->thread); - while (kfifo_out(&entity->job_queue, &job, sizeof(job))) + while (kfifo_out(&entity->job_queue, &job, sizeof(job))) { + struct amd_sched_fence *s_fence = job->s_fence; + amd_sched_fence_scheduled(s_fence); + dma_fence_set_error(&s_fence->finished, -ESRCH); + amd_sched_fence_finished(s_fence); + dma_fence_put(&s_fence->finished); sched->ops->free_job(job); + } } kfifo_free(&entity->job_queue);