Blob Blame History Raw
From: Harald Freudenberger <freude@linux.ibm.com>
Subject: s390/ap: Fix hanging ioctl caused by wrong msg counter

Patch-mainline: v5.13-rc7
Git-commit: e73a99f3287a740a07d6618e9470f4d6cb217da8
References: bsc#1188982 LTC#193818

Description:   kernel: Fix hanging ioctl caused by wrong msg counter
Symptom:       Application hang after offline/online switch of a
               crypto queue.
Problem:       When a AP queue is switched to soft offline, all
               pending requests are purged out of the pending requests
               list and 'received' by the upper layer like zcrypt
               device drivers. This is also done for requests which
               are already enqueued into the firmware queue. A request
               in a firmware queue may eventually produce an response
               message, but there is no waiting process any
               more. However, the response was counted with the
               queue_counter and as this counter was reset to 0 with
               the offline switch, the pending response caused the
               queue_counter to get negative. The next request
               increased this counter to 0 (instead of 1) which caused
               the ap code to assume there is nothing to receive and
               so the response for this valid request was never tried
               to fetch from the firmware queue.
               This all caused a queue to not work properly after a
               switch offline/online and in the end processes to hang
               forever when trying to send a crypto request after an
               queue offline/online switch cicle.
               Fixed by a) making sure the counter does not drop below
               0 and b) on a successful enqueue of a message has at
               least a value of 1.
               Additionally a warning is emitted, when a reply can't
               get assigned to a waiting process. This may be normal
               operation (process had timeout or has been killed) but
               may give a hint that something unexpected happened
               (like this odd behavior described above).
Solution:      Fixed by fixing the code.
Reproduction:  Switch all but one queue offline. Run your application
               against this only queue and while doing so switch the
               queue offline. Application will stop with ENODEV. Now
               switch back this only queue to online and restart the
               application. Without the fix it will hang unable to
               receive an reply for crypto requests. With the fix it
               will run as expected without any hangs.

Upstream-Description:

              s390/ap: Fix hanging ioctl caused by wrong msg counter

              When a AP queue is switched to soft offline, all pending
              requests are purged out of the pending requests list and
              'received' by the upper layer like zcrypt device drivers.
              This is also done for requests which are already enqueued
              into the firmware queue. A request in a firmware queue
              may eventually produce an response message, but there is
              no waiting process any more. However, the response was
              counted with the queue_counter and as this counter was
              reset to 0 with the offline switch, the pending response
              caused the queue_counter to get negative. The next request
              increased this counter to 0 (instead of 1) which caused
              the ap code to assume there is nothing to receive and so
              the response for this valid request was never tried to
              fetch from the firmware queue.

              This all caused a queue to not work properly after a
              switch offline/online and in the end processes to hang
              forever when trying to send a crypto request after an
              queue offline/online switch cicle.

              Fixed by a) making sure the counter does not drop below 0
              and b) on a successful enqueue of a message has at least
              a value of 1.

              Additionally a warning is emitted, when a reply can't get
              assigned to a waiting process. This may be normal operation
              (process had timeout or has been killed) but may give a
              hint that something unexpected happened (like this odd
              behavior described above).

              Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
              Cc: stable@vger.kernel.org
              Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>


Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Acked-by: Michal Suchanek <msuchanek@suse.de>
---
 drivers/s390/crypto/ap_queue.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

--- a/drivers/s390/crypto/ap_queue.c
+++ b/drivers/s390/crypto/ap_queue.c
@@ -13,6 +13,7 @@
 #include <asm/facility.h>
 
 #include "ap_bus.h"
+#include "ap_debug.h"
 
 /**
  * ap_queue_enable_interruption(): Enable interruption on an AP queue.
@@ -131,12 +131,13 @@ static struct ap_queue_status ap_sm_recv
 {
 	struct ap_queue_status status;
 	struct ap_message *ap_msg;
+	bool found = false;
 
 	status = ap_dqap(aq->qid, &aq->reply->psmid,
 			 aq->reply->message, aq->reply->length);
 	switch (status.response_code) {
 	case AP_RESPONSE_NORMAL:
-		aq->queue_count--;
+		aq->queue_count = max_t(int, 0, aq->queue_count - 1);
 		if (aq->queue_count > 0)
 			mod_timer(&aq->timeout,
 				  jiffies + aq->request_timeout);
@@ -146,8 +147,14 @@ static struct ap_queue_status ap_sm_recv
 			list_del_init(&ap_msg->list);
 			aq->pendingq_count--;
 			ap_msg->receive(aq, ap_msg, aq->reply);
+			found = true;
 			break;
 		}
+		if (!found) {
+			AP_DBF(DBF_WARN, "%s unassociated reply psmid=0x%016llx on 0x%02x.%04x\n",
+				    __func__, aq->reply->psmid,
+				    AP_QID_CARD(aq->qid), AP_QID_QUEUE(aq->qid));
+		}
 	case AP_RESPONSE_NO_PENDING_REPLY:
 		if (!status.queue_empty || aq->queue_count <= 0)
 			break;
@@ -239,7 +246,7 @@ static enum ap_wait ap_sm_write(struct a
 			   ap_msg->message, ap_msg->length, ap_msg->special);
 	switch (status.response_code) {
 	case AP_RESPONSE_NORMAL:
-		aq->queue_count++;
+		aq->queue_count = max_t(int, 1, aq->queue_count + 1);
 		if (aq->queue_count == 1)
 			mod_timer(&aq->timeout, jiffies + aq->request_timeout);
 		list_move_tail(&ap_msg->list, &aq->pendingq);