Blob Blame History Raw
From: Steffen Maier <maier@linux.vnet.ibm.com>
Subject: scsi: zfcp: fix queuecommand for scsi_eh commands when DIX enabled
Patch-mainline: v4.14-rc1
Git-commit: 71b8e45da51a7b64a23378221c0a5868bd79da4f
References: bnc#1066983, LTC#158493

Description:  zfcp: fix queuecommand for scsi_eh commands when DIX enabled
Symptom:      Prerequisites: zfcp.dif=1 and a T10-DIF SCSI disk for which the
              SCSI disk driver (sd) enabled DIX. The same single SCSI command
              (READ or WRITE) must run into two timeouts without any successful
              command response inbetween.
              This triggers SCSI error handling (scsi_eh). Each test unit ready
              (TUR) SCSI command as part of scsi_eh fails in zfcp causing a
              QDIO problem with kernel message:
              "zfcp.e78dec: <FCP_device_bus_ID>: A QDIO problem occurred"
              (zfcpdbf REC  trace tag "qdires1").
              As a result, scsi_eh unnecessarily escalates successful LUN reset
              (zfcpdbf SCSI trace tag "lr_okay") to successful target reset
              (zfcpdbf SCSI trace tag "tr_okay") to successful host reset
              (zfcpdbf SCSI trace tag "schrh_1") which finally gives up by
              setting affected SCSI devices offline with kernel message:
              "sd H:0:T:L: Device offlined - not ready after error recovery"
Problem:      Scsi_eh re-uses regular SCSI commands in scsi_send_eh_cmnd().
              Such command can have DIX protection data. Since commit
              db007fc5e20c ("[SCSI] Command protection operation"),
              scsi_eh_prep_cmnd() saves scmd->prot_op and temporarily resets it
              to SCSI_PROT_NORMAL. A re-used command can still have
              (scsi_prot_sg_count() != 0) and so zfcp sends down bogus requests
              to the FCP channel hardware making the TUR scsi_eh command fail.
              This causes scsi_eh_test_devices() to have (finish_cmds == 0)
              [not SCSI device is online or not scsi_eh_tur() failed]. So
              regular SCSI commands, that caused / were affected by scsi_eh,
              are moved to work_q and scsi_eh_test_devices() itself returns
              false. This escalates scsi_eh including a final fail in
              scsi_eh_ready_devs() causing scsi_eh_offline_sdevs().
Solution:     Other FCP LLDDs such as qla2xxx and lpfc shield their
              queuecommand() to only access any of scsi_prot_sg...() if
              (scsi_get_prot_op(cmd) != SCSI_PROT_NORMAL).
              Do the same thing for zfcp, which introduced DIX support with
              commit ef3eb71d8ba4 ("[SCSI] zfcp: Introduce experimental support
              for DIF/DIX").
Reproduction: With zfcp.dif=1 and a T10-DIF SCSI disk for which the SCSI disk
              driver (sd) enabled DIX, trigger two timeouts in a row for the
              same single SCSI command (READ or WRITE).
              To manually create a similar situation: Stop multipathd so we
              don't get additional path checker TURs. Enable RSCN suppression
              on the SAN switch port beyond the first link, i.e. towards the
              storage target. Disable that switch port. Send one SCSI command
              in the background (because it will block for a while) e.g. via
              "dd if=/dev/mapper/... of=/dev/null count=1 &". After
              <SCSI command timeout> seconds, the command runs into the timeout
              for the first time, gets aborted, and then a retry is submitted.
              The retry is also lost because the switch port is still disabled.
              After 1.5 * <SCSI command timeout> seconds, enable that switch
              port again. After 2 * <SCSI command timeout> seconds, the command
              runs into the timeout for the second time and triggers scsi_eh.
              As first step, scsi_eh sends a LUN reset which should get a
              successful response from the storage target. The subsequent
              scsi_eh TUR is only successful with this fix.

Upstream-Description:

              scsi: zfcp: fix queuecommand for scsi_eh commands when DIX enabled

              Since commit db007fc5e20c ("[SCSI] Command protection operation"),
              scsi_eh_prep_cmnd() saves scmd->prot_op and temporarily resets it to
              SCSI_PROT_NORMAL.
              Other FCP LLDDs such as qla2xxx and lpfc shield their queuecommand()
              to only access any of scsi_prot_sg...() if
              (scsi_get_prot_op(cmd) != SCSI_PROT_NORMAL).

              Do the same thing for zfcp, which introduced DIX support with
              commit ef3eb71d8ba4 ("[SCSI] zfcp: Introduce experimental support for
              DIF/DIX").

              Otherwise, TUR SCSI commands as part of scsi_eh likely fail in zfcp,
              because the regular SCSI command with DIX protection data, that scsi_eh
              re-uses in scsi_send_eh_cmnd(), of course still has
              (scsi_prot_sg_count() != 0) and so zfcp sends down bogus requests to the
              FCP channel hardware.

              This causes scsi_eh_test_devices() to have (finish_cmds == 0)
              [not SCSI device is online or not scsi_eh_tur() failed]
              so regular SCSI commands, that caused / were affected by scsi_eh,
              are moved to work_q and scsi_eh_test_devices() itself returns false.
              In turn, it unnecessarily escalates in our case in scsi_eh_ready_devs()
              beyond host reset to finally scsi_eh_offline_sdevs()
              which sets affected SCSI devices offline with the following kernel message:

              "kernel: sd H:0:T:L: Device offlined - not ready after error recovery"

              Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com>
              Fixes: ef3eb71d8ba4 ("[SCSI] zfcp: Introduce experimental support for DIF/DIX")
              Cc: <stable@vger.kernel.org> #2.6.36+
              Reviewed-by: Benjamin Block <bblock@linux.vnet.ibm.com>
              Signed-off-by: Benjamin Block <bblock@linux.vnet.ibm.com>
              Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>


Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com>
Acked-by: Hannes Reinecke <hare@suse.com>
---
 drivers/s390/scsi/zfcp_fsf.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/drivers/s390/scsi/zfcp_fsf.c
+++ b/drivers/s390/scsi/zfcp_fsf.c
@@ -2258,7 +2258,8 @@ int zfcp_fsf_fcp_cmnd(struct scsi_cmnd *
 	fcp_cmnd = (struct fcp_cmnd *) &req->qtcb->bottom.io.fcp_cmnd;
 	zfcp_fc_scsi_to_fcp(fcp_cmnd, scsi_cmnd, 0);
 
-	if (scsi_prot_sg_count(scsi_cmnd)) {
+	if ((scsi_get_prot_op(scsi_cmnd) != SCSI_PROT_NORMAL) &&
+	    scsi_prot_sg_count(scsi_cmnd)) {
 		zfcp_qdio_set_data_div(qdio, &req->qdio_req,
 				       scsi_prot_sg_count(scsi_cmnd));
 		retval = zfcp_qdio_sbals_from_sg(qdio, &req->qdio_req,