Blob Blame History Raw
From: James Smart <jsmart2021@gmail.com>
Date: Fri, 10 Sep 2021 16:31:49 -0700
Subject: scsi: lpfc: Fix hang on unload due to stuck fport node
Patch-mainline: v5.16-rc1
Git-commit: 88f7702984e6e562223ecc07c38ac4e61713780a
References: bsc#1190576

A test scenario encountered an unload hang while an FLOGI ELS was in flight
when a link down condition occurred.  The driver fails unload as it never
releases the fport node.

For most nodes, when the link drops, devloss tmo is started and the timeout
will cause the final node release. For the Fport, as it has not yet
registered with the SCSI transport, there is no devloss timer to be
started, so there is no final release.  Additionally, the link down
sequence causes ABORTS to be issued for pending ELS's. The completions from
the ABORTS perform the release of node references.  However, as the adapter
is being reset to be unloaded, those completions will never occur.

Fix by the following:

 - In the ELS cleanup, recognize when unloading and place the ELS's on a
   different list that immediately cleans up/completes the ELS's.  It's
   recognized that this condition primarily affects only the fport, with
   other ports having normal clean up logic that handles things.

 - Resolve the devloss issue by, when cleaning up nodes on after link down,
   recognizing when the fabric node does not have a completed state (its
   state is UNUSED) and removing a reference so the node can delete after
   the ELS reference is released.

Link: https://lore.kernel.org/r/20210910233159.115896-5-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Daniel Wagner <dwagner@suse.de>
---
 drivers/scsi/lpfc/lpfc_els.c     |   14 ++++++++++++++
 drivers/scsi/lpfc/lpfc_hbadisc.c |   14 +++++++++++++-
 2 files changed, 27 insertions(+), 1 deletion(-)

--- a/drivers/scsi/lpfc/lpfc_els.c
+++ b/drivers/scsi/lpfc/lpfc_els.c
@@ -11338,6 +11338,7 @@ lpfc_sli4_vport_delete_els_xri_aborted(s
 {
 	struct lpfc_hba *phba = vport->phba;
 	struct lpfc_sglq *sglq_entry = NULL, *sglq_next = NULL;
+	struct lpfc_nodelist *ndlp = NULL;
 	unsigned long iflag = 0;
 
 	spin_lock_irqsave(&phba->sli4_hba.sgl_list_lock, iflag);
@@ -11345,7 +11346,20 @@ lpfc_sli4_vport_delete_els_xri_aborted(s
 			&phba->sli4_hba.lpfc_abts_els_sgl_list, list) {
 		if (sglq_entry->ndlp && sglq_entry->ndlp->vport == vport) {
 			lpfc_nlp_put(sglq_entry->ndlp);
+			ndlp = sglq_entry->ndlp;
 			sglq_entry->ndlp = NULL;
+
+			/* If the xri on the abts_els_sgl list is for the Fport
+			 * node and the vport is unloading, the xri aborted wcqe
+			 * likely isn't coming back.  Just release the sgl.
+			 */
+			if ((vport->load_flag & FC_UNLOADING) &&
+			    ndlp->nlp_DID == Fabric_DID) {
+				list_del(&sglq_entry->list);
+				sglq_entry->state = SGL_FREED;
+				list_add_tail(&sglq_entry->list,
+					&phba->sli4_hba.lpfc_els_sgl_list);
+			}
 		}
 	}
 	spin_unlock_irqrestore(&phba->sli4_hba.sgl_list_lock, iflag);
--- a/drivers/scsi/lpfc/lpfc_hbadisc.c
+++ b/drivers/scsi/lpfc/lpfc_hbadisc.c
@@ -820,8 +820,20 @@ lpfc_cleanup_rpis(struct lpfc_vport *vpo
 	struct lpfc_nodelist *ndlp, *next_ndlp;
 
 	list_for_each_entry_safe(ndlp, next_ndlp, &vport->fc_nodes, nlp_listp) {
-		if (ndlp->nlp_state == NLP_STE_UNUSED_NODE)
+		if (ndlp->nlp_state == NLP_STE_UNUSED_NODE) {
+			/* It's possible the FLOGI to the fabric node never
+			 * successfully completed and never registered with the
+			 * transport.  In this case there is no way to clean up
+			 * the node.
+			 */
+			if (ndlp->nlp_DID == Fabric_DID) {
+				if (ndlp->nlp_prev_state ==
+				    NLP_STE_UNUSED_NODE &&
+				    !ndlp->fc4_xpt_flags)
+					lpfc_nlp_put(ndlp);
+			}
 			continue;
+		}
 
 		if ((phba->sli3_options & LPFC_SLI3_VPORT_TEARDOWN) ||
 		    ((vport->port_type == LPFC_NPIV_PORT) &&