aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/nvme/host/tcp.c
AgeCommit message (Collapse)Author
2023-08-16nvme-tcp: fix potential unbalanced freeze & unfreezeMing Lei
commit 99dc264014d5aed66ee37ddf136a38b5a2b1b529 upstream. Move start_freeze into nvme_tcp_configure_io_queues(), and there is at least two benefits: 1) fix unbalanced freeze and unfreeze, since re-connection work may fail or be broken by removal 2) IO during error recovery can be failfast quickly because nvme fabrics unquiesces queues after teardown. One side-effect is that !mpath request may timeout during connecting because of queue topo change, but that looks not one big deal: 1) same problem exists with current code base 2) compared with !mpath, mpath use case is dominant Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic") Cc: stable@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-26nvme-tcp: fix a possible UAF when failing to allocate an io queueSagi Grimberg
[ Upstream commit 88eaba80328b31ef81813a1207b4056efd7006a6 ] When we allocate a nvme-tcp queue, we set the data_ready callback before we actually need to use it. This creates the potential that if a stray controller sends us data on the socket before we connect, we can trigger the io_work and start consuming the socket. In this case reported: we failed to allocate one of the io queues, and as we start releasing the queues that we already allocated, we get a UAF [1] from the io_work which is running before it should really. Fix this by setting the socket ops callbacks only before we start the queue, so that we can't accidentally schedule the io_work in the initialization phase before the queue started. While we are at it, rename nvme_tcp_restore_sock_calls to pair with nvme_tcp_setup_sock_ops. [1]: [16802.107284] nvme nvme4: starting error recovery [16802.109166] nvme nvme4: Reconnecting in 10 seconds... [16812.173535] nvme nvme4: failed to connect socket: -111 [16812.173745] nvme nvme4: Failed reconnect attempt 1 [16812.173747] nvme nvme4: Reconnecting in 10 seconds... [16822.413555] nvme nvme4: failed to connect socket: -111 [16822.413762] nvme nvme4: Failed reconnect attempt 2 [16822.413765] nvme nvme4: Reconnecting in 10 seconds... [16832.661274] nvme nvme4: creating 32 I/O queues. [16833.919887] BUG: kernel NULL pointer dereference, address: 0000000000000088 [16833.920068] nvme nvme4: Failed reconnect attempt 3 [16833.920094] #PF: supervisor write access in kernel mode [16833.920261] nvme nvme4: Reconnecting in 10 seconds... [16833.920368] #PF: error_code(0x0002) - not-present page [16833.921086] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp] [16833.921191] RIP: 0010:_raw_spin_lock_bh+0x17/0x30 ... [16833.923138] Call Trace: [16833.923271] <TASK> [16833.923402] lock_sock_nested+0x1e/0x50 [16833.923545] nvme_tcp_try_recv+0x40/0xa0 [nvme_tcp] [16833.923685] nvme_tcp_io_work+0x68/0xa0 [nvme_tcp] [16833.923824] process_one_work+0x1e8/0x390 [16833.923969] worker_thread+0x53/0x3d0 [16833.924104] ? process_one_work+0x390/0x390 [16833.924240] kthread+0x124/0x150 [16833.924376] ? set_kthread_struct+0x50/0x50 [16833.924518] ret_from_fork+0x1f/0x30 [16833.924655] </TASK> Reported-by: Yanjun Zhang <zhangyanjun@cestc.cn> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Yanjun Zhang <zhangyanjun@cestc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-09-15nvme-tcp: fix UAF when detecting digest errorsSagi Grimberg
[ Upstream commit 160f3549a907a50e51a8518678ba2dcf2541abea ] We should also bail from the io_work loop when we set rd_enabled to true, so we don't attempt to read data from the socket when the TCP stream is already out-of-sync or corrupted. Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver") Reported-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-21nvme: fix regression when disconnect a recovering ctrlRuozhu Li
[ Upstream commit f7f70f4aa09dc43d7455c060143e86a017c30548 ] We encountered a problem that the disconnect command hangs. After analyzing the log and stack, we found that the triggering process is as follows: CPU0 CPU1 nvme_rdma_error_recovery_work nvme_rdma_teardown_io_queues nvme_do_delete_ctrl nvme_stop_queues nvme_remove_namespaces --clear ctrl->namespaces nvme_start_queues --no ns in ctrl->namespaces nvme_ns_remove return(because ctrl is deleting) blk_freeze_queue blk_mq_freeze_queue_wait --wait for ns to unquiesce to clean infligt IO, hang forever This problem was not found in older kernels because we will flush err work in nvme_stop_ctrl before nvme_remove_namespaces.It does not seem to be modified for functional reasons, the patch can be revert to solve the problem. Revert commit 794a4cb3d2f7 ("nvme: remove the .stop_ctrl callout") Signed-off-by: Ruozhu Li <liruozhu@huawei.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-02-23nvme-tcp: fix possible use-after-free in transport error_recovery workSagi Grimberg
[ Upstream commit ff9fc7ebf5c06de1ef72a69f9b1ab40af8b07f9e ] While nvme_tcp_submit_async_event_work is checking the ctrl and queue state before preparing the AER command and scheduling io_work, in order to fully prevent a race where this check is not reliable the error recovery work must flush async_event_work before continuing to destroy the admin queue after setting the ctrl state to RESETTING such that there is no race .submit_async_event and the error recovery handler itself changing the ctrl state. Tested-by: Chris Leech <cleech@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-02-16nvme-tcp: fix bogus request completion when failing to send AERSagi Grimberg
commit 63573807b27e0faf8065a28b1bbe1cbfb23c0130 upstream. AER is not backed by a real request, hence we should not incorrectly assume that when failing to send a nvme command, it is a normal request but rather check if this is an aer and if so complete the aer (similar to the normal completion path). Cc: stable@vger.kernel.org Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-02nvme-tcp: fix data digest pointer calculationVarun Prakash
commit d89b9f3bbb58e9e378881209756b0723694f22ff upstream. ddgst is of type __le32, &req->ddgst + req->offset increases &req->ddgst by 4 * req->offset, fix this by type casting &req->ddgst to u8 *. Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver") Signed-off-by: Varun Prakash <varun@chelsio.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-22nvme-tcp: don't check blk_mq_tag_to_rq when receiving pdu dataSagi Grimberg
[ Upstream commit 3b01a9d0caa8276d9ce314e09610f7fb70f49a00 ] We already validate it when receiving the c2hdata pdu header and this is not changing so this is a redundant check. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15nvme-tcp: don't update queue count when failing to set io queuesRuozhu Li
[ Upstream commit 664227fde63844d69e9ec9e90a8a7801e6ff072d ] We update ctrl->queue_count and schedule another reconnect when io queue count is zero.But we will never try to create any io queue in next reco- nnection, because ctrl->queue_count already set to zero.We will end up having an admin-only session in Live state, which is exactly what we try to avoid in the original patch. Update ctrl->queue_count after queue_count zero checking to fix it. Signed-off-by: Ruozhu Li <liruozhu@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-14nvme-tcp: block BH in sk state_change sk callbackSagi Grimberg
[ Upstream commit 8b73b45d54a14588f86792869bfb23098ea254cb ] The TCP stack can run from process context for a long time so we should disable BH here. Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-24nvme-tcp: fix a NULL deref when receiving a 0-length r2t PDUSagi Grimberg
commit fd0823f405090f9f410fc3e3ff7efb52e7b486fa upstream. When the controller sends us a 0-length r2t PDU we should not attempt to try to set up a h2cdata PDU but rather conclude that this is a buggy controller (forward progress is not possible) and simply fail it immediately. Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver") Reported-by: Belanger, Martin <Martin.Belanger@dell.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-24nvme-tcp: fix possible hang when failing to set io queuesSagi Grimberg
commit 72f572428b83d0bc7028e7c4326d1a5f45205e44 upstream. We only setup io queues for nvme controllers, and it makes absolutely no sense to allow a controller (re)connect without any I/O queues. If we happen to fail setting the queue count for any reason, we should not allow this to be a successful reconnect as I/O has no chance in going through. Instead just fail and schedule another reconnect. Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-07nvme-tcp: add clean action for failed reconnectionChao Leng
[ Upstream commit 70a99574a79f1cd4dc7ad56ea37be40844bfb97b ] If reconnect failed after start io queues, the queues will be unquiesced and new requests continue to be delivered. Reconnection error handling process directly free queues without cancel suspend requests. The suppend request will time out, and then crash due to use the queue after free. Add sync queues and cancel suppend requests for reconnection error handling. Signed-off-by: Chao Leng <lengchao@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-19nvme-tcp: fix possible data corruption with bio mergesSagi Grimberg
commit ca1ff67d0fb14f39cf0cc5102b1fbcc3b14f6fb9 upstream. When a bio merges, we can get a request that spans multiple bios, and the overall request payload size is the sum of all bios. When we calculate how much we need to send from the existing bio (and bvec), we did not take into account the iov_iter byte count cap. Since multipage bvecs support, bvecs can split in the middle which means that when we account for the last bvec send we should also take the iov_iter byte count cap as it might be lower than the last bvec size. Reported-by: Hao Wang <pkuwangh@gmail.com> Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver") Tested-by: Hao Wang <pkuwangh@gmail.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18nvme-tcp: avoid repeated request completionSagi Grimberg
[ Upstream commit 0a8a2c85b83589a5c10bc5564b796836bf4b4984 ] The request may be executed asynchronously, and rq->state may be changed to IDLE. To avoid repeated request completion, only MQ_RQ_COMPLETE of rq->state is checked in nvme_tcp_complete_timed_out. It is not safe, so need adding check IDLE for rq->state. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Chao Leng <lengchao@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-18nvme-tcp: avoid race between time out and tear downChao Leng
[ Upstream commit d6f66210f4b1aa2f5944f0e34e0f8db44f499f92 ] Now use teardown_lock to serialize for time out and tear down. This may cause abnormal: first cancel all request in tear down, then time out may complete the request again, but the request may already be freed or restarted. To avoid race between time out and tear down, in tear down process, first we quiesce the queue, and then delete the timer and cancel the time out work for the queue. At the same time we need to delete teardown_lock. Signed-off-by: Chao Leng <lengchao@huawei.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-14nvme-tcp: check page by sendpage_ok() before calling kernel_sendpage()Coly Li
commit 7d4194abfc4de13a2663c7fee6891de8360f7a52 upstream. Currently nvme_tcp_try_send_data() doesn't use kernel_sendpage() to send slab pages. But for pages allocated by __get_free_pages() without __GFP_COMP, which also have refcount as 0, they are still sent by kernel_sendpage() to remote end, this is problematic. The new introduced helper sendpage_ok() checks both PageSlab tag and page_count counter, and returns true if the checking page is OK to be sent by kernel_sendpage(). This patch fixes the page checking issue of nvme_tcp_try_send_data() with sendpage_ok(). If sendpage_ok() returns true, send this page by kernel_sendpage(), otherwise use sock_no_sendpage to handle this page. Signed-off-by: Coly Li <colyli@suse.de> Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Vlastimil Babka <vbabka@suse.com> Cc: stable@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-01nvme: Fix ctrl use-after-free during sysfs deletionIsrael Rukshin
[ Upstream commit b780d7415aacec855e2f2370cbf98f918b224903 ] In case nvme_sysfs_delete() is called by the user before taking the ctrl reference count, the ctrl may be freed during the creation and cause the bug. Take the reference as soon as the controller is externally visible, which is done by cdev_device_add() in nvme_init_ctrl(). Also take the reference count at the core layer instead of taking it on each transport separately. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-23nvme-tcp: cancel async events before freeing event structDavid Milburn
[ Upstream commit ceb1e0874dba5cbfc4e0b4145796a4bfb3716e6a ] Cancel async event work in case async event has been queued up, and nvme_tcp_submit_async_event() runs after event has been freed. Signed-off-by: David Milburn <dmilburn@redhat.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-17nvme-tcp: fix reset hang if controller died in the middle of a resetSagi Grimberg
[ Upstream commit e5c01f4f7f623e768e868bcc08d8e7ceb03b75d0 ] If the controller becomes unresponsive in the middle of a reset, we will hang because we are waiting for the freeze to complete, but that cannot happen since we have commands that are inflight holding the q_usage_counter, and we can't blindly fail requests that times out. So give a timeout and if we cannot wait for queue freeze before unfreezing, fail and have the error handling take care how to proceed (either schedule a reconnect of remove the controller). Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-17nvme-tcp: fix timeout handlerSagi Grimberg
[ Upstream commit 236187c4ed195161dfa4237c7beffbba0c5ae45b ] When a request times out in a LIVE state, we simply trigger error recovery and let the error recovery handle the request cancellation, however when a request times out in a non LIVE state, we make sure to complete it immediately as it might block controller setup or teardown and prevent forward progress. However tearing down the entire set of I/O and admin queues causes freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really an overkill to what we actually need, which is to just fence controller teardown that may be running, stop the queue, and cancel the request if it is not already completed. Now that we have the controller teardown_lock, we can safely serialize request cancellation. This addresses a hang caused by calling extra queue freeze on controller namespaces, causing unfreeze to not complete correctly. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-17nvme-tcp: serialize controller teardown sequencesSagi Grimberg
[ Upstream commit d4d61470ae48838f49e668503e840e1520b97162 ] In the timeout handler we may need to complete a request because the request that timed out may be an I/O that is a part of a serial sequence of controller teardown or initialization. In order to complete the request, we need to fence any other context that may compete with us and complete the request that is timing out. In this case, we could have a potential double completion in case a hard-irq or a different competing context triggered error recovery and is running inflight request cancellation concurrently with the timeout handler. Protect using a ctrl teardown_lock to serialize contexts that may complete a cancelled request due to error recovery or a reset. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-19nvme-tcp: fix controller reset hang during trafficSagi Grimberg
[ Upstream commit 2875b0aecabe2f081a8432e2bc85b85df0529490 ] commit fe35ec58f0d3 ("block: update hctx map when use multiple maps") exposed an issue where we may hang trying to wait for queue freeze during I/O. We call blk_mq_update_nr_hw_queues which in case of multiple queue maps (which we have now for default/read/poll) is attempting to freeze the queue. However we never started queue freeze when starting the reset, which means that we have inflight pending requests that entered the queue that we will not complete once the queue is quiesced. So start a freeze before we quiesce the queue, and unfreeze the queue after we successfully connected the I/O queues (and make sure to call blk_mq_update_nr_hw_queues only after we are sure that the queue was already frozen). This follows to how the pci driver handles resets. Fixes: fe35ec58f0d3 ("block: update hctx map when use multiple maps") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-05nvme-tcp: fix possible hang waiting for icresp responseSagi Grimberg
[ Upstream commit adc99fd378398f4c58798a1c57889872967d56a6 ] If the controller died exactly when we are receiving icresp we hang because icresp may never return. Make sure to set a high finite limit. Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-22nvme-tcp: use bh_lock in data_readySagi Grimberg
[ Upstream commit 386e5e6e1aa90b479fcf0467935922df8524393d ] data_ready may be invoked from send context or from softirq, so need bh locking for that. Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-29nvme-tcp: fix possible crash in write_zeroes processingSagi Grimberg
[ Upstream commit 25e5cb780e62bde432b401f312bb847edc78b432 ] We cannot look at blk_rq_payload_bytes without first checking that the request has a mappable physical segments first (e.g. blk_rq_nr_phys_segments(rq) != 0) and only then to take the request payload bytes. This caused us to send a wrong sgl to the target or even dereference a non-existing buffer in case we actually got to the data send sequence (if it was in-capsule). Reported-by: Tony Asleson <tasleson@redhat.com> Suggested-by: Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-03-05nvme: prevent warning triggered by nvme_stop_keep_aliveNigel Kirkland
[ Upstream commit 97b2512ad000a409b4073dd1a71e4157d76675cb ] Delayed keep alive work is queued on system workqueue and may be cancelled via nvme_stop_keep_alive from nvme_reset_wq, nvme_fc_wq or nvme_wq. Check_flush_dependency detects mismatched attributes between the work-queue context used to cancel the keep alive work and system-wq. Specifically system-wq does not have the WQ_MEM_RECLAIM flag, whereas the contexts used to cancel keep alive work have WQ_MEM_RECLAIM flag. Example warning: workqueue: WQ_MEM_RECLAIM nvme-reset-wq:nvme_fc_reset_ctrl_work [nvme_fc] is flushing !WQ_MEM_RECLAIM events:nvme_keep_alive_work [nvme_core] To avoid the flags mismatch, delayed keep alive work is queued on nvme_wq. However this creates a secondary concern where work and a request to cancel that work may be in the same work queue - namely err_work in the rdma and tcp transports, which will want to flush/cancel the keep alive work which will now be on nvme_wq. After reviewing the transports, it looks like err_work can be moved to nvme_reset_wq. In fact that aligns them better with transition into RESETTING and performing related reset work in nvme_reset_wq. Change nvme-rdma and nvme-tcp to perform err_work in nvme_reset_wq. Signed-off-by: Nigel Kirkland <nigel.kirkland@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-03-05nvme/tcp: fix bug on double requeue when send failsAnton Eidelman
[ Upstream commit 2d570a7c0251c594489a2c16b82b14ae30345c03 ] When nvme_tcp_io_work() fails to send to socket due to connection close/reset, error_recovery work is triggered from nvme_tcp_state_change() socket callback. This cancels all the active requests in the tagset, which requeues them. The failed request, however, was ended and thus requeued individually as well unless send returned -EPIPE. Another return code to be treated the same way is -ECONNRESET. Double requeue caused BUG_ON(blk_queued_rq(rq)) in blk_mq_requeue_request() from either the individual requeue of the failed request or the bulk requeue from blk_mq_tagset_busy_iter(, nvme_cancel_request, ); Signed-off-by: Anton Eidelman <anton@lightbitslabs.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-28net: use skb_queue_empty_lockless() in busy poll contextsEric Dumazet
Busy polling usually runs without locks. Let's use skb_queue_empty_lockless() instead of skb_queue_empty() Also uses READ_ONCE() in __skb_try_recv_datagram() to address a similar potential problem. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-15nvme-tcp: fix possible leakage during error flowMax Gurtovoy
During nvme_tcp_setup_cmd_pdu error flow, one must call nvme_cleanup_cmd since it's symmetric to nvme_setup_cmd. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-10-14nvme-tcp: Initialize sk->sk_ll_usec only with NET_RX_BUSY_POLLSebastian Andrzej Siewior
The access to sk->sk_ll_usec should be hidden behind CONFIG_NET_RX_BUSY_POLL like the definition of sk_ll_usec. Put access to ->sk_ll_usec behind CONFIG_NET_RX_BUSY_POLL. Fixes: 1a9460cef5711 ("nvme-tcp: support simple polling") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-10-14nvme: Restart request timers in resetting stateKeith Busch
A controller in the resetting state has not yet completed its recovery actions. The pci and fc transports were already handling this, so update the remaining transports to not attempt additional recovery in this state. Instead, just restart the request timer. Tested-by: Edmund Nadolski <edmund.nadolski@intel.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-09-25nvme-tcp: fix wrong stop condition in io_workWunderlich, Mark
Allow the do/while statement to continue if current time is not after the proposed time 'deadline'. Intent is to allow loop to proceed for a specific time period. Currently the loop, as coded, will exit after first pass. Signed-off-by: Mark Wunderlich <mark.wunderlich@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12nvme: tcp: remove redundant assignment to variable retColin Ian King
The variable ret is being initialized with a value that is never read and is being re-assigned immediately afterwards. The assignment is redundant and hence can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12nvme-tcp: fail command with NVME_SC_HOST_PATH_ERROR send failedSagi Grimberg
This is a more appropriate error status for a transport error detected by us (the host). Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme: make fabrics command run on a separate request queueSagi Grimberg
We have a fundamental issue that fabric commands use the admin_q. The reason is, that admin-connect, register reads and writes and admin commands cannot be guaranteed ordering while we are running controller resets. For example, when we reset a controller we perform: 1. disable the controller 2. teardown the admin queue 3. re-establish the admin queue 4. enable the controller In order to perform (3), we need to unquiesce the admin queue, however we may have some admin commands that are already pending on the quiesced admin_q and will immediate execute when we unquiesce it before we execute (4). The host must not send admin commands to the controller before enabling the controller. To fix this, we have the fabric commands (admin connect and property get/set, but not I/O queue connect) use a separate fabrics_q and make sure to quiesce the admin_q before we disable the controller, and unquiesce it only after we enable the controller. This fixes the error prints from nvmet in a controller reset storm test: kernel: nvmet: got cmd 6 while CC.EN == 0 on qid = 0 Which indicate that the host is sending an admin command when the controller is not enabled. Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme-tcp: Add TOS for tcp transportIsrael Rukshin
TOS provide clients the ability to segregate traffic flows for different type of data. One of the TOS usage is bandwidth management which allows setting bandwidth limits for QoS classes, e.g. 80% bandwidth to controllers at QoS class A and 20% to controllers at QoS class B. usage examples: nvme connect --tos=0 --transport=tcp --traddr=10.0.1.1 --nqn=test-nvme Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme-tcp: Use struct nvme_ctrl directlyIsrael Rukshin
This patch doesn't change any functionality. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme-tcp: support simple pollingSagi Grimberg
Simple polling support via socket busy_poll interface. Although we do not shutdown interrupts but simply hammer the socket poll, we can sometimes find completions faster than the normal interrupt driven RX path. We add per queue nr_cqe counter that resets every time RX path is invoked such that .poll callback can return it to stay consistent with the semantics. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme: don't pass cap to nvme_disable_ctrlSagi Grimberg
All seem to call it with ctrl->cap so no need to pass it at all. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme: move sqsize setting to the coreSagi Grimberg
nvme_enable_ctrl reads the cap register right after, so no need to do that locally in the transport driver. Have sqsize setting in nvme_init_identify. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme-tcp: Use protocol specific operations while reading socketPotnuri Bharat Teja
Using socket specific read_sock() calls instead of directly calling tcp_read_sock() helps lld module registered handlers if any, to be called from nvme-tcp host. This patch therefore replaces the tcp_read_sock() with socket specific prot_ops. Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com> Acked-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme-tcp: cleanup nvme_tcp_recv_pduSagi Grimberg
Can return directly in the switch statement Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-04nvme: wait until all completed request's complete fn is calledMing Lei
When aborting in-flight request for recovering controller, we have to make sure that queue's complete function is called on completed request before moving on. Otherwise, for example, the warning of WARN_ON_ONCE(qp->mrs_used > 0) in ib_destroy_qp_user() may be triggered on nvme-rdma. Fix this issue by using blk_mq_tagset_wait_completed_request. Cc: Max Gurtovoy <maxg@mellanox.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Keith Busch <keith.busch@intel.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-09nvme-tcp: don't use sendpage for SLAB pagesMikhail Skorzhinskii
According to commit a10674bf2406 ("tcp: detecting the misuse of .sendpage for Slab objects") and previous discussion, tcp_sendpage should not be used for pages that is managed by SLAB, as SLAB is not taking page reference counters into consideration. Signed-off-by: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-05-30nvme-tcp: fix queue mapping when queue count is limitedSagi Grimberg
When the controller supports less queues than requested, we should make sure that queue mapping does the right thing and not assume that all queues are available. This fixes a crash when the controller supports less queues than requested. The rules are: 1. if no write queues are requested, we assign the available queues to the default queue map. The default and read queue maps share the existing queues. 2. if write queues are requested: - first make sure that read queue map gets the requested nr_io_queues count - then grant the default queue map the minimum between the requested nr_write_queues and the remaining queues. If there are no available queues to dedicate to the default queue map, fallback to (1) and share all the queues in the existing queue map. Also, provide a log indication on how we constructed the different queue maps. Reported-by: Harris, James R <james.r.harris@intel.com> Tested-by: Jim Harris <james.r.harris@intel.com> Cc: <stable@vger.kernel.org> # v5.0+ Suggested-by: Roy Shterman <roys@lightbitslabs.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-05-01nvme-tcp: fix possible null deref on a timed out io queue connectSagi Grimberg
If I/O queue connect times out, we might have freed the queue socket already, so check for that on the error path in nvme_tcp_start_queue. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-25nvme-tcp: rename function to have nvme_tcp prefixSagi Grimberg
usually nvme_ prefix is for core functions. While we're cleaning up, remove redundant empty lines Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Minwoo Im <minwoo.im@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-25nvme-tcp: fix a NULL deref when an admin connect times outSagi Grimberg
If we timeout the admin startup sequence we might not yet have an I/O tagset allocated which causes the teardown sequence to crash. Make nvme_tcp_teardown_io_queues safe by not iterating inflight tags if the tagset wasn't allocated. Fixes: 39d57757467b ("nvme-tcp: fix timeout handler") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-03-28nvme-tcp: fix an endianess miss-annotationChristoph Hellwig
nvme_tcp_end_request just takes the status value and the converts it to little endian as well as shifting for the phase bit. Fixes: 43ce38a6d823 ("nvme-tcp: support C2HData with SUCCESS flag") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>