summaryrefslogtreecommitdiffstats
path: root/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
AgeCommit message (Collapse)Author
2019-01-02xprtrdma: Don't wake pending tasks until disconnect is doneChuck Lever
Transport disconnect processing does a "wake pending tasks" at various points. Suppose an RPC Reply is being processed. The RPC task that Reply goes with is waiting on the pending queue. If a disconnect wake-up happens before reply processing is done, that reply, even if it is good, is thrown away, and the RPC has to be sent again. This window apparently does not exist for socket transports because there is a lock held while a reply is being received which prevents the wake-up call until after reply processing is done. To resolve this, all RPC replies being processed on an RPC-over-RDMA transport have to complete before pending tasks are awoken due to a transport disconnect. Callers that already hold the transport write lock may invoke ->ops->close directly. Others use a generic helper that schedules a close when the write lock can be taken safely. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-10-30Merge tag 'nfsd-4.20' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd updates from Bruce Fields: "Olga added support for the NFSv4.2 asynchronous copy protocol. We already supported COPY, by copying a limited amount of data and then returning a short result, letting the client resend. The asynchronous protocol should offer better performance at the expense of some complexity. The other highlight is Trond's work to convert the duplicate reply cache to a red-black tree, and to move it and some other server caches to RCU. (Previously these have meant taking global spinlocks on every RPC) Otherwise, some RDMA work and miscellaneous bugfixes" * tag 'nfsd-4.20' of git://linux-nfs.org/~bfields/linux: (30 commits) lockd: fix access beyond unterminated strings in prints nfsd: Fix an Oops in free_session() nfsd: correctly decrement odstate refcount in error path svcrdma: Increase the default connection credit limit svcrdma: Remove try_module_get from backchannel svcrdma: Remove ->release_rqst call in bc reply handler svcrdma: Reduce max_send_sges nfsd: fix fall-through annotations knfsd: Improve lookup performance in the duplicate reply cache using an rbtree knfsd: Further simplify the cache lookup knfsd: Simplify NFS duplicate replay cache knfsd: Remove dead code from nfsd_cache_lookup SUNRPC: Simplify TCP receive code SUNRPC: Replace the cache_detail->hash_lock with a regular spinlock SUNRPC: Remove non-RCU protected lookup NFS: Fix up a typo in nfs_dns_ent_put NFS: Lockless DNS lookups knfsd: Lockless lookup of NFSv4 identities. SUNRPC: Lockless server RPCSEC_GSS context lookup knfsd: Allow lockless lookups of the exports ...
2018-10-29svcrdma: Remove try_module_get from backchannelChuck Lever
Since commit ffe1f0df5862 ("rpcrdma: Merge svcrdma and xprtrdma modules into one"), the forward and backchannel components are part of the same kernel module. A separate try_module_get() call in the backchannel code is no longer necessary. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-10-29svcrdma: Remove ->release_rqst call in bc reply handlerChuck Lever
Similar to a change made in the client's forward channel reply handler: The xprt_release_rqst_cong() call is not necessary. Also, release xprt->recv_lock when taking xprt->transport_lock to avoid disabling and enabling BH's while holding another spin lock. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-10-18Merge tag 'nfs-rdma-for-4.20-1' of ↵Trond Myklebust
git://git.linux-nfs.org/projects/anna/linux-nfs NFS RDMA client updates for Linux 4.20 Stable bugfixes: - Reset credit grant properly after a disconnect Other bugfixes and cleanups: - xprt_release_rqst_cong is called outside of transport_lock - Create more MRs at a time and toss out old ones during recovery - Various improvements to the RDMA connection and disconnection code: - Improve naming of trace events, functions, and variables - Add documenting comments - Fix metrics and stats reporting - Fix a tracepoint sparse warning Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-10-02xprtrdma: Reset credit grant properly after a disconnectChuck Lever
On a fresh connection, an RPC/RDMA client is supposed to send only one RPC Call until it gets a credit grant in the first RPC Reply from the server [RFC 8166, Section 3.3.3]. There is a bug in the Linux client's credit accounting mechanism introduced by commit e7ce710a8802 ("xprtrdma: Avoid deadlock when credit window is reset"). On connect, it simply dumps all pending RPC Calls onto the new connection. Servers have been tolerant of this bad behavior. Currently no server implementation ever changes its credit grant over reconnects, and servers always repost enough Receives before connections are fully established. To correct this issue, ensure that the client resets both the credit grant _and_ the congestion window when handling a reconnect. Fixes: e7ce710a8802 ("xprtrdma: Avoid deadlock when credit ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: stable@kernel.org Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-09-30SUNRPC: Cleanup: remove the unused 'task' argument from the request_send()Trond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30SUNRPC: Clean up transport write space handlingTrond Myklebust
Treat socket write space handling in the same way we now treat transport congestion: by denying the XPRT_LOCK until the transport signals that it has free buffer space. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30SUNRPC: Treat the task and request as separate in the xprt_ops->send_request()Trond Myklebust
When we shift to using the transmit queue, then the task that holds the write lock will not necessarily be the same as the one being transmitted. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30SUNRPC: Rename xprt->recv_lock to xprt->queue_lockTrond Myklebust
We will use the same lock to protect both the transmit and receive queues. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-12Merge tag 'nfs-for-4.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Highlights include: Stable fixes: - Fix a 1-byte stack overflow in nfs_idmap_read_and_verify_message - Fix a hang due to incorrect error returns in rpcrdma_convert_iovs() - Revert an incorrect change to the NFSv4.1 callback channel - Fix a bug in the NFSv4.1 sequence error handling Features and optimisations: - Support for piggybacking a LAYOUTGET operation to the OPEN compound - RDMA performance enhancements to deal with transport congestion - Add proper SPDX tags for NetApp-contributed RDMA source - Do not request delegated file attributes (size+change) from the server - Optimise away a GETATTR in the lookup revalidate code when doing NFSv4 OPEN - Optimise away unnecessary lookups for rename targets - Misc performance improvements when freeing NFSv4 delegations Bugfixes and cleanups: - Try to fail quickly if proto=rdma - Clean up RDMA receive trace points - Fix sillyrename to return the delegation when appropriate - Misc attribute revalidation fixes - Immediately clear the pNFS layout on a file when the server returns ESTALE - Return NFS4ERR_DELAY when delegation/layout recalls fail due to igrab() - Fix the client behaviour on NFS4ERR_SEQ_FALSE_RETRY" * tag 'nfs-for-4.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (80 commits) skip LAYOUTRETURN if layout is invalid NFSv4.1: Fix the client behaviour on NFS4ERR_SEQ_FALSE_RETRY NFSv4: Fix a typo in nfs41_sequence_process NFSv4: Revert commit 5f83d86cf531d ("NFSv4.x: Fix wraparound issues..") NFSv4: Return NFS4ERR_DELAY when a layout recall fails due to igrab() NFSv4: Return NFS4ERR_DELAY when a delegation recall fails due to igrab() NFSv4.0: Remove transport protocol name from non-UCS client ID NFSv4.0: Remove cl_ipaddr from non-UCS client ID NFSv4: Fix a compiler warning when CONFIG_NFS_V4_1 is undefined NFS: Filter cache invalidation when holding a delegation NFS: Ignore NFS_INO_REVAL_FORCED in nfs_check_inode_attributes() NFS: Improve caching while holding a delegation NFS: Fix attribute revalidation NFS: fix up nfs_setattr_update_inode NFSv4: Ensure the inode is clean when we set a delegation NFSv4: Ignore NFS_INO_REVAL_FORCED in nfs4_proc_access NFSv4: Don't ask for delegated attributes when adding a hard link NFSv4: Don't ask for delegated attributes when revalidating the inode NFS: Pass the inode down to the getattr() callback NFSv4: Don't request size+change attribute if they are delegated to us ...
2018-05-11svcrdma: Persistently allocate and DMA-map Send buffersChuck Lever
While sending each RPC Reply, svc_rdma_sendto allocates and DMA- maps a separate buffer where the RPC/RDMA transport header is constructed. The buffer is unmapped and released in the Send completion handler. This is significant per-RPC overhead, especially for small RPCs. Instead, allocate and DMA-map a buffer, and cache it in each svc_rdma_send_ctxt. This buffer and its mapping can be re-used for each RPC, saving the cost of memory allocation and DMA mapping. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11svcrdma: Remove post_send_wrChuck Lever
Clean up: Now that the send_wr is part of the svc_rdma_send_ctxt, svc_rdma_post_send_wr is nearly empty. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11svcrdma: Introduce svc_rdma_send_ctxtChuck Lever
svc_rdma_op_ctxt's are pre-allocated and maintained on a per-xprt free list. This eliminates the overhead of calling kmalloc / kfree, both of which grab a globally shared lock that disables interrupts. Introduce a replacement to svc_rdma_op_ctxt's that is built especially for the svcrdma Send path. Subsequent patches will take advantage of this new structure by allocating real resources which are then cached in these objects. The allocations are freed when the transport is torn down. I've renamed the structure so that static type checking can be used to ensure that uses of op_ctxt and send_ctxt are not confused. As an additional clean up, structure fields are renamed to conform with kernel coding conventions. Additional clean ups: - Handle svc_rdma_send_ctxt_get allocation failure at each call site, rather than pre-allocating and hoping we guessed correctly - All send_ctxt_put call-sites request page freeing, so remove the @free_pages argument - All send_ctxt_put call-sites unmap SGEs, so fold that into svc_rdma_send_ctxt_put Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11svcrdma: Clean up Send SGE accountingChuck Lever
Clean up: Since there's already a svc_rdma_op_ctxt being passed around with the running count of mapped SGEs, drop unneeded parameters to svc_rdma_post_send_wr(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11svcrdma: Trace key RDMA API eventsChuck Lever
This includes: * Posting on the Send and Receive queues * Send, Receive, Read, and Write completion * Connect upcalls * QP errors Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-07SUNRPC: Add a ->free_slot transport calloutChuck Lever
Refactor: xprtrdma needs to have better control over when RPCs are awoken from the backlog queue, so replace xprt_free_slot with a transport op callout. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-18svcrdma: Post Receives in the Receive completion handlerChuck Lever
This change improves Receive efficiency by posting Receives only on the same CPU that handles Receive completion. Improved latency and throughput has been noted with this change. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-18Merge tag 'nfsd-4.15' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd updates from Bruce Fields: "Lots of good bugfixes, including: - fix a number of races in the NFSv4+ state code - fix some shutdown crashes in multiple-network-namespace cases - relax our 4.1 session limits; if you've an artificially low limit to the number of 4.1 clients that can mount simultaneously, try upgrading" * tag 'nfsd-4.15' of git://linux-nfs.org/~bfields/linux: (22 commits) SUNRPC: Improve ordering of transport processing nfsd: deal with revoked delegations appropriately svcrdma: Enqueue after setting XPT_CLOSE in completion handlers nfsd: use nfs->ns.inum as net ID rpc: remove some BUG()s svcrdma: Preserve CB send buffer across retransmits nfds: avoid gettimeofday for nfssvc_boot time fs, nfsd: convert nfs4_file.fi_ref from atomic_t to refcount_t fs, nfsd: convert nfs4_cntl_odstate.co_odcount from atomic_t to refcount_t fs, nfsd: convert nfs4_stid.sc_count from atomic_t to refcount_t lockd: double unregister of inetaddr notifiers nfsd4: catch some false session retries nfsd4: fix cached replies to solo SEQUENCE compounds sunrcp: make function _svc_create_xprt static SUNRPC: Fix tracepoint storage issues with svc_recv and svc_rqst_status nfsd: use ARRAY_SIZE nfsd: give out fewer session slots as limit approaches nfsd: increase DRC cache limit nfsd: remove unnecessary nofilehandle checks nfs_common: convert int to bool ...
2017-11-07svcrdma: Preserve CB send buffer across retransmitsChuck Lever
During each NFSv4 callback Call, an RDMA Send completion frees the page that contains the RPC Call message. If the upper layer determines that a retransmit is necessary, this is too soon. One possible symptom: after a GARBAGE_ARGS response an NFSv4.1 callback request, the following BUG fires on the NFS server: kernel: BUG: Bad page state in process kworker/0:2H pfn:7d3ce2 kernel: page:ffffea001f4f3880 count:-2 mapcount:0 mapping: (null) index:0x0 kernel: flags: 0x2fffff80000000() kernel: raw: 002fffff80000000 0000000000000000 0000000000000000 fffffffeffffffff kernel: raw: dead000000000100 dead000000000200 0000000000000000 0000000000000000 kernel: page dumped because: nonzero _refcount kernel: Modules linked in: cts rpcsec_gss_krb5 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue rpcrdm a ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pc lmul crc32_pclmul ghash_clmulni_intel pcbc iTCO_wdt iTCO_vendor_support aesni_intel crypto_simd glue_helper cryptd pcspkr lpc_ich i2c_i801 mei_me mf d_core mei raid0 sg wmi ioatdma ipmi_si ipmi_devintf ipmi_msghandler shpchp acpi_power_meter acpi_pad nfsd nfs_acl lockd auth_rpcgss grace sunrpc ip_tables xfs libcrc32c mlx4_en mlx4_ib mlx5_ib ib_core sd_mod sr_mod cdrom ast drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci crc32c_intel libahci drm mlx5_core igb libata mlx4_core dca i2c_algo_bit i2c_core nvme kernel: ptp nvme_core pps_core dm_mirror dm_region_hash dm_log dm_mod dax kernel: CPU: 0 PID: 11495 Comm: kworker/0:2H Not tainted 4.14.0-rc3-00001-g577ce48 #811 kernel: Hardware name: Supermicro Super Server/X10SRL-F, BIOS 1.0c 09/09/2015 kernel: Workqueue: ib-comp-wq ib_cq_poll_work [ib_core] kernel: Call Trace: kernel: dump_stack+0x62/0x80 kernel: bad_page+0xfe/0x11a kernel: free_pages_check_bad+0x76/0x78 kernel: free_pcppages_bulk+0x364/0x441 kernel: ? ttwu_do_activate.isra.61+0x71/0x78 kernel: free_hot_cold_page+0x1c5/0x202 kernel: __put_page+0x2c/0x36 kernel: svc_rdma_put_context+0xd9/0xe4 [rpcrdma] kernel: svc_rdma_wc_send+0x50/0x98 [rpcrdma] This issue exists all the way back to v4.5, but refactoring and code re-organization prevents this simple patch from applying to kernels older than v4.12. The fix is the same, however, if someone needs to backport it. Reported-by: Ben Coddington <bcodding@redhat.com> BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=314 Fixes: 5d252f90a800 ('svcrdma: Add class for RDMA backwards ... ') Cc: stable@vger.kernel.org # v4.12 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-02License cleanup: add SPDX GPL-2.0 license identifier to files with no licenseGreg Kroah-Hartman
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-05Merge tag 'nfs-rdma-for-4.14-1' of ↵Trond Myklebust
git://git.linux-nfs.org/projects/anna/linux-nfs into linux-next NFS-over-RDMA client updates for Linux 4.14 Bugfixes and cleanups: - Constify rpc_xprt_ops - Harden RPC call encoding and decoding - Clean up rpc call decoding to use xdr_streams - Remove unused variables from various structures - Refactor code to remove imul instructions - Rearrange rx_stats structure for better cacheline sharing
2017-08-18SUNRPC: Add a separate spinlock to protect the RPC request receive listTrond Myklebust
This further reduces contention with the transport_lock, and allows us to convert to using a non-bh-safe spinlock, since the list is now never accessed from a bh context. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-01sunrpc: Const-ify all instances of struct rpc_xprt_opsChuck Lever
After transport instance creation, these function pointers never change. Mark them as constant to prevent their use as an attack vector for code injections. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25svcrdma: Clean up RPC-over-RDMA backchannel reply processingChuck Lever
Replace C structure-based XDR decoding with pointer arithmetic. Pointer arithmetic is considered more portable. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25svcrdma: Use rdma_rw API in RPC reply pathChuck Lever
The current svcrdma sendto code path posts one RDMA Write WR at a time. Each of these Writes typically carries a small number of pages (for instance, up to 30 pages for mlx4 devices). That means a 1MB NFS READ reply requires 9 ib_post_send() calls for the Write WRs, and one for the Send WR carrying the actual RPC Reply message. Instead, use the new rdma_rw API. The details of Write WR chain construction and memory registration are taken care of in the RDMA core. svcrdma can focus on the details of the RPC-over-RDMA protocol. This gives three main benefits: 1. All Write WRs for one RDMA segment are posted in a single chain. As few as one ib_post_send() for each Write chunk. 2. The Write path can now use FRWR to register the Write buffers. If the device's maximum page list depth is large, this means a single Write WR is needed for each RPC's Write chunk data. 3. The new code introduces support for RPCs that carry both a Write list and a Reply chunk. This combination can be used for an NFSv4 READ where the data payload is large, and thus is removed from the Payload Stream, but the Payload Stream is still larger than the inline threshold. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25svcrdma: Add svc_rdma_map_reply_hdr()Chuck Lever
Introduce a helper to DMA-map a reply's transport header before sending it. This will in part replace the map vector cache. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25svcrdma: Move send_wr to svc_rdma_op_ctxtChuck Lever
Clean up: Move the ib_send_wr off the stack, and move common code to post a Send Work Request into a helper. This is a refactoring change only. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-02-28Merge tag 'nfsd-4.11' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd updates from Bruce Fields: "The nfsd update this round is mainly a lot of miscellaneous cleanups and bugfixes. A couple changes could theoretically break working setups on upgrade. I don't expect complaints in practice, but they seem worth calling out just in case: - NFS security labels are now off by default; a new security_label export flag reenables it per export. But, having them on by default is a disaster, as it generally only makes sense if all your clients and servers have similar enough selinux policies. Thanks to Jason Tibbitts for pointing this out. - NFSv4/UDP support is off. It was never really supported, and the spec explicitly forbids it. We only ever left it on out of laziness; thanks to Jeff Layton for finally fixing that" * tag 'nfsd-4.11' of git://linux-nfs.org/~bfields/linux: (34 commits) nfsd: Fix display of the version string nfsd: fix configuration of supported minor versions sunrpc: don't register UDP port with rpcbind when version needs congestion control nfs/nfsd/sunrpc: enforce transport requirements for NFSv4 sunrpc: flag transports as having congestion control sunrpc: turn bitfield flags in svc_version into bools nfsd: remove superfluous KERN_INFO nfsd: special case truncates some more nfsd: minor nfsd_setattr cleanup NFSD: Reserve adequate space for LOCKT operation NFSD: Get response size before operation for all RPCs nfsd/callback: Drop a useless data copy when comparing sessionid nfsd/callback: skip the callback tag nfsd/callback: Cleanup callback cred on shutdown nfsd/idmap: return nfserr_inval for 0-length names SUNRPC/Cache: Always treat the invalid cache as unexpired SUNRPC: Drop all entries from cache_detail when cache_purge() svcrdma: Poll CQs in "workqueue" mode svcrdma: Combine list fields in struct svc_rdma_op_ctxt svcrdma: Remove unused sc_dto_q field ...
2017-02-10net: sunrpc: fix build errors when linux/phy*.h is removed from net/dsa.hRussell King
Removing linux/phy.h from net/dsa.h reveals a build error in the sunrpc code: net/sunrpc/xprtrdma/svc_rdma_backchannel.c: In function 'xprt_rdma_bc_put': net/sunrpc/xprtrdma/svc_rdma_backchannel.c:277:2: error: implicit declaration of function 'module_put' [-Werror=implicit-function-declaration] net/sunrpc/xprtrdma/svc_rdma_backchannel.c: In function 'xprt_setup_rdma_bc': net/sunrpc/xprtrdma/svc_rdma_backchannel.c:348:7: error: implicit declaration of function 'try_module_get' [-Werror=implicit-function-declaration] Fix this by adding linux/module.h to svc_rdma_backchannel.c Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-08svcrdma: Clean up backchannel send header encodingChuck Lever
Replace C structure-based XDR decoding with pointer arithmetic. Pointer arithmetic is considered more portable. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-30svcrdma: Remove unused variables in xprt_rdma_bc_allocate()Chuck Lever
Clean up. /linux-2.6/net/sunrpc/xprtrdma/svc_rdma_backchannel.c: In function ‘xprt_rdma_bc_allocate’: linux-2.6/net/sunrpc/xprtrdma/svc_rdma_backchannel.c:169:23: warning: variable ‘rdma’ set but not used [-Wunused-but-set-variable] struct svcxprt_rdma *rdma; ^ Fixes: 5d252f90a800 ("svcrdma: Add class for RDMA backwards ...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-30svcrdma: Clear xpt_bc_xps in xprt_setup_rdma_bc() error exit armChuck Lever
Logic copied from xs_setup_bc_tcp(). Fixes: 39a9beab5acb ('rpc: share one xps between all backchannels') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-01svcrdma: backchannel cannot share a page for send and rcv buffersChuck Lever
The underlying transport releases the page pointed to by rq_buffer during xprt_rdma_bc_send_request. When the backchannel reply arrives, rq_rbuffer then points to freed memory. Fixes: 68778945e46f ('SUNRPC: Separate buffer pointers for RPC ...') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-10-28sunrpc: fix some missing rq_rbuffer assignmentsJeff Layton
We've been seeing some crashes in testing that look like this: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff8135ce99>] memcpy_orig+0x29/0x110 PGD 212ca2067 PUD 212ca3067 PMD 0 Oops: 0002 [#1] SMP Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ppdev parport_pc i2c_piix4 sg parport i2c_core virtio_balloon pcspkr acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod ata_generic pata_acpi virtio_scsi 8139too ata_piix libata 8139cp mii virtio_pci floppy virtio_ring serio_raw virtio CPU: 1 PID: 1540 Comm: nfsd Not tainted 4.9.0-rc1 #39 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 task: ffff88020d7ed200 task.stack: ffff880211838000 RIP: 0010:[<ffffffff8135ce99>] [<ffffffff8135ce99>] memcpy_orig+0x29/0x110 RSP: 0018:ffff88021183bdd0 EFLAGS: 00010206 RAX: 0000000000000000 RBX: ffff88020d7fa000 RCX: 000000f400000000 RDX: 0000000000000014 RSI: ffff880212927020 RDI: 0000000000000000 RBP: ffff88021183be30 R08: 01000000ef896996 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff880211704ca8 R13: ffff88021473f000 R14: 00000000ef896996 R15: ffff880211704800 FS: 0000000000000000(0000) GS:ffff88021fc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000212ca1000 CR4: 00000000000006e0 Stack: ffffffffa01ea087 ffffffff63400001 ffff880215145e00 ffff880211bacd00 ffff88021473f2b8 0000000000000004 00000000d0679d67 ffff880211bacd00 ffff88020d7fa000 ffff88021473f000 0000000000000000 ffff88020d7faa30 Call Trace: [<ffffffffa01ea087>] ? svc_tcp_recvfrom+0x5a7/0x790 [sunrpc] [<ffffffffa01f84d8>] svc_recv+0xad8/0xbd0 [sunrpc] [<ffffffffa0262d5e>] nfsd+0xde/0x160 [nfsd] [<ffffffffa0262c80>] ? nfsd_destroy+0x60/0x60 [nfsd] [<ffffffff810a9418>] kthread+0xd8/0xf0 [<ffffffff816dbdbf>] ret_from_fork+0x1f/0x40 [<ffffffff810a9340>] ? kthread_park+0x60/0x60 Code: 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe 7c 35 48 83 ea 20 48 83 ea 20 4c 8b 06 4c 8b 4e 08 4c 8b 56 10 4c 8b 5e 18 48 8d 76 20 <4c> 89 07 4c 89 4f 08 4c 89 57 10 4c 89 5f 18 48 8d 7f 20 73 d4 RIP [<ffffffff8135ce99>] memcpy_orig+0x29/0x110 RSP <ffff88021183bdd0> CR2: 0000000000000000 Both Bruce and Eryu ran a bisect here and found that the problematic patch was 68778945e46 (SUNRPC: Separate buffer pointers for RPC Call and Reply messages). That patch changed rpc_xdr_encode to use a new rq_rbuffer pointer to set up the receive buffer, but didn't change all of the necessary codepaths to set it properly. In particular the backchannel setup was missing. We need to set rq_rbuffer whenever rq_buffer is set. Ensure that it is. Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Chuck Lever <chuck.lever@oracle.com> Reported-by: Eryu Guan <guaneryu@gmail.com> Tested-by: Eryu Guan <guaneryu@gmail.com> Fixes: 68778945e46 "SUNRPC: Separate buffer pointers..." Reported-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-10-13Merge tag 'nfs-for-4.9-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client updates from Anna Schumaker: "Highlights include: Stable bugfixes: - sunrpc: fix writ espace race causing stalls - NFS: Fix inode corruption in nfs_prime_dcache() - NFSv4: Don't report revoked delegations as valid in nfs_have_delegation() - NFSv4: nfs4_copy_delegation_stateid() must fail if the delegation is invalid - NFSv4: Open state recovery must account for file permission changes - NFSv4.2: Fix a reference leak in nfs42_proc_layoutstats_generic Features: - Add support for tracking multiple layout types with an ordered list - Add support for using multiple backchannel threads on the client - Add support for pNFS file layout session trunking - Delay xprtrdma use of DMA API (for device driver removal) - Add support for xprtrdma remote invalidation - Add support for larger xprtrdma inline thresholds - Use a scatter/gather list for sending xprtrdma RPC calls - Add support for the CB_NOTIFY_LOCK callback - Improve hashing sunrpc auth_creds by using both uid and gid Bugfixes: - Fix xprtrdma use of DMA API - Validate filenames before adding to the dcache - Fix corruption of xdr->nwords in xdr_copy_to_scratch - Fix setting buffer length in xdr_set_next_buffer() - Don't deadlock the state manager on the SEQUENCE status flags - Various delegation and stateid related fixes - Retry operations if an interrupted slot receives EREMOTEIO - Make nfs boot time y2038 safe" * tag 'nfs-for-4.9-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (100 commits) NFSv4.2: Fix a reference leak in nfs42_proc_layoutstats_generic fs: nfs: Make nfs boot time y2038 safe sunrpc: replace generic auth_cred hash with auth-specific function sunrpc: add RPCSEC_GSS hash_cred() function sunrpc: add auth_unix hash_cred() function sunrpc: add generic_auth hash_cred() function sunrpc: add hash_cred() function to rpc_authops struct Retry operation on EREMOTEIO on an interrupted slot pNFS: Fix atime updates on pNFS clients sunrpc: queue work on system_power_efficient_wq NFSv4.1: Even if the stateid is OK, we may need to recover the open modes NFSv4: If recovery failed for a specific open stateid, then don't retry NFSv4: Fix retry issues with nfs41_test/free_stateid NFSv4: Open state recovery must account for file permission changes NFSv4: Mark the lock and open stateids as invalid after freeing them NFSv4: Don't test open_stateid unless it is set NFSv4: nfs4_do_handle_exception() handle revoke/expiry of a single stateid NFS: Always call nfs_inode_find_state_and_recover() when revoking a delegation NFSv4: Fix a race when updating an open_stateid NFSv4: Fix a race in nfs_inode_reclaim_delegation() ...
2016-09-23svcrdma: Tail iovec leaves an orphaned DMA mappingChuck Lever
The ctxt's count field is overloaded to mean the number of pages in the ctxt->page array and the number of SGEs in the ctxt->sge array. Typically these two numbers are the same. However, when an inline RPC reply is constructed from an xdr_buf with a tail iovec, the head and tail often occupy the same page, but each are DMA mapped independently. In that case, ->count equals the number of pages, but it does not equal the number of SGEs. There's one more SGE, for the tail iovec. Hence there is one more DMA mapping than there are pages in the ctxt->page array. This isn't a real problem until the server's iommu is enabled. Then each RPC reply that has content in that iovec orphans a DMA mapping that consists of real resources. krb5i and krb5p always populate that tail iovec. After a couple million sent krb5i/p RPC replies, the NFS server starts behaving erratically. Reboot is needed to clear the problem. Fixes: 9d11b51ce7c1 ("svcrdma: Fix send_reply() scatter/gather set-up") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-09-19SUNRPC: Generalize the RPC buffer release APIChuck Lever
xprtrdma needs to allocate the Call and Reply buffers separately. TBH, the reliance on using a single buffer for the pair of XDR buffers is transport implementation-specific. Instead of passing just the rq_buffer into the buf_free method, pass the task structure and let buf_free take care of freeing both XDR buffers at once. There's a micro-optimization here. In the common case, both xprt_release and the transport's buf_free method were checking if rq_buffer was NULL. Now the check is done only once per RPC. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19SUNRPC: Generalize the RPC buffer allocation APIChuck Lever
xprtrdma needs to allocate the Call and Reply buffers separately. TBH, the reliance on using a single buffer for the pair of XDR buffers is transport implementation-specific. Transports that want to allocate separate Call and Reply buffers will ignore the "size" argument anyway. Don't bother passing it. The buf_alloc method can't return two pointers. Instead, make the method's return value an error code, and set the rq_buffer pointer in the method itself. This gives call_allocate an opportunity to terminate an RPC instead of looping forever when a permanent problem occurs. If a request is just bogus, or the transport is in a state where it can't allocate resources for any request, there needs to be a way to kill the RPC right there and not loop. This immediately fixes a rare problem in the backchannel send path, which loops if the server happens to send a CB request whose call+reply size is larger than a page (which it shouldn't do yet). One more issue: looks like xprt_inject_disconnect was incorrectly placed in the failure path in call_allocate. It needs to be in the success path, as it is for other call-sites. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-03-01svcrdma: Use new CQ API for RPC-over-RDMA server send CQsChuck Lever
Calling ib_poll_cq() to sort through WCs during a completion is a common pattern amongst RDMA consumers. Since commit 14d3a3b2498e ("IB: add a proper completion queue abstraction"), WC sorting can be handled by the IB core. By converting to this new API, svcrdma is made a better neighbor to other RDMA consumers, as it allows the core to schedule the delivery of completions more fairly amongst all active consumers. This new API also aims each completion at a function that is specific to the WR's opcode. Thus the ctxt->wr_op field and the switch in process_context is replaced by a set of methods that handle each completion type. Because each ib_cqe carries a pointer to a completion method, the core can now post operations on a consumer's QP, and handle the completions itself. The server's rdma_stat_sq_poll and rdma_stat_sq_prod metrics are no longer updated. As a clean up, the cq_event_handler, the dto_tasklet, and all associated locking is removed, as they are no longer referenced or used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01svcrdma: svc_rdma_post_recv() should close connection on errorChuck Lever
Clean up: Most svc_rdma_post_recv() call sites close the transport connection when a receive cannot be posted. Wrap that in a common helper. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com> Tested-by: Devesh Sharma <devesh.sharma@broadcom.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01svcrdma: Do not send Write chunk XDR pad with inline contentChuck Lever
The NFS server's XDR encoders adds an XDR pad for content in the xdr_buf page list at the beginning of the xdr_buf's tail buffer. On RDMA transports, Write chunks are sent separately and without an XDR pad. If a Write chunk is being sent, strip off the pad in the tail buffer so that inline content following the Write chunk remains XDR-aligned when it is sent to the client. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=294 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-01-19svc_rdma: use local_dma_lkeyChristoph Hellwig
We now alwasy have a per-PD local_dma_lkey available. Make use of that fact in svc_rdma and stop registering our own MR. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Acked-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-01-19svcrdma: Add class for RDMA backwards direction transportChuck Lever
To support the server-side of an NFSv4.1 backchannel on RDMA connections, add a transport class that enables backward direction messages on an existing forward channel connection. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Bruce Fields <bfields@fieldses.org> Signed-off-by: Doug Ledford <dledford@redhat.com>