summaryrefslogtreecommitdiffstats
path: root/include/net/ip_vs.h
AgeCommit message (Collapse)Author
2014-10-02ipvs: Clean up comment style in ip_vs.hSimon Horman
* Consistently use the multi-line comment style for networking code: /* This * That * The other thing */ * Use single-line comment style for comments with only one line of text. * In general follow the leading '*' of each line of a comment with a single space and then text. * Add missing line break between functions, remove double line break, align comments to previous lines whenever possible. Reported-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-09-16ipvs: prevent mixing heterogeneous pools and synchronizationAlex Gartrell
The synchronization protocol is not compatible with heterogeneous pools, so we need to verify that we're not turning both on at the same time. Signed-off-by: Alex Gartrell <agartrell@fb.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2014-09-16ipvs: Supply destination address family to ip_vs_conn_newAlex Gartrell
The assumption that dest af is equal to service af is now unreliable, so we must specify it manually so as not to copy just the first 4 bytes of a v6 address or doing an illegal read of 16 butes on a v6 address. We "lie" in two places: for synchronization (which we will explicitly disallow from happening when we have heterogeneous pools) and for black hole addresses where there's no real dest. Signed-off-by: Alex Gartrell <agartrell@fb.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2014-09-16ipvs: Supply destination addr family to ip_vs_{lookup_dest,find_dest}Alex Gartrell
We need to remove the assumption that virtual address family is the same as real address family in order to support heterogeneous services (that is, services with v4 vips and v6 backends or the opposite). Signed-off-by: Alex Gartrell <agartrell@fb.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2014-09-16ipvs: Add destination address family to netlink interfaceAlex Gartrell
This is necessary to support heterogeneous pools. For example, if you have an ipv6 addressed network, you'll want to be able to forward ipv4 traffic into it. This patch enforces that destination address family is the same as service family, as none of the forwarding mechanisms support anything else. For the old setsockopt mechanism, we simply set the dest address family to AF_INET as we do with the service. Signed-off-by: Alex Gartrell <agartrell@fb.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2014-04-18arch: Mass conversion of smp_mb__*()Peter Zijlstra
Mostly scripted conversion of the smp_mb__* barriers. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-11netfilter: push reasm skb through instead of original frag skbsJiri Pirko
Pushing original fragments through causes several problems. For example for matching, frags may not be matched correctly. Take following example: <example> On HOSTA do: ip6tables -I INPUT -p icmpv6 -j DROP ip6tables -I INPUT -p icmpv6 -m icmp6 --icmpv6-type 128 -j ACCEPT and on HOSTB you do: ping6 HOSTA -s2000 (MTU is 1500) Incoming echo requests will be filtered out on HOSTA. This issue does not occur with smaller packets than MTU (where fragmentation does not happen) </example> As was discussed previously, the only correct solution seems to be to use reassembled skb instead of separete frags. Doing this has positive side effects in reducing sk_buff by one pointer (nfct_reasm) and also the reams dances in ipvs and conntrack can be removed. Future plan is to remove net/ipv6/netfilter/nf_conntrack_reasm.c entirely and use code in net/ipv6/reassembly.c instead. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-15ipvs: avoid rcu_barrier during netns cleanupJulian Anastasov
commit 578bc3ef1e473a ("ipvs: reorganize dest trash") added rcu_barrier() on cleanup to wait dest users and schedulers like LBLC and LBLCR to put their last dest reference. Using rcu_barrier with many namespaces is problematic. Trying to fix it by freeing dest with kfree_rcu is not a solution, RCU callbacks can run in parallel and execution order is random. Fix it by creating new function ip_vs_dest_put_and_free() which is heavier than ip_vs_dest_put(). We will use it just for schedulers like LBLC, LBLCR that can delay their dest release. By default, dests reference is above 0 if they are present in service and it is 0 when deleted but still in trash list. Change the dest trash code to use ip_vs_dest_put_and_free(), so that refcnt -1 can be used for freeing. As result, such checks remain in slow path and the rcu_barrier() from netns cleanup can be removed. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-10-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/emulex/benet/be.h drivers/net/usb/qmi_wwan.c drivers/net/wireless/brcm80211/brcmfmac/dhd_bus.h include/net/netfilter/nf_conntrack_synproxy.h include/net/secure_seq.h The conflicts are of two varieties: 1) Conflicts with Joe Perches's 'extern' removal from header file function declarations. Usually it's an argument signature change or a function being added/removed. The resolutions are trivial. 2) Some overlapping changes in qmi_wwan.c and be.h, one commit adds a new value, another changes an existing value. That sort of thing. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-21ip*.h: Remove extern from function prototypesJoe Perches
There are a mix of function prototypes with and without extern in the kernel sources. Standardize on not using extern for function prototypes. Function prototypes don't need to be written with extern. extern is assumed by the compiler. Its use is as unnecessary as using auto to declare automatic/local variables in a block. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-18ipvs: make the service replacement more robustJulian Anastasov
commit 578bc3ef1e473a ("ipvs: reorganize dest trash") added IP_VS_DEST_STATE_REMOVING flag and RCU callback named ip_vs_dest_wait_readers() to keep dests and services after removal for at least a RCU grace period. But we have the following corner cases: - we can not reuse the same dest if its service is removed while IP_VS_DEST_STATE_REMOVING is still set because another dest removal in the first grace period can not extend this period. It can happen when ipvsadm -C && ipvsadm -R is used. - dest->svc can be replaced but ip_vs_in_stats() and ip_vs_out_stats() have no explicit read memory barriers when accessing dest->svc. It can happen that dest->svc was just freed (replaced) while we use it to update the stats. We solve the problems as follows: - IP_VS_DEST_STATE_REMOVING is removed and we ensure a fixed idle period for the dest (IP_VS_DEST_TRASH_PERIOD). idle_start will remember when for first time after deletion we noticed dest->refcnt=0. Later, the connections can grab a reference while in RCU grace period but if refcnt becomes 0 we can safely free the dest and its svc. - dest->svc becomes RCU pointer. As result, we add explicit RCU locking in ip_vs_in_stats() and ip_vs_out_stats(). - __ip_vs_unbind_svc is renamed to __ip_vs_svc_put(), it now can free the service immediately or after a RCU grace period. dest->svc is not set to NULL anymore. As result, unlinked dests and their services are freed always after IP_VS_DEST_TRASH_PERIOD period, unused services are freed after a RCU grace period. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-09-18ipvs: fix overflow on dest weight multiplySimon Kirby
Schedulers such as lblc and lblcr require the weight to be as high as the maximum number of active connections. In commit b552f7e3a9524abcbcdf ("ipvs: unify the formula to estimate the overhead of processing connections"), the consideration of inactconns and activeconns was cleaned up to always count activeconns as 256 times more important than inactconns. In cases where 3000 or more connections are expected, a weight of 3000 * 256 * 3000 connections overflows the 32-bit signed result used to determine if rescheduling is required. On amd64, this merely changes the multiply and comparison instructions to 64-bit. On x86, a 64-bit result is already present from imull, so only a few more comparison instructions are emitted. Signed-off-by: Simon Kirby <sim@hostway.ca> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-06-26ipvs: add sync_persist_mode flagJulian Anastasov
Add sync_persist_mode flag to reduce sync traffic by syncing only persistent templates. Signed-off-by: Julian Anastasov <ja@ssi.bg> Tested-by: Aleksey Chudov <aleksey.chudov@gmail.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-06-26ipvs: replace the SCTP state machineJulian Anastasov
Convert the SCTP state table, so that it is more readable. Change the states to be according to the diagram in RFC 2960 and add more states suitable for middle box. Still, such change in states adds incompatibility if systems in sync setup include this change and others do not include it. With this change we also have proper transitions in INPUT-ONLY mode (DR/TUN) where we see packets only from client. Now we should not switch to 10-second CLOSED state at a time when we should stay in ESTABLISHED state. The short names for states are because we have 16-char space in ipvsadm and 11-char limit for the connection list format. It is a sequence of the TCP implementation where the longest state name is ESTABLISHED. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-06-26ipvs: sloppy TCP and SCTPAlexander Frolkin
This adds support for sloppy TCP and SCTP modes to IPVS. When enabled (sysctls net.ipv4.vs.sloppy_tcp and net.ipv4.vs.sloppy_sctp), allows IPVS to create connection state on any packet, not just a TCP SYN (or SCTP INIT). This allows connections to fail over from one IPVS director to another mid-flight. Signed-off-by: Alexander Frolkin <avf@eldamar.org.uk> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-06-26ipvs: provide iph to schedulersJulian Anastasov
Before now the schedulers needed access only to IP addresses and it was easy to get them from skb by using ip_vs_fill_iph_addr_only. New changes for the SH scheduler will need the protocol and ports which is difficult to get from skb for the IPv6 case. As we have all the data in the iph structure, to avoid the same slow lookups provide the iph to schedulers. Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-05-26ipvs: change type of netns_ipvs->sysctl_sync_qlen_maxZhang Yanfei
This member of struct netns_ipvs is calculated from nr_free_buffer_pages so change its type to unsigned long in case of overflow. Also, type of its related proc var sync_qlen_max and the return type of function sysctl_sync_qlen_max() should be changed to unsigned long, too. Besides, the type of ipvs_master_sync_state->sync_queue_len should be changed to unsigned long accordingly. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Julian Anastasov <ja@ssi.bg> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-23ipvs: fix sparse warnings for some parametersJulian Anastasov
Some service fields are in network order: - netmask: used once in network order and also as prefix len for IPv6 - port Other parameters are in host order: - struct ip_vs_flags: flags and mask moved between user and kernel only - sync state: moved between user and kernel only - syncid: sent over network as single octet Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-02ipvs: convert services to rcuJulian Anastasov
This is the final step in RCU conversion. Things that are removed: - svc->usecnt: now svc is accessed under RCU read lock - svc->inc: and some unused code - ip_vs_bind_pe and ip_vs_unbind_pe: no ability to replace PE - __ip_vs_svc_lock: replaced with RCU - IP_VS_WAIT_WHILE: now readers lookup svcs and dests under RCU and work in parallel with configuration Other changes: - before now, a RCU read-side critical section included the calling of the schedule method, now it is extended to include service lookup - ip_vs_svc_table and ip_vs_svc_fwm_table are now using hlist - svc->pe and svc->scheduler remain to the end (of grace period), the schedulers are prepared for such RCU readers even after done_service is called but they need to use synchronize_rcu because last ip_vs_scheduler_put can happen while RCU read-side critical sections use an outdated svc->scheduler pointer - as planned, update_service is removed - empty services can be freed immediately after grace period. If dests were present, the services are freed from the dest trash code Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-02ipvs: convert dests to rcuJulian Anastasov
In previous commits the schedulers started to access svc->destinations with _rcu list traversal primitives because the IP_VS_WAIT_WHILE macro still plays the role of grace period. Now it is time to finish the updating part, i.e. adding and deleting of dests with _rcu suffix before removing the IP_VS_WAIT_WHILE in next commit. We use the same rule for conns as for the schedulers: dests can be searched in RCU read-side critical section where ip_vs_dest_hold can be called by ip_vs_bind_dest. Some things are not perfect, for example, calling functions like ip_vs_lookup_dest from updating code under RCU, just because we use some function both from reader and from updater. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-02ipvs: convert sched_lock to spin lockJulian Anastasov
As all read_locks are gone spin lock is preferred. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-02ipvs: do not expect result from done_serviceJulian Anastasov
This method releases the scheduler state, it can not fail. Such change will help to properly replace the scheduler in following patch. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-02ipvs: reorganize dest trashJulian Anastasov
All dests will go to trash, no exceptions. But we have to use new list node t_list for this, due to RCU changes in following patches. Dests will wait there initial grace period and later all conns and schedulers to put their reference. The dests don't get reference for staying in dest trash as before. As result, we do not load ip_vs_dest_put with extra checks for last refcnt and the schedulers do not need to play games with atomic_inc_not_zero while selecting best destination. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-02ipvs: add ip_vs_dest_hold and ip_vs_dest_putJulian Anastasov
ip_vs_dest_hold will be used under RCU lock while ip_vs_dest_put can be called even after dest is removed from service, as it happens for conns and some schedulers. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-02ipvs: preparations for using rcu in schedulersJulian Anastasov
Allow schedulers to use rcu_dereference when returning destination on lookup. The RCU read-side critical section will allow ip_vs_bind_dest to get dest refcnt as preparation for the step where destinations will be deleted without an IP_VS_WAIT_WHILE guard that holds the packet processing during update. Add new optional scheduler methods add_dest, del_dest and upd_dest. For now the methods are called together with update_service but update_service will be removed in a following change. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-02ipvs: avoid kmem_cache_zalloc in ip_vs_conn_newJulian Anastasov
We have many fields to set and few to reset, use kmem_cache_alloc instead to save some cycles. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-02ipvs: reorder keys in connection structureJulian Anastasov
__ip_vs_conn_in_get and ip_vs_conn_out_get are hot places. Optimize them, so that ports are matched first. By moving net and fwmark below, on 32-bit arch we can fit caddr in 32-byte cache line and all addresses in 64-byte cache line. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-02ipvs: convert connection lockingJulian Anastasov
Convert __ip_vs_conntbl_lock_array as follows: - readers that do not modify conn lists will use RCU lock - updaters that modify lists will use spinlock_t Now for conn lookups we will use RCU read-side critical section. Without using __ip_vs_conn_get such places have access to connection fields and can dereference some pointers like pe and pe_data plus the ability to update timer expiration. If full access is required we contend for reference. We add barrier in __ip_vs_conn_put, so that other CPUs see the refcnt operation after other writes. With the introduction of ip_vs_conn_unlink() we try to reorganize ip_vs_conn_expire(), so that unhashing of connections that should stay more time is avoided, even if it is for very short time. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-02ipvs: remove rs_lock by using RCUJulian Anastasov
rs_lock was used to protect rs_table (hash table) from updaters (under global mutex) and readers (packet handlers). We can remove rs_lock by using RCU lock for readers. Reclaiming dest only with kfree_rcu is enough because the readers access only fields from the ip_vs_dest structure. Use hlist for rs_table. As we are now using hlist_del_rcu, introduce in_rs_table flag as replacement for the list_empty checks which do not work with RCU. It is needed because only NAT dests are in the rs_table. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-02ipvs: convert app locksJulian Anastasov
We use locks like tcp_app_lock, udp_app_lock, sctp_app_lock to protect access to the protocol hash tables from readers in packet context while the application instances (inc) are [un]registered under global mutex. As the hash tables are mostly read when conns are created and bound to app, use RCU for readers and reclaim app instance after grace period. Simplify ip_vs_app_inc_get because we use usecnt only for statistics and rely on module refcounting. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-02ipvs: optimize dst usage for real serverJulian Anastasov
Currently when forwarding requests to real servers we use dst_lock and atomic operations when cloning the dst_cache value. As the dst_cache value does not change most of the time it is better to use RCU and to lock dst_lock only when we need to replace the obsoleted dst. For this to work we keep dst_cache in new structure protected by RCU. For packets to remote real servers we will use noref version of dst_cache, it will be valid while we are in RCU read-side critical section because now dst_release for replaced dsts will be invoked after the grace period. Packets to local real servers that are passed to local stack with NF_ACCEPT need a dst clone. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-02ipvs: rename functions related to dst_cache resetJulian Anastasov
Move and give better names to two functions: - ip_vs_dst_reset to __ip_vs_dst_cache_reset - __ip_vs_dev_reset to ip_vs_forget_dev Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-02ipvs: avoid routing by TOS for real serverJulian Anastasov
Avoid replacing the cached route for real server on every packet with different TOS. I doubt that routing by TOS for real server is used at all, so we should be better with such optimization. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-03-25Merge branch 'master' of git://1984.lsi.us.es/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== The following patchset contains Netfilter/IPVS updates for your net-next tree, they are: * Better performance in nfnetlink_queue by avoiding copy from the packet to netlink message, from Eric Dumazet. * Remove unnecessary locking in the exit path of ebt_ulog, from Gao Feng. * Use new function ipv6_iface_scope_id in nf_ct_ipv6, from Hannes Frederic Sowa. * A couple of sparse fixes for IPVS, from Julian Anastasov. * Use xor hashing in nfnetlink_queue, as suggested by Eric Dumazet, from myself. * Allow to dump expectations per master conntrack via ctnetlink, from myself. * A couple of cleanups to use PTR_RET in module init path, from Silviu-Mihai Popescu. * Remove nf_conntrack module a bit faster if netns are in use, from Vladimir Davydov. * Use checksum_partial in ip6t_NPT, from YOSHIFUJI Hideaki. * Sparse fix for nf_conntrack, from Stephen Hemminger. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-19ipvs: add backup_only flag to avoid loopsJulian Anastasov
Dmitry Akindinov is reporting for a problem where SYNs are looping between the master and backup server when the backup server is used as real server in DR mode and has IPVS rules to function as director. Even when the backup function is enabled we continue to forward traffic and schedule new connections when the current master is using the backup server as real server. While this is not a problem for NAT, for DR and TUN method the backup server can not determine if a request comes from client or from director. To avoid such loops add new sysctl flag backup_only. It can be needed for DR/TUN setups that do not need backup and director function at the same time. When the backup function is enabled we stop any forwarding and pass the traffic to the local stack (real server mode). The flag disables the director function when the backup function is enabled. For setups that enable backup function for some virtual services and director function for other virtual services there should be another more complex solution to support DR/TUN mode, may be to assign per-virtual service syncid value, so that we can differentiate the requests. Reported-by: Dmitry Akindinov <dimak@stalker.com> Tested-by: German Myzovsky <lawyer@sipnet.ru> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-03-19ipvs: fix some sparse warningsJulian Anastasov
Add missing __percpu annotations and make ip_vs_net_id static. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-10-23ipvs: fix build errors related to config option combinationsJesper Dangaard Brouer
Fix two build error introduced by commit 63dca2c0: "ipvs: Fix faulty IPv6 extension header handling in IPVS" First build error was fairly trivial and can occur, when CONFIG_IP_VS_IPV6 is disabled. The second build error was tricky, and can occur when deselecting both all Netfilter and IPVS, but selecting CONFIG_IPV6. This is caused by "kernel/sysctl_binary.c" including "net/ip_vs.h", which includes "linux/netfilter_ipv6/ip6_tables.h" causing include of "include/linux/netfilter/x_tables.h" which then cannot find the typedef nf_hookfn. Fix this by only including "linux/netfilter_ipv6/ip6_tables.h" in case of CONFIG_IP_VS_IPV6 as its already used to guard the usage of ipv6_find_hdr(). Reported-by: Fengguang Wu <fengguang.wu@intel.com> Reported-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-09-28ipvs: API change to avoid rescan of IPv6 exthdrJesper Dangaard Brouer
Reduce the number of times we scan/skip the IPv6 exthdrs. This patch contains a lot of API changes. This is done, to avoid repeating the scan of finding the IPv6 headers, via ipv6_find_hdr(), which is called by ip_vs_fill_iph_skb(). Finding the IPv6 headers is done as early as possible, and passed on as a pointer "struct ip_vs_iphdr *" to the affected functions. This patch reduce/removes 19 calls to ip_vs_fill_iph_skb(). Notice, I have choosen, not to change the API of function pointer "(*schedule)" (in struct ip_vs_scheduler) as it can be used by external schedulers, via {un,}register_ip_vs_scheduler. Only 4 out of 10 schedulers use info from ip_vs_iphdr*, and when they do, they are only interested in iph->{s,d}addr. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-09-28ipvs: Complete IPv6 fragment handling for IPVSJesper Dangaard Brouer
IPVS now supports fragmented packets, with support from nf_conntrack_reasm.c Based on patch from: Hans Schillstrom. IPVS do like conntrack i.e. use the skb->nfct_reasm (i.e. when all fragments is collected, nf_ct_frag6_output() starts a "re-play" of all fragments into the interrupted PREROUTING chain at prio -399 (NF_IP6_PRI_CONNTRACK_DEFRAG+1) with nfct_reasm pointing to the assembled packet.) Notice, module nf_defrag_ipv6 must be loaded for this to work. Report unhandled fragments, and recommend user to load nf_defrag_ipv6. To handle fw-mark for fragments. Add a new IPVS hook into prerouting chain at prio -99 (NF_IP6_PRI_NAT_DST+1) to catch fragments, and copy fw-mark info from the first packet with an upper layer header. IPv6 fragment handling should be the last thing on the IPVS IPv6 missing support list. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Hans Schillstrom <hans@schillstrom.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-09-28ipvs: Fix faulty IPv6 extension header handling in IPVSJesper Dangaard Brouer
IPv6 packets can contain extension headers, thus its wrong to assume that the transport/upper-layer header, starts right after (struct ipv6hdr) the IPv6 header. IPVS uses this false assumption, and will write SNAT & DNAT modifications at a fixed pos which will corrupt the message. To fix this, proper header position must be found before modifying packets. Introducing ip_vs_fill_iph_skb(), which uses ipv6_find_hdr() to skip the exthdrs. It finds (1) the transport header offset, (2) the protocol, and (3) detects if the packet is a fragment. Note, that fragments in IPv6 is represented via an exthdr. Thus, this is detected while skipping through the exthdrs. This patch depends on commit 84018f55a: "netfilter: ip6_tables: add flags parameter to ipv6_find_hdr()" This also adds a dependency to ip6_tables. Originally based on patch from: Hans Schillstrom kABI notes: Changing struct ip_vs_iphdr is a potential minor kABI breaker, because external modules can be compiled with another version of this struct. This should not matter, as they would most-likely be using a compiled-in version of ip_vs_fill_iphdr(). When recompiled, they will notice ip_vs_fill_iphdr() no longer exists, and they have to used ip_vs_fill_iph_skb() instead. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-09-28ipvs: Use config macro IS_ENABLED()Jesper Dangaard Brouer
Cleanup patch. Use the IS_ENABLED macro, instead of having to check both the build and the module CONFIG_ option. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-09-28ipvs: Trivial changes, use compressed IPv6 address in outputJesper Dangaard Brouer
Have not converted the proc file output to compressed IPv6 addresses. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-08-10ipvs: add pmtu_disc option to disable IP DF for TUN packetsJulian Anastasov
Disabling PMTU discovery can increase the output packet rate but some users have enough resources and prefer to fragment than to drop traffic. By default, we copy the DF bit but if pmtu_disc is disabled we do not send FRAG_NEEDED messages anymore. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-08-10ipvs: generalize app registration in netnsJulian Anastasov
Get rid of the ftp_app pointer and allow applications to be registered without adding fields in the netns_ipvs structure. v2: fix coding style as suggested by Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-07-17ipvs: fix oops on NAT reply in br_nf contextLin Ming
IPVS should not reset skb->nf_bridge in FORWARD hook by calling nf_reset for NAT replies. It triggers oops in br_nf_forward_finish. [ 579.781508] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004 [ 579.781669] IP: [<ffffffff817b1ca5>] br_nf_forward_finish+0x58/0x112 [ 579.781792] PGD 218f9067 PUD 0 [ 579.781865] Oops: 0000 [#1] SMP [ 579.781945] CPU 0 [ 579.781983] Modules linked in: [ 579.782047] [ 579.782080] [ 579.782114] Pid: 4644, comm: qemu Tainted: G W 3.5.0-rc5-00006-g95e69f9 #282 Hewlett-Packard /30E8 [ 579.782300] RIP: 0010:[<ffffffff817b1ca5>] [<ffffffff817b1ca5>] br_nf_forward_finish+0x58/0x112 [ 579.782455] RSP: 0018:ffff88007b003a98 EFLAGS: 00010287 [ 579.782541] RAX: 0000000000000008 RBX: ffff8800762ead00 RCX: 000000000001670a [ 579.782653] RDX: 0000000000000000 RSI: 000000000000000a RDI: ffff8800762ead00 [ 579.782845] RBP: ffff88007b003ac8 R08: 0000000000016630 R09: ffff88007b003a90 [ 579.782957] R10: ffff88007b0038e8 R11: ffff88002da37540 R12: ffff88002da01a02 [ 579.783066] R13: ffff88002da01a80 R14: ffff88002d83c000 R15: ffff88002d82a000 [ 579.783177] FS: 0000000000000000(0000) GS:ffff88007b000000(0063) knlGS:00000000f62d1b70 [ 579.783306] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b [ 579.783395] CR2: 0000000000000004 CR3: 00000000218fe000 CR4: 00000000000027f0 [ 579.783505] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 579.783684] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 579.783795] Process qemu (pid: 4644, threadinfo ffff880021b20000, task ffff880021aba760) [ 579.783919] Stack: [ 579.783959] ffff88007693cedc ffff8800762ead00 ffff88002da01a02 ffff8800762ead00 [ 579.784110] ffff88002da01a02 ffff88002da01a80 ffff88007b003b18 ffffffff817b26c7 [ 579.784260] ffff880080000000 ffffffff81ef59f0 ffff8800762ead00 ffffffff81ef58b0 [ 579.784477] Call Trace: [ 579.784523] <IRQ> [ 579.784562] [ 579.784603] [<ffffffff817b26c7>] br_nf_forward_ip+0x275/0x2c8 [ 579.784707] [<ffffffff81704b58>] nf_iterate+0x47/0x7d [ 579.784797] [<ffffffff817ac32e>] ? br_dev_queue_push_xmit+0xae/0xae [ 579.784906] [<ffffffff81704bfb>] nf_hook_slow+0x6d/0x102 [ 579.784995] [<ffffffff817ac32e>] ? br_dev_queue_push_xmit+0xae/0xae [ 579.785175] [<ffffffff8187fa95>] ? _raw_write_unlock_bh+0x19/0x1b [ 579.785179] [<ffffffff817ac417>] __br_forward+0x97/0xa2 [ 579.785179] [<ffffffff817ad366>] br_handle_frame_finish+0x1a6/0x257 [ 579.785179] [<ffffffff817b2386>] br_nf_pre_routing_finish+0x26d/0x2cb [ 579.785179] [<ffffffff817b2cf0>] br_nf_pre_routing+0x55d/0x5c1 [ 579.785179] [<ffffffff81704b58>] nf_iterate+0x47/0x7d [ 579.785179] [<ffffffff817ad1c0>] ? br_handle_local_finish+0x44/0x44 [ 579.785179] [<ffffffff81704bfb>] nf_hook_slow+0x6d/0x102 [ 579.785179] [<ffffffff817ad1c0>] ? br_handle_local_finish+0x44/0x44 [ 579.785179] [<ffffffff81551525>] ? sky2_poll+0xb35/0xb54 [ 579.785179] [<ffffffff817ad62a>] br_handle_frame+0x213/0x229 [ 579.785179] [<ffffffff817ad417>] ? br_handle_frame_finish+0x257/0x257 [ 579.785179] [<ffffffff816e3b47>] __netif_receive_skb+0x2b4/0x3f1 [ 579.785179] [<ffffffff816e69fc>] process_backlog+0x99/0x1e2 [ 579.785179] [<ffffffff816e6800>] net_rx_action+0xdf/0x242 [ 579.785179] [<ffffffff8107e8a8>] __do_softirq+0xc1/0x1e0 [ 579.785179] [<ffffffff8135a5ba>] ? trace_hardirqs_off_thunk+0x3a/0x6c [ 579.785179] [<ffffffff8188812c>] call_softirq+0x1c/0x30 The steps to reproduce as follow, 1. On Host1, setup brige br0(192.168.1.106) 2. Boot a kvm guest(192.168.1.105) on Host1 and start httpd 3. Start IPVS service on Host1 ipvsadm -A -t 192.168.1.106:80 -s rr ipvsadm -a -t 192.168.1.106:80 -r 192.168.1.105:80 -m 4. Run apache benchmark on Host2(192.168.1.101) ab -n 1000 http://192.168.1.106/ ip_vs_reply4 ip_vs_out handle_response ip_vs_notrack nf_reset() { skb->nf_bridge = NULL; } Actually, IPVS wants in this case just to replace nfct with untracked version. So replace the nf_reset(skb) call in ip_vs_notrack() with a nf_conntrack_put(skb->nfct) call. Signed-off-by: Lin Ming <mlin@ss.pku.edu.cn> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-05-08ipvs: add support for sync threadsPablo Neira Ayuso
Allow master and backup servers to use many threads for sync traffic. Add sysctl var "sync_ports" to define the number of threads. Every thread will use single UDP port, thread 0 will use the default port 8848 while last thread will use port 8848+sync_ports-1. The sync traffic for connections is scheduled to many master threads based on the cp address but one connection is always assigned to same thread to avoid reordering of the sync messages. Remove ip_vs_sync_switch_mode because this check for sync mode change is still risky. Instead, check for mode change under sync_buff_lock. Make sure the backup socks do not block on reading. Special thanks to Aleksey Chudov for helping in all tests. Signed-off-by: Julian Anastasov <ja@ssi.bg> Tested-by: Aleksey Chudov <aleksey.chudov@gmail.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: reduce sync rate with time thresholdsJulian Anastasov
Add two new sysctl vars to control the sync rate with the main idea to reduce the rate for connection templates because currently it depends on the packet rate for controlled connections. This mechanism should be useful also for normal connections with high traffic. sync_refresh_period: in seconds, difference in reported connection timer that triggers new sync message. It can be used to avoid sync messages for the specified period (or half of the connection timeout if it is lower) if connection state is not changed from last sync. sync_retries: integer, 0..3, defines sync retries with period of sync_refresh_period/8. Useful to protect against loss of sync messages. Allow sysctl_sync_threshold to be used with sysctl_sync_period=0, so that only single sync message is sent if sync_refresh_period is also 0. Add new field "sync_endtime" in connection structure to hold the reported time when connection expires. The 2 lowest bits will represent the retry count. As the sysctl_sync_period now can be 0 use ACCESS_ONCE to avoid division by zero. Special thanks to Aleksey Chudov for being patient with me, for his extensive reports and helping in all tests. Signed-off-by: Julian Anastasov <ja@ssi.bg> Tested-by: Aleksey Chudov <aleksey.chudov@gmail.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: wakeup master threadPablo Neira Ayuso
High rate of sync messages in master can lead to overflowing the socket buffer and dropping the messages. Fixed sleep of 1 second without wakeup events is not suitable for loaded masters, Use delayed_work to schedule sending for queued messages and limit the delay to IPVS_SYNC_SEND_DELAY (20ms). This will reduce the rate of wakeups but to avoid sending long bursts we wakeup the master thread after IPVS_SYNC_WAKEUP_RATE (8) messages. Add hard limit for the queued messages before sending by using "sync_qlen_max" sysctl var. It defaults to 1/32 of the memory pages but actually represents number of messages. It will protect us from allocating large parts of memory when the sending rate is lower than the queuing rate. As suggested by Pablo, add new sysctl var "sync_sock_size" to configure the SNDBUF (master) or RCVBUF (slave) socket limit. Default value is 0 (preserve system defaults). Change the master thread to detect and block on SNDBUF overflow, so that we do not drop messages when the socket limit is low but the sync_qlen_max limit is not reached. On ENOBUFS or other errors just drop the messages. Change master thread to enter TASK_INTERRUPTIBLE state early, so that we do not miss wakeups due to messages or kthread_should_stop event. Thanks to Pablo Neira Ayuso for his valuable feedback! Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/intel/e1000e/param.c drivers/net/wireless/iwlwifi/iwl-agn-rx.c drivers/net/wireless/iwlwifi/iwl-trans-pcie-rx.c drivers/net/wireless/iwlwifi/iwl-trans.h Resolved the iwlwifi conflict with mainline using 3-way diff posted by John Linville and Stephen Rothwell. In 'net' we added a bug fix to make iwlwifi report a more accurate skb->truesize but this conflicted with RX path changes that happened meanwhile in net-next. In e1000e a conflict arose in the validation code for settings of adapter->itr. 'net-next' had more sophisticated logic so that logic was used. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30ipvs: kernel oops - do_ip_vs_get_ctlHans Schillstrom
Change order of init so netns init is ready when register ioctl and netlink. Ver2 Whitespace fixes and __init added. Reported-by: "Ryan O'Hara" <rohara@redhat.com> Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Simon Horman <horms@verge.net.au>