aboutsummaryrefslogtreecommitdiffstats
path: root/net/netfilter
AgeCommit message (Collapse)Author
2016-08-10netfilter: nft_exthdr: Add size check on u8 nft_exthdr attributesLaura Garcia Liebana
Fix the direct assignment of offset and length attributes included in nft_exthdr structure from u32 data to u8. Signed-off-by: Laura Garcia Liebana <nevola@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-09netfilter: ctnetlink: reject new conntrack request with different l4protoLiping Zhang
Currently, user can add a conntrack with different l4proto via nfnetlink. For example, original tuple is TCP while reply tuple is SCTP. This is invalid combination, we should report EINVAL to userspace. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-09netfilter: nfnetlink_queue: reject verdict request from different portidLiping Zhang
Like NFQNL_MSG_VERDICT_BATCH do, we should also reject the verdict request when the portid is not same with the initial portid(maybe from another process). Fixes: 97d32cf9440d ("netfilter: nfnetlink_queue: batch verdict support") Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-09netfilter: nfnetlink_queue: fix memory leak when attach expectation successfullyLiping Zhang
User can use NFQA_EXP to attach expectations to conntracks, but we forget to put back nf_conntrack_expect when it is inserted successfully, i.e. in this normal case, expect's use refcnt will be 3. So even we unlink it and put it back later, the use refcnt is still 1, then the memory will be leaked forever. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-09netfilter: nf_ct_expect: remove the redundant slash when policy name is emptyLiping Zhang
The 'name' filed in struct nf_conntrack_expect_policy{} is not a pointer, so check it is NULL or not will always return true. Even if the name is empty, slash will always be displayed like follows: # cat /proc/net/nf_conntrack_expect 297 l3proto = 2 proto=6 src=1.1.1.1 dst=2.2.2.2 sport=1 dport=1025 ftp/ ^ Fixes: 3a8fc53a45c4 ("netfilter: nf_ct_helper: allocate 16 bytes for the helper and policy names") Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-08netfilter: nf_conntrack_sip: CSeq 0 is a valid CSeqChristophe Leroy
Do not drop packet when CSeq is 0 as 0 is also a valid value for CSeq. simple_strtoul() will return 0 either when all digits are 0 or if there are no digits at all. Therefore when simple_strtoul() returns 0 we check if first character is digit 0 or not. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-08netfilter: nft_rbtree: ignore inactive matching element with no descendantsPablo Neira Ayuso
If we find a matching element that is inactive with no descendants, we jump to the found label, then crash because of nul-dereference on the left branch. Fix this by checking that the element is active and not an interval end and skipping the logic that only applies to the tree iteration. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Tested-by: Anders K. Pedersen <akp@akp.dk>
2016-08-08netfilter: nf_ct_h323: do not re-activate already expired timerLiping Zhang
Commit 96d1327ac2e3 ("netfilter: h323: Use mod_timer instead of set_expect_timeout") just simplify the source codes if (!del_timer(&exp->timeout)) return 0; add_timer(&exp->timeout); to mod_timer(&exp->timeout, jiffies + info->timeout * HZ); This is not correct, and introduce a race codition: CPU0 CPU1 - timer expire process_rcf expectation_timed_out lock(exp_lock) - find_exp waiting exp_lock... re-activate timer!! waiting exp_lock... unlock(exp_lock) lock(exp_lock) - unlink expect - free(expect) - unlock(exp_lock) So when the timer expires again, we will access the memory that was already freed. Replace mod_timer with mod_timer_pending here to fix this problem. Fixes: 96d1327ac2e3 ("netfilter: h323: Use mod_timer instead of set_expect_timeout") Cc: Gao Feng <fgao@ikuai8.com> Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: 1) Unified UDP encapsulation offload methods for drivers, from Alexander Duyck. 2) Make DSA binding more sane, from Andrew Lunn. 3) Support QCA9888 chips in ath10k, from Anilkumar Kolli. 4) Several workqueue usage cleanups, from Bhaktipriya Shridhar. 5) Add XDP (eXpress Data Path), essentially running BPF programs on RX packets as soon as the device sees them, with the option to mirror the packet on TX via the same interface. From Brenden Blanco and others. 6) Allow qdisc/class stats dumps to run lockless, from Eric Dumazet. 7) Add VLAN support to b53 and bcm_sf2, from Florian Fainelli. 8) Simplify netlink conntrack entry layout, from Florian Westphal. 9) Add ipv4 forwarding support to mlxsw spectrum driver, from Ido Schimmel, Yotam Gigi, and Jiri Pirko. 10) Add SKB array infrastructure and convert tun and macvtap over to it. From Michael S Tsirkin and Jason Wang. 11) Support qdisc packet injection in pktgen, from John Fastabend. 12) Add neighbour monitoring framework to TIPC, from Jon Paul Maloy. 13) Add NV congestion control support to TCP, from Lawrence Brakmo. 14) Add GSO support to SCTP, from Marcelo Ricardo Leitner. 15) Allow GRO and RPS to function on macsec devices, from Paolo Abeni. 16) Support MPLS over IPV4, from Simon Horman. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits) xgene: Fix build warning with ACPI disabled. be2net: perform temperature query in adapter regardless of its interface state l2tp: Correctly return -EBADF from pppol2tp_getname. net/mlx5_core/health: Remove deprecated create_singlethread_workqueue net: ipmr/ip6mr: update lastuse on entry change macsec: ensure rx_sa is set when validation is disabled tipc: dump monitor attributes tipc: add a function to get the bearer name tipc: get monitor threshold for the cluster tipc: make cluster size threshold for monitoring configurable tipc: introduce constants for tipc address validation net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update() MAINTAINERS: xgene: Add driver and documentation path Documentation: dtb: xgene: Add MDIO node dtb: xgene: Add MDIO node drivers: net: xgene: ethtool: Use phy_ethtool_gset and sset drivers: net: xgene: Use exported functions drivers: net: xgene: Enable MDIO driver drivers: net: xgene: Add backward compatibility drivers: net: phy: xgene: Add MDIO driver ...
2016-07-25Merge branch 'locking-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "The locking tree was busier in this cycle than the usual pattern - a couple of major projects happened to coincide. The main changes are: - implement the atomic_fetch_{add,sub,and,or,xor}() API natively across all SMP architectures (Peter Zijlstra) - add atomic_fetch_{inc/dec}() as well, using the generic primitives (Davidlohr Bueso) - optimize various aspects of rwsems (Jason Low, Davidlohr Bueso, Waiman Long) - optimize smp_cond_load_acquire() on arm64 and implement LSE based atomic{,64}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}() on arm64 (Will Deacon) - introduce smp_acquire__after_ctrl_dep() and fix various barrier mis-uses and bugs (Peter Zijlstra) - after discovering ancient spin_unlock_wait() barrier bugs in its implementation and usage, strengthen its semantics and update/fix usage sites (Peter Zijlstra) - optimize mutex_trylock() fastpath (Peter Zijlstra) - ... misc fixes and cleanups" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (67 commits) locking/atomic: Introduce inc/dec variants for the atomic_fetch_$op() API locking/barriers, arch/arm64: Implement LDXR+WFE based smp_cond_load_acquire() locking/static_keys: Fix non static symbol Sparse warning locking/qspinlock: Use __this_cpu_dec() instead of full-blown this_cpu_dec() locking/atomic, arch/tile: Fix tilepro build locking/atomic, arch/m68k: Remove comment locking/atomic, arch/arc: Fix build locking/Documentation: Clarify limited control-dependency scope locking/atomic, arch/rwsem: Employ atomic_long_fetch_add() locking/atomic, arch/qrwlock: Employ atomic_fetch_add_acquire() locking/atomic, arch/mips: Convert to _relaxed atomics locking/atomic, arch/alpha: Convert to _relaxed atomics locking/atomic: Remove the deprecated atomic_{set,clear}_mask() functions locking/atomic: Remove linux/atomic.h:atomic_fetch_or() locking/atomic: Implement atomic{,64,_long}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}() locking/atomic: Fix atomic64_relaxed() bits locking/atomic, arch/xtensa: Implement atomic_fetch_{add,sub,and,or,xor}() locking/atomic, arch/x86: Implement atomic{,64}_fetch_{add,sub,and,or,xor}() locking/atomic, arch/tile: Implement atomic{,64}_fetch_{add,sub,and,or,xor}() locking/atomic, arch/sparc: Implement atomic{,64}_fetch_{add,sub,and,or,xor}() ...
2016-07-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for net-next, they are: 1) Count pre-established connections as active in "least connection" schedulers such that pre-established connections to avoid overloading backend servers on peak demands, from Michal Kubecek via Simon Horman. 2) Address a race condition when resizing the conntrack table by caching the bucket size when fulling iterating over the hashtable in these three possible scenarios: 1) dump via /proc/net/nf_conntrack, 2) unlinking userspace helper and 3) unlinking custom conntrack timeout. From Liping Zhang. 3) Revisit early_drop() path to perform lockless traversal on conntrack eviction under stress, use del_timer() as synchronization point to avoid two CPUs evicting the same entry, from Florian Westphal. 4) Move NAT hlist_head to nf_conn object, this simplifies the existing NAT extension and it doesn't increase size since recent patches to align nf_conn, from Florian. 5) Use rhashtable for the by-source NAT hashtable, also from Florian. 6) Don't allow --physdev-is-out from OUTPUT chain, just like --physdev-out is not either, from Hangbin Liu. 7) Automagically set on nf_conntrack counters if the user tries to match ct bytes/packets from nftables, from Liping Zhang. 8) Remove possible_net_t fields in nf_tables set objects since we just simply pass the net pointer to the backend set type implementations. 9) Fix possible off-by-one in h323, from Toby DiPasquale. 10) early_drop() may be called from ctnetlink patch, so we must hold rcu read size lock from them too, this amends Florian's patch #3 coming in this batch, from Liping Zhang. 11) Use binary search to validate jump offset in x_tables, this addresses the O(n!) validation that was introduced recently resolve security issues with unpriviledge namespaces, from Florian. 12) Fix reference leak to connlabel in error path of nft_ct, from Zhang. 13) Three updates for nft_log: Fix log prefix leak in error path. Bail out on loglevel larger than debug in nft_log and set on the new NF_LOG_F_COPY_LEN flag when snaplen is specified. Again from Zhang. 14) Allow to filter rule dumps in nf_tables based on table and chain names. 15) Simplify connlabel to always use 128 bits to store labels and get rid of unused function in xt_connlabel, from Florian. 16) Replace set_expect_timeout() by mod_timer() from the h323 conntrack helper, by Gao Feng. 17) Put back x_tables module reference in nft_compat on error, from Liping Zhang. 18) Add a reference count to the x_tables extensions cache in nft_compat, so we can remove them when unused and avoid a crash if the extensions are rmmod, again from Zhang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Just several instances of overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-23netfilter: nft_compat: fix crash when related match/target module is removedLiping Zhang
We "cache" the loaded match/target modules and reuse them, but when the modules are removed, we still point to them. Then we may end up with invalid memory references when using iptables-compat to add rules later. Input the following commands will reproduce the kernel crash: # iptables-compat -A INPUT -j LOG # iptables-compat -D INPUT -j LOG # rmmod xt_LOG # iptables-compat -A INPUT -j LOG BUG: unable to handle kernel paging request at ffffffffa05a9010 IP: [<ffffffff813f783e>] strcmp+0xe/0x30 Call Trace: [<ffffffffa05acc43>] nft_target_select_ops+0x83/0x1f0 [nft_compat] [<ffffffffa058a177>] nf_tables_expr_parse+0x147/0x1f0 [nf_tables] [<ffffffffa058e541>] nf_tables_newrule+0x301/0x810 [nf_tables] [<ffffffff8141ca00>] ? nla_parse+0x20/0x100 [<ffffffffa057fa8f>] nfnetlink_rcv+0x33f/0x53d [nfnetlink] [<ffffffffa057f94b>] ? nfnetlink_rcv+0x1fb/0x53d [nfnetlink] [<ffffffff817116b8>] netlink_unicast+0x178/0x220 [<ffffffff81711a5b>] netlink_sendmsg+0x2fb/0x3a0 [<ffffffff816b7fc8>] sock_sendmsg+0x38/0x50 [<ffffffff816b8a7e>] ___sys_sendmsg+0x28e/0x2a0 [<ffffffff816bcb7e>] ? release_sock+0x1e/0xb0 [<ffffffff81804ac5>] ? _raw_spin_unlock_bh+0x35/0x40 [<ffffffff816bcbe2>] ? release_sock+0x82/0xb0 [<ffffffff816b93d4>] __sys_sendmsg+0x54/0x90 [<ffffffff816b9422>] SyS_sendmsg+0x12/0x20 [<ffffffff81805172>] entry_SYSCALL_64_fastpath+0x1a/0xa9 So when nobody use the related match/target module, there's no need to "cache" it. And nft_[match|target]_release are useless anymore, remove them. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-23netfilter: nft_compat: put back match/target module if init failLiping Zhang
If the user specify the invalid NFTA_MATCH_INFO/NFTA_TARGET_INFO attr or memory alloc fail, we should call module_put to the related match or target. Otherwise, we cannot remove the module even nobody use it. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-23netfilter: h323: Use mod_timer instead of set_expect_timeoutGao Feng
Simplify the code without any side effect. The set_expect_timeout is used to modify the timer expired time. It tries to delete timer, and add it again. So we could use mod_timer directly. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-22netfilter: connlabels: move set helper to xt_connlabelFlorian Westphal
xt_connlabel is the only user so move it. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-22netfilter: conntrack: support a fixed size of 128 distinct labelsFlorian Westphal
The conntrack label extension is currently variable-sized, e.g. if only 2 labels are used by iptables rules then the labels->bits[] array will only contain one element. We track size of each label storage area in the 'words' member. But in nftables and openvswitch we always have to ask for worst-case since we don't know what bit will be used at configuration time. As most arches are 64bit we need to allocate 24 bytes in this case: struct nf_conn_labels { u8 words; /* 0 1 */ /* XXX 7 bytes hole, try to pack */ long unsigned bits[2]; /* 8 24 */ Make bits a fixed size and drop the words member, it simplifies the code and only increases memory requirements on x86 when less than 64bit labels are required. We still only allocate the extension if its needed. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-21netfilter: nf_tables: allow to filter out rules by table and chainPablo Neira Ayuso
If the table and/or chain attributes are set in a rule dump request, we filter out the rules based on this selection. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-21netfilter: nft_log: fix snaplen does not truncate packetsLiping Zhang
There's a similar problem in xt_NFLOG, and was fixed by commit 7643507fe8b5 ("netfilter: xt_NFLOG: nflog-range does not truncate packets"). Only set copy_len here does not work, so we should enable NF_LOG_F_COPY_LEN also. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-21netfilter: nft_log: check the validity of log levelLiping Zhang
User can specify the log level larger than 7(debug level) via nfnetlink, this is invalid. So in this case, we should report EINVAL to the userspace. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-21netfilter: nft_log: fix possible memory leak if log expr init failLiping Zhang
Suppose that we specify the NFTA_LOG_PREFIX, then NFTA_LOG_LEVEL and NFTA_LOG_GROUP are specified together or nf_logger_find_get call returns fail, i.e. expr init fail, memory leak will happen. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-21netfilter: Add helper array register/unregister functionsGao Feng
Add nf_ct_helper_init(), nf_conntrack_helpers_register() and nf_conntrack_helpers_unregister() functions to avoid repetitive opencoded initialization in helpers. This patch keeps an id parameter for nf_ct_helper_init() not to break helper matching by name that has been inconsistently exposed to userspace through ports, eg. ftp-2121, and through an incremental id, eg. tftp-1. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-19netfilter: nft_ct: fix unpaired nf_connlabels_get/put callLiping Zhang
We only get nf_connlabels if the user add ct label set expr successfully, but we will also put nf_connlabels if the user delete ct lable get expr. This is mismathced, and will cause ct label expr cannot work properly. Also, if we init something fail, we should put nf_connlabels back. Otherwise, we may waste to alloc the memory that will never be used. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-18netfilter: x_tables: speed up jump target validationFlorian Westphal
The dummy ruleset I used to test the original validation change was broken, most rules were unreachable and were not tested by mark_source_chains(). In some cases rulesets that used to load in a few seconds now require several minutes. sample ruleset that shows the behaviour: echo "*filter" for i in $(seq 0 100000);do printf ":chain_%06x - [0:0]\n" $i done for i in $(seq 0 100000);do printf -- "-A INPUT -j chain_%06x\n" $i printf -- "-A INPUT -j chain_%06x\n" $i printf -- "-A INPUT -j chain_%06x\n" $i done echo COMMIT [ pipe result into iptables-restore ] This ruleset will be about 74mbyte in size, with ~500k searches though all 500k[1] rule entries. iptables-restore will take forever (gave up after 10 minutes) Instead of always searching the entire blob for a match, fill an array with the start offsets of every single ipt_entry struct, then do a binary search to check if the jump target is present or not. After this change ruleset restore times get again close to what one gets when reverting 36472341017529e (~3 seconds on my workstation). [1] every user-defined rule gets an implicit RETURN, so we get 300k jumps + 100k userchains + 100k returns -> 500k rule entries Fixes: 36472341017529e ("netfilter: x_tables: validate targets of jumps") Reported-by: Jeff Wu <wujiafu@gmail.com> Tested-by: Jeff Wu <wujiafu@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-12netfilter: conntrack: skip clash resolution if nat is in placePablo Neira Ayuso
The clash resolution is not easy to apply if the NAT table is registered. Even if no NAT rules are installed, the nul-binding ensures that a unique tuple is used, thus, the packet that loses race gets a different source port number, as described by: http://marc.info/?l=netfilter-devel&m=146818011604484&w=2 Clash resolution with NAT is also problematic if addresses/port range ports are used since the conntrack that wins race may describe a different mangling that we may have earlier applied to the packet via nf_nat_setup_info(). Fixes: 71d8c47fc653 ("netfilter: conntrack: introduce clash resolution on insertion race") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Tested-by: Marc Dionne <marc.c.dionne@gmail.com>
2016-07-12netfilter: conntrack: protect early_drop by rcu read lockLiping Zhang
User can add ct entry via nfnetlink(IPCTNL_MSG_CT_NEW), and if the total number reach the nf_conntrack_max, we will try to drop some ct entries. But in this case(the main function call path is ctnetlink_create_conntrack -> nf_conntrack_alloc -> early_drop), rcu_read_lock is not held, so race with hash resize will happen. Fixes: 242922a02717 ("netfilter: conntrack: simplify early_drop") Cc: Florian Westphal <fw@strlen.de> Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: nf_conntrack_h323: fix off-by-one in DecodeQ931Toby DiPasquale
This patch corrects an off-by-one error in the DecodeQ931 function in the nf_conntrack_h323 module. This error could result in reading off the end of a Q.931 frame. Signed-off-by: Toby DiPasquale <toby@cbcg.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11Merge tag 'ipvs-for-v4.8' of ↵Pablo Neira Ayuso
https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next Simon Horman says: ==================== IPVS Updates for v4.8 please consider these enhancements to the IPVS. This alters the behaviour of the "least connection" schedulers such that pre-established connections are included in the active connection count. This avoids overloading servers when a large number of new connections arrive in a short space of time - e.g. when clients reconnect after a node or network failure. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: nf_tables: get rid of possible_net_t from set and basechainPablo Neira Ayuso
We can pass the netns pointer as parameter to the functions that need to gain access to it. From basechains, I didn't find any client for this field anymore so let's remove this too. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: nft_ct: make byte/packet expr more friendlyLiping Zhang
If we want to use ct packets expr, and add a rule like follows: # nft add rule filter input ct packets gt 1 counter We will find that no packets will hit it, because nf_conntrack_acct is disabled by default. So It will not work until we enable it manually via "echo 1 > /proc/sys/net/netfilter/nf_conntrack_acct". This is not friendly, so like xt_connbytes do, if the user want to use ct byte/packet expr, enable nf_conntrack_acct automatically. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: physdev: physdev-is-out should not work with OUTPUT chainHangbin Liu
physdev_mt() will check skb->nf_bridge first, which was alloced in br_nf_pre_routing. So if we want to use --physdev-out and physdev-is-out, we need to match it in FORWARD or POSTROUTING chain. physdev_mt_check() only checked physdev-out and missed physdev-is-out. Fix it and update the debug message to make it clearer. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Marcelo R Leitner <marcelo.leitner@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: nat: convert nat bysrc hash to rhashtableFlorian Westphal
It did use a fixed-size bucket list plus single lock to protect add/del. Unlike the main conntrack table we only need to add and remove keys. Convert it to rhashtable to get table autosizing and per-bucket locking. The maximum number of entries is -- as before -- tied to the number of conntracks so we do not need another upperlimit. The change does not handle rhashtable_remove_fast error, only possible "error" is -ENOENT, and that is something that can happen legitimetely, e.g. because nat module was inserted at a later time and no src manip took place yet. Tested with http-client-benchmark + httpterm with DNAT and SNAT rules in place. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11Merge tag 'ipvs-fixes2-for-v4.7' of ↵Pablo Neira Ayuso
https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs Simon Horman says: ==================== Second Round of IPVS Fixes for v4.7 The fix from Quentin Armitage allows the backup sync daemon to be bound to a link-local mcast IPv6 address as is already the case for IPv4. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: move nat hlist_head to nf_connFlorian Westphal
The nat extension structure is 32bytes in size on x86_64: struct nf_conn_nat { struct hlist_node bysource; /* 0 16 */ struct nf_conn * ct; /* 16 8 */ union nf_conntrack_nat_help help; /* 24 4 */ int masq_index; /* 28 4 */ /* size: 32, cachelines: 1, members: 4 */ /* last cacheline: 32 bytes */ }; The hlist is needed to quickly check for possible tuple collisions when installing a new nat binding. Storing this in the extension area has two drawbacks: 1. We need ct backpointer to get the conntrack struct from the extension. 2. When reallocation of extension area occurs we need to fixup the bysource hash head via hlist_replace_rcu. We can avoid both by placing the hlist_head in nf_conn and place nf_conn in the bysource hash rather than the extenstion. We can also remove the ->move support; no other extension needs it. Moving the entire nat extension into nf_conn would be possible as well but then we have to add yet another callback for deletion from the bysource hash table rather than just using nat extension ->destroy hook for this. nf_conn size doesn't increase due to aligment, followup patch replaces hlist_node with single pointer. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: conntrack: simplify early_dropFlorian Westphal
We don't need to acquire the bucket lock during early drop, we can use lockless traveral just like ____nf_conntrack_find. The timer deletion serves as synchronization point, if another cpu attempts to evict same entry, only one will succeed with timer deletion. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: nf_ct_helper: unlink helper again when hash resize happenLiping Zhang
From: Liping Zhang <liping.zhang@spreadtrum.com> Similar to ctnl_untimeout, when hash resize happened, we should try to do unhelp from the 0# bucket again. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: cttimeout: unlink timeout obj again when hash resize happenLiping Zhang
Imagine such situation, nf_conntrack_htable_size now is 4096, we are doing ctnl_untimeout, and iterate on 3000# bucket. Meanwhile, another user try to reduce hash size to 2048, then all nf_conn are removed to the new hashtable. When this hash resize operation finished, we still try to itreate ct begin from 3000# bucket, find nothing to do and just return. We may miss unlinking some timeout objects. And later we will end up with invalid references to timeout object that are already gone. So when we find that hash resize happened, try to unlink timeout objects from the 0# bucket again. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: conntrack: fix race between nf_conntrack proc read and hash resizeLiping Zhang
When we do "cat /proc/net/nf_conntrack", and meanwhile resize the conntrack hash table via /sys/module/nf_conntrack/parameters/hashsize, race will happen, because reader can observe a newly allocated hash but the old size (or vice versa). So oops will happen like follows: BUG: unable to handle kernel NULL pointer dereference at 0000000000000017 IP: [<ffffffffa0418e21>] seq_print_acct+0x11/0x50 [nf_conntrack] Call Trace: [<ffffffffa0412f4e>] ? ct_seq_show+0x14e/0x340 [nf_conntrack] [<ffffffff81261a1c>] seq_read+0x2cc/0x390 [<ffffffff812a8d62>] proc_reg_read+0x42/0x70 [<ffffffff8123bee7>] __vfs_read+0x37/0x130 [<ffffffff81347980>] ? security_file_permission+0xa0/0xc0 [<ffffffff8123cf75>] vfs_read+0x95/0x140 [<ffffffff8123e475>] SyS_read+0x55/0xc0 [<ffffffff817c2572>] entry_SYSCALL_64_fastpath+0x1a/0xa4 It is very easy to reproduce this kernel crash. 1. open one shell and input the following cmds: while : ; do echo $RANDOM > /sys/module/nf_conntrack/parameters/hashsize done 2. open more shells and input the following cmds: while : ; do cat /proc/net/nf_conntrack done 3. just wait a monent, oops will happen soon. The solution in this patch is based on Florian's Commit 5e3c61f98175 ("netfilter: conntrack: fix lookup race during hash resize"). And add a wrapper function nf_conntrack_get_ht to get hash and hsize suggested by Florian Westphal. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-08netfilter: nft_ct: fix expiration getterFlorian Westphal
We need to compute timeout.expires - jiffies, not the other way around. Add a helper, another patch can then later change more places in conntrack code where we currently open-code this. Will allow us to only change one place later when we remove per-ct timer. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-07ipvs: count pre-established TCP states as activeMichal Kubecek
Some users observed that "least connection" distribution algorithm doesn't handle well bursts of TCP connections from reconnecting clients after a node or network failure. This is because the algorithm counts active connection as worth 256 inactive ones where for TCP, "active" only means TCP connections in ESTABLISHED state. In case of a connection burst, new connections are handled before previous ones have finished the three way handshaking so that all are still counted as "inactive", i.e. cheap ones. The become "active" quickly but at that time, all of them are already assigned to one real server (or few), resulting in highly unbalanced distribution. Address this by counting the "pre-established" states as "active". Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2016-07-07ipvs: fix bind to link-local mcast IPv6 address in backupQuentin Armitage
When using HEAD from https://git.kernel.org/cgit/utils/kernel/ipvsadm/ipvsadm.git/, the command: ipvsadm --start-daemon backup --mcast-interface eth0.60 \ --mcast-group ff02::1:81 fails with the error message: Argument list too long whereas both: ipvsadm --start-daemon master --mcast-interface eth0.60 \ --mcast-group ff02::1:81 and: ipvsadm --start-daemon backup --mcast-interface eth0.60 \ --mcast-group 224.0.0.81 are successful. The error message "Argument list too long" isn't helpful. The error occurs because an IPv6 address is given in backup mode. The error is in make_receive_sock() in net/netfilter/ipvs/ip_vs_sync.c, since it fails to set the interface on the address or the socket before calling inet6_bind() (via sock->ops->bind), where the test 'if (!sk->sk_bound_dev_if)' failed. Setting sock->sk->sk_bound_dev_if on the socket before calling inet6_bind() resolves the issue. Fixes: d33288172e72 ("ipvs: add more mcast parameters for the sync daemon") Signed-off-by: Quentin Armitage <quentin@armitage.org.uk> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2016-07-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next, they are: 1) Don't use userspace datatypes in bridge netfilter code, from Tobin Harding. 2) Iterate only once over the expectation table when removing the helper module, instead of once per-netns, from Florian Westphal. 3) Extra sanitization in xt_hook_ops_alloc() to return error in case we ever pass zero hooks, xt_hook_ops_alloc(): 4) Handle NFPROTO_INET from the logging core infrastructure, from Liping Zhang. 5) Autoload loggers when TRACE target is used from rules, this doesn't change the behaviour in case the user already selected nfnetlink_log as preferred way to print tracing logs, also from Liping Zhang. 6) Conntrack slabs with SLAB_HWCACHE_ALIGN to allow rearranging fields by cache lines, increases the size of entries in 11% per entry. From Florian Westphal. 7) Skip zone comparison if CONFIG_NF_CONNTRACK_ZONES=n, from Florian. 8) Remove useless defensive check in nf_logger_find_get() from Shivani Bhardwaj. 9) Remove zone extension as place it in the conntrack object, this is always include in the hashing and we expect more intensive use of zones since containers are in place. Also from Florian Westphal. 10) Owner match now works from any namespace, from Eric Bierdeman. 11) Make sure we only reply with TCP reset to TCP traffic from nf_reject_ipv4, patch from Liping Zhang. 12) Introduce --nflog-size to indicate amount of network packet bytes that are copied to userspace via log message, from Vishwanath Pai. This obsoletes --nflog-range that has never worked, it was designed to achieve this but it has never worked. 13) Introduce generic macros for nf_tables object generation masks. 14) Use generation mask in table, chain and set objects in nf_tables. This allows fixes interferences with ongoing preparation phase of the commit protocol and object listings going on at the same time. This update is introduced in three patches, one per object. 15) Check if the object is active in the next generation for element deactivation in the rbtree implementation, given that deactivation happens from the commit phase path we have to observe the future status of the object. 16) Support for deletion of just added elements in the hash set type. 17) Allow to resize hashtable from /proc entry, not only from the obscure /sys entry that maps to the module parameter, from Florian Westphal. 18) Get rid of NFT_BASECHAIN_DISABLED, this code is not exercised anymore since we tear down the ruleset whenever the netdevice goes away. 19) Support for matching inverted set lookups, from Arturo Borrero. 20) Simplify the iptables_mangle_hook() by removing a superfluous extra branch. 21) Introduce ether_addr_equal_masked() and use it from the netfilter codebase, from Joe Perches. 22) Remove references to "Use netfilter MARK value as routing key" from the Netfilter Kconfig description given that this toggle doesn't exists already for 10 years, from Moritz Sichert. 23) Introduce generic NF_INVF() and use it from the xtables codebase, from Joe Perches. 24) Setting logger to NONE via /proc was not working unless explicit nul-termination was included in the string. This fixes seems to leave the former behaviour there, so we don't break backward. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-05netfilter: nf_log: fix error on write NONE to logger choice sysctlPavel Tikhomirov
It is hard to unbind nf-logger: echo NONE > /proc/sys/net/netfilter/nf_log/0 bash: echo: write error: No such file or directory sysctl -w net.netfilter.nf_log.0=NONE sysctl: setting key "net.netfilter.nf_log.0": No such file or directory net.netfilter.nf_log.0 = NONE You need explicitly send '\0', for instance like: echo -e "NONE\0" > /proc/sys/net/netfilter/nf_log/0 That seem to be strange, so fix it using proc_dostring. Now it works fine: modprobe nfnetlink_log echo nfnetlink_log > /proc/sys/net/netfilter/nf_log/0 cat /proc/sys/net/netfilter/nf_log/0 nfnetlink_log echo NONE > /proc/sys/net/netfilter/nf_log/0 cat /proc/sys/net/netfilter/nf_log/0 NONE v2: add missed error check for proc_dostring Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-04net: simplify and make pkt_type_ok() available for other usersJamal Hadi Salim
Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-03netfilter: Convert FWINV<[foo]> macros and uses to NF_INVFJoe Perches
netfilter uses multiple FWINV #defines with identical form that hide a specific structure variable and dereference it with a invflags member. $ git grep "#define FWINV" include/linux/netfilter_bridge/ebtables.h:#define FWINV(bool,invflg) ((bool) ^ !!(info->invflags & invflg)) net/bridge/netfilter/ebtables.c:#define FWINV2(bool, invflg) ((bool) ^ !!(e->invflags & invflg)) net/ipv4/netfilter/arp_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(arpinfo->invflags & (invflg))) net/ipv4/netfilter/ip_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(ipinfo->invflags & (invflg))) net/ipv6/netfilter/ip6_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(ip6info->invflags & (invflg))) net/netfilter/xt_tcpudp.c:#define FWINVTCP(bool, invflg) ((bool) ^ !!(tcpinfo->invflags & (invflg))) Consolidate these macros into a single NF_INVF macro. Miscellanea: o Neaten the alignment around these uses o A few lines are > 80 columns for intelligibility Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-01netfilter: Remove references to obsolete CONFIG_IP_ROUTE_FWMARKMoritz Sichert
This option was removed in commit 47dcf0cb1005 ("[NET]: Rethink mark field in struct flowi"). Signed-off-by: Moritz Sichert <moritz+linux@sichert.me> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-01netfilter: conntrack: avoid integer overflow when resizingFlorian Westphal
Can overflow so we might allocate very small table when bucket count is high on a 32bit platform. Note: resize is only possible from init_netns. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Several cases of overlapping changes, except the packet scheduler conflicts which deal with the addition of the free list parameter to qdisc_enqueue(). Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-24netfilter: nf_tables: add support for inverted logic in nft_lookupArturo Borrero
Introduce a new configuration option for this expression, which allows users to invert the logic of set lookups. In _init() we will now return EINVAL if NFT_LOOKUP_F_INV is in anyway related to a map lookup. The code in the _eval() function has been untangled and updated to sopport the XOR of options, as we should consider 4 cases: * lookup false, invert false -> NFT_BREAK * lookup false, invert true -> return w/o NFT_BREAK * lookup true, invert false -> return w/o NFT_BREAK * lookup true, invert true -> NFT_BREAK Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-24netfilter: nf_tables: get rid of NFT_BASECHAIN_DISABLEDPablo Neira Ayuso
This flag was introduced to restore rulesets from the new netdev family, but since 5ebe0b0eec9d6f7 ("netfilter: nf_tables: destroy basechain and rules on netdevice removal") the ruleset is released once the netdev is gone. This also removes nft_register_basechain() and nft_unregister_basechain() since they have no clients anymore after this rework. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>