aboutsummaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)Author
2010-05-06netpoll: add generic support for bridge and bonding devicesWANG Cong
This whole patchset is for adding netpoll support to bridge and bonding devices. I already tested it for bridge, bonding, bridge over bonding, and bonding over bridge. It looks fine now. To make bridge and bonding support netpoll, we need to adjust some netpoll generic code. This patch does the following things: 1) introduce two new priv_flags for struct net_device: IFF_IN_NETPOLL which identifies we are processing a netpoll; IFF_DISABLE_NETPOLL is used to disable netpoll support for a device at run-time; 2) introduce one new method for netdev_ops: ->ndo_netpoll_cleanup() is used to clean up netpoll when a device is removed. 3) introduce netpoll_poll_dev() which takes a struct net_device * parameter; export netpoll_send_skb() and netpoll_poll_dev() which will be used later; 4) hide a pointer to struct netpoll in struct netpoll_info, ditto. 5) introduce ->real_dev for struct netpoll. 6) introduce a new status NETDEV_BONDING_DESLAE, which is used to disable netconsole before releasing a slave, to avoid deadlocks. Cc: David Miller <davem@davemloft.net> Cc: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: WANG Cong <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-05net: __alloc_skb() speedupEric Dumazet
With following patch I can reach maximum rate of my pktgen+udpsink simulator : - 'old' machine : dual quad core E5450 @3.00GHz - 64 UDP rx flows (only differ by destination port) - RPS enabled, NIC interrupts serviced on cpu0 - rps dispatched on 7 other cores. (~130.000 IPI per second) - SLAB allocator (faster than SLUB in this workload) - tg3 NIC - 1.080.000 pps without a single drop at NIC level. Idea is to add two prefetchw() calls in __alloc_skb(), one to prefetch first sk_buff cache line, the second to prefetch the shinfo part. Also using one memset() to initialize all skb_shared_info fields instead of one by one to reduce number of instructions, using long word moves. All skb_shared_info fields before 'dataref' are cleared in __alloc_skb(). Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-03net: skb_free_datagram_locked() fixEric Dumazet
Commit 4b0b72f7dd617b ( net: speedup udp receive path ) introduced a bug in skb_free_datagram_locked(). We should not skb_orphan() skb if we dont have the guarantee we are the last skb user, this might happen with MSG_PEEK concurrent users. To keep socket locked for the smallest period of time, we split consume_skb() logic, inlined in skb_free_datagram_locked() Reported-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-02net: fix softnet_statChangli Gao
Per cpu variable softnet_data.total was shared between IRQ and SoftIRQ context without any protection. And enqueue_to_backlog should update the netdev_rx_stat of the target CPU. This patch renames softnet_data.total to softnet_data.processed: the number of packets processed in uppper levels(IP stacks). softnet_stat data is moved into softnet_data. Signed-off-by: Changli Gao <xiaosuo@gmail.com> ---- include/linux/netdevice.h | 17 +++++++---------- net/core/dev.c | 26 ++++++++++++-------------- net/sched/sch_generic.c | 2 +- 3 files changed, 20 insertions(+), 25 deletions(-) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-02net: Inline skb_pull() in eth_type_trans().David S. Miller
In commit 6be8ac2f ("[NET]: uninline skb_pull, de-bloats a lot") we uninlined skb_pull. But in some critical paths it makes sense to inline this thing and it helps performance significantly. Create an skb_pull_inline() so that we can do this in a way that serves also as annotation. Based upon a patch by Eric Dumazet. Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-01net: sock_def_readable() and friends RCU conversionEric Dumazet
sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we need two atomic operations (and associated dirtying) per incoming packet. RCU conversion is pretty much needed : 1) Add a new structure, called "struct socket_wq" to hold all fields that will need rcu_read_lock() protection (currently: a wait_queue_head_t and a struct fasync_struct pointer). [Future patch will add a list anchor for wakeup coalescing] 2) Attach one of such structure to each "struct socket" created in sock_alloc_inode(). 3) Respect RCU grace period when freeing a "struct socket_wq" 4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct socket_wq" 5) Change sk_sleep() function to use new sk->sk_wq instead of sk->sk_sleep 6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside a rcu_read_lock() section. 7) Change all sk_has_sleeper() callers to : - Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock) - Use wq_has_sleeper() to eventually wakeup tasks. - Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock) 8) sock_wake_async() is modified to use rcu protection as well. 9) Exceptions : macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq" instead of dynamically allocated ones. They dont need rcu freeing. Some cleanups or followups are probably needed, (possible sk_callback_lock conversion to a spinlock for example...). Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-28net: speedup udp receive pathEric Dumazet
Since commit 95766fff ([UDP]: Add memory accounting.), each received packet needs one extra sock_lock()/sock_release() pair. This added latency because of possible backlog handling. Then later, ticket spinlocks added yet another latency source in case of DDOS. This patch introduces lock_sock_bh() and unlock_sock_bh() synchronization primitives, avoiding one atomic operation and backlog processing. skb_free_datagram_locked() uses them instead of full blown lock_sock()/release_sock(). skb is orphaned inside locked section for proper socket memory reclaim, and finally freed outside of it. UDP receive path now take the socket spinlock only once. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-27net: disallow to use net_assign_generic externallyJiri Pirko
Now there's no need to use this fuction directly because it's handled by register_pernet_device. So to make this simple and easy to understand, make this static to do not tempt potentional users. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-27net: sk_add_backlog() take rmem_alloc into accountEric Dumazet
Current socket backlog limit is not enough to really stop DDOS attacks, because user thread spend many time to process a full backlog each round, and user might crazy spin on socket lock. We should add backlog size and receive_queue size (aka rmem_alloc) to pace writers, and let user run without being slow down too much. Introduce a sk_rcvqueues_full() helper, to avoid taking socket lock in stress situations. Under huge stress from a multiqueue/RPS enabled NIC, a single flow udp receiver can now process ~200.000 pps (instead of ~100 pps before the patch) on a 8 core machine. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-27net: batch skb dequeueing from softnet input_pkt_queueChangli Gao
batch skb dequeueing from softnet input_pkt_queue to reduce potential lock contention when RPS is enabled. Note: in the worst case, the number of packets in a softnet_data may be double of netdev_max_backlog. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-27net: reimplement softnet_data.output_queue as a FIFO queueChangli Gao
reimplement softnet_data.output_queue as a FIFO queue to keep the fairness among the qdiscs rescheduled. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> ---- include/linux/netdevice.h | 1 + net/core/dev.c | 22 ++++++++++++---------- 2 files changed, 13 insertions(+), 10 deletions(-) Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-27Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/kaber/ipmr-2.6
2010-04-27Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/e100.c drivers/net/e1000e/netdev.c
2010-04-26net: rtnetlink: decouple rtnetlink address families from real address familiesPatrick McHardy
Decouple rtnetlink address families from real address families in socket.h to be able to add rtnetlink interfaces to code that is not a real address family without increasing AF_MAX/NPROTO. This will be used to add support for multicast route dumping from all tables as the proc interface can't be extended to support anything but the main table without breaking compatibility. This partialy undoes the patch to introduce independant families for routing rules and converts ipmr routing rules to a new rtnetlink family. Similar to that patch, values up to 127 are reserved for real address families, values above that may be used arbitrarily. Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-04-26net: fib_rules: mark arguments to fib_rules_register const and __net_initdataPatrick McHardy
fib_rules_register() duplicates the template passed to it without modification, mark the argument as const. Additionally the templates are only needed when instantiating a new namespace, so mark them as __net_initdata, which means they can be discarded when CONFIG_NET_NS=n. Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-04-25netns: rename unregister_pernet_subsys parameterJiri Pirko
Stay consistent with other functions and with comment also and name pernet_operations parameter properly. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-24rps: optimize rps_get_cpu()Changli Gao
optimize rps_get_cpu(). don't initialize ports when we can get the ports. one memory access for ports than two. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-22net: Socket filter ancilliary data access for skb->dev->typePaul LeoNerd Evans
Add an SKF_AD_HATYPE field to the packet ancilliary data area, giving access to skb->dev->type, as reported in the sll_hatype field. When capturing packets on a PF_PACKET/SOCK_RAW socket bound to all interfaces, there doesn't appear to be a way for the filter program to actually find out the underlying hardware type the packet was captured on. This patch adds such ability. This patch also handles the case where skb->dev can be NULL, such as on netlink sockets. Signed-off-by: Paul Evans <leonerd@leonerd.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-22rtnetlink: potential ERR_PTR dereferenceDan Carpenter
In the original code, if rtnl_create_link() returned an ERR_PTR then that would get passed to rtnl_configure_link() which dereferences it. Signed-off-by: Dan Carpenter <error27@gmail.com> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-22net: Orphan and de-dst skbs earlier in xmit path.David S. Miller
This way GSO packets don't get handled differently. With help from Eric Dumazet. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
2010-04-22rps: immediate send IPI in process_backlog()Eric Dumazet
If some skb are queued to our backlog, we are delaying IPI sending at the end of net_rx_action(), increasing latencies. This defeats the queueing, since we want to quickly dispatch packets to the pool of worker cpus, then eventually deeply process our packets. It's better to send IPI before processing our packets in upper layers, from process_backlog(). Change the _and_disable_irq suffix to _and_enable_irq(), since we enable local irq in net_rps_action(), sorry for the confusion. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-21net: Introduce skb_orphan_try()Eric Dumazet
At this point, skb->destructor is not the original one (stored in DEV_GSO_CB(skb)->destructor) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-21Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/wireless/iwlwifi/iwl-6000.c net/core/dev.c
2010-04-21net: Fix an RCU warning in dev_pick_tx()David Howells
Fix the following RCU warning in dev_pick_tx(): =================================================== [ INFO: suspicious rcu_dereference_check() usage. ] --------------------------------------------------- net/core/dev.c:1993 invoked rcu_dereference_check() without protection! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 2 locks held by swapper/0: #0: (&idev->mc_ifc_timer){+.-...}, at: [<ffffffff81039e65>] run_timer_softirq+0x17b/0x278 #1: (rcu_read_lock_bh){.+....}, at: [<ffffffff812ea3eb>] dev_queue_xmit+0x14e/0x4dc stack backtrace: Pid: 0, comm: swapper Not tainted 2.6.34-rc5-cachefs #4 Call Trace: <IRQ> [<ffffffff810516c4>] lockdep_rcu_dereference+0xaa/0xb2 [<ffffffff812ea4f6>] dev_queue_xmit+0x259/0x4dc [<ffffffff812ea3eb>] ? dev_queue_xmit+0x14e/0x4dc [<ffffffff81052324>] ? trace_hardirqs_on+0xd/0xf [<ffffffff81035362>] ? local_bh_enable_ip+0xbc/0xc1 [<ffffffff812f0954>] neigh_resolve_output+0x24b/0x27c [<ffffffff8134f673>] ip6_output_finish+0x7c/0xb4 [<ffffffff81350c34>] ip6_output2+0x256/0x261 [<ffffffff81052324>] ? trace_hardirqs_on+0xd/0xf [<ffffffff813517fb>] ip6_output+0xbbc/0xbcb [<ffffffff8135bc5d>] ? fib6_force_start_gc+0x2b/0x2d [<ffffffff81368acb>] mld_sendpack+0x273/0x39d [<ffffffff81368858>] ? mld_sendpack+0x0/0x39d [<ffffffff81052099>] ? mark_held_locks+0x52/0x70 [<ffffffff813692fc>] mld_ifc_timer_expire+0x24f/0x288 [<ffffffff81039ed6>] run_timer_softirq+0x1ec/0x278 [<ffffffff81039e65>] ? run_timer_softirq+0x17b/0x278 [<ffffffff813690ad>] ? mld_ifc_timer_expire+0x0/0x288 [<ffffffff81035531>] ? __do_softirq+0x69/0x140 [<ffffffff8103556a>] __do_softirq+0xa2/0x140 [<ffffffff81002e0c>] call_softirq+0x1c/0x28 [<ffffffff81004b54>] do_softirq+0x38/0x80 [<ffffffff81034f06>] irq_exit+0x45/0x47 [<ffffffff810177c3>] smp_apic_timer_interrupt+0x88/0x96 [<ffffffff810028d3>] apic_timer_interrupt+0x13/0x20 <EOI> [<ffffffff810488dd>] ? __atomic_notifier_call_chain+0x0/0x86 [<ffffffff810096bf>] ? mwait_idle+0x6e/0x78 [<ffffffff810096b6>] ? mwait_idle+0x65/0x78 [<ffffffff810011cb>] cpu_idle+0x4d/0x83 [<ffffffff81380b05>] rest_init+0xb9/0xc0 [<ffffffff81380a4c>] ? rest_init+0x0/0xc0 [<ffffffff8168dcf0>] start_kernel+0x392/0x39d [<ffffffff8168d2a3>] x86_64_start_reservations+0xb3/0xb7 [<ffffffff8168d38b>] x86_64_start_kernel+0xe4/0xeb An rcu_dereference() should be an rcu_dereference_bh(). Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20net: Remove two unnecessary exports (skbuff).Rami Rosen
There is no need to export skb_under_panic() and skb_over_panic() in skbuff.c, since these methods are used only in skbuff.c ; this patch removes these two exports. It also marks these functions as 'static' and removeS the extern declarations of them from include/linux/skbuff.h Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20net: sk_sleep() helperEric Dumazet
Define a new function to return the waitqueue of a "struct sock". static inline wait_queue_head_t *sk_sleep(struct sock *sk) { return sk->sk_sleep; } Change all read occurrences of sk_sleep by a call to this function. Needed for a future RCU conversion. sk_sleep wont be a field directly available. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20net: emphasize rtnl lock required in call_netdevice_notifiersJiri Pirko
Since netdev_chain is guarded by rtnl_lock, ASSERT_RTNL should be present here to make sure that all callers of call_netdevice_notifiers does the locking properly. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20rps: consistent rxhashEric Dumazet
In case we compute a software skb->rxhash, we can generate a consistent hash : Its value will be the same in both flow directions. This helps some workloads, like conntracking, since the same state needs to be accessed in both directions. tbench + RFS + this patch gives better results than tbench with default kernel configuration (no RPS, no RFS) Also fixed some sparse warnings. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20rps: cleanupsEric Dumazet
struct softnet_data holds many queues, so consistent use "sd" name instead of "queue" is better. Adds a rps_ipi_queued() helper to cleanup enqueue_to_backlog() Adds a _and_irq_disable suffix to net_rps_action() name, as David suggested. incr_input_queue_head() becomes input_queue_head_incr() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-19rps: static functionsEric Dumazet
store_rps_map() & store_rps_dev_flow_table_cnt() are static. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-19rps: shortcut net_rps_action()Eric Dumazet
net_rps_action() is a bit expensive on NR_CPUS=64..4096 kernels, even if RPS is not active. Tom Herbert used two bitmasks to hold information needed to send IPI, but a single LIFO list seems more appropriate. Move all RPS logic into net_rps_action() to cleanup net_rx_action() code (remove two ifdefs) Move rps_remote_softirq_cpus into softnet_data to share its first cache line, filling an existing hole. In a future patch, we could call net_rps_action() from process_backlog() to make sure we send IPI before handling this cpu backlog. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-18net: Introduce skb_orphan_try()Eric Dumazet
Transmitted skb might be attached to a socket and a destructor, for memory accounting purposes. Traditionally, this destructor is called at tx completion time, when skb is freed. When tx completion is performed by another cpu than the sender, this forces some cache lines to change ownership. XPS was an attempt to give tx completion to initial cpu. David idea is to call destructor right before giving skb to device (call to ndo_start_xmit()). Because device queues are usually small, orphaning skb before tx completion is not a big deal. Some drivers already do this, we could do it in upper level. There is one known exception to this early orphaning, called tx timestamping. It needs to keep a reference to socket until device can give a hardware or software timestamp. This patch adds a skb_orphan_try() helper, to centralize all exceptions to early orphaning in one spot, and use it in dev_hard_start_xmit(). "tbench 16" results on a Nehalem machine (2 X5570 @ 2.93GHz) before: Throughput 4428.9 MB/sec 16 procs after: Throughput 4448.14 MB/sec 16 procs UDP should get even better results, its destructor being more complex, since SOCK_USE_WRITE_QUEUE is not set (four atomic ops instead of one) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-18net: remove time limit in process_backlog()Eric Dumazet
- There is no point to enforce a time limit in process_backlog(), since other napi instances dont follow same rule. We can exit after only one packet processed... The normal quota of 64 packets per napi instance should be the norm, and net_rx_action() already has its own time limit. Note : /proc/net/core/dev_weight can be used to tune this 64 default value. - Use DEFINE_PER_CPU_ALIGNED for softnet_data definition. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-17rps: rps_sock_flow_table is mostly readEric Dumazet
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16rfs: Receive Flow SteeringTom Herbert
This patch implements receive flow steering (RFS). RFS steers received packets for layer 3 and 4 processing to the CPU where the application for the corresponding flow is running. RFS is an extension of Receive Packet Steering (RPS). The basic idea of RFS is that when an application calls recvmsg (or sendmsg) the application's running CPU is stored in a hash table that is indexed by the connection's rxhash which is stored in the socket structure. The rxhash is passed in skb's received on the connection from netif_receive_skb. For each received packet, the associated rxhash is used to look up the CPU in the hash table, if a valid CPU is set then the packet is steered to that CPU using the RPS mechanisms. The convolution of the simple approach is that it would potentially allow OOO packets. If threads are thrashing around CPUs or multiple threads are trying to read from the same sockets, a quickly changing CPU value in the hash table could cause rampant OOO packets-- we consider this a non-starter. To avoid OOO packets, this solution implements two types of hash tables: rps_sock_flow_table and rps_dev_flow_table. rps_sock_table is a global hash table. Each entry is just a CPU number and it is populated in recvmsg and sendmsg as described above. This table contains the "desired" CPUs for flows. rps_dev_flow_table is specific to each device queue. Each entry contains a CPU and a tail queue counter. The CPU is the "current" CPU for a matching flow. The tail queue counter holds the value of a tail queue counter for the associated CPU's backlog queue at the time of last enqueue for a flow matching the entry. Each backlog queue has a queue head counter which is incremented on dequeue, and so a queue tail counter is computed as queue head count + queue length. When a packet is enqueued on a backlog queue, the current value of the queue tail counter is saved in the hash entry of the rps_dev_flow_table. And now the trick: when selecting the CPU for RPS (get_rps_cpu) the rps_sock_flow table and the rps_dev_flow table for the RX queue are consulted. When the desired CPU for the flow (found in the rps_sock_flow table) does not match the current CPU (found in the rps_dev_flow table), the current CPU is changed to the desired CPU if one of the following is true: - The current CPU is unset (equal to RPS_NO_CPU) - Current CPU is offline - The current CPU's queue head counter >= queue tail counter in the rps_dev_flow table. This checks if the queue tail has advanced beyond the last packet that was enqueued using this table entry. This guarantees that all packets queued using this entry have been dequeued, thus preserving in order delivery. Making each queue have its own rps_dev_flow table has two advantages: 1) the tail queue counters will be written on each receive, so keeping the table local to interrupting CPU s good for locality. 2) this allows lockless access to the table-- the CPU number and queue tail counter need to be accessed together under mutual exclusion from netif_receive_skb, we assume that this is only called from device napi_poll which is non-reentrant. This patch implements RFS for TCP and connected UDP sockets. It should be usable for other flow oriented protocols. There are two configuration parameters for RFS. The "rps_flow_entries" kernel init parameter sets the number of entries in the rps_sock_flow_table, the per rxqueue sysfs entry "rps_flow_cnt" contains the number of entries in the rps_dev_flow table for the rxqueue. Both are rounded to power of two. The obvious benefit of RFS (over just RPS) is that it achieves CPU locality between the receive processing for a flow and the applications processing; this can result in increased performance (higher pps, lower latency). The benefits of RFS are dependent on cache hierarchy, application load, and other factors. On simple benchmarks, we don't necessarily see improvement and sometimes see degradation. However, for more complex benchmarks and for applications where cache pressure is much higher this technique seems to perform very well. Below are some benchmark results which show the potential benfit of this patch. The netperf test has 500 instances of netperf TCP_RR test with 1 byte req. and resp. The RPC test is an request/response test similar in structure to netperf RR test ith 100 threads on each host, but does more work in userspace that netperf. e1000e on 8 core Intel No RFS or RPS 104K tps at 30% CPU No RFS (best RPS config): 290K tps at 63% CPU RFS 303K tps at 61% CPU RPC test tps CPU% 50/90/99% usec latency Latency StdDev No RFS/RPS 103K 48% 757/900/3185 4472.35 RPS only: 174K 73% 415/993/2468 491.66 RFS 223K 73% 379/651/1382 315.61 Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-15net: dev_pick_tx() fixEric Dumazet
When dev_pick_tx() caches tx queue_index on a socket, we must check socket dst_entry matches skb one, or risk a crash later, as reported by Denys Fedorysychenko, if old packets are in flight during a route change, involving devices with different number of queues. Bug introduced by commit a4ee3ce3 (net: Use sk_tx_queue_mapping for connected sockets) Reported-by: Denys Fedorysychenko <nuclearcat@nuclearcat.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-15net: netif_rx() must disable preemptionEric Dumazet
Eric Paris reported netif_rx() is calling smp_processor_id() from preemptible context, in particular when caller is ip_dev_loopback_xmit(). RPS commit added this smp_processor_id() call, this patch makes sure preemption is disabled. rps_get_cpus() wants rcu_read_lock() anyway, we can dot it a bit earlier. Reported-by: Eric Paris <eparis@redhat.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-13net: fib_rules: decouple address families from real address familiesPatrick McHardy
Decouple the address family values used for fib_rules from the real address families in socket.h. This allows to use fib_rules for code that is not a real address family without increasing AF_MAX/NPROTO. Values up to 127 are reserved for real address families and map directly to the corresponding AF value, values starting from 128 are for other uses. rtnetlink is changed to invoke the AF_UNSPEC dumpit/doit handlers for these families. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-13net: fib_rules: set family in fib_rule_hdr centrallyPatrick McHardy
All fib_rules implementations need to set the family in their ->fill() functions. Since the value is available to the generic fib_nl_fill_rule() function, set it there. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-13net: fib_rules: consolidate IPv4 and DECnet ->default_pref() functions.Patrick McHardy
Both functions are equivalent, consolidate them since a following patch needs a third implementation for multicast routing. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-13dst: don't inline dst_ifdownstephen hemminger
The function dst_ifdown is called only two places but in a non- performance critical code path, there is no reason to inline it. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-13net: uninline skb_bond_should_drop()Eric Dumazet
skb_bond_should_drop() is too big to be inlined. This patch reduces kernel text size, and its compilation time as well (shrinking include/linux/netdevice.h) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-13net: sk_dst_cache RCUificationEric Dumazet
With latest CONFIG_PROVE_RCU stuff, I felt more comfortable to make this work. sk->sk_dst_cache is currently protected by a rwlock (sk_dst_lock) This rwlock is readlocked for a very small amount of time, and dst entries are already freed after RCU grace period. This calls for RCU again :) This patch converts sk_dst_lock to a spinlock, and use RCU for readers. __sk_dst_get() is supposed to be called with rcu_read_lock() or if socket locked by user, so use appropriate rcu_dereference_check() condition (rcu_read_lock_held() || sock_owned_by_user(sk)) This patch avoids two atomic ops per tx packet on UDP connected sockets, for example, and permits sk_dst_lock to be much less dirtied. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-13net: Dont use netdev_warn()Eric Dumazet
Dont use netdev_warn() in dev_cap_txqueue() and get_rps_cpu() so that we can catch following warnings without crash. bond0.2240 received packet on queue 6, but number of RX queues is 1 bond0.2240 received packet on queue 11, but number of RX queues is 1 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-11Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/stmmac/stmmac_main.c drivers/net/wireless/wl12xx/wl1271_cmd.c drivers/net/wireless/wl12xx/wl1271_main.c drivers/net/wireless/wl12xx/wl1271_spi.c net/core/ethtool.c net/mac80211/scan.c
2010-04-07net: fix ethtool coding style errors and warningschavey
Fix coding style errors and warnings output while running checkpatch.pl on the files net/core/ethtool.c and include/linux/ethtool.h Signed-off-by: chavey <chavey@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-07net: include linux/proc_fs.h in dev_addr_lists.cEric Dumazet
As pointed by Randy Dunlap, we must include linux/proc_fs.h in net/core/dev_addr_lists.c, regardless of CONFIG_PROC_FS Reported-by: Randy Dunlap <randy.dunlap@oracle.com>, Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-07flow: delayed deletion of flow cache entriesTimo Teräs
Speed up lookups by freeing flow cache entries later. After virtualizing flow cache entry operations, the flow cache may now end up calling policy or bundle destructor which can be slowish. As gc_list is more effective with double linked list, the flow cache is converted to use common hlist and list macroes where appropriate. Signed-off-by: Timo Teras <timo.teras@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-07flow: virtualize flow cache entry methodsTimo Teräs
This allows to validate the cached object before returning it. It also allows to destruct object properly, if the last reference was held in flow cache. This is also a prepartion for caching bundles in the flow cache. In return for virtualizing the methods, we save on: - not having to regenerate the whole flow cache on policy removal: each flow matching a killed policy gets refreshed as the getter function notices it smartly. - we do not have to call flow_cache_flush from policy gc, since the flow cache now properly deletes the object if it had any references Signed-off-by: Timo Teras <timo.teras@iki.fi> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-06Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/bonding/bond_main.c drivers/net/via-velocity.c drivers/net/wireless/iwlwifi/iwl-agn.c