summaryrefslogtreecommitdiff
path: root/Documentation/networking/NAPI_HOWTO.txt
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@ppc970.osdl.org>2005-04-16 15:20:36 -0700
committerLinus Torvalds <torvalds@ppc970.osdl.org>2005-04-16 15:20:36 -0700
commit1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 (patch)
tree0bba044c4ce775e45a88a51686b5d9f90697ea9d /Documentation/networking/NAPI_HOWTO.txt
downloadconfigs-1da177e4c3f41524e886b7f1b8a0c1fc7321cac2.tar.gz
Linux-2.6.12-rc2v2.6.12-rc2
Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!
Diffstat (limited to 'Documentation/networking/NAPI_HOWTO.txt')
-rw-r--r--Documentation/networking/NAPI_HOWTO.txt766
1 files changed, 766 insertions, 0 deletions
diff --git a/Documentation/networking/NAPI_HOWTO.txt b/Documentation/networking/NAPI_HOWTO.txt
new file mode 100644
index 00000000000..54376e8249c
--- /dev/null
+++ b/Documentation/networking/NAPI_HOWTO.txt
@@ -0,0 +1,766 @@
+HISTORY:
+February 16/2002 -- revision 0.2.1:
+COR typo corrected
+February 10/2002 -- revision 0.2:
+some spell checking ;->
+January 12/2002 -- revision 0.1
+This is still work in progress so may change.
+To keep up to date please watch this space.
+
+Introduction to NAPI
+====================
+
+NAPI is a proven (www.cyberus.ca/~hadi/usenix-paper.tgz) technique
+to improve network performance on Linux. For more details please
+read that paper.
+NAPI provides a "inherent mitigation" which is bound by system capacity
+as can be seen from the following data collected by Robert on Gigabit
+ethernet (e1000):
+
+ Psize Ipps Tput Rxint Txint Done Ndone
+ ---------------------------------------------------------------
+ 60 890000 409362 17 27622 7 6823
+ 128 758150 464364 21 9301 10 7738
+ 256 445632 774646 42 15507 21 12906
+ 512 232666 994445 241292 19147 241192 1062
+ 1024 119061 1000003 872519 19258 872511 0
+ 1440 85193 1000003 946576 19505 946569 0
+
+
+Legend:
+"Ipps" stands for input packets per second.
+"Tput" == packets out of total 1M that made it out.
+"txint" == transmit completion interrupts seen
+"Done" == The number of times that the poll() managed to pull all
+packets out of the rx ring. Note from this that the lower the
+load the more we could clean up the rxring
+"Ndone" == is the converse of "Done". Note again, that the higher
+the load the more times we couldnt clean up the rxring.
+
+Observe that:
+when the NIC receives 890Kpackets/sec only 17 rx interrupts are generated.
+The system cant handle the processing at 1 interrupt/packet at that load level.
+At lower rates on the other hand, rx interrupts go up and therefore the
+interrupt/packet ratio goes up (as observable from that table). So there is
+possibility that under low enough input, you get one poll call for each
+input packet caused by a single interrupt each time. And if the system
+cant handle interrupt per packet ratio of 1, then it will just have to
+chug along ....
+
+
+0) Prerequisites:
+==================
+A driver MAY continue using the old 2.4 technique for interfacing
+to the network stack and not benefit from the NAPI changes.
+NAPI additions to the kernel do not break backward compatibility.
+NAPI, however, requires the following features to be available:
+
+A) DMA ring or enough RAM to store packets in software devices.
+
+B) Ability to turn off interrupts or maybe events that send packets up
+the stack.
+
+NAPI processes packet events in what is known as dev->poll() method.
+Typically, only packet receive events are processed in dev->poll().
+The rest of the events MAY be processed by the regular interrupt handler
+to reduce processing latency (justified also because there are not that
+many of them).
+Note, however, NAPI does not enforce that dev->poll() only processes
+receive events.
+Tests with the tulip driver indicated slightly increased latency if
+all of the interrupt handler is moved to dev->poll(). Also MII handling
+gets a little trickier.
+The example used in this document is to move the receive processing only
+to dev->poll(); this is shown with the patch for the tulip driver.
+For an example of code that moves all the interrupt driver to
+dev->poll() look at the ported e1000 code.
+
+There are caveats that might force you to go with moving everything to
+dev->poll(). Different NICs work differently depending on their status/event
+acknowledgement setup.
+There are two types of event register ACK mechanisms.
+ I) what is known as Clear-on-read (COR).
+ when you read the status/event register, it clears everything!
+ The natsemi and sunbmac NICs are known to do this.
+ In this case your only choice is to move all to dev->poll()
+
+ II) Clear-on-write (COW)
+ i) you clear the status by writing a 1 in the bit-location you want.
+ These are the majority of the NICs and work the best with NAPI.
+ Put only receive events in dev->poll(); leave the rest in
+ the old interrupt handler.
+ ii) whatever you write in the status register clears every thing ;->
+ Cant seem to find any supported by Linux which do this. If
+ someone knows such a chip email us please.
+ Move all to dev->poll()
+
+C) Ability to detect new work correctly.
+NAPI works by shutting down event interrupts when theres work and
+turning them on when theres none.
+New packets might show up in the small window while interrupts were being
+re-enabled (refer to appendix 2). A packet might sneak in during the period
+we are enabling interrupts. We only get to know about such a packet when the
+next new packet arrives and generates an interrupt.
+Essentially, there is a small window of opportunity for a race condition
+which for clarity we'll refer to as the "rotting packet".
+
+This is a very important topic and appendix 2 is dedicated for more
+discussion.
+
+Locking rules and environmental guarantees
+==========================================
+
+-Guarantee: Only one CPU at any time can call dev->poll(); this is because
+only one CPU can pick the initial interrupt and hence the initial
+netif_rx_schedule(dev);
+- The core layer invokes devices to send packets in a round robin format.
+This implies receive is totaly lockless because of the guarantee only that
+one CPU is executing it.
+- contention can only be the result of some other CPU accessing the rx
+ring. This happens only in close() and suspend() (when these methods
+try to clean the rx ring);
+****guarantee: driver authors need not worry about this; synchronization
+is taken care for them by the top net layer.
+-local interrupts are enabled (if you dont move all to dev->poll()). For
+example link/MII and txcomplete continue functioning just same old way.
+This improves the latency of processing these events. It is also assumed that
+the receive interrupt is the largest cause of noise. Note this might not
+always be true.
+[according to Manfred Spraul, the winbond insists on sending one
+txmitcomplete interrupt for each packet (although this can be mitigated)].
+For these broken drivers, move all to dev->poll().
+
+For the rest of this text, we'll assume that dev->poll() only
+processes receive events.
+
+new methods introduce by NAPI
+=============================
+
+a) netif_rx_schedule(dev)
+Called by an IRQ handler to schedule a poll for device
+
+b) netif_rx_schedule_prep(dev)
+puts the device in a state which allows for it to be added to the
+CPU polling list if it is up and running. You can look at this as
+the first half of netif_rx_schedule(dev) above; the second half
+being c) below.
+
+c) __netif_rx_schedule(dev)
+Add device to the poll list for this CPU; assuming that _prep above
+has already been called and returned 1.
+
+d) netif_rx_reschedule(dev, undo)
+Called to reschedule polling for device specifically for some
+deficient hardware. Read Appendix 2 for more details.
+
+e) netif_rx_complete(dev)
+
+Remove interface from the CPU poll list: it must be in the poll list
+on current cpu. This primitive is called by dev->poll(), when
+it completes its work. The device cannot be out of poll list at this
+call, if it is then clearly it is a BUG(). You'll know ;->
+
+All these above nethods are used below. So keep reading for clarity.
+
+Device driver changes to be made when porting NAPI
+==================================================
+
+Below we describe what kind of changes are required for NAPI to work.
+
+1) introduction of dev->poll() method
+=====================================
+
+This is the method that is invoked by the network core when it requests
+for new packets from the driver. A driver is allowed to send upto
+dev->quota packets by the current CPU before yielding to the network
+subsystem (so other devices can also get opportunity to send to the stack).
+
+dev->poll() prototype looks as follows:
+int my_poll(struct net_device *dev, int *budget)
+
+budget is the remaining number of packets the network subsystem on the
+current CPU can send up the stack before yielding to other system tasks.
+*Each driver is responsible for decrementing budget by the total number of
+packets sent.
+ Total number of packets cannot exceed dev->quota.
+
+dev->poll() method is invoked by the top layer, the driver just sends if it
+can to the stack the packet quantity requested.
+
+more on dev->poll() below after the interrupt changes are explained.
+
+2) registering dev->poll() method
+===================================
+
+dev->poll should be set in the dev->probe() method.
+e.g:
+dev->open = my_open;
+.
+.
+/* two new additions */
+/* first register my poll method */
+dev->poll = my_poll;
+/* next register my weight/quanta; can be overridden in /proc */
+dev->weight = 16;
+.
+.
+dev->stop = my_close;
+
+
+
+3) scheduling dev->poll()
+=============================
+This involves modifying the interrupt handler and the code
+path which takes the packet off the NIC and sends them to the
+stack.
+
+it's important at this point to introduce the classical D Becker
+interrupt processor:
+
+------------------
+static irqreturn_t
+netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs)
+{
+
+ struct net_device *dev = (struct net_device *)dev_instance;
+ struct my_private *tp = (struct my_private *)dev->priv;
+
+ int work_count = my_work_count;
+ status = read_interrupt_status_reg();
+ if (status == 0)
+ return IRQ_NONE; /* Shared IRQ: not us */
+ if (status == 0xffff)
+ return IRQ_HANDLED; /* Hot unplug */
+ if (status & error)
+ do_some_error_handling()
+
+ do {
+ acknowledge_ints_ASAP();
+
+ if (status & link_interrupt) {
+ spin_lock(&tp->link_lock);
+ do_some_link_stat_stuff();
+ spin_lock(&tp->link_lock);
+ }
+
+ if (status & rx_interrupt) {
+ receive_packets(dev);
+ }
+
+ if (status & rx_nobufs) {
+ make_rx_buffs_avail();
+ }
+
+ if (status & tx_related) {
+ spin_lock(&tp->lock);
+ tx_ring_free(dev);
+ if (tx_died)
+ restart_tx();
+ spin_unlock(&tp->lock);
+ }
+
+ status = read_interrupt_status_reg();
+
+ } while (!(status & error) || more_work_to_be_done);
+ return IRQ_HANDLED;
+}
+
+----------------------------------------------------------------------
+
+We now change this to what is shown below to NAPI-enable it:
+
+----------------------------------------------------------------------
+static irqreturn_t
+netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs)
+{
+ struct net_device *dev = (struct net_device *)dev_instance;
+ struct my_private *tp = (struct my_private *)dev->priv;
+
+ status = read_interrupt_status_reg();
+ if (status == 0)
+ return IRQ_NONE; /* Shared IRQ: not us */
+ if (status == 0xffff)
+ return IRQ_HANDLED; /* Hot unplug */
+ if (status & error)
+ do_some_error_handling();
+
+ do {
+/************************ start note *********************************/
+ acknowledge_ints_ASAP(); // dont ack rx and rxnobuff here
+/************************ end note *********************************/
+
+ if (status & link_interrupt) {
+ spin_lock(&tp->link_lock);
+ do_some_link_stat_stuff();
+ spin_unlock(&tp->link_lock);
+ }
+/************************ start note *********************************/
+ if (status & rx_interrupt || (status & rx_nobuffs)) {
+ if (netif_rx_schedule_prep(dev)) {
+
+ /* disable interrupts caused
+ * by arriving packets */
+ disable_rx_and_rxnobuff_ints();
+ /* tell system we have work to be done. */
+ __netif_rx_schedule(dev);
+ } else {
+ printk("driver bug! interrupt while in poll\n");
+ /* FIX by disabling interrupts */
+ disable_rx_and_rxnobuff_ints();
+ }
+ }
+/************************ end note note *********************************/
+
+ if (status & tx_related) {
+ spin_lock(&tp->lock);
+ tx_ring_free(dev);
+
+ if (tx_died)
+ restart_tx();
+ spin_unlock(&tp->lock);
+ }
+
+ status = read_interrupt_status_reg();
+
+/************************ start note *********************************/
+ } while (!(status & error) || more_work_to_be_done(status));
+/************************ end note note *********************************/
+ return IRQ_HANDLED;
+}
+
+---------------------------------------------------------------------
+
+
+We note several things from above:
+
+I) Any interrupt source which is caused by arriving packets is now
+turned off when it occurs. Depending on the hardware, there could be
+several reasons that arriving packets would cause interrupts; these are the
+interrupt sources we wish to avoid. The two common ones are a) a packet
+arriving (rxint) b) a packet arriving and finding no DMA buffers available
+(rxnobuff) .
+This means also acknowledge_ints_ASAP() will not clear the status
+register for those two items above; clearing is done in the place where
+proper work is done within NAPI; at the poll() and refill_rx_ring()
+discussed further below.
+netif_rx_schedule_prep() returns 1 if device is in running state and
+gets successfully added to the core poll list. If we get a zero value
+we can _almost_ assume are already added to the list (instead of not running.
+Logic based on the fact that you shouldn't get interrupt if not running)
+We rectify this by disabling rx and rxnobuf interrupts.
+
+II) that receive_packets(dev) and make_rx_buffs_avail() may have disappeared.
+These functionalities are still around actually......
+
+infact, receive_packets(dev) is very close to my_poll() and
+make_rx_buffs_avail() is invoked from my_poll()
+
+4) converting receive_packets() to dev->poll()
+===============================================
+
+We need to convert the classical D Becker receive_packets(dev) to my_poll()
+
+First the typical receive_packets() below:
+-------------------------------------------------------------------
+
+/* this is called by interrupt handler */
+static void receive_packets (struct net_device *dev)
+{
+
+ struct my_private *tp = (struct my_private *)dev->priv;
+ rx_ring = tp->rx_ring;
+ cur_rx = tp->cur_rx;
+ int entry = cur_rx % RX_RING_SIZE;
+ int received = 0;
+ int rx_work_limit = tp->dirty_rx + RX_RING_SIZE - tp->cur_rx;
+
+ while (rx_ring_not_empty) {
+ u32 rx_status;
+ unsigned int rx_size;
+ unsigned int pkt_size;
+ struct sk_buff *skb;
+ /* read size+status of next frame from DMA ring buffer */
+ /* the number 16 and 4 are just examples */
+ rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset));
+ rx_size = rx_status >> 16;
+ pkt_size = rx_size - 4;
+
+ /* process errors */
+ if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) ||
+ (!(rx_status & RxStatusOK))) {
+ netdrv_rx_err (rx_status, dev, tp, ioaddr);
+ return;
+ }
+
+ if (--rx_work_limit < 0)
+ break;
+
+ /* grab a skb */
+ skb = dev_alloc_skb (pkt_size + 2);
+ if (skb) {
+ .
+ .
+ netif_rx (skb);
+ .
+ .
+ } else { /* OOM */
+ /*seems very driver specific ... some just pass
+ whatever is on the ring already. */
+ }
+
+ /* move to the next skb on the ring */
+ entry = (++tp->cur_rx) % RX_RING_SIZE;
+ received++ ;
+
+ }
+
+ /* store current ring pointer state */
+ tp->cur_rx = cur_rx;
+
+ /* Refill the Rx ring buffers if they are needed */
+ refill_rx_ring();
+ .
+ .
+
+}
+-------------------------------------------------------------------
+We change it to a new one below; note the additional parameter in
+the call.
+
+-------------------------------------------------------------------
+
+/* this is called by the network core */
+static int my_poll (struct net_device *dev, int *budget)
+{
+
+ struct my_private *tp = (struct my_private *)dev->priv;
+ rx_ring = tp->rx_ring;
+ cur_rx = tp->cur_rx;
+ int entry = cur_rx % RX_BUF_LEN;
+ /* maximum packets to send to the stack */
+/************************ note note *********************************/
+ int rx_work_limit = dev->quota;
+
+/************************ end note note *********************************/
+ do { // outer beginning loop starts here
+
+ clear_rx_status_register_bit();
+
+ while (rx_ring_not_empty) {
+ u32 rx_status;
+ unsigned int rx_size;
+ unsigned int pkt_size;
+ struct sk_buff *skb;
+ /* read size+status of next frame from DMA ring buffer */
+ /* the number 16 and 4 are just examples */
+ rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset));
+ rx_size = rx_status >> 16;
+ pkt_size = rx_size - 4;
+
+ /* process errors */
+ if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) ||
+ (!(rx_status & RxStatusOK))) {
+ netdrv_rx_err (rx_status, dev, tp, ioaddr);
+ return 1;
+ }
+
+/************************ note note *********************************/
+ if (--rx_work_limit < 0) { /* we got packets, but no quota */
+ /* store current ring pointer state */
+ tp->cur_rx = cur_rx;
+
+ /* Refill the Rx ring buffers if they are needed */
+ refill_rx_ring(dev);
+ goto not_done;
+ }
+/********************** end note **********************************/
+
+ /* grab a skb */
+ skb = dev_alloc_skb (pkt_size + 2);
+ if (skb) {
+ .
+ .
+/************************ note note *********************************/
+ netif_receive_skb (skb);
+/********************** end note **********************************/
+ .
+ .
+ } else { /* OOM */
+ /*seems very driver specific ... common is just pass
+ whatever is on the ring already. */
+ }
+
+ /* move to the next skb on the ring */
+ entry = (++tp->cur_rx) % RX_RING_SIZE;
+ received++ ;
+
+ }
+
+ /* store current ring pointer state */
+ tp->cur_rx = cur_rx;
+
+ /* Refill the Rx ring buffers if they are needed */
+ refill_rx_ring(dev);
+
+ /* no packets on ring; but new ones can arrive since we last
+ checked */
+ status = read_interrupt_status_reg();
+ if (rx status is not set) {
+ /* If something arrives in this narrow window,
+ an interrupt will be generated */
+ goto done;
+ }
+ /* done! at least thats what it looks like ;->
+ if new packets came in after our last check on status bits
+ they'll be caught by the while check and we go back and clear them
+ since we havent exceeded our quota */
+ } while (rx_status_is_set);
+
+done:
+
+/************************ note note *********************************/
+ dev->quota -= received;
+ *budget -= received;
+
+ /* If RX ring is not full we are out of memory. */
+ if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
+ goto oom;
+
+ /* we are happy/done, no more packets on ring; put us back
+ to where we can start processing interrupts again */
+ netif_rx_complete(dev);
+ enable_rx_and_rxnobuf_ints();
+
+ /* The last op happens after poll completion. Which means the following:
+ * 1. it can race with disabling irqs in irq handler (which are done to
+ * schedule polls)
+ * 2. it can race with dis/enabling irqs in other poll threads
+ * 3. if an irq raised after the begining of the outer beginning
+ * loop(marked in the code above), it will be immediately
+ * triggered here.
+ *
+ * Summarizing: the logic may results in some redundant irqs both
+ * due to races in masking and due to too late acking of already
+ * processed irqs. The good news: no events are ever lost.
+ */
+
+ return 0; /* done */
+
+not_done:
+ if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 ||
+ tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
+ refill_rx_ring(dev);
+
+ if (!received) {
+ printk("received==0\n");
+ received = 1;
+ }
+ dev->quota -= received;
+ *budget -= received;
+ return 1; /* not_done */
+
+oom:
+ /* Start timer, stop polling, but do not enable rx interrupts. */
+ start_poll_timer(dev);
+ return 0; /* we'll take it from here so tell core "done"*/
+
+/************************ End note note *********************************/
+}
+-------------------------------------------------------------------
+
+From above we note that:
+0) rx_work_limit = dev->quota
+1) refill_rx_ring() is in charge of clearing the bit for rxnobuff when
+it does the work.
+2) We have a done and not_done state.
+3) instead of netif_rx() we call netif_receive_skb() to pass the skb.
+4) we have a new way of handling oom condition
+5) A new outer for (;;) loop has been added. This serves the purpose of
+ensuring that if a new packet has come in, after we are all set and done,
+and we have not exceeded our quota that we continue sending packets up.
+
+
+-----------------------------------------------------------
+Poll timer code will need to do the following:
+
+a)
+
+ if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 ||
+ tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
+ refill_rx_ring(dev);
+
+ /* If RX ring is not full we are still out of memory.
+ Restart the timer again. Else we re-add ourselves
+ to the master poll list.
+ */
+
+ if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
+ restart_timer();
+
+ else netif_rx_schedule(dev); /* we are back on the poll list */
+
+5) dev->close() and dev->suspend() issues
+==========================================
+The driver writter neednt worry about this. The top net layer takes
+care of it.
+
+6) Adding new Stats to /proc
+=============================
+In order to debug some of the new features, we introduce new stats
+that need to be collected.
+TODO: Fill this later.
+
+APPENDIX 1: discussion on using ethernet HW FC
+==============================================
+Most chips with FC only send a pause packet when they run out of Rx buffers.
+Since packets are pulled off the DMA ring by a softirq in NAPI,
+if the system is slow in grabbing them and we have a high input
+rate (faster than the system's capacity to remove packets), then theoretically
+there will only be one rx interrupt for all packets during a given packetstorm.
+Under low load, we might have a single interrupt per packet.
+FC should be programmed to apply in the case when the system cant pull out
+packets fast enough i.e send a pause only when you run out of rx buffers.
+Note FC in itself is a good solution but we have found it to not be
+much of a commodity feature (both in NICs and switches) and hence falls
+under the same category as using NIC based mitigation. Also experiments
+indicate that its much harder to resolve the resource allocation
+issue (aka lazy receiving that NAPI offers) and hence quantify its usefullness
+proved harder. In any case, FC works even better with NAPI but is not
+necessary.
+
+
+APPENDIX 2: the "rotting packet" race-window avoidance scheme
+=============================================================
+
+There are two types of associations seen here
+
+1) status/int which honors level triggered IRQ
+
+If a status bit for receive or rxnobuff is set and the corresponding
+interrupt-enable bit is not on, then no interrupts will be generated. However,
+as soon as the "interrupt-enable" bit is unmasked, an immediate interrupt is
+generated. [assuming the status bit was not turned off].
+Generally the concept of level triggered IRQs in association with a status and
+interrupt-enable CSR register set is used to avoid the race.
+
+If we take the example of the tulip:
+"pending work" is indicated by the status bit(CSR5 in tulip).
+the corresponding interrupt bit (CSR7 in tulip) might be turned off (but
+the CSR5 will continue to be turned on with new packet arrivals even if
+we clear it the first time)
+Very important is the fact that if we turn on the interrupt bit on when
+status is set that an immediate irq is triggered.
+
+If we cleared the rx ring and proclaimed there was "no more work
+to be done" and then went on to do a few other things; then when we enable
+interrupts, there is a possibility that a new packet might sneak in during
+this phase. It helps to look at the pseudo code for the tulip poll
+routine:
+
+--------------------------
+ do {
+ ACK;
+ while (ring_is_not_empty()) {
+ work-work-work
+ if quota is exceeded: exit, no touching irq status/mask
+ }
+ /* No packets, but new can arrive while we are doing this*/
+ CSR5 := read
+ if (CSR5 is not set) {
+ /* If something arrives in this narrow window here,
+ * where the comments are ;-> irq will be generated */
+ unmask irqs;
+ exit poll;
+ }
+ } while (rx_status_is_set);
+------------------------
+
+CSR5 bit of interest is only the rx status.
+If you look at the last if statement:
+you just finished grabbing all the packets from the rx ring .. you check if
+status bit says theres more packets just in ... it says none; you then
+enable rx interrupts again; if a new packet just came in during this check,
+we are counting that CSR5 will be set in that small window of opportunity
+and that by re-enabling interrupts, we would actually triger an interrupt
+to register the new packet for processing.
+
+[The above description nay be very verbose, if you have better wording
+that will make this more understandable, please suggest it.]
+
+2) non-capable hardware
+
+These do not generally respect level triggered IRQs. Normally,
+irqs may be lost while being masked and the only way to leave poll is to do
+a double check for new input after netif_rx_complete() is invoked
+and re-enable polling (after seeing this new input).
+
+Sample code:
+
+---------
+ .
+ .
+restart_poll:
+ while (ring_is_not_empty()) {
+ work-work-work
+ if quota is exceeded: exit, not touching irq status/mask
+ }
+ .
+ .
+ .
+ enable_rx_interrupts()
+ netif_rx_complete(dev);
+ if (ring_has_new_packet() && netif_rx_reschedule(dev, received)) {
+ disable_rx_and_rxnobufs()
+ goto restart_poll
+ } while (rx_status_is_set);
+---------
+
+Basically netif_rx_complete() removes us from the poll list, but because a
+new packet which will never be caught due to the possibility of a race
+might come in, we attempt to re-add ourselves to the poll list.
+
+
+
+
+APPENDIX 3: Scheduling issues.
+==============================
+As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the
+general solution to schedule softirq's to run before next interrupt and by putting
+them under scheduler control. Also this prevents consecutive softirq's from
+monopolize the CPU. This also have the effect that the priority of ksoftirq needs
+to be considered when running very CPU-intensive applications and networking to
+get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0
+(eventually more) is reported cure problems with low network performance at high
+CPU load.
+
+Most used processes in a GIGE router:
+USER PID %CPU %MEM SIZE RSS TTY STAT START TIME COMMAND
+root 3 0.2 0.0 0 0 ? RWN Aug 15 602:00 (ksoftirqd_CPU0)
+root 232 0.0 7.9 41400 40884 ? S Aug 15 74:12 gated
+
+--------------------------------------------------------------------
+
+relevant sites:
+==================
+ftp://robur.slu.se/pub/Linux/net-development/NAPI/
+
+
+--------------------------------------------------------------------
+TODO: Write net-skeleton.c driver.
+-------------------------------------------------------------
+
+Authors:
+========
+Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
+Jamal Hadi Salim <hadi@cyberus.ca>
+Robert Olsson <Robert.Olsson@data.slu.se>
+
+Acknowledgements:
+================
+People who made this document better:
+
+Lennert Buytenhek <buytenh@gnu.org>
+Andrew Morton <akpm@zip.com.au>
+Manfred Spraul <manfred@colorfullife.com>
+Donald Becker <becker@scyld.com>
+Jeff Garzik <jgarzik@pobox.com>