ipge tuning guide

Monday Jan 30, 2006

IPGE Ethernet Device Driver Tunable Parameters

ipge:ipge_tx_syncq
-------------------

Description
Force outbound serialization using Solaris streams syncq. Keeping the Tx process on a fixed CPU reduces the risk of CPUs spinning waiting for other CPUs to complete their Tx activity, ensuring CPUs are always kept busy doing useful work.
The trade off is transmit latency for avoiding mutex spins induced by contention.

Default 0 (disabled)

Range 0 (disabled) or 1 (enabled)

When to Change
Set to 1 (enabled) when multiple outbound connections heavily content for tx resource.

Note
Only applicable on ipge.1.29 or later.
Refer Roch's blog at http://blogs.sfbay/roller/page/roch#about_tx_serialization

ipge:ipge_tx_ring_size
-----------------------

Description
The number of descriptors for the transmit buffer ring. Must be a power of 2.

Default 2048

Range 16 to 8192

When to use
When the multi threaded transmit load potentially will overwhelm the resource, indicated by "kstat ipge:x:ipgex:tx_max_pending" closing to this setting or postpone transmit due to unavailable of tx descriptor indicated by "kstat ipge:x:ipgex:tx_no_desc". When traffic is low, you can deduce the value for saving system memory.

Note
The "x" in kstat command is the instance number of ipge.

ipge:ipge_srv_fifo_depth
-------------------------

Description
Only in use with the recommended helper thread mode (ipge_taskq_disable=0), ipge_srv_fifo_depth controls the number of packets that may be put in a helper thread queue before the driver decides to drop packets. Each of the ipge_inst_taskqs (default 4) helper thread maintains it's own queue.
Note that the flow of packets associated with a connection is always targeted to a specific helper thread and thus to one queue.
Packets are placed in queue waiting for the associated kernel helper thread to handle them. A rare condition that may cause an overflow here is running a very high system load; the helper thread may become starved of CPU resources.
A more typical problem though is working with a non-flow controlled protocol such as a steady UDP stream from a strong transmit machine.

Default 2048

Range 1 to MAXINT

When to Change
When the queue depth is exceeded the driver will drop packets and this will be indicated by command "kstat ipge:x:ipgex:rx_pkts_dropped". When memory is available, network traffic is bursty and kstat rx_pkts_dropped increases, recommend set to 16000.

NOTE
Higher values are still possible. Memory resources (one MTU per packet) are consumed only by the stored packets in the helper threads queue (memory is not preallocated). If the remote machine is sending an endless stream of packets faster that the receive application is able to process them, no amount of buffering will suffice. Increasing ipge_srv_fifo_depth just delays the inevitable drops.

ipge:ipge_reclaim_pending
--------------------------

Description
This is the threshold when tx reclaims start happening. Once the packets transmitted, reclaim the descriptors for reuse. The reclaim and transmit can't execute in parallel. The trade off is the frequency and time length of reclaiming.
The best tuning is to have reclaiming in sync with transmitting.

Default 4

Range 1 to (ipge_tx_ring_size - 1)

When to Change
This parameter takes effect more likely at high tx traffic. Recommend set to 32.

ipge:ipge_ring_size
--------------------

Description
The number of descriptors for the receive buffer ring. Must be a power of 2. The resources associated with the arrival of a packet are in use until the interrupt service routine (ISR) has run and 'cleaned' the hardware. With the recommended settings of ipge_taskq_disable=0, the ISR routine is normally fast enough to handle most workloads on default values.

Default 2048

Range 16 to 64K

When to Change
When ipge_taskq_disable=1, it makes sense to increase the ring size.

Note
rx_no_buf and rx_no_comp_wd didn't get implemented in ipge !!!
check Interrupt Cause Read Register ICR (000C0H; R) TXQE for Transmit Queue Empty.

ipge:ipge_bcopy_thresh
----------------------

Description
A byte count below which the driver uses a bcopy scheme into a pre-map dma buffer.

Default 256

Range 1 to MTU

When to Change

A value of 512 or higher is possible better for a heavy network traffic. When ipge_bcopy_thresh is set to anything equal or higher to the network MTU; all packets go through the bcopy scheme.

Note
Refer to blog http://blogs.sfbay/roller/page/roch#why_large_bcopy_sometimes_wins

ipge:ipge_dvma_thresh
----------------------

Description
A byte count above which the ipge driver will start to use dvma calls to map and unmap dma handles. Packet sizes that fall in the range between ipge_bcopy_thresh and ipge_dvma_thresh are handled with the more common ddi_dma calls.

Default 1024

Range 1 to MTU

When to Change
As of Solaris 10, the dvma interface is now always faster than ddi_dma equivalent calls. In order to avoid the ddi_dma calls, we should set ipge_dvma_thresh to 1.

ipge:ipge_taskq_disable
-----------------------

Description
There are 2 ways ipge can send packets up to the IP layer. Either packets are sent up from the interrupt service routine (ISR) or they are send up using helper threads (ipge_inst_taskqs). In interrupt mode (ipge_taskq_disable=1), packets are sent up from the ISR.
This is the most efficient way to do this operation. However, the abuse of the interrupt context by a device driver can lead to overall system disruption, if packets arrive at a high rate on a gigabit interface, an Ontario strand may end up running 100% in interrupt mode thus pinning other important threads.

Default 0

Range 0 (taskq_enabled) or 1 (taskq_disabled)

When to Change
To push NFS server bandwith, set ipge_taskq_disable=1 and ipge_tx_syncq=1.

ipge:ipge_inst_taskqs
---------------------

Description
The instances of ipge help thread. Each of helper thread maintains it's own queue. The ipge_srv_fifo_depth controls the number of packets that may be put in a helper threads queue before the driver decides to drop packets. The inbound packages are classified and saved into different queues.

Default 4

Range 2 to 8

When to Change
We believe that ipge_inst_taskqs of 4 is the best value in common situations.

ipge:ipge_clsyspri
-------------------

Description
This is the priority level the Task threads run at.

Default 60

Range 60 to 99

When to Change
Recommend higher priority 99 for heavy network traffic.

ip:ip_squeue_fanout
-------------------

Description
Controls whether incoming connections from one NIC are fanned out across all CPUs. A value of 0 (disabled) means incoming connections are assigned to the squeue attached to the interrupted CPU. A value of 1 (enabled) means the connections are fanned out across all CPUs.

Default 0 (disabled)

Range 0 (disabled) or 1 (enabled)

When to Change
Set to 1 (enabled) when NIC is faster than the CPU and multiple CPU need to service the NIC.

Note
Refer blog http://blogs.sun.com/roller/page/sunay?entry=the_solaris_networking_the_magic#mozTocId400304.125

sq_max_size
------------

Description
Sets the depth of the syncq (number of messages) before a destination STREAMS queue generates a QFULL message.

Default 10000

Range 0 (unlimited) to MAXINT

When to Change
The sq_max_size to be of concern in situations where ipge report increasing numbers of nocanput on the receive side, indicated by "kstat ipge:x:ipgex:rx_nocanput".A value of 0, which disable syncq throttling, is intended only for benchmarks or testing environments. Recommend to set value in between 20 and 100.

tune_t_fsflushr
---------------

Description
Specifies the number of seconds between fsflush invocations

Default 1

Range 1 to MAXINT

When to Change

autoup
------

Description
Along with tune_t_flushr, autoup controls the amount of memory examined for dirty pages in each invocation and frequency of file system synchronizing operations. The value of autoup is also used to control whether a buffer is written out from the free list. Buffers marked with the B_DELWRI flag (which identifies file content pages that have changed) are written out whenever the buffer has been on the list for longer than autoup seconds.

Default 30

Range 1 to MAXINT

When to Change
Increasing the value of autoup keeps the buffers in memory for a longer time and less written out.

shmsys:shminfo_shmmax
----------------------
Obsolete in the Solaris 10 release.

rlim_fd_max
-----------

Description
Specifies the hard limit on file descriptors that a single process might have open. Overriding this limit requires superuser privilege.

Default 65536

Range 1 to MAXINT

When to Change
When the maximum number of open files for a process is not enough.

rlim_fd_cur
-----------

Description
Defines the soft limit on file descriptors that a single process can have open. A process might adjust its file descriptor limit to any value up to the hard limit defined by rlim_fd_max.

Default 256

Range 1 to MAXINT

When to Change
When the default number of open files for a process is not enough.

ndd tuning
==========

/dev/ipge instance
------------------

Description
Prior to using the the ndd utility to get or set a ipge device parameter,
the ipge device instance must be specified for the ndd utility. The device remains
elected until you change the selection.

Default 0

Range 0 to (# of ipge ports -1)

/dev/ipge rx_intr_pkts
----------------------

Description
This parameter is used to delay interrupt notification for the receive descriptor ring. Coalesce receive packages into one interrupt if time interval between packages is less than specified value. The delay timer is measured in increments of 1.024 us for Gigbit, and 10.24 us for 100 Megabit.
This feature operates by initiating a countdown timer upon successfully receiving each packet to system memory. If a subsequent packet is received BEFORE the timer expires, the timer is reinitialized to the programmed value and re-starts its countdown. If the timer expires due to NOT having received a subsequent packet within the programmed interval, pending receive descriptor writebacks are flushed and a receive timer interrupt is generated.
The delay timer is measured in increments of 1.024 us for Gigbit, and 10.24 us for 100 Megabit. The benefits of delaying comes from the fact that, during that time, more packets may arrive on the wire. The bigger number of packets handled per interrupts results in a more efficient use of the CPU. The trade off is that the time to handle a single packet is slightly increased by a few us.

Default 8

Range 0 (disabled) to 600

When to Change
When ipge_taskq_disable=0 and high ipge interrupt rate, higher value can help receive performance.

Note
This term is overloaded with cassini driver for defining the maximum packages one interrupt can have.

/dev/ipge rx_intr_time
----------------------

Description
The packet delay timer (rx_intr_pkts) is used to coalesce receive interrupts.
This parameter is used to ENSURE that a receive interrupt occurs at some predefined interval after the first packet is received.

Default 3

Range 0 (disabled) to 600

When to Change
Adjust with rx_intr_pkts to limit the packages for one interrupt.

IPGE Related TCP tuning
=======================

tcp_conn_req_max_q
-------------------

Description
Specifies the default maximum number of pending TCP connections for a TCP listener waiting to be accepted by accept(3SOCKET).

Default 128

Range 1 to 4,294,967,296

When to Change
For applications such as web servers that might receive several connection requests, the default value might be increased to match the incoming rate.
Set by ndd:
ndd -set /dev/tcp tcp_conn_req_max_q 8192

tcp_conn_req_max_q0
-------------------

Description
Specifies the default maximum number of incomplete (three-way handshake not yet finished) pending TCP connections for a TCP listener.

Default 1024

Range 0 to 4,294,967,296

When to Change
For applications such as web servers that might receive excessive connection requests, you can increase the default value to match the incoming rate.
Set by ndd:
ndd -set /dev/tcp tcp_conn_req_max_q0 8192

tcp_max_buf
-----------

Description
Defines the maximum buffer size in bytes. This parameter controls how large the send and receive buffers are set to by an application that uses setsockopt(3XNET).

Default 1,048,576

Range 8192 to 1,073,741,824

When to Change
If TCP connections are being made in a high-speed network environment, increase the value to match the network link speed.
Set by ndd:
ndd -set /dev/tcp tcp_max_buf 4194304

tcp_cwnd_max
------------

Description
Defines the maximum value of the TCP congestion window (cwnd) in bytes.

Default 1,048,576

Range 128 to 1,073,741,824

When to Change
Even if an application uses setsockopt(3XNET) to change the window size to a value higher than tcp_cwnd_max, the actual window used can never grow beyond tcp_cwnd_max. Thus, tcp_max_buf should be greater than tcp_cwnd_max.
Set by ndd:
ndd -set /dev/tcp tcp_cwnd_max 2097152

tcp_recv_hiwat
--------------

Description
Defines the default receive window size in bytes.

Default 49,152

Range 2048 to 1,073,741,824

When to Change
An application can use setsockopt(3XNET) SO_RCVBUF to change the individual connection?s receive buffer.

Set by ndd:
ndd -set /dev/tcp tcp_recv_hiwat 400000

tcp_xmit_hiwat
--------------

Description
Defines the default send window size in bytes.

Default 49,152

Range 4096 to 1,073,741,824

When to Change
An application can use setsockopt(3XNET) SO_SNDBUF to change the individual connection?s send buffer.
Set by ndd:
ndd -set /dev/tcp tcp_xmit_hiwat 400000

Comments:

Roch: For the ipge parameters discussed here, where is the recommended place on Solaris 10 to place them? Engineering has been recommending to place tuning params into driver.conf files, but the latest EIS checklist for T2000 indicates that the ipge params should be placed in /etc/system. Can you help resolve my confusion? Thanks, -Steve Client Solutions Datacenter Practice US Northeast Area

Posted by Stephen Putre, Sun Micro US on March 02, 2006 at 12:49 PM PST #

1.) Since ipge_task_disable=0 by default, are there more appropriate non-default values for GigE for ipge rx_intr_pkts & ipge rx_intr_time?

2.) Are the values listed above for the ndd /dev/tcp tcp_max_buf & tcp_cwnd_max settings appropriate for any speed network, or specifically the max that the ipge interface can push (i.e. GigE)?

3.) How are the tcp_*_hiwat sized? I see from http://www.spec.org/web2005/results/res2005q4/web2005-20051205-00018.html that 400000 was used here as well. Most other tcp_*_hiwat publically published settings tend toward 32k or 64k.

4.) How does one identify their "link speed", as mentioned in tcp_max_buf?

By the way, this is a great bit of information on this page. Thank you.

Posted by J Surlow (strip on May 24, 2006 at 10:33 AM PDT #

good

Posted by 196.207.40.213 on May 26, 2006 at 03:15 AM PDT #

The driver.conf, /kernel/drv/ipge.conf for this driver, is used for ndd commands to survive reboot. There are also driver tunable parameters defined in global scope and can be tuned in /etc/system, such as ipge_tx_syncq.

Posted by Erhyuan Tsai on May 29, 2006 at 12:54 PM PDT #

1) rx_intr_pkts & rx_intr_time.
Depends on latency requirement and workload, raise these two settings higher can help throughput. 600 is recommended for heavy traffic.
2) tcp_max_buf & tcp_cwnd_max.
The recommended setting is good for GbE networking. However, it depends on your network environment. Try to use ping to know the delay of round trip time then set values accordingly.
3) tcp_recv_hiwat & tcp_xmit_hiwat.
For heavy traffic, 400000 is recommended for GbE networking.
4)Link speed.
Try ping to know the speed of link.

Posted by Erhyuan Tsai on May 29, 2006 at 01:52 PM PDT #

I used the recommended parameters on Solaris 10 6/06 on a T2000 (http://www.sun.com/servers/coolthreads/tnb/parameters.jsp). After applying these parameters trunking didn't work anymore. After testing it appeared that trunking doesn't work when following parameter is turned on set ipge:ipge_tx_syncq=1 I used following parameters in /etc/system: set ip:ip_squeue_bind = 0 set ip:ip_squeue_fanout = 1 #set ipge:ipge_tx_syncq=1 set ipge:ipge_bcopy_thresh = 512 set ipge:ipge_dvma_thresh = 1 set pcie:pcie_aer_ce_mask=0x1 set segkmem_lpsize=0x400000

Posted by Johan Kielbaey on September 04, 2006 at 01:31 AM PDT #

[Trackback] How cialis works. Cialis generic.

Posted by Cheapest cialis. on March 23, 2007 at 01:58 PM PDT #

nice blog but no many posts :-)

Posted by Tom on December 14, 2007 at 05:19 AM PST #

world trade senter
hight 5.11
wieght 70
age 25
name jibran
game name "king of fighter"
a.375 block 12 f.b.area gulberg karachi pakistan
jibran60@hotmail.com
3 talwaar 3 ghanta film
2 talwaar 2 ghanta film
1 talwaar 1 ghanta film "karachi"
amirca a to z
bankok a to z
columbo a to z
england a to z
became a model jobe
colification "inter"

Posted by jibran saghir on May 20, 2009 at 07:55 AM PDT #

jub wo 1 years ka hay to us kay pass samandar jitnay money hay or daur aisay lagao jaysay hair
jub wo 5 years ka hay to us nay aik marder kia
jub wo 7 years ka hay to zanna likh star hay aleef hay
jub wo 11 years ka hay to shoot shoot "sheen sheen" aleef aleef aleef "z"
jub wo 15 years ka hay to sheen wala door aleef ankh band aik gawahi aleef aleef aleef z
a to z
aik raat hay aleef aleef aik timer hay aleef z
aik timer hay aleef aleef aik raat hay aleef z
chakor chest
politice pet
i mean body bild sir

Posted by jibran saghir on June 08, 2009 at 03:24 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed

Calendar

Feeds

Search

Links

Navigation

Referrers