	   $Id: tools_man.txt,v 1.8 1999/05/12 07:04:41 roca Exp $


		 Networking Performance Evaluation Environment
			  - manuel for version 0.92 -

				Vincent Roca

		  Universit Pierre et Marie Curie - Paris VI
		Vincent.Roca@lip6.fr, http://www-rp.lip6.fr/~roca

			        March 1999



Here is the README file of the networking performance evaluation environment.
This environment is composed of several tools:

- bench/benchd:		a client/server tool for statistics gathering,
- bencht/benchdt:	a shell script that runs bench/benchd while varying
			the message size parameter,
- bsort:		a shell/awk file that analyzes traces generated by
			bench.

It also requires the presence of:

- gnuplot:		an excellent (GNU) tool to plot curves that is freely
			available:
			http://www.cs.dartmouth.edu/gnuplot_info.html


1- BENCH/BENCHD
---------------

This is the client/server tool. Three scenarios exist:

      - unidirectional BULK DATA TRANSFER:
	bench sends data continuously to benchd, trying to saturate
	the communication path (sender, network, or receiver).

      - transfers in ECHO MODE for RTT measurements:
	bench sends some data, then waits untill it has received the
	same amount. benchd continuously wait for data and each time
	it receives something immediately sends it back.

      - unidirectional transfers for TRANSIT TIME evaluation:
	bench sends data at a regular rate to benchd, each data unit having
	a timestamp. benchd calculates the difference between reception
	time and timestamp (ie. the transit time). Of course, both
	hosts must be synchronized (eg. using NTP, the Network Time
	Protocol). 
	NB: if machines are not synchronized, you can also use RTT
	measurements in raw mode.

These scenarios can be used either with UDP or TCP. The possibility of
packet loss with UDP require special handling. This is explained later.

Parameters:

	A strong point of this tool is the possibility to control many
	parameters:

	- various protocols
		TCP			the default
		UDP
	- various access methods
		Socket			the default
		TLI/XTI			the Transport Library Interface
					or its equivalent from X/Open
	- various scenarios
		bulk			ftp like transfers (default),
		RTT			also called echo, bench sends some
					data then wait untill benchd returns
					him the same amount of data.
		transit time		we measure the difference between
					the sending and receiving times
	- various versions of (t_)snd
		Standard Socket or XTI version
		direct use of write
	- various versions of (t_)rcv
		Standard Socket or XTI version
		direct use of read
	- standard features
		Message length		set the message size used by the
					application (TSDU in OSI terminology).
		Number of messages	set the number of messages exchanged.
		Test duration		limit the duration of the test.
					NB: test duration and number of
					messages are mutually exclusive.
		Connection number	set the number of connections opened
					between the client and server. Each
					new connection is handled by a newly
					forked bench and benchd process.
		Pause between each connection opening
					when opening a large number of
					connections, it is advised to pause
					after each open to avoid overwhelming
					the internal TCP not-yet-accepted
					connection list (-d<delay (millisec)>
					parameter).
		Remote host		identify the remote host name/address.
		Port number		identify the local port number on which
					to listen (case of benchd) or the
					remote port number where to open the
					connection (case of bench). By default
					the "/etc/services" file is used to
					know the various port numbers.
	- various advanced features
		Raw mode		rather than doing statisticalm measures
					over the (default) 10 s period, each
					transfer produces a timestamp'ed record
					for post-processing analysis.
		Bench/Benchd synchronization
					an additional TCP connection is
					created between bench and benchd
					in order to synchronize both peers.
					It enables statistics gathering at the
					receive side during TCP or UDP tests.
		Rate controled sending	specify the maximum rate you want on
					a given flow. This may be usefull to
					avoid overloading a shared LAN, to
					analyze the maximum UDP flow that can
					be handled without loss, or to generate
					padding traffic on a line...
					NB: the specified rate is only
					a (very) rough approximation of the
					generated data rate. The algorithm
					must still be improved...
		Receive buffer size	this is the maximum amount of data that
					an application can read in a single
					system call. Default is 64 kBytes
					(maximum).
		Real-time process	if you want to minimize the influence
					of other processes on your benchmarks,
					or if both bench/benchd run on the
					same machine... then make bench/benchd
					real-time processes with fixed
					(priviledge?) priority.
		Pinned text/code	Make sure no paging will happen for
					these tools by locking segment pages
					in physical memory. Used along with
					real-time scheduling.
		Silent mode		avoid several trace messages while
					opening or closing a connection.
		CPU load statistics	collect CPU load and other system
					statistics automatically during data
					transfer (the opening and closing of
					connections are not taken into
					account). 
					This can be done either using the
					"sar" command, or the "vmstat"
					command.
		Nodelay mode		set the nodelay option of TCP. Must be
					used on both sites while doing RTT
					tests, and at the sender with TT tests.
		Dtonly mode		(Data Tranfer Only) avoid to include
					connection release stage in the
					statistics.

Examples:

1) bulk data transfer with TCP.

      - Exchange 1000 500-byte messages between host1 and host2. Collect
	data only during data transfer.
	On host1:
		benchd -dtonly
		(-tcp -sock are optional (defaults))
	On host2:
		bench  -dtonly -h host1 -l500 -n1000
		(-tcp -sock are optional (defaults))

      - Same test for an infinite number of messages:
	On host1, same command, on host2:
		bench  -dtonly -h host1 -l500

      - Same test for a 30 second duration:
	On host1, same command, on host2:
		bench  -dtonly -h host1 -l500 -dur30

      - Same test with 4 simultaneous connections:
	On host1, same command, on host2:
		bench  -dtonly -h host1 -l500 -n1000 -c4 -d1

      - Same test in loopback mode (client and server on the same machine):
		benchd -dtonly &
		bench  -dtonly -l500 -n1000
		(-h localhost is optional (default))

      - Same test with system statistics:
	On host1:
		benchd -dtonly -cpustat3
	On host2:
		bench  -dtonly -h host1 -l500 -n1000 -cpustat3

      - Same test with bench/benchd synchronization:
	On host1:
		benchd -dtonly -sync
	On host2:
		bench  -dtonly -h host1 -l500 -n1000 -sync

2) bulk data transfer with UDP.

      - Exchange 1000 500-byte messages between host1 and host2.
	On host1:
		benchd -udp
	On host2:
		bench  -udp -dtonly -h host1 -l500 -n1000
	Due to the possibility of packet loss, the throughput on the
	receiving side (benchd) can be lower than that of the sending
	side (bench).

3) RTT data transfer with TCP.

      - Exchange 1000 500-byte messages between host1 and host2.
	On host1:
		benchd -rtt -nodelay -dtonly
	On host2:
		bench  -rtt -nodelay -dtonly -h host1 -l500 -n1000

4) RTT data transfer with UDP.

      - Exchange 1000 500-byte messages between host1 and host2.
	On host1:
		benchd -udp -rtt -BSD
	On host2:
		bench  -udp -rtt -BSD -dtonly -h host1 -l500 -n1000

System parameters that affect performances on high speed links:

	FOR AIX systems:

	On high speed links (FDDI and above) and/or long delay links, then
	parameters of interest are sb_max, tcp_sendspace and tcp_recvspace,
	udp_sendspace and udp_recvspace, and RFC1323:

	      - Due to the mbuf buffering strategy, when possible an
		application using TCP should write multiples of 4096 bytes
		at a time for maximum throughput.
	      - Use "no -o sb_max=2*NewSize" to raise the ceiling on socket
		buffer space.
	      - Use "no -o *_*space=NewSize" to set the TCP and UDP socket send
		and receive space defaults to NewSpace bytes. NewSpace should
		be at least 57344 bytes (56KB).
	      - On high performance systems and when the
			<bandwidth>*<round trip time>
		product is high, use "no -o rfc1323=1" to allow socket buffer
		sizes to be higher than 64KB. Then use the previous
		procedure with a large NewSize.

Hints, limitations:

      - In order to avoid saying the port on each command line, add the
	following three lines to the /etc/services file:
		benchd		<port nb>/tcp		# bench test pgm
		benchd		<port nb>/udp		# bench test pgm
		benchd_sync	<port nb+1>/udp		# bench test pgm
	Choose a port number and port number + 1 that are not already used
	on your system. Of course, both ends should use the same value.

      - For a rapid description of parameters, type: "bench(d) -help".

      - In order to have more precision it is recommended to always set
	-dtonly and to have few intermediate performance reports. In that
	purpose, set -m120 to have a report every two minutes instead of
	the default 10 second interval.

      - Good precision alse requires that message exchange lasts at least
	10 seconds. Below, unstable figures may be obtained.
	Use the -durD argument to set the test duration rather than the
	number of messages exchanged.

      - Very low throughput can occur in RTT mode if -nodelay is
	not set on both sides. This is due to a internal mechanism that
	refrains TCP from sending under certain conditions.

      - With UDP make sure that the application receive buffer and UDP
	receive socket are large enough to hold the whole TDSU.  Otherwise
	the socket layer may truncate any excess data.
	Use  flag -rbuf<max TSDU size> with benchd and if required
	"no -o udp_recvspace=<max TSDU size>".

      - On AIX systems and with UDP, make sure that the udp_sendspace
	is always larger than the TSDU size. Otherwise transmit errors
	occur.
	Use "no -o udp_sendspace=<max TSDU size>".

      - In synchronization mode with UDP, don't take care of
		"benchd udp: recv: Interrupted system call"
	messages. That's normal!

      - CPU statistics can also be collected with the "sar" command (-cpustat1
	flag) on either AIX/3.2.5 or AIX/4.1 systems. In that case, sar must
	be available and user must have appropriate rights. Here, statistics
	are only collected during the first 10 seconds of the transfer.
	NB: the "gnusort.cpustat" tool works for both types of traces.

A quick look at bench/benchd architectures:

		+--------+  +--------+           +--------+  +--------+
		| bench  |  | child1 |<--conn1-->| child2 |  | benchd |
		|        |  +--------+           +--------+  |        |
		|        |  | child2 |<--conn2-->| child2 |  |        |
		|   ^    |  +--------+           +--------+  |        |
		+---|----+  | ...    |           | ...    |  +--------+
		    |       +--------+           +--------+
		    +------sync-connection------>| sync   |
		                                 +--------+

	BENCH:
		father process: forks a child for each TCP connection.
				forks each child and sync processes, then
				wait till everything is over.
				open/closes a sync connection with the
				remote benchd_sync port to tell him when
				a test session starts and stops.
		child processes:one per TCP connection. Initiate connection
				with peer (active open).
				use signals to synchronize with the father.

	BENCHD:
		father process: listening endpoint.
				forks a child for each incoming TCP
				connection.
				no fork in case of UDP.
		child processes:one per accepted TCP connection.
		sync process:	listening sync endpoint.
				use signals to trigger statistics printing
				by the father at the end of a UDP test
				session.


2- BENCHT/BENCHDT
-----------------

bencht is a shell script that runs bench repetively while varying the message
size parameter. This is used to collect performance traces that will
thenafter be processed by gnusort.

benchdt is a similar shell script that runs benchd.


Usage: bencht [tcp|udp|rtt|rttudp] [sock|xti] <connexion nb> <peer host>
        (in that order !)
        additional optional arguments:
                with tcp: [nodelay]
                in all cases: [dtonly] [cpustat1] [cpustat3] [sync] [port]

Usage: benchdt [tcp|udp|rtt|rttudp] [sock|xti]
        (in that order !)
        additional optional arguments:
                with the xti API: [V0|V1] [push]
                with tcp: [nodelay]
                in all cases: [rt] [cpustat1] [cpustat3] [sync] [port]

Additional parameters:

	Some parameters must be set directly within the bencht file:

	This is the case for the list of message size of interest (LMSG
	variable). The number of samples depends on the precision desired.
	If high precision curves are wished, the content of this list
	depends on the MTU (Maximum Transmission Unit) of the network:
	1500 bytes with Ethernet, 4352 with FDDI, 9180 with ATM... There
	will be large performance variations at multiples of MTU - 40
	(40 is the TCP/IP header size). This is due to the necessity for TCP
	to create and send another packet when the message size crosses these
	boundaries.

	Bencht uses the duration option of bench to limit test duration.
	Each test should last 15 sec to 30 seconds for good precision.

Trace file:

	The output of bencht is a trace file whose name is:
		<bulk|rtt>.<tcp|udp>.<conn number>.<year>.<week>
		.<day in week>.<hour>
	For instance:
		bulk.tcp.1.98.11.4_18H21	for bulk data transfers
						on the 4th day of the 11th
						week of 1998, or
		rtt.tcp.1.98.11.4_18H21		for echo data transfers.

	The output of benchdt is a trace file whose name is:
		<bulk_d|rtt_d>.<tcp|udp>...

Examples:

1) bulk data transfer with TCP. Look only bench statistics.

      - Exchange messages between host1 and host2.
	On host1:
		benchd -silent -m600
	On host2:
		bencht tcp sock 1 host1 dtonly

2) bulk data transfer with UDP. Look only bench statistics.

      - Exchange messages between host1 and host2.
	On host1:
		benchd -udp -m600
	On host2:
		bencht udp sock 1 host1 dtonly
	Due to the possibility of packet loss, the throughput on the
	receiving side may be smaller than that of the sending side.
	To record this phenomenon, use benchdt and synchronization
	(see below).

3) bulk data transfer with TCP. Look bench and benchd statistics
   and CPU usage.

      - Exchange messages between host1 and host2.
	On host1:
		benchdt tcp sock cpustat3 sync
	On host2:
		bencht tcp sock 1 host1 dtonly cpustat3 sync
	Both end statistics are now logged.

4) bulk data transfer with UDP. Look bench and benchd statistics
   and CPU usage.

      - Exchange messages between host1 and host2.
	On host1:
		benchdt udp sock cpustat3 sync
	On host2:
		bencht udp sock 1 host1 dtonly cpustat3 sync
	Both end statistics are now logged.

5) RTT data transfer with TCP. Look only bench statistics.

      - Exchange messages between host1 and host2.
	On host1:
		benchd -rtt -nodelay -silent -m600
	On host2:
		bencht rtt sock 1 localhost dtonly nodelay

6) RTT data transfer with UDP. Look only bench statistics.

      - Exchange messages between host1 and host2.
	On host1:
		benchd -udp -rtt -nodelay -silent -m600
	On host2:
		bencht rttudp sock 1 localhost dtonly

Hints, limitations:

      - Use of nodelay is still required with TCP in RTT mode.
      - Traces are time consuming, so set -silent and -m600 for benchd.
      - Typical use of bencht/benchdt consist in gathering statistics for
	a given set of parameters (number of connections, ...), modifying
	these parameters and running bencht/benchdt again, and so on.
	The various trace files can then be gathered in a single file
	(use cat in that purpose). The various curves will be distinguished
	later by the gnusort tool thanks to the Test header.


3- BSORT/BSORT_D
----------------

bsort is a mixed shell/awk file that analyzes the trace files generated by
bench, creates data files and calls gnuplot to plot the corresponding
curves. bsort_d is similar but works with trace files generated by benchd.

An additional tool, "bsort.speedup" has been added to calculate the
speedup between two data sets. Its use is different from that of other
bsort tools (see bellow).

Finally, rawsort.udp/rawsort.tcp are used to plot the transit time and number
of lost messages (UDP) versus time. This is completely different from other
bsort* tools that use the user data unit size as the x-axis.
Similarily, rawsort.udp_rtt/rawsort.tcp_rtt are used to plot the round trip
time versus time. This is particularily interesting when using machines that
are not synchronized (by NTP).

Versions:

	Several versions of bsort exist:

		bsort			plots an average throughput curve
					and a delay curve.
		bsort.cpustat		plots an average CPU load curve,
					and a context switch/s curve.
					Works with both sar and initstat
					traces.
		bsort.cumul		plots a cumulative throughput curve
					and a delay curve. Useful to identify
					when saturation occurs (CPU or link)
					(the cumulative curves no longer
					increases with the number of
					connections).
		bsort.lockstat		plots a lock usage curve and a lock
					contention curve.
		bsort.lockfstat		plots a per family lock usage file
					and lock contention file.
		bsort.msg		plots a msg/s throuhgput curve.
		bsort.per_cpu		plots an average throughput curve
					divided by the number of CPUs, and
					a delay curve.

	The interest of calculating the cumulative rather than average
	throughput is to visualize the possible medium or machine saturation
	when the number of connections increases.

	Several versions of bsort_d exist:

		bsort_d
		bsort_d.cpustat
		bsort_d.cumul
		bsort_d.msg
		bsort_d.per_cpu

Output:

	In addition to calling gnuplot automatically, bsort creates many
	temporary files. Here are these files:
	
		/tmp/sortfile53824		intermediate file,
		/tmp/sortfile53824.C.dem	Lock Contention file
		/tmp/sortfile53824.D.dem	Delay file,
		/tmp/sortfile53824.F.dem	Lock usage per family file,
		/tmp/sortfile53824.G.dem	Lock contention per family file,
		/tmp/sortfile53824.L.dem	CPU Load file,
		/tmp/sortfile53824.M.dem	Msg/s throughput file,
		/tmp/sortfile53824.O.dem	LOck usage file
		/tmp/sortfile53824.P.dem	Speedup file,
		/tmp/sortfile53824.S.dem	context Switch/s file,
		/tmp/sortfile53824.T.dem	Throughput file,
		/tmp/sortfile53824.dat.C.1	Lock Contention data file,
		/tmp/sortfile53824.dat.D.1	Delay data file,
		/tmp/sortfile53824.dat.F.1	Lock usage/family data file,
		/tmp/sortfile53824.dat.G.1	Lock contention/fam. data file,
		/tmp/sortfile53824.dat.L.1	CPU Load data file,
		/tmp/sortfile53824.dat.M.1	Msg/s data file,
		/tmp/sortfile53824.dat.O.1	LOck usage data file,
		/tmp/sortfile53824.dat.P.1	Speedup data file,
		/tmp/sortfile53824.dat.S.1	context Switch/s data file,
		/tmp/sortfile53824.dat.T.1	Throughput data file.

	The 53824 suffix is the current process number automatically added
	to the sortfile preffix. According to the version of gnusort used,
	not all files are generated.

	NB: The errors encountered during trace file analyze are logged in
	    the /tmp/sortfile53824 file. When gnusort command returns without
	    plotting anything, look the last line of this file.

bsort exemples:

	To plot a curve for data contained in the single trace file:

		cat bulk.CCA.tcp.1.513.2_21H34 | bsort

	If several trace files are to be plot on the same curve:

		cat CCA.tcp.1.513.2_21H34 bulk.CCA.tcp.2.513.2_22H08 | bsort

	Once the curves are no more needed, remove the temporary files with:
		rm /tmp/sortfile*

bsort.speedup exemple:

	To have speedup curves, first plot the curves you are interested in
	by using the above technic. Then identify the two *.dat.* files that
	contain data for the two curves. Use "ls -t /tmp/sortfile*dat*" for
	that. Then, run the bsort.speedup tool with:

		bsort.speedup <first .dat file> <second .dat file>

	NB: Speedup is only calculated for values corresponding to the same
	TSDU length (X-axis). No linear interpolation is performed when TSDU
	lengths don't match!

rawsort.udp exemple:

	To create the transit time, loss, and sequence number versus
	time, use the following tool:

		cat udp_tt.trc  | rawsort.udp 

	Once the curves are no more needed, remove the temporary files with:
		rm /tmp/sortfile*

	NB: rawsort.[udp|tcp] use a trace file generated by benchd (server),
	while rawsort.[udp|tcp]_rtt use a trace file generated by bench
	(client)!


4- DESCRIPTIVE STATISTICS
-------------------------

The descr_stats tool is a postprocessing tool (written in C), coming after
the bsort/* tools, that produces various statistics:
	- mean
	- median
	- variance
	- standard deviation
	- confidence intervals
	- histograms (.dat and .dem files for gnuplot)

usage: ./descr_stats row if
	row	row to consider in the input file (first row is 1)
	if	input file

Example:

	Suppose that a rawsort.udp_rtt has produced the
	"/tmp/sortfile2285.dat.TT.1" file. We are interested by a statistical
	analysis of RTT measurements (2nd column). In that case we use the
	following command:

		$ ~/bench_v0.92.3/bsort/descr_stats 2 /tmp/sortfile2285.dat.TT.1
		------------------------------------------------------
		nb of samples = 1172
		mean = 3.205720
		median = 3.033000
		variance = 0.162550
		standard deviation = 0.403175
		range = 3.769000
		confidence interval around mean 3.205720:
			90: +/- 0.526280
			95: +/- 0.746280
			99: +/- 1.395280
		------------------------------------------------------
		Continue with histogram (y/n)[n] ? y   
		use step 0.010000
		histogram data file is:         /tmp/histo1032.dat
		histogram gnuplot file is:      /tmp/histo1032.dem

	The histogram can then be plot using:
		$ gnuplot /tmp/histo1032.dem


5- GNUPLOT
----------

Gnuplot is a GNU public domain tool for plotting curves.
It is usually run transparently by the gnusort tool but for special purpose
it can be run separately. An online help is available. Type help when the
gnuplot> prompt appears.

Exemple:

	Type: gnuplot
	The gnuplot> prompt appears. An online help is available (type help).
	It is possible to manually ask gnuplot to plot the curves.
	For instance:
		load "/tmp/sortfile53824.T.dem"
	will plot the Throughput curve.

Advanced features:

      - It is possible to generate Postscript files instead of X11 figures.
	Encapsulated postscript, latex, Framemaker (MIF) and many other
	output formats are possible.

	Postscript:

		set terminal postscript
		set output "mycurve.ps"
		load "/tmp/sortfile53824.T.dem"

	There are other possibilities. Type:
		help terminal:
	to have the list of supported output formats.

      - It is possible to change the X of Y scale. In that purpose, edit
	the desired *.dem file and add:
		[xmin:xmax]
	to modify the X axis, or
		[][ymin:ymax]
	to modify the Y axis.


6- WHAT'S NEW IN RECENT VERSIONS?
---------------------------------

 * Version 0.92.4, May 1999:
      - a few bug corrections
      - clarified README...

 * Version 0.92.3, March 1999:
      - IRIX support
      - descriptive statistics tool for postprocessing (mean, median,
	variance, standard deviation, confidence intervals, histograms)

 * Version 0.92.1, December 1998:
      - various bug fixes from version 0.92.0 (especially on Solaris)
      - management of send/recv socket buffers
      - improved units handling
      - real-time support on both systems

 * Version 0.92.0, December 1998:
      - new Transit Time mode (RTT without the round !) for NTP synchronized
	machines
      - raw mode where statistics are replaced by the trace of packets
	received by benchd (time, size, lost msgs and transit time with UDP)
      - various units (Mbps/kBps)

 * Version 0.91.9, August 1998:
      - improved start/stop synchronization between bench/benchd when
	using UDP for more accuracy
      - names of the sort scripts changed to bsort(_d)
      - modified the bench/benchd measures output, additional statistics

 * ...

 * Version 0.91.1, December 1997:
	first version publically available. This is the result of a
	complet redesign of a tool I extensively used and improved during
	my PhD


7- KNOWN BUGS/LIMITATIONS
-------------------------

      - UDP loss statistics at the receiver can be erroneous if too many
	packets are consecutively lost since the per packet counter is
	coded in a u_short.
	Besides, no loss statistics is available when using a packet size
	inferior to sizeof(u_short) since the counter must be written in
	each packet.

      - Performance measurements may be slightly different at the receiving
	side. Use options -silent -dtonly and a test duration > 10 seconds
	to improve the situation.
	But it remains an intrinsic limitation: the sender/receiver
	synchronization.

      - Certain features may not be available on all OS installations. 
	It can be the case of the real-time process scheduling.


8- FINAL NOTE AND CREDITS
-------------------------

If you have problems using one of these tools, feel free to contact
me. Also, these tools are continuously enhanced, so be sure you have
the latest version!
And thanks to Jean-Dominique Sorace (BULL S.A.) who wrote the first
version of some parts of this tool in 92/93.


----------------------------------------------------------------
Universite Pierre et Marie Curie     Vincent ROCA
Laboratoire LIP6-CNRS, Bureau C-660  mailto:vincent.roca@lip6.fr
8, rue du capitaine Scott            phone :   +33 1.44.27.75.14
75015 PARIS - FRANCE                 fax :     +33 1.44.27.87.32
------- http://www-rp.lip6.fr/~roca ----------------------------
