RPC Performance
Geetanjali Sampemane
Performance of Firefly RPC
Michael D. Schroeder & Michael Burrows
Objective
To analyze performance of RPC on the Firefly machine.
To study RPC execution and form a model of the latencies associated with the different components.
Outline
Description of Firefly system
RPC on the Firefly
Performance Analysis
Improvements
Discussions
Firefly: A distributed Workstation
Multiple microVAX II processors (~1 MIPS each) -- a homogenous system
Shared memory (16M)
Coherent caches
Only one processor connected to the Qbus
DEQNA device controller attaches the Qbus to 10Mb/s ethernet
Kernel (Nub) has scheduler, virtual memory manager and device drivers
Scheduler has multiple threads per address space, for concurrent execution
RPC on Firefly
RPC is the primary communication paradigm
Stub procedures generated from Modula-2+ code
Allows choosing of transport protocol at bind time (UDP/IP, DECNET and shared memory are supported)
Server is multi-threaded
Performance Analysis: Method
Two procedures were used:
Null() with no arguments, no results to measure base latency
MaxResults() with the largest possible result that fits into one packet.
System model: t_L = t_B + xT_N
10000 RPCs were run on two machines connected to a private ethernet
Fast Path
performance was measured
Results
Elapsed time: Null() proc 2.66 ms latency, 740 calls/sec MaxResult() proc 6.35 ms latency, 4.65 Mb/sec
Steps involved in an RPC call
Caller stub
Call Starter procedure to obtain a buffer
Marshall arguments into call packet
Call Transporter to send the packet
Unmarshal the result packet -- copy data from packet to result variables
Call Ender procedure to free packet
Server stub
Unmarshall the arguments from the received packet
Call server procedure
Marshall results into saved packet and send it back to caller.
Transporter
Fill RPC header in call packet
Call Sender to fill UDP, IP, Ethernet headers
Call Ethernet driver to queue packet for transmission (Due to asymmetric nature of Firefly, this has to happen on CPU 0)
Ethernet controller reads packet from memory
Transmits it
On receiving side, Ethernet controller writes packet to memory and issues a packet-arrival interrupt
Ethernet interrupt routine validates headers and tries to wake up a server thread
Server thread wakes up in server's Receiver procedure, calls the stub and sends back results.
Latency reduction
Assignment statements for marshalling
Ethernet interrupt routine directly wakes up the appropriate thread
Buffer management scheme
Measured Latency in microseconds
Null arguments
Large packet (1514 bytes)
Sending machine
288
683
Network
210
2880
Receiving machine
456
851
Total
954
4414
In detail (table VI from paper)
Improvements
Rewrote some segments in assembly
Buffer management schemes
Different network controller would improve performance from 11-28%
Faster network -- 4-18% improvement
Faster CPUs -- 52-36%
Conclusion
Factors involved in RPC delays are:
Marshalling delays
I/O costs
Network delays
Context-switching delays
For small packets, software costs (such as wakeup times) dominate For large packets, network delays and I/O overheads dominate.
Discussions
Todays systems have faster networks and faster CPUs -- where is the bottleneck now?
Multiprocessor systems -- effect on RPC
Different measurement/experiment approaches
Different models of RPC latency/performance
Heterogenous systems
Geetanjali Sampemane
geta@cs.uiuc.edu