Microsoft Real-Time Communications: Protocols and TechnologiesUpdated: July 3, 2003 Abstract This paper is written for IT professionals and developers interested in understanding the concepts, protocols, and technologies of real-time communications. It describes protocols such as the Internet Engineering Task Force (IETF) Session Initiation Protocol (SIP), SIP Instant Messaging and Presence Language Extensions (SIMPLE), and Real-time Transport Protocol (RTP). Microsoft uses these protocols and related technologies to provide a real-time communications (RTC) platform for corporate multi-modal communication, which includes voice and video communication, instant messaging, application sharing, and collaboration. Throughout this paper, voice communication and the way the Microsoft® Windows® XP operating system supports it are used to illustrate how the underlying technologies work. On This PageIntroductionThe Microsoft® real-time communications platform is based on Microsoft’s commitment to supporting industry communication standards. Windows XP supports Internet Engineering Task Force (IETF) Session Initiation Protocol (SIP), SIP Instant Messaging and Presence Language Extensions (SIMPLE), and Real-time Transport Protocol (RTP). These protocols and associated technologies are designed to address the specific needs of real-time communication over of a packet-switched network, whether the communication takes the form of voice, video, or instant messaging. This paper briefly describes the voice communication process used on a circuit-switched voice network and then focuses on RTC voice communications to illustrate how the underlying technologies are used to enable real-time communications over a packet-switched network. RTC Call ProcessingIn the process of transmitting real-time communications from one point to another, multiple steps are involved and various protocols are used. First, some type of signaling and call control is needed to establish, modify, and terminate a call. Within the public switched telephone network (PSTN), a circuit-switched network, Signaling System 7 (SS7) is used for call setup and termination. For packet-based networks, both the SIP and H.323 protocols provide call control. For information about SIP, see “Session Initiation Protocol (SIP)” later in this paper. For information about H.323, see “Telephony Integration and Conferencing” in the Windows 2000 Internetworking Guide of the Microsoft® Windows® 2000 Server Resource Kit. After the calling session is established, the audio or video input needs to be sampled and converted to a digital format. Next, the sampled data is encapsulated into Real-time Transport Protocol (RTP) packets. RTP is specifically designed for the needs of real-time communication over a packet-based network. Then, the RTP packet is encapsulated into a network transport protocol, which is most often the User Datagram Protocol (UDP). Alternatively, the Transmission Control Protocol (TCP) can be used for encapsulation; however, because TCP is a guaranteed transport-level protocol, the additional time needed to occasionally retransmit TCP packets can add enough latency (the time between sending and receiving packets) to the transmission so that the received audio is unintelligible. Throughout the transmission of the RTP packets, the Real-time Control Protocol (RTCP) is used to monitor the quality of an RTP session. For information about RTP and RTCP, see “RTP and RTCP” later in this paper. Next, the network transport protocol, UDP or TCP, is encapsulated into an IP packet, which is then encapsulated into the link layer protocol — Ethernet, for example. The link layer packet is then transmitted to the destination computer(s). Figure 1 shows the encapsulation process, from the encapsulation of the RTP packet to encapsulation of the link layer packet. Figure 1: Real-Time Communication Protocol Encapsulation Session Initiation ProtocolSession Initiation Protocol (SIP), which is similar to the HyperText Transfer Protocol (HTTP), is a text-based application-layer signaling and call control protocol. SIP is used to create, modify, and terminate SIP sessions. It supports both unicast and multicast communication. Because SIP is text-based, implementation, development, and debugging are easier than with H.323. Note: Windows Messenger is a SIP-based application. Windows XP does not support SIP through Telephony Application Programming Interface (TAPI). For more information about Windows Messenger, see “Using Windows Messenger” in Help and Support Center for Windows XP Professional and “Configuring Telephony and Conferencing” in the Microsoft® Windows® XP Professional Resource Kit Documentation. SIP ComponentsThe main components of a SIP environment fall into two primary categories, SIP servers and SIP user agents. SIP Servers There are three types of SIP servers: proxy, registrar, and redirect. Each type of server performs a different function, as noted in Table 1. The specific function the server performs determines which SIP requests it processes. Table 1 SIP Servers
SIP servers — proxy, registrar, and redirect — can be developed as separate applications or as a single application combining the functionality of all the servers. The combination of a registrar and proxy server is sometimes referred to as a rendezvous server. SIP User Agents Table 2 lists the two types of SIP user agents and what they do. Table 2 SIP User Agents
Each user agent is associated with a SIP address. SIP Call FlowThe call flow for SIP sessions depends upon whether the SIP session is established directly between SIP user agents or whether a SIP server (proxy, registrar, or redirect) is located between SIP user agents. Figure 2 shows the typical call flow between two user agents, with each step noted in parentheses. First, user agent A sends out an INVITE request to initiate a call. User agent B then replies with the Trying response code (100), indicating that the call request is being processed. User agent B then replies with the OK response code (200), indicating that that user agent has accepted the call. User agent A then replies to user agent B with an acknowledgement (ACK) request, indicating that user agent A received the final response code from user agent B. The real-time data is then encapsulated in RTP packets (as described in “RTP and RTCP” later in this paper) and sent between user agent A and user agent B. Either user agent A or user agent B can then send a BYE request, indicating the that the user agent wants to terminate the session. User agent B then sends an OK response code (200) to user agent A to indicate that the request has succeeded. Figure 2: User Agent SIP Call Flow Figure 3 shows the typical call flow when a proxy server is between the paths of two user agents. The proxy server essentially acts as a communication midpoint, functioning as both a user server and as a user agent. When acting as a user server the proxy receives the SIP requests and forwards them on to the destination user agent. When acting as a user agent the proxy receives the SIP responses and forwards them on to the destination user agent. Figure 3: Proxy Server SIP Call Flow Figure 4 illustrates the typical call flow between a user agent and a registrar server. The registrar server accepts REGISTER requests from the user agent, indicating the addresses at which the user agent can be reached. A registrar server is typically located with a proxy or redirect server. Figure 4: Registrar Server SIP Call Flow Figure 5 shows the typical call flow when a redirect server is between two user agents. User agent A sends out an INVITE request to initiate a call. The redirect server then replies with the Moved response code (302), indicating that user agent B has temporarily moved. User agent A replies with an ACK request, indicating that user agent A received the response code from the redirect server. User agent A then sends another INVITE request directly to the newly acquired address for user agent B. Figure 5: Redirect Server SIP Call Flow Sample SIP ArchitectureTo illustrate how communication is handled among SIP components and how SIP components can fit into a network environment, Figure 6 shows a sample SIP architecture. Figure 6: Sample SIP Architecture A. Datum Corporation has two SIP proxy servers that direct SIP requests between domains within the company. The SIP proxy server connected to the firewall handles all SIP messages sent to recipients outside the company and all messages sent to recipients within the company from outside. For example, a SIP INVITE message sent from a SIP client in A. Datum Corporation to a SIP client in Fabrikam, Inc. would be sent to the SIP proxy server in Fabrikam, Inc. The SIP proxy server then forwards the SIP INVITE request to the destination SIP client computer, or the SIP IP phone, in the domain of the SIP proxy server in Fabrikam, Inc. For example, the SIP server in Fabrikam, Inc. might receive a SIP INVITE request sent with a SIP URL in the format of a global phone number. If the global phone number has the destination of a SIP IP phone in Fabrikam, Inc., then the SIP INVITE request will be forwarded directly to the SIP IP phone. On the other hand, if the global phone number has the destination of a non-SIP IP phone, such as an analog phone, then the SIP INVITE request will be forwarded to the SIP/PSTN gateway, which formats the SIP INVITE request for the PSTN. Using the global phone number, the organization’s private branch exchange (PBX) determines whether to route the call to an analog phone within the company, or to route it to the PSTN for an analog phone outside the company. SIP ProtocolSIP messages are based on the standard Internet message format, as described in RFC 822, “Standard for the Format of ARPA Internet Text Messages,” which you can find on the Web Resources page at http://www.microsoft.com/windows/reskits/webresources/. SIP messages are either requests from a client to a server or responses from a server to a client. Each SIP message has three parts, as shown in Table 3. Table 3 SIP Message Parts
SIP defines the values for the start line and headers. The Session Description Protocol (SDP) defines the values for the message body. SIP Message Start Line The syntax for the start line, as shown in Table 4, depends on whether the message is a request or a response. Table 4 Start Line Syntax
Request Method The first item in a request start line is the SIP method, a signaling command. The SIP methods, listed in Table 5, are defined in RFC 3261, the Internet draft “SIP Extensions for Presence” and the Internet draft “SIP Extensions for Instant Messaging.” Table 5 SIP Methods and Their Functions
Request URI The second item in a request start line is a Request-Uniform Resource Identifier (URI), which contains the URL of the called party. Generally, the URL is a SIP URL. A SIP URL can have one of several formats. Some of the supported formats are listed in Table 6. For a complete list of available SIP URL formats and syntax, see RFC 3261, “SIP: Session Initiation Protocol” on the Web Resources page at http://www.microsoft.com/windows/reskits/webresources/. Table 6 Partial List of SIP Request URL Formats
Request or Response Version The final item in the request start line and the first item in the response start line is the SIP version, which is currently version 2.0. The following sample SIP request message, taken from a Windows Messenger session, shows a typical SIP request line. Response Status Code There are six categories of status code: informational, success, redirection, client error, server error, and global failures. The left-most digit of the status code, as shown in Table 7, indicates the code’s category. Table 7 SIP Response Status-Codes
Response Phrase All SIP response codes defined in SIP version 2.0, and their corresponding categories and response phrases, are listed in Table 8. Table 8 SIP Response Status Codes and Phrases
SIP Message Headers The start line of a SIP message is followed by one or more headers. The included headers depend upon whether the message is a response or a request. Headers are defined in RFC 3261, “SIP: Session Initiation Protocol,” which you can find on the Web Resources page at http://www.microsoft.com/windows/reskits/webresources/. As shown in Table 9, headers fall into four categories: general, request, response, and entity. Headers in the general category can be used for both request and response messages. Table 9 SIP Headers
The following sample SIP request, taken from a Windows Messenger session, highlights the SIP headers: The body of a SIP message is defined by the Session Description Protocol (SDP). Session Description ProtocolThe Session Description Protocol (SDP) is an IETF standard for announcing and describing multimedia conferences. The SIP message body contains a session description, as defined by the SDP. A session description consists of three parts: a single session description, zero or more time descriptions, and zero or more media descriptions. The session description contains global attributes that apply to the whole conference or all media streams. Time descriptions contain conference start, stop, and repeat time information. Media descriptions contain details about a particular media stream. Table 10 lists the SDP types and associated description values that can be used in each of the three parts of an SDP message. Table 10 SDP Descriptions
The following sample SIP request message, taken from a Windows Messenger session, highlights the SIP message body: Audio and Video Digitization and CompressionAfter the call has been set up with SIP, the data must be digitized and compressed. In order to transmit audio and video data, which are inherent in an analog format, across the wire on a packet-based network, the analog waveforms must be converted into digital values. Once the data is in digital format, a software-based codec (coder-decoder) is used to compress the data, which allows for better network utilization and improved voice quality. Audio DigitizationConverting audio signals to digital format involves several steps. First, the waveform, which represents the audio input, is sampled at regular intervals, as shown in Figure 7. Figure 7: Periodic Waveform Sampling The sampling rate — the frequency with which the samples are taken — depends upon the type of audio media being sampled and on the codec and associated coding algorithm used. For example, PSTN, which uses the compounded pulse code modulation (PCM) coding algorithm, has a voice sampling rate of 8 kHz, where a Hz equals one cycle per second. Sampling rate is derived from the Nyquist criterion: Fs > 2×BW Fs = sampling frequency BW = bandwidth of input analog voice signal The Nyquist criterion states that sampling must occur at least twice as often as the number that represents the highest frequency sampled. Because most analog voice signals fit approximately within the bandwidth range of 4 kHz, the sampling rate of 8 kHz is deemed sufficient for most voice communications. After sampling the data, the next step is to identify the interval into which each sampling of the waveform falls. This process, shown in Figure 8, is called quantization. Figure 8: Quantization After the data has been sampled and quantized, an 8-bit code word is assigned to each sample for transmission. Each 8-bit code word is subsequently transmitted through the network. Figure 9 depicts the transmission of the first three samples of the quantization shown in Figure 8. Figure 9: Digital Signal Transmission Hence, we derive the 64 Kbps bandwidth (8 kHz x 8 bits per sample) required for each analog transmission (voice or data) over the PSTN switched circuit network. Audio and Video CompressionAudio and video codecs use algorithms to compress the digitized audio and video signals before the sender transmits them, and then to decompress them on the receiving computer before they are played for the user. Using a codec for compression and decompression reduces network bandwidth utilization and minimizes network traffic load. The conversion from analog to digital form and from digital back to analog form is performed by hardware. For example, the data is already digitized, but in a less compressed format, by the time it is received by the source filter. Figure 10 shows how codecs are used for video compression and decompression. Figure 10: Video Compression and Decompression Windows XP supports audio codecs for both SIP and H.323 IP telephony applications, as shown in Table 11. Table 11 Audio Codecs Supported by Windows XP
Windows XP supports video codecs for both SIP and H.323 IP telephony applications, as shown in Table 12. Table 12 Video Codecs Supported by Windows XP
Audio Bandwidth CapacityThe codec used and its supporting quantization and compression algorithms determine the bandwidth needed to transmit voice and video data. For example, each analog voice call using PSTN requires 64 Kbs of bandwidth. This is derived from the encoding and compression algorithms used with companded PCM, which provides high quality for both voice and data. One of the advantages of using IP telephony is the ability to utilize the latest improvements in codec technology. As noted above, one voice call over the PSTN uses a bit rate of 64 Kbs. Approximately 10 voice calls can be placed at the same bit rate on a packet-switched network when using the G.723.1 codec, which employs the Code Excited Linear Predictive (CELP) encoding algorithm. IP telephony also offers codecs, such as 16-kHz codecs, that provide better quality than the 8-bit PSTN codec and that require less bandwidth than a PSTN call. Note: Using a 16-kHz sampling rate increases the network requirement to 128 Kbps. With recent advances in audio codec and network technology, however, a 16-kHz sampling rate is not that expensive on an IP network. RTP and RTCPAfter the data has been optimized for transmission over the packet-based network through digitization and compression, it is encapsulated within RTP. RTP is a real-time transport protocol, and RTCP is a control protocol used for monitoring RTP sessions. RTP and RTCP, defined in RFC 1889, “RTP: A Transport Protocol for Real-time Communications,” were designed by IETF specifically to address the needs of real-time communication over a packet-based network. For more information about RTP and RTCP, see RFC 1889 on the Web Resources page at http://www.microsoft.com/windows/reskits/webresources/default.asp. Both SIP and H.323 make use of RTP for transferring digitized audio and video data between the various parties participating in a call. Each RTP packet contains one or more media payloads and other relevant information, such as time stamps and sequence numbers. Typically, RTP and RTCP are used with UDP as the underlying transport layer and with IP as the underlying network layer. RTP uses dynamic UDP ports negotiated between the sender and receiver of specific media streams. However, RTP and RTCP are independent of the underlying transport and network layers and need not be used with UDP and IP as the transport and network protocols. RTPRTP provides end-to-end network transport for real-time applications, such as Windows Messenger and Phone Dialer. RTP contains information about the real-time session so applications can easily adjust for jitter, improper packet sequencing, and dropped packets. Much of this information is included in the RTP header. Figure 11 shows the structure of an RTP packet. Figure 11: RTP Packet Structure
RTCPRTCP packets contain information regarding the quality of the RTP session and the individuals participating in the session. Both sender(s) and receiver(s) periodically transmit RTCP packets to each participant in an RTP session. A real-time application can use this information to monitor the quality of the RTP session; for example, to monitor jitter and packet loss. There are five RTCP packet types, as shown in Table 13: Table 13 RTCP Packet Types
Participants in an RTP session send RR packet types, and, if they are active senders, send SR packet types. The RR packet has two sections, the header and report blocks, as shown in Figure 12. There is one report block for each source.
Figure 12 RR Packet Structure The SR packet structure, shown in Figure 13, differs in format from the RR packet only in that it includes a 20-byte section of sender information.
Figure 13 SR Packet Structure Receiver Report and Sender Report header structure The RR and SR header structure is shown in Figure 14. The only difference between the two headers is the value for the packet type. Figure 14: RTCP RR and SR Header Structure
The additional 20-byte sender information included in an SR packet is shown in Figure 15. Figure 15: RTCP SR Information
Report block structure SR and RR packets can contain zero or more report blocks. A report block, which is appended directly after the RTCP header, is received for each SSRC included in the RTP data packets received since the last report was received by the receiver. The structure of report blocks is the same for both SR and RR packets, as shown in Figure 16. Figure 16: RTCP Report Block Structure
Although RTP and RTCP are specifically designed for the needs of real-time communication over a packet-based network, they do not provide quality of service mechanisms. Instead, they leave quality of service issues to the underlying network and data-link layers. Voice Quality TechnologiesA circuit-switched network, such as the PSTN, provides a dedicated communication path between two end stations. Datagram-based packet-switched networks segment the original data into multiple packets, which are then separately routed through the network. By default, there is no dedicated path or bandwidth for datagram-based packet-switched networks. Because of these differences and the low tolerance for latency in real-time communications, toll-quality voice transmission can be obtained on a packet-switched network only after the following problems have been resolved:
Windows XP provides several quality of service mechanisms, including jitter control, acoustic echo cancellation, and Quality of Service (QoS) protocols. Jitter ControlRTP and RTCP provide information, such as time stamps and interarrival jitter values, that real-time communications applications can use to compensate for jitter during a session. An application’s jitter buffers use the time stamps and interarrival jitter values to make adjustments so that a smooth, even flow of packets is received. Applications use the information received from the RTP and RTCP packets to calculate the difference in transit time for two packets. The calculation they use is: D(n,n-1)=[R(n)-S(n)]-[R(n-1)-S(n-1)] Where D(n,n-1) is the difference in transit time for packets n and n-1, S represents the time when packets (n,n-1) were sent, and R represents the time when packets (n,n-1) were received. The difference in transmit time, D(n,n-1), is then used in the following formula, as described in RFC 1889, “RTP: A Transport Protocol for Real-Time Communications,” to determine interarrival packet jitter, J(n), as a smoothed running value of an RTP session: J(n)=J(n-1)+(|D(n,n-1)|- J(n-1))/16 For more information, see RFC 1889 on the Web Resources page at http://www.microsoft.com/windows/reskits/webresources/default.asp. Note: Both Windows Messenger and Phone Dialer have built-in jitter buffers. Acoustic Echo CancellerWhen a computer is used for real-time communications, such as voice calls, call participants can experience acoustic echo. Using a headset, which has an integrated microphone and speakers, as opposed to using a separate microphone and speakers, can eliminate some acoustic echo. To better control acoustic echo, an acoustic echo canceller (AEC) is needed. Note Windows XP includes AEC support in both the Windows Messenger client and in Windows TAPI version 3.1. Quality of ServiceThe RTP and RTCP protocols, jitter control mechanisms, and acoustic echo canceller provide applications with information and tools to monitor and improve the quality of real-time communications; however, none of these protocols or technologies has control over the underlying networking environment. QoS, a combination of IETF-defined protocols, such as Differentiated Services (Diff-Serv) and 802.1p, is used to provide different levels of control over the underlying networking environment and to provide varying degrees of quality of service. Note Windows XP supports all applications that can use QoS, which are written specifically to make calls to the Windows XP QoS APIs. Measuring Voice QualityThe Mean Opinion Score (MOS) scale provides a tool for subjectively measuring and rating voice quality. The MOS scale ranges from 1 to 5, where 1 indicates poor quality and 5 indicates excellent quality. Voice quality on the PSTN, also referred to as toll quality, generally ranges between 4 and 5 on the MOS scale. The MOS scores for audio codecs with a 16-kHz sampling rate, such as SIREN and G.722.1, are approximately 4; however, because various codecs use different sampling rates, the user experience is different and the comparison is not quite applicable to the value received from the MOS scale. Because these codecs capture a wider range of frequencies, they actually offer a more enjoyable user experience by rendering more natural sound. Voice transmission over a packet-based IP network can now provide better sound quality than voice transmission over a PSTN network. Note In voice transmissions over a packet-switched network, toll quality can be obtained only when latency is less than 200 milliseconds (ms). Nevertheless, even with a delay between 200 and 400 ms, a transmission is acceptable. But when the delay is greater than 400 ms, the audio connection is no longer acceptable. The MOS scores for the audio codecs supported by Windows XP that use an 8-kHz sampling rate are shown in the Figure 17. Figure 17: Windows XP 8-kHz Sampling Rate Audio Codec MOS Scores SIP Instant Messaging and Presence Language ExtensionsSIP Instant Messaging and Presence Language Extensions (SIMPLE) allow users to send and receive instant real-time messages (generally text messages) and to know the current availability or status of other users. A general model for SIMPLE is described in RFC 2778, “A Model for Presence and Instant Messaging,” which is available on the Web Resources page at http://www.microsoft.com/windows/reskits/webresources/default.asp The Instant Messaging model described in RFC 2778 defines communication between a server, defined as the Instant Message Service, and the clients, defined as either Senders or Instant Inboxes. When a message is sent from the Sender client to the Instant Message Service, the Instant Message Service forwards the message to the Instant Inbox client, as illustrated in Figure 18. Figure 18: Instant Message Communication Flow RFC 2778 defines the objects involved in the exchange and the communication among them; however, it does not specify the protocol to use for communicating presence and instant messaging information. The Presence model described in RFC 2778 defines communication between a server, defined as the Presence Service, and the clients, defined as either Presentities or Watchers. The Presentity provides presence information to the Presence Service, and the Watcher receives presence information from the Presence Service. There are two types of Watcher clients: Fetchers and Subscribers. A Fetcher requests only the current value of the presence information for a Presentity from the Presence Service. A Subscriber requests updates whenever the presence information for a Presentity changes. Figure 19 illustrates the relationship between presence clients. Figure 19: Presence Clients SIP provides some presence information. For example, when a SIP user agent registers with a SIP registrar server, the presence or location of the SIP user agent is available from the SIP registrar server. This level of presence awareness allows the establishment of SIP-based calls; however, it does not allow SIP user agents to subscribe to other SIP user agents to obtain their presence information. To provide SIP with the capabilities of Presence and Instant Messaging, two additional Internet drafts have been written: “SIP Extensions for Presence” and “SIP Extensions for Instant Messaging.” Two new SIP methods, SUBSCRIBE and NOTIFY, which provide presence capabilities in the SIP protocol, are defined in the “SIP Extensions for Presence” draft. One new SIP method, MESSAGE, which allows instant messaging capabilities in the SIP protocol, is defined in the “SIP Extensions for Instant Messaging” draft. For more information about SIP methods, see “SIP Protocol” earlier in this paper. SummaryThe Microsoft real-time communications platform is based on industry standards and is designed for corporate multi-modal communication, such as voice and video communication, instant messaging, application sharing, and collaboration. Windows XP supports SIP, which is used for creating and terminating call sessions; various codecs, which convert voice and video signals to digital format, and compress and decompress those signals for efficient transport; SDP, which describes multimedia sessions; and RTP and RTCP, which monitor communications sessions. Additionally, Windows XP includes a number of voice quality technologies, which improve the quality of voice communication over packet-switched networks. By supporting SIMPLE, Windows XP also provides presence and instant messaging capabilities. Related LinksSee the following resources for more information:
|