Research Article | Open Access
Hongrui Guo, Wenli Zhang, Zishu Yu, Mingyu Chen, "Queueing-Theoretic Performance Analysis of a Low-Entropy Labeled Network Stack", Intelligent Computing, vol. 2022, Article ID 9863054, 16 pages, 2022. https://doi.org/10.34133/2022/9863054
Queueing-Theoretic Performance Analysis of a Low-Entropy Labeled Network Stack
Theoretical modeling is a popular method for quantitative analysis and performance prediction of computer systems, including cloud systems. Low entropy cloud (i.e., low interference among workloads and low system jitter) is becoming a new trend, where the Labeled Network Stack (LNS) based server is a good case to gain orders of magnitude performance improvement compared to servers based on traditional network stacks. However, it is desirable to figure out 1) where the low tail latency and the low entropy of LNS mainly come from, compared with mTCP, a typical user-space network stack in academia, and Linux network stack, the mainstream network stack in industry, and 2) how much LNS can be further optimized. Therefore, we propose a queueing theory-based analytical method defining a bottleneck stage to simplify the quantitative analysis of tail latency. Facilitated by the analytical method, we establish models characterizing the change of processing speed in different stages for an LNS-based server, an mTCP-based server, and a Linux-based server, with bursty traffic as an example. Under such traffic, each network service stage's processing speed is obtained by non-intrusive basic tests to identify the slowest stage as the bottleneck according to traffic and system characteristics. Our models reveal that the full-datapath prioritized processing and the full-path zero-copy are primary sources of the low tail latency and the low entropy of the LNS-based server, with 0.8%-24.4% error for the percentile latency. In addition, the model of the LNS-based server can give the best number of worker threads querying a database, improving 2.1-3.5 in concurrency.
Currently, the growing number of clients poses new challenges for the design of network stack for interactive services such as cloud gaming , web searching , and autonomous driving . One major challenge is meeting user experience demand while maintaining high resource utilization. However, this objective is hard to fulfill because of the interference between delay-sensitve requests and delay-insensitve requests . The delay-sensitve and delay-insensitve requests are usually processed by a shared network stack to improve resource utilization. Nevertheless, this also leads to the two classes of requests contending for shared resources such as CPU cores, caches, and memory bandwidth, causing high tail latency. As a result, the high tail latency gives rise to the problem of the user experience of interactive services. What is worse, the large number of clients often causes bursty traffic on the server side [5–11], sharpening the problem of user experience. To handle the problem, researchers mainly focus on optimizing the Linux network stack [12–14] and designing user-space network stacks [15–18]. What stands out is a Labeled Network Stack (LNS)  which prioritizes delay-sensitve requests in all layers of the network stack. The full-datapath prioritization of LNS comes from the concept of low entropy cloud computing systems  where the interference among workloads and the system jitter should be controlled to improve user experience. Based on the low entropy property, LNS is promising to achieve low tail latency and high concurrency for interactive services. Thus, a quantitative study for LNS is of great importance to reveal its benefits and potential improvement.
Although prior experiments  have demonstrated that LNS can support millions of clients with low tail latency, compared with mTCP , a typical user-space network stack in academia, and Linux network stack, the mainstream network stack in industry, a thorough quantitative study is lacking in answering the following two questions: (i)Where do the low tail latency and the low entropy of LNS mainly come from, compared with mTCP and Linux network stack? LNS is designed with multiple technologies such as full-datapath prioritization, full-path zero-copy, and user-space TCP/IP. However, the benefits of each technology in LNS have not been quantitatively studied yet. For instance, questions like “How much tail latency can be reduced by the full-datapath prioritized processing of LNS compared with mTCP?” and “How much entropy can be reduced by the full-path zero-copy of LNS compared with Linux?” are still needed to be answered. By answering these questions, we can recognize the main contributing factors to the low tail latency and the low entropy of LNS(ii)How much LNS can be further optimized? When deploying LNS, users are often faced with some tradeoffs. For instance, users can ask such questions as “How much tail latency can be reduced if one more CPU core is provided?” or “How many CPU cores are optimal to achieve high performance and avoid wasting resources?” Therefore, we need to quantify the influence of various factors, including the number of CPU cores, to optimize LNS further
In order to answer the above questions, we need to quantify the improvement provided by various factors in LNS over traditional network stacks.
An analytical method is desirable to study the benefits and potential optimization of LNS quantitatively. A quantitative performance study can be achieved by experiments or analytical modeling. However, a large number of system and environment configurations make it time-consuming for researchers to conduct experiments since different applications often exhibit different traffic characteristics. For instance, the smart home scenario evaluated in  is featured with short packet lengths (200 Byte) and low ratios of delay-sensitve requests (5%), while longer packets or higher ratios of delay-sensitve requests can also appear in other scenarios, such as cloud gaming , smart metering , and vehicular applications . As a result, a large number of experiments are needed to quantify the performance of LNS under various configurations. On the contrary, analytical modeling only needs a few experiments for parameter measurement. Therefore, an analytical method is more cost-effective and more commonly used to conduct a quantitative study of LNS.
In this paper, we model and compare the tail latency of a server based on LNS, a server based on mTCP , and a server based on the Linux network stack. We call the three servers “LNS-based server”, “mTCP-based server”, and “Linux-based server” for short. There are four expectations for the models of the three servers: (i) The models could accurately predict the tail latency of the three servers under various configurations. (ii) The models could be able to improve the performance of the LNS-based server. (iii) The models’ formulas should be as simple as possible, preferably closed-form. (iv) The user-uncontrollable parameters, e.g., service time, of the models, could be measured by a non-intrusive method, i.e., without modifying the source code. To fulfill these expectations, we leverage queueing theory  to build a link between tail latency and various contributing factors. If the expectations are fulfilled, the resultant models will be convenient for users to predict and optimize the tail latency of the LNS-based server.
As tail latency is a critical metric for the quality of interactive services, the analytical modeling of tail latency has drawn many concerns. Most prior works [24–27] focus on the tail latency of Fork-Join queueing network (FJQN) models  for cloud applications. However, the FJQN models cannot apply to the three modeled servers because their queueing network structures do not satisfy the conditions of the FJQN, where a job will spawn multiple tasks processed by different servers. Aside from these works about FJQN models,  models the tail latency of a Memcached database for CPU cores with an model. However, the TCP/IP layer rather than the database can also be the bottleneck of the modeled servers. Thus, only considering the database is inadequate to characterize the performance of a network stack. Moreover, the above prior works all assume a Poisson arrival process. However, bursty traffic is more realistic in modern cloud applications [5–11]. For instance, in a building, a fire can trigger multiple smoke and temperature sensors to send emergent packets to the server in a short period .
In this paper, we model the above three servers under bursty traffic via a tandem stage queueing network . Besides the database, the total network stack is considered in the queueing network. The tail latency of the three servers is decomposed to the tail queueing time of the bottlneck stage and the summation of the service time of each stage of the queueing network to simplify the analysis. The changes in the service time of each stage of the queueing network are also considered based on non-intrusive basic tests. Specifically, our contributions are as follows. (i)We propose an analytical method which can be leveraged to establish models that perceive the change of processing speed in different stages for the tail latency of the above three servers with closed-form formulas, with 0.8%-24.4% error for the LNS-based server(ii)The resultant models reveal that the full-datapath prioritized processing and the full-path zero-copy are the primary sources of the low tail latency and the low entropy of the LNS-based server(iii)The model of the LNS-based server can give the best number of worker threads querying a database according to the model parameters, improving 2.1-3.5 in concurrency
The rest of the paper is organized as follows. Firstly, Section 2 shows the analytical method and the experimental design. Then, Section 3 shows the measured and modeled results of the three servers under bursty traffic. Besides, this section guides optimizing the LNS-based server. Finally, Section 4 summarizes the article.
2. Materials and Methods
In this section, we propose our models of the three servers. Firstly, we present an abstracted scenario and the architecture of the modeled three servers. Then we describe the structure of the queueing network. Afterward, we derive the formulas for the tail latency of the three servers. Finally, the setting of the experiments is given. For the convenience of readers, a summary of the main notations used in our models is listed in Table 1.
2.1. System Description
This section describes an abstracted scenario of massive clients and the architecture of the three modeled servers.
2.1.1. Massive-Client Scenario
Current online applications provide interactive services for massive clients. These clients mainly send different classes of requests to a large-scale cloud data center. In such a data center, a request usually spawns multiple sub-operations parallelized on different servers to reduce latency. However, the tail latency of one sub-operation significantly influences the tail latency of the request. For instance, suppose a request spawns 100 sub-operations whose percentile latency (i.e., the latency that exceeds the latency of 99% sub-operations) is 10 ms, the percentile latency of all sub-operations is 140 ms . Therefore, studying the tail latency of one server is of great importance to quantify the performance of a cloud data center.
In this paper, we focus on an abstracted massive-client scenario where a single server provides services for millions of clients, as shown in Figure 1. Due to the massive number of clients, bursty traffic forms on the server side. The traffic is assumed to consist of two classes of requests, namely delay-sensitve and delay-insensitve requests. These requests are assumed to have the same packet length since many interactive services tend to use a constant packet length [21, 32]. After these requests arrive at the server, they are processed by several stages in the network stack of the server: Firstly, a request is processed by the Network Interface Card (NIC) and the TCP/IP layer of the server. Afterward, a thread of the server queries a database to serve the request. Finally, a response packet is sent to the sender of the request.
From the abstracted scenario, several key factors contributing to the server-side tail latency can be identified. First, the number of requests in one burst, reflecting the number of clients, is a critical factor for the tail latency of the server because more bursty requests tend to form a longer queue, corresponding to a longer queueing time in the server. Second, because only delay-sensitve requests are of interest, the ratio of delay-sensitve requests is also an essential factor for the tail latency. Third, the service time of each processing stage in the server also affects the tail latency. It should be noted that besides the time executing fixed operations, the service time is also related to packet length and the number of CPU cores, for a longer packet means more memory accesses, and more CPU cores mean more intensive contention for shared resources. Moreover, the number of CPU cores affects the tail latency by affecting system utilization. Finally, the interarrival time of requests also influences the server-side tail latency, for longer interarrival time leads to lower tail latency.
2.1.2. Network Stack Architecture
Figure 2(a) shows the architecture of the LNS-based server. The LNS-based server prioritizes delay-sensitve requests at each layer of its network stack. When a request arrives, a user-space driver will put the request into different queues according to its label. Then the TCP/IP layer of the server will process requests from a high-priority queue until the queue is empty when the requests from a low-priority queue are processed. After a request is processed by the TCP/IP layer, the request is transferred to the queues of an event dispatcher, which dispatches requests to worker threads in a priority-based round-robin manner. Finally, the worker threads will query the Redis database of the server and send response packets processed by the TCP/IP layer with a prioritized scheduling policy again. Note that zero-copy is applied to each layer of the network stack of the LNS-based server.
Compared with LNS, conventional network stacks do not prioritize delay-sensitve requests. For example, the Linux-based server in Figure 2(c) is the default choice for modern multi-core processors but suffers from large packet copy and scheduling overhead [33, 34]. mTCP (see Figure 2(b)) is proposed to reduce the scheduling overhead of the Linux-based server through a user-space TCP/IP and batch processing . However, the mTCP-based server still does not handle the interference between delay-sensitve and delay-insensitve requests. As a result, the LNS-based server can achieve the best performance among the three servers for the tail latency of delay-sensitve requests.
2.2. Queueing Network
According to the mechanisms described in Figure 2, the modeled servers can be represented by the queueing network in Figure 3(a). The queueing network consists of components corresponding to the NIC, the TCP/IP, the event dispatcher, and the worker threads of the modeled servers. The worker threads run in parallel and have the same service time. For any component in the queueing network, it does similar operations while processing requests. Therefore, each component can be represented by a in the queueing network.
The complexity of the queueing network makes it difficult for analysis and parameter measurement. First, the loop from the worker threads to the TCP/IP and the loop from the TCP/IP to the NIC complicate the analysis, preventing us from deriving closed-form formulas. Second, users need to modify the source code of the modeled network stacks to measure the service time of each stage.
We propose a simplified tandem stage queueing network  to simplify the analysis and parameter measurement, as shown in Figure 3(b). Firstly, the NIC in the original queueing network is assumed to be not the bottleneck of the modeled servers. Therefore, the queueing delay of the NIC is modeled as 0, making the latency of the NIC stage a constant. As a result, the loop from the TCP/IP to the NIC is eliminated. Moreover, to break the loop from the worker threads to the TCP/IP, we let the service time of the TCP/IP an equivalent value of both receiving and sending packets, and we again model the delay of the TCP/IP sending a response packet with a constant latency. Finally, we assume that the variance of the packet length of the modeled applications is in a small range so that the size relationship of the service time of the TCP/IP and the event dispatcher remains unchanged for different packet lengths. Then the component with a smaller service time is modeled as a constant delay. From the above assumptions, the simplified queueing network consists of three stages. The stage (called “TCP/IP stage”) is the component with a larger service time between the TCP/IP and the event dispatcher. The stage (called “Redis stage”) represents the worker threads querying the Redis database. The stage (called “Constant stage”) is a representing all the constant delays mentioned above. The bottleneck stage of the simplified queueing network can be changed between the TCP/IP stage and the Redis stage by adjusting the number of worker threads.
2.3. Bottleneck Decomposition for Tail Latency
The bottleneck decomposition formula (1) decomposes the tail latency of the whole system to , the tail queueing time of the bottleneck stage of the queueing network, and , the summation of the service time of all stages of the queueing network.
The bottleneck stage is defined as follows.
Definition 2.1 (Bottleneck Stage) In the simplified queueing network, suppose the service time and the number of nodes (i.e., a core and its queues) in the stage are and , respectively. The Bottleneck Stage of the queueing network is the stage with the largest . In other words, suppose the service time of the bottleneck stage is , if , then ; otherwise, , where and are the service time of the TCP/IP stage and the Redis stage, respectively.
The basic idea of the bottleneck decomposition formula comes from the following intuitions. (i)The order of stages of a tandem -based queueing network with deterministic service time is interchangeable without affecting the leaving time of each request(ii)When the bottleneck stage is changed to the first stage, queueing is less likely to happen in the subsequent non-bottleneck stages
With the above intuitions, the following theorem can be proved.
Proof. We assume there is a request exactly experiencing the percentile latency, , and we approximate the latency of this request in the following process. Suppose the service time of the Redis stage is and the service time of the TCP/IP stage is . There are two cases. (i)If , according to the Result 4 in , the Redis stage is equivalent to a constant delay , which means that .(ii)If , by the Result 1 in , the Redis stage can be changed to the first stage without affecting the leaving time of each request. Then, in any time interval , at most requests arrive at the second stage. Because and the requests are served in a First-Come-First-Serve (FCFS) manner, these requests can be finished before . Therefore, at most requests are queued in the second stage, meaning that .Therefore, Formula (2) holds for both cases.
2.4. Models under Bursty Traffic
In the abstracted massive-client scenario presented in Section 2.1.1, bursty traffic is assumed to form on the server side. Under such traffic, the tail latency of a server can be modeled by the methods in this section.
We make the following assumptions under bursty traffic. (i)We assume each message contains exactly one request for simplicity(ii)The interarrival time of requests is constant and smaller than , making the bottleneck stage keep running during the processing of one burst of requests(iii)Every request in one burst is independently labeled as an delay-sensitve request with a probability of .(iv)There is no packet dropping occurring. Therefore, the influence of packet dropping can be ignored
2.4.2. The LNS-Based Server
The LNS-based server can be modeled with a pre-emptive prioritized scheduling policy. Because of the full-datapath prioritized processing of the LNS-based server, from the user-space driver to the worker threads, all stages in the LNS-based server firstly try to extract a high-priority request (i.e., delay-sensitve request). If successful, the request is processed; otherwise, a low-priority request (i.e., delay-insensitve request) is processed. For simplicity, the interference between low-priority and high-priority requests is modeled as a constant delay included in , which can be determined by measurement. As a result, the processing of high-priority requests can be modeled with a pre-emptive prioritized scheduling policy.
However, a problem arises when the assumption does not hold for only high-priority requests. The average interarrival time of high-priority requests is , which can be larger than the interarrival time of all requests. Moreover, the first assumption only states that the interarrival time of all requests is smaller than . As a result, it is possible that . In the case of , the bottleneck stage does not necessarily keep processing high-priority requests in one burst of requests.
Therefore, we discuss two cases according to the size relationship of and .
Case 1: .
First, when , we can derive the arrival time of the request experiencing the percentile latency. In one burst, the later a request arrives, the longer latency it tends to experience. Therefore, suppose there are high-priority requests in one burst, the percentile latency is approximately the latency of the high-priority request. Moreover, because of the second assumption, follows a Binomial Distribution with . The standard deviation of this distribution is . For not too small and , i.e. , the standard deviation is , which is only 13.9% of the mean value . Therefore, we use as an approximation of . In one burst of requests, the interarrival time of the high-priority requests is where follows a Geometric Distribution with mean and variance . Then, by the Central Limit Theorem, the arrival time of the high-priority request can be approximated by a random variable with Normal Distribution.
The standard deviation of is , which is relatively small compared to the mean value when and are not too small. Therefore, we approximate using its mean value to derive a closed-form expression for the tail latency of the LNS-based server.
Having obtained the arrival time of the high-priority request in one burst, we can derive the latency of this request. By the bottleneck decomposition formula, we only need to calculate the queueing time of this request as if the bottleneck stage saw the bursty arrival process directly. Thus the percentile queueing time can be given by Formula (4). Note that in Formula (4), the is the leaving time of the high-priority request in one burst.
Finally, the percentile latency of the LNS-based server can be given by Formula (5).
Case 2: .
When , a Poisson arrival process can be used to model the traffic of high-priority requests. In one burst, it is not necessarily the latter arrived requests will experience longer latency since the bottleneck stage may not keep processing the high-priority requests. To handle this problem, we assume that the arrival process of the high-priority requests is a Poisson process, i.e., the interarrival time between two consecutive high-priority requests follows an Exponential Distribution. The assumption is reasonable because a constant multiplied by following Geometric Distribution can be approximated by an Exponential Distribution when is small enough .
Based on the approximation of the Poisson arrival process, the tail queueing time as if the bottleneck stage were fed the traffic directly can be derived. The bottleneck stage in the simplified queueing network can be the TCP/IP stage or the Redis stage. If the TCP/IP stage is the bottleneck stage, it can be modeled as an (or equivalently, where means the Erlang-1 Distribution). Otherwise, the Redis stage is the bottleneck stage, which can be modeled as an . Because of the round-robin policy of the event dispatcher, the traffic to one worker thread of the Redis stage follows an Erlang-cB distribution , which is the summation of identical and independent random variables with Exponential Distribution. Therefore, one worker thread of the Redis stage can be further seen as a . Because the worker threads of the Redis stage are identical and each worker thread has a queue, the tail queueing time of the Redis stage can be approximated by that of one worker thread. Combining the above two cases, we can model the bottleneck stage with a , whose percentile queueing time can be accurately approximated by Formula (6) according to . where and , are the squared coefficient of variation (SCV) of interarrival time and service time, respectively. Due to the assumption of deterministic service time, . Moreover, the SCV of the Erlang-cB Distribution is , thus . According to the above discussion, can be calculated by Formula (7). where is the utilization of the bottleneck stage.
Finally, the tail latency of the LNS-based server can be calculated by Formula (8).
2.4.3. The mTCP-Based Server
The mTCP-based server does not distinguish delay-sensitve requests from delay-insensitve requests. Therefore all requests are processed in an FCFS manner. By the first assumption, the percentile latency equals the latency of the request in one burst.
The tail queueing time of the request in one burst is as follows.
2.4.4. The Linux-Based Server
The formula for the tail latency of the Linux-based server is the same as that of the mTCP-based server. The differences lie in the value of the service time of each stage. Therefore, the percentile latency of the Linux-based server can also be given by Formula (10).
2.4.5. Formulas for Service Time
As discussed in Section 2.1.1, the service time in our models can also be influenced by some factors such as packet length and the number of worker threads. Therefore, a set of formulas are needed to model how these factors influence the service time of our models. These formulas enable users to conveniently use our models without measuring the service time for every system and environment configuration.
The service time of the Redis stage consists of the delay of querying the Redis database and calling API to send response packets. First, the delay of querying the Redis database is influenced by the number of worker threads, for more worker threads will contend for memory bandwidth when they simultaneously access the Redis database. Second, the delay of calling API is also assumed to be affected by the number of worker threads, for more worker threads would increase the rate of calling API, whose latency may be influenced by contending for shared resources. For simplicity, we assume the delay of the above two operations increases linearly with the number of worker threads. As a result, besides a constant , the formula for (Formula (11)) includes a term for the additional delay of querying the Redis database and calling API caused by contending shared resources.
Formula (12) shows the service time of the TCP/IP stage. The service time increases linearly with payload length to model the impact of the packet copy in this stage. Note that because of the full-path zero-copy attribute of LNS, the of LNS is set to 0.
2.5. Latency Entropy
In information theory, the concept of information entropy is used to represent the degree of uncertainty of a discrete random variable . Suppose is a random variable, which takes discrete values from a set , the information entropy of can be defined by Formula (13). For example, suppose , if , the information entropy of is 1, which means the value of is quite uncertain; if and , the information entropy of is 0, which means the value of is deterministic.
By converting latency to a discrete random variable, we can apply information entropy to latency. Suppose latency is a continous random variable taking values in . Let to discretize by the user-defined distinction , which has the same dimension as . Suppose , the latency entropy can be calculated by Formula (14). Note that when , is defined to be 0.
2.6. Experimental Design
In , a testbed consisting of a load generator and a monitor is used to evaluate the performance of the three servers above. We build a similar testbed to measure the tail latency of the three servers. The testbed takes MCC  as the load generator and HCMonitor  as the monitor, as shown in Figure 4. The tested network stacks, the MCC, and the HCMonitor run on three mainstream X86 servers connected by a switch. Detailed configurations are listed in Table 2.
MCC can periodically send requests with specifications, including the number of requests in one burst, the length of each packet, and the ratio of delay-sensitve requests. These parameters were studied in our experiments. In our testbed, four logical cores were used to generate bursty traffic every second to ensure that the two cases of the LNS-based server about were covered.
HCMonitor is a full traffic monitor recording the latency of every packet. In our testbed, thanks to the port mirroring feature of the switch and Intel DPDK (data plane development kit), the mirrored traffic to/from the server is sent to HCMonitor. HCMonitor thus can calculate the request-response latency of every packet. Because of this property, the tail latency of the tested servers can be easily obtained. Moreover, we configured HCMonitor to record the latency of delay-sensitve requests only, if not otherwise stated. Besides tail latency, the interarrival time of requests in one burst was also measured by HCMonitor.
To make a fair comparison, we set the structure of the three servers similarly, as described in Figure 2. More specifically, in the LNS-based and the mTCP-based servers, the TCP/IP, the event dispatcher, a printing thread, and each worker thread were located on different logical cores. Note that although mTCP recommends users to run one TCP/IP and one application thread in the same core to reduce cache miss overhead, we observed that packet dropping was particularly serious when we did this. Therefore, the TCP/IP of the mTCP-based server was located on a different core from the event dispatcher. Due to the limited number of logical cores (32 cores), the number of cores left to run worker threads is maximally 29 (one core for the TCP/IP, one core for the event dispatcher, and one core for the printing thread). To minimize the delay of querying the Redis database, we started 29 Redis backend processes, each of which resided in the same core as a worker thread. When handling a request, the worker threads were set to query the Redis database two times, a typical configuration in some IoT scenarios.
2.7. Statistical Analysis
We obtained the user-uncontrollable parameters, except , of our models according to the measured percentile latency of several basic tests listed in Table 3. First, the number of worker threads in each basic test was selected according to the requirements for the bottleneck stage. When determining and , we need to start 29 worker threads to ensure the TCP/IP stage is the bottleneck. When determining and , we need two settings for the number of worker threads when the Redis stage is the bottleneck. To verify that the bottleneck stage fulfills the requirements, users can conduct the experiments in Figure 5(d). However, the selection of the basic tests can satisfy the requirements of the bottleneck stage in most cases since the service time of the Redis stage is usually an order of magnitude of the TCP/IP stage. Second, the default payload length was set to 512 Byte. Third, the default ratio of delay-sensitve requests was set to 1.0 to enlarge the tail latency since small tail latency can introduce significant measurement error. Note that when determining for the LNS-based server, was set to 0.05 to reduce measurement error since large would lead to a large tail latency in which is only a minor part. Fourth, the default number of requests was set to 20,000 to enlarge the tail latency and limit measurement error. Moreover, these basic tests have been excluded when evaluating the accuracy of our models.
1For the LNS-based server, is set to 0, so .
After determining these parameters, the relative error is computed for all tested configurations according to Formula (15). where and are measured and predicted percentile latency.
This method enables users to predict the tail latency of the three servers mentioned above without modifying the source code of the three network stacks, easing the difficulty of using our models. Users need to conduct only four basic tests for each server among infinite configurations. Moreover, because the user-uncontrollable parameters are obtained in a non-intrusive way rather than instrumentation, the error for the basic tests used to obtain the parameters is close to zero. This way, the error can also be reduced for configurations similar to those of the basic tests. As a result, the error of our models can be compensated by the non-intrusive measurement.
For optimizing LNS, users may still need to determine which stage is the bottleneck stage. Thus instrumentation may be needed to measure the service time of each stage of the LNS-based server. However, it is reasonable because the optimization itself can modify the source code.
3. Results and Discussion
In this section, we present our results and discuss the benefits and the potential improvement of the LNS-based server.
3.1. Benefits of the LNS-Based Server
Some factors play a crucial role in the tail latency of the modeled servers. First, the number of requests in one burst, , can influence the tail latency, for more requests can cause longer queues in the modeled servers. Second, because only the tail latency of the delay-sensitve requests is of interest, the ratio of delay-sensitve requests, , is also an essential factor for the tail latency. Moreover, after requests arrive at the server, they may be read/written from/to memory many times. Therefore, the payload length of these requests has a non-negligible impact on the tail latency. Lastly, the number of worker threads, representing the parallelism of querying the Redis database, also affects the tail latency.
3.1.1. Comparison of Tail Latency
We now present the influence of various factors on the tail latency of the three modeled servers. (i) Figure 5(a) shows the percentile latency of the three servers mentioned above at different request numbers , which represents the load received by the servers. These request numbers cover a considerable range of the percentile latency from hundreds of microseconds to hundreds of milliseconds. Moreover, the ratio of delay-sensitve requests is 0.05, which is the same as the setting in , the payload length is 512 Byte, a moderate value within 1 Maximum Transmission Unit (MTU), and the number of worker threads is 29 to utilize all CPU cores of the tested server. First, from Figure 5(a), we can find that the performance of the LNS-based server increases 11.2-555.2 over that of the mTCP-based server, which in turn outperforms the Linux-based server. Second, a linear increase in the percentile latency of the mTCP-based and the Linux-based servers for larger is observed. The linearity is reasonable as more requests in one burst will cause longer queues in the server. Third, what stands out is that the percentile latency of the LNS-based server when there are 5% delay-sensitve requests is insensitive to the number of bursty requests . Under different , the measured standard variance of the percentile latency of the LNS-based server is only , which means that the variation of the tail latency is slight. Our model for the LNS-based server can predict this insensitivity because when , the LNS-based server reaches a steady state. In the steady state, the tail latency of the LNS-based server can be estimated by Formula (8), which shows that no longer influences the tail latency. Fourth, the percentile latency of the LNS-based server for all requests is still lower than those of the other two servers, indicating that the full-datapath prioritized processing does not affect the latency of the low-priority requests too much due to the lower service delay of the TCP/IP stage of the LNS-based server. Last but not least, our models accurately predict the percentile latency of the three servers. The relative errors are 1.1%-18.5%, 10.0%-44.2%, and 0.1%-28.3% for the LNS-based server, the mTCP-based server, and the Linux-based server, respectively.(ii) The percentile latency of the three servers under different payload lengths is shown in Figure 5(b). We constrain packet length to be in an MTU to avoid packet fragmentation. The number of requests in a burst is 20,000, representing a moderate load. The ratio of delay-sensitve requests is set to 1.00 to exclude the influence of the prioritized processing of the LNS-based server. Again, the number of worker threads is set to 29. We can also see that our models coincide with the experiments. Our models have 3.3%-6.3% error for the LNS-based server, 5.0%-25.5% error for the mTCP-based server, and 0.3%-3.7% error for the Linux-based server. The large error for the mTCP-based server is that the service time of its TCP/IP stage is nonlinear with the payload length. We observe that the LNS-based server still outperforms the mTCP-based and Linux-based servers even without prioritized processing. Moreover, for the LNS-based and the Linux-based servers, a decrease in the percentile latency is observed when the payload length increases. The reason for this trend is that as payload length increases, the increase in packet interarrival time outweighs the increase in the service time of the TCP/IP stage, according to our models. In addition, for longer packets, the improvement of the LNS-based server to the other two servers becomes larger. When the payload length is minimal (16 Byte), the speedup of the LNS-based server over the other two servers is 1.6 and 3.2, respectively. When the payload length is maximal (1434 Byte), the speedup grows to 5.9 and 5.2, respectively. The above improvement can be attributed to the full-path zero-copy of the LNS-based server. The full-path zero-copy property makes the service time of the TCP/IP stage lower than those of the other two servers. Therefore, the tail latency of the LNS-based server is lower than those of the other two servers, even when all bursty requests are delay-sensitve requests.(iii) The influence of , the ratio of delay-sensitve requests, is shown in Figure 5(c) in the setting of 20,000 delay-sensitve requests with 512 Byte payload and 29 worker threads. As expected, the increase of causes a longer percentile latency of the LNS-based server. Our models can explain this trend: As increases, the tail latency of the LNS-based server also becomes larger because more delay-sensitve requests are queued. At the same time, the tail latency of the mTCP-based server and the Linux-based server remains unchanged because they do not distinguish between delay-sensitve and delay-insensitve requests. As a result, the improvement of the LNS-based server becomes less significant for a larger ratio of delay-sensitve requests. However, due to the smaller service time of the TCP/IP stage of the LNS-based server, the tail latency is still lower than those of the other two servers, even though all requests are delay-sensitve requests. Moreover, our models show good scalability for the ratio of delay-sensitve requests. Our models have 0.8%-24.4% error for the LNS-based server, 24.2%-25.5% error for the mTCP-based server, and 1.9%-5.4% error for the Linux-based server.(iv) The influence of the number of worker threads on the percentile latency is shown in Figure 5(d). We set , Byte, and to evaluate the performance of the three servers without the influence of prioritization. Firstly, the percentile latency of the three servers gradually decreases as the number of worker threads increases when is small. When is beyond some threshold, the percentile latency becomes steady, although continues to increase. The trend in Figure 5(d) is the result of the transfer of the bottleneck stage. According to our models, when is small, the bottleneck is the Redis Stage. As becomes larger, decreases. When is less than , the bottleneck will transfer to the TCP/IP stage, whose service time is not influenced by . Moreover, for the LNS-based server, the larger when bottleneck-transferring happens is due to the smaller service time of the TCP/IP stage. The smaller service time can also be explained by the full-path zero-copy attribute of the LNS-based server. We also find a good agreement between our models and experiments for different numbers of worker threads. The errors of our models for servers based on LNS, mTCP, and Linux network stack are 3.7%-18.9%, 28.3%-42.7%, and 1.7%-11.1%, respectively.
3.1.2. Comparison of Latency Entropy
From Figure 6, we observe that the influence of the above factors on the latency entropy is similar to that on the percentile latency. First, in Figure 6(a), the insensitivity of the percentile latency of the LNS-based server when the ratio of delay-sensitve requests is 0.05 is translated to the insensitivity of the latency entropy, which is only 18.2%-26.3%, and 17.8%-25.3% of those of the mTCP-based and the Linux-based servers. Second, Figure 6(b) shows that as the payload length increases, the latency entropy of the LNS-based server decreases to 85.0% and 85.9% of those of the mTCP-based and the Linux-based servers. This trend of the latency entropy is also consistent with the result in Figure 5(b). Third, a higher ratio of delay-sensitve requests leads to a higher latency entropy of the LNS-based server, as Figure 6(c) shows. This increase in the latency entropy results from longer queues of delay-sensitve requests in the server. Fourth, Figure 6(d) shows that the latency entropy no longer reduces after the number of worker threads exceeds some threshold in the three modeled servers. Moreover, the similarity between Figures 5 and 6 indicates that a lower-entropy system can achieve better performance.
3.2. Potential Improvement
According to our models, we observe two ways of improvement for the LNS-based server.
3.2.1. Improvement 1
Our model for the LNS-based server gives the best number of worker threads querying the Redis database according to the parameters to achieve minimal tail latency and avoid wasting CPU cores. As Figure 5(d) shows, the percentile latency no longer reduces after the number of worker threads exceeds some threshold. From the model of the LNS-based server, when , the bottleneck has been transferred from the Redis Stage to the TCP/IP Stage, meaning that more worker threads cannot help reduce the tail latency anymore. Based on the observation, a representing the transition point of the bottleneck stage exists, as computed by Formula (16).
The indicates that the best number of worker threads , which represents the parallelism of querying the Redis Database, should be adaptive according to the parameters of the model of the LNS-based server. The model suggests that 12 worker threads are optimal in our testbed. Compared with the default 4 worker threads (https://github.com/acs-network/qstack.git), the optimized version supports 2.1-3.5 concurrency, as shown in Figure 7.
3.2.2. Improvement 2
Optimization for the TCP/IP will be more effective than the other stages. For instance, for 20,000 bursty delay-sensitve requests, our model indicates that the percentile latency can be maximally reduced by 64.1% when the TCP/IP is optimized, whereas the reductions are maximally 0.0017% and 0.093% when the event dispatcher and the worker threads are optimized, respectively. The significant reduction when optimizing the TCP/IP comes from the fact that the TCP/IP is the bottleneck of the LNS-based server, on the condition that there are 29 worker threads. Therefore, optimizing the TCP/IP can reduce , the service time of the bottleneck stage. As Formula (5) suggests, the reduction of will have a linear reduction of , which dominates the tail latency. On the contrary, the reduction of the non-bottleneck stages only impacts , thus affecting little to the tail latency. Also, eliminating redundant operations of the TCP/IP or providing more cores to run the TCP/IP are two ways to reduce to reduce the tail latency.
An analytical method based on queueing theory is proposed to simplify the quantitative study of the tail latency of cloud servers by defining a bottleneck stage. Leveraging the analytical method, we model an LNS-based server, an mTCP-based server, and a Linux-based server, taking bursty traffic as an example, with basic non-intrusive tests. Our models 1) reveal that two technologies in LNS, including the full-datapath prioritized processing and the full-path zero-copy, are primary factors for high performance, with orders of magnitude improvement of tail latency as the latency entropy reduces maximally 5.5 over the mTCP-based server, and 2) suggest the optimal number of worker threads querying a database, improving the concurrency of the LNS-based server 2.1-3.5. The analytical method can also apply to the modeling of other servers characterized as tandem stage queueing networks.
The data used to support the findings of this study are available from the corresponding author upon reasonable request.
Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this article.
We thank Professor Zhiwei Xu for his academic guidance to do this work. This work is supported in part by the National Key Research and Development Program of China (2016YFB1000200), and the Key Program of the National Natural Science Foundation of China (61532016).
- A. A. Laghari, H. He, K. A. Memon, R. A. Laghari, I. A. Halepoto, and A. Khan, “Quality of experience (qoe) in cloud gaming models: A review,” Multiagent and Grid Systems, vol. 15, no. 3, pp. 289–304, 2019.
- I. Arapakis, X. Bai, and B. B. Cambazoglu, “Impact of response latency on user behavior in web search,” in Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, ser. SIGIR ‘14, pp. 103–112, Gold Coast, Queensland, Australia, 2014.
- I. Yaqoob, L. U. Khan, S. M. A. Kazmi, M. Imran, N. Guizani, and C. S. Hong, “Autonomous driving cars in smart cities: recent advances, requirements, and challenges,” IEEE Network, vol. 34, no. 1, pp. 174–181, 2020.
- H. Kasture and D. Sanchez, “Ubik: Efficient cache sharing with strict qos for latency-critical workloads,” in Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ‘14, pp. 729–742, Salt Lake City, Utah, USA, 2014.
- D. Shan, F. Ren, P. Cheng, R. Shu, and C. Guo, “Observing and mitigating micro-burst traffic in data center networks,” IEEE/ACM Transactions on Networking, vol. 28, no. 1, pp. 98–111, 2020.
- M. Wang, T. Madhyastha, N. H. Chan, S. Papadimitriou, and C. Faloutsos, “Data mining meets performance evaluation: fast algorithms for modeling bursty traffic,” in Proceedings 18th International Conference on Data Engineering, pp. 507–516, San Jose, CA, USA, 2002.
- M. Andersson, J. Cao, M. Kihl, and C. Nyberg, “Performance modeling of an apache web server with bursty arrival traffic,” in International Conference on Internet Computing, pp. 508–514, Las Vegas, Nevada, USA, 2003.
- A. Sivanathan, D. Sherratt, H. H. Gharakheili et al., “Characterizing and classifying iot traffic in smart cities and campuses,” in 2017 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 559–564, Atlanta, GA, USA, May 2017.
- Y. Wu, G. Min, K. Li, and B. Javadi, “Modeling and analysis of communication networks in multicluster systems under spatio-temporal bursty traffic,” IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 5, pp. 902–912, 2012.
- V. Gupta, S. K. Devar, N. H. Kumar, and K. P. Bagadi, “Modelling of iot traffic and its impact on lorawan,” in GLOBECOM 2017 - 2017 IEEE Global Communications Conference, pp. 1–6, Singapore, December 2017.
- X. Jian, X. Zeng, J. Huang, Y. Jia, and Y. Zhou, “Statistical description and analysis of the concurrent data transmission from massive mtc devices,” International Journal of Smart Home, vol. 8, no. 4, pp. 139–150, 2014.
- S. Han, S. Marshall, B.-G. Chun, and S. Ratnasamy, “Megapipe: A new programming interface for scalable network i/o,” in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’12, pp. 135–148, Hollywood, CA, USA, 2012.
- X. Lin, Y. Chen, X. Li et al., “Scalable kernel tcp design and implementation for short-lived connections,” in Proceedings of the Twenty-First Inter national Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ‘16, pp. 339–352, Atlanta, Georgia, USA, 2016.
- K. Yasukata, M. Honda, D. Santry, and L. Eggert, “Stackmap: Low-Latency Networking with the Os Stack and Dedicated Nics,” in Proceedings of the 2016 USENIX Conference on Usenix Annual Technical Conference, ser. USENIX ATC ‘16, pp. 43–56, Denver, CO, USA, 2016.
- A. Belay, G. Prekas, A. Klimovic, S. Grossman, C. Kozyrakis, and E. Bugnion, “Ix: A Protected Dataplane Operating System for High Throughput and Low Latency,” in in Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’14, pp. 49–65, Broomfield, CO, 2014.
- A. Kaufmann, T. Stamler, S. Peter, N. K. Sharma, A. Krishnamurthy, and T. Anderson, “Tas: Tcp Acceleration as an Os Service,” in Proceedings of the Fourteenth EuroSys Conference 2019, ser. EuroSys ‘19, Dresden, Germany, 2019.
- E. Y. Jeong, S. Woo, M. Jamshed et al., “Mtcp: A Highly Scalable User-Level Tcp Stack for Multicore Systems,” pp. 489–502, USENIX Association.
- W.-L. Zhang, K. Liu, Y.-F. Shen et al., “Labeled network stack: a high-concurrency and low-tail latency cloud server framework for massive iot devices,” Journal of Computer Science and Technology, vol. 35, no. 1, pp. 179–193, 2020.
- Z. Xu and C. Li, “Low-entropy cloud computing systems,” SCIENTIA SINICA Informationis, vol. 47, no. 9, pp. 1149–1163, 2017.
- M. Carrascosa and B. Bellalta, “Cloud-gaming: Analysis of google stadia traffic,” Computer Communications, vol. 188, pp. 99–116, 2022.
- R. Ratasuk, A. Prasad, Z. Li, A. Ghosh, and M. A. Uusitalo, “Recent advancements in m2m communications in 4g networks and evolution towards 5g,” in 2015 18th International Conference on Intelligence in Next Generation Networks, pp. 52–57, Paris, France, 2015.
- W. Miao, G. Min, X. Zhang, Z. Zhao, and J. Hu, “Performance modelling and quantitative analysis of vehicular edge computing with bursty task arrivals,” IEEE Transactions on Mobile Computing, 2021, https://www.computer.org/csdl/journal/tm/5555/01/09447948/1ueYVjpWn4c.
- M. Harchol-Balter, Performance Modeling and Design of Computer Systems: Queueing Theory in Action, Cambridge University Press, USA, 1st edition, 2013.
- M. Nguyen, Z. Li, F. Duan, H. Che, Y. Lei, and H. Jiang, “The tail at scale: How to predict it?” pp. 120–125, USENIX Association.
- M. Nguyen, S. Alesawi, N. Li, H. Che, and H. Jiang, “Forktail: A black-box fork-join tail latency prediction model for user-facing datacenter workloads,” in HPDC '18: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, pp. 206–217, Tempe, Arizona, June 2018.
- S. Alesawi, M. Nguyen, H. Che, and A. Singhal, “Tail latency prediction for datacenter applications in consolidated environments,” in 2019 International Conference on Computing, Networking and Communications (ICNC), pp. 265–269, Honolulu, HI, USA, February 2019.
- Z. Qiu, J. F. Pérez, and P. G. Harrison, “Beyond the mean in fork-join queues: Efficient approximation for response-time tails,” Performance Evaluation, vol. 91, pp. 99–116, 2015.
- A. Thomasian, “Analysis of fork/join and related queueing systems,” ACM Computing Surveys, vol. 47, no. 2, pp. 1–71, 2015.
- C. Delimitrou and C. Kozyrakis, “Amdahl’s law for tail latency,” Communications of the ACM, vol. 61, no. 8, pp. 65–72, 2018.
- H. D. Friedman, “Reduction methods for tandem queuing systems,” Operations Research, vol. 13, no. 1, pp. 121–131, 1965.
- J. Dean and L. A. Barroso, “The tail at scale,” Communications of the ACM, vol. 56, no. 2, pp. 74–80, 2013.
- T. Hosfeld, F. Metzger, and P. E. Heegaard, “Traffic modeling for aggregated periodic iot data,” in 2018 21st Conference on Innovation in Clouds, Internet and Networks and Workshops (ICIN), pp. 1–8, Paris, France, August 2018.
- S. Larsen, P. Sarangam, and R. Huggahalli, “Architectural breakdown of end-to-end latency in a tcp/ip network,” in 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’07), pp. 195–202, Gramado, Brazil, October 2007.
- Q. Cai, S. Chaudhary, M. Vuppalapati, J. Hwang, and R. Agarwal, “Understanding host network stack overheads,” in SIGCOMM '21: Proceedings of the 2021 ACM SIGCOMM 2021 Conference, pp. 65–77, USA, August 2021.
- O. C. Ibe, Ed., Markov Processes for Stochastic Modeling, Elsevier, Oxford, 2nd edition, 2013.
- J. Abate, G. L. Choudhury, and W. Whitt, “Exponential approximations for tail probabilities in queues, i: Waiting times,” Operations Research, vol. 43, no. 5, pp. 885–901, 1995.
- C. E. Shannon, “A mathematical theory of communication,” The Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948.
- W. Wu, X. Feng, W. Zhang, and M. Chen, “Mcc: A Predictable and Scalable Massive Client Load Generator,” in Benchmarking, Measuring, and Optimizing, W. Gao, J. Zhan, G. Fox, X. Lu, and D. Stanzione, Eds., pp. 319–331, Springer International Publishing, Cham, 2020.
- H. Song, W. Zhang, K. Liu, Y. Shen, and M. Chen, “Hcmonitor: an accurate measurement system for high concurrent network services,” in 2019 IEEE International Conference on Networking, Architecture and Storage (NAS), pp. 1–8, EnShi, China, August 2019.
Copyright © 2022 Hongrui Guo et al. Exclusive Licensee Zhejiang Lab, China. Distributed under a Creative Commons Attribution License (CC BY 4.0).