Research Article | Open Access
Yanan Yang, Xiangyu Kong, Laiping Zhao, Yiming Li, Huanyu Zhang, Jie Li, Heng Qi, Keqiu Li, "SDCBench: A Benchmark Suite for Workload Colocation and Evaluation in Datacenters", Intelligent Computing, vol. 2022, Article ID 9810691, 18 pages, 2022. https://doi.org/10.34133/2022/9810691
SDCBench: A Benchmark Suite for Workload Colocation and Evaluation in Datacenters
Colocating workloads are commonly used in datacenters to improve server utilization. However, the unpredictable application performance degradation caused by the contention for shared resources makes the problem difficult and limits the efficiency of this approach. This problem has sparked research in hardware and software techniques that focus on enhancing the datacenters’ isolation abilities. There is still lack of a comprehensive benchmark suite to evaluate such techniques. To address this problem, we present SDCBench, a new benchmark suite that is specifically designed for workload colocation and characterization in datacenters. SDCBench includes 16 applications that span a wide range of cloud scenarios, which are carefully selected from the existing benchmarks using the clustering analysis method. SDCBench implements a robust statistical methodology to support workload colocation and proposes a concept of latency entropy for measuring the isolation ability of cloud systems. It enables cloud tenants to understand the performance isolation ability in datacenters and choose their best-fitted cloud services. For cloud providers, it also helps them to improve the quality of service to increase their revenues. Experimental results show that SDCBench can simulate different workload colocation scenarios by generating pressures on multidimensional resources with simple configurations. We also use SDCBench to compare the latency entropies in public cloud platforms such as Huawei Cloud and AWS Cloud and a local prototype system FlameCluster-II; the evaluation results show FlameCluster-II has the best performance isolation ability over these three cloud systems, with 0.99 of experience availability and 0.29 of latency entropy.
Cloud computing has grown rapidly over the past few decades and is widely used in many application areas, such as web service, databases, big data processing, and machine learning [1, 2]. Virtualization technologies allow end users to share cloud resources in the form of virtual machines (VMs) with an on-demand provision model . By abstracting the underlying hardware resources, the server utilization can be improved through workload consolidation . However, the contention for shared resources such as CPU, Last-Level Cache (LLC), and memory bandwidth between VMs causes performance interference, especially in multitenant cloud scenarios [5–7].
Interference may result in unpredictable performance degradation for cloud services, which not only reduces the user experience but also hurts resource efficiency in datacenters [8, 9]. For example, the pressures on multidimensional resource consumption at the instruction cycle level become more intractable, causing long tail latencies for interactive applications. Additionally, servers running latency-critical services can only operate at low utilization due to their unpredictable tail latency. The performance interference makes it difficult to utilize the spare server capacity by colocating batch applications since uncontrolled sharing of CPU cores, caches, and power causes high latency degradation. As a result, the average server utilization in most datacenters is only 10%-50% [10, 11]. This leads to billions of dollars of wastage in infrastructure and energy every year .
To mitigate the performance interference caused by shared resource contention, many researchers seek to enhance the isolation ability in cloud systems from hardware and software approaches. The hardware methods such as Intel RDT  and PARD  provide control interfaces for partitioning hardware such as LLC ways and memory bandwidth between different colocated applications, thus reducing the contention on these microarchitecture level resources. The software methods commonly adopt mechanisms like resource overprovisioning, CPU core binding, and dynamical power management to protect latency-critical services from the interference of colocated workloads [4, 15]. However, utilizing these techniques in datacenter requires new hardware support or system upgrades. Not all of the cloud providers are willing to implement such optimizations in their platforms, which also leads to service performance differences between cloud providers. The need for predictable service performance in datacenters brings new challenges and opportunities for cloud system design that seek to improve server-level resource utilization but do not hurt application-level performance.
Unfortunately, the lack of a comprehensive suite of workload colocation benchmarks makes studying this emerging problem challenging. First, it hampers research that seeks to analyze the causes of application interference. Latency-critical services (LCs) have a variety of latency requirements and microarchitectural characteristics. The performance degradation may come from the pressures on different shared resource contentions or their combinations. However, most benchmarks in prior works are not designed for evaluating the performance interference of workloads in multitenant shared cloud scenarios [1, 16–18], and they are unable to exert a wide range of pressures on the underlying hardware resources. Additionally, they only observe the service-level performance and do not support the measurement in system isolation ability. Second, most algorithm or architecture innovations in cloud systems are focused on throughput-oriented designs to provide better resource pooling or provisioning abilities [10, 19]. This constitutes a blind spot in the interference measurement and explorations of new isolation techniques. For example, many scheduling frameworks adopt optimized algorithms to improve the efficiency of allocating virtual machines (VM) or jobs [20–22], while insufficient hardware isolation mechanisms may dramatically worsen the application performance of VMs (e.g., increasing job completion time). Nevertheless, none of the existing benchmark suites support the measurement of performance isolation ability in diverse cloud scenarios.
A workload colocation benchmark can help cloud providers understand and improve the infrastructures’ isolation capabilities, thereby increasing their adoption by cloud users . Designing such a benchmark is challenging for several reasons. First, the applications must be carefully selected and cover a comprehensive range of domains and multidimensional resource usage behaviors. Similar applications will make the benchmark redundant and not easy to use. Second, the service performance degradation caused by interference may present in many system-level and microarchitecture metrics (e.g., tail latency and IPC). How to characterize the system uncertainty by consolidating these observation changes is a challenge. Third, it is not enough for these workloads to run individually on systems. Instead, the benchmark should support flexible mixing of workload types and intensities to adapt different application colocation requirements in datacenters.
To solve these problems, we present SDCBench, a benchmark suite for workload colocation that addresses these challenges. SDCBench includes a diverse set of LC services and BE applications, as well as a robust validated experimental methodology that makes it easy to colocate these benchmarks on cloud systems and measure the datacenters’ resource isolation abilities. Our key contributions can be summarized as follows: (i)We present SDCBench, a new benchmark suite for the isolation ability measurement in multitenant shared cloud systems that covers a wide spectrum of workload diversity and characterization(ii)We propose a concept of latency entropy to describe the application performance degradation arising due to resource contention in cloud systems. This enables one to quantify the system isolation ability and efficiency of hardware and software partition technologies in workload colocation scenarios(iii)We implement a comprehensive evaluation framework based on SDCBench that can automatically configure, deploy, and evaluate applications on cloud platforms. The framework is open source and can be easily extended to new cloud systems(iv)We demonstrate SDCBench’ usability and show that it can simulate different cloud scenarios by colocating workloads with simple configurations. We also evaluate today’s major cloud providers using SDCBench and present the comparison of their performance entropies
2. Materials and Methods
2.1. Background and Motivation
2.1.1. Interference in Shared Cloud Scenarios
Large-scale online services such as e-commerce, search engines, online maps, social media, and advertising are widespread in today’s datacenters. These interactive, latency-critical services are usually scaled across thousands of servers with fanout or multitiered architecture . The intermediate state and accessible data are stored in memory or flash distributedly to ensure fast response time. A large number of microservices across multiple leaf nodes may collaborate to serve a user request. As the overall latency presented to the user is determined by the slowest nodes, even small interference, queue delays, or other sources of performance variations in these nodes may cause dramatic service time increases (a.k.a. tail latency). For example, the tail latency of the web search service in Google ranges from 0 to 500 ms, and the highest variation difference can exceed 600 .
The requirements for low and predictable tail latency of latency-critical applications limit server resource utilization in datacenters [4, 35]. On the one hand, the workload of interactive services varies significantly due to the diurnal patterns and unpredictable bursts in user accesses. Cloud providers have to allocate resources to these services for their peak loads. This leads to much wastage due to resource overprovisioning. On the other hand, it is difficult to utilize the spare capacity by colocating batch applications with them, as the interference from sharing CPU cores, cache, and power causes high tail latency degradation and even violates the latency SLO.
Achieving high performance isolation in cloud systems has been a key challenge in improving the resource efficiency of datacenters. However, researchers have proposed many approaches to reduce the performance interference between colocated workloads in cloud systems in order to improve server utilization. The lack of a comprehensive benchmark suite hampers the researchers in this area to evaluate their new proposed methods, and for cloud users, it is also hard to help them understand their application performance in different cloud platforms. In the following, we will explain why the existing benchmarks do not address this problem.
2.1.2. Limitation of Existing Benchmark Suites
Prior work has proposed a variety of benchmarks, including both latency-critical services and background applications that can be deployed in cloud systems to help researchers in this area. These benchmarks fall short of our needs from the standpoints of workload diversity and performance metrics in interference measurement and studies. In the following, we compare SDCBench with some of the representative benchmarks from these aspects.
Table 1 lists the existing benchmark suites for evaluating computing system performance. LINPACK , SPEC CPU , HPCC , and PARSEC  are benchmarks designed in the evaluations of high-performance computing (HPC) areas, which include several well-written programs running on computing hardware or simulators. The main focus of these applications is on the peak speed of CPU processors such as GFLOPs, which impacts the job completion time (JCT) of tested applications. Different from the SPEC CPU, the SPEC Cloud_IaaS  is a specially designed version for evaluating applications in cloud systems. It includes both latency-critical services and background applications in the benchmark and supports the measurement of service latency. However, SPEC Cloud_IaaS only provides two applications and is not representative of most of today’s cloud scenarios.
Several benchmarks, such as YCSB , CloudSuite , TailBench , and uSuite , focus on performance measurements in cloud services. Among them, YCSB is a cloud benchmark specifically for data storage systems, which provides key-value pairs of queries from the NoSQL database. Similar to YCSB, μSuite includes four data-intensive interactive applications that are designed for the measurement in microarchitecture metrics such as system calls, context switches, and other OS overheads. CloudSuite and BigDataBench  provide both latency-critical and throughput-oriented applications to evaluate the microarchitectural traits that impact the performance of these services. However, their load testers adopt a “closed-loop” system and lack a rigorous latency measurement methodology. TailBench aggregates a set of interactive benchmarks and proposes a more accurate methodology to measure the tail latency. However, all of these benchmarks are designed to measure the monolithic application performance, which cannot reflect the difference of performance isolation ability in cloud systems.
Other benchmarks, such as DeathStarBench , MLPerf , and ServerlessBench , target specific application domains in cloud datacenters. For example, DeathStarBench is a recent open-source benchmark suite for cloud and IoT microservices, which includes representative services such as social networks, video streaming, E-commerce, and swarm control services. MLPerf is an industry-academic benchmark suite for machine learning that facilitates system-level performance measurements and comparisons on diverse software platforms, such as TensorFlow  and PyTorch  as well as hardware architectures. ServerlessBench is a benchmark designed for serverless platforms. It contains a number of multifunction applications and focuses on function composition patterns.
The limitations of current benchmarks motivate us to design a new benchmark suite for widespread colocation scenarios in datacenters to help researchers understand the interference in cloud systems and evaluate their new proposed software and hardware techniques to improve the system efficiency. The benchmark suite should be designed with the following principles. (i) Workload diversity: the benchmark should include both latency-critical services and background applications from a wide range of domains in cloud systems. The applications in the benchmark should be sensitive to the pressures on multidimensional resources at the system and microarchitecture levels. (ii) Usability and robust evaluation methodology: the workloads in the benchmark suite should be easy to use for simulating different workload colocation scenarios in a local cluster or public cloud systems. The benchmark should also be able to provide automatic evaluation and metric collection mechanisms that cover a wide range of service latency or completion time from microseconds to tens of minutes. (iii) Characterization and measurement for interference: the benchmark should be able to characterize the workload performance degradation caused by shared resource interference in cloud scenarios, and it should also provide metrics for measuring the performance isolation ability in cloud systems.
2.2. Overview of SDCBench
In this section, we present the design of SDCBench. We first give an overview of the evaluation framework of this benchmark to explain how it works in cloud systems. Then, we describe the main components in the benchmark suite, including the application selection methodology, definition of latency entropy, and implementation of the key modules of SDCBench.
Figure 1 shows the overview of SDCBench. SDCBench includes 16 latency-critical services and background applications with representative workload characterizations in multitenant shared cloud scenarios. It supports workload colocation in cloud systems based on these applications and metric collection from service performance to microarchitectural behaviors. Different from the existing benchmarks, SDCBench implements a robust evaluation methodology and latency entropy metric that enables it to measure the interference isolation ability between different tenants in a cloud system.
SDCBench can be deployed in a local server cluster running on a container engine or public cloud platforms built on top of VMs. We also develop an automatic software toolkit to help users easily build, deploy, and evaluate SDCBench in cloud systems. The toolkit consists of three key components: colocation controller, load generator, and metric collector. For each evaluation case in the cloud scenario, the user can choose the needed benchmarks from SDCBench and configure their resources and runtime parameters, such as query per second, request arrival pattern, and peak load. The colocation controller reads the user’s configurations, prepares the sandbox environments, and runs applications by calling load generator to send request. During this process, metric collector monitors the service-level and system-level states and collects the valuable metrics, which will be analyzed by the colocation controller and then presents the evaluation results to user.
2.3. Benchmark Selection
2.3.1. Candidate Applications
To design SDCBench, we first select candidate applications used in cloud scenarios. The existing benchmarks have proposed a large number of applications from simple website to resource-intensive background tasks. However, some of these benchmarks are similar in their workload behaviors and microarchitecture characterization. To select representative applications, we conduct a characterization of these benchmarks using metrics describing CPU, memory behaviors, and external resource requirements. The evaluation allows us to classify applications and choose a benchmark set that is representative of the required resource consumption.
We collect more than 50 applications from existing benchmarks [1, 17, 18, 31, 32, 38]. As shown in Figure 2, these applications come from diverse cloud scenarios such as web search, video processing, machine learning, and serverless computing. The performance measurement of these applications also covers a wide range of response times, from microsecond latency in interactive services to tens of minutes of completion time in background tasks. However, many of these applications have similar workload characterization of the pressures on the underlying hardware resources. For example, Image-classify, Scimark, and Alu are compute-intensive applications that require much computing capacity for task or data processing. Similar phenomenon can be found in I/O-intensive applications such as Redis, Memcached, and Media-streaming.
To reduce the number of redundant applications and select a representative set of them in SDCBench, we characterize the applications in these benchmarks by collecting their microarchitecture resource consumption. The measured metrics include CPU, memory, LLC, memory bandwidth, network, disk I/O, and IPC. These applications are deployed individually in isolated sandboxes without running other workloads aside from the server. Since latency-critical services always have varied workloads, we measure these services under different loads (10%, 50%, and 100%) and aggregate the collected data as their overall metrics. For background applications, we characterize them by collecting the performance metrics under different input data sizes. The input datasets used for evaluating these applications are listed in Table 2. To estimate the measurement error from system noise, we prewarm every application for a period of time (e.g., 5 minutes) in each load and collect the average metrics from a statistical method.
We use a vector to present the profile of the application, where is the dimension of metrics in the resource consumption characterization. For each resource, we normalize the metric value to the maximum resource capacity in the server, which is mapped in the range of 0 to 1. In this paper, we construct a 7-dimension profiling vector for each application; the profile metadata includes the consumption of CPU cores, main memory, memory bandwidth, LLC, disk I/O, network resources, and a microarchitecture metric IPC. After that, we adopt the -means algorithm  to cluster these applications based on their similarity to select the minimum set of representative benchmarks. We use cosine distance to define the similarity of these applications, which is commonly used for classification. Given the profile vectors of two applications, their cosine distance can be calculated as follows: where () represents the metric of application (). If the characterization of application is close to application , a small value of the cosine distance is derived.
Based on the application characterization, these candidate benchmarks are clustered into 6 classes of latency-critical services and 10 classes of throughput-oriented applications (see Tables 3 and 4). As the benchmarks in a same class have similar workload behaviors, we finally select one of them in each class for building the application set of SDCBench. Then, the selected 16 applications in SDCBench are listed in Table 5. For clarity, each number in the table is color-coded as follows: red is ≥0.8, yellow is between 0.2 and 0.8, and green is ≤0.2. We can see that these applications exhibit a variety of resource sensitivity characteristics, and many of them can generate considerable pressure on several dimensions of resources.
2.3.2. Application Descriptions
We now briefly describe the applications included in SDCBench.
Image-classify  is a deep learning serving application implemented with python. This service takes images through HTTP requests and activates a ResNet model for Image-classify. Since ResNet is computationally intensive, the serving latency typically ranges from 10s to 1000s of milliseconds.
Redis  is an open-source, key-value database, and is widely used as distributed in-memory cache and message brokers. Redis is written in C and is highly efficient, which provides sub-millisecond response latency.
Solr  is an open-source enterprise-search engine written in Java, which supports full-text search, hit highlighting, and real-time indexing. Solr is highly scalable and fault-tolerant and is widely used for enterprise search and analytics use case. The search latency of Solr is typically at 10s of milliseconds.
Speech-recog  is a speech classification inference service, which consists a Speech-recog model that takes in the frequency spectrum of the input speech sequence and produces the classified labels. This light-weighted model is implemented with python, which takes only 10s of milliseconds for inference.
TPC-W  is a web server and database performance benchmark, which is proposed by the Transaction Processing Performance Council. It defines the complete web-based shop for searching, browsing, and ordering books. The response time of such web interactions typically ranges from 10s to 100s milliseconds.
Social-network  is a microservice application that resides in the DeathStarBench benchmark, which is an end-to-end service that implements a broadcast-style social network with unidirectional follow relationships. Since requests may be forwarded and processed by different components, the service latency typically ranges from 10s to 100s milliseconds.
DecisionTree  is an application in the Spark benchmark suite, which is written in Scala. The spark decision tree application is implemented with Spark mllib APIs, which supports decision trees for binary and multiclass classification and for regression. This application is highly IPC efficient.
Alu  is an arithmetic computation application in the serverless benchmark suite, ServerlessBench, which computes the arithmetic operation repeatedly with multiple threads. The Alu application is CPU intensive and requires much less memory and network resources.
PageRank  is a graph processing application implemented with Spark . Since web pages could be numerous, the page rank computation also consumes relatively intensive resources and jobs would take at least minutes to complete.
DiskIO  is an application from serverless benchmark suite, FunctionBench, which performs the dd system command that creates a file in the /tmp/ directory of the function runtime. The DiskIO application consumes less CPU, memory, and network resources, while imposing high pressure on the disk I/O bandwidth.
Dwarf-sort  is a big data sort application in the BigDataBench benchmark suite, which is implemented with Scala by sorting the Wikipedia entries by keys. This application is typically memory and cache intensive.
AlexNet, LeNet, and ResNet20  are deep learning training applications implemented with TensorFlow. These deep learning training processes impose huge and lasting pressure on CPUs, memory, cache, and network bandwidth. Typically, AlexNet has the highest resource demands, while LeNet consumes relatively fewer resources to train the network.
Matmul  is a matrix multiplication application in the HPCC benchmark. Typically, the matrix multiplication operation consumes a large amount of CPU, memory, and cache resources, while generating less network pressure.
2.4. Metric Collection
SDCBench supports the measurement of both service-level and system-level metrics in workload colocation scenarios. These service-level metrics mainly focus on application performance and are presented to the user as intuitive results. System-level metrics are collected to analyze user application runtime behavior and the impact of system isolation ability on application performance. We now discuss the detailed metrics that are measured at the service level and system level.
Service-level metrics: these metrics provide an accurate profile of application performance for users. (i)Response time: for interactive services, SDCBench records the response time for each request to analyze their performance changes in cloud systems. The collected metrics include the latency for a single request, the average latency and the tail latency (e.g., 90th, 95th, and 99th latency)(ii)CPU utilization: SDCBench supports the measurement of the CPU time in the user and the kernel space spent by the application. This metric helps to determine which applications are sensitive to computational resources(iii)Memory: the available memory size directly affects the data swapping speed for applications that need to exchange a large amount of data between the memory and disk, and insufficient memory can cause an increase in the order of magnitude latency. SDCBench supports the measurement of peak memory usage to help users understand the memory requirements of their applications(iv)Disk and network I/O: the contention on the disk and network affects the performance of I/O-intensive applications. Previous work has shown that the average throughput of file system I/O and network operations decreases with the number of colocated applications in cloud systems that share bandwidth. SDCBench supports the measurement on the disk I/O and network usages to help users track performance variations from the contention on these two-dimensional resources(v)Cache and memory bandwidth: latency-critical services, such as in-memory databases, are more sensitive to contentions on cache and memory bandwidth. Even when adopting hardware isolation mechanisms such as CPU core binding and the numactl technique, contention from the shared LLC and memory bandwidth is inevitable between virtual machines in old-generation servers. SDCBench provides the measurement of LLC and memory bandwidth usage to analyze the applications’ performance degradation caused by microarchitecture resource contentions(vi)Cache miss: an insufficient cache capacity or contentions from the threads running on the same CPU socket may cause frequent cache misses for the application. SDCBench uses hardware performance counters to count cache miss events and collect LLC MPKI as a metric
System-level metrics: this metric is used to describe the performance uncertainty of an application running on a cloud system and the impact of performance degradation on users caused by system uncertainty. (i)Experience availability: SDCBench supports the user-oriented service experience availability (EA) measurement (Figure 3). Prior work always focuses on system failures that cause user experience degradation; that is, when the system fails, the service becomes unavailable. However, this ignores the low latency requirement of cloud users. High tail latency also degrades user experience and even violates the users’ latency service level objectives (SLOs). SDCBench introduces a new EA metric [51, 52] by combining the system availability with service tail latency, which is defined as follows:where the collected latency statistics are divided into uniform time intervals (i.e., ). For the time interval, if the of tail latency meets , where is the latency SLO, then it is set to 1. Otherwise, is set to 0. (ii)Latency entropy: the concept of entropy was first introduced by the German physicist Clausius in 1865 and is used to describe the degree of disorder within a system . Inspired by this, SDCBench first proposes the latency entropy (LE) metric for measuring the uncertainty of cloud systems. Inside a computing system, the sequence of system calls and hardware accessing events (e.g., memory access, instruction fetching, and thread execution) occurring in per unit time can be considered as microstate of the system. The colocation from multiple applications makes the microstates of the system more complex, especially when shared resource contention occurs between different applications. In these scenarios, system behaviors, such as instruction fetching and execution, become disorderly and unpredictable
Unfortunately, it is difficult to measure the internal microstates of computer systems under the architecture of modern high-speed processors. To help users understand their application performance changes with observable metrics, SDCBench defines the latency entropy that describes the variations of tail latency for the measurement of system isolation ability. The latency entropy is calculated as follows: where is the number of latency distribution states and represents the state’s probability. In practice, we divide the collected latencies into multiple fixed-length time intervals, and each of them can be seen as individual states, and then, the probability of one state can be approximately derived by calculating the percentage of latency samples falling into the corresponding interval. For each cloud service, latency entropy can be used to describe its performance uncertainty in cloud system, which implies the following. (i) The smaller the number of latency distribution states, the smaller the latency entropy of the cloud system. (ii) The more uneven the probability of the latency distribution states, the smaller the latency entropy of the cloud system. For example, if service A has a latency state distribution with “[14, 17, 20], [24, 25, 29], [32, 37]” and service B has a latency state distribution with “, [23, 24, 26, 28, 29], ,” we could see that service B is more stable than service A, and actually, service B has smaller LE score than service A (0.81 vs. 1.08), which is consistent with our observation.
3. Results and Discussion
3.1. System Implementation
We implement an evaluation framework based on SDCBench, which is designed to help users easily understand the isolation ability of different cloud provides, thus to evaluate their application performance in these cloud platforms. As mentioned in Section 2.2, the colocation controller, load generator, and metric collector are the core modules in the framework design, which support automatic application configuration, deployment, and measurement of performance and cloud system isolation ability.
Colocation controller: the controller automatically manages all necessary steps of the workload evaluation in a cloud system. It provides application selection and parameter configuration interfaces with a visual interface for users to execute these operations. The latency-critical services and background applications in SDCBench are registered in the database, and the user can select the evaluated applications by marking their flags as executable. For latency-critical services, SDCBench supports the evaluation parameter settings, such as request arrival pattern, peak load, warmup invocations, and the total request invocations. For background applications, SDCBench supports the settings of job execution times, task types, and input data sizes.
SDCBench supports component-level (i.e., container-level) application colocation based on docker APIs and Linux system tools. It uses docker update commands  and the numactl  tool to bind the CPU core and memory block to containers. Applications running on the same CPU socket may have contentions on the shared cache and memory bandwidth, which causes performance inference for the colocated workloads. SDCBench also provides a fine-grained resource partition mechanism for containers running within the same CPU socket; it adopts the Intel RDT tool  to set the cache ways and memory bandwidth for each container. Additionally, SDCBench uses the qdisc network tool  for allocating network bandwidth for the evaluated applications to measure their performance variations in both colocated and isolated workload scenarios.
Load generator: SDCBench implements a load generator to generate requests to the latency-critical services, which can be deployed on one or more client machines. The load generator integrates a traffic shaper, a client pool, and a recorder. It creates multiple clients from the thread pool to continuously generate requests, and the traffic shaper handles these requests and sends them to the backend service following the desired workload patterns (e.g., from production traces) by inserting delays between requests before sending them out over the network. The simulated clients operate in an “open-loop” mode, where the request can be sent directly according to their desired timing characteristics without waiting for responses from previous requests. The open-loop [17, 57] setups generate sufficient workload pressures on the evaluated services and can accurately capture the queuing delays that are an important factor impacting the tail latency.
The recorder maintains a queue to store the processed requests, which is shared among simulated clients. It records the response time of all the requests sent by the clients, aggregates them and calculates statistical metrics, such as the single response time, average latency, and tail latency. The measured data can be stored in a database or exported as files for users.
Metric collector: the performance degradation caused by interference can manifest in multiple hardware resource activities. To accurately measure these system layer and microarchitecture level behaviors without introducing external overhead, SDCBench adopts a nonintrusive method to implement the metric collector. In the system layer, we collect the actual resource usage ratio of the measured application, which includes the number of CPU cores, memory, network, and disk I/O. In the microarchitecture layer, we collect branch prediction errors, cache switching, context switching, memory-level parallelism, and misses per thousand instructions (MPKI). These metrics focus on the operating efficiency of the application code on the current physical hardware, such as locality and parallelism, which help us understand where the inference comes from and its impact on the application performance.
The collector runs aside the measured applications in an isolated CPU socket and adopts a multithreading technique to invoke a series of monitoring tools such as Intel RDT, Perf , and Docker Stats , for the metrics measurement. When all of the monitors are complete, the collector formats the data and returns the results to the user. SDCBench is an open source and is available at https://github.com/TankLabTJU/sdcbench/tree/sdcbench-v2.0/.
3.2. Evaluation and Methodology
SDCBench is designed to help cloud users understand the performance isolation ability of cloud systems by deploying colocated applications in cloud systems that different workloads may share the hardware resources and measuring their performance changes. In the evaluation of SDCBench, we need to answer several key questions. Are the benchmarks in SDCBench representative of the multitenant cloud scenarios by covering a wide range of latencies that can be measured? Is SDCBench able to observe the service performance degradation due to interference from colocated workloads? Can hardware isolation mechanisms eliminate the performance variations caused by interference? How do the major cloud service providers perform in latency entropy measurement? To answer these questions, we use SDCBench to thoroughly evaluate cloud system under different workloads. We begin with a local benchmark evaluation to verify that we cover various workload behaviors (Section 3.2.1). We analyze performance degradations of latency-critical services under different workload colocation scenarios (Section 3.2.2) and present the comparison of latency entropy in some of the existing cloud platforms (Section 3.2.3).
We evaluate SDCBench on both local cluster and public cloud platforms. The benchmarks are implemented in C, Python 3.7, and Java. All real-system measurements reported in the evaluation were performed on servers with two Intel Xeon Silver-4215 CPUs. Table 6 shows the detailed server configurations. We run load generator on individual server to avoid interference from the deployed applications. In the testbed, we bind the applications in the second CPU socket and forbid the operation system to schedule other tasks on the CPU cores in this socket, thus to prevent interference from the system. We also disable the TurboBoost technique and use cpufreq tool to fix CPU frequency at 2.0 GHz, which can help to avoid unpredictable performance fluctuations . The server and client nodes are connected via 10 Gbps, full-bisection bandwidth Ethernet.
3.2.1. Benchmark Characteristics
We now study the latency characteristics of each application, which include the average request service time and tail latency. The service time of a request measures the time the application takes to process that request, which can reflect the execution speed of the application code running on dedicated hardware. The tail latency represents the few slowest requests (e.g., the slowest 1% requests when measuring the 99th percentile latency); it is much more sensitive to small perturbations and can be used to observe the service performance fluctuations. We also study how the request arrival rate affects tail latency in these applications. All measurements in this experiment were obtained by the record collector module in the load generator. To mitigate the measurement error caused by system noise, we collect the evaluation metadata after the application running stably and each of these experiments is measured three times.
Q1: Are the benchmarks in SDCBench representative for the multitenant cloud scenarios by covering a wide range of latencies that can be measured? Figure 4 shows the cumulative distribution function (CDF) of request service times for each SDCBench application. Obviously, the service times vary widely across applications. Almost all Redis requests finish in less than 1.1 ms, and the difference between the lowest and highest service times is only 0.3 ms. But the service time of Social-network requests can take more than 110 ms each. Applications also vary widely in how tightly their request service times are distributed. For some applications, request service times are distributed in a fairly narrow range or have a long tail. 90% of Social-network request times are distributed between 110 ms and 125 ms, and another 10% are distributed between 125 ms and 175 ms, accounting for 77% of the total time distribution. Solr requests have 1% of requests spread over 100 ms to 150 ms, accounting for one-third of the total time distribution. Image-classify requests have a similar trend. Other applications, such as Speech-recog and TPC-W, have their request service times fairly evenly distributed in two specific ranges.
Figure 5 shows the mean and 99th percentile latencies for each application at various request load. In these experiments, 100% of the request load represents the queries per second (QPS) limit of the application under the current configuration. At very low request load, the difference between mean and tail latencies mostly depends on the distribution of request service times. As the request load increases, both mean and tail latencies increase because of competition for resources and queue delays. However, the tail latencies of all applications except Image-classify increase much faster than the mean; for example, Solr requests have a tail latency of about 500 ms higher than the mean latency at 100% of the request load and 50 ms higher at 80% of the request load. Redis also has a similar trend, but the difference between the tail latency and the mean latency of Redis increases slowly compared to other applications. And the tail latency of Redis is only about 1 ms higher than the mean at 100% of the request load. The difference between the tail latency and the mean latency of Image-classify is also small, but its tail latency growth rate is gradually exceeded by the mean latency growth rate.
3.2.2. Interference Measurement
We next analyze the performance changes of these benchmarks in multitenant shared cloud scenarios. Users may deploy different workloads in their cloud virtual machines that run on a large scale of physical servers. Recent studies have shown that interactive services such as websites make up a large part of these cloud applications. At the same time, many users also use cloud VMs to process data-intensive applications such as batch tasks. When deploying SDCBench to simulate these shared cloud scenarios, we prefer to observe the performance degradations on the colocated workloads, which means that applications running inside the cloud system cause contentions on the underlying hardware resources. This could help us to understand the sensitivity of the SDCBench applications for interference and the range of performance changes they can measure.
Based on the benchmark characterization of SDCBench in Section 2.3.1, we build four workload colocation suites with different levels of competition in share resources (Figure 6). These application combinations are (i) CPU-intensive suite: this suite includes Speech-recog, Alu, and PageRank, which are computation-intensive applications during their executions. (ii) Memory-intensive suite: this suite consists of latency-critical service Solr, background applications PageRank, and DNN model training for ResNet20, which rely on much memory resources for execution. (iii) Hybrid contentions: this suite contains Redis, DiskIO, and DNN model training for AlexNet, which can generate pressures on multidimensional resources. (vi) Symbiotic workloads: this suite includes Social-network, DecisionTree, and DNN training on ResNet20. Unlike the other combinations, these three applications can be colocated together without significant performance interference on shared resources. We use these benchmark suites on the local cluster and evaluate their performance changes with the measurement of system isolation ability.
Q2: Is SDCBench able to observe the service performance degradation due to interference from colocated workloads? Figure 7 shows the latency distributions of latency-critical services in these colocation workload suites. For these workload suites of CPU, memory, and hybrid contentions, the latency of colocation is significantly higher than that of solo-run. And it is clear that the latency distribution of above three types of workloads at various request load has become larger, which means that the performance of these services is more unstable. Solr is most sensitive to the interference caused by colocation with other background workloads. At 50% of the request load, the mean latency of colocated Solr service is about 70 that of the solo-run, and the latency distribution is also significantly larger. At 100% of the request load, the latency of Solr increases from about 600 ms to 9000 ms. Compared with the other two types of workload, Redis is less sensitive to the colocation, but the latency is still increased by 8 times at 10% of the request load.
The performance degradation caused by interference is especially obvious when the online service is under high load. This is because as the load of online services increases, the demand for memory and CPU resources increases, and the competition with other background applications becomes more intense, making it more sensitive to interference. For example, while at 10% and 50% of the request load, the latency of colocated Speech-recog increases by 2 and 10 compared with solo-run, respectively. At 100% of the request load, the latency increases by about 100. Compared with the previous three workloads, the distribution of latency increases slightly, but is not obvious. The mean latency of colocated social network increases about 5 ms compared with solo-run at 10% of the request load. However, the mean latency hardly changes at 100% of the request load and even slightly decreases at 50% of the request load. Therefore, when the symbiotic application is combined with other background applications, the performance does not change much compared to the solo-run, which is also in line with our expected results. This result shows that the performance degradation in the first three workloads is indeed caused by interference between applications, and it also shows that the application selection in SDCBench is scientific.
Figures 8(a) and 8(b) show an example of the comparison in system-level and microarchitecture-level metrics with hybrid-contention workloads suite. As Redis generates a high utilization of network and LLC resources, there is little difference in these two metrics between solo-run and colocation workloads suites. By colocating Redis with DiskIO and AlexNet, the CPU and memory utilizations in the system are improved by 2.7 and 13%, respectively. Meanwhile, the usage of DiskIO in colocation group increases by 4.28. However, we could also see that the IPC and Context-switches metrics decrease by 42% and 87% in the colocation group since the workload interference significantly reduces the QPS of Redis. Additionally, the workload interference in colocation group results in higher tail latency for Redis. When compared with solo-run group, the cache miss rate of L1 and L2 increases by 1.7 and 1.3 in the colocation group, respectively. More cache miss also results in a higher memory access rate, and we could see the memory bandwidth usage increases about 30% in the colocation group. This explains why the shared resources contention lead to performance degradation for latency-critical services.
Q3: Can hardware isolation mechanisms eliminate the performance variations caused by interference? In the evaluation of the first three workloads, application performance has been greatly improved after resource isolation. It can be seen from the figure that the performance of these three services in isolation is very close to that of solo-run. In CPU-intensive workloads, after isolation, the mean latency of Speech-recog at three types of request loads decreases by about 20 ms, 90 ms, and 4000 ms, respectively, which is almost the same as that of solo-run. And the performance stability is very close to that of solo-run. It indicates that the workload colocation with resource isolation brings limited interference. For symbiotic application combinations, there is no significant change between the effect of resource isolation and colocation. We can see that the latency distribution after resource isolation is more centralized, which means that performance is more stable. However, under 100% request load, the mean latency of social network even increases by about 10 ms. This also shows that the resource isolation method can improve application performance to a certain extent in the multitenant cloud scenario, but it should be analyzed according to the specific application characteristics.
We also measure the service experience availability and latency entropy metrics of these applications in each group of experiment. Table 7 shows the comparison of service experience availability of the colocated latency-critical services, which is calculated with Equations (2) and (3). We define that the latency-critical services have the best performance in solo-run group, and set the latency SLO () as the value that meets at service 100% load. We could see that the service experience availability decreases dramatically in the first three experiment groups because of the performance interference between colocated workloads. For example, the experience availability of Speech-recog reduces about 60% at 10% load level in the colocation group. Moreover, its performance becomes even worse when we increase the request arrival rate, which achieves only 29% and 16% of time intervals that meet latency SLO at 50% and 100% load levels, respectively. Additionally, we could see that isolation on hardware resources significantly improves the performance availability of these latency-critical services. For Speech-recog, its performance availability recovers to 0.97, 0.98, and 0.99 at 10%, 50%, and 100% load levels, respectively. The performance availability changes of Solr and Redis are similar to those of Speech-recog. Social-network has less performance fluctuations over these evaluated applications.
We further present the comparison of latency entropy of these experiments, which are listed in Table 8. For each workload colocation suite, we record the best and worst latencies in the solo-run group, divide them into multiple latency intervals, and calculate the probabilities of latency time that falls in these intervals in colocation and isolation groups, thus deriving their latency entropy measurements. We could see that Social-network has the highest latency entropy in solo-run group over the four workload colocation suites as its latency that crosses multiple microservices invocations is more sensitive to the performance fluctuation. The interference caused by shared resources contention leads to an average of 13, 4, and 2.8 of LE increasement except for the Social-network. By isolating the shared resources between these colocated applications, the average LE decreases about 9.7, 4.7, and 2.8 in Speech-recog, Solr, and Redis groups, respectively. This indicates that the interference between colocated workloads can greatly magnify the system uncertainty, leading to unpredictable performance degradation on applications running in the system. Introducing isolation mechanisms such as hardware partitions can effectively reduce the application performance fluctuation caused by system uncertainty, thus improving the user experience.
3.2.3. Case Study in Public Cloud
Q4: How do the major cloud service providers perform in the latency entropy measurement? One of the key benefits of SDCBench is to help users to understand their application performance in different public cloud systems. We seek to find the answers in some of the existing public cloud platforms through a simple case study, where we deploy SDCBench in these platforms to measure their latency entropy metrics. Specifically, our testbeds include the public cloud platforms such as Huawei Cloud and AWS Cloud and a local prototype system FlameCluster-II , which is built based on the new CPU architecture for Labeled von Neumann Architecture (LvNA) that supports better isolation of shared resources than the traditional x86-based CPU architectures.
Since the current version of FlameCluster-II only supports C and JAVA languages, we choose TPC-W and Redis as the evaluated latency-critical services to measure the EA and LE metrics of the three platforms. For the public cloud platforms, we deploy TCP-W and Redis in individual cloud VMs and collect their latencies at different times of the day. As the VMs of different cloud tenants may be scheduled into the same server, the user’s application performance can be impacted by the interference from other tenants’ workloads. For FlameCluster-II, we build an 8-node FPGA cluster and deploy benchmarks across these nodes. The load generator in these experiments is deployed in an isolated server environment and communicates with the evaluated applications via network. To reduce the measurement error caused by system noise, we collect the measured data in each experiment three times and take their statistical metrics as the evaluated results.
We collect more than 1,000,000 request latencies in each testbed and measure their EA and LE metrics. Figure 9 shows the comparison of these three platforms in average EA and LE measurement. We could see that Huawei cloud, AWS cloud, and FlameCluster-II achieved 0.94, 0.86, and 0.99 of EA, respectively. This indicates that user may obtain better performance experience by deploying applications in Huawei cloud when compared with AWS cloud. For the comparison of LE, the latency entropy of the three platforms is 1.34, 3.4, and 0.29, respectively. Specifically, applications in FlameCluster-II have minimal performance fluctuations since its strong isolation in hardware from the LvNA design, which validates that hardware isolation is a good way to eliminate performance uncertainty in cloud datacenters. Additionally, the evaluation results also show that application in Huawei cloud achieves better performance isolation ability than that in AWS cloud. This may be because AWS has adopted more aggressive resource oversold policies in different cloud tenants.
We have presented SDCBench, a benchmark suite and evaluation methodology for latency entropy measurement in datacenters. SDCBench seeks to help cloud tenants and providers understand the application isolation ability in datacenter by colocating workloads and observing their performance variation in cloud systems. SDCBench includes 16 representative applications selected from today’s well-known benchmarks across a wide range of cloud scenarios. It first proposes the concept of latency entropy and implements a robust methodology to measure the performance isolation ability in datacenters. Our validation results show that SDCBench can simulate different multitenant shared cloud systems with simple configurations, and we also present the comparison of latency entropy in today’s major cloud providers by deploying SDCBench in Huawei cloud, AWS cloud, and a local prototype system FlameCluster-II. The evaluation results show FlameCluster-II achieves the lowest latency entropy with 0.29 while the scores in Huawei cloud and AWS cloud are 1.34 and 3.4, respectively.
(1) SDCBench is an open source and is available at https://github.com/TankLabTJU/sdcbench/tree/sdcbench-v2.0/. (2) The experimental data used to characterize the latency distribution of benchmarks under different workloads have been deposited in the git repository (/materials/sdcbench-data/Characteristics). (3) The experimental data used to characterize the latency degradation caused by interference have been deposited in the git repository (/materials/sdcbench-data/Interference). (4) The experimental data used to describe the system-level and microarchitecture metrics under interference have been deposited in the git repository (/materials/sdcbench-data/metrics.xlsx).
Conflicts of Interest
The authors declare no competing interests.
This work is supported by the National Key Research and Development Program of China No. 2016YFB1000205, the National Natural Science Foundation of China under grants 61872265 and 62141218, and the CCF-Huawei Populus euphratica Innovation Research Funding CCF2021-admin-270-202104.
- M. Ferdman, A. Adileh, O. Kocberber et al., “Clearing the clouds: a study of emerging scale-out workloads on modern hardware,” in Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2012, pp. 37–48, London, UK, March 3-7, 2012.
- E. Cortez, A. Bonde, A. Muzio, M. Russinovich, M. Fontoura, and R. Bianchini, “Resource central: understanding and predicting workloads for improved resource management in large cloud platforms,” in Proceedings of the 26th Symposium on Operating Systems Principles, pp. 153–167, Shanghai, China, October 28-31, 2017.
- J. Zhang, X. Wang, H. Huang, and S. Chen, “Clustering based virtual machines placement in distributed cloud computing,” Future Generation Computer Systems, vol. 66, pp. 1–10, 2017.
- D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and C. Kozyrakis, “Improving resource efficiency at scale with heracles,” ACM Transactions on Computer Systems (TOCS), vol. 34, no. 2, pp. 1–33, 2016.
- S. Govindan, J. Liu, A. Kansal, and A. Sivasubramaniam, “Cuanta: quantifying effects of shared on-chip resource interference for consolidated virtual machines,” in ACM Symposium on Cloud Computing in conjunction with SOSP 2011, SOCC ‘11, Cascais, Portugal, October 26-28, 2011.
- C. Delimitrou and C. Kozyrakis, “Hcloud: resource-efficient provisioning in shared cloud systems,” in Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2016, pp. 473–488, Atlanta, GA, USA, April 2-6, 2016.
- H. Yang, A. D. Breslow, J. Mars, and L. Tang, “Bubble-flux: precise online qos management for increased utilization in warehouse scale computers,” in The 40th Annual International Symposium on Computer Architecture, ISCA’13, pp. 607–618, Tel-Aviv, Israel, 2013.
- J. Dean and L. A. Barroso, “The tail at scale,” Communications of the ACM, vol. 56, no. 2, pp. 74–80, 2013.
- Z. Xu and C. Li, “Low-entropy cloud computing systems,” SCIENTIA SINICA Informationis, vol. 47, no. 9, pp. 1149–1163, 2017.
- M. Tirmazi, A. Barker, N. Deng et al., “Borg: the next generation,” in EuroSys ‘20: Fifteenth EuroSys Conference 2020, Heraklion, Greece, April 27-30, 2020.
- Q. Liu and Z. Yu, “The elasticity and plasticity in semi-containerized co-locating cloud workload: a view from alibaba trace,” in Proceedings of the ACM Symposium on Cloud Computing, SoCC 2018, pp. 347–360, Carlsbad, CA, USA, October 11-13, 2018.
- L. A. Barroso and U. Hölzle, “The case for energy-proportional computing,” Computer, vol. 40, no. 12, pp. 33–37, 2007.
- I. A. Papadakis, K. Nikas, V. Karakostas, G. I. Goumas, and N. Koziris, “Improving qos and utilisation in modern multi-core servers with dynamic cache partitioning,” in Proceedings of the Joined Workshops COSH 2017 and VisorHPC 2017, COSH/VisorHPC@HiPEAC 2017, Stockholm, Sweden, January 24, 2017.
- J. Ma, X. Sui, N. Sun et al., “Supporting differentiated services in computers via programmable architecture for resourcing-on-demand (PARD),” in Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2015, pp. 131–143, Istanbul, Turkey, March 14-18, 2015.
- C. Iorgulescu, R. Azimi, Y. Kwon et al., “Perfiso: performance isolation for commercial latency-sensitive services,” in 2018 USENIX Annual Technical Conference, USENIX ATC 2018, Boston, MA, USA, July 11-13, 2018.
- S. Baset, M. Silva, and N. Wakou, “SPEC cloud™ iaas 2016 benchmark,” in Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering, ICPE 2017, L’Aquila, Italy, April 22-26, 2017.
- H. Kasture and D. Sanchez, “Tailbench: a benchmark suite and evaluation methodology for latency-critical applications,” in 2016 IEEE International Symposium on Workload Characterization, IISWC 2016, pp. 3–12, Providence, RI, USA, September 25-27, 2016.
- W. Gao, J. Zhan, L. Wang et al., “Bigdatabench: a dwarf-based big data and AI benchmark suite,” 2018, http://arxiv.org/abs/1802.08254.
- F. P. Tso, K. Oikonomou, E. Kavvadia, and D. P. Pezaros, “Scalable traffic-aware virtual machine management for cloud data centers,” in IEEE 34th International Conference on Distributed Computing Systems, ICDCS 2014, pp. 238–247, Madrid, Spain, July 3, 2014.
- X. Li, J. Wu, S. Tang, and S. Lu, “Let’s stay together: towards traffic aware virtual machine placement in data centers,” in 2014 IEEE Conference on Computer Communications, INFOCOM 2014, pp. 1842–1850, Toronto, Canada, April 27 - May 2, 2014.
- J. Tordsson, R. S. Montero, R. Moreno-Vozmediano, and I. M. Llorente, “Cloud brokering mechanisms for optimized placement of virtual machines across multiple providers,” Future generation computer systems, vol. 28, no. 2, pp. 358–367, 2012.
- Q. Chen, J. Yao, and Z. Xiao, “LIBRA: lightweight data skew mitigation in mapreduce,” IEEE Transactions on parallel and distributed systems, vol. 26, no. 9, pp. 2520–2533, 2015.
- J. J. Dongarra and P. Luszczek, “LINPACK benchmark,” in Encyclopedia of Parallel Computing, D. A. Padua, Ed., pp. 1033–1036, Springer, 2011.
- C. D. Spradling, “SPEC CPU2006 benchmark tools,” ACM SIGARCH Computer Architecture News, vol. 35, no. 1, pp. 130–134, 2007.
- P. R. Luszczek, D. H. Bailey, J. J. Dongarra et al., “S12 - the HPC challenge (HPCC) benchmark suite,” in Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, p. 213, Tampa, FL, USA, 2006.
- D. A. Padua, “PARSEC benchmarks,” in Encyclopedia of Parallel Computing, D. A. Padua, Ed., p. 1464, Springer, 2011.
- B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, “Benchmarking cloud serving systems with YCSB,” in Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC 2010, pp. 143–154, Indianapolis, Indiana, USA, June 10-11, 2010.
- G. Cloud, “Perfkit,” 2017. [Online]. Available: https://github.com/GoogleCloudPlatform/PerfKitBenchmarker.
- A. Sriraman and T. F. Wenisch, “Μ suite: a benchmark suite for microservices,” in 2018 IEEE International Symposium on Workload Characterization, IISWC 2018, pp. 1–12, Raleigh, NC, USA, September 30 - October 2, 2018.
- Y. Gan, Y. Zhang, D. Cheng et al., “An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, AS–PLOS 2019, pp. 3–18, Providence, RI, USA, April 13-17, 2019.
- P. Mattson, V. J. Reddi, C. Cheng et al., “MLPerf: an industry standard benchmark suite for machine learning performance,” IEEE Micro, vol. 40, no. 2, pp. 8–16, 2020.
- T. Yu, Q. Liu, D. Du et al., “Characterizing serverless platforms with serverlessbench,” in SoCC ‘20: ACM Symposium on Cloud Computing, Virtual Event, pp. 30–44, USA, October 19-21, 2020.
- L. A. Barroso, U. Hölzle, and P. Ranganathan, “The Datacenter as a Computer: Designing Warehouse-Scale Machines,” in Third Edition, ser. Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers, 2018.
- D. Krushevskaja and M. Sandler, “Understanding latency variations of black box services,” in 22nd International World Wide Web Conference, WWW ‘13, pp. 703–714, Rio de Janeiro, Brazil, May 13-17, 2013.
- C. Delimitrou and C. Kozyrakis, “Quasar: resource-efficient and qos-aware cluster management,” in Architectural Support for Programming Languages and Operating Systems, ASPLOS 2014, pp. 127–144, Salt Lake City, UT, USA, March 1-5, 2014.
- M. Abadi, P. Barham, J. Chen et al., “Tensorflow: a system for large-scale machine learning,” in 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2-4, 2016.
- A. Paszke, S. Gross, F. Massa et al., “Pytorch: an imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, December 8-14, 2019.
- M. Li, J. Tan, Y. Wang, L. Zhang, and V. Salapura, “SparkBench: a spark benchmarking suite characterizing large-scale in-memory data analytics,” Cluster Computing, vol. 20, no. 3, pp. 2575–2589, 2017.
- H. Yuan and C. Wang, “A human action recognition algorithm based on semi-supervised kmeans clustering,” Trans. Edutainment, vol. 6758, pp. 227–236, 2011.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012, Lake Tahoe, Nevada, United States, December 3-6, 2012.
- “Redis: an open source, in-memory data structure store,” 2019. [Online]. Available: https://redis.io/.
- “Solr is the popular, blazing-fast, open source enterprise search platform built on apache lucene,” 2019. [Online]. Available: https://www.elastic.co.
- L. Velikovich, I. Williams, J. Scheiner, P. S. Aleksic, P. J. Moreno, and M. Riley, “Semantic lattice processing in contextual automatic speech recognition for google assistant,” in Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, pp. 2222–2226, Hyderabad, India, 2-6 September 2018.
- D. A. Menascé, “TPC-W: a benchmark for e-commerce,” IEEE Internet Computing, vol. 6, no. 3, pp. 83–87, 2002.
- J. R. Quinlan, C4. 5: programs for machine learning, Elsevier, 2014.
- Y. Ding, E. Yan, A. R. Frazho, and J. Caverlee, “PageRank for ranking authors in co-citation networks,” Journal of the Association for Information Science and Technology, vol. 60, no. 11, pp. 2229–2243, 2009.
- M. Zaharia, M. Chowdhury, T. Das et al., “Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing,” in Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, April 25-27, 2012.
- J. Kim and K. Lee, “Functionbench: a suite of workloads for serverless cloud function service,” in 12th IEEE International Conference on Cloud Computing, CLOUD 2019, pp. 502–504, Milan, Italy, July 8-13, 2019.
- Y. Wang, G. Wei, and D. Brooks, “Benchmarking tpu, gpu, and CPU platforms for deep learning,” 2019, http://arxiv.org/abs/1907.10701.
- M. Christandl, P. Vrana, and J. Zuiddam, “Barriers for fast matrix multiplication from irreversibility,” Journal of Chemical Theory and Computation, vol. 17, no. 1, pp. 1–32, 2021.
- Y. Cao, L. Zhao, R. Zhang, Y. Yang, X. Zhou, and K. Li, “Experience-availability analysis of online cloud services using stochastic models,” in 17th International IFIP TC6 Networking Conference, Networking 2018, pp. 478–486, Zurich, Switzerland, May 14-16, 2018.
- B. Cai, R. Zhang, X. Zhou, L. Zhao, and K. Li, “Experience availability: tail-latency oriented availability in software-defined cloud computing,” Journal of Computer Science and Technology, vol. 32, no. 2, pp. 250–257, 2017.
- H. Fuchs, M. D’Anna, and F. Corni, “Entropy and the experience of heat,” Entropy, vol. 24, no. 5, p. 646, 2022.
- D. Inc, “Docker homepage,” 2019, [Online]. Available: https://www.docker.com/.
- “Numactl,” 2019. [Online]. Available: https://github.com/numactl/numactl.
- M. A. Brown, “Traffic control howto,” 2015. [Online]. Available: http://linux-ip.net/854 articles/Traffic-Control-HOWTO/.
- Y. Zhang, D. Meisner, J. Mars, and L. Tang, “Treadmill: attributing the source of tail latency through precise load testing and statistical inference,” in 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, pp. 456–468, Seoul, South Korea, June 18-22, 2016.
- “Perf tool,” 2014. [Online]. Available: https://perf.wiki.kernel.org/.
- S. Kanev, K. M. Hazelwood, G. Wei, and D. M. Brooks, “Tradeoffs between power management and tail latency in warehouse-scale applications,” in 2014 IEEE International Symposium on Workload Characterization, IISWC 2014, pp. 31–40, Raleigh, NC, USA, October 26-28, 2014.
- X. Jin, Y. Zhou, B. Huang et al., “Qosmt: supporting precise performance control for simultaneous multithreading architecture,” in Proceedings of the ACM International Conference on Supercomputing, ICS 2019, pp. 206–216, Phoenix, AZ, USA, June 26-28, 2019.
Copyright © 2022 Yanan Yang et al. Exclusive Licensee Zhejiang Lab, China. Distributed under a Creative Commons Attribution License (CC BY 4.0).