Get Our e-AlertsSubmit Manuscript
Intelligent Computing / 2022 / Article

Research Article | Open Access

Volume 2022 |Article ID 9795476 |

Chuanqi Zhang, Sa Wang, Zihao Yu, Huizhe Wang, Yinan Xu, Luoshan Cai, Dan Tang, Ninghui Sun, Yungang Bao, "A Labeled Architecture for Low-Entropy Clouds: Theory, Practice, and Lessons", Intelligent Computing, vol. 2022, Article ID 9795476, 14 pages, 2022.

A Labeled Architecture for Low-Entropy Clouds: Theory, Practice, and Lessons

Received06 Jul 2022
Accepted09 Aug 2022
Published01 Sep 2022


Resource efficiency and quality of service (QoS) are both long-pursuit goals for cloud providers over the last decade. However, hardly any cloud platform can exactly achieve them perfectly even until today. Improving resource efficiency or resource utilization often could cause complicated resource contention between colocated cloud applications on different resources, spanning from the underlying hardware to the software stack, leading to unexpected performance degradation. The low-entropy cloud proposes a new software-hardware codesigned technology stack to holistically curb performance interference from the bottom up and obtain both high resource efficiency and high quality of application performance. In this paper, we introduce a new computer architecture for the low-entropy cloud stack, called labeled von Neumann architecture (LvNA), which incorporates a set of label-powered control mechanisms to enable shared components and resources on chip to differentiate, isolate, and prioritize user-defined application requests when competing for hardware resource. With the power of these mechanisms, LvNA was able to protect the performance of certain applications, such as latency-critical applications, from disorderly resource contention while improving resource utilization. We further build and tapeout Beihai, a 1.2 GHz 8-core RISC-V processor based on the LvNA architecture. The evaluation results show that Beihai could drastically reduce the performance degradation caused by memory bandwidth contention from 82.8% to 0.4%. When improving the CPU utilization over 70%, Beihai could reduce the 99th tail latency of Redis from 115 ms to 18.1 ms. Furthermore, Beihai can realize hardware virtualization, which boots up two unmodified virtual machines concurrently without the intervention of any software hypervisor.

1. Introduction

For decades, the resource efficiency of data centers has been a long-pursuit goal for providers. However, a major and representative category of workload in data centers, called latency-critical (LC) applications, prevents people from reaching this goal. LC applications often consume moderate resource but still have to reserve remarkable resources for intermittent request spikes, leading to the accumulated large amount of stranding resources in data centers. Meanwhile, LC applications are sensitive to resource utilization. Researchers and practitioners have attempted to colocate LC applications with best-effort (BE) applications to improve the overall resource utilization of data centers, which has proven to cause severe performance interference, resulting in long tail latency of LC applications. Therefore, obtaining both high resource efficiency and QoS of applications becomes a key challenge to cloud providers.

State-of-the-art methods devote significant efforts to make a better trade-off between the resource efficiency and QoS of applications. Since LC applications are usually first-class citizens in data centers, cloud providers would rather sacrifice resource efficiency for QoS of LC applications. To break this bias and realize a better trade-off, cloud systems should be able to identify, diagnose, and resolve potential resource contention leading to performance interference that spans different levels of the cloud stack, from hardware to software. Once they can curb all the sources of performance interference and guarantee the QoS of LC applications, they could steadily improve resource utilization by colocating more applications together. However, the sources of performance interference are ultimately relatively complex. Prior work has attempted to eliminate resource contention at various points, from the interrupt handler, OS scheduler [1], and hypervisor scheduler [2, 3], to the network stack [47], global file system [8], and application locks [9]. Despite these efforts, as yet, to the best of our knowledge, no holistic solution to address this problem thoroughly from hardware to software has been developed.

Low-entropy cloud (LEC) [10], proposed by Xu and Li, is the first hardware-software codesigned full-stack solution designed to improve the whole resource utilization of data centers while guaranteeing the QoS of application performance. Entropy is a term from the second law of thermodynamics, which refers to a measure of the disorder of a system [11]. LEC leverages entropy to describe the extent to which the various resources within a cloud system are shared by different applications in a disorderly way. LEC systems are designed to address the disorderly resource sharing, curb performance interference, and realize high resource efficiency. The LEC stack contains a set of new systems, including labeled architecture, labeled storage containers, labeled network stacks, and labeled operating systems.

In this study, we present labeled architecture as a new computer architecture for LEC. The key idea of labeled architecture is inspired by software-defined networking, in which the control plane is decoupled from the data plane to allow the control plane to be made programmable. We find that a computer can be viewed as a network, and we can apply the principle of SDN to computer architecture. Therefore, we propose labeled von Neumann architecture (LvNA), a programmable architecture for LEC that provides a new programming interface to convey user-defined high-level information like QoS requirements to various hardware components. LvNA attaches a high-level semantic label to each memory access and enables fine-grained control mechanisms on disorderly shared hardware resources such as LLC and memory bandwidth, which significantly reduce the uncertainty and the inefficiency of resource contention between co-located applications.

We propose three key concepts that LvNA should follow. (1) Label-powered control mechanisms realize the functionalities of differentiation, isolation, and prioritization (DIP) on hardware. LvNA proves the correctness of the DIP proposition. (2) LvNA enables scheduling and managing of hardware resources under bus-cycle accuracy. The resource allocation and adjustment between different applications could be achieved in bus-cycle resolution. (3) LvNA should introduce minimum pollution to the original design.

We tapeout Beihai, a 1.2 GHz 8-core RISC-V processor based on LvNA. Beihai is the first labeled RISC-V prototype chip in the state of the art, with a 28 nm TSMC, a 16 KB L1I cache, a 16 KB L1D cache, and a 2 MB L2 cache. Beihai can boot a standard Linux operating system, as well as applications such as Redis. Beihai demonstrates the effectiveness of labeled-powered control mechanisms.

We conduct a bunch of experiments, and the results show that Beihai was able to dramatically reduce the performance degradation (82.8% %) caused by memory bandwidth contention with streaming applications. The label-powered control mechanisms reduced the 99th tail latency of the LC application Redis from 115 ms to 18.1 ms while also improving the CPU utilization of the entire server by over 70%. We also present a case study called NoHype virtualization realized by the label mechanism, which refers to a type of hardware virtualization that can boot unmodified virtual machines directly on bare-metal hardware without the intervention of a software hypervisor.

Overall, this work makes the following contributions: (i)We propose the concept of LvNA as a new computer architecture for LEC that borrows the wisdom of software-defined networking to enhance the control ability of computer architecture(ii)We constructed labeled RISC-V, the first open-source labeled architecture that proves the effectiveness of label mechanism, which could differentiate, isolate, and prioritize application requests when competing for hardware resources(iii)We tapeout a test chip Beihai based on labeled RISC-V. The experimental result shows that Beihai was able to realize fully hardware-supported virtualization and greatly reduced performance degradation when competing for memory bandwidth

2. Materials and Methods

2.1. Background
2.1.1. From Shared Cloud to Low-Entropy Cloud

(1) Shared Cloud. Cloud computing has emerged and evolved over the last decade. From the very beginning, cloud computing was proposed to reduce the total cost of ownership (TCO) of small companies, which could purchase a certain amount of computing and storage resources to deploy their services without operating the real servers themselves. From the perspective of cloud providers, their TCO heavily relies on resource efficiency and QoS of applications. Improving resource efficiency could save more resources for profit and lower the cost per resource unit. However, squeezing the resource share allocated to each application would definitely affect the performance of sensitive applications among them.

(2) Resource Stranding. Overprovisioning is a common practice for shared cloud providers to realize resource management, leading to the resource stranding problem proposed by Vahdat [12]. Apparently, overprovisioning is a decision to trade resource efficiency for application QoS. Cloud providers cannot take any negative comments from LC applications. Therefore, they choose to reserve much more resources to guarantee the performance of LC applications, leading to a large number of idle resource strands in data centers. Resource stranding decreases resource utilization and worsens as the scale of data centers continues to increase.

(3) Low-Entropy Cloud (LEC). Xu and Li [10] proposed LEC as the next generation of cloud computing, which aims to both realize high resource efficiency and guarantee the QoS of applications. The term entropy is borrowed from the second law of thermodynamics, which is a measure of the disorder of a system. The higher the entropy, the greater the disorder of a system. They aim to leverage the concept of entropy to describe the disorder of cloud computing systems. High entropy indicates that a cloud computing system suffers from disordered resource sharing. Disordered resource sharing causes performance interference between different applications, which negatively affects user experiences. LEC aims to decrease the entropy inside cloud computing systems. LEC includes a software-hardware codesigned full-stack solution, which was designed to curb the disorder in cloud computing systems and guarantee the performance of cloud applications.

LvNA, presented in this paper, is an alternative computer architecture proposed for LEC. LvNA attempts to identify and eliminate the disordered sharing on chips and I/O devices and guarantee the performance of applications.

2.1.2. Disorderly Resource Sharing in Hardware

In the traditional multicore architecture, multiple applications can colocate together and share the hardware resources, including CPU cycles, last-level cache (LLC), memory bandwidth, and I/O bandwidth. When an application cannot fully utilize the hardware resources, running another application simultaneously can make good use of the remaining resources.

However, when the resources are suffering severe contention, this may introduce performance interference in applications. Many studies have been conducted on this “disorderly sharing” phenomenon. When two cache-sensitive applications run separately, they both run well. However, when they run together on the same machine, they both suffer from unpredictable performance degradation up to a factor of 3.3 [13]. Some research on Google data centers show that this interference exists in overall shared resources, including hyperthreading, LLC, DRAM, and network [14]. To mitigate interference, the mainstream existing solutions adapted by industry utilize software techniques to schedule applications with severe contentions to different physical machines.

The root cause of disorderly sharing is that the low-level hardware lacks software semantic information. The hardware controllers of shared resources cannot determine which source a request originated from. This results in the controllers being unable to determine how critical a request is, which is important for the controllers to guarantee the QoS of critical applications. Therefore, we consider that a method should be developed to convey the software semantic information down to hardware, in order to mitigate resource contentions at the hardware level.

2.1.3. DIP Thesis

Together with LEC, Xu and Li [10] proposed an observation called the DIP Thesis. To curb the disorderly resource sharing and realize low-entropy cloud, a cloud computing system must support three properties: (i)(D) Differentiation: different tasks must have different task labels(ii)(I) Isolation: resource used by one task must be isolated from other tasks(iii)(P) Prioritization: tasks can have different levels of priorities, which are enforced

The design of LvNA follows the observation of the DIP Thesis, which designs hardware labels to differentiate different tasks, processes, or virtual machines when they access shared hardware resources in a disorderly manner and integrates a set of isolation mechanisms and scheduling policies to enforce the order of hardware resource sharing between different tasks.

2.2. Labeled Architecture
2.2.1. Definition of LvNA

Before introducing LvNA, we first review the classic general model of computers, the stored program architecture, also known as the von Neumann model or von Neumann architecture, which has the following five characteristics [15]: (i)Binary: data and instructions are encoded with binary representations(ii)P-M-I/O: the computer hardware comprises three interconnected components: processor, memory, and I/O devices(iii)Store programs: both programs and data are stored in the memory and accessed by the processor(iv)Instruction driven: the computer changes its state (the contents of its memory and registers) only when an instruction is executed(v)Serial execution: a computational process is a serial-execution process. Any program is executed by automatically executing one instruction after another

We define LvNA as a new extended von Neumann architecture that inherits the five basic characteristics above with one extra new characteristic: (vi)Label-guided resource access control: the access permission of computer resources, including processor, memory, I/O devices, and data paths between them, can be changed instantly by user-defined labels at any time. User-defined labels can guide the computer to adjust the spatial and temporal resource access of all labeled processes

LvNA adds a set of label-powered control mechanisms to the classic von Neumann architecture to strengthen its ability to control access to all types of computer resource. As illustrated in Figure 1, data access requests from processors to memory or I/O devices in classic von Neumann architecture can arbitrarily flow through the data paths and memory hierarchy (e.g., L1 cache, L2 cache, and memory bandwidth), under a best-effort delivery. In contrast, LvNA attaches a semantic label to every data access request, and access requests carry the semantic labels all the way to their destination. Labels determine when and how much resources certain requests can access and occupy, rather than all the requests competing in a disorderly fashion for all kinds of resources. With these labels, LvNA could enable resources or components inside the computer architecture to recognize, isolate, and prioritize high-priority applications such as LC applications to utilize resources when they are suffering disorderly resource sharing and ensure the QoS of high-priority applications. Therefore, we propose LvNA as a new architecture for sharing cloud.

2.2.2. Label Design

Labels are the core concept of LvNA, and LvNA mainly leverages labels to realize the control mechanisms and the DIP Thesis. At a high level, a label should at least contain two kinds of information (Figure 2).

(1) Identification. The first and foremost information a label should convey is identification (a.k.a. ID), which enables the resources or components in computer architecture to determine which application, process, or thread issued certain access requests. Similar to PID in Linux, users could retrieve, trace, and kill certain process through the identification of PID

Naturally, the identification or ID is designed as a string of numbers. Due to limited hardware resources, the ID needs to be set to a maximum length. The longer the ID is, the more applications (or application categories) can be distinguished. This enables more precise performance control. However, it also increases hardware resource overhead and renders maintaining control policies more challenging.

Generally, it is more practical to allocate each application category a unique ID to differentiate them, rather than a single application. For example, in SDN, the DiffServ domain of the IP protocol plays a similar role to the label in the network domain. The 6-bit DiffServ domain can theoretically represent 64 application categories, but only 14 commonly used categories are defined in RFC 2475 [16]. The scheduler in Alibaba’s data center only classifies apps into two categories, including high and low priority. Hence, the ID requires only a single bit of information.

Due to these considerations, in labeled RISC-V implementation, the length of the identity ID is set to 3–4 bits, which can meet the requirements of classifying applications.

(2) Policy. After determining to which application the request belongs, the label should contain additional information for resources or components to determine how to handle the request. The components should immediately handle the request as soon as possible or make the request stall for some time until other requests finish their jobs. The policy can be a trigger-action rule, a sophisticated algorithm, or a simple priority hierarchy according to users’ needs

We can add a QoS regulation policy to the label, and the control logic will directly regulate the request based on the QoS regulation policy carried by request. This method can save the cost of fetching the control policy by labels. However, if the control policy needs to represent more information, it increases the hardware resource cost. Meanwhile, if the control policies of each control logic differ, the control policies of each control logic need to be stored in labels, thus increasing the resource cost of hardware. Therefore, this method is only applicable to scenarios where the regulation strategy is single or the label cost is not important (e.g., the label implementation of pure software).

2.2.3. Label Control Procedure

Figure 3 provides an overview of the label control procedure of LvNA, which can be divided into three steps.

(1) Label Registration. Each application should register and attach a group label through the interface of operating systems. As discussed previously, allocating every application a unique ID for differentiation is impractical. Administrators should categorize the applications into certain groups (e.g., the LC applications group or the BE applications group) and attach a group label to the application for underlying label management

(2) Label Transmission. After the application registers a label and begins to run, the operating system and LvNA collaborate and convey the label information from software to underlying hardware components and I/O devices. The two parts of a label, namely, ID and policy, are transmitted in separate ways. We attach an ID to the process and transmit the ID information through schedulers and cores to requests within the hardware. We leverage the ID information to differentiate various requests in hardware. Instead, we directly write the policy into control logic on hardware through the software interface

(3) Label Control. Each data access request in LvNA is attached with an ID and conveyed all the way to the underlying memory hierarchy and I/O devices. The request will trigger the user-defined QoS policy retrieved by its ID information on certain hardware components to realize label control. Taking last-level caches (LLC) as an example, once the data access request arrives on LLC, the control logic of the LLC will retrieve the QoS policy of the request by its ID. With its QoS policy, the control logic will determine how much LLC the request could utilize

2.2.4. Three Key Concepts of LvNA

We propose LvNA as a general computer architecture for LEC. LvNA can be implemented in many different ways. Further, we make no assumption about the ISA, microarchitecture, etc. Below, we describe three key concepts of LvNA.

(1) Labels Enable DIP Thesis. In LvNA, labels are attached to all data access requests, used to identify the application (or category of application) from which the request comes, and are propagated throughout the computer system along with the data access request. Hence, bus and shared hardware components can distinguish requests from different applications (or application classes) by examining the label of the data access request, thereby supporting the differentiation property (D property).

Furthermore, the bus and shared hardware components can isolate the space resources (e.g., cache and memory address space) accessed by request based on differentiating the sources of data access requests to slow down or eliminate the interference caused by the sharing conflicts of space resources, thus supporting the isolation property (I property).

In addition, bus and shared hardware components can prioritize performance resources (e.g., queues and bandwidth) used by requests based on source differentiation of data access requests to mitigate or eliminate interference caused by sharing conflicts of performance resources, thus supporting the prioritization property (P property).

(2) Bus-Cycle Performance Adjustments. The control logic is the performance control part of the labeled architecture. It implements different performance control strategies for corresponding data access requests according to preset rules based on labels. These performance control strategies are software programmable and can even be adjusted according to the actual situation of system operation.

Because the control logic can operate at the hardware bus frequency, LvNA allows fine-grained performance regulation at the bus cycle level, implementing different regulation strategies for each incoming data access request for the bus cycle. LvNA exhibits better performance control for LC applications than traditional time-slice performance control strategies at the operating system level.

(3) Minimum Pollution Principle. Labeled architecture enhancements to hardware do not alter the semantics of existing instructions and are therefore non-intrusive to the software stack, without modifying the operating system and applications. In addition, the labeled architecture does not depend on or alter the architecture of the processor pipeline. Hence, it can be applied to any processor.

The intrusion of LvNA on hardware changes is the least polluting solution to satisfy the DIP attribute. Without labels, the bus and shared hardware components cannot identify the source of data access requests, and the system will have no distinguishing attribute. Without the addition of control logic, the bus and shared hardware components cannot differentiate the performance of data access requests, and the system will not exhibit isolation and priority properties.

2.3. Labeled RISC-V Implementation

Based on LvNA, we construct labeled RISC-V, the first open-source labeled architecture, which leverages the Rocket Chip [17] open-source System-on-Chip design as our base system. We attempt to introduce as little as possible modification to the Rocket Chip design to realize the label mechanism according to the principle of minimum pollution. Compared with the native RISC-V system, the modification of the labeled RISC-V system is mainly divided into the hardware stack and the software stack. In the hardware implementation, we modified the processor core to add label registers, used a remapping logic to support memory partition, and modified the bus to a labeled bus to transmit labeled requests. When resource contention occurs (e.g., LLC and memory bandwidth), we add the corresponding control logic and a control plane to manage the control logic and collect statistics. In the software implementation, mainly in the operating system, we add the label structure in process metadata and modify the scheduler, as well as the label management interface and control policy interface for use by system administrators.

2.3.1. Label Implementation

The labels are divided into VM-core labels and process-level labels. The VM label is the identifier of the VM and can be configured through an external interface. The process-level label is configured by the Cgroups [18] mechanism of the Linux system. The process identifier is recorded in the process control block (PCB, as task_struct in the Linux system) and written to a new specially defined control and status register in the core during context switch so that the hardware system can identify the current instruction flow and request from which process. The concatenation of the VM label and process label is the actual label in the hardware system.

We extended Rocket Chip’s internal bus protocol, TileLink [19], for interconnecting bus modules by adding label fields to carry label information for channels A (request), B (coherence probing), and C (cache release). Thus, requests from the core can still be identified as coming from which process in which VM when they reach the L2 cache. Because VM labels and process labels have limited widths, the system allows a maximum of several labels to be known. Given the exponential growth in the number of labels, the proposed approach currently supports 5-bit wide labels on octo-core systems, which has a total of 32 labels.

2.3.2. Control Plane Implementation

The control plane is an independent module under the system bus. The implementation of the control plane is closely related to each control and access function, so we only introduce the implementation of the control plane read-write interface framework in the present work.

To access the registers in the control plane through bus access, a TileLink node is generated in the control plane to interact with the system bus, and address space 0x10000 to 0x1FFFF is allocated and accessible by OS through MMIO.

The register in the address space contains following types: (1)A label selector used to specify the registers of the corresponding label to be operated next(2)Parameters corresponding to each control logic(3)Statistics registers, which are used to save the statistical data of the corresponding label

2.3.3. Memory Partition and Remapping

The memory partition in the Beihai system transparently maps the address space of the physical core in a VM to different continuous intervals of physical memory. Transparency implies that the system running on the VM is not aware of the mapping. For example, operating systems running in different memory partitions assume that the starting address of the physical memory is the same.

Compared with the segmentation mechanism, the memory partition scheme is rough and not flexible. However, it is simple to implement and has little impact on the original hardware system. In addition, for MMIO access to peripherals, we still differentiate by specifying different mapping addresses for different VMs. For example, for UART devices, VM 0 accesses address 0x60000000 and VM 1 accesses address 0x60001000. This is performed by reading hard-coded numbers from the cores (the mhartid register in RISC-V) or specifying them to the system in a device tree file.

The configuration method of memory partition is to redirect the core to the non-memory code space (e.g., boot ROM or Debug ROM) before the system starts and to modify the configuration of the allocated memory space through the control plane. MemBase determines its base address, and MemMask limits the size of its space. The system image is then written to this area, redirecting the core to the memory starting address to begin execution. Mapping is implemented in Rocket Chip’s ICache and DCache, respectively: after a request from the core pipeline passes through the TLB to obtain a physical address, the response of the TLB is modified to (paddr and memMask) | memBase.

2.3.4. Memory Bandwidth Control via Token Buckets

The token bucket has three parameters: size (SIZE), frequency (FREQ), and increment (INC). The frequency indicates the token recovery period of the bucket, the increment indicates the number of tokens recovered at each period, and the size of the bucket determines the maximum number of tokens in the bucket. In an abstract sense, SIZE and FREQ determine the maximum burst bandwidth allowed for token buckets (which lasts only one recovery cycle), while INC and FREQ determine the maximum stable bandwidth allowed for token buckets.

We insert the token bucket as a monitoring module in the system bus. This module has another port connected to the control plane to allow parameters to be updated and return statistics. The following is the logic of token bucket implementation. Each label corresponds to a set of token bucket states (token count, recovery period counter, and flow counter) and a set of token bucket parameters (SIZE, FREQ, and INC). After filtering the request address (only monitoring the request sent to the main memory, ignoring the peripheral MMIO request), for each token bucket, a valid request implies that the request is valid and the label is the same. In each cycle, the request size of all valid requests of corresponding labels is accumulated (invalid is regarded as 0) and whether the recovery cycle is reached is judged. The token count of the next cycle is calculated and updated (reduced to 0) by substituting the number of recovery tokens and token stock. If the token stock is reduced to zero, the requests will be stalled in the waiting queue. The stalled requests are then probed in the next cycle as valid requests.

2.3.5. Cache Partition

The current L2 cache is configured 16-way set associative by default. Each label in the control plane corresponds to a waymask register. Because L2 cache implementation is multicycle, the waymask query provides great convenience: when we first request a meta-array, we use the requested label to read the waymask configuration from the control plane and store it in the register. In the next cycle, when the replacement victim ways are generated, the waymask is used as a mask to filter the candidates, so that only the waymask bit of the corresponding way is valid and can be selected as victim.

2.3.6. An Illustrative Example of the Labeled RISC-V Workflow

We use the illustrative example in Figure 4, LC and BE applications colocated on hardware, to present the workflow of how labeled RISC-V protects the QoS of LC applications when they are suffering performance interference from BE applications.

(1) Label Registration. The administrator can register two label task groups with labels 0 and 1 through the Cgroups label subsystem, which functions as label management interface (e.g., “mkdir -p /sys/fs/cgroup/dsid/BE-1”). The admin can add LC app processes to group 0 and BE app processes to group 1. The label subsystem sets the label corresponding to the group into the task_struct of the process when it becomes attached to the group, and this label then accompanies the entire life cycle of the process.

(2) Label Transmission. When a process is dispatched to a core, the scheduler writes the label of that process into the core and concatenates it together with the VM label to form a full hardware label. As shown at the top of the hardware section in Figure 4, the core where the LC process is running belongs to VM 0, and the process label is 0, which are concatenated into a hardware label 0; the core where BE is located generates label 1. When there are new memory requests generated by the labeled core, the hardware label is attached to those requests. These requests carry the label through the entire memory hierarchy until the last level that does not support label architecture. Therefore, each shared hierarchy level can distinguish different access requests according to their labels, and the admin can also obtain statistical data of different labels in each shared level.

(3) Label with Control Logic. However, only supporting (D) differentiation is incomplete for LvNA. A label lacking policy cannot achieve the full DIP properties and guarantee the QoS of LC applications. Resource contention still occurs in each shared resource. The labeled requests would still compete for the shared LLC space and memory bandwidth. After observing that the QoS of the LC application is disturbed, combined with the statistics of the label, the administrator can set the control policy of the corresponding label by writing commands to the control files of group 0 and 1 (e.g., “echo l2_waymask 0xf000 > /sys/fs/cgroup/dsid/BE-1/dsid.dsid-cp”). The control policy interface then converts the commands into control parameters and issues commands to operate the control plane through MMIO. These parameters are eventually updated to each control logic.

(4) Label-Guided LLC Control. When a labeled request arrives at LLC, if it is a new cache block, it needs to be allocated in LLC according to the replacement algorithm. At this time, LLC limits the ways used according to the waymask parameter corresponding to the label. In the LLC part of Figure 4, requests of label 0 are allocated on the right 12-way of LLC and requests of label 1 are allocated on the left 4-way of LLC according to the waymasks. Hence, the LLC space occupied by different requests are separated, and there is no contention on the cache space and the (I) Isolation property is realized.

(5) Label-Guided Memory Bandwidth Control. Before the labeled request access for memory after passing through the LLC, it is first processed through the bandwidth control logic. The bandwidth control limits the bandwidth of each labeled request according to the parameters corresponding to the label. The request of label 1 enters the bus to the memory at a controlled rate, but the label 0 request is not affected, thus ensuring the priority service of memory bandwidth for the label 0 request and implementing the (P) prioritization property.

(6) Label Reclamation. Labels in hardware and labels in software have different life cycles. Labels in hardware are to be destroyed after use, whereas those in the software are recycled. The hardware labels are generated along with the memory access request. Thus, when the request reaches the last memory hierarchy that can satisfy it, the request gets completed and replied, and the hardware label attached to it is also destroyed simultaneously. The software label is assigned to the process by the label management subsystem and accompanies the entire life cycle of the process until the process ends when the task_struct structure is released and the label disappears. However, the group in the Cgroups label subsystem that manages the label is not automatically destroyed; it can still wait for other processes to attach and assign the label to them.

3. Results and Discussion

3.1. Platform

We designed and tapeout a test chip named Beihai to establish the effectiveness of LvNA in low-entropy cloud scene using the TSMC 28 nm process. Detailed specifications of Beihai are shown in Table 1.

Core8 RISC-V RV64GC cores, 1.2 GHz

L1 I cache16 KB, per-core, 8-way set-associative
L1 D cache16 KB, per-core, 8-way set-associative
L2 cache2 MB, shared, 16-way set-associative, inclusive
LvNA supportSoftware programmable control plane, memory segmentation, memory bandwidth adjustment, and cache partition

Figure 5 shows the GDS layout of our Beihai chip; the parts in the upper green box are 8 core tiles derived from the Rocket Chip [17], each including a Rocket core and a L1 I/D cache. We added a control and status register (CSR) for each core to store the label corresponding to the software running on that core. The parts in the lower blue box are the L2 cache, and the middle part between the two boxes is the main part of the rest of chip logic. To allow the label to be transmitted through the memory hierarchy, we modified the TileLink [19] bus protocol in Rocket Chip [17] and added 10 bits to it, accounting for 2% of the total bit width (10/500). The part marked in red in the middle part is the control logic related to the label architecture, accounting for only 1.22% of the total chip area. The overall labeled architecture modification hardly involves changes in the core, and the overhead of introducing labeled control logic is relatively small.

We constructed a 16-node cluster prototype system called “FlameCluster” based on the Beihai chip, each node of which could run standard Linux system and labeled-Linux system. All nodes are connected to a switch and are connected to each other through a network; other specifications are shown in Table 2.

Each nodeCPUBeihai chip
Memory2GB DDR4 2133
OSLinux 5.14

Node number16

3.2. Functionality
3.2.1. Fully Hardware-Supported Virtualization

This experiment demonstrates the effectiveness of fully hardware-supported virtualization when using label as an identifier of each VM. We partition the eight-core Beihai system into eight VMs that equally share hardware resources. We launched two of them without any hypervisor and each VM ran Linux as a bare-metal machine (Figure 6).

3.2.2. Process Memory Bandwidth Management

The labeled architecture could also be used to control and serve different memory bandwidths to entities in one system. In this experiment, we ran one STREAM [20] and three Dwarfs [21] in an eight-core Linux SMP system on Beihai. We assigned label 0 to the STREAM process and label 1 to the Dwarfs processes with no constraint on processor affinity.

Figure 7 illustrates the change of STREAM’s memory bandwidth under different loads and control strategies. When STREAM runs in a solo mode, its copy bandwidth was 60.5 MB/s. However, the performance of the STREAM was easily degraded when other processes compete against it for the memory bandwidth. After we started three Dwarfs (md5, wordcount, and sort) along with the STREAM, the copy bandwidth of the STREAM dropped to 10.4 MB/s (an 82.8% reduction compared to the solo mode). Next, we applied a memory bandwidth control strategy to label 1 (which represents the three Dwarfs application) and reran the STREAM test. The copy bandwidth of STREAM came back to 60.2 MB/s. Compared with the solo mode, the performance degradation of STREAM due to memory bandwidth contention was reduced to only 0.4%.

This experimental results demonstrate that the labeled architecture is able to isolate and reallocate the memory bandwidth among different applications and can significantly reduce performance degradation caused by memory bandwidth contention.

3.2.3. QoS Guarantee for LC Applications

We chose Redis [22] as a representative workload of LC applications in data centers and Dwarfs as BE applications. We colocated a Redis server with four Dwarfs applications on both Beihai single node and Beihai 16-node cluster. The LC applications and BE applications were assigned different labels such as the memory bandwidth experiment above.

Figure 8(a) shows the change of Redis’s 99th tail latency and hardware resource usage in the single-node test. We ran one Redis server instance on a node and used the Redis benchmark on another machine to send requests and measure the tail latency periodically. We checked our hardware performance counter to get the statistics associated to the labels at the same time. Due to the limitation of hardware implementation, we can only obtain bandwidth of different labels between the L1 cache and L2 cache, and it can also represent the upper bound of the memory bandwidth to some extent. When the Redis server runs alone, its 99th tail latency remained stably below 20 ms. However, there was only one process fully busy, and the CPU utilization was only 22.2%. After we started Dwarfs on that node, the CPU utilization increased to 75.9%, but the 99th tail latency of Redis drastically increases by about 2×10× and fluctuated wildly, hurting the QoS of the LC Redis severely. At first, we applied cache partition to restrict the L2 capacity of Dwarfs to 25%. However, the tail latency of Redis still fluctuated because Redis still suffered severe resource contention on memory bandwidth with Dwarfs. Hence, we applied memory bandwidth restriction to Dwarfs. The 99th tail latency of Redis finally returned to the same level as it ran alone and the CPU utilization could remain at 71.3%.

In the 16-node cluster test, we ran Redis server on all nodes and set 8 of them as master nodes and the other nodes as replica nodes. We used Redis benchmark in the cluster mode to test the tail latency of FlameCluster, stared Dwarfs later on each nodes, and then applied the control strategy to Dwarfs, such as the single-node test. The results are shown in Figures 9(a)–9(c).

In this experiment, we used the labeled architecture to distinguish different requests from LC applications and BE applications in shared hardware resources. Then, we set a programmable control strategy to isolate the interference between them and prioritize the requests from LC applications. The results of the experiment show that the labeled architecture can effectively prevent the long tail delay of LC applications, provide better hardware QoS for LC applications, and keep the CPU utilization at a high level at the same time.

3.3. Related Work and Discussion
3.3.1. Software-Based Partitioning

Researchers have proposed to leverage page-coloring technique to partition cache and DRAM. The key idea of page coloring is leveraging address mapping information to manage hardware resources. Lin et al. [23] and Tam et al. [24] proposed page-coloring-based cache partitioning to address the cache contention problem. Liu et al. [25] presented a practical page-coloring-based DRAM bank partitioning approach to eliminate the memory bank-level interference and further proposed an approach to cooperatively partitioning both cache and DRAM [26]. Although these software techniques are able to effectively partition shared cache and DRAM banks, they involve two major concerns: first, they require reorganizing free-page lists and migrating pages when workload mixtures change, but the overhead is not negligible. Although some works have designed different kernel buddy systems for different workload mixtures, this approach requires significant kernel hacking efforts and has limited usage scenarios in production data centers. Second, contemporary processors adopts hashing algorithms to map physical address to cache index [27]. Hence, applying a page-coloring approach to such systems is relatively difficult. In contrast to page coloring, LvNA can expose more underlying information to system, and operators thus can apply better resource management policies based on these information.

3.3.2. Hardware-Based Technique

Kasture and Sanchez proposed Ubik [28] as a cache partitioning policy designed to characterize and leverage the transient behavior of latency-critical applications to maintain their target tail latency. Vantage [29] implemented fine-grained cache partitioning using the statistical properties of Zcaches [30]. Utility-based cache partitioning (UCP) [31] strictly partitions the shared cache depending on the benefit of allocating different number of ways to each application. Muralidhara et al. proposed an application-aware memory channel partitioning (MCP) [32] to reduce memory system interference. However, this work usually focuses on only one type of resource, whereas LvNA provides a solution to manage all shared hardware resources simultaneously within a server.

3.3.3. Architectural Support for QoS

Rafique et al. [33] proposed a OS-driven hardware cache partitioning mechanism that labels cache requests and allows the OS to adjust cache quota according to the label. Sharifi et al. [34] further proposed a feedback-based control architecture for end-to-end on-chip resource management. Iyer et al. made substantial contributions in terms of architectural support for QoS [3539]. The closest work to LvNA is the Intel Resource Director Technology [40], which assigns a label called class of service (CLOS) to group each on-chip request and allows cache/DRAM to serve the requests according to the CLOS. dCAT [41] improves throughput by dynamically partitioning a single shared resource (LLC). CoPart [42] partitions LLC as well as memory bandwidth to improve fairness. PARTIES [43] uses a gradient descent algorithm to find a local optimal resource partition strategy for multiple LC workloads. CLITE [44] uses Bayesian optimization to find a better strategy for multiple LC workloads.

However, LvNA differs from these prior proposals in the following aspects. (1) Prior work hash primarily focused on on-chip resources, such as cache, NoC, and memory bandwidth (managed by an on-chip memory controller), while LvNA is able to manage not only on-chip resources but also I/O resources. (2) We design programmable control logics for LvNA and a uniform programming interface, while prior work does not support programmability. (3) LvNA includes a centralized platform resource manager and a Linux-based firmware to facilitate operators’ management. (4) LvNA supports not only QoS but also NoHype-like virtualization that can partition a single server into multiple VMs. (5) In future work, LvNA could be extended to carry more information across software and hardware by labels.

4. Conclusion

In this paper, we presented LvNA, a new computer architecture for low-entropy clouds that realizes a set of label-powered control mechanisms in various hardware components to enhance the ability of the system to control shared resources. LvNA significantly improves the control functionality of hardware on shared resources, such as last-level caches and memory bandwidth, and curbs the uncertainty inside hardware. We have also tapeout a 1.2 GHz 8-core RISC-V processor called Beihai, based on LvNA to demonstrate the effectiveness of the proposed label-powered mechanisms. The experimental results show that Beihai was able to drastically reduce the tail latency of LC application Redis when colocated with other CPU-consuming applications.

Data Availability

The latency and bandwidth data used to support the findings of this study have been deposited in the beihai-data repository (

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this article.

Authors’ Contributions

Chuanqi Zhang, Sa Wang, and Zihao Yu contributed equally to this work.


This work is supported partially by the National Key R&D Program of China (2016YFB1000201), the National Natural Science Foundation of China (Grant Nos. 62090020 and 62172388), Youth Innovation Promotion Association of the Chinese Academy of Sciences (Grant Nos. 2013073 and 2020105), and the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No. XDC05030200).


  1. J. Leverich and C. Kozyrakis, “Reconciling high server utilization and sub-millisecond quality–of-service,” in Proceedings of the Ninth European Conference on Computer Systems, pp. 1–14, Amsterdam, The Netherlands, 2014. View at: Publisher Site | Google Scholar
  2. Y. Xu, M. Bailey, B. Noble, and F. Jahanian, “Small is better: avoiding latency traps in virtualized data centers,” in Proceedings of the 4th annual Symposium on Cloud Computing, pp. 1–16, Santa Clara, California, 2013. View at: Publisher Site | Google Scholar
  3. Y. Xu, Z. Musgrave, B. Noble, and M. Bailey, “Bobtail: avoiding long tails in the cloud,” in 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13), pp. 329–341, Lombard, IL, 2013. View at: Google Scholar
  4. M. Yu, A. Greenberg, D. Maltz et al., “Profiling network performance for multi-tier data center applications,” in 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11), Boston, MA, 2011. View at: Google Scholar
  5. B. Vamanan, J. Hasan, and T. Vijaykumar, “Deadline-aware datacenter tcp (d2tcp),” ACM SIGCOMM Computer Communication Review, vol. 42, no. 4, pp. 115–126, 2012. View at: Publisher Site | Google Scholar
  6. C. Wilson, H. Ballani, T. Karagiannis, and A. Rowtron, “Better never than late,” ACM SIGCOMM Computer Communication Review, vol. 41, no. 4, pp. 50–61, 2011. View at: Publisher Site | Google Scholar
  7. D. Zats, T. Das, P. Mohan, D. Borthakur, and R. Katz, “Detail: reducing the flow completion time tail in datacenter networks,” ACM SIGCOMM Computer Communication Review, vol. 42, no. 4, pp. 139–150, 2012. View at: Publisher Site | Google Scholar
  8. J. Dean and L. A. Barroso, “The tail at scale,” Communications of the ACM, vol. 56, no. 2, pp. 74–80, 2013. View at: Publisher Site | Google Scholar
  9. R. Kapoor, G. Porter, M. Tewari, G. M. Voelker, and A. Vahdat, “Chronos: predictable low latency for data center applications,” in Proceedings of the Third ACM Symposium on Cloud Computing, pp. 1–14, San Jose, California, 2012. View at: Publisher Site | Google Scholar
  10. Z. Xu and C. Li, “Low-entropy cloud computing systems,” Scientia Sinica Informationis, vol. 47, no. 9, pp. 1149–1163, 2017. View at: Publisher Site | Google Scholar
  11. C. E. Shannon, “A mathematical theory of communication,” The Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948. View at: Publisher Site | Google Scholar
  12. A. Vahdat, “Networking challenges for the next decade,” 2017, 2022, View at: Google Scholar
  13. A. Herdrich, E. Verplanke, P. Autee et al., “Cache qos: from concept to reality in the intel®xeon®processor e5-2600 v3 product family,” in 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 657–668, Barcelona, Spain, 2016. View at: Publisher Site | Google Scholar
  14. D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and C. Kozyrakis, “Heracles: improving resource efficiency at scale,” in Proceedings of the 42nd Annual International Symposium on Computer Architecture, pp. 450–462, Portland, Oregon, 2015. View at: Publisher Site | Google Scholar
  15. Z. Xu and J. Zhang, Computational Thinking: A Perspective on Computer Science, Springer, 2021. View at: Publisher Site
  16. S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss, Rfc2475: an Architecture for Differentiated Service, Network Working Group, USA, 1998.
  17. K. Asanovic, R. Avizienis, J. Bachrach et al., “The rocket chip generator, EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2016-17,” 2016, View at: Google Scholar
  18. P. Menage and R. Seth, “Cgroups, [EB/OL],” View at: Google Scholar
  19. “Tilelink specification, [EB/OL],” View at: Google Scholar
  20. J. D. McCalpin, “Stream: sustainable memory bandwidth in high performance computers, [EB/OL],” View at: Google Scholar
  21. W. Gao, J. Zhan, L. Wang et al., “Data motifs: a lens towards fully understanding big data and ai workloads,” in Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, pp. 1–14, Chicago, IL, USA, 2018. View at: Publisher Site | Google Scholar
  22. “Redis, [EB/OL],” View at: Google Scholar
  23. J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan, “Gaining insights into multicore cache partitioning: bridging the gap between simulation and real systems,” in 2008 IEEE 14th International Symposium on High Performance Computer Architecture, pp. 367–378, Salt Lake City, UT, 2008. View at: Publisher Site | Google Scholar
  24. D. Tam, R. Azimi, L. Soares, and M. Stumm, “Managing shared l2 caches on multicore systems in software,” in Workshop on the Interaction between Operating Systems and Computer Architecture, pp. 26–33, Toronto, 2007. View at: Google Scholar
  25. L. Liu, Z. Cui, M. Xing, Y. Bao, M. Chen, and C. Wu, “A software memory partition approach for eliminating bank-level interference in multicore systems,” in 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 367–375, Minneapolis, MN, USA, 2012. View at: Google Scholar
  26. L. Liu, Y. Li, Z. Cui, Y. Bao, M. Chen, and C. Wu, “Going vertical in memory management: Handling multiplicity by multi-policy,” in 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), pp. 169–180, Minneapolis, MN, USA, 2014. View at: Publisher Site | Google Scholar
  27. C. Maurice, N. L. Scouarnec, C. Neumann, O. Heen, and A. Francillon, “Reverse engineering intel last-level cache complex addressing using performance counters,” in International Symposium on Recent Advances in Intrusion Detection, pp. 48–65, Cham, 2015. View at: Publisher Site | Google Scholar
  28. H. Kasture and D. Sanchez, “Ubik,” ACM SIGPLAN Notices, vol. 49, no. 4, pp. 729–742, 2014. View at: Publisher Site | Google Scholar
  29. D. Sanchez and C. Kozyrakis, “Vantage: scalable and efficient fine-grain cache partitioning,” ACM SIGARCH Computer Architecture News, vol. 39, no. 3, pp. 57–68, 2011. View at: Publisher Site | Google Scholar
  30. D. Sanchez and C. Kozyrakis, “The zcache: decoupling ways and associativity,” in 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 187–198, Atlanta, GA, USA, 2010. View at: Publisher Site | Google Scholar
  31. M. K. Qureshi and Y. N. Patt, “Utility-based cache partitioning: a low-overhead, high performance, runtime mechanism to partition shared caches,” in 2006 39th annual IEEE/ACM international symposium on microarchitecture (MICRO’06), pp. 423–432, Orlando, FL, USA, 2006. View at: Publisher Site | Google Scholar
  32. S. P. Muralidhara, L. Subramanian, O. Mutlu, M. Kandemir, and T. Moscibroda, “Reducing memory interference in multicore systems via application-aware memory channel partitioning,” in 2011 44th annual IEEE/ACM international symposium on microarchitecture (MICRO), pp. 374–385, Porto Alegre, Brazil, 2011. View at: Google Scholar
  33. N. Rafique, W.-T. Lim, and M. Thottethodi, “Architectural support for operating system driven cmp cache management,” in 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 2–12, Seattle, WA, USA, 2006. View at: Google Scholar
  34. A. Sharifi, S. Srikantaiah, A. K. Mishra, M. Kandemir, and C. R. Das, “Mete: meeting end–to-end qos in multicores through system-wide resource management,” in Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems, pp. 13–24, San Jose, California, USA, 2011. View at: Publisher Site | Google Scholar
  35. A. Herdrich, R. Illikkal, R. Iyer, D. Newell, V. Chadha, and J. Moses, “Rate-based qos techniques for cache/memory in cmp platforms,” in Proceedings of the 23rd international conference on Supercomputing, pp. 479–488, Yorktown Heights, NY, USA, 2009. View at: Publisher Site | Google Scholar
  36. R. Iyer, “Cqos: a framework for enabling qos in shared caches of cmp platforms,” in Proceedings of the 18th annual international conference on Supercomputing, pp. 257–266, Malo, France, 2004. View at: Publisher Site | Google Scholar
  37. R. Iyer, L. Zhao, F. Guo et al., “QoS policies and architecture for cache/memory in cmp platforms,” ACM SIGMETRICS Performance Evaluation Review, vol. 35, no. 1, pp. 25–36, 2007. View at: Publisher Site | Google Scholar
  38. B. Li, L.-S. Peh, L. Zhao, and R. Iyer, “Dynamic qos management for chip multiprocessors,” ACM Transactions on Architecture and Code Optimization, vol. 9, no. 3, pp. 1–29, 2012. View at: Publisher Site | Google Scholar
  39. B. Li, L. Zhao, R. Iyer et al., “CoQoS: coordinating QoS-aware shared resources in NoC-based SoCs,” Journal of Parallel and Distributed Computing, vol. 71, no. 5, pp. 700–713, 2011. View at: Publisher Site | Google Scholar
  40. “Intel rdt, [EB/OL],”–technology/resource-director-technology.html. View at: Google Scholar
  41. C. Xu, K. Rajamani, A. Ferreira, W. Felter, J. Rubio, and Y. Li, “Dcat: dynamic cache management for efficient, performance-sensitive infrastructure-as-a-service,” in Proceedings of the Thirteenth EuroSys Conference, pp. 1–13, Porto, Portugal, 2018. View at: Publisher Site | Google Scholar
  42. J. Park, S. Park, and W. Baek, “CoPart: coordinated partitioning of last-level cache and memory bandwidth for fairness-aware workload consolidation on commodity servers,” in Proceedings of the Fourteenth EuroSys Conference 2019, pp. 1–16, Dresden, Germany, 2019. View at: Publisher Site | Google Scholar
  43. S. Chen, C. Delimitrou, and J. F. Martinez, “PARTIES: QoS-aware resource partitioning for multiple interactive services,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 107–120, Providence, RI, USA, 2019. View at: Publisher Site | Google Scholar
  44. T. Patel and D. Tiwari, “CLITE: efficient and QoS-aware co-location of multiple latency-critical jobs for warehouse scale computers,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 193–206, San Diego, CA, USA, 2020. View at: Publisher Site | Google Scholar

Copyright © 2022 Chuanqi Zhang et al. Exclusive Licensee Zhejiang Lab, China. Distributed under a Creative Commons Attribution License (CC BY 4.0).

 PDF Download Citation Citation
Altmetric Score