Get Our e-AlertsSubmit Manuscript
Intelligent Computing / 2022 / Article

Research Article | Open Access

Volume 2022 |Article ID 9797623 |

Yongwei Zhao, Yunji Chen, Zhiwei Xu, "Fractal Parallel Computing", Intelligent Computing, vol. 2022, Article ID 9797623, 10 pages, 2022.

Fractal Parallel Computing

Received04 Jul 2022
Accepted09 Aug 2022
Published05 Sep 2022


As machine learning (ML) becomes the prominent technology for many emerging problems, dedicated ML computers are being developed at a variety of scales, from clouds to edge devices. However, the heterogeneous, parallel, and multilayer characteristics of conventional ML computers concentrate the cost of development on the software stack, namely, ML frameworks, compute libraries, and compilers, which limits the productivity of new ML computers. Fractal von Neumann architecture (FvNA) is proposed to address the programming productivity issue for ML computers. FvNA is scale-invariant to program, thus making the development of a family of scaled ML computers as easy as a single node. In this study, we generalize FvNA to the field of general-purpose parallel computing. We model FvNA as an abstract parallel computer, referred to as the fractal parallel machine (FPM), to demonstrate several representative general-purpose tasks that are efficiently programmable. FPM limits the entropy of programming by applying constraints on the control pattern of the parallel computing systems. However, FPM is still general-purpose and cost-optimal. We settle some preliminary results showing that FPM is as powerful as many fundamental parallel computing models such as BSP and alternating Turing machine. Therefore, FvNA is also generally applicable to various fields other than ML.

1. Introduction

Machine learning (ML) is fast becoming the prominent technology to be used for many emerging applications, including image processing, voice recognizing, gaming, and intelligent healthcare. Furthermore, the applications of ML scale from mobile and edge to supercomputing: For example, in the mobile phone scenarios, the “Snap’n’Shop” (product image search) features [1] are pervasive in today’s mobile shopping Apps, and this feature relies on ML for the semantic understanding of images. In the field of supercomputing, meanwhile, the 2020 ACM Gorden Bell Prize was awarded to Jia et al. [2] who pushed the limit of ab initio molecular dynamics to 100 million atoms by integrating ML into physical modeling. Consequently, ML computers are being developed at a variety of scales. For example, Ascend, Huawei’s ML computer, is as small as the IP core in the mobile SoC (Ascend Tiny in HiSillicon Kirin 9000) and as large as the AI cluster (e.g., Atlas 900 including thousands of Ascend 910 chips, up to 1 ExaFlops @ FP16).

Due to the heterogeneous, parallel, multilayer, and anisostratal (i.e., different ISA adopted in each layer of the system) characteristics of the conventional ML computer architecture, the software stack on every scaled model must be developed on an ad hoc basis. As a result, the R&D costs of ML computers are concentrated on software development, namely, the programming productivity issue.

We take the programming of a GPU cluster as an example. To leverage a single GPU, software developers write CUDA programs to implement common ML operators, such as Conv2D, AvgPool, and SoftMax. The CUDA program must allocate and use the global, shared, and local memory explicitly to utilize the power of the GPU. Moreover, the developer may also need to implement fused operators [3] for optimal efficiency, contributing to a significant part of the development workload. However, when the computer extends to multiple GPUs and nodes, the software stack must be inadvertently reworked to include the control of multi-GPU/multinode communication and synchronization. The fused operators on the cluster are unlikely reusable from the single-GPU codebase. Empirically, NVIDIA provides single-GPU operators and multi-GPU/multinode operators via different libraries (the CUB/NCCL library). The fused operators are supported under the TensorRT framework, but no support for multi-GPU/multinode is provided. This makes the programming experience for end users dependent on the machine scale. The end users must divide and distribute payloads as the software stack supports accordingly, resulting in scale-dependent programs.

Addressing the productivity issue, we proposed ML computers with fractal von Neumann architecture (FvNA) [4, 5]. FvNA borrows the geometric concept fractal to describe the self-similar patterns applied to any scale. With FvNA, we define the programs and the architecture in a scale-invariant way. Both are free to zoom until an appropriate match, the actual execution. FvNA is multilayer but isostratal, which literally means “same across layered structures.” That is, every layer in the system adopts the same ISA. The lower layer is fully controlled by the higher layer, thus, only the top layer is exposed to the programmer as a monolithic processor. Therefore, ML computers built with FvNA are programmable under a scale-invariant, homogeneous, and sequential view. Prior works have shown that FvNA is applicable to the ML domain and alleviates the programming productivity issue while keeping the comparable efficiency as its ad hoc counterparts.

However, there still remain unanswered questions about FvNA. In this study, we address the most important ones, including the following. (1) How could FvNA remain quite efficient with such a strict architectural constraint? (2) Is FvNA also applicable to payloads from other domains? (3) If so, what are the exact prerequisites? To answer these questions, we start by modeling the fractal parallel machine (FPM), from where the cost, power, and limitation of fractal machines can be reasonably discussed. We settle some preliminary results showing that the FPM is as powerful as many fundamental parallel computing models, BSP, and alternating Turing machine. These results support our extension of fractal parallel computing to the domain of general-purpose parallel computing.

1.1. Background

We provide first the definition of fractal regarding computer architectures. Fractalness is the primitive property of computing problems stating that there exists a finite program, when executed on arbitrarily scaled parallel machines that solve the problem uniformly in finite time. Therefore, when saying a parallel computing system is fractal, we imply that the system always adopts the same program regardless of the scale.

Being fractal is to adopt a scale-invariant description of the computing system. Fractal parallel computing systems are self-similar at different scales, adopting the same set of descriptions (e.g., ISA and program) for hardware resources, payloads, and execution behaviors. Therefore, the systems are freely scalable according to the description of a single scale. In this study, we refer to the principle as isostratal, which literally means “same across layered structures.”

We define the scheme of payloads on fractal parallel computing systems as fracop.

Definition 1 (Fracop). An operator is a, if there is another operator so that where

We refer to as the reduction operator and as subfracops.

Fracop is defined following the isostratal principle, defined in a scale-invariant description. The parallel execution behavior of a fracop is generated by iteratively decomposing it into sub- until matching the parallelism and the capacity of the underlying hardware. This process resembles the generating rules for fractal geometrics (e.g., the Sierpiski carpet).

FvNA is a multilayer, isostratal parallelized von Neumann architecture. It inherits the processing unit, control unit, memory, and input/output mechanisms to the massive external storage from the von Neumann architecture, while the processing units are extended into two categories: multiple fractal functional units (FFU) and a local functional unit (LFU). FFUs are homogeneous components including a new layer of the same FvNA inside them. Therefore, the layered structure of FvNA is scale-invariant, constituting the fractalness of the system, freely scalable by recursively building new layers into the FFUs. FvNA when viewed from a global perspective forms a tree structure. The memory in a layer acts like the external storage for its child layer. The LFU can perform light-weight data processing in the memory, such as basic partition and reduction of tasks. To leverage leaf processors, data must be copied from layer to layer, under the control of the same program (i.e., codes of fracop), as required by the isostratal principle.

We start our analysis by modeling FvNA as our first fractal parallel computing model. Valiant [6] proposed and analyzed a homogeneous multilayer parallel model, multi-BSP. Despite the anisostratal characteristic of multi-BSP, we can build our model of FPM based on it with only minor extensions. In the next section, we start by defining FPM in a way that resembles that of multi-BSP. Note that the defined model is not intended to restrict the concept of fractal parallel computing, but to provide an intuitive but complete specification as an example for further discussion.

2. Materials and Methods

2.1. Modeling Fractal Parallel Machines

An instance of FPM is a tree structure of nested components. Each component contains a memory, a processor (which could be modeled as a random accessing machine), and (except for the leaf components in layer 0) child components. Components can execute fracops, i.e., read some input data from the external storage, perform computation on the processor, and then write output data to the external storage. The processor may divide the fracop into smaller-scaled subfracops and distribute the subfracops to the child components (if there are any). Therefore, to program FPM, we only program a single processor, and all processors in the instance must run the same program.

To define an instance of FPM, only two parameters are required. specifies the depth of the tree, and specifies the number of child components in a component. We define the number of processors to count all processors in the FPM, . Trivially, , and is always only a constant factor larger than the count of leaf processors.

Compared with Valiant’s multi-BSP, FPM minimized the parameters for simpler abstraction. What is more important, FPM puts explicit restrictions on the programming by only exposing a single processor to the programming interface. The processor is only aware of its parent component and child components, but not the global system specification. The program never knows where it resides in the tree structure. Therefore, FPM cannot be programmed to be scale-dependent by definition.

Generally, the payload on FPM is specified as a fracop. A fracop is divided and distributed to the child processors, from where it is further divided and distributed to the child processors of child processors, and so on. The processor executes a fracop in the following ordered steps: (1)Retrieve input data from the parent memory. Data transmission costs unit time per data(2)Carry out the computation. If there are child processors, perform the following steps in order:

(a) Divide the fracop into multiple (commonly multiples of ) smaller-scaled subfracops

(b) Distribute the subfracops to child processors. Wait for the child processors to work until all the results of the subfracops are submitted. The waiting time depends on the maximal time cost of the child processors

Optionally, it is also allowed to distribute the last divided fracop to the current processor instead of a child processor if the current executing fracop is halting after this step, namely, the tail recursion.

(c) Reduce the results of the sub-fracops to obtain the final result of the original

Otherwise, the computation of the original fracop is carried out completely on the processor. The processor costs unit time per operation it performs. (3)Submit the result to the parent memory. Data transmission costs unit time per data

Note that one can trivially implement multiple different fracops into a single fracop through self-extraction by tail recursion and case-by-case routine selection. Provided that the code is still bounded under a constant length (i.e., invariant to problem sizes and model scales), it does not violate fractalness. Therefore, despite defining the payload as a single fracop here, it is not meant to limit the total number of (and types of) operations an FPM supports.

The above-described execution of a fracop transmits the data layer by layer down to layer 0, carries out the computation, and then submits and reduces the results layer by layer up to layer (the root). Therefore, the latency of each fracop depends on the scale: larger-scale FPM often has more layers, and hence, longer latencies. However, with a sequence of independent workloads, the execution could be pipelined to maintain a scale-invariant throughput.

2.2. Fractal Programming Style

We illustrate the fractal programming style of FPM by sample programs. The chosen samples include naive matrix multiplication, mergesort, and edit distance, covering the different parallel algorithm categories including embarrassingly parallel, divide-and-conquer, and dynamic programming algorithms, respectively. Finally, we show how to efficiently simulate a bridging parallel model BSP [7], settling a sound lower bound on the power and efficiency of FPM.

Require: Input matrices
Ensure: Output matrix saved in
1. function MatMul
2.  if is leaf then(1) End of recursion
3.   fordo
4.    fordo
6.    end for
7.   end for
8.  else if (2) Binary split along
9.    fracop MatMul
10.    fracop MatMul
11.  else ifthen (3) Binary split along
12.    fracop MatMul
13.    fracop MatMul
14.  else(4) Binary split along
15.    fracop MatMul
16.    fracop MatMul
18.  end if
19. end function

The first sample is a naive matrix multiplication (Algorithm 1), which is one of the fractal instructions implemented on Cambricon-F. The fractal MatMul program is divided into four cases: (1) on leaf component, MatMul is implemented as conventional for-loops; otherwise, the matrices are binary-split along the (2) -dimension, (3) -dimension, or (4) -dimension.

Any routine of the (2, 3, 4) generates two independent fracops, regardless of the actual FPM parameter as one processor can efficiently simulate several layers of the FPM before actually scheduling subfracops to the child processors.

Lemma 2 (Free zooming). The decomposition of a fracop does not need to follow the model parameter .

Proof. We show an efficient simulation of a different number of child processors (as in Algorithm 2). Virtualize simulates the given program fracop and saves the generated subfracops in a dependency graph . Once contains sufficient independent fracops, fracops are distributed to the child processors.

After each iteration of subfracops distribution, virtualize recurs itself on the tail, and hence, a series of virtualize fracops are self-extracted and executed on the current processor. Therefore, every single iteration meets the definition of fracop from Definition 1; that is, all divided subfracops are distributed to the child processors at simultaneously (in step 2b). Once the reduction step (step 2c) of a fracop starts, no new subfracops are generated from this fracop anymore.

Require: The programmed with a different number of children
Ensure: The same behavior as , but utilizes all child processors even when
1: function Virtualize
2:  while is not empty do
3:   if there are multiple of leaf vertices in then  Schedule fracops to child
4:    for leaf vertices in do
5:     remove from
6:     fracop Virtualize
7:    end for
8:    return fracop Virtualize    Tail recursion to start a new fracop
9:   else  Simulate on the current processor to generate more sub fracops to schedule
10:     select a leaf vertice from
11:    ifthen                    New fracop
12:     repeat
13:      single-stepping simulate Fracop in memory
14:     until all generated in the simulation
15:     for alldo
16:      add into
17:      add edge from to
18:      add edge from to for all , where depends on
19:      end for
20:     else              Simulated fracop, waiting for reduction
21:      simulate Fracop in memory until halting
22:      remove from
23:    end if
24:   end if
25:  end while
26: end function

Comparing with the original program fracop executed on an iso-processor-count FPM parameterized with , the simulation costs are with in a constant multiplicative factor for all finite , as only the partition and reduction steps are increased by finitely many times. Note that imbalanced workload from layers can accumulate into a significant overhead, leading to suboptimality. For example, when and , one child processor idle from each layer leads to suboptimality since . For not divisible by , find their least common multiplier as the target number of fracops to decompose into.

Free zooming is one of the desirable consequences of applying FPM. The actual machine can zoom in to obtain more detailed parallel behaviors (as shown in the proof of Lemma 2), or zoom out to save the transmission of common input data by merging many fracops into one. Hence, the programming experience on FPM is independent even of the model parameter . From here upon, we write fractal programs concentrating on the semantics, disregarding .

The second sample program is mergesort (Algorithm 3). This is a typical divide-and-conquer algorithm that inherently fits the fractal programming style.

Require: Input sequence
Ensure: Sorted sequence
1: function MergeSort
2:  ifthen                         Binary split
3:   fracop MergeSort
4:   fracop MergeSort
5:                  Merge the partial results into
6:   repeat
7:    ifthen
10:    else
13:    end if
14:   until
15:  end if

Evidently, Algorithm 3 is almost identical to a sequential code on a single core system, despite the recursive calls being modeled as subfracops. The execution is automatically parallelized on FPM (optionally via virtualize if is not exactly 2).

The last sample program is the Levenshtein edit distance (EditDist, Algorithm 4), which is solved through a typical dynamic programming algorithm. EditDist is nonintuitive to implement on most parallel programming models; here, we illustrate the versatility of the fractal programming style to cover implementation details. Overall, EditDist computes the dynamic programming matrix by diagonals, as the values on the same diagonal are independent of each other. We exploit the parallelism by decomposing the diagonals into subfracops. EditDist can be divided into five routines, on a case-by-case basis: (1)At the end of recursion, compute a segment of the diagonal directly using the Bellman equation(2)During recursion, binary split the diagonal segment to generate two subfracops(3)On initialization, swap the input strings on demand so that A is never longer than B; deal the special cases where either string is empty

At the end of the computation, return the final result . (4)The main iteration is divided into three stages: the left upper triangle, the middle part, and the right lower triangle of the dynamic programming matrix. In each iteration, a subfracop that corresponds to the computation of a diagonal and a tail-recurring fracop that corresponds to the next iteration are generated

Require: Input string ; Let
Ensure: Edit distance
1: function EditDist
2:  if is leaf and then                   End of recursion
3:   fordo
5:   end for
6:  else ifthen       Recursively compute a diagonal of the DP matrix
7:   fracop EditDist
8:   fracop EditDist
9:  else ifthen                  Entry point, initialization
11:   if then           , start diagonal computations
12:    fracop EditDist
13:   else ifthen            Swap so that
14:    fracop EditDist
15:   else                     Special case of empty strings
17:   end if
18:  else ifthen                End of computation
20:  else      Main iteration implemented as tail recursion, divided into three cases:
21:   ifthen                  1. Left upper triangle area
22:    fracop EditDist
23:    fracop EditDist
24:   else ifthen        2. Middle parallelogram area
25:    fracop EditDist
26:    fracop EditDist
27:   else               3. Right lower triangle area
28:    fracop EditDist
29:    fracop EditDist
30:   end if
31:  end if
32: end function

The example shows that FPM is applicable even for nonintuitive algorithms. Compared with conventional parallel programs, the illustrated fractal program is even cleaner and less error-prone, as there is no explicit communication, synchronization, or resource contention. Although one may argue that the fractal programming style is strictly constrained compared with the conventional wisdom of parallel programming, we empirically find that it is surprisingly hard to name a counterexample that is not efficiently tractable in the fractal programming style but in other parallel models. We cannot help but wonder if this is reasonable: we address this question in the next section.

As the last sample program, we show how to simulate BSP supersteps [7] on FPM (Algorithm 5). The simulation can be cost-optimal under certain preconditions. As BSP is a bridging model, the efficient simulation of BSP shows a sound lower bound on the power of FPM. The fracop Superstep divides BSP processors into groups, each containing BSP processors. The memory (represented as a hash map) is then divided accordingly. Groups are distributed to the child processors. Cross-group messages are saved in another hash map , which is merged into in the reduction step.

Require: BSP memory as a hash map, set of all the BSP processors
Ensure: A superstep performed on
1: function Superstep
2:  if is leaf then
3:   simulate the BSP processors on until synchronization
4:   write the cross-group messages in
5:  else
6:   divide evenly into groups
7:   divide accordingly into
8:   for alldo
9:    fracop Superstep
10:   end for
11:   for alldo
12:    merge into
13:    for all message in do
14:     if the destination of lies in then
15:      write into
16:     else
17:      write into
18:     end if
19:    end for
20:   end for
21:  end if
22: end function

The main obstacle to a cost-optimal simulation is that we have a centralized data transmission bottleneck in FPM. All input/output data are transmitted from or to the root, which only has a constant data rate, resulting in a lower bound on the simulation cost. However, following the result from [8], if we consider the size of memory and the computation time of the processors as constants, adding a slackness of is sufficient for optimality. That is, let the BSP model have more processors than FPM, say BSP processors. The partition, data transmission, and reduction in a layer all cost constant time. Although the cost is accumulated per layer, the total overhead is , and , which is aligned with the slacked computation time.

In this study, we extend the result from [8] by allowing a sublinear size of memory (rather than only a constant) with regard to the number of processors, if more slackness is acceptable. The result shows that locality is required for payloads to be efficient on FPM. The better the locality, the less slackness is required. We formalize the result as the following theorem.

Theorem 3. Let denote the parallel slackness. FPM with processors can simulate BSP with processors in time , if the total memory size .

Proof. As previously described, the data transmission costs on the root (layer ). When adding another layer, the data transmission cost is reduced by times (as the group size is divided by ), excluding the cost of possibly increased cross-group messages . However, the cost of transmitting is hidden under the big-O notation due to the constrained -relation of BSP. The total data transmission is summed up as , which is still . From the same argument, the control and reduction are also hidden under the data transmission cost. Therefore, the overall cost is , where the theorem follows from.

2.3. Boundary

Architects often use BSP as a bridging model. We have shown that FPM is equivalent to BSP in terms of computing power and cost (under the locality and the slackness preconditions). However, we have not irrefutably set the boundary of fractal programming in terms of computing power. The above reduction requires the payload programmed on the BSP instead. Therefore, yet we cannot conclude that anything programmable on BSP has a corresponding fractal program as the BSP model does not imply the scale-invariance of its programs.

In this section, we settle the boundary of fractalness in the field of general-purpose parallel computing. The first statement addresses a common fallacy that only the problems from a minority class are programmable in the fractal style. The fractalness of a problem, when not considering efficiency, is only an alternative expression of the recursiveness or uniformity. This is trivial due to the Turing completeness of the FPM.

Theorem 4. Recursive language class is exactly the set of languages decidable on FPM.

Proof. To show that languages in are decidable on FPM, use a processor in the FPM to simulate TMs. To show that languages decidable on FPMs are in , use TMs to simulate FPMs.

However, although FPM is Turing-complete, it may be still less powerful than its ad hoc counterparts: ad hoc parallel machines may even decide nonrecursive languages. Note that excessive programs may serve as advice strings. We take the MPMD model as an example.

Theorem 5. MPMD decides halting.

Proof. As MPMD allows parallel nodes to be programmed separately, we can hardcode one bit of extra information into each program without violating rules (each program has a constant length). At the beginning of the computation, the machine can gather the bits, form an -bit natural number denoted as , and broadcast . Here, introduces nonuniformity into the model. Let be the number of TMs that have descriptive size and halt on empty tapes. Then, we have Algorithm 6 that decides the Halting problem in finite time.

Require: An -node MPMD model hardcoding . Input a TM description ( bits).
Ensure: Decide if halts on empty tapes.
1: Gather and broadcast .
2: .
3: parallel fordo
4:   Simulate on empty tape until halting.
5:   If , abort and accept.
6:   Atomically increase .
7:   If , abort and reject.
8: end parallel for

Excessive programming is harmful. As we have shown in Theorem 5, ad hoc parallel machines with multiple programmable components may abstract to a nonuniform model , as it requires a nonconstant Kolmogorov complexity to define a scalable ad hoc parallel model [9]. The extra power is far beyond both the mathematical meaning of computation and the practical interests, as the programs of high Kolmogorov complexity are unlikely producible by either human or Turing machines. However, the nonuniformity is the root cause of scale-dependent programming experiences, a major contributor to the programming productivity issue.

Thus far, we show that FPM is as powerful as any uniform computing model. However, this is not all about fractal parallel computing. Not only the scale-invariant programming experience makes fractal parallel computing attractive (considering that it is already invariant in most theoretical computing models, e.g., nondeterministic TM, alternating TM, and uniform circuits) but also the competitive computing efficiency achieved in practical machines. Regarding parallel computing, we are especially concerned about how FPMs perform on efficiently parallelizable problems (i.e., the problems in [10]). By definition, is the set of problems that is decidable by -uniform circuits in , and is the union of for all finite .

However, the parallel time complexity of problems in is sublinear, whereas the time on FPM is lower bounded by linear time due to the centralized input data and the constant data rate at the root. We have shown how the centralized input data can get reasonable workarounds in Theorem 3. Here, we extend the model by adding a dedicated memory to provide input data directly to any processors in the model, resembling that we commonly equip an input tape in addition to a working tape when discussing logarithmic space Turing machines. Only then the discussion on parallel time complexity makes theoretical sense.

Definition 6 (FPM). is the extended FPM equipped with a read-only random access memory , saving input data that is directly accessible from all processors. FPM time is the overall cost, and FPM space is the maximum memory space used in any of the processors.

After the extension, FPM becomes quite similar to alternating TM [11], as ATM itself also conforms to the fractalness criteria. The most notable difference is that FPM processors cost time to transmit data to their child processors, whereas the configuration of ATM is forked at no cost. FPM shows that it can be efficient although the computing state is forced to copy per layer.

We first show that log-space uniform circuits are efficiently constructible on FPM.

Lemma 7. Problems in are decidable in FPM time and space.

Proof. Resembles the argument of Savitch’s theorem [12]. STCON is decidable by multiplying the adjacency matrix up to its -th power and then checking the value in cell . A fracop on FPM with parameters computes the value in cell of the -th power of the input matrix by . The fracop first binary splits the range of , generating layers of binary disjunctions. Then, an additional layer decides the conjunction between and . The program then decides both by calling and recursively. Finally, values in are read from the input memory. The fracop executes in layers, each performing a basic logical operation and returning a single Boolean value. The calling requires bits encoding parameters. However, as the control from each layer only puts one more bit on each parameter, the parameters may be infiltrated to save time; that is, the processors may prematurely place whatever bits they receive into their child processors’ memories and notify the child processors to relay the parameter infiltration. Thus, data transmission costs for passing parameters are overlapped within globally. Therefore, STCON is decidable in time. The space is used at most to save the parameters, which is . The lemma follows from the fact that STCON is -complete under many-one reductions [13]. The reduction costs additional layers and space. To simulate any NTM of space on FPM, replace the read operation to the input adjacency matrix with an additional computation tree checking if the NTM configuration is a direct successor of .

We now present the efficient execution on FPM for .

Theorem 8. is decidable in FPM time and space for .

Proof. Let the FPM simulate gates on processors. A fracop on FPM with parameters maps the following subroutines to the children: (1)Decide the name of the current gate, by simulating the circuit generator on state . As is defined under -uniformity, the circuit generator can be simulated in layers of FPM, following Lemma 7(2)Recursively call the fracop with parameters to evaluate the left input gate(3)Recursively call the fracop with parameters to evaluate the right input gate

After these fracops are finished, a reduction step computes the value of this gate, which is then reported to the parent. On the leaf processors, the input values to the circuit are read from the input memory.

The fracop executes in layers (the depth of the circuit plus the depth of the circuit generator), with each layer performing a constant number of basic operations and returning a single Boolean value. State is also decided one bit per layer. Thus, we can apply the same infiltration trick as in the proof of Lemma 7. Therefore, the theorem follows from the constant time cost per layer, and the space cost is again at most used to save parameters.

Thus far, we show the equivalent power of FPM and ATM, as the latter is shown to be efficient for only either [14]. Actually, ATM and FPM are equivalent to each other within a constant factor. The most notable difference of the configuration copying cost is effectively suppressed by the infiltration trick, as the configuration of ATM also only changes by a constant number of bits during each alternating step.

3. Results

We extended the concept of fractal parallel computing to the general purpose in this study. The main results are three-fold. Theoretically, (1) recursive languages are fractal (Theorem 4), and (2) a fractal parallel machine is efficient if the language it recognizes is efficient on any uniform parallel machine (Theorem 8). In practice, (3) FPM is efficient for any parallel algorithms, if given sufficient slackness and locality to overcome the sequential data input/output from the external storage (Theorem 3). These preliminary results prove that fractalness is never a special criterion for parallel computing, but is fairly general.

4. Discussion

The concept of fractal parallel computing and the FPM abstraction model was developed from FvNA. FvNA was originally proposed to provide a scale-invariant programming experience for the domain of machine learning. Currently, we proposed two different FvNA architectures targeting the ML field. (i)Cambricon-F [4] is the specialized fractal machine that has a predefined fractal instruction set, including convolutions, matrix multiplications, pooling, element-wise transforming, sorting, and counting. It is motivated by the observation that the most time-consuming parts of most ML algorithms are constituted by these operations. Therefore, with minimal help from the CPU, almost all ML algorithms should be supported with optimal efficiency. Programs are scale-invariant so that they are universally applicable to a series of Cambricon-F instances(ii)Cambricon-FR [5] in contrast is the universal fractal machine that we proposed later. We enabled program-defined fracops in the instruction set rather than only predefined. Therefore, for the alternative and “corner case” operations that are not covered in the Cambricon-F instruction set, Cambricon-FR can stay efficient. The program is also scale-invariant

Either the specialized or the universal FPM instances could be favored when the machines are developed on various scales.

However, even when working with conventional hardware (e.g., GPU clusters) targeting general-purpose parallel computing, programming may also benefit from the concept of fractal parallel computing. We clarified that, although originally developed from the domain of ML, fractal parallel computing is fairly generally applicable. Therefore, certain fractal programming frameworks could be developed ubiquitously (e.g., as in [15] has been implemented for BSP). We believe that full implementation of FPM could be handy in various scenarios: due to the scale-invariance of FPM, cloud services deployed via FPM automatically become elastic, adding or reducing cloud resources on demand. The FPM program is the same for both cloud and edge, making computation offloading easy to achieve [16]. Instances of FPM can be made from systems either as large as the entire worldwide web or as small as micrometer-scale in vivo devices.

One of the most interesting discoveries from this study is that most problems that we compute are fractal, while most parallel architectures are controlled and programmed in an ad hoc style. The concept of fractal parallel computing is placing strict but reasonable constraints on the programming style and relying on the parallel execution model to automatically generate the details of the behavior. The control at each scale must be the same. By forcing this, uncertainties of program behaviors, especially on massively parallel machines, are drastically reduced. This can be considered entropy reduction in the programs [17]. Currently, fractal machines, such as Cambricon-F/FR, only leverage such entropy reduction to simplify software development. Whether energy reduction can be achieved by introducing fractal controlling into conventional parallel machines is an interesting open question.

Data Availability

No data were used to support this study.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.


This work is partially supported by the NSF of China National Key R&D Program of China (under Grant 2018AAA0103300), the (under Grants 61925208 and 62102398), Strategic Priority Research Program of Chinese Academy of Science (XDB32050200), CAS Project for Young Scientists in Basic Research (YSBR-029), and Xplore Prize.


  1. F. Xu, W. Zhang, Y. Cheng, and W. Chu, “Metric Learning with Equidistant and Equidistributed Triplet-Based Loss for Product Image Search,” in Proceedings of The Web Conference 2020, Association for Computing Machinery, New York, NY, USA, 2020. View at: Publisher Site | Google Scholar
  2. W. Jia, H. Wang, M. Chen et al., “Pushing the Limit of Molecular Dynamics with Ab Initio Accuracy to 100 Million Atoms with Machine Learning,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE Press, Atlanta, Georgia, 2020. View at: Publisher Site | Google Scholar
  3. M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer CNN accelerators,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Taipei, Taiwan, 2016. View at: Publisher Site | Google Scholar
  4. Y. Zhao, Z. Du, Q. Guo et al., “Cambricon-F: machine learning computers with fractal von Neumann architecture,” in 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA),, Association for Computing Machinery, Phoenix, Arizona, 2019. View at: Publisher Site | Google Scholar
  5. Y. Zhao, Z. Fan, Z. Du et al., “Machine learning computers with fractal von Neumann architecture,” IEEE Transactions on Computers, vol. 69, no. 7, pp. 998–1014, 2020. View at: Publisher Site | Google Scholar
  6. L. G. Valiant, “A bridging model for multi-core computing,” Journal of Computer and System Sciences, vol. 77, no. 1, pp. 154–166, 2011. View at: Publisher Site | Google Scholar
  7. L. G. Valiant, “A bridging model for parallel computation,” Communications of the ACM, vol. 33, no. 8, pp. 103–111, 1990. View at: Publisher Site | Google Scholar
  8. Y. Zhao, Fractal computing systems, [Ph.D. thesis], University of Chinese Academy of Sciences, China, 2020.
  9. P. Orponen, K. I. Ko, U. Schöning, and O. Watanabe, “Instance complexity,” Journal of the ACM (JACM), vol. 41, no. 1, pp. 96–121, 1994. View at: Publisher Site | Google Scholar
  10. S. A. Cook, “Deterministic CFL's are accepted simultaneously in polynomial time and log squared space,” in Proceedings of the eleventh annual ACM Symposium on Theory of Computing, pp. 338–345, Association for Computing Machinery, Atlanta, Georgia, USA, 1979. View at: Publisher Site | Google Scholar
  11. A. K. Chandra, D. C. Kozen, and L. J. Stockmeyer, “Alternation,” ACM, vol. 28, no. 1, pp. 114–133, 1981. View at: Publisher Site | Google Scholar
  12. W. J. Savitch, “Relationships between nondeterministic and deterministic tape complexities,” Journal of Computer and System Sciences, vol. 4, no. 2, pp. 177–192, 1970. View at: Publisher Site | Google Scholar
  13. N. Immerman, “Languages that capture complexity classes,” SIAM Journal on Computing, vol. 16, no. 4, pp. 760–778, 1987. View at: Publisher Site | Google Scholar
  14. W. L. Ruzzo, “On uniform circuit complexity,” Journal of Computer and System Sciences, vol. 22, no. 3, pp. 365–383, 1981. View at: Publisher Site | Google Scholar
  15. M. Goudreau, K. Lang, S. Rao, T. Suel, and T. Tsantilas, “Towards Efficiency and portability: programming with the BSP model,” in Proceedings of the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures, Ser. SPAA '96, Association for Computing Machinery, Padua, Italy, 1996. View at: Publisher Site | Google Scholar
  16. W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: vision and challenges,” IEEE Internet of Things Journal, vol. 3, no. 5, pp. 637–646, 2016. View at: Publisher Site | Google Scholar
  17. Z. Xu and C. Li, “Low-entropy cloud computing systems,” Scientia Sinica Informationis, vol. 47, no. 9, pp. 1149–1163, 2017. View at: Publisher Site | Google Scholar

Copyright © 2022 Yongwei Zhao et al. Exclusive Licensee Zhejiang Lab, China. Distributed under a Creative Commons Attribution License (CC BY 4.0).

 PDF Download Citation Citation
Altmetric Score