Research Article  Open Access
Yixin Luo, Jiaming Han, Zhou Liu, Mi Wang, GuiSong Xia, "An Elliptic Centerness for Object Instance Segmentation in Aerial Images", Journal of Remote Sensing, vol. 2022, Article ID 9809505, 14 pages, 2022. https://doi.org/10.34133/2022/9809505
An Elliptic Centerness for Object Instance Segmentation in Aerial Images
Abstract
Instance segmentation in aerial images is an important and challenging task. Most of the existing methods have adapted instance segmentation algorithms developed for natural images to aerial images. However, these methods easily suffer from performance degradation in aerial images, due to the scale variations, large aspect ratios, and arbitrary orientations of instances caused by the bird’seye view of aerial images. To address this issue, we propose an elliptic centerness (EC) for instance segmentation in aerial images, which can assign the proper centerness values to the intricate aerial instances and thus mitigate the performance degradation. Specifically, we introduce ellipses to fit the various contours of aerial instances and measure these fitted ellipses by twodimensional anisotropic Gaussian distribution. Armed with EC, we develop a onestage aerial instance segmentation network. Extensive experiments on a commonly used dataset, the instance segmentation in aerial images dataset (iSAID), demonstrate that our proposed method can achieve a remarkable performance of instance segmentation while introducing negligible computational cost.
1. Introduction
Instance segmentation is aimed at predicting both the location and semantic mask of each object instance in an image. Therefore, an intuitive approach to instance segmentation is to detect the bounding boxes of object instances and then perform semantic segmentation in the area of each box. The conventional twostage instance segmentation methods adopt this detectthensegment pipeline [1–14]. Mask RCNN [3] is a stateoftheart method, which extends Faster RCNN [15] by adding a mask prediction branch. Based on Mask RCNN, Mask Scoring RCNN [6] predicts the IoU between mask and groundtruth and further rescores the confidence of mask through the added maskIoU branch, acquiring better segmentation results. PANet [4] pays more attention to the process of feature propagation. It proposes bottomup path aggregation and adaptive feature pooling to merge the features from all levels and finally boost up the performance of instance segmentation. HTC [5] is a cascade architecture for instance segmentation. Different from Cascade RCNN [16], HTC integrates features from each stage to add complementary information and finally get better mask predictions. PointRend [7] refines the segmentation of objects in a rendering way that is similar to classical computer graphics methods. Methods above all depend on anchorbased detectors. There are also some twostage methods built upon anchorfree detectors. Extending the detector FCOS [17] by adding a mask branch, CenterMask [8] improves the backbone network [18] and uses Spatial AttentionGuided to focus on important features during the segmentation process. DeepSnake [10] introduces circular convolution, which finishes segmentation by transforming the box detected by CenterNet [19] into a polygon. As for twostage methods in the aerial scene, most currently extensive works [20–26] are based on Mask RCNN [3]. Su et al. [20] propose the Precise RoIPooling to replace the RoIAlign in Mask RCNN and further avoids the degeneration of a precision result from the quantization of coordinate. To learn multiscale context information, Feng et al. [21] embed the local context module into the mask branch of [3]. There are also a few works that depend on the rotating object detector, which is one of the characteristics of aerial images. Recently, the rotated object detection task has gained much attention in an aerial scene [27–34], as the oriented bounding boxes enclose the objects of aerial images better. For instance, to save the loss caused by extracting features of the neighboring objects, ISOP [22, 23] follow [29] and predict the mask result on the oriented proposal rather than the horizontal one. Even though these twostage methods can achieve competitive performance, they are usually timeconsuming due to the complicated network architectures.
To accelerate the instance segmentation model, onestage instance segmentation methods [35–49] simplify the pipeline of twostage methods and reduce the cost of computation. Some onestage approaches achieve instance segmentation from semantic segmentation results by grouping the pixels that belong to the same object. Using instancesensitive score maps for generating proposals, InstanceFCN [35] first produces the score maps, then generates object instances by an assembling module. YOLACT [41] linearly combines the proposed prototype masks according to the predicted coefficients and then crops with a predicted bounding box. Some onestage methods learn to generate the feature point or contour of instances. ExtremeNet [42], with the heavy backbone of HourGlass [50], detects 8 extreme points to generate an octagon, which is a relatively rough mask result. For a faster speed, ESESeg [43] directly regresses the coordinates of contour points through Chebyshev polynomial fitting. CurveGCN [38] and PointSet Anchors [46] learn to generate the contour of an instance by the regression way. PolyYOLO [47] extends YOLOv3 [51] to be able to instance segment in the distance and angle regression way. Most recently, PolarMask [49] utilizes polar masks to represent the contour of instances. It predicts multiorientation distances from the central area to the contour of the instance and finally decodes the set of distances to the polar mask. As for onestage aerial instance segmentation, it is not as luxuriant as the twostage ones. Most of the existing onestage aerial instance segmentation methods [52–54] group the pixels that belong to the same object on the semantic segmentation result. Audebert et al. [52] utilize a classification network to achieve objectwise classification on the semantic segmentation result. Mou and Zhu [53] learn both semantic segmentation and semantic boundary simultaneously to finish the instance segmentation of the vehicle. Different from the onestage aerial methods above, Huang et al. [54] follow PolarMask [49] and propose the Polar Template Mask to fit the ship instance in aerial better.
Although these onestage methods are usually faster than twostage methods, there is still a performance gap between onestage and twostage methods [41, 49]. It is observed that lowquality locations (i.e., locations far away from the centroid of an object instance) often produce lowquality predictions [17], and thus, these lowquality locations result in performance degradation. In order to suppress lowquality locations, centerness [17, 49] is introduced to estimate the quality of locations, which is a branch in parallel with the classification branch. The value of centerness predicted by the network is in the range of , and the higher centerness value will be assigned to the location that is closer to the centroid of the instance. Then the predicted centerness value is multiplied by the classification score as the final score, thus reducing the effect of lowquality locations and improving the performance remarkably [17, 49].
However, due to the fact that scale variations and aspect ratios of object instances in aerial images are often larger than those in natural images, most existing centerness methods [17, 49] developed for natural images are not effective in aerial images. Examples are shown in Figure 1. Polar centerness (PC) is a polar representationbased centerness method proposed by the most recent method PolarMask [49], which can achieve stateoftheart performance. If a location is at the centroid of an object instance, the centerness value of this location should be 1 or approach 1. But the centerness predicted by PC is close to 0 even when the location is at the centroid of instance, as shown in Figure 1.
To address this issue, we propose a novel centerness for object instance segmentation in aerial images, termed elliptic centerness (EC). To be concrete, we introduce ellipses to fit the complex contours of aerial instances. Then we estimate the quality of locations inside ellipses by using twodimensional anisotropic Gaussian distribution. Considering the information of the whole contour, EC can assign more proper centerness values to aerial objects. As shown in Figure 1, the centerness values at the centroid of objects are 1 strictly, which are not affected by large variations of scale or aspect ratios. Based on EC, we develop a onestage instance segmentation network for aerial images. Experimental results show that our proposed EC achieves a remarkable performance of instance segmentation in a largescale aerial image dataset, iSAID [55]. Generally, our contributions are summarized as follows: (i)We propose an elliptic centerness for aerial instance segmentation, which is a singlelayer branch to estimate the quality of locations. Our proposed EC can estimate the appropriate centerness values for the intricate aerial object instance and thus improve the instance segmentation performance(ii)Experimental results show that our EC outperforms PC by a large margin (8% mAP) of 15 categories in total on iSAID. Furthermore, extensive experiments with the tradeoff between accuracy and speed demonstrate that our method has a competitive performance on accuracy and a faster inference speed than stateofthearts
The remainder of this paper is organized as follows. In Section 2, we describe in detail our proposed method. Then we show our experimental results in Section 3. We briefly discuss the limitation in Section 4. Finally, a conclusion is given in Section 5.
2. Materials and Methods
2.1. Dataset
iSAID [55] is a largescale aerial image dataset for instance segmentation which has the same raw images as the DOTA [27] dataset. It contains 2806 images with the size ranging from to , and 655451 instances of 15 common object categories include the following: plane (PL), baseball diamond (BD), bridge (BR), ground track field (GTF), small vehicle (SV), large vehicle (LV), ship (SH), tennis court (TC), basketball court (BC), storage tank (ST), soccerball field (SBF), roundabout (RA), harbor (HA), swimming pool (SP), and helicopter (HC).
Training and validation sets are used for training and testing separately in the ablation study. When compared with other methods, the testing set is used for testing. Following [55], we crop a series of patches from original images with an overlap of 200 between patches.
2.2. Review of Existing Centernesses
Centerness is firstly proposed in object detection [17], which is designed for estimating the location quality of the anchor point by measuring the distance between the sample point and the box center. As it is defined on the bounding box, we term it BC (box centerness) for short. The BC of one location can be calculated by where , , , and represent the distance from the location to the bounding box of 4 directions.
Polar centerness (PC) is one variant of the centerness proposed in [49] for instance segmentation. Since a polar mask of an anchor point can be viewed as contour points , the PC of can be calculated by where means the distance between point and point . In other words, the polar centerness value of an anchor point is equal to the ratio of minimum distance and maximum distance from to the contour of the polar mask .
From the definition, both BC and PC lie in . They estimate the quality of localization and consequently suppress the noncentral areas. Specifically, they affect models in two aspects. (1) In the training phase, centerness weights the loss of each positive anchor point on the regression branch according to its value. (2) In the testing phase, centerness is multiplied by the corresponding classification score, and the refined score is subsequently used to sort proposals in postprocess like the nonmaximum suppression (NMS).
However, the two centernesses above have shortcomings individually. As the definition of BC (1), it is designed in object detection tasks and only considers the bounding box without the label of the mask. Moreover, it neglects the multiorientation of aerial instances.
In Equation (2), PC is defined on the contour of the mask, and it takes the centroid into account. But PC is incompatible with aerial objects and is not able to assign proper values to aerial instances. As shown in Figure 1, we use the PC value at centroid for example. There are mainly two kinds of circumstances: (1) In Equation (2), instance (a) with an extremely large aspect ratio (e.g., bridge, large vehicle) and instance (b) with complex contour (e.g., plane) tend to have lower centerness than round or squared instance (e.g., roundabout, storage tank). 2). Appearing as only a few pixels in aerial images, small instance (c) (e.g., small vehicle) has too few contour points to generate a polar mask target at all orientations, resulting in generating a pretty low centerness target. Because its minimal distance target is close to 0. This condition is also faced when the anchor point is near the contour. In general, PC provides aerial instances with lower centerness targets, which is harmful to the performance of instance segmentation.
2.3. Elliptic Centerness
To estimate the proper centerness value for aerial instances, we propose a new centerness, named elliptic centerness (EC).
Given a polar mask with its points on contour, the elliptic centerness of a pixel is measured by a twodimensional anisotropic Gaussian score: where is a hyperparameter affects the value of EC, is the centroid, and is the normalized inertial matrix of the polar mask . More precisely, the matrix is calculated as where is the normalized order central moments of the polar mask :
It is worth noticing that representing the polar mask with its normalized inertial matrix is actually equivalent to fit the polar mask by an ellipse with a semimajor axis and a semiminor axis oriented to , where are the two eigenvalues of (see Figure 2 for example).
With the definition of EC (Equation (3)), the centerness of the centroid is equal to 1 strictly. And the centerness of other neighboring anchor points depends both on the distance to the center and the shape (aspect ratio and rotation) of the ellipse. Figure 3 visualizes the BC, PC, and EC of aerial instances. The visualization shows that BC is more appropriate than PC, while being surpassed by EC.
2.4. Network Architecture
The architecture with proposed elliptic centerness is illustrated in Figure 4. Our overall pipeline is as simple as PolarMask [49] and the other onestage method [17]. We use a fully convolutional network consisting of a backbone and two taskspecific head networks. The backbone network with the following feature pyramid network (FPN) [56] extracts deep features of images and generates the multiscale feature maps. Two branches of the head networks with four stacked convolution layers finish the final prediction. One branch predicts both instance categories and centerness, while the other predicts polar distances of the mask.
In the training phase, the forward propagation has finished here, and the three parts of predictions will be subsequently used to calculate the loss to finish the post propagation. While in the testing phase, the classification score (, means the number of categories) will firstly be reduced to a 2D map () by choosing the highest score of categories. Then, since the classification score and the centerness have the same size, they will be multiplied by each other to get the refined score, which is ranked later for postprocess (e.g., NMS). At last, the remaining sets of distances will be assembled and generate the final polar mask result. The polarbased mask follows the formulation in [49]. It is modeled in polar coordinates (see Figure 4(d)), involving an anchor point with contour points that have evenly angle intervals. Such polar coordinates can be further decoded to Cartesian coordinates . The number of rays is set to 20 in the figure.
2.5. Loss Function for BackPropagation
The loss function consists of three parts according to three tasks, i.e., the losses of classification, mask regression, and centerness, which is defined as where is an indicator function, is the number of positive samples, and is the index of a sample in a minibatch. , , and are the predicted category, distance set, and centerness of the anchor . , , and are the groundtruth category, distance set, and centerness of the anchor . The focal loss [57], polar IoU loss [49], and crossentropy loss are adopted as the classification loss , the mask regression loss , and the EC loss , respectively.
3. Results
3.1. Implementation Details
With the superiority of the ResNet [58] compared with the conventional backbone network VGG [59], we adopt ResNet50 pretrained by ImageNet [60] with FPN [56] as the backbone network by default. In the mask regression branch, 36 rays are adopted to represent a polar mask based on polar if not specified. The hyperparameters of focal loss are set to and . We adopt the same training schedules as [49]. We adopt the same training schedules as MMDetection [61]. We train all models in 12 epochs on iSAID in the ablation study. The SGD optimizer is adopted with an initial learning rate of 0.0025, and the learning rate is divided by 10 at each decay step. The momentum [62] and weight decay are 0.9 and 0.0001, respectively. We adopt a learning rate warmup for 500 iterations. In the inference phase, we choose 2000 predictions at most in each feature level before NMS and finally retain no more than 2000 predictions in one image. We use a single RTX 2080 Ti GPU with a batch size of 2 for training and 1 for testing.
3.2. Evaluation Indicators
In this section, we briefly introduce some evaluation indicators in the instance segmentation task, i.e., average precision (AP) and frames per second (FPS).
AP is used to measure the accuracy of the prediction result. The higher the AP gets, the better result is predicted. AP considers both the precision and the recall. where and denote the precision and the recall, respectively. More specifically, the precision and the recall can be, respectively, calculated by
TP, FP, and FN denote true positives, false positives, and false negatives, respectively. It is worth noticing that the standard of distinguishing positives and negatives is whether the IoU (maskIoU in instance segmentation task) between prediction and groundtruth is higher than the threshold. In the experiments below, the AP of each category is under the IoU threshold of 0.5.
The mAP denotes the mean AP of the 15 categories under the series of thresholds (0.50.95, with an interval of 0.05), and mAP denotes the mean AP of the 15 categories under one specific threshold of IoU.
Besides, as the settings in Pycocotools, mAP, mAP, and mAP, respectively, denote the mAP of small objects (), medium objects (), and large objects ().
FPS means the number of images that the model can infer in one second. The whole process contains the forward propagation and the postprocess.
3.3. Ablation Studies
In this section, we conduct a series of experiments with different settings to validate the effectiveness of our proposed method. ResNet50 and FPN are used in all experiments, and the AP is tested on the validation set of iSAID.
3.3.1. Verification of Upper Bound
An essential concern about the polarbased mask is that it might not depict the mask precisely. In fact, even the pixelbased mask will be likely to lose some details of the mask as it usually faces the RoIPooling operator. As shown in Figure 5, we calculate the pixelwise IoU of the target mask and groundtruth on the whole validation set and finally verify the upper bound of the polarbased mask in [49] and the pixelbased mask in [3]. It can be seen that the IoU between target and groundtruth is growing higher when the number of rays increases. Most categories remain a negligible gap between the two kinds of masks except for some of them, which shows that the concern about the upper bound of the polarbased mask is unnecessary.
3.3.2. Hyperparameter
As the formulation of EC (Equation (3)) involves one hyperparameter , which is a part of covariance in Gaussian function and therefore affects values of EC, we first conduct experiments on different . As shown in Table 1, different values of in are used to train the model. We observe that the proposed EC is quite insensitive to the variation of from 1.41 to 0.35. Too small (i.e., 0.18) will give high scores (i.e., 1.0) to nearly all positive points and make them weigh equally as the centroid point, which violates the original design intention of centerness and causes a slight drop in accuracy. On the contrary, too large (i.e., 3.54) will result in a large proportion of lowquality samples that decreases the performance noticeably. Overall, the only hyperparameter is quite robust, and the proposed EC can be nearly regarded as hyperparameterfree. We set in the remaining experiments.

3.3.3. Effectiveness of Elliptic Centerness
After confirming the setting of the hyperparameter, we compare our EC with box centerness (BC) in [17] and polar centerness (PC) in [49]. Table 2 (a) and (c) are two baselines of BC and PC, respectively. Following the original designs, BC estimates the centerness to the box center, and the PC faces the centroid. Although the PC is specifically designed for the instance segmentation task while the BC serves for detection, we found that BC outperforms the PC in an aerial scene, especially in mAP. Such a result indicates that the more proper centerness provided, the better model learned, which is shown in Figure 3.

We firstly conduct a new sample strategy, which regards points inside mask contour as positive points (like Figure 3). While in [49], points only in a small range around the centroid will be set to positive. As shown in Table 2 (b) and (d), the new sample strategy improves about 1% mAP, which shows the effectiveness of this sample strategy. Then, in Table 2 (e), we replace the centerness with the proposed EC. We firstly try to use the box center to replace in Equation (3) and observe EC outperforms BC by nearly 2% in mAP, which demonstrates the effectiveness of considering the label of the mask. In the meantime, EC surpasses PC by a large margin (47.9% to 54.9% in mAP). This is mainly because our EC is more robust for evaluating the sample quality of aerial instances. Moreover, since PC in [49] is defined according to the centroid, we generate EC target with the centroid of instance as Equation (3). And it (Table 2 (f)) improves about 0.5% in mAP than Table 2 (e). The result shows that the centroid is better than the box center in defining EC. Finally, we use both centroid as instance center and the new sample strategy (Table 2 (g)) and achieve 30.0% in mAP, which improves slightly than (e) while outperforming baselines (a) and (c) obviously.
We also report the results of classwise accuracy for further analysis. Instead of listing all of the 15 categories, we classify these categories according to their characteristics for briefness. Specifically, the 15 categories are divided into 4 parts, i.e., objects with large aspect ratios (lar), complex contours (cc), small scales (ss), and normal aspect ratios (nar). The lar objects include large vehicle (LV), bridge (BR), and ship (SH). The cc objects include plane (PL), harbor (HA), and helicopter (HC). The ss objects only include small vehicle (SV). The remaining eight categories, i.e., baseball diamond (BD), tennis court (TC), basketball court (BC), storage tank (ST), soccerball field (SBF), roundabout (RA), swimming pool (SP), and ground track field (GTF), compose the last group, whose member is the category with normal aspect ratios (nar). As shown in Table 2, when compared with PC, our EC outperforms it by a large margin in mAP, mAP, and mAP and also outperforms PC about 2% in mAP, which verifies that our EC alleviates the irrationality of PC in aerial obviously. While compared with BC, our methods are also better than BC about 2% mAP in each group.
3.3.4. Number of Rays
We also conduct experiments with different numbers of rays for representing the mask. Experiments in [49] verify that the performance of PolarMask which is embedded with PC keeps improving when utilizing more rays of the polar mask and finally becomes saturated. It is intuitive because more rays for representing can produce a more elaborate contour (shown in Figure 5). However, the result above is not accordant with the performance of PolarMask on the iSAID dataset. As shown in the top half of Table 3, the performance of the PC keeps dropping while the rays of the mask keep rising. We argue that the irrationality of polar centerness discussed before is the main reason that causes the consequence, especially for small instances (shown in Figure 1(c)), as the mAP of small objects (mAP) drops notably for rays from 20 to 72. As shown in the bottom half of Table 3, for our EC, the irrationality of former centerness has been alleviated obviously since the mAP keeps growing with the rising number of rays, which indicates the effectiveness of the proposed EC.

3.4. Comparisons with the StateoftheArt
In this section, we compare our proposed EC with other stateoftheart (SOTA) methods for instance segmentation. The settings have been stated in Sections 2.1 and 3.1.
Before the comparison experiments with SOTA methods, we firstly illustrate the advantages and disadvantages of those methods in the aerial instance segmentation task. As for twostage methods, we compared our methods with Mask RCNN [3], Mask Scoring RCNN [6], and CenterMask [8], whose results tend to be more accurate. However, as they utilize the horizontal bounding boxes to extract the feature of the regional area, the mask results tend to contain patches when objects of the same category gather compactly. Besides, their computation complexity and parameters are larger. As for onestage methods, we introduce PolarMask [49], Yolact [41], and ExtremeNet [42] for comparison. Their architectures are more concise and therefore have less computation complexity and parameters. However, the segmentation results of these methods are not satisfactory due to different reasons. PolarMask and ExtremeNet use the polar mask and octagon to represent the mask, respectively, which have a lower upper bound of accuracy. The results of Yolact come from the prototype mask, which is generated directly by the feature map in FPN and lacks regional information.
3.4.1. ClassWise Performances
Tables 3 and 4 show the classwise AP of each method. We only compare methods whose training schedules are the same for fairness (i.e., the comparisons of the top half and the bottom half of Tables 3 and 4 are independent). In the top half of Tables 3 and 4, we use ResNet50 as the backbone, and the training epochs are set to 12, except Yolact as the different setting in its original paper. Our EC surpasses PolarMask in every class, especially in the complex instance, e.g., plane (+11.1% AP) and harbor (+14.9% AP), large aspect ratio instances, e.g., large vehicle (+13.2% AP), bridge (+10.4% AP), and small instances, e.g., ship (+12.8% AP) and small vehicle (+16.2% AP), which indicates the effectiveness of EC. The same outperformance in almost all categories also occurs when compared with other onestage methods like Yolact [41] and ExtremeNet [42]. While comparing with the stateoftheart method Mask RCNN [3], our method can still achieve better results in some categories (i.e., vehicle, ship, storage tank, and soccerball field). However, as discussed in Figure 5, due to the limitation of the polarbased mask, our method cannot perform desired results compared with Mask RCNN in categories whose contours are complicated (i.e., plane, harbor, and helicopter). Therefore, we use mAP to represent mAP of the remaining 12 categories and find that only nearly 1% mAP gap between EC and Mask RCNN. In the bottom half of Tables 3 and 4, we conduct experiments for the extension to validate the potential of our EC. With longer training epochs and a larger training dataset and the ResNet101 backbone, our EC has less gap with Mask RCNN in mAP (60.3% vs 61.1%) and even outperforms Mask RCNN in mAP (62.1% vs 61.0%). Our EC achieves the best results in 10/15 categories.

Qualitative segmentation results of the PolarMask, Mask RCNN, and our EC are visualized in Figure 6. The series of figures explicitly show how the proposed EC tackle the different kinds of challenge objects in aerial instance segmentation. The first three columns are the results of objects with large aspect ratios. EC detects and segments the typical objects (e.g., large vehicle, ship, and bridge) better than PolarMask and Mask RCNN. PolarMask and Mask RCNN fail to detect many objects with large aspect ratios. Besides, Mask RCNN produces patches between the crowded objects. The fourth column shows the result of objects with complex contours (e.g., plane). It can be seen that the mask results of PolarMask have lower quality than Mask RCNN and EC. And Mask RCNN is still trapped in processing the crowded objects. Finally, the last two columns are the segmentation result of smallscale objects (e.g., small vehicle, storage tank). PolarMask is incapable to segment these small objects, and Mask RCNN is even worse. However, our proposed EC can detect these small objects stably. In conclusion, compared with PolarMask, our method segments better on large aspect ratio instances and small scale instances. While comparing with Mask RCNN, our method performs better on crowded objects without producing patches like Mask RCNN and is also more effective in large aspect ratio instances and small scale instances.
3.4.2. TradeOff between Accuracy and Speed
Mask RCNN has a slightly higher accuracy, but its inference speed is not unsatisfactory due to its twostage network design. The FPS of our onestage EC (7.5) is similar to Mask RCNN (7.8). This phenomenon results mainly from too many proposals produced as their score is rated higher, which costs more time for the postprocess. We conducted a tradeoff experiment between accuracy and speed by increasing the score threshold of the proposal in the testing stage. As shown in Figure 7, our EC obtains 2 times acceleration in FPS with a drop of nearly 1 in mAP, while Mask RCNN gains little in FPS, which demonstrates our method has a competitive performance on accuracy and a faster inference speed than stateofthearts. PolarMask has a similar tradeoff curve with our EC, but it is located totally on our left, which means its accuracy is pretty lower than ours.
3.4.3. Computation Complexity Analysis
Table 5 compares the computation complexity and parameters of different models. Having a similar architecture, the computation complexity and parameters of EC and PolarMask are the same. Note that EC has fewer computation and parameters compared with Mask RCNN. Even embedded with ResNet101, our method has a lighter computation than Mask RCNN.

4. Discussion
There are still some limitations of the EC. The center plays an important role in the definition of EC. However, as shown in Figure 8, when the central area of the object (e.g., harbor) does not lie in the mask, subsequently, the highquality area defined in EC generated by the mask annotation will not exist in this object. It will harm the both training and testing phases for objects of this kind. The lower AP result for the harbor category of the EC in Tables 3 and 4 validates the issue to some extent. This issue can also be found in other centernessbased methods.
5. Conclusion
In this paper, we propose a novel locationquality estimator, termed elliptic centerness, to alleviate the issues caused by inaccurate centerness for instance segmentation in aerial images. Extensive experiments demonstrate that our EC can achieve competitive performance with other stateoftheart methods on the accuracy, while it has a faster inference speed in aerial images.
There are still some limitations in centernessbased methods, i.e., EC cannot perform well on objects whose central areas do not lie in the masks, and the extra computation of EC slightly affects the training efficiency. So the future work plan can be focused on these limitations, such as designing a more robust and simpledefined EC. Moreover, since our proposed EC is an estimator of location quality, it is a potential to be integrated into other instance segmentation methods with common CNN architectures, as long as they determine the definition of EC under their architectures and add the module to predict the EC. Therefore, future work can also study how to embed EC in other SOTA CNN architectures.
Data Availability
The data of iSAID used to support this study are publicly available. The iSAID data can be downloaded from the website https://captainwhu.github.io/iSAID/index.html. The code is available upon request.
Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this article.
Authors’ Contributions
Y. Luo proposed the method and implemented the experiments. Y. Luo, J. Han, and Z. Liu participated in the analysis of experimental results. Y. Luo and J. Han wrote the manuscript. Z. Liu, M. Wang, and G. Xia provided supervisory support. All authors read and approved the final manuscript.
Acknowledgments
This work was supported by the Fundamental Research Funds for the Central Universities (grant number: 2042021kf0040).
References
 J. Dai, K. He, and J. Sun, “Instanceaware semantic segmentation via multitask network cascades,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3150–3158, Las Vegas, USA, 2016. View at: Google Scholar
 Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei, “Fully convolutional instanceaware semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2359–2367, Hawaii, USA, 2017. View at: Google Scholar
 K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask rcnn,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2961–2969, Venice, Italy, 2017. View at: Google Scholar
 S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8759–8768, Salt Lake City, USA, 2018. View at: Google Scholar
 K. Chen, J. Pang, J. Wang et al., “Hybrid task cascade for instance segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4974–4983, Long Beach, USA, 2019. View at: Google Scholar
 Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang, “Mask scoring rcnn,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6409–6418, Long Beach, USA, 2019. View at: Google Scholar
 A. Kirillov, Y. Wu, K. He, and R. Girshick, “Pointrend: image segmentation as rendering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9799–9808, Seattle, USA, 2020. View at: Google Scholar
 Y. Lee and J. Park, “Centermask: realtime anchorfree instance segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13906–13915, Seattle, USA, 2020. View at: Google Scholar
 Z. Tian, C. Shen, and H. Chen, “Conditional convolutions for instance segmentation,” in Proceedings of European Conference on Computer Vision (ECCV), pp. 282–298, 2020. View at: Publisher Site  Google Scholar
 S. Peng, W. Jiang, H. Pi, X. Li, H. Bao, and X. Zhou, “Deep snake for realtime instance segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8533–8542, Seattle, USA, 2020. View at: Google Scholar
 Z. Dong, G. Li, Y. Liao, F. Wang, P. Ren, and C. Qian, “Centripetalnet: pursuing highquality keypoint pairs for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10519–10528, Seattle, USA, 2020. View at: Google Scholar
 J. Cao, H. Cholakkal, R. Anwer, F. Khan, Y. Pang, and L. Shao, “D2det: towards high quality object detection and instance segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11485–11494, Seattle, USA, 2020. View at: Google Scholar
 S. Wang, Y. Gong, J. Xing, L. Huang, C. Huang, and W. Hu, “RDSNet: a new deep architecture forreciprocal object detection and instance segmentation,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 7, pp. 12208–12215, 2020. View at: Publisher Site  Google Scholar
 Z. Tian, C. Shen, X. Wang, and H. Chen, “Boxinst: highperformance instance segmentation with box annotations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5443–5452, 2021. View at: Google Scholar
 S. Ren, K. He, R. Girshick, and J. Sun, “Faster rcnn: towards realtime object detection with region proposal networks,” Proceedings of Advances in Neural Information Processing Systems, vol. 28, pp. 91–99, 2015. View at: Google Scholar
 Z. Cai and N. Vasconcelos, “Cascade rcnn: delving into high quality object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6154–6162, Salt Lake City, USA, 2018. View at: Google Scholar
 Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: fully convolutional onestage object detection,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 9627–9636, Seoul, Korea (south), 2019. View at: Google Scholar
 Y. Lee, J. Hwang, S. Lee, Y. Bae, and J. Park, “An energy and gpucomputation efficient backbone network for realtime object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, USA, 2019. View at: Google Scholar
 X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” 2019, https://arxiv.org/abs/1904.07850. View at: Google Scholar
 H. Su, S. Wei, M. Yan, C. Wang, J. Shi, and X. Zhang, “Object detection and instance segmentation in remote sensing imagery based on precise mask rcnn,” in IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pp. 1454–1457, Yokohama, Japan, 2019. View at: Google Scholar
 Y. Feng, W. Diao, Y. Zhang et al., “Ship instance segmentation from remote sensing images using sequence local context module,” in IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pp. 1025–1028, Yokohama, Japan, 2019. View at: Publisher Site  Google Scholar
 T. Pan, J. Ding, J. Wang, W. Yang, and G. Xia, “Instance segmentation with oriented proposals for aerial images,” in IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pp. 988–991, 2020. View at: Google Scholar
 Z. Zhang and J. Du, “Accurate oriented instance segmentation in aerial images,” in International Conference on Image and Graphics, pp. 160–170, Haikou, China, 2021. View at: Publisher Site  Google Scholar
 X. Zeng, S. Wei, J. Wei et al., “Cpisnet: delving into consistent proposals of instance segmentation network for highresolution aerial images,” Remote Sensing, vol. 13, no. 14, p. 2788, 2021. View at: Publisher Site  Google Scholar
 Y. Liu, H. Li, C. Hu, S. Luo, H. Shen, and C. Chen, “Catnet: context aggregation network for instance segmentation in remote sensing images,” 2021, https://arxiv.org/abs/2111.11057. View at: Google Scholar
 T. Zhang, X. Zhang, P. Zhu et al., “Semantic attention and scale complementary network for instance segmentation in remote sensing images,” in IEEE Transactions on Cybernetics, 2021. View at: Google Scholar
 G. Xia, X. Bai, J. Ding et al., “DOTA: a largescale dataset for object detection in aerial images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3974–3983, Salt Lake City, USA, 2018. View at: Google Scholar
 J. Ding, N. Xue, G. Xia et al., “Object detection in aerial images: a largescale benchmark and challenges,” 2021, https://arxiv.org/2102.12219. View at: Google Scholar
 J. Ding, N. Xue, Y. Long, G. Xia, and Q. Lu, “Learning roi transformer for oriented object detection in aerial images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2849–2858, Long Beach, USA, 2019. View at: Google Scholar
 J. Han, J. Ding, J. Li, and G. Xia, “Align deep features for oriented object detection,” IEEE Transactionson Geoscience and Remote Sensing, vol. 60, pp. 1–11, 2022. View at: Publisher Site  Google Scholar
 J. Han, J. Ding, N. Xue, and G. Xia, “Redet: a rotationequivariant detector for aerial object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2786–2795, 2021. View at: Google Scholar
 Z. Huang, W. Li, X. Xia, X. Wu, Z. Cai, and R. Tao, “A novel nonlocalaware pyramid and multiscale multitask refinement detector for object detection in remote sensing images,” IEEE Transactionson Geoscience and Remote Sensing, vol. 60, pp. 1–20, 2022. View at: Google Scholar
 Z. Huang, W. Li, X. G. Xia, H. Wang, F. Jie, and R. Tao, “LODet: lightweight oriented object detection in remote sensing images,” IEEE Transactionson Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2022. View at: Publisher Site  Google Scholar
 Z. Huang, W. Li, X. Xia, and R. Tao, “A general gaussian heatmap label assignment for arbitraryoriented object detection,” IEEE Transactionson on Image Processing, vol. 31, pp. 1895–1910, 2022. View at: Publisher Site  Google Scholar
 J. Dai, K. He, Y. Li, S. Ren, and J. Sun, “Instancesensitive fully convolutional networks,” in Proceedings of European Conference on Computer Vision (ECCV), pp. 534–549, Amsterdam, Netherlands, 2016. View at: Publisher Site  Google Scholar
 L. Castrejon, K. Kundu, R. Urtasun, and S. Fidler, “Annotating object instances with a polygonrnn,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5230–5238, Hawaii, USA, 2017. View at: Google Scholar
 D. Acuna, H. Ling, A. Kar, and S. Fidler, “Efficient interactive annotation of segmentation datasets with polygonrnn++,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 859–868, Salt Lake City, USA, 2018. View at: Google Scholar
 H. Ling, J. Gao, A. Kar, W. Chen, and S. Fidler, “Fast interactive object annotation with curvegcn,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5257–5266, Long Beach, USA, 2019. View at: Google Scholar
 N. Gao, Y. Shan, Y. Wang et al., “Ssap: singleshot instance segmentation with affinity pyramid,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 642–651, Seoul, Korea (south), 2019. View at: Google Scholar
 Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin, “Reppoints: point set representation for object detection,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 9657–9666, Seoul, Korea (south), 2019. View at: Google Scholar
 D. Bolya, C. Zhou, F. Xiao, and Y. Lee, “Yolact: realtime instance segmentation,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 9157–9166, Seoul, Korea (south), 2019. View at: Google Scholar
 X. Zhou, J. Zhuo, and P. Krahenbuhl, “Bottomup object detection by grouping extreme and center points,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 850–859, Long Beach, USA, 2019. View at: Google Scholar
 W. Xu, H. Wang, F. Qi, and C. Lu, “Explicit shape encoding for realtime instance segmentation,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5168–5177, Seoul, Korea (south), 2019. View at: Google Scholar
 J. Cao, R. Anwer, H. Cholakkal, F. Khan, Y. Pang, and L. Shao, “Sipmask: spatial information preservation for fast image and video instance segmentation,” in Computer Vision – ECCV 2020, pp. 740–755, Springer, Cham, 2020. View at: Publisher Site  Google Scholar
 H. Chen, K. Sun, Z. Tian, C. Shen, Y. Huang, and Y. Yan, “Blendmask: topdown meets bottomup for instance segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8573–8581, Seattle, USA, 2020. View at: Google Scholar
 F. Wei, X. Sun, H. Li, J. Wang, and S. Lin, “Pointset anchors for object detection, instance segmentation and pose estimation,” in Proceedings of European Conference on Computer Vision (ECCV), pp. 527–544, 2020. View at: Publisher Site  Google Scholar
 P. Hurtik, V. Molek, J. Hula, M. Vajgl, P. Vlasanek, and T. Nejezchleba, “PolyYOLO: higher speed, more precise detection and instance segmentation for yolov3,” Neural Computing and Applications, vol. 34, no. 10, pp. 8275–8290, 2022. View at: Publisher Site  Google Scholar
 F. Wang, Y. Chen, F. Wu, and X. Li, “Textray: contourbased geometric modeling for arbitraryshaped scene text detection,” in Proceedings of the 28th ACM International Conference on Multimedia, pp. 111–119, New York, 2020. View at: Google Scholar
 E. Xie, P. Sun, X. Song et al., “Polarmask: single shot instance segmentation with polar representation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12193–12202, Seattle, USA, 2020. View at: Google Scholar
 A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in Proceedings of European Conference on Computer Vision (ECCV), pp. 483–499, Amsterdam, Netherlands, 2016. View at: Publisher Site  Google Scholar
 J. Redmon and A. Farhadi, “Yolov3: an incremental improvement,” 2018, https://arxiv.org/abs/1804.02767. View at: Google Scholar
 N. Audebert, B. Le Saux, and S. Lefèvre, “Segmentbeforedetect: vehicle detection and classification through semantic segmentation of aerial images,” Remote Sensing, vol. 9, no. 4, p. 368, 2017. View at: Google Scholar
 L. Mou and X. Zhu, “Vehicle instance segmentation from aerial image and video using a multitask learning residual fully convolutional network,” IEEE Transactionson Geoscience and Remote Sensing, vol. 56, no. 11, pp. 6699–6711, 2018. View at: Publisher Site  Google Scholar
 Z. Huang, S. Sun, and R. Li, “Fast singleshot ship instance segmentation based on polar template mask in remote sensing images,” in IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pp. 1236–1239, IEEE, Waikoloa, HI, USA, 2020. View at: Publisher Site  Google Scholar
 S. Waqas Zamir, A. Arora, A. Gupta et al., “Isaid: a largescale dataset for instance segmentation in aerial images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 28–37, Long Beach, USA, 2019. View at: Google Scholar
 T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2117–2125, Hawaii, USA, 2017. View at: Google Scholar
 T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988, Venice, Italy, 2017. View at: Google Scholar
 K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, Las Vegas, USA, 2016. View at: Google Scholar
 K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” 2014, https://arxiv.org/abs/1409.1556. View at: Google Scholar
 O. Russakovsky, J. Deng, H. Su et al., “ImageNet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015. View at: Publisher Site  Google Scholar
 K. Chen, J. Wang, J. Pang et al., “MDetection: open mmlab detection toolbox and benchmark,” 2019, https://arxiv.org/abs/1906.07155. View at: Google Scholar
 N. Qian, “On the momentum term in gradient descent learning algorithms,” Neural Networks, vol. 12, no. 1, pp. 145–151, 1999. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2022 Yixin Luo et al. Exclusive Licensee Aerospace Information Research Institute, Chinese Academy of Sciences. Distributed under a Creative Commons Attribution License (CC BY 4.0).