Get Our e-AlertsSubmit Manuscript
Journal of Remote Sensing / 2022 / Article

Research Article | Open Access

Volume 2022 |Article ID 9809505 | https://doi.org/10.34133/2022/9809505

Yixin Luo, Jiaming Han, Zhou Liu, Mi Wang, Gui-Song Xia, "An Elliptic Centerness for Object Instance Segmentation in Aerial Images", Journal of Remote Sensing, vol. 2022, Article ID 9809505, 14 pages, 2022. https://doi.org/10.34133/2022/9809505

An Elliptic Centerness for Object Instance Segmentation in Aerial Images

Received15 Jan 2022
Accepted15 May 2022
Published02 Jun 2022

Abstract

Instance segmentation in aerial images is an important and challenging task. Most of the existing methods have adapted instance segmentation algorithms developed for natural images to aerial images. However, these methods easily suffer from performance degradation in aerial images, due to the scale variations, large aspect ratios, and arbitrary orientations of instances caused by the bird’s-eye view of aerial images. To address this issue, we propose an elliptic centerness (EC) for instance segmentation in aerial images, which can assign the proper centerness values to the intricate aerial instances and thus mitigate the performance degradation. Specifically, we introduce ellipses to fit the various contours of aerial instances and measure these fitted ellipses by two-dimensional anisotropic Gaussian distribution. Armed with EC, we develop a one-stage aerial instance segmentation network. Extensive experiments on a commonly used dataset, the instance segmentation in aerial images dataset (iSAID), demonstrate that our proposed method can achieve a remarkable performance of instance segmentation while introducing negligible computational cost.

1. Introduction

Instance segmentation is aimed at predicting both the location and semantic mask of each object instance in an image. Therefore, an intuitive approach to instance segmentation is to detect the bounding boxes of object instances and then perform semantic segmentation in the area of each box. The conventional two-stage instance segmentation methods adopt this detect-then-segment pipeline [114]. Mask R-CNN [3] is a state-of-the-art method, which extends Faster R-CNN [15] by adding a mask prediction branch. Based on Mask R-CNN, Mask Scoring R-CNN [6] predicts the IoU between mask and ground-truth and further rescores the confidence of mask through the added mask-IoU branch, acquiring better segmentation results. PANet [4] pays more attention to the process of feature propagation. It proposes bottom-up path aggregation and adaptive feature pooling to merge the features from all levels and finally boost up the performance of instance segmentation. HTC [5] is a cascade architecture for instance segmentation. Different from Cascade R-CNN [16], HTC integrates features from each stage to add complementary information and finally get better mask predictions. PointRend [7] refines the segmentation of objects in a rendering way that is similar to classical computer graphics methods. Methods above all depend on anchor-based detectors. There are also some two-stage methods built upon anchor-free detectors. Extending the detector FCOS [17] by adding a mask branch, CenterMask [8] improves the backbone network [18] and uses Spatial Attention-Guided to focus on important features during the segmentation process. DeepSnake [10] introduces circular convolution, which finishes segmentation by transforming the box detected by CenterNet [19] into a polygon. As for two-stage methods in the aerial scene, most currently extensive works [2026] are based on Mask R-CNN [3]. Su et al. [20] propose the Precise RoI-Pooling to replace the RoI-Align in Mask R-CNN and further avoids the degeneration of a precision result from the quantization of coordinate. To learn multiscale context information, Feng et al. [21] embed the local context module into the mask branch of [3]. There are also a few works that depend on the rotating object detector, which is one of the characteristics of aerial images. Recently, the rotated object detection task has gained much attention in an aerial scene [2734], as the oriented bounding boxes enclose the objects of aerial images better. For instance, to save the loss caused by extracting features of the neighboring objects, ISOP [22, 23] follow [29] and predict the mask result on the oriented proposal rather than the horizontal one. Even though these two-stage methods can achieve competitive performance, they are usually time-consuming due to the complicated network architectures.

To accelerate the instance segmentation model, one-stage instance segmentation methods [3549] simplify the pipeline of two-stage methods and reduce the cost of computation. Some one-stage approaches achieve instance segmentation from semantic segmentation results by grouping the pixels that belong to the same object. Using instance-sensitive score maps for generating proposals, InstanceFCN [35] first produces the score maps, then generates object instances by an assembling module. YOLACT [41] linearly combines the proposed prototype masks according to the predicted coefficients and then crops with a predicted bounding box. Some one-stage methods learn to generate the feature point or contour of instances. ExtremeNet [42], with the heavy backbone of HourGlass [50], detects 8 extreme points to generate an octagon, which is a relatively rough mask result. For a faster speed, ESESeg [43] directly regresses the coordinates of contour points through Chebyshev polynomial fitting. Curve-GCN [38] and Point-Set Anchors [46] learn to generate the contour of an instance by the regression way. Poly-YOLO [47] extends YOLOv3 [51] to be able to instance segment in the distance and angle regression way. Most recently, PolarMask [49] utilizes polar masks to represent the contour of instances. It predicts multiorientation distances from the central area to the contour of the instance and finally decodes the set of distances to the polar mask. As for one-stage aerial instance segmentation, it is not as luxuriant as the two-stage ones. Most of the existing one-stage aerial instance segmentation methods [5254] group the pixels that belong to the same object on the semantic segmentation result. Audebert et al. [52] utilize a classification network to achieve object-wise classification on the semantic segmentation result. Mou and Zhu [53] learn both semantic segmentation and semantic boundary simultaneously to finish the instance segmentation of the vehicle. Different from the one-stage aerial methods above, Huang et al. [54] follow PolarMask [49] and propose the Polar Template Mask to fit the ship instance in aerial better.

Although these one-stage methods are usually faster than two-stage methods, there is still a performance gap between one-stage and two-stage methods [41, 49]. It is observed that low-quality locations (i.e., locations far away from the centroid of an object instance) often produce low-quality predictions [17], and thus, these low-quality locations result in performance degradation. In order to suppress low-quality locations, centerness [17, 49] is introduced to estimate the quality of locations, which is a branch in parallel with the classification branch. The value of centerness predicted by the network is in the range of , and the higher centerness value will be assigned to the location that is closer to the centroid of the instance. Then the predicted centerness value is multiplied by the classification score as the final score, thus reducing the effect of low-quality locations and improving the performance remarkably [17, 49].

However, due to the fact that scale variations and aspect ratios of object instances in aerial images are often larger than those in natural images, most existing centerness methods [17, 49] developed for natural images are not effective in aerial images. Examples are shown in Figure 1. Polar centerness (PC) is a polar representation-based centerness method proposed by the most recent method PolarMask [49], which can achieve state-of-the-art performance. If a location is at the centroid of an object instance, the centerness value of this location should be 1 or approach 1. But the centerness predicted by PC is close to 0 even when the location is at the centroid of instance, as shown in Figure 1.

To address this issue, we propose a novel centerness for object instance segmentation in aerial images, termed elliptic centerness (EC). To be concrete, we introduce ellipses to fit the complex contours of aerial instances. Then we estimate the quality of locations inside ellipses by using two-dimensional anisotropic Gaussian distribution. Considering the information of the whole contour, EC can assign more proper centerness values to aerial objects. As shown in Figure 1, the centerness values at the centroid of objects are 1 strictly, which are not affected by large variations of scale or aspect ratios. Based on EC, we develop a one-stage instance segmentation network for aerial images. Experimental results show that our proposed EC achieves a remarkable performance of instance segmentation in a large-scale aerial image dataset, iSAID [55]. Generally, our contributions are summarized as follows: (i)We propose an elliptic centerness for aerial instance segmentation, which is a single-layer branch to estimate the quality of locations. Our proposed EC can estimate the appropriate centerness values for the intricate aerial object instance and thus improve the instance segmentation performance(ii)Experimental results show that our EC outperforms PC by a large margin (8% mAP) of 15 categories in total on iSAID. Furthermore, extensive experiments with the trade-off between accuracy and speed demonstrate that our method has a competitive performance on accuracy and a faster inference speed than state-of-the-arts

The remainder of this paper is organized as follows. In Section 2, we describe in detail our proposed method. Then we show our experimental results in Section 3. We briefly discuss the limitation in Section 4. Finally, a conclusion is given in Section 5.

2. Materials and Methods

2.1. Dataset

iSAID [55] is a large-scale aerial image dataset for instance segmentation which has the same raw images as the DOTA [27] dataset. It contains 2806 images with the size ranging from to , and 655451 instances of 15 common object categories include the following: plane (PL), baseball diamond (BD), bridge (BR), ground track field (GTF), small vehicle (SV), large vehicle (LV), ship (SH), tennis court (TC), basketball court (BC), storage tank (ST), soccer-ball field (SBF), roundabout (RA), harbor (HA), swimming pool (SP), and helicopter (HC).

Training and validation sets are used for training and testing separately in the ablation study. When compared with other methods, the testing set is used for testing. Following [55], we crop a series of patches from original images with an overlap of 200 between patches.

2.2. Review of Existing Centernesses

Centerness is firstly proposed in object detection [17], which is designed for estimating the location quality of the anchor point by measuring the distance between the sample point and the box center. As it is defined on the bounding box, we term it BC (box centerness) for short. The BC of one location can be calculated by where , , , and represent the distance from the location to the bounding box of 4 directions.

Polar centerness (PC) is one variant of the centerness proposed in [49] for instance segmentation. Since a polar mask of an anchor point can be viewed as contour points , the PC of can be calculated by where means the distance between point and point . In other words, the polar centerness value of an anchor point is equal to the ratio of minimum distance and maximum distance from to the contour of the polar mask .

From the definition, both BC and PC lie in . They estimate the quality of localization and consequently suppress the noncentral areas. Specifically, they affect models in two aspects. (1) In the training phase, centerness weights the loss of each positive anchor point on the regression branch according to its value. (2) In the testing phase, centerness is multiplied by the corresponding classification score, and the refined score is subsequently used to sort proposals in postprocess like the nonmaximum suppression (NMS).

However, the two centernesses above have shortcomings individually. As the definition of BC (1), it is designed in object detection tasks and only considers the bounding box without the label of the mask. Moreover, it neglects the multiorientation of aerial instances.

In Equation (2), PC is defined on the contour of the mask, and it takes the centroid into account. But PC is incompatible with aerial objects and is not able to assign proper values to aerial instances. As shown in Figure 1, we use the PC value at centroid for example. There are mainly two kinds of circumstances: (1) In Equation (2), instance (a) with an extremely large aspect ratio (e.g., bridge, large vehicle) and instance (b) with complex contour (e.g., plane) tend to have lower centerness than round or squared instance (e.g., roundabout, storage tank). 2). Appearing as only a few pixels in aerial images, small instance (c) (e.g., small vehicle) has too few contour points to generate a polar mask target at all orientations, resulting in generating a pretty low centerness target. Because its minimal distance target is close to 0. This condition is also faced when the anchor point is near the contour. In general, PC provides aerial instances with lower centerness targets, which is harmful to the performance of instance segmentation.

2.3. Elliptic Centerness

To estimate the proper centerness value for aerial instances, we propose a new centerness, named elliptic centerness (EC).

Given a polar mask with its points on contour, the elliptic centerness of a pixel is measured by a two-dimensional anisotropic Gaussian score: where is a hyperparameter affects the value of EC, is the centroid, and is the normalized inertial matrix of the polar mask . More precisely, the matrix is calculated as where is the normalized -order central moments of the polar mask :

It is worth noticing that representing the polar mask with its normalized inertial matrix is actually equivalent to fit the polar mask by an ellipse with a semimajor axis and a semiminor axis oriented to , where are the two eigenvalues of (see Figure 2 for example).

With the definition of EC (Equation (3)), the centerness of the centroid is equal to 1 strictly. And the centerness of other neighboring anchor points depends both on the distance to the center and the shape (aspect ratio and rotation) of the ellipse. Figure 3 visualizes the BC, PC, and EC of aerial instances. The visualization shows that BC is more appropriate than PC, while being surpassed by EC.

2.4. Network Architecture

The architecture with proposed elliptic centerness is illustrated in Figure 4. Our overall pipeline is as simple as PolarMask [49] and the other one-stage method [17]. We use a fully convolutional network consisting of a backbone and two task-specific head networks. The backbone network with the following feature pyramid network (FPN) [56] extracts deep features of images and generates the multiscale feature maps. Two branches of the head networks with four stacked convolution layers finish the final prediction. One branch predicts both instance categories and centerness, while the other predicts polar distances of the mask.

In the training phase, the forward propagation has finished here, and the three parts of predictions will be subsequently used to calculate the loss to finish the post propagation. While in the testing phase, the classification score (, means the number of categories) will firstly be reduced to a 2D map () by choosing the highest score of categories. Then, since the classification score and the centerness have the same size, they will be multiplied by each other to get the refined score, which is ranked later for postprocess (e.g., NMS). At last, the remaining sets of distances will be assembled and generate the final polar mask result. The polar-based mask follows the formulation in [49]. It is modeled in polar coordinates (see Figure 4(d)), involving an anchor point with contour points that have evenly angle intervals. Such polar coordinates can be further decoded to Cartesian coordinates . The number of rays is set to 20 in the figure.

2.5. Loss Function for Back-Propagation

The loss function consists of three parts according to three tasks, i.e., the losses of classification, mask regression, and centerness, which is defined as where is an indicator function, is the number of positive samples, and is the index of a sample in a minibatch. , , and are the predicted category, distance set, and centerness of the anchor . , , and are the ground-truth category, distance set, and centerness of the anchor . The focal loss [57], polar IoU loss [49], and cross-entropy loss are adopted as the classification loss , the mask regression loss , and the EC loss , respectively.

3. Results

3.1. Implementation Details

With the superiority of the ResNet [58] compared with the conventional backbone network VGG [59], we adopt ResNet-50 pretrained by ImageNet [60] with FPN [56] as the backbone network by default. In the mask regression branch, 36 rays are adopted to represent a polar mask based on polar if not specified. The hyperparameters of focal loss are set to and . We adopt the same training schedules as [49]. We adopt the same training schedules as MMDetection [61]. We train all models in 12 epochs on iSAID in the ablation study. The SGD optimizer is adopted with an initial learning rate of 0.0025, and the learning rate is divided by 10 at each decay step. The momentum [62] and weight decay are 0.9 and 0.0001, respectively. We adopt a learning rate warmup for 500 iterations. In the inference phase, we choose 2000 predictions at most in each feature level before NMS and finally retain no more than 2000 predictions in one image. We use a single RTX 2080 Ti GPU with a batch size of 2 for training and 1 for testing.

3.2. Evaluation Indicators

In this section, we briefly introduce some evaluation indicators in the instance segmentation task, i.e., average precision (AP) and frames per second (FPS).

AP is used to measure the accuracy of the prediction result. The higher the AP gets, the better result is predicted. AP considers both the precision and the recall. where and denote the precision and the recall, respectively. More specifically, the precision and the recall can be, respectively, calculated by

TP, FP, and FN denote true positives, false positives, and false negatives, respectively. It is worth noticing that the standard of distinguishing positives and negatives is whether the IoU (mask-IoU in instance segmentation task) between prediction and ground-truth is higher than the threshold. In the experiments below, the AP of each category is under the IoU threshold of 0.5.

The mAP denotes the mean AP of the 15 categories under the series of thresholds (0.50.95, with an interval of 0.05), and mAP denotes the mean AP of the 15 categories under one specific threshold of IoU.

Besides, as the settings in Pycocotools, mAP, mAP, and mAP, respectively, denote the mAP of small objects (), medium objects (), and large objects ().

FPS means the number of images that the model can infer in one second. The whole process contains the forward propagation and the postprocess.

3.3. Ablation Studies

In this section, we conduct a series of experiments with different settings to validate the effectiveness of our proposed method. ResNet-50 and FPN are used in all experiments, and the AP is tested on the validation set of iSAID.

3.3.1. Verification of Upper Bound

An essential concern about the polar-based mask is that it might not depict the mask precisely. In fact, even the pixel-based mask will be likely to lose some details of the mask as it usually faces the RoI-Pooling operator. As shown in Figure 5, we calculate the pixel-wise IoU of the target mask and ground-truth on the whole validation set and finally verify the upper bound of the polar-based mask in [49] and the pixel-based mask in [3]. It can be seen that the IoU between target and ground-truth is growing higher when the number of rays increases. Most categories remain a negligible gap between the two kinds of masks except for some of them, which shows that the concern about the upper bound of the polar-based mask is unnecessary.

3.3.2. Hyperparameter

As the formulation of EC (Equation (3)) involves one hyperparameter , which is a part of covariance in Gaussian function and therefore affects values of EC, we first conduct experiments on different . As shown in Table 1, different values of in are used to train the model. We observe that the proposed EC is quite insensitive to the variation of from 1.41 to 0.35. Too small (i.e., 0.18) will give high scores (i.e., 1.0) to nearly all positive points and make them weigh equally as the centroid point, which violates the original design intention of centerness and causes a slight drop in accuracy. On the contrary, too large (i.e., 3.54) will result in a large proportion of low-quality samples that decreases the performance noticeably. Overall, the only hyperparameter is quite robust, and the proposed EC can be nearly regarded as hyperparameter-free. We set in the remaining experiments.


mAPmAPAPmAPmAPmAP

3.5427.752.026.413.936.540.4
1.4129.454.527.815.537.842.0
0.7129.155.027.315.836.841.3
0.3528.855.026.715.736.741.4
0.1828.654.226.715.436.140.9

3.3.3. Effectiveness of Elliptic Centerness

After confirming the setting of the hyperparameter, we compare our EC with box centerness (BC) in [17] and polar centerness (PC) in [49]. Table 2 (a) and (c) are two baselines of BC and PC, respectively. Following the original designs, BC estimates the centerness to the box center, and the PC faces the centroid. Although the PC is specifically designed for the instance segmentation task while the BC serves for detection, we found that BC outperforms the PC in an aerial scene, especially in mAP. Such a result indicates that the more proper centerness provided, the better model learned, which is shown in Figure 3.


ScoreCIMSmAPmAPmAPmAPmAPmAP

(a)BC28.053.158.742.131.058.0
(b)BC28.553.659.341.531.858.8
(c)PC26.547.948.135.715.756.3
(d)PC27.449.048.935.714.658.3
(e)EC29.354.961.144.234.359.2
(f)EC29.455.362.144.434.459.7
(g)EC30.055.259.943.533.560.5

We firstly conduct a new sample strategy, which regards points inside mask contour as positive points (like Figure 3). While in [49], points only in a small range around the centroid will be set to positive. As shown in Table 2 (b) and (d), the new sample strategy improves about 1% mAP, which shows the effectiveness of this sample strategy. Then, in Table 2 (e), we replace the centerness with the proposed EC. We firstly try to use the box center to replace in Equation (3) and observe EC outperforms BC by nearly 2% in mAP, which demonstrates the effectiveness of considering the label of the mask. In the meantime, EC surpasses PC by a large margin (47.9% to 54.9% in mAP). This is mainly because our EC is more robust for evaluating the sample quality of aerial instances. Moreover, since PC in [49] is defined according to the centroid, we generate EC target with the centroid of instance as Equation (3). And it (Table 2 (f)) improves about 0.5% in mAP than Table 2 (e). The result shows that the centroid is better than the box center in defining EC. Finally, we use both centroid as instance center and the new sample strategy (Table 2 (g)) and achieve 30.0% in mAP, which improves slightly than (e) while outperforming baselines (a) and (c) obviously.

We also report the results of class-wise accuracy for further analysis. Instead of listing all of the 15 categories, we classify these categories according to their characteristics for briefness. Specifically, the 15 categories are divided into 4 parts, i.e., objects with large aspect ratios (lar), complex contours (cc), small scales (ss), and normal aspect ratios (nar). The lar objects include large vehicle (LV), bridge (BR), and ship (SH). The cc objects include plane (PL), harbor (HA), and helicopter (HC). The ss objects only include small vehicle (SV). The remaining eight categories, i.e., baseball diamond (BD), tennis court (TC), basketball court (BC), storage tank (ST), soccer-ball field (SBF), roundabout (RA), swimming pool (SP), and ground track field (GTF), compose the last group, whose member is the category with normal aspect ratios (nar). As shown in Table 2, when compared with PC, our EC outperforms it by a large margin in mAP, mAP, and mAP and also outperforms PC about 2% in mAP, which verifies that our EC alleviates the irrationality of PC in aerial obviously. While compared with BC, our methods are also better than BC about 2% mAP in each group.

3.3.4. Number of Rays

We also conduct experiments with different numbers of rays for representing the mask. Experiments in [49] verify that the performance of PolarMask which is embedded with PC keeps improving when utilizing more rays of the polar mask and finally becomes saturated. It is intuitive because more rays for representing can produce a more elaborate contour (shown in Figure 5). However, the result above is not accordant with the performance of PolarMask on the iSAID dataset. As shown in the top half of Table 3, the performance of the PC keeps dropping while the rays of the mask keep rising. We argue that the irrationality of polar centerness discussed before is the main reason that causes the consequence, especially for small instances (shown in Figure 1(c)), as the mAP of small objects (mAP) drops notably for rays from 20 to 72. As shown in the bottom half of Table 3, for our EC, the irrationality of former centerness has been alleviated obviously since the mAP keeps growing with the rising number of rays, which indicates the effectiveness of the proposed EC.


RaysmAPmAPmAPmAPmAPmAP

PC2026.949.925.813.035.741.0
PC3626.547.925.611.735.941.5
PC7225.545.325.210.335.341.5
EC2028.754.027.515.236.242.2
EC3630.055.229.015.938.844.5
EC7230.555.529.416.238.943.7

3.4. Comparisons with the State-of-the-Art

In this section, we compare our proposed EC with other state-of-the-art (SOTA) methods for instance segmentation. The settings have been stated in Sections 2.1 and 3.1.

Before the comparison experiments with SOTA methods, we firstly illustrate the advantages and disadvantages of those methods in the aerial instance segmentation task. As for two-stage methods, we compared our methods with Mask R-CNN [3], Mask Scoring R-CNN [6], and CenterMask [8], whose results tend to be more accurate. However, as they utilize the horizontal bounding boxes to extract the feature of the regional area, the mask results tend to contain patches when objects of the same category gather compactly. Besides, their computation complexity and parameters are larger. As for one-stage methods, we introduce PolarMask [49], Yolact [41], and ExtremeNet [42] for comparison. Their architectures are more concise and therefore have less computation complexity and parameters. However, the segmentation results of these methods are not satisfactory due to different reasons. PolarMask and ExtremeNet use the polar mask and octagon to represent the mask, respectively, which have a lower upper bound of accuracy. The results of Yolact come from the prototype mask, which is generated directly by the feature map in FPN and lacks regional information.

3.4.1. Class-Wise Performances

Tables 3 and 4 show the class-wise AP of each method. We only compare methods whose training schedules are the same for fairness (i.e., the comparisons of the top half and the bottom half of Tables 3 and 4 are independent). In the top half of Tables 3 and 4, we use ResNet-50 as the backbone, and the training epochs are set to 12, except Yolact as the different setting in its original paper. Our EC surpasses PolarMask in every class, especially in the complex instance, e.g., plane (+11.1% AP) and harbor (+14.9% AP), large aspect ratio instances, e.g., large vehicle (+13.2% AP), bridge (+10.4% AP), and small instances, e.g., ship (+12.8% AP) and small vehicle (+16.2% AP), which indicates the effectiveness of EC. The same outperformance in almost all categories also occurs when compared with other one-stage methods like Yolact [41] and ExtremeNet [42]. While comparing with the state-of-the-art method Mask R-CNN [3], our method can still achieve better results in some categories (i.e., vehicle, ship, storage tank, and soccer-ball field). However, as discussed in Figure 5, due to the limitation of the polar-based mask, our method cannot perform desired results compared with Mask R-CNN in categories whose contours are complicated (i.e., plane, harbor, and helicopter). Therefore, we use mAP to represent mAP of the remaining 12 categories and find that only nearly 1% mAP gap between EC and Mask R-CNN. In the bottom half of Tables 3 and 4, we conduct experiments for the extension to validate the potential of our EC. With longer training epochs and a larger training dataset and the ResNet-101 backbone, our EC has less gap with Mask R-CNN in mAP (60.3% vs 61.1%) and even outperforms Mask R-CNN in mAP (62.1% vs 61.0%). Our EC achieves the best results in 10/15 categories.


MethodsEpochsmAPmAPmAPPLBDBRGTFSVLVSHTCBCSTSBFRAHASPHCFPS

Two-stage
Mask R-CNN [3]1260.160.436.382.674.041.646.231.952.381.092.364.857.048.566.063.469.230.17.8
MS R-CNN [6]1259.360.036.981.073.640.745.831.852.580.491.365.556.246.866.762.968.625.77.9
CenterMask [8]1257.057.433.579.768.939.540.831.151.572.391.861.757.141.463.560.769.126.23.0
One-stage
PolarMask [49]1248.952.327.562.567.028.235.019.640.168.790.661.852.643.560.927.860.115.68.2
Yolact [41]5546.346.923.862.661.731.527.021.826.757.984.649.150.029.361.750.162.618.04.0
ExtremeNet [42]1231.339.011.70.160.015.631.58.16.715.285.962.451.639.366.10.825.31.16.2
EC1256.759.231.073.671.538.637.235.853.381.591.463.858.249.061.442.769.223.67.5
Extension
EC2457.659.532.376.170.238.237.737.253.782.392.363.356.849.963.046.569.027.3
EC-r1011257.259.731.675.074.039.237.935.853.581.791.764.456.949.062.844.269.222.86.9
EC-r1012460.362.134.478.176.541.345.638.355.083.492.466.958.451.564.649.671.530.8
PolarMask-r1012451.453.330.166.572.229.938.419.041.963.090.566.150.349.361.933.965.622.17.9
Mask R-CNN-r1012461.161.037.883.276.441.351.831.652.880.591.366.155.551.166.066.968.234.27.0
MS R-CNN-r1012458.858.638.181.675.740.650.426.149.670.790.665.452.248.667.464.866.432.08.0

Qualitative segmentation results of the PolarMask, Mask R-CNN, and our EC are visualized in Figure 6. The series of figures explicitly show how the proposed EC tackle the different kinds of challenge objects in aerial instance segmentation. The first three columns are the results of objects with large aspect ratios. EC detects and segments the typical objects (e.g., large vehicle, ship, and bridge) better than PolarMask and Mask R-CNN. PolarMask and Mask R-CNN fail to detect many objects with large aspect ratios. Besides, Mask R-CNN produces patches between the crowded objects. The fourth column shows the result of objects with complex contours (e.g., plane). It can be seen that the mask results of PolarMask have lower quality than Mask R-CNN and EC. And Mask R-CNN is still trapped in processing the crowded objects. Finally, the last two columns are the segmentation result of small-scale objects (e.g., small vehicle, storage tank). PolarMask is incapable to segment these small objects, and Mask R-CNN is even worse. However, our proposed EC can detect these small objects stably. In conclusion, compared with PolarMask, our method segments better on large aspect ratio instances and small scale instances. While comparing with Mask R-CNN, our method performs better on crowded objects without producing patches like Mask R-CNN and is also more effective in large aspect ratio instances and small scale instances.

3.4.2. Trade-Off between Accuracy and Speed

Mask R-CNN has a slightly higher accuracy, but its inference speed is not unsatisfactory due to its two-stage network design. The FPS of our one-stage EC (7.5) is similar to Mask R-CNN (7.8). This phenomenon results mainly from too many proposals produced as their score is rated higher, which costs more time for the postprocess. We conducted a trade-off experiment between accuracy and speed by increasing the score threshold of the proposal in the testing stage. As shown in Figure 7, our EC obtains 2 times acceleration in FPS with a drop of nearly 1 in mAP, while Mask R-CNN gains little in FPS, which demonstrates our method has a competitive performance on accuracy and a faster inference speed than state-of-the-arts. PolarMask has a similar trade-off curve with our EC, but it is located totally on our left, which means its accuracy is pretty lower than ours.

3.4.3. Computation Complexity Analysis

Table 5 compares the computation complexity and parameters of different models. Having a similar architecture, the computation complexity and parameters of EC and PolarMask are the same. Note that EC has fewer computation and parameters compared with Mask R-CNN. Even embedded with ResNet-101, our method has a lighter computation than Mask R-CNN.


ModelBackboneGFLOPsParam

Mask R-CNNRseNet-50201.6543.83
PolarMaskRseNet-50124.3431.94
ECRseNet-50124.3431.94
ECRseNet-101171.8850.88

4. Discussion

There are still some limitations of the EC. The center plays an important role in the definition of EC. However, as shown in Figure 8, when the central area of the object (e.g., harbor) does not lie in the mask, subsequently, the high-quality area defined in EC generated by the mask annotation will not exist in this object. It will harm the both training and testing phases for objects of this kind. The lower AP result for the harbor category of the EC in Tables 3 and 4 validates the issue to some extent. This issue can also be found in other centerness-based methods.

5. Conclusion

In this paper, we propose a novel location-quality estimator, termed elliptic centerness, to alleviate the issues caused by inaccurate centerness for instance segmentation in aerial images. Extensive experiments demonstrate that our EC can achieve competitive performance with other state-of-the-art methods on the accuracy, while it has a faster inference speed in aerial images.

There are still some limitations in centerness-based methods, i.e., EC cannot perform well on objects whose central areas do not lie in the masks, and the extra computation of EC slightly affects the training efficiency. So the future work plan can be focused on these limitations, such as designing a more robust and simple-defined EC. Moreover, since our proposed EC is an estimator of location quality, it is a potential to be integrated into other instance segmentation methods with common CNN architectures, as long as they determine the definition of EC under their architectures and add the module to predict the EC. Therefore, future work can also study how to embed EC in other SOTA CNN architectures.

Data Availability

The data of iSAID used to support this study are publicly available. The iSAID data can be downloaded from the website https://captain-whu.github.io/iSAID/index.html. The code is available upon request.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this article.

Authors’ Contributions

Y. Luo proposed the method and implemented the experiments. Y. Luo, J. Han, and Z. Liu participated in the analysis of experimental results. Y. Luo and J. Han wrote the manuscript. Z. Liu, M. Wang, and G. Xia provided supervisory support. All authors read and approved the final manuscript.

Acknowledgments

This work was supported by the Fundamental Research Funds for the Central Universities (grant number: 2042021kf0040).

References

  1. J. Dai, K. He, and J. Sun, “Instance-aware semantic segmentation via multi-task network cascades,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3150–3158, Las Vegas, USA, 2016. View at: Google Scholar
  2. Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei, “Fully convolutional instance-aware semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2359–2367, Hawaii, USA, 2017. View at: Google Scholar
  3. K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2961–2969, Venice, Italy, 2017. View at: Google Scholar
  4. S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8759–8768, Salt Lake City, USA, 2018. View at: Google Scholar
  5. K. Chen, J. Pang, J. Wang et al., “Hybrid task cascade for instance segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4974–4983, Long Beach, USA, 2019. View at: Google Scholar
  6. Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang, “Mask scoring r-cnn,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6409–6418, Long Beach, USA, 2019. View at: Google Scholar
  7. A. Kirillov, Y. Wu, K. He, and R. Girshick, “Pointrend: image segmentation as rendering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9799–9808, Seattle, USA, 2020. View at: Google Scholar
  8. Y. Lee and J. Park, “Centermask: real-time anchor-free instance segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13906–13915, Seattle, USA, 2020. View at: Google Scholar
  9. Z. Tian, C. Shen, and H. Chen, “Conditional convolutions for instance segmentation,” in Proceedings of European Conference on Computer Vision (ECCV), pp. 282–298, 2020. View at: Publisher Site | Google Scholar
  10. S. Peng, W. Jiang, H. Pi, X. Li, H. Bao, and X. Zhou, “Deep snake for real-time instance segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8533–8542, Seattle, USA, 2020. View at: Google Scholar
  11. Z. Dong, G. Li, Y. Liao, F. Wang, P. Ren, and C. Qian, “Centripetalnet: pursuing high-quality keypoint pairs for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10519–10528, Seattle, USA, 2020. View at: Google Scholar
  12. J. Cao, H. Cholakkal, R. Anwer, F. Khan, Y. Pang, and L. Shao, “D2det: towards high quality object detection and instance segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11485–11494, Seattle, USA, 2020. View at: Google Scholar
  13. S. Wang, Y. Gong, J. Xing, L. Huang, C. Huang, and W. Hu, “RDSNet: a new deep architecture forreciprocal object detection and instance segmentation,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 7, pp. 12208–12215, 2020. View at: Publisher Site | Google Scholar
  14. Z. Tian, C. Shen, X. Wang, and H. Chen, “Boxinst: high-performance instance segmentation with box annotations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5443–5452, 2021. View at: Google Scholar
  15. S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: towards real-time object detection with region proposal networks,” Proceedings of Advances in Neural Information Processing Systems, vol. 28, pp. 91–99, 2015. View at: Google Scholar
  16. Z. Cai and N. Vasconcelos, “Cascade r-cnn: delving into high quality object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6154–6162, Salt Lake City, USA, 2018. View at: Google Scholar
  17. Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: fully convolutional one-stage object detection,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 9627–9636, Seoul, Korea (south), 2019. View at: Google Scholar
  18. Y. Lee, J. Hwang, S. Lee, Y. Bae, and J. Park, “An energy and gpu-computation efficient backbone network for real-time object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, USA, 2019. View at: Google Scholar
  19. X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” 2019, https://arxiv.org/abs/1904.07850. View at: Google Scholar
  20. H. Su, S. Wei, M. Yan, C. Wang, J. Shi, and X. Zhang, “Object detection and instance segmentation in remote sensing imagery based on precise mask r-cnn,” in IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pp. 1454–1457, Yokohama, Japan, 2019. View at: Google Scholar
  21. Y. Feng, W. Diao, Y. Zhang et al., “Ship instance segmentation from remote sensing images using sequence local context module,” in IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pp. 1025–1028, Yokohama, Japan, 2019. View at: Publisher Site | Google Scholar
  22. T. Pan, J. Ding, J. Wang, W. Yang, and G. Xia, “Instance segmentation with oriented proposals for aerial images,” in IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pp. 988–991, 2020. View at: Google Scholar
  23. Z. Zhang and J. Du, “Accurate oriented instance segmentation in aerial images,” in International Conference on Image and Graphics, pp. 160–170, Haikou, China, 2021. View at: Publisher Site | Google Scholar
  24. X. Zeng, S. Wei, J. Wei et al., “Cpisnet: delving into consistent proposals of instance segmentation network for high-resolution aerial images,” Remote Sensing, vol. 13, no. 14, p. 2788, 2021. View at: Publisher Site | Google Scholar
  25. Y. Liu, H. Li, C. Hu, S. Luo, H. Shen, and C. Chen, “Catnet: context aggregation network for instance segmentation in remote sensing images,” 2021, https://arxiv.org/abs/2111.11057. View at: Google Scholar
  26. T. Zhang, X. Zhang, P. Zhu et al., “Semantic attention and scale complementary network for instance segmentation in remote sensing images,” in IEEE Transactions on Cybernetics, 2021. View at: Google Scholar
  27. G. Xia, X. Bai, J. Ding et al., “DOTA: a large-scale dataset for object detection in aerial images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3974–3983, Salt Lake City, USA, 2018. View at: Google Scholar
  28. J. Ding, N. Xue, G. Xia et al., “Object detection in aerial images: a large-scale benchmark and challenges,” 2021, https://arxiv.org/2102.12219. View at: Google Scholar
  29. J. Ding, N. Xue, Y. Long, G. Xia, and Q. Lu, “Learning roi transformer for oriented object detection in aerial images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2849–2858, Long Beach, USA, 2019. View at: Google Scholar
  30. J. Han, J. Ding, J. Li, and G. Xia, “Align deep features for oriented object detection,” IEEE Transactionson Geoscience and Remote Sensing, vol. 60, pp. 1–11, 2022. View at: Publisher Site | Google Scholar
  31. J. Han, J. Ding, N. Xue, and G. Xia, “Redet: a rotation-equivariant detector for aerial object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2786–2795, 2021. View at: Google Scholar
  32. Z. Huang, W. Li, X. Xia, X. Wu, Z. Cai, and R. Tao, “A novel nonlocal-aware pyramid and multiscale multitask refinement detector for object detection in remote sensing images,” IEEE Transactionson Geoscience and Remote Sensing, vol. 60, pp. 1–20, 2022. View at: Google Scholar
  33. Z. Huang, W. Li, X. G. Xia, H. Wang, F. Jie, and R. Tao, “LO-Det: lightweight oriented object detection in remote sensing images,” IEEE Transactionson Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2022. View at: Publisher Site | Google Scholar
  34. Z. Huang, W. Li, X. Xia, and R. Tao, “A general gaussian heatmap label assignment for arbitrary-oriented object detection,” IEEE Transactionson on Image Processing, vol. 31, pp. 1895–1910, 2022. View at: Publisher Site | Google Scholar
  35. J. Dai, K. He, Y. Li, S. Ren, and J. Sun, “Instance-sensitive fully convolutional networks,” in Proceedings of European Conference on Computer Vision (ECCV), pp. 534–549, Amsterdam, Netherlands, 2016. View at: Publisher Site | Google Scholar
  36. L. Castrejon, K. Kundu, R. Urtasun, and S. Fidler, “Annotating object instances with a polygon-rnn,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5230–5238, Hawaii, USA, 2017. View at: Google Scholar
  37. D. Acuna, H. Ling, A. Kar, and S. Fidler, “Efficient interactive annotation of segmentation datasets with polygon-rnn++,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 859–868, Salt Lake City, USA, 2018. View at: Google Scholar
  38. H. Ling, J. Gao, A. Kar, W. Chen, and S. Fidler, “Fast interactive object annotation with curve-gcn,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5257–5266, Long Beach, USA, 2019. View at: Google Scholar
  39. N. Gao, Y. Shan, Y. Wang et al., “Ssap: single-shot instance segmentation with affinity pyramid,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 642–651, Seoul, Korea (south), 2019. View at: Google Scholar
  40. Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin, “Reppoints: point set representation for object detection,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 9657–9666, Seoul, Korea (south), 2019. View at: Google Scholar
  41. D. Bolya, C. Zhou, F. Xiao, and Y. Lee, “Yolact: real-time instance segmentation,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 9157–9166, Seoul, Korea (south), 2019. View at: Google Scholar
  42. X. Zhou, J. Zhuo, and P. Krahenbuhl, “Bottom-up object detection by grouping extreme and center points,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 850–859, Long Beach, USA, 2019. View at: Google Scholar
  43. W. Xu, H. Wang, F. Qi, and C. Lu, “Explicit shape encoding for real-time instance segmentation,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5168–5177, Seoul, Korea (south), 2019. View at: Google Scholar
  44. J. Cao, R. Anwer, H. Cholakkal, F. Khan, Y. Pang, and L. Shao, “Sipmask: spatial information preservation for fast image and video instance segmentation,” in Computer Vision – ECCV 2020, pp. 740–755, Springer, Cham, 2020. View at: Publisher Site | Google Scholar
  45. H. Chen, K. Sun, Z. Tian, C. Shen, Y. Huang, and Y. Yan, “Blendmask: top-down meets bottom-up for instance segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8573–8581, Seattle, USA, 2020. View at: Google Scholar
  46. F. Wei, X. Sun, H. Li, J. Wang, and S. Lin, “Point-set anchors for object detection, instance segmentation and pose estimation,” in Proceedings of European Conference on Computer Vision (ECCV), pp. 527–544, 2020. View at: Publisher Site | Google Scholar
  47. P. Hurtik, V. Molek, J. Hula, M. Vajgl, P. Vlasanek, and T. Nejezchleba, “Poly-YOLO: higher speed, more precise detection and instance segmentation for yolov3,” Neural Computing and Applications, vol. 34, no. 10, pp. 8275–8290, 2022. View at: Publisher Site | Google Scholar
  48. F. Wang, Y. Chen, F. Wu, and X. Li, “Textray: contour-based geometric modeling for arbitrary-shaped scene text detection,” in Proceedings of the 28th ACM International Conference on Multimedia, pp. 111–119, New York, 2020. View at: Google Scholar
  49. E. Xie, P. Sun, X. Song et al., “Polarmask: single shot instance segmentation with polar representation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12193–12202, Seattle, USA, 2020. View at: Google Scholar
  50. A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in Proceedings of European Conference on Computer Vision (ECCV), pp. 483–499, Amsterdam, Netherlands, 2016. View at: Publisher Site | Google Scholar
  51. J. Redmon and A. Farhadi, “Yolov3: an incremental improvement,” 2018, https://arxiv.org/abs/1804.02767. View at: Google Scholar
  52. N. Audebert, B. Le Saux, and S. Lefèvre, “Segment-before-detect: vehicle detection and classification through semantic segmentation of aerial images,” Remote Sensing, vol. 9, no. 4, p. 368, 2017. View at: Google Scholar
  53. L. Mou and X. Zhu, “Vehicle instance segmentation from aerial image and video using a multitask learning residual fully convolutional network,” IEEE Transactionson Geoscience and Remote Sensing, vol. 56, no. 11, pp. 6699–6711, 2018. View at: Publisher Site | Google Scholar
  54. Z. Huang, S. Sun, and R. Li, “Fast single-shot ship instance segmentation based on polar template mask in remote sensing images,” in IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pp. 1236–1239, IEEE, Waikoloa, HI, USA, 2020. View at: Publisher Site | Google Scholar
  55. S. Waqas Zamir, A. Arora, A. Gupta et al., “Isaid: a large-scale dataset for instance segmentation in aerial images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 28–37, Long Beach, USA, 2019. View at: Google Scholar
  56. T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2117–2125, Hawaii, USA, 2017. View at: Google Scholar
  57. T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988, Venice, Italy, 2017. View at: Google Scholar
  58. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, Las Vegas, USA, 2016. View at: Google Scholar
  59. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014, https://arxiv.org/abs/1409.1556. View at: Google Scholar
  60. O. Russakovsky, J. Deng, H. Su et al., “ImageNet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015. View at: Publisher Site | Google Scholar
  61. K. Chen, J. Wang, J. Pang et al., “MDetection: open mmlab detection toolbox and benchmark,” 2019, https://arxiv.org/abs/1906.07155. View at: Google Scholar
  62. N. Qian, “On the momentum term in gradient descent learning algorithms,” Neural Networks, vol. 12, no. 1, pp. 145–151, 1999. View at: Publisher Site | Google Scholar

Copyright © 2022 Yixin Luo et al. Exclusive Licensee Aerospace Information Research Institute, Chinese Academy of Sciences. Distributed under a Creative Commons Attribution License (CC BY 4.0).

 PDF Download Citation Citation
Views134
Downloads94
Altmetric Score
Citations