Get Our e-AlertsSubmit Manuscript
Space: Science & Technology / 2022 / Article

Research Article | Open Access

Volume 2022 |Article ID 9852185 |

Yurong Shi, Jingjing Wang, Yanhong Chen, Siqing Liu, Yanmei Cui, Xianzhi Ao, "Impacts of CMEs on Earth Based on Logistic Regression and Recommendation Algorithm", Space: Science & Technology, vol. 2022, Article ID 9852185, 12 pages, 2022.

Impacts of CMEs on Earth Based on Logistic Regression and Recommendation Algorithm

Received25 Jan 2022
Accepted29 Mar 2022
Published22 Apr 2022


Coronal mass ejections (CMEs) are one of the major disturbance sources of space weather. Therefore, it is of great significance to determine whether CMEs will reach the earth. Utilizing the method of logistic regression, we first calculate and analyze the correlation coefficients of the characteristic parameters of CMEs. These parameters include central position angle, angular width, and linear velocity, which are derived from the Large Angle and Spectrometric Coronagraph (LASCO) images. We have developed a logistic regression model to predict whether a CME will reach the earth, and the model yields an F1 score of 30% and a recall of 53%. Besides, for each CME, we use the recommendation algorithm to single out the most similar historical event, which can be a reference to forecast CMEs geoeffectiveness forecasting and for comparative analysis.

1. Introduction

Coronal mass ejections (CMEs) are eruptive solar events. They are often associated with solar flares and filaments. CMEs can cause space weather events such as geomagnetic storms, high energy electron storms, hot plasma injection, ionospheric storms, and increased density in the upper atmosphere [15]. Large CME events can impact communications, navigation systems, aviation activities, and even power grids [6]. To avoid potential damage and asset loss, there is a need to accurately predict the arrival of the CMEs that may sweep across the earth. Over the last decades, much literature has tried to reveal the eruption and evolution of the CMEs [7, 8]. Until recently, the concept of CME geoeffectiveness becomes popular as a natural application extension of those theoretical researches alongside the boom of the number of geospace satellites in the real world. Therefore, we need to predict: (1)Will the CME “hit” or “miss” the earth?(2)If the prediction is “hit,” then the next question is what is the expected arrival time of the CME?

Currently, models with the subject on CMEs roughly fit into three categories: empirical models [915], physics-based models, and black-box models with the last classification is mainly consisting of machine learning models. Benefitting from the advancement of the machine learning theory and algorithm in recent years, machine learning models [1619] can achieve results that are comparable to physics-based models, either the analytical drag-based models [2024] or the numerical MHD models [2539].

Before the arrival time prediction, forecasters need to assess whether or not the CME seen in the coronagraph images can reach the earth. To improve the accuracy of prediction, there is a lot of work that has to be done [4042]. Indeed, even with the aid of sophisticated models, it is still a great challenge to identify a geoeffective CME. Nevertheless, a combination of machine learning methods and experienced forecasters may shed a ray of light over the path to reach a better result for forecasting the CME’s geoeffectiveness.

Predicting whether or not a CME will reach the earth is a dichotomous problem, requiring only a “yes” or “no.” Logistic regression is often used to deal with this kind of classification problem in machine learning. Besliu-Ionescu et al. [43] have used a modified version of the logistic regression method proposed by Srivastava [44] to predict the geoeffectiveness of CMEs. Besliu-Ionescu and Mierla [45] presented an update of a logistic regression model to predict whether a CME will reach the earth and be associated with geomagnetic storms. In fact, forecasters need to count on similar historical solar activity events to assess the event in progress or the breed. The recommendation algorithm can be used to recommend the similar historical CME event. Therefore, recommendation algorithm and logistic regression can act together to provide forecasters an option to improve the prediction results. Shi et al. [46] have applied the recommendation algorithm to anticipate CMEs’ arrival time.

This paper is organized as follows: Section 2 describes the process of data preparation, normalization, and the selection of characteristic parameters. Section 3 discusses the model method and criterion. Section 4 introduces the experimental model development and the implementation process. Section 5 is the experimental results and discussion, and we conclude the paper in Section 6.

2. Data

2.1. Data Preparation

In this study, a total of 30,321 CME events are collected from the SOHO/LASCO CME catalog [4754], from 1996 to 2020. This CME catalog is generated and maintained by the CDAW Data Center by NASA and The Catholic University of America in cooperation with the Naval Research Laboratory. Furthermore, 227 near-Earth interplanetary coronal mass ejections (ICME) events are taken into account between 1996 and 2020, via the near-Earth ICME list [5557] to collect positive samples for our study. We confirm a CME that hit the earth from both the SOHO/LASCO CME catalog and the associated near-Earth ICME catalog as a positive sample and the rest in the catalogs are negative, resulting in a sample set of 227 positive samples and 30,094 negative samples, which is unbalanced.

In order to balance the sample distribution, many cases are removed from our data set: (1)Generally, the angular width (AW) of most CME events with possible geoeffectiveness is greater than 90 degrees. Therefore, CME events with angular width less than 90 degrees are deleted.(2)CME events (too faint or without sufficient observations) with missing characteristic parameters are also deleted.

Eventually, within the remaining 3,667 CME events data set, we have obtained 181 positive samples and 3,486 negative samples. From the SOHO/LASCO CME catalog, 10 characteristic parameters are gathered for further analysis. These parameters are angular width, central position angle (CPA), measurement position angle (MPA), linear velocity (Vlinear), initial velocity (Vinitial), final velocity (Vfinal), the velocity at 20 solar radii (V20Rs), mass, acceleration (Accel), and kinetic energy (KE). The details of their calculations are given in the SOHO/LASCO CME catalog. We have analyzed and used the parameters above as the input when applying the machine learning method: logistic regression and recommendation algorithm, aiming at developing a model to help forecasters assess the geoeffectiveness of CMEs.

2.2. Normalization

Table 1 lists the statistical analysis results for the 10 CME parameters, including the mean value (MV), standard deviation (Std), the minimum (Min), the maximum (Max), and the percentile value at 25%, 50%, and 75%. It can be found that the values of kinetic energy and mass vary greatly between the maximum and minimum values, so the logarithm operation of parameters with the base of 10 is carried out before normalization.

CPA (deg)AW (deg)Vlinear (km/s)Vinitial (km/s)Vfinal (km/s)V20Rs (km/s)Accel (m/s2)Mass (gram)KE (erg)MPA (deg)

MV206.16172.26581.54560.91603.98624.300.535.02 × 10152.20 × 1031178.58
Std114.1290.12383.48445.73361.86376.0523.748.29 × 10151.01 × 1032102.50
Min0.0091.0035.000.0069.0034.00-240.106.20 × 10119.60 × 10250.00
Percentiles25%95.50108.00323.00256.50369.00378.50-5.301.30 × 10158.20 × 102986.00
50%223.00134.00488.00462.00512.00529.001.603.10 × 10153.50 × 1030179.00
75%305.00197.00736.00755.00735.00766.507.006.30 × 10151.30 × 1031272.00
Max360.00360.003387.003703.003284.003731.00434.802.00 × 10174.20 × 1033360.00

The 10 investigated CME parameters should be normalized as scalars before subsequent calculations. We follow the similar procedure as that in Shi et al. [46] and map each quantity to the range of [-1, 1], and then feed them into the model.

2.2.1. Deviation Standardization

Eight physical quantities, angular width, linear velocity, initial velocity, final velocity, V20Rs, kinetic energy, mass, and acceleration, are all continuous and normalized by the deviation standardization given by: where X is the original data. X.max and X.min represent the maximum and the minimum for each CME parameter. X’ gives the normalized value.

2.2.2. Sine Function

Relying on artificial expertise, the direction of the CME eruption is crucial in determining whether or not a CME can reach the earth. CPA and MPA are normalized by the sine function to reflect the directions of CMEs. Here, the position angle (PA) is indicated by a 360° counter-clockwise rotation from the north of the sun. Based on this measurement, the sinusoidal values of CPA and MPA are in the exact range of [-1, 1]. The degree range of them in the eastern hemisphere is [0°, 180°] that corresponds to the sinusoidal value in [0, 1]. On the contrary, the sinusoidal values in the western hemisphere fall are [-1, 0]. It is obvious in Figure 1 that the eruptions on the east side have a good degree of distinction from those on the west side in terms of the sinusoidal value.

2.3. Characteristic Parameters Selection

After normalization, we need to select meaningful characteristic parameters to input machine learning models for training. The purpose of selecting characteristic parameters is to improve the training speed and the accuracy of the models by reducing the noise features. Some noise features will lead to the wrong generalization of the model, which will perform poorly in the test set. In addition, the more features are, the higher the complexity of the model is. It is easier to lead to overfitting.

The characteristic parameters with too small variance are eliminated to obtain the final appropriate characteristic parameters. If the variance of the parameter is close to 0, there is no difference in this parameter which is not useful to distinguish samples. The reason for choosing the variance method is that it only considers the internal properties of the data and is independent of all learning algorithms. Therefore, it has strong universality.

Figure 2 shows the variance distribution of characteristic parameters of all CMEs. MPA and CPA have high variances, suggesting that they play important role in judging whether CMEs can reach the earth. And then there are the speeds of several CMEs. The last three are acceleration, mass, and kinetic energy. The variances of the three parameters are less than 0.01, indicating that their changes are very small. At the same time, there is a certain problem of feature coincidence that several parameters are also correlated. Acceleration is directly related to velocity [54]. Kinetic energy is obtained from the mass and linear velocity [58]. In a comprehensive consideration, we removed the acceleration and kinetic energy and retained the mass and the other seven characteristic parameters. Because there are no characteristic parameters in the top seven related to mass, the mass is retained. Finally, we choose 8 characteristic parameters, including 7 characteristic parameters with variance over 0.01 and mass as the input of the models.

At this stage, a complete and unified dimensionless data set of the 8 characteristic parameters is set up and ready to facilitate the development of the prediction model.

3. Methodology

We adopt the recommendation algorithm and logistic regression to develop our experimental model to assess a CME’s geoeffectiveness and recommend akin CME events that may occur before. Therefore, the following analyses are required.

3.1. Distance Similarity

We search for the most similar historical event for a specified CME event by checking the similarity of the two events. The similarity is evaluated by pre-defined object distance. Each CME event can be represented by a vector with the associated 8 parameters as the elements. Thus, each event can be regarded as either a point or a vector in an 8-dimensional space. The similarity measure is then to compute the distance between the two points or two vectors within the 8-D space. Shorter distance means higher similarity, and vice versa. Here, we adopt two distances commonly used in machine learning and computer artificial intelligence: cosine distance and Euclidean distance.

Cosine distance compares two events for their similarity by the cosine of the angle between the two vectors in the 8-D space. It is expressed by:

and are the spatial vectors representing the two CME events, respectively. and are the normalized CME parameters, respectively. Each parameter is multiplied by a specific weight (), which measures the impact of the physical nature of each parameter on the prediction results, and is adjustable. The cosine distance falls within the range of [-1, 1].

As mentioned before, each event can also be regarded as a point in the 8-D feature space. The definition of the Euclidean distance of two CME events is given by: in which and represent the two points (two CME events). , , and are defined similarly to those in formula (2).

3.2. Logistic Regression

Logistic regression is often used in dichotomies. The binary logistic regression model we adopt in this work is the following: where in the range of [-∞, ∞] refers to the linear regression fitting function. is the input characteristic parameters, is a constant, and is the coefficient that reflects the degree of contribution for parameter to the value of . The Sigmoid function maps a set of input parameters to a classification probability. As the value of approaches positive infinity, the probability given by approaches 1, in which case the event is considered to sweep across the earth.

The ultimate purpose of function optimization is to obtain a set of appropriate coefficients () by reducing the loss function. The main objective is to minimize the value of the loss function by adjusting the coefficient (), and the loss function of logistic regression is:

Here, is the real value of the sample. It is customary to call the eigenvalue and the label. is used to average the loss function values of sample values.

With the loss function, the gradient descent method can be used to solve the optimal coefficients (), and the optimal model is established when the optimal coefficients are obtained. Gradient descent is to find the descent direction through the first derivative of to , and iteratively update the coefficients. The update method is: where is the learning rate, usually less than 1, which is used to control the specification of each movement in the gradient descent process. is the number of iterations, and the iteration is stopped by comparing less than the threshold or reaching the maximum number of iterations. The learning rate, threshold, and the maximum number of iterations can be set manually.

3.3. Evaluation Index

As shown in Table 2, the prediction model is reviewed by the confusion matrix [59]. The model is determined to give a binary value of “yes” or “no” in terms of whether a CME will arrive at the earth. Generally, we divide the prediction results of the model into the following categories: (1)If the case is that the CME reaches the earth and the forecast is consistent with the truth, the sample is a hit (H) event.(2)If the case is that the CME reaches the earth while the forecast is “no,” the sample is a miss (M) event.(3)If the case is that the CME does not arrive at the earth while the forecast is “yes,” the sample is a false alarm (FA) event.(4)If the case is that the CME does not arrive at the earth while the forecast is consistent with the truth, the sample is a correct rejection (CR) event.

True condition
TotalObserved arrivalNo observed arrival

Predicted conditionPredicted arrivalHit (H)False alarm (FA)
No predicted arrivalMiss (M)Correct rejection (CR)

In Table 2, the orange background represents the correct predictions and the purple background represents the wrong predictions. Table 3 illustrates how to calculate the skill scores from the confusion matrix to appraise the model performance [60]. Higher recall, F1 score, accuracy, and precision values correspond to a more accurate model. The lower the POFA (probability of false alarm) is, the lower the false alarm of the model is.

Skill scoreEquationPerfect score

F1 score100%

4. Experimental Design

Figure 3 is the schematic of the experimental process for developing the CME geoeffectiveness prediction model. The experiment is a controlled trial and divided into three stages which are boxed in Figure 3 with rectangles outlined by dashed lines of a different color. The three stages are the following:

4.1. Data Sampling

The first stage is the data sampling and is framed by the orange dashed line. A total of 3,667 samples including 8 characteristic parameters from Section 2 are randomly divided into two equal subgroups. One (1,833 samples) is for weight training and the other (1,834 samples) is for the subsequent recommendation test. 80 percent of the samples (1,466 samples) in the weight training subgroup serve as the training set and the rest as (367 samples) the validation set.

4.2. Weight Training

The second stage is weight training and is framed by the blue dashed line. We use 1,466 training samples to train weights following both the logistic regression procedure and the recommendation algorithm. Each type of method can output a set of weights of its own. Two logistic regression frameworks are adopted for comparison. One is the logit function provided in the Python-based statsmodels module and referred to as “sm.logit.” The other also Python-based is the LogisticRegression classifier provided in the scikit-learn (sklearn) library and referred to as “sk.LR.” Since the distribution of positive and negative samples in the data set is extremely imbalanced, we oversample the examples in the minority positive class. The most widely used approach to synthesizing new examples is called the Synthetic Minority Oversampling Technique (SMOTE) described by Chawla et al. [61]. SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space, and drawing a new sample at a point along that line. Therefore, each framework is tested against both the original data set and the oversampled data set.

In Figure 3, the experiments using the original data are denoted by ① and ③, and the experiments using the oversampling data based on SMOTE are denoted by ② and ④, respectively. As for the recommendation algorithm, both the distance calculation methods introduced in Section 3.1 are implemented for comparison marked by ⑤ and ⑥ in Figure 3, respectively. The original data is used in the distance algorithm. Briefly, a total of 6 experiments are conducted to train weights, and hence, 6 sets of weight coefficients are obtained with 4 from the logistic regression algorithm and 2 from the recommendation algorithm.

The validation process is necessary to verify the quality of the trained model. The validation set containing 367 events is used to verify the above 6 trained models. The verification is done according to the evaluation index in Section 3.3. The priority order of the applied skill scores is F1 Score, Recall, POFA, Precision, and Accuracy to determine the most appropriate weights for the prediction model before entering the recommendation stage. The procedure that compares the performance of the different sets of weights and chooses the best set is called optimal weight combination.

4.3. Recommendation Test

The last stage is the recommended test outlined by the purple dashed lines in Figure 3. The purpose of this stage is to combine logistic regression and recommendation algorithms for a better forecast service. We demonstrate this stage from the three aspects below and will explain the benefit of combining both logistic regression and recommendation algorithms later. (1)Experiments (a) and (b) test the recommendation algorithm against the test set of 1,834 events. However, instead of using its own set of weights through the recommendation algorithm, experiments (a) and (b) utilize the weights from the optimal weight combination obtained in Section 4.2, indicated by the black dashed arrow in Figure 3(2)For comparison, the test group is processed against both the sm.logit scheme referred to as experiment (c) and the sk.LR scheme referred to as experiment (d). Similar to (a) and (b), experiment (c) uses the weights of experiment ②, while experiment (d) uses the weights computed in experiment ④ during the training stage (see the yellow dashed arrow and the orange dashed arrow, respectively)(3)Since the optimal weight is singled out from experiments ① to ⑥, there is a chance that it is from experiment ⑤ or ⑥. If this is the case, experiments (a) and (b) are using the weight calculated from ⑤ or ⑥. Otherwise, we can do two more experiments for comparison using the weight calculated from ⑤ or ⑥. In our actual practice, the optimal weight is from experiment ②. Therefore, we continue to test the recommendation algorithm module against the samples using the weights from ⑤ and ⑥. In Figure 3, cosine distance and the Euclidean distance are indicated by the dark blue dashed line and light blue dashed line corresponding to experiments (e) and (f), respectively

5. Results and Discussion

5.1. Results in the Weight Training Stage

Six training processes are executed with four based on logistic regression and two based on recommendation algorithm. In sm.logit, it will predict the probability for each CME event after training, so we need to manually define a threshold. When the probability is greater than this threshold, we think that the event will reach the earth; otherwise, it will not reach. In experiment ①, the result is the best when the threshold is 0.2. When the threshold is 0.3, the F1 score and recall of the model are 0, which indicates that the probability of all events predicted by the model is less than 0.2. That is, the model believes that almost all samples do not reach the earth. Because there are too many negative samples and a lack of necessary positive sample information, which makes the model not learn the useful characteristics. In experiment ②, we added SMOTE method and listed the skill scores under different thresholds in Table 4. It can be seen from the table that the corresponding result is the best when the threshold is 0.7. We choose this group of results as the final result of the weight training stage of this model. Therefore, the results of the sm.logit are improved by oversampling.

ThresholdF1 scoreRecallPOFAPrecisionAccuracy


As for other models, there is no need to set the threshold manually. The trained model will directly give a “yes” or “no” conclusion. The validation results of all models are shown in Table 5. The first column corresponds to the experimental number in Figure 3. The third column indicates whether or not it is an oversampling training. According to the evaluation index mentioned above, it is concluded that the best result coming from experiment ② with a threshold of 0.7, in which the F1 score is 38%, recall is 68%, POFA is 14%, the precision is 26%, and the accuracy is 85%.

No.ModelOversampling or notThresholdF1 scoreRecallPOFAPrecisionAccuracy

Cosine distanceNo...12%12%6%13%89%
Euclidean distanceNo...10%8%4%13%90%

As mentioned in Section 3.2, the weights of sm.logit and sk.LR are the coefficient corresponding to the minimum loss function obtained. When computing the training weights by the recommendation algorithm, in order to find the most appropriate weight for each parameter, we vary the weight of one specified quantity from 0 to 50 with a step size of 1 and simultaneously fix the weights of the rest to 1. Subsequently, we select a set of weight values corresponding to the best result.

The weights of all the training models are listed by rows in Table 6. By observing the weight of each characteristic parameter in the table, the greater the absolute value of the weight, the more important this parameter is. If the value is positive, this parameter is positively correlated with the probability that the target value is “yes.” If it is negative, it indicates that this characteristic is positively correlated with the probability that the target value is “no.” In terms of weight distribution, the angular width and Vfinal of several models account for a large proportion. Moreover, in Section 2.3, these two parameters are also at the forefront. In practice, CME with faster speed and larger angular width is more likely to reach the earth. It is consistent with the current understanding of the nature of CMEs.

No.ModelOversampling or notThresholdThe weight combination

Cosine distanceNo...011311121
Euclidean distanceNo...0399024132814

Upon the completion of the training and the validation for all the configurations in Table 5 (see columns 2 to 4), we assess the performance from columns 5 to 9 in Table 5 and choose the set of weights with the best result as the final optimal weight set. The optimal weight set, also called the optimal weight combination coming from the training stage, can be used in the next recommendation stage.

5.2. Results in Recommendation Test Stage

Using the recommendation algorithms to train the weights of characteristic parameters is very time-consuming, but it is easier to obtain the weights by logistic regression. A new attempt is to apply the weights obtained by the logistic regression to the recommendation algorithm. Therefore, the purpose of the recommendation test stage is to better test the feasibility of such operation. The corresponding weights for each test run are shown in Table 7. In experiments (a) and (b), two different recommendation algorithms are used to test, but the weights in the recommendation process use the optimal weight combination of the previous stage, whose results are compared to those coming from the subsequent experiments (c) to (f). Experiment (c), i.e., the sm.logit test, adopts the same weights as in experiment ② with the threshold value of 0.7, which is the optimal weight combination. Similarly, experiment (d), i.e., the sk.LR test, adopts the weight from the previous training stage using the sk.LR algorithm. Experiments (e) and (f) adopt the weights directly from experiments ⑤ and ⑥, respectively.

No.ModelWhether to use the optimal weight combinationThresholdThe weight combination

(a)Cosine distanceYes...-0.05-0.175.38-8.72-0.388.55-1.78-3.49
(b)Euclidean distanceYes...-0.05-0.175.38-8.72-0.388.55-1.78-3.49
(e)Cosine distanceNo...011311121
(f)Euclidean distanceNo...0399024132814

The results of the test set are summarized in Table 8. The results of experiments (c) to (f) are for comparison with the results of experiments (a) and (b). At the same time, they are also the test sets of experiments ②, ④, ⑤, and ⑥, which is equivalent to 1-fold cross-validation. It can be found that the results of experiments (a) and (b) combining logistic regression with recommendation algorithm are better than the results of experiments (e) and (f) using recommendation algorithm only. Therefore, it is feasible to combine the two methods to save a tremendous amount of time and computer resources.

No.ModelWhether to use the optimal weight combinationThresholdF1 scoreRecallPOFAPrecisionAccuracy

(a)Cosine distanceYes...13%13%4%13%91%
(b)Euclidean distanceYes...12%12%4%13%92%
(e)Cosine distanceNo...10%9%4%11%92%
(f)Euclidean distanceNo...10%10%4%11%91%

As can be seen from the test results, experiments (c) and (d) are more prominent than other models. The result of sm.logit model is the best, the F1 score is 30%, recall is 53%, POFA is 10%, the precision is 20%, and the accuracy is 88%, respectively. It further confirms the stability of the model and the reliability for selecting weights. However, the two models only used the logistic regression, which can only give the results of “yes” or “no,” and cannot recommend similar historical events. Therefore, the combination of the logistic regression and recommendation algorithm has the best effect. Once the logistic regression model confirms the geoeffectiveness for a CME, the recommendation algorithm is used to recommend similar historical events.

5.3. Discussion

To better understand the model, Figure 4 shows the working process of the model. For instance, there was a halo CME that erupted on 2006 August 16 16 : 30 UT as shown in Figures 5(a)–5(c). We can input its 8 parameters (as shown in Figure 4) into the model. Then, the model will recommend the most similar historical CME which erupted on 1997 April 7 14 : 27 UT, which is shown in Figures 5(d)–5(f). It can be found that the two CMEs are very similar by comparing the data in Table 9. First of all, both of the two events are halo CMEs, and the CPA and MPA are very similar. In addition to this, the four CME velocities of the two events are around 900 km/s. By directly comparing the images of the recommended events, the reliability of the model is proved on the other hand.

Date and time

2006/08/16 16 : 303601613608888739058961.00 × 1016
1997/04/07 14 : 273601233608788789058961.00 × 1016

Compare the results with other models that use logistic regression to predict the geoeffectiveness of CMEs, and list all results in Table 10. Srivastava [44] used 55 geoeffective events that were defined as full chains CME-ICME-geomagnetic storms (intense and super intense). The data set was divided into a training set (46 events) and a test set (9 events). The accuracy of the model was 85% in the training set and 78% in the test set, respectively. Besliu-Ionescu et al. [43] had slight differences in the data set division. They divided the data set (264 events) into four different ways. Therefore, their results are averaged and shown in Table 10. Their accuracy is very high and can reach 100% in the test set. Besliu-Ionescu and Mierla [45] used 2097 events to train the model and 699 events to test. The accuracy is 99% in both the training set and test set. Unfortunately, the model did not successfully predict any positive samples, and recall is 0 in both the training set and test set. Compared with others, the advantage of our model is that we have a high ability to predict positive samples, resulting in high recall, F1 score, and precision.

ModelRecallAccuracyNumber of samples

Srivastava [44]......85%78%469
Besliu-Ionescu et al. [43]0100%97%100%......
Besliu-Ionescu and Mierla [45]0099%99%2097699

6. Conclusion

In this study, 30,321 CME events from 1996 to 2020 are obtained according to the SOHO/LASCO CME catalog [4754]. Of these, 227 near-Earth interplanetary coronal mass ejections (ICMEs) are identified as positive samples by the Richardson and Cane list [5557]. After screening, a sample set of 181 positive samples and 3,486 negative samples are obtained. From the SOHO/LASCO CME catalog, 8 characteristic parameters are gathered by characteristic parameters selection, including angular width, CPA, MPA, Vlinear, Vinitial, Vfinal, V20Rs, and mass. We first calculate the weights of the characteristic parameters of CMEs based on logistic regression and then feed them into the recommendation algorithm to provide the most similar historical events as a reference for CMEs geoeffectiveness forecasting. Space weather forecasters can make use of this method to execute a comparative analysis. Through the experiment presented in this article, we can draw the following conclusions: (1)The number of positive samples (CMEs that reached the earth) is 181 and the number of negative samples (CMEs that did not reach the earth) is 3,486. We use oversampling to solve the unbalanced data and have obtained good results. Therefore, we can try to adopt this method for other unbalanced data(2)Comparing all models, the sm.logit model performs the best in both the validation set and the test set. In the test (validation) set, F1 score is 30% (38%), recall is 53% (68%), POFA is 10% (14%), and the precision is 20% (26%). It is appropriate to choose the weights of sm.logit as the optimal weights in the recommended test stage in this particular work(3)Here, we conduct a new attempt to combine logistic regression and recommendation algorithms. By comparing the test results of experiments (a) and (e) (or (b) and (f)), we find that in each skill score the model applying the weights of logistic regression to the recommendation algorithm is better than that using recommendation algorithm alone, so this hybrid model is feasible. Such a treatment avoids training the recommendation weights to save time and computer resources(4)The cosine distance and Euclidean distance are applied to the experiment. From several skill scores, it is found that their scores are very similar. It is difficult to choose which method is more suitable for this experiment. Therefore, it can be said that they both perform well(5)At present, applying the recommendation algorithm to the prediction of CMEs is very rare in literature. That recommending similar historical events as a vivid reference for forecasters is a great improvement to the forecast service by contrast to the binary “yes” or “no” forecast provided by the logistic regression model only. In the future, we will focus on improving the model results by accumulating and analyzing more CME data

Data Availability

The SOHO/LASCO CME catalog is available at The near-Earth ICME list is available at

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Authors’ Contributions

Yurong Shi, Jingjing Wang, and Yanhong Chen participated in the research design. Yurong Shi performed the data analysis and the code development of experiments. Yurong Shi, Jingjing Wang, Siqing Liu, Yangmei Cui, and Xianzhi Ao contributed to the writing of the manuscript. Professor Siqing Liu provided the equipment needed for the experiments and administrative support.


We appreciate the CME list generated and maintained at the CDAW Data Center by NASA and The Catholic University of America in cooperation with the Naval Research Laboratory. And we thank the team that organizes and maintains the near-Earth ICME list. This work was supported by the National Natural Science Foundation of China (Grant Nos. 42074224 and 12071166), the Key Research Program of the Chinese Academy of Science (Grant No. ZDRE-KT-2021-3), and the Pandeng Program of National Space Science Center, Chinese Academy of Science.


  1. G. L. Siscoe, “Geomagnetic storms and substorms,” Reviews of Geophysics, vol. 13, no. 3, p. 990, 1975. View at: Publisher Site | Google Scholar
  2. N. I. Izhovkina, N. M. Shutte, and S. A. Pulinets, “Electrostatic disturbances and electron fluxes in inhomogeneous plasma during electron-beam impulse injection in the ionosphere,” Geomagnetism and Aeronomy, vol. 40, no. 4, pp. 523–525, 2000. View at: Google Scholar
  3. M. J. Buonsanto, “Ionospheric storms - a review,” Space Science Reviews, vol. 88, no. 3, pp. 563–601, 1999. View at: Publisher Site | Google Scholar
  4. A. D. Danilov, “Long-term trends in the upper atmosphere and ionosphere (a review),” Geomagnetism and Aeronomy, vol. 52, no. 3, pp. 271–291, 2012. View at: Publisher Site | Google Scholar
  5. J. Lastovicka, S. C. Solomon, and L. Y. Qian, “Trends in the neutral and ionized upper atmosphere,” Space Science Reviews, vol. 168, no. 1-4, pp. 113–145, 2012. View at: Publisher Site | Google Scholar
  6. D. G. Cole, “Space weather: its effects and predictability,” in Advances in Space Environment Research, vol. 107, no. 1/2, pp. 295–302, Springer, 2003. View at: Publisher Site | Google Scholar
  7. E. Huttunen, “Geoeffectiveness of cmes in the solar wind,” Proceedings of the International Astronomical Union, vol. 2004, no. IAUS226, pp. 455-456, 2004. View at: Publisher Site | Google Scholar
  8. R. S. Kim, K. S. Cho, Y. J. Moon et al., “Forecast evaluation of the coronal mass ejection (cme) geoeffectiveness using halo cmes from 1997 to 2003,” Journal of Geophysical Research-Space Physics, vol. 110, no. A11, 2005. View at: Publisher Site | Google Scholar
  9. M. Vandas, S. Fischer, M. Dryer, Z. Smith, and T. Detman, “Parametric study of loop-like magnetic cloud propagation,” Journal of Geophysical Research, Space Physics, vol. 101, no. A7, pp. 15645–15652, 1996. View at: Publisher Site | Google Scholar
  10. Y. M. Wang, P. Z. Ye, S. Wang, G. P. Zhou, and J. X. Wang, “A statistical study on the geoeffectiveness of earth-directed coronal mass ejections from march 1997 to december 2000,” Journal of Geophysical Research-Space Physics, vol. 107, no. A11, 2002. View at: Publisher Site | Google Scholar
  11. H. Xie, L. Ofman, and G. Lawrence, “Cone model for halo cmes: application to space weather forecasting,” Journal of Geophysical Research-Space Physics, vol. 109, no. A3, 2004. View at: Publisher Site | Google Scholar
  12. R. Schwenn, A. Dal Lago, E. Huttunen, and W. D. Gonzalez, “The association of coronal mass ejections with their effects near the earth,” Annales Geophysicae, vol. 23, no. 3, pp. 1033–1059, 2005. View at: Publisher Site | Google Scholar
  13. P. K. Manoharan, “Evolution of coronal mass ejections in the inner heliosphere: a study using white-light and scintillation images,” Solar Physics, vol. 235, no. 1-2, pp. 345–368, 2006. View at: Publisher Site | Google Scholar
  14. M. Nunez, T. Nieves-Chinchilla, and A. Pulkkinen, “Prediction of shock arrival times from cme and flare data,” Space Weather, vol. 14, no. 8, pp. 544–562, 2016. View at: Publisher Site | Google Scholar
  15. E. Paouris and H. Mavromichalaki, “Effective acceleration model for the arrival time of interplanetary shocks driven by coronal mass ejections,,” Solar Physics, vol. 292, no. 12, 2017. View at: Publisher Site | Google Scholar
  16. D. Sudar, B. Vrsnak, and M. Dumbovic, “Predicting coronal mass ejections transit times to earth with neural network,” Monthly Notices of the Royal Astronomical Society, vol. 456, no. 2, pp. 1542–1548, 2016. View at: Publisher Site | Google Scholar
  17. J. Liu, Y. Ye, C. Shen, Y. Wang, and R. Erdélyi, “A new tool for cme arrival time prediction using machine learning algorithms: Cat-puma,” The Astrophysical Journal, vol. 855, no. 2, p. 109, 2018. View at: Publisher Site | Google Scholar
  18. Y. Wang, J. Liu, Y. Jiang, and R. Erdélyi, “Cme arrival time prediction using convolutional neural network,” The Astrophysical Journal, vol. 881, no. 1, p. 15, 2019. View at: Publisher Site | Google Scholar
  19. P. Wang, Y. Zhang, L. Feng et al., “A new automatic tool for cme detection and tracking with machine-learning techniques,” The Astrophysical Journal Supplement Series, vol. 244, no. 1, p. 9, 2019. View at: Publisher Site | Google Scholar
  20. P. Hess and J. Zhang, “Predicting cme ejecta and sheath front arrival at l1 with a data-constrained physical model,” The Astrophysical Journal, vol. 812, no. 2, p. 144, 2015. View at: Publisher Site | Google Scholar
  21. C. Möstl, T. Rollett, R. A. Frahm et al., “Strong coronal channelling and interplanetary evolution of a solar storm up to earth and mars,” Nature Communications, vol. 6, no. 1, p. 7135, 2015. View at: Publisher Site | Google Scholar
  22. C. Kay, M. L. Mays, and C. Verbeke, “Identifying critical input parameters for improving drag-based cme arrival time predictions,” Space Weather, vol. 18, no. 1, 2020. View at: Publisher Site | Google Scholar
  23. P. Subramanian, A. Lara, and A. Borgazzi, “Can solar wind viscous drag account for coronal mass ejection deceleration?” Geophysical Research Letters, vol. 39, no. 19, 2012. View at: Publisher Site | Google Scholar
  24. B. Vršnak, T. Žic, D. Vrbanec et al., “Propagation of interplanetary coronal mass ejections: the drag-based model,” Solar Physics, vol. 285, no. 1-2, pp. 295–315, 2013. View at: Publisher Site | Google Scholar
  25. Z. Smith and M. Dryer, “Mhd study of temporal and spatial evolution of simulated interplanetary shocks in the ecliptic-plane within 1 au,” Solar Physics, vol. 129, no. 2, pp. 387–405, 1990. View at: Publisher Site | Google Scholar
  26. M. Dryer, C. D. Fry, W. Sun et al., “Prediction in real time of the 2000 july 14 heliospheric shock wave and its companions during the 'bastille' epoch ,” Solar Physics, vol. 204, no. 1-2, pp. 267–284, 2001. View at: Google Scholar
  27. Y. J. Moon, M. Dryer, Z. Smith, Y. D. Park, and K. S. Cho, “A revised shock time of arrival (stoa) model for interplanetary shock propagation: Stoa-2,,” Geophysical Research Letters, vol. 29, no. 10, pp. 28-1–28-4, 2002. View at: Publisher Site | Google Scholar
  28. D. Odstrcil, V. J. Pizzo, J. A. Linker, P. Riley, R. Lionello, and Z. Mikic, “Initial coupling of coronal and heliospheric numerical magnetohydrodynamic codes,” Journal of Atmospheric and Solar-Terrestrial Physics, vol. 66, no. 15-16, pp. 1311–1320, 2004. View at: Publisher Site | Google Scholar
  29. G. Tóth, I. V. Sokolov, T. I. Gombosi et al., “Space weather modeling framework: a new tool for the space science community,” Journal of Geophysical Research-Space Physics, vol. 110, no. A12, 2005. View at: Publisher Site | Google Scholar
  30. T. Detman, Z. Smith, M. Dryer, C. D. Fry, C. N. Arge, and V. Pizzo, “A hybrid heliospheric modeling system: background solar wind,” Journal of Geophysical Research-Space Physics, vol. 111, no. A7, 2006. View at: Publisher Site | Google Scholar
  31. X. S. Feng and X. H. Zhao, “A new prediction method for the arrival time of interplanetary shocks,” Solar Physics, vol. 238, no. 1, pp. 167–186, 2006. View at: Publisher Site | Google Scholar
  32. X. S. Feng, Y. F. Zhou, and S. T. Wu, “A novel numerical implementation for solar wind modeling by the modified conservation element/solution element method,” The Astrophysical Journal, vol. 655, no. 2, pp. 1110–1126, 2007. View at: Publisher Site | Google Scholar
  33. P. Riley, J. A. Linker, R. Lionello, and Z. Mikic, “Corotating interaction regions during the recent solar minimum: the power and limitations of global mhd modeling,” Journal of Atmospheric and Solar-Terrestrial Physics, vol. 83, pp. 1–10, 2012. View at: Publisher Site | Google Scholar
  34. P. Riley, J. A. Linker, and Z. Mikic, “On the application of ensemble modeling techniques to improve ambient solar wind models,” Journal of Geophysical Research, Space Physics, vol. 118, no. 2, pp. 600–607, 2013. View at: Publisher Site | Google Scholar
  35. I. V. Sokolov, B. van der Holst, R. Oran et al., “Magnetohydrodynamic waves and coronal heating: unifying empirical and mhd turbulence models,” The Astrophysical Journal, vol. 764, no. 1, p. 23, 2013. View at: Publisher Site | Google Scholar
  36. B. van der Holst, I. V. Sokolov, X. Meng et al., “ALFVÉN wave solar model (awsom): coronal heating,” The Astrophysical Journal, vol. 782, no. 2, p. 81, 2014. View at: Publisher Site | Google Scholar
  37. M. Jin, W. B. Manchester, B. van der Holst et al., “Data-constrained coronal mass ejections in a global magnetohydrodynamics model,” The Astrophysical Journal, vol. 834, no. 2, p. 173, 2017. View at: Publisher Site | Google Scholar
  38. J. Wang, X. Ao, Y. Wang et al., “An operational solar wind prediction system transitioning fundamental science to operations,” Journal of Space Weather and Space Climate, vol. 8, p. A39, 2018. View at: Publisher Site | Google Scholar
  39. S. Poedts, A. Lani, C. Scolini et al., “European heliospheric forecasting information asset 2.0,” Journal of Space Weather and Space Climate, vol. 10, p. 57, 2020. View at: Publisher Site | Google Scholar
  40. C. N. Arge and V. J. Pizzo, “Improvement in the prediction of solar wind conditions using near-real time solar magnetic field updates,” Journal of Geophysical Research, Space Physics, vol. 105, no. A5, pp. 10465–10479, 2000. View at: Publisher Site | Google Scholar
  41. M. L. Mays, A. Taktakishvili, A. Pulkkinen et al., “Ensemble modeling of cmes using the WSA–ENLIL+Cone model,” Solar Physics, vol. 290, no. 6, pp. 1775–1814, 2015. View at: Publisher Site | Google Scholar
  42. C. Möstl, A. Isavnin, P. D. Boakes et al., “Modeling observations of solar coronal mass ejections with heliospheric imagers verified with the heliophysics system observatory,” Space Weather, vol. 15, no. 7, pp. 955–970, 2017. View at: Publisher Site | Google Scholar
  43. D. Besliu-Ionescu, D. C. Talpeanu, M. Mierla, and G. M. Muntean, “On the prediction of geoeffectiveness of cmes during the ascending phase of sc24 using a logistic regression method,” Journal of Atmospheric and Solar-Terrestrial Physics, vol. 193, article 105036, 2019. View at: Publisher Site | Google Scholar
  44. N. Srivastava, “A logistic regression model for predicting the occurrence of intense geomagnetic storms,” Annales Geophysicae, vol. 23, no. 9, pp. 2969–2974, 2005. View at: Publisher Site | Google Scholar
  45. D. Besliu-Ionescu and M. Mierla, “Geoeffectiveness prediction of cmes,” Frontiers in Astronomy and Space Sciences, vol. 8, 2021. View at: Publisher Site | Google Scholar
  46. Y.-R. Shi, Y.-H. Chen, S.-Q. Liu et al., “Predicting the cme arrival time based on the recommendation algorithm,” Research in Astronomy and Astrophysics, vol. 21, no. 8, p. 190, 2021. View at: Publisher Site | Google Scholar
  47. S. Yashiro, G. Michalek, and N. Gopalswamy, “A comparison of coronal mass ejections identified by manual and automatic methods,” Annales Geophysicae, vol. 26, no. 10, pp. 3103–3112, 2008. View at: Publisher Site | Google Scholar
  48. N. Gopalswamy, A. Lara, S. Yashiro, S. Nunes, and R. A. Howard, “Coronal mass ejection activity during solar cycle 23,” Solar Variability as an Input to the Earth's Environment, vol. 535, pp. 403–414, 2003. View at: Google Scholar
  49. N. Gopalswamy, “A global picture of cmes in the inner heliosphere,” The Sun and the Heliosphere as an Integrated System, vol. 317, pp. 201–251, 2004. View at: Publisher Site | Google Scholar
  50. N. Gopalswamy, S. Nunes, S. Yashiro, and R. A. Howard, “Variability of solar eruptions during cycle 23,” in Solar Variability and Climate Change, J. M. Pap, J. Kuhn, K. Labitzke, and M. A. Shea, Eds., Advances in space research, pp. 391–396, Pergamon-Elsevier Science Ltd, Kidlington, 2004. View at: Google Scholar
  51. N. Gopalswamy, S. Yashiro, G. Michalek et al., “The SOHO/LASCO CME catalog,” Earth, Moon, and Planets, vol. 104, no. 1-4, pp. 295–313, 2009. View at: Publisher Site | Google Scholar
  52. O. C. St. Cyr, R. A. Howard, N. R. Sheeley et al., “Properties of coronal mass ejections: Soho lasco observations from january 1996 to June 1998,” Journal of Geophysical Research, Space Physics, vol. 105, no. A8, pp. 18169–18185, 2000. View at: Publisher Site | Google Scholar
  53. A. Vourlidas, D. Buzasi, R. A. Howard, and E. Esfandiari, “Mass and energy properties of lasco cmes,” The 10th European Solar Physics Meeting, vol. 1, 2002. View at: Google Scholar
  54. S. Yashiro, N. Gopalswamy, G. Michalek et al., “A catalog of white light coronal mass ejections observed by the soho spacecraft,” Journal of Geophysical Research-Space Physics, vol. 109, no. A7, p. 11, 2004. View at: Publisher Site | Google Scholar
  55. H. V. Cane and I. G. Richardson, “Interplanetary coronal mass ejections in the near-earth solar wind during 1996-2002,” Journal of Geophysical Research-Space Physics, vol. 108, no. A4, 2003. View at: Publisher Site | Google Scholar
  56. D. Ameri and E. Valtonen, “Investigation of the geoeffectiveness of disk-centre full-halo coronal mass ejections,” Solar Physics, vol. 292, no. 6, 2017. View at: Publisher Site | Google Scholar
  57. I. G. Richardson and H. V. Cane, “Near-earth interplanetary coronal mass ejections during solar cycle 23 (1996 – 2009): catalog and summary of properties,” Solar Physics, vol. 264, no. 1, pp. 189–237, 2010. View at: Publisher Site | Google Scholar
  58. A. Vourlidas, P. Subramanian, K. P. Dere, and R. A. Howard, “Large-angle spectrometric coronagraph measurements of the energetics of coronal mass ejections,” Astrophysical Journal, vol. 534, no. 1, pp. 456–467, 2000. View at: Publisher Site | Google Scholar
  59. C. Verbeke, M. L. Mays, M. Temmer et al., “Benchmarking cme arrival time and impact: progress on metadata, metrics, and events,” Space Weather, vol. 17, no. 1, pp. 6–26, 2019. View at: Publisher Site | Google Scholar
  60. A. Vourlidas, S. Patsourakos, and N. P. Savani, “Predicting the geoeffective properties of coronal mass ejections: current status, open issues and path forward,” Philosophical Transactions of the Royal Society a-Mathematical Physical and Engineering Sciences, vol. 377, no. 2148, 2019. View at: Publisher Site | Google Scholar
  61. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002. View at: Publisher Site | Google Scholar

Copyright © 2022 Yurong Shi et al. Exclusive Licensee Beijing Institute of Technology Press. Distributed under a Creative Commons Attribution License (CC BY 4.0).

 PDF Download Citation Citation
Altmetric Score