Step-by-step guide on the Regressors (app).
For numerical variables, the darker the color of a spatial point the higher the value of the selected variable, while the lighter the color of a spatial point indicating lower value of the selected variable.
In this example, the regions that are closer to attraction places can be seen to be clustered around central Jakarta, while the regions near the border are further away from attraction places
Histogram is used to see the distribution of the data. The number of bins is to help group the data according to the values. Additionally, histogram is a efficient way to look at the skewness of the data distribution.
In this example, PROX_ATTRACTION is slightly right skewed, however, no action is required for our application to work.
Note that all parameters for “Correlation Plot” and “Model” have a default value as shown above. Not changing the selection will run the correlation and model analysis based on the selected parameters.
If there are no changes to model parameters, click “Run” (Step 8)
The formula tab shows the Multiple Linear Regression Formula based on the dependent and independent variables selected.
In this example, The formula is: POSITIF ~ PROX_ATTRACTION + PROX_WORSHIP + PROX_RESTAURANT
In this tab, the correlation between selected dependent variables are shown with the diagonal always equating to 1 as a variable will always be correlated with itself. From the graph, the darker the colors are, the more the variables are correlated regardless of whether it is positive or negative.
In this example, the correlation plots shows the correlation coefficient between the 3 variables, PROX_ATTRACTION, PROX_WORSHIP, PROX_RESTAURANT. PROX_ATTRACTION can be inferred to be more correlated to PROX_RESTAURANT than with PROX_WORSHIP.
In the Summary Tab, the statistical summary of the base model is shown. Some of which includes: Min, Median, Max, Statistical significance of variables in the model, the R-square value of the model. Based on the calculated p-value of each selected dependent variable, users can decide which variables to keep in the model.
In this example, we can interpret the multicollinearity function to be as such: POSITIF = 1822.98 + 185.89 * PROX_ATTRACTION + 69.32 * PROX_WORSHIP + 362.23 * PROX_RESTAURANT
The R-squared refers to the percentage that the regression model is able to explain. In this case, the R-squared is 0.1944 and it means that the multiple regression model built is able to explain about 19% of the number of POSITIF COVID-19 cases.
The model p-value tells us if the multiple linear regression model is a good estimator of the dependent variable. P-value lower than the significance level means multiple linear regression model is a good estimator of the dependent variable. P-value higher than the significance level means that the mean is a good estimator of the dependent variable.
The p-value of the above model is much smaller than 0.05, and hence we will reject the null hypothesis that mean is a good estimator of POSITIF. We can infer that the multiple linear regression model is a good estimator of POSITIF cases.
Coefficients section shows the p-values of the estimates of the Intercept and the independent variables. Variables with p-value less than the significance level (can look at the asterick depending on the significance level determined by the user) means that they are good parameter estimates, which should be kept. On the other hand, variables with p-value above significance level (can look at the asterick depending on the significance level determined by the user) means they are not statistically significant and should be removed from the model.
The above shows that PROX_ATTRACTION and PROX_RESTAURANT are statistically significant and should be kept.
The multicollinearity tab shows the VIF value of each independent variable
Independent variables with VIF more than 10 should be removed because there are signs of multicollinearity. In the example above, there is no sign of multicollinearity among the independent variables.
The Linearity tab shows a plot of the model’s scaled residual centering around a zero line. A scatter plot clustering near the zero line indicates close approximation to linearity function.
In this example, we can see that most of the points are close to the zero line with a minority scattering across the lower and upper extremes of the graph, the model can therefore be considered to be approximately linear.
The Normality tab shows the distribution of the model’s residual. It is desired to have the curve to closely resemble a normal curve as an indication for normality approximation.
In this example, we can see some resemblance of a normal curve, even though the curve is considered flat. This means that the model fulfils the normality assumption.
In the Base Model’s Performance tab, the statistical summary of the global regression model and the GWR model are provided, as shown below.
Based on the summary given in this example, we can see that the global regression model’s R-squared value is extremely low as compared to that of the GWR model. As R-squared value indicates how much of the data can be explained by the model, it may be worth considering to include more variables for better explainability of the model.
In this Visualization tab, the distribution of the local_R2 score is shown geographically, with darker shades indicating higher local_R2, thus better explainability of the model, while lighter shades indicate poorer explainability.
In this example, we can see that there are signs of clustering based on the local R-squared values, this may indicate that in certain areas, the dependent variable tend to have stronger relations to the selected independent variable
The data table tab shows the data of the selected dependent and independent variables.
The formula tab shows the Multiple Linear Regression Formula based on the dependent and independent variables input.
In this example, The formula is: POSITIF ~ PROX_ATTRACTION + PROX_WORSHIP + PROX_RESTAURANT + PROX_MALL + PROX_SUPERMARKET + PROX_CONVENIENCE + PROX_KINDERGARTENS + PROX_SCHOOL + PROX_TERMINAL + PROX_HEALTHCARE + PROX_RAILWAYS
The summary tab shows the summary of the linear regression model of the formula.
With the estimate of each coefficient we are able to formulate the linear regression formula. The summary shows that the number of POSITIF COVID-19 cases can be explained by using the formula:
POSITIF = 1861.21 + 242.39(PROX_ATTRACTION) + 36.22(PROX_WORSHIP) + 405.32(PROX_RESTAURANT) + 132.49(PROX_MALL) - 11.06(PROX_SUPERMARKET) - 30.83(PROX_CONVENIENCE) - 183.96(PROX_KINDERGARTENS) + 45.30(PROX_SCHOOL) - 13.63(PROX_TERMINAL) - 236.01(PROX_HEALTHCARE) + 57.90(PROX_RAILWAYS)
The R-squared refers to the percentage that the regression model is able to explain. In this case, the R-squared is 0.2296 and it means that the multiple regression model built is able to explain about 23% of the number of POSITIF COVID-19 cases.
The model p-value tells us if the multiple linear regression model is a good estimator of the dependent variable. P-value lower than the significance level means multiple linear regression model is a good estimator of the dependent variable. P-value higher than the significance level means that the mean is a good estimator of the dependent variable.
The p-value of the above model is much smaller than 0.05, and hence we will reject the null hypothesis that mean is a good estimator of POSITIF. We can infer that the multiple linear regression model is a good estimator of POSITIF cases.
Coefficients section shows the p-values of the estimates of the Intercept and the independent variables. Variables with p-value less than the significance level (can look at the asterick depending on the significance level determined by the user) means that they are good parameter estimates, which should be kept. On the other hand, variables with p-value above significance level (can look at the asterick depending on the significance level determined by the user) means they are not statistically significant and should be removed from the model.
The above shows that PROX_ATTRACTION, PROX_RESTAURANT, PROX_MALL, PROX_HEALTHCARE and PROX_RAILWAYS are statistically significant and should be kept.
The prediction tab shows the result of GW prediction with the model. It shows the min, median and max of the predicted values.
From the example, the min predicted POSITIF COVID-19 case is 1017.2, median predicted POSITIF COVID-19 case is 2602.7, and max predicted POSITIF COVID-19 case is 6208.4.
The visualization tab shows an interactive map of the predicted value geographically. Darker shade dot means a high prediction value while lighter shade dot means a low prediction value.
The above example shows the predicted POSITIF COVID-19 cases geographically in Jakarta. Region with darker shade dot means a high number of predicted POSITIF COVID-19 cases while region with lighter shade dot means a low number of predicted POSITIF COVID-19 cases.
The data table tab shows the data of the selected dependent and independent variables.