Title: | Genetic Algorithms and C-Steps Based LTS (Least Trimmed Squares) Estimation |
---|---|
Description: | Includes the ga.lts() function that estimates LTS (Least Trimmed Squares) parameters using genetic algorithms and C-steps. ga.lts() constructs a genetic algorithm to form a basic subset and iterates C-steps as defined in Rousseeuw and van-Driessen (2006) to calculate the cost value of the LTS criterion. OLS (Ordinary Least Squares) regression is known to be sensitive to outliers. A single outlying observation can change the values of estimated parameters. LTS is a resistant estimator even the number of outliers is up to half of the data. This package is for estimating the LTS parameters with lower bias and variance in a reasonable time. Version >=1.3 includes the function medmad for fast outlier detection in linear regression. |
Authors: | Mehmet Hakan Satman |
Maintainer: | Mehmet Hakan Satman <[email protected]> |
License: | GPL |
Version: | 1.3.2 |
Built: | 2024-11-09 05:22:56 UTC |
Source: | https://github.com/jbytecode/galts |
This package includes the ga.lts function that estimates LTS (Least Trimmed Squares) parameters using genetic algorithms and C-steps. ga.lts() constructs a genetic algorithm to form a basic subset and iterates C-steps as defined in Rousseeuw and van-Driessen (2006) to calculate the cost value of the LTS criterion. OLS (Ordinary Least Squares) regression is known to be sensitive to outliers. A single outlying observation can change the values of estimated parameters. LTS is a resistant estimator even the number of outliers is up to half of the data. This package is for estimating the LTS parameters with lower bias and variance in a reasonable time.
Mehmet Hakan Satman
Maintainer: Mehmet Hakan Satman <[email protected]>
Rousseeuw, P. J., van Driessen, K. (2006). Computing LTS Regression for Large Data Sets ,Data Mining and Knowledge Discovery, 12, 29-45.
Satman, M.,H. (2012). A Genetic Algorithm Based Modification on the LTS Algorithm for Large Data Sets, Communications in Statistics - Simulation and Computation, Vol 41, Issue 5, pp. 644-652.
# Data generating process x1 <- rnorm(100) x2 <- rnorm(100) e <- rnorm(100) # Setting betas to 5 y <- 5 + 5 * x1 + 5 * x2 + e # Contaminate the data on the dimension of X's randomly # This is the maximum contamination rate that the LTS can cope with. outlyings <- sample(1:100, 48) x1[outlyings] <- 10 x2[outlyings] <- 10 # Estimating LTS with ga (Default optimization method) lts <- ga.lts(y ~ x1 + x2, popsize = 40, iters = 2, lower = -20, upper = 20) print(lts) #Estimating LTS with differential evolution lts <- ga.lts(y ~ x1 + x2, popsize = 40, iters = 2, lower = -20, upper = 20, method = "de") print(lts)
# Data generating process x1 <- rnorm(100) x2 <- rnorm(100) e <- rnorm(100) # Setting betas to 5 y <- 5 + 5 * x1 + 5 * x2 + e # Contaminate the data on the dimension of X's randomly # This is the maximum contamination rate that the LTS can cope with. outlyings <- sample(1:100, 48) x1[outlyings] <- 10 x2[outlyings] <- 10 # Estimating LTS with ga (Default optimization method) lts <- ga.lts(y ~ x1 + x2, popsize = 40, iters = 2, lower = -20, upper = 20) print(lts) #Estimating LTS with differential evolution lts <- ga.lts(y ~ x1 + x2, popsize = 40, iters = 2, lower = -20, upper = 20, method = "de") print(lts)
This function estimates the LTS (Least Trimmed Squares) regression parameters using genetic algorithms. LTS is a robust regression estimator with high breakdown property. LTS is a resistant estimator even the number of outliers is up to half of the data. However, calculating LTS estimator is computaionally expensive. ga.lts() uses evolutionary algorithms (genetic algorithms by default, optionally differential evolution) to construct a basic subset and iterates C-steps as defined in Rousseeuw and van-Driessen (2006). Despite lower time efficiency of the ga.lts(), estimations have lower mean square errors, as a result of lower biases and lower variances.
ga.lts(formula, h = NULL, iters = 2, popsize = 50, lower, upper, csteps = 2, method = "ga", verbose = FALSE)
ga.lts(formula, h = NULL, iters = 2, popsize = 50, lower, upper, csteps = 2, method = "ga", verbose = FALSE)
formula |
Dependent ~ Independents style formula as same in lm() and glm(). |
h |
User defined variable to define the majority of the data. Default is floor(n/2)+floor((p+1)/2) where n is the number of observations and p is the number of parameters to estimate |
iters |
Number of generations of the evolutionary algorithm. This variable can be kept larger if the precision is more important than the computation time. Default is 2. |
popsize |
Number of candidates (chromosomes) in the population of evolutionary algorithm. Default is 50. |
lower |
Lower bound for the initial estimates of the parameters. |
upper |
Upper bound for the initial estimates of the parameters. |
csteps |
Number of C-steps to be performed for each candidate solution. Default is 2. |
method |
A string variable for the evolutionary algorithm. 'ga' for the genetic algorithms and 'de' for the differential evolution. Default is 'ga' |
verbose |
A boolean variable for printing the current status of algorithms to screen. Default is FALSE. |
coefficients |
A vector of estimated parameters |
crit |
LTS criterion value for the reported coefficients |
method |
Name of the method used in the optimization process |
Mehmet Hakan Satman
Rousseeuw, P. J., van Driessen, K. (2006). Computing LTS Regression for Large Data Sets ,Data Mining and Knowledge Discovery, 12, 29-45.
Satman, M.,H. (2012). A Genetic Algorithm Based Modification on the LTS Algorithm for Large Data Sets, Communications in Statistics - Simulation and Computation, Vol 41, Issue 5, pp. 644-652.
# Data generating process x1 <- rnorm(100) x2 <- rnorm(100) e <- rnorm(100) # Setting betas to 5 y <- 5 + 5 * x1 + 5 * x2 + e # Contaminate the data on the dimension of X's randomly # This is the maximum contamination rate that the LTS can cope with. outlyings <- sample(1:100, 48) x1[outlyings] <- 10 x2[outlyings] <- 10 # Estimating LTS with ga (Default optimization method) lts <- ga.lts(y ~ x1 + x2, popsize = 40, iters = 2, lower = -20, upper = 20) print(lts) # Estimating LTS with differential evolution lts <- ga.lts(y ~ x1 + x2, popsize = 40, iters = 2, lower = -20, upper = 20, method = "de") print(lts)
# Data generating process x1 <- rnorm(100) x2 <- rnorm(100) e <- rnorm(100) # Setting betas to 5 y <- 5 + 5 * x1 + 5 * x2 + e # Contaminate the data on the dimension of X's randomly # This is the maximum contamination rate that the LTS can cope with. outlyings <- sample(1:100, 48) x1[outlyings] <- 10 x2[outlyings] <- 10 # Estimating LTS with ga (Default optimization method) lts <- ga.lts(y ~ x1 + x2, popsize = 40, iters = 2, lower = -20, upper = 20) print(lts) # Estimating LTS with differential evolution lts <- ga.lts(y ~ x1 + x2, popsize = 40, iters = 2, lower = -20, upper = 20, method = "de") print(lts)
A method for detecting regression outliers.
medmad(formula, h=NULL, csteps=20)
medmad(formula, h=NULL, csteps=20)
formula |
Dependent ~ Independents style formula as same in lm() and glm(). |
h |
User defined variable to define the majority of the data. Default is floor(n/2)+floor((p+1)/2) where n is the number of observations and p is the number of parameters to estimate |
csteps |
Number of C-steps to be performed for each candidate solution. Default is 2. |
coefficients |
A vector of estimated parameters |
crit |
LTS criterion value for the reported coefficients |
residuals |
Calculated residuals from the final estimate of model |
Mehmet Hakan Satman
n <- 100 x1 <- rnorm(n,0,10) x2 <- rnorm(n,0,10) x3 <- rnorm(n,0,10) x4 <- rnorm(n,0,10) e <- rnorm(n) x <- cbind(1, x1, x2, x3, x4) p <- 5 betas <- rep(5,p) c <- 0.20 h <- n - n*c y <- 5 + 5*x1 + 5*x2 + 5*x3 + 5*x4 + e x1[(h + 1):n] <- rnorm(n-h, 100, 10) x2[(h + 1):n] <- rnorm(n-h, 100, 10) x3[(h + 1):n] <- rnorm(n-h, 100, 10) x4[(h + 1):n] <- rnorm(n-h, 100, 10) mm <- medmad(formula = y ~ x1 + x2 + x3 + x4, csteps = 10)
n <- 100 x1 <- rnorm(n,0,10) x2 <- rnorm(n,0,10) x3 <- rnorm(n,0,10) x4 <- rnorm(n,0,10) e <- rnorm(n) x <- cbind(1, x1, x2, x3, x4) p <- 5 betas <- rep(5,p) c <- 0.20 h <- n - n*c y <- 5 + 5*x1 + 5*x2 + 5*x3 + 5*x4 + e x1[(h + 1):n] <- rnorm(n-h, 100, 10) x2[(h + 1):n] <- rnorm(n-h, 100, 10) x3[(h + 1):n] <- rnorm(n-h, 100, 10) x4[(h + 1):n] <- rnorm(n-h, 100, 10) mm <- medmad(formula = y ~ x1 + x2 + x3 + x4, csteps = 10)
Function for robust covariance matrix estimation.
medmad.cov(data)
medmad.cov(data)
data |
Row matrix of data |
varcov |
Covariance matrix |
Mehmet Hakan Satman
n <- 100 c <- 0.20 h <- n - n*c x1 <- rnorm(n,0,10) x2 <- rnorm(n,0,10) x3 <- rnorm(n,0,10) x4 <- rnorm(n,0,10) x1[(h + 1):n]<-rnorm(n-h, 100, 10) x2[(h + 1):n]<-rnorm(n-h, 100, 10) x3[(h + 1):n]<-rnorm(n-h, 100, 10) x4[(h + 1):n]<-rnorm(n-h, 100, 10) mat <- medmad.cov(cbind(x1, x2, x3, x4)) print (mat)
n <- 100 c <- 0.20 h <- n - n*c x1 <- rnorm(n,0,10) x2 <- rnorm(n,0,10) x3 <- rnorm(n,0,10) x4 <- rnorm(n,0,10) x1[(h + 1):n]<-rnorm(n-h, 100, 10) x2[(h + 1):n]<-rnorm(n-h, 100, 10) x3[(h + 1):n]<-rnorm(n-h, 100, 10) x4[(h + 1):n]<-rnorm(n-h, 100, 10) mat <- medmad.cov(cbind(x1, x2, x3, x4)) print (mat)