r - seed object for reproducible results in parallel operation in caret -
i trying use code reproducible parallel models in caret not understand how set size of vectors in seed object. gbm have 4 tuning parameters total of 11 different levels, , have 54 rows in tuning grid. if specify value < 18 last value in "for(i in 1:10)" line below, error: "bad seeds: seed object should list of length 11 10 integer vectors of size 18 , last list element having single integer." why 18? runs without errors values > 18 (e.g., 54) - why? many help. following based on http://topepo.github.io/caret/training.html, added things.
library(mlbench) data(sonar) str(sonar[, 1:10]) library(caret) library(doparallel) set.seed(998) intraining <- createdatapartition(sonar$class, p = .75, list = false) training <- sonar[ intraining,] testing <- sonar[-intraining,] grid <- expand.grid(n.trees = seq(50,150,by=50), interaction.depth = seq(1,3,by=1), shrinkage = seq(.09,.11,by=.01),n.minobsinnode=seq(8,10,by=2)) # set seed run reproducible model in parallel mode using caret set.seed(825) seeds <- vector(mode = "list", length = 11) # length = (n_repeats*nresampling)+1 for(i in 1:10) seeds[[i]]<- sample.int(n=1000, 11) # ...the number of tuning parameter... seeds[[11]]<-sample.int(1000, 1) # last model fitcontrol <- traincontrol(method = "cv",number = 10,seeds=seeds) # run model in parallel cl <- makecluster(detectcores()) registerdoparallel(cl) gbmfit1 <- train(class ~ ., data = training,method = "gbm", trcontrol = fitcontrol,tunegrid=grid,verbose = false) gbmfit1
i address question in 2 parts:
1 - setting seeds:
the code stated :
set.seed(825) seeds <- vector(mode = "list", length = 11) for(i in 1:10) seeds[[i]]<- sample.int(n=1000, 54) #for last model seeds[[11]]<-sample.int(1000, 1) the 11 in seeds <- vector(mode = "list", length = 11) (n_repeats*nresampling)+1, in case, you're using 10-fold cv, 10+1 = 11. if using repeatedcv number=10 , repeats = 5 replace 11 (5*10)+1 = 51.
the 10 in for(i in 1:10) (n_repeats*nresampling). in case 10 because you're using 10-fold cv. similarly, if using repeatedcv number=10 , repeats = 5 for(i in 1:50).
the 54 in sample.int(n=1000, 54) number of tuning parameter combinations. in case, have 4 parameters 3,3,3 , 2 values. so, 3*3*3*2 = 54. but, remember red somewhere gbm, model fit max(n.trees) in grid, , models less trees derived it, explains why caret calculates seeds based on interaction.depth * shrinkage * n.minobsinnode in case 3 * 3 * 2 = 18 , not 3*3*3*2 = 54 see later.
but if using svm model grid svmgrid <- expand.grid(sigma= 2^c(-25, -20, -15,-10, -5, 0), c= 2^c(0:5)) value 6 * 6 = 36
remember, goal of using seeds allow reproducible research setting seeds models fit @ each resampling iteration.
the seeds[[11]]<-sample.int(1000, 1) used set seed last (optimum) model fit complete dataset.
2 - why error if specify value < 18, no error value >= 18
i able reproduce same error on machine:
error in train.default(x, y, weights = w, ...) : bad seeds: seed object should list of length 11 10 integer vectors of size 18 , last list element having single integer so, inspecting train.default able find source. error message triggered stop in lines 7 10 based on test badseed in lines 4 , 5.
else { if (!(length(trcontrol$seeds) == 1 && is.na(trcontrol$seeds))) { numseeds <- unlist(lapply(trcontrol$seeds, length)) 4 badseed <- (length(trcontrol$seeds) < length(trcontrol$index) + 5 1) || (any(numseeds[-length(numseeds)] < nrow(traininfo$loop))) if (badseed) 7 stop(paste("bad seeds: seed object should list of length", 8 length(trcontrol$index) + 1, "with", length(trcontrol$index), 9 "integer vectors of size", nrow(traininfo$loop), 10 "and last list element having a", "single integer")) } } the number 18 coming nrow(traininfo$loop), need find value of traininfo$loop. object traininfo assigned value traininfo <- models$loop(tunegrid) in line 3:
if (trcontrol$method != "none") { if (is.function(models$loop) && nrow(tunegrid) > 1) { 3 traininfo <- models$loop(tunegrid) if (!all(c("loop", "submodels") %in% names(traininfo))) stop("the 'loop' function should produce list elements 'loop' , 'submodels'") } now, need find object models. assigned value of models <- getmodelinfo(method, regex = false)[[1]] in line 2:
else { 2 models <- getmodelinfo(method, regex = false)[[1]] if (length(models) == 0) stop(paste("model", method, "is not in caret's built-in library")) } since using method = "gbm", can see value of getmodelinfo("gbm", regex = false)[[1]]$loop , inspect result below:
> getmodelinfo("gbm", regex = false)[[1]]$loop function(grid) { 3 loop <- ddply(grid, c("shrinkage", "interaction.depth", "n.minobsinnode"), function(x) c(n.trees = max(x$n.trees))) submodels <- vector(mode = "list", length = nrow(loop)) for(i in seq(along = loop$n.trees)) { index <- which(grid$interaction.depth == loop$interaction.depth[i] & grid$shrinkage == loop$shrinkage[i] & grid$n.minobsinnode == loop$n.minobsinnode[i]) trees <- grid[index, "n.trees"] submodels[[i]] <- data.frame(n.trees = trees[trees != loop$n.trees[i]]) } list(loop = loop, submodels = submodels) } > the loop (in line 3 above) assigned value:
loop <- ddply(grid, c("shrinkage", "interaction.depth", "n.minobsinnode"), function(x) c(n.trees = max(x$n.trees)))` now, let's pass grid 54 rows line above , inspect result:
> nrow(grid) [1] 54 > > loop <- ddply(grid, c("shrinkage", "interaction.depth", "n.minobsinnode"), + function(x) c(n.trees = max(x$n.trees))) > loop shrinkage interaction.depth n.minobsinnode n.trees 1 0.09 1 8 150 2 0.09 1 10 150 3 0.09 2 8 150 4 0.09 2 10 150 5 0.09 3 8 150 6 0.09 3 10 150 7 0.10 1 8 150 8 0.10 1 10 150 9 0.10 2 8 150 10 0.10 2 10 150 11 0.10 3 8 150 12 0.10 3 10 150 13 0.11 1 8 150 14 0.11 1 10 150 15 0.11 2 8 150 16 0.11 2 10 150 17 0.11 3 8 150 18 0.11 3 10 150 > ahh!, found it. value 18 coming nrow(traininfo$loop) coming getmodelinfo("gbm", regex = false)[[1]]$loop shown above 18 rows.
now, going test triggered error:
badseed <- (length(trcontrol$seeds) < length(trcontrol$index) + 1) || (any(numseeds[-length(numseeds)] < nrow(traininfo$loop))) the first part of test (length(trcontrol$seeds) < length(trcontrol$index) + 1) false, second part (any(numseeds[-length(numseeds)] < nrow(traininfo$loop))) true valuse less 18 [coming nrow(traininfo$loop)], , false valuse greater 18. that's why error triggered value <18 , not >=18. said above, caret's calculates seeds based on interaction.depth * shrinkage * n.minobsinnode in case 3 * 3 * 2 = 18 (a model fit max(n.trees) , others derived it, there no need 54 integers).
Comments
Post a Comment