r - seed object for reproducible results in parallel operation in caret -
i trying use code reproducible parallel models in caret not understand how set size of vectors in seed object. gbm have 4 tuning parameters total of 11 different levels, , have 54 rows in tuning grid. if specify value < 18 last value in "for(i in 1:10)" line below, error: "bad seeds: seed object should list of length 11 10 integer vectors of size 18 , last list element having single integer." why 18? runs without errors values > 18 (e.g., 54) - why? many help. following based on http://topepo.github.io/caret/training.html, added things.
library(mlbench) data(sonar) str(sonar[, 1:10]) library(caret) library(doparallel) set.seed(998) intraining <- createdatapartition(sonar$class, p = .75, list = false) training <- sonar[ intraining,] testing <- sonar[-intraining,] grid <- expand.grid(n.trees = seq(50,150,by=50), interaction.depth = seq(1,3,by=1), shrinkage = seq(.09,.11,by=.01),n.minobsinnode=seq(8,10,by=2)) # set seed run reproducible model in parallel mode using caret set.seed(825) seeds <- vector(mode = "list", length = 11) # length = (n_repeats*nresampling)+1 for(i in 1:10) seeds[[i]]<- sample.int(n=1000, 11) # ...the number of tuning parameter... seeds[[11]]<-sample.int(1000, 1) # last model fitcontrol <- traincontrol(method = "cv",number = 10,seeds=seeds) # run model in parallel cl <- makecluster(detectcores()) registerdoparallel(cl) gbmfit1 <- train(class ~ ., data = training,method = "gbm", trcontrol = fitcontrol,tunegrid=grid,verbose = false) gbmfit1
i address question in 2 parts:
1 - setting seeds
:
the code stated :
set.seed(825) seeds <- vector(mode = "list", length = 11) for(i in 1:10) seeds[[i]]<- sample.int(n=1000, 54) #for last model seeds[[11]]<-sample.int(1000, 1)
the 11
in seeds <- vector(mode = "list", length = 11)
(n_repeats*nresampling)+1
, in case, you're using 10-fold cv
, 10+1 = 11
. if using repeatedcv
number=10 , repeats = 5
replace 11
(5*10)+1 = 51
.
the 10
in for(i in 1:10)
(n_repeats*nresampling)
. in case 10
because you're using 10-fold cv
. similarly, if using repeatedcv
number=10 , repeats = 5
for(i in 1:50)
.
the 54
in sample.int(n=1000, 54)
number of tuning parameter combinations
. in case, have 4 parameters
3,3,3 , 2 values
. so, 3*3*3*2 = 54
. but, remember red somewhere gbm, model fit max(n.trees)
in grid, , models less trees derived it, explains why caret
calculates seeds
based on interaction.depth * shrinkage * n.minobsinnode
in case 3 * 3 * 2 = 18
, not 3*3*3*2 = 54
see later.
but if using svm
model grid svmgrid <- expand.grid(sigma= 2^c(-25, -20, -15,-10, -5, 0), c= 2^c(0:5))
value 6 * 6 = 36
remember, goal of using seeds
allow reproducible research
setting seeds models fit @ each resampling iteration.
the seeds[[11]]<-sample.int(1000, 1)
used set seed last (optimum) model fit complete dataset.
2 - why error if specify value < 18, no error value >= 18
i able reproduce same error on machine:
error in train.default(x, y, weights = w, ...) : bad seeds: seed object should list of length 11 10 integer vectors of size 18 , last list element having single integer
so, inspecting train.default
able find source. error message triggered stop
in lines 7 10
based on test badseed
in lines 4
, 5
.
else { if (!(length(trcontrol$seeds) == 1 && is.na(trcontrol$seeds))) { numseeds <- unlist(lapply(trcontrol$seeds, length)) 4 badseed <- (length(trcontrol$seeds) < length(trcontrol$index) + 5 1) || (any(numseeds[-length(numseeds)] < nrow(traininfo$loop))) if (badseed) 7 stop(paste("bad seeds: seed object should list of length", 8 length(trcontrol$index) + 1, "with", length(trcontrol$index), 9 "integer vectors of size", nrow(traininfo$loop), 10 "and last list element having a", "single integer")) } }
the number 18
coming nrow(traininfo$loop)
, need find value of traininfo$loop
. object traininfo
assigned value traininfo <- models$loop(tunegrid)
in line 3:
if (trcontrol$method != "none") { if (is.function(models$loop) && nrow(tunegrid) > 1) { 3 traininfo <- models$loop(tunegrid) if (!all(c("loop", "submodels") %in% names(traininfo))) stop("the 'loop' function should produce list elements 'loop' , 'submodels'") }
now, need find object models
. assigned value of models <- getmodelinfo(method, regex = false)[[1]]
in line 2:
else { 2 models <- getmodelinfo(method, regex = false)[[1]] if (length(models) == 0) stop(paste("model", method, "is not in caret's built-in library")) }
since using method = "gbm"
, can see value of getmodelinfo("gbm", regex = false)[[1]]$loop
, inspect result below:
> getmodelinfo("gbm", regex = false)[[1]]$loop function(grid) { 3 loop <- ddply(grid, c("shrinkage", "interaction.depth", "n.minobsinnode"), function(x) c(n.trees = max(x$n.trees))) submodels <- vector(mode = "list", length = nrow(loop)) for(i in seq(along = loop$n.trees)) { index <- which(grid$interaction.depth == loop$interaction.depth[i] & grid$shrinkage == loop$shrinkage[i] & grid$n.minobsinnode == loop$n.minobsinnode[i]) trees <- grid[index, "n.trees"] submodels[[i]] <- data.frame(n.trees = trees[trees != loop$n.trees[i]]) } list(loop = loop, submodels = submodels) } >
the loop
(in line 3 above) assigned value:
loop <- ddply(grid, c("shrinkage", "interaction.depth", "n.minobsinnode"), function(x) c(n.trees = max(x$n.trees)))`
now, let's pass grid
54 rows
line above , inspect result:
> nrow(grid) [1] 54 > > loop <- ddply(grid, c("shrinkage", "interaction.depth", "n.minobsinnode"), + function(x) c(n.trees = max(x$n.trees))) > loop shrinkage interaction.depth n.minobsinnode n.trees 1 0.09 1 8 150 2 0.09 1 10 150 3 0.09 2 8 150 4 0.09 2 10 150 5 0.09 3 8 150 6 0.09 3 10 150 7 0.10 1 8 150 8 0.10 1 10 150 9 0.10 2 8 150 10 0.10 2 10 150 11 0.10 3 8 150 12 0.10 3 10 150 13 0.11 1 8 150 14 0.11 1 10 150 15 0.11 2 8 150 16 0.11 2 10 150 17 0.11 3 8 150 18 0.11 3 10 150 >
ahh!, found it. value 18
coming nrow(traininfo$loop)
coming getmodelinfo("gbm", regex = false)[[1]]$loop
shown above 18 rows
.
now, going test triggered error:
badseed <- (length(trcontrol$seeds) < length(trcontrol$index) + 1) || (any(numseeds[-length(numseeds)] < nrow(traininfo$loop)))
the first part of test (length(trcontrol$seeds) < length(trcontrol$index) + 1)
false
, second part (any(numseeds[-length(numseeds)] < nrow(traininfo$loop)))
true
valuse less 18
[coming nrow(traininfo$loop)
], , false
valuse greater 18
. that's why error triggered value <18
, not >=18
. said above, caret's calculates seeds
based on interaction.depth * shrinkage * n.minobsinnode
in case 3 * 3 * 2 = 18
(a model fit max(n.trees)
, others derived it, there no need 54
integers).
Comments
Post a Comment