r - seed object for reproducible results in parallel operation in caret -

i trying use code reproducible parallel models in caret not understand how set size of vectors in seed object. gbm have 4 tuning parameters total of 11 different levels, , have 54 rows in tuning grid. if specify value < 18 last value in "for(i in 1:10)" line below, error: "bad seeds: seed object should list of length 11 10 integer vectors of size 18 , last list element having single integer." why 18? runs without errors values > 18 (e.g., 54) - why? many help. following based on http://topepo.github.io/caret/training.html, added things.

library(mlbench) data(sonar) str(sonar[, 1:10]) library(caret) library(doparallel)  set.seed(998) intraining <- createdatapartition(sonar$class, p = .75, list = false) training <- sonar[ intraining,] testing  <- sonar[-intraining,]  grid <- expand.grid(n.trees = seq(50,150,by=50), interaction.depth = seq(1,3,by=1),   shrinkage = seq(.09,.11,by=.01),n.minobsinnode=seq(8,10,by=2))   # set seed run reproducible model in parallel mode using caret           set.seed(825) seeds <- vector(mode = "list", length = 11) # length = (n_repeats*nresampling)+1 for(i in 1:10) seeds[[i]]<- sample.int(n=1000, 11) # ...the number of tuning parameter... seeds[[11]]<-sample.int(1000, 1) # last model  fitcontrol <- traincontrol(method = "cv",number = 10,seeds=seeds)                 # run model in parallel cl <- makecluster(detectcores()) registerdoparallel(cl)  gbmfit1 <- train(class ~ ., data = training,method = "gbm",   trcontrol = fitcontrol,tunegrid=grid,verbose = false) gbmfit1 

i address question in 2 parts:

1 - setting seeds:

the code stated :

set.seed(825) seeds <- vector(mode = "list", length = 11) for(i in 1:10) seeds[[i]]<- sample.int(n=1000, 54) #for last model seeds[[11]]<-sample.int(1000, 1) 

the 11 in seeds <- vector(mode = "list", length = 11) (n_repeats*nresampling)+1, in case, you're using 10-fold cv, 10+1 = 11. if using repeatedcv number=10 , repeats = 5 replace 11 (5*10)+1 = 51.

the 10 in for(i in 1:10) (n_repeats*nresampling). in case 10 because you're using 10-fold cv. similarly, if using repeatedcv number=10 , repeats = 5 for(i in 1:50).

the 54 in sample.int(n=1000, 54) number of tuning parameter combinations. in case, have 4 parameters 3,3,3 , 2 values. so, 3*3*3*2 = 54. but, remember red somewhere gbm, model fit max(n.trees) in grid, , models less trees derived it, explains why caret calculates seeds based on interaction.depth * shrinkage * n.minobsinnode in case 3 * 3 * 2 = 18 , not 3*3*3*2 = 54 see later.

but if using svm model grid svmgrid <- expand.grid(sigma= 2^c(-25, -20, -15,-10, -5, 0), c= 2^c(0:5)) value 6 * 6 = 36

remember, goal of using seeds allow reproducible research setting seeds models fit @ each resampling iteration.

the seeds[[11]]<-sample.int(1000, 1) used set seed last (optimum) model fit complete dataset.

2 - why error if specify value < 18, no error value >= 18

i able reproduce same error on machine:

error in train.default(x, y, weights = w, ...) :    bad seeds: seed object should list of length 11 10 integer vectors of size 18 , last list element having single integer 

so, inspecting train.default able find source. error message triggered stop in lines 7 10 based on test badseed in lines 4 , 5.

    else {         if (!(length(trcontrol$seeds) == 1 && is.na(trcontrol$seeds))) {             numseeds <- unlist(lapply(trcontrol$seeds, length)) 4            badseed <- (length(trcontrol$seeds) < length(trcontrol$index) +  5              1) || (any(numseeds[-length(numseeds)] < nrow(traininfo$loop)))             if (badseed)  7             stop(paste("bad seeds: seed object should list of length",  8               length(trcontrol$index) + 1, "with", length(trcontrol$index),  9                   "integer vectors of size", nrow(traininfo$loop),  10               "and last list element having a", "single integer"))         }     } 

the number 18 coming nrow(traininfo$loop), need find value of traininfo$loop. object traininfo assigned value traininfo <- models$loop(tunegrid) in line 3:

    if (trcontrol$method != "none") {         if (is.function(models$loop) && nrow(tunegrid) > 1) {  3          traininfo <- models$loop(tunegrid)             if (!all(c("loop", "submodels") %in% names(traininfo)))                  stop("the 'loop' function should produce list elements 'loop' , 'submodels'")     } 

now, need find object models. assigned value of models <- getmodelinfo(method, regex = false)[[1]] in line 2:

    else { 2       models <- getmodelinfo(method, regex = false)[[1]]         if (length(models) == 0)              stop(paste("model", method, "is not in caret's built-in library"))     } 

since using method = "gbm", can see value of getmodelinfo("gbm", regex = false)[[1]]$loop , inspect result below:

> getmodelinfo("gbm", regex = false)[[1]]$loop function(grid) {      3               loop <- ddply(grid, c("shrinkage", "interaction.depth", "n.minobsinnode"),                               function(x) c(n.trees = max(x$n.trees)))                 submodels <- vector(mode = "list", length = nrow(loop))                 for(i in seq(along = loop$n.trees)) {                   index <- which(grid$interaction.depth == loop$interaction.depth[i] &                                     grid$shrinkage == loop$shrinkage[i] &                                    grid$n.minobsinnode == loop$n.minobsinnode[i])                   trees <- grid[index, "n.trees"]                    submodels[[i]] <- data.frame(n.trees = trees[trees != loop$n.trees[i]])                 }                     list(loop = loop, submodels = submodels) } >  

the loop (in line 3 above) assigned value:

loop <- ddply(grid, c("shrinkage", "interaction.depth", "n.minobsinnode"),                               function(x) c(n.trees = max(x$n.trees)))` 

now, let's pass grid 54 rows line above , inspect result:

> nrow(grid) [1] 54 >  > loop <- ddply(grid, c("shrinkage", "interaction.depth", "n.minobsinnode"), +               function(x) c(n.trees = max(x$n.trees))) > loop    shrinkage interaction.depth n.minobsinnode n.trees 1       0.09                 1              8     150 2       0.09                 1             10     150 3       0.09                 2              8     150 4       0.09                 2             10     150 5       0.09                 3              8     150 6       0.09                 3             10     150 7       0.10                 1              8     150 8       0.10                 1             10     150 9       0.10                 2              8     150 10      0.10                 2             10     150 11      0.10                 3              8     150 12      0.10                 3             10     150 13      0.11                 1              8     150 14      0.11                 1             10     150 15      0.11                 2              8     150 16      0.11                 2             10     150 17      0.11                 3              8     150 18      0.11                 3             10     150 >  

ahh!, found it. value 18 coming nrow(traininfo$loop) coming getmodelinfo("gbm", regex = false)[[1]]$loop shown above 18 rows.

now, going test triggered error:

badseed <- (length(trcontrol$seeds) < length(trcontrol$index) +                1) || (any(numseeds[-length(numseeds)] < nrow(traininfo$loop))) 

the first part of test (length(trcontrol$seeds) < length(trcontrol$index) + 1) false, second part (any(numseeds[-length(numseeds)] < nrow(traininfo$loop))) true valuse less 18 [coming nrow(traininfo$loop)], , false valuse greater 18. that's why error triggered value <18 , not >=18. said above, caret's calculates seeds based on interaction.depth * shrinkage * n.minobsinnode in case 3 * 3 * 2 = 18 (a model fit max(n.trees) , others derived it, there no need 54 integers).


