项目作者: Blunde1

项目描述 :
Adaptive and automatic gradient boosting computations.
高级语言: C++
项目地址: git://github.com/Blunde1/agtboost.git
创建时间: 2019-09-03T20:42:51Z
项目社区:https://github.com/Blunde1/agtboost

开源协议:MIT License

下载


Travis build status
Lifecycle: experimental
License:
MIT

CRAN RStudio mirror downloads

aGTBoost

Adaptive and automatic gradient tree boosting computations

aGTBoost is a lightning fast gradient boosting library designed to avoid manual tuning and cross-validation by utilizing an information theoretic approach.
This makes the algorithm adaptive to the dataset at hand; it is completely automatic, and with minimal worries of overfitting.
Consequently, the speed-ups relative to state-of-the-art implementations are in the thousands while mathematical and technical knowledge required on the user are minimized.

Note: Currently for academic purposes: Implementing and testing new innovations w.r.t. information theoretic choices of GTB-complexity. See below for to-do research list.

Installation

R: Finally on CRAN! Install the stable version with

  1. install.packages("agtboost")

or install the development version from GitHub

  1. devtools::install_github("Blunde1/agtboost/R-package")

Users experiencing errors after warnings during installlation, may be helped by the following command prior to installation:

  1. Sys.setenv(R_REMOTES_NO_ERRORS_FROM_WARNINGS="true")

Example code and documentation

agtboost essentially has two functions, a train function gbt.train and a predict function predict.
From the code below it should be clear how to train an aGTBoost model using a design matrix x and a response vector y, write ?gbt.train in the console for detailed documentation.

  1. library(agtboost)
  2. # -- Load data --
  3. data(caravan.train, package = "agtboost")
  4. data(caravan.test, package = "agtboost")
  5. train <- caravan.train
  6. test <- caravan.test
  7. # -- Model building --
  8. mod <- gbt.train(train$y, train$x, loss_function = "logloss", verbose=10)
  9. # -- Predictions --
  10. prob <- predict(mod, test$x) # Score after logistic transformation: Probabilities

agtboostalso contain functions for model inspection and validation.

  • Feature importance: gbt.importance generates a typical feature importance plot.
    Techniques like inserting noise-features are redundant due to computations w.r.t. approximate generalization (test) loss.
  • Convergence: gbt.convergence computes the loss over the path of boosting iterations. Check visually for convergence on test loss.
  • Model validation: gbt.ksval transforms observations to standard uniformly distributed random variables, if the model is specified
    correctly. Perform a formal Kolmogorov-Smirnov test and plots transformed observations for visual inspection.
    ```r

    — Feature importance —

    gbt.importance(feature_names=colnames(caravan.train$x), object=mod)

— Model validation —

gbt.ksval(object=mod, y=caravan.test$y, x=caravan.test$x)
`` The functionsgbt.ksvalandgbt.importance` create the following plots:

Furthermore, an aGTBoost model is (see example code)

Dependencies

Scheduled updates

  • Adaptive and automatic deterministic frequentist gradient tree boosting.
  • Information criterion for fast histogram algorithm (non-exact search) (Fall 2020, planned)
  • Adaptive L2-penalized gradient tree boosting. (Fall 2020, planned)
  • Automatic stochastic gradient tree boosting. (Fall 2020/Spring 2021, planned)

Hopeful updates

  • Optimal stochastic gradient tree boosting.

References

Contribute

Any help on the following subjects are especially welcome:

  • Utilizing sparsity (possibly Eigen sparsity).
  • Paralellizatin (CPU and/or GPU).
  • Distribution (Python, Java, Scala, …),
  • good ideas and coding best-practices in general.

Please note that the priority is to work on and push the above mentioned scheduled updates. Patience is a virtue. :)