Applications

This module stores the applications (mainly different machine learning models) we support. Our APIs follow LightGBM. We currently support a subset of them that are considered the most important. Supporting the rest is primarily an engineering task and is on our roadmap.

class joinboost.app.App

Bases: ABC

An abstract base class for applications

class joinboost.app.DecisionTree(num_leaves: int = 31, learning_rate: float = 1, max_depth: int = 6, subsample: float = 1, growth: str = 'bestfirst', debug: bool = False, partition_early: bool = True, enable_batch_optimization: bool = False)

Bases: DummyModel

DecisionTree extends the functionality of DummyModel to provide a decision tree-based model for classification or regression tasks.

Parameters:

num_leaves (int, optional) – Maximum number of leaves the tree can have. Defaults to 31.
learning_rate (float, optional) – Rate at which the model adjusts based on errors. Defaults to 1.
max_depth (int, optional) – Maximum depth of the tree. Defaults to 6.
subsample (float, optional) – Fraction of training data to be used for learning. Defaults to 1.
growth (str, optional) – Strategy for growing the tree. Defaults to “bestfirst”.
debug (bool, optional) – If set to True, enables debugging mode. Defaults to False.
partition_early (bool, optional) – If set to True, each tree split will materialize the partitioned fact table (as opposed to only creating a view). Defaults to True.
enable_batch_optimization (bool, optional) – If set to True, for each tree nodes, the set of queries that find the best splits for all featues will be batched together and executed in one query. This is only applicable for pandas right now. Defaults to False.

class joinboost.app.DummyModel

Bases: App

This is a dummy model that always use the gradient/heassian as prediction For rmse, this model uses the mean of the target variable as prediction

class joinboost.app.GradientBoosting(num_leaves: int = 31, learning_rate: float = 1, max_depth: int = 6, n_estimators: int = 1, debug: bool = False, partition_early: bool = False, enable_batch_optimization: bool = False)

Bases: DecisionTree

GradientBoosting extends the functionality of DecisionTree to implement the gradient boosting algorithm for classification or regression tasks. It builds an additive model in a forward stage-wise fashion by iteratively adding decision trees to minimize the loss function.

Parameters:

num_leaves (int, optional) – Maximum number of leaves each decision tree can have. Defaults to 31.
learning_rate (float, optional) – Rate at which the model adjusts based on errors. This influences the contribution of each tree to the final prediction. Defaults to 1.
max_depth (int, optional) – Maximum depth of each decision tree. Defaults to 6.
n_estimators (int, optional) – Number of boosting stages or decision trees to be run. Essentially, how many times the boosting procedure should be executed. Defaults to 1.
debug (bool, optional) – If set to True, enables debugging mode. Defaults to False.
partition_early (bool, optional) – If set to True, each decision tree split will materialize the partitioned fact table (as opposed to only creating a view). Defaults to False.
enable_batch_optimization (bool, optional) – If set to True, for each tree node, the set of queries that find the best splits for all features will be batched together and executed in one query. This is only applicable for pandas currently. Defaults to False.

class joinboost.app.RandomForest(num_leaves: int = 31, learning_rate: float = 1, max_depth: int = 6, subsample: float = 1, n_estimators: int = 1, debug: bool = False, partition_early: bool = False, growth: str = 'bestfirst', enable_batch_optimization: bool = False)

Bases: DecisionTree

RandomForest builds upon the DecisionTree to create an ensemble method that fits multiple decision trees to subsets of the dataset and uses averaging to improve the predictive accuracy and control overfitting.

Parameters:

num_leaves (int, optional) – Maximum number of leaves each tree can have. Defaults to 31.
learning_rate (float, optional) – Rate at which the model adjusts based on errors. Defaults to 1.
max_depth (int, optional) – Maximum depth of each tree. Defaults to 6.
subsample (float, optional) – Fraction of training data to be used for learning by each tree. Defaults to 1.
n_estimators (int, optional) – Number of trees in the random forest. Defaults to 1.
debug (bool, optional) – If set to True, enables debugging mode. Defaults to False.
partition_early (bool, optional) – If set to True, each tree split will materialize the partitioned fact table (as opposed to only creating a view). Defaults to False.
growth (str, optional) – Strategy for growing each tree in the forest. Defaults to “bestfirst”.
enable_batch_optimization (bool, optional) – If set to True, for each node in the trees, the set of queries that find the best splits for all features will be batched together and executed in one query. This is only applicable for pandas currently. Defaults to False.

JoinBoost