--- title: "The buzz on using bHIVE" author: name: Nick Borcherding email: ncborch@gmail.com affiliation: Washington University in St. Louis, School of Medicine, St. Louis, MO, USA date: "`r Sys.Date()`" output: BiocStyle::html_document: toc_float: true package: bHIVE vignette: > %\VignetteIndexEntry{The buzz on using bHIVE} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE, results="hide", message=FALSE} knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE, tidy = FALSE) library(BiocStyle) library(bHIVE) library(ggplot2) library(viridis) library(caret) ``` # Introduction The **bHIVE** package implements an Artificial Immune Network (AI-Net) algorithm for clustering and classification tasks. Inspired by biological immune systems, bHIVE uses principles like clonal selection, mutation, and suppression to analyze and model data. This vignette demonstrates how to: 1. Perform clustering and classification using `bHIVE`. 2. Tune hyperparameters using `swarmbHIVE`. 3. Use the `caret` wrapper for easy integration with machine learning workflows. 4. Use multilayered immune networks with `honeycombHIVE` 5. Visualize results with `ggplot2`. ## Parameters for `bHIVE()` The behavior of the `bHIVE` function can be fine-tuned using a range of hyperparameters. Below is a description of the key parameters: | **Parameter** | **Description** | |------------------------|-------------------------------------------------------------------------------------------------------------| | `X` | A numeric matrix or data frame of input features (rows are observations, columns are features). | | `y` | (Optional) Factor target vector for classification. If `NULL`, clustering is performed. | | `task` | Specifies the task: `"clustering"` or `"classification"`. | | `nAntibodies` | Number of initial antibodies in the population. Larger values increase diversity but add computational cost. | | `beta` | Clone multiplier. Determines how many clones are generated for top-matching antibodies. | | `epsilon` | Suppression threshold. Antibodies closer than `epsilon` are considered redundant and removed. | | `maxIter` | Maximum number of iterations for the algorithm. | | `affinityFunc` | Affinity (similarity) function. Options include `"gaussian"`, `"laplace"`, `"polynomial"`, `"cosine"`, `"hamming"`. | | `distFunc` | Distance function for clustering and suppression. Options include `"euclidean"`, `"manhattan"`, `"cosine"`, `"minkowski"`, `"hamming"`. | | `affinityParams` | A list of optional parameters for the affinity/distance functions. | | `mutationDecay` | Factor controlling how the mutation rate decays over iterations. Default is `1.0` (no decay). | | `mutationMin` | Minimum mutation rate to avoid zero mutation. | | `maxClones` | Maximum number of clones per antibody. Default is unlimited (`Inf`). | | `stopTolerance` | Tolerance for stopping the algorithm if the antibody population size stabilizes. | | `noImprovementLimit` | Number of iterations without improvement before early stopping. | | `initMethod` | Method for initializing antibodies. Options: `"sample"` (randomly selects rows from `X`), `"random"` (Gaussian noise), `"random_uniform"` (samples uniformly in [min, max] of each column), or `"kmeans++"` (kmeans++-like initialization for coverage). | | `k` | Number of top-matching antibodies to consider during cloning. | | `seed` | Random seed for reproducibility. | | `verbose` | Logical. If `TRUE`, prints progress messages for each iteration. | ### How `bHIVE` Works 1. **Affinity Calculation**: Measures the similarity between each data point and antibodies using a specified `affinityFunc`. 2. **Clonal Expansion and Mutation**: Generates clones of top-matching antibodies, introducing diversity via mutation. 3. **Network Suppression**: Removes antibodies that are too similar to maintain diversity in the population. 4. **Assignment**: Assigns data points to antibodies (clusters, labels, or numeric predictions). ### Example of Parameter Selection The following example demonstrates how to configure the `bHIVE` function for clustering, emphasizing the impact of key parameters: ```{r tidy = FALSE, eval=FALSE} # Load the Iris dataset data(iris) X <- as.matrix(iris[, 1:4]) # Configure bHIVE parameters for clustering set.seed(42) res <- bHIVE( X = X, # Input data task = "clustering", # Task type nAntibodies = 20, # Number of antibodies beta = 5, # Clone multiplier epsilon = 0.01, # Suppression threshold maxIter = 20, # Maximum iterations affinityFunc = "gaussian", # Affinity function distFunc = "euclidean", # Distance function verbose = TRUE # Print progress ) ``` # Applications ## Clustering **Clustering** is an unsupervised learning task where we group similar data points based on their features. In this example, we cluster the numeric features of the Iris dataset using `bHIVE`. ```{r tidy = FALSE} # Load Iris dataset data(iris) X <- as.matrix(iris[, 1:4]) # Perform clustering set.seed(42) res <- bHIVE(X = X, task = "clustering", nAntibodies = 10, beta = 5, epsilon = 0.05, maxIter = 20, k = 3, verbose = FALSE) # Add cluster assignments to the data iris$Cluster <- as.factor(res$assignments) # Visualize clusters ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Cluster)) + geom_point(size = 3) + labs(title = "Clustering Results with bHIVE", x = "Sepal Length", y = "Sepal Width") + scale_color_viridis(discrete = TRUE) + theme_minimal() ``` ## Classification **Classification** is a supervised learning task where data points are assigned to predefined categories based on their features. Here, we classify the species of Iris flowers using `bHIVE`. ```{r tidy = FALSE} # Classification setup y <- iris$Species # Perform classification set.seed(42) res <- bHIVE(X = X, y = y, task = "classification", nAntibodies = 100, beta = 5, epsilon = 0.05, initMethod = "random", k = 4, verbose = FALSE) # Visualize classification results iris$Predicted <- res$assignments ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Predicted, shape = Species)) + geom_point(size = 3) + labs(title = "Classification Results with bHIVE", x = "Sepal Length", y = "Sepal Width") + scale_color_viridis(discrete = TRUE) + theme_minimal() ``` Comparing predicted vs actual ```{r tidy = FALSE} table(Predicted = res$assignments, Actual = y) ``` ## Hyperparameter Tuning Tuning hyperparameters is crucial for optimizing the performance of machine learning algorithms. In this example, we perform hyperparameter tuning for clustering using `swarmbHIVE`. ```{r tidy = FALSE} grid <- expand.grid( nAntibodies = c(10, 20), beta = c(3, 5), epsilon = c(0.01, 0.05) ) # Perform tuning set.seed(42) tuning_results <- swarmbHIVE(X = X, task = "clustering", grid = grid, metric = "silhouette", verbose = FALSE) # Visualize tuning results ggplot(tuning_results$results, aes(x = nAntibodies, y = metric_value, color = factor(beta))) + geom_line() + geom_point(aes(shape = as.factor(epsilon)), size = 3) + labs(title = "Hyperparameter Tuning Results", x = "Number of Antibodies", y = "Silhouette Score", color = "Beta", shape = "Epsilon") + scale_color_viridis(discrete = TRUE) + theme_minimal() ``` Best parameters ```{r tidy = FALSE} tuning_results$best_params ``` ## Using the `caret` wrapper The `bHIVE` package provides a `caret` wrapper for seamless integration with the `caret` framework, allowing for easy cross-validation and hyperparameter tuning. Here, we demonstrate classification on the iris dataset. ```{r message=FALSE, warning=FALSE, tidy=FALSE} data(iris) X <- as.matrix(iris[, 1:4]) y <- iris$Species # Splitting Training and Validation Data Sets set.seed(42) sample.idx <- sample(nrow(X), nrow(X)*0.7) x_test <- X[sample.idx,] x_val <- X[-sample.idx,] y_test <- y[sample.idx] y_val <- y[-sample.idx] train_control <- trainControl(method = "cv", number = 2) set.seed(42) model <- train(x = x_test, y = y_test, method = bHIVEmodel, trControl = train_control, tuneGrid = expand.grid( nAntibodies = c(10, 20, 30), beta = c(3, 5, 10), epsilon = c(0.01, 0.05, 0.1)), verbose = FALSE) # Visualize caret results ggplot(model) + labs(title = "Caret Tuning Results for bHIVE", x = "Hyperparameter Combination", y = "Performance Metric") + scale_color_viridis(discrete = TRUE) + theme_minimal() ``` ### Applying the caret-based model To use the best performing model from the above, we just using the ```predict()``` function with the separated validation data set (**X_val**). ```{r tidy = FALSE} preds <- predict(model, newdata = x_val) table(Predicted = preds, Actual = y_val) ``` ## honeycombHIVE: Multilayered bHIVES In `honeycombHIVE`, clustering proceeds hierarchically across multiple layers: `honeycombHIVE` iteratively builds and refines a network of "antibodies" (prototypes) to capture complex patterns in data. Initially, the `bHIVE` algorithm creates a set of prototypes, and then each layer can be fine-tuned using gradient-based updates (via the `refineB()` function) with flexible optimizers like SGD, Adam, and RMSProp. Fo each layer, a collapse step aggregates the refined prototypes - by computing a centroid, medoid, or another statistic - compressing the data into a new, lower-dimensional representation where each prototype becomes a new observation. This new representation serves as the input for the next layer, similar to neural network architecture. ```{r tidy=FALSE} # Load the Iris dataset data(iris) X <- as.matrix(iris[, 1:4]) # Run honeycombHIVE for clustering res <- honeycombHIVE(X = X, task = "clustering", epsilon = 0.05, layers = 3, nAntibodies = 30, beta = 5, maxIter = 10, verbose = FALSE) # Visualize results from each layer for (i in seq_along(res)) { # Create a data frame for plotting; add original Sepal.Length and Sepal.Width plot_df <- data.frame( Sepal.Length = iris$Sepal.Length, Sepal.Width = iris$Sepal.Width, Cluster = factor(res[[i]]$membership) # cluster labels from layer i ) # Basic ggplot scatter plot plot <- ggplot(plot_df, aes(x = Sepal.Length, y = Sepal.Width, color = Cluster)) + geom_point(size = 3) + labs( title = paste("honeycombHIVE Clustering - Layer", i), x = "Sepal Length", y = "Sepal Width" ) + theme_minimal() + scale_color_viridis(discrete = TRUE) print(plot) } ``` Note: If you use `task = "classification"`, honeycombHIVE will generate multi-layer predictions. You can compare each layer's predictions against the true labels to see if performance improves or if the data become too collapsed. ## Gradient-based Refinement The ```refineB()``` function takes the prototypes produced by the ```bHIVE()``` algorithm and fine-tunes them using gradient‐based updates. In addition to the basic parameters: * **loss:** The loss function used to guide refinement. For a classification task, "categorical_crossentropy" is a typical choice. * **steps:** The number of gradient update iterations. * **lr:** The learning rate for each update. * **optimizers** * *sgd:* Standard stochastic gradient descent. * *momentum:* SGD with momentum. (Control the momentum with **momentum_coef**). * *adagrad:* Adaptive gradient descent. * *adam:* Adaptive moment estimation. (Control the decay rates with **beta1** and **beta2**). * *rmsprop:* Root Mean Square Propagation. (Control the decay of the squared gradient average with **rmsprop_decay**) ```{r tidy=FALSE} # Prepare the Iris dataset X <- as.matrix(iris[, 1:4]) y <- iris$Species # Run bHIVE to obtain initial antibody prototypes. set.seed(42) res <- bHIVE(X = X, y = y, task = "classification", nAntibodies = 10, beta = 5, epsilon = 0.05, initMethod = "random", k = 4, verbose = FALSE) Ab <- res$antibodies colnames(Ab) <- colnames(X) assignments <- res$assignments # Ensure assignments are numeric indices. assignments <- as.integer(factor(assignments, levels = unique(assignments))) # PCA of the Iris data for visualization in 2D pca <- prcomp(X, scale. = TRUE) X_pca <- pca$x[, 1:2] colnames(X_pca) <- c("PC1", "PC2") # Transform initial prototypes into PCA space. A_bhive_pca <- predict(pca, Ab) # Run refinement using several optimizers. optimizers <- c("sgd", "adam", "rmsprop") refined_list <- lapply(optimizers, function(opt) { Ab_refined <- refineB(A = Ab, X = X, y = y, assignments = assignments, loss = "categorical_crossentropy", task = "classification", steps = 5, lr = 0.01, verbose = FALSE, optimizer = opt, beta1 = 0.9, beta2 = 0.999, rmsprop_decay = 0.9) colnames(Ab_refined) <- colnames(X) A_refined_pca <- predict(pca, Ab_refined) data.frame(optimizer = opt, PC1_after = A_refined_pca[,1], PC2_after = A_refined_pca[,2]) }) refined_df <- do.call(rbind, refined_list) refined_df$optimizer <- factor(refined_df$optimizer, levels = optimizers) ``` ```{r echo=FALSE} # Build a data frame for the data points (colored by their bHIVE assignment) plot_data <- data.frame(PC1 = X_pca[, 1], PC2 = X_pca[, 2], Cluster = factor(assignments)) # Build a data frame for original prototypes (same for every optimizer) proto_df <- data.frame(ID = 1:nrow(A_bhive_pca), PC1_before = A_bhive_pca[,1], PC2_before = A_bhive_pca[,2]) proto_df$ID <- factor(proto_df$ID) # Add an ID to the refined data for merging. refined_df$ID <- factor(rep(1:nrow(A_bhive_pca), times = length(optimizers))) # Merge the original and refined prototypes by ID. merged_df <- merge(proto_df, refined_df, by = "ID") # Visualizing data and refined prototypes ggplot() + geom_point(data = plot_data, aes(x = PC1, y = PC2, color = Cluster), alpha = 0.6) + geom_point(data = proto_df, aes(x = PC1_before, y = PC2_before), color = "#440154", size = 4) + geom_point(data = merged_df, aes(x = PC1_after, y = PC2_after), color = "#FDE725", size = 4) + geom_segment(data = merged_df, aes(x = PC1_before, y = PC2_before, xend = PC1_after, yend = PC2_after), arrow = arrow(length = unit(0.2, "cm"), type = "closed"), color = "black", lwd = 0.5) + scale_color_viridis(discrete = TRUE) + labs(title = "Prototype Refinement in PCA Space", x = "PC1", y = "PC2") + facet_wrap(~ optimizer, nrow = 1, labeller = as_labeller(c(sgd = "SGD", adam = "Adam", rmsprop = "RMSProp"))) + theme_minimal() + theme(plot.title = element_text(face = "bold", size = 14), strip.text = element_text(face = "bold")) ``` **Interpretation** * Data points are shown in a light transparency, colored by their cluster assignments (as determined by the original `bHIVE` algorithm). * Initial prototypes are shown in dark purple. These are the prototypes (antibodies) before refinement. * Refined prototypes are shown in bright yellow. The arrows indicate the direction and magnitude of the update from the original positions to the refined positions. ## Applying Refinement to honeycombHIVE ```{r tidy=FALSE} data(iris) X <- as.matrix(iris[, 1:4]) y <- iris$Species res_class <- honeycombHIVE(X = X, y = y, layers = 3, task = "classification", nAntibodies = 30, beta = 5, epsilon = 0.01, verbose = FALSE) res_class_refine <- honeycombHIVE(X = X, y = y, task = "classification", layers = 3, nAntibodies = 30, beta = 5, epsilon = 0.01, refine = TRUE, refineOptimizer = "adam", refineLoss = "categorical_crossentropy", refineSteps = 3, refineLR = 0.01, verbose = FALSE) table(Refined = res_class_refine[[3]]$predictions, Actual = y) ``` # Visualizing Results with visualizeHIVE Below are several examples that demonstrate how to use the visualizeHIVE() function for different tasks and plot types. ## Example 1: Scatterplot for Clustering Results In this example we run `honeycombHIVE()` on the Iris dataset for a clustering task. The scatterplot is generated using a PCA transformation to reduce the feature space to two dimensions. Data points are colored by cluster membership (treated as discrete) and prototypes are overlaid as black points. ```{r} data(iris) X <- as.matrix(iris[, 1:4]) set.seed(42) res <- honeycombHIVE(X = X, task = "clustering", epsilon = 0.05, layers = 3, nAntibodies = 30, beta = 5, maxIter = 10, verbose = FALSE) visualizeHIVE(result = res, X = iris[, 1:4], plot_type = "scatter", title = "Layer 2: Scatterplot (Clustering)", layer = 2, task = "clustering", transform = TRUE, transformation_method = "PCA") ``` This scatterplot shows the data projected onto the first two principal components. Each data point is colored according to its cluster (discrete grouping), and the prototypes computed by honeycombHIVE are displayed as large black points. Faceting by layer is applied if multiple layers are selected. ## Example 2: Violin Plot for Classification Results For a classification task, assume that the algorithm stores class predictions. The following example generates a violin plot of the "Sepal.Width" feature from layer 1. The discrete grouping (class labels) is used to color the violins, and prototype values are overlaid as black points. ```{r} set.seed(42) res_class <- honeycombHIVE(X = X, y = iris$Species, task = "classification", layers = 2, nAntibodies = 15, beta = 5, maxIter = 10, verbose = FALSE) visualizeHIVE(result = res_class, X = iris[, 1:4], plot_type = "violin", feature = "Sepal.Width", title = "Violin Plot: Sepal.Width by Group", layer = 1, task = "classification") ``` The violin plot shows the distribution of **Sepal.Width** for each predicted class. Discrete color scales ensure that class labels are clearly distinguished, and the black markers indicate the prototype (group summary) for each group. # Conclusions The `bHIVE` package is a versatile tool for clustering and classification tasks. Its integration with caret simplifies hyperparameter tuning and cross-validation, making it suitable for a variety of datasets and use cases. If you have any questions, comments, or suggestions, please visit the [GitHub repository](https://github.com/BorchLab/bHIVE). ```{r} sessionInfo() ```