Skip to content

v2.3.1 is released on CRAN

Latest

Choose a tag to compare

@asalavatyasalavaty released this 10 Jun 23:36
· 6 commits to master since this release

Major updates

  • Substantially updated and optimized the exir function for improved scalability, flexibility, and usability across bulk and single-cell omics datasets.

  • Updated exir to accept experimental data in multiple formats, including data frames, tibbles, matrices, sparse matrices, and Seurat objects.

  • Updated the expected non-Seurat experimental data format for exir to support the common omics layout, with features/genes in rows and samples/cells in columns. Internally, exir automatically converts the input to the required analysis format.

  • Replaced the previous Condition_colname workflow with the more flexible condition argument. The condition argument can now be either a condition row/column name or a character/factor vector with the same order as the samples/cells in the input data. For Seurat objects, condition can be the name of a metadata column.

  • Added Seurat object support to exir via the new assay and layer arguments.

  • Added the Exptl_data_type argument to specify whether the experimental data are "bulk" or "sc", enabling data-type-aware preprocessing, normalization, pseudo-sampling, and warnings.

  • Restored and redesigned the normalize argument. For bulk count-like data, normalize = TRUE now applies TMM normalization followed by logCPM transformation using edgeR. For single-cell data, normalization is applied after pseudo-bulk aggregation when pseudo-sampling is enabled.

  • Added pseudo-sampling/pseudo-bulking support to exir through the new pseudo_sample and pseudo_samples_per_group arguments. This is particularly useful for large datasets and single-cell RNA-seq data.

  • Implemented condition-stratified, non-overlapping pseudo-sampling. For bulk data, pseudo-samples are generated by averaging normalized expression values within condition-specific groups. For single-cell data, pseudo-bulk samples are generated by summing raw counts within condition-specific groups followed by TMM/logCPM normalization using edgeR.

  • Added the Exptl_data_size_check argument to optionally prompt users to consider pseudo-sampling when the number of samples/cells is large.

  • Added conservative feature filtering to exir through the new feature_filter, min_feature_prevalence, min_feature_total, min_feature_variance, and always_keep_diff_features arguments. This filter removes uninformative features with insufficient prevalence, total signal, or variance without performing highly variable gene selection. Features in Diff_data and Desired_list can be forced to remain in the analysis.

Performance improvements

  • Optimized exir data preparation to delay dense conversion of sparse input data until after optional pseudo-sampling and feature filtering, reducing memory pressure for large omics datasets.

  • Optimized PCA in exir by replacing full PCA with truncated PCA using irlba::prcomp_irlba for the first principal component.

  • Optimized the correlation table handling in exir to avoid unnecessary full-table duplication while preserving the original association analysis logic and output.

  • Optimized graph reconstruction in exir by removing unintended self-loops while preserving multiple edges.

  • Optimized neighbourhood score calculation in exir by replacing row-by-row igraph::neighbors() calls with sparse adjacency matrix multiplication.

  • Vectorized row-wise scoring and classification operations in exir, including primitive driver score calculation and driver/biomarker type assignment.

  • Optimized the extraction of first- and second-order associated drivers for mediator tables by replacing repeated regex-based grep() searches with batched neighbourhood retrieval and set-based matching.

  • Optimized several IVI-related routines while preserving output consistency with the original implementation.

  • Optimized clusterRank by avoiding repeated degree calculations and reducing redundant graph traversal.

  • Optimized lh_index and h_index by precomputing repeated neighbourhood-size and H-index components where possible.

  • Optimized neighborhood.connectivity by precomputing first-order neighbourhood sizes and reducing repeated calls to igraph::neighborhood.size.

  • Optimized collective.influence while preserving identical output.

Usability and documentation

  • Replaced older verbose output in exir with cleaner cli-based progress and stage reporting.

  • Updated error and warning messages in the exir workflow using cli for clearer user-facing feedback.

  • Updated the ExIR vignette to document the new input formats, data orientation, condition handling, normalization options, pseudo-sampling workflow, Seurat support, and conservative feature filtering.

  • Updated exir documentation to clarify recommended input requirements for bulk and single-cell data.

  • Updated exir documentation to clarify that TMM/logCPM normalization is appropriate for many bulk RNA-seq count datasets but may not be appropriate for all omics data modalities.

  • Updated examples for the new condition, Exptl_data_type, Exptl_data_orientation, normalize, pseudo_sample, and feature_filter workflows.

Dependency updates

  • Added edgeR for TMM/logCPM normalization of count-like bulk data and pseudo-bulked single-cell data.

  • Added use of cli for improved messages and progress reporting.

  • Added use of Matrix utilities for sparse matrix handling and efficient graph/neighbourhood calculations.

  • Added optional Seurat object support through SeuratObject.

Bug fixes and minor improvements

  • Improved handling of missing values in experimental data during preprocessing.

  • Improved detection of count-like, sparse, bulk-like, and single-cell-like input data characteristics.

  • Improved validation of Seurat assays, layers, and metadata-derived condition labels.

  • Improved validation of pseudo-sampling settings, including condition-specific sample/cell counts.

  • Improved memory cleanup after full correlation table reduction in exir.

  • Improved consistency of output table preparation after vectorized score/type calculations.