Research Glossary¶

Here are key terms related to research methodology and computational biology.

ARACNe¶

A mutual information-based algorithm used to reconstruct transcriptional networks by removing indirect interactions between transcription factors and their targets.

Bias¶

A systematic error that leads to incorrect estimates of effect or association. Can emerge from flawed design, measurement, or analysis.

Blinding¶

The process of withholding knowledge of group assignment from participants, investigators, or analysts to reduce bias. Can be single- or double-blind.

Bootstrap¶

A resampling method used to estimate statistics (like confidence intervals) from data by repeatedly sampling with replacement. Useful for evaluating uncertainty in high-dimensional biological data.

CADD¶

Combined Annotation Dependent Depletion; a score that integrates multiple annotations to predict the deleteriousness of genetic variants in a genome-wide context.

Case-Control Study¶

An observational design comparing individuals with a specific outcome (cases) to those without it (controls) to identify prior exposures.

Causal Inference¶

The process of drawing conclusions about a causal relationship between variables, often through experimental or quasi-experimental designs.

Censoring¶

A condition in survival analysis where the full event time is not observed for all subjects. Common in clinical and longitudinal genomic studies.

Centimorgan¶

A unit of genetic linkage representing a 1% chance that a marker at one genetic locus will be separated from a marker at another locus due to crossing over in a single generation.

Cohort Study¶

An observational design where a group is followed over time to assess how exposures affect outcomes.

Confounding¶

A hidden factor that influences both the exposure and the outcome, potentially biasing the observed relationship and threatening causal inference.

Construct Validity¶

The extent to which a test or tool actually measures the concept it intends to measure.

Control Group¶

A group in a study that does not receive the experimental treatment and serves as a benchmark for comparison.

Cross-Sectional Study¶

A snapshot study that observes a population at a single point in time, often used to assess prevalence.

Design¶

The structured logic and architecture of how you test your ideas. Good design minimizes bias, controls variation, and clarifies causal relationships.

Differential Expression¶

A computational method used to detect genes with statistically significant changes in expression between conditions (e.g., disease vs. control).

A study design where both participants and investigators are unaware of treatment assignments to reduce expectancy effects and bias.

Effect Size¶

A quantitative measure of the magnitude of a phenomenon. Important for interpreting the practical significance of results.

Experimental¶

A study design where participants or units are randomly assigned to different conditions or treatments to test causal effects.

External Validity¶

The degree to which study findings generalize beyond the specific sample, setting, or time in which the study was conducted.

FAIR Metadata¶

Metadata adhering to the principles of Findability, Accessibility, Interoperability, and Reusability, supporting data sharing and reproducibility.

False Discovery Rate (FDR)¶

The expected proportion of false positives among all significant results. Frequently used in genomics to adjust for multiple hypothesis testing.

Feature Selection¶

The process of selecting a subset of relevant variables (genes, proteins, etc.) for model building in high-dimensional biological data.

Generalizability¶

The extent to which results from a sample or model apply to other populations, settings, or times.

Gene Regulatory Network¶

A system of transcription factors, genes, and regulatory interactions that governs gene expression patterns. Often modeled as directed graphs to understand cellular responses and disease mechanisms.

GPU¶

Graphics Processing Unit; a parallel processing hardware component originally developed for rendering images, now widely used to accelerate computation in machine learning, graph analytics, and bioinformatics workflows.

Graph Database¶

A database structured as a graph where data entities are nodes and relationships are edges. Supports efficient querying of connected data. Examples include Neo4j and Amazon Neptune.

Graph Theory¶

The mathematical study of graphs as representations of pairwise relationships between objects. Key concepts include centrality, clustering, and motifs.

Heilmeier Catechism¶

A set of vetting questions developed by George Heilmeier to evaluate the feasibility, impact, and clarity of research proposals. Commonly used by funding agencies (e.g., DARPA) to assess whether a project is well-posed, high-impact, and realistic.

Hetionet¶

A heterogeneous biomedical knowledge graph integrating entities such as genes, diseases, drugs, and pathways, used to infer mechanistic links and support hypothesis generation.

Hidden Confounder¶

An unobserved variable that correlates with both predictor and outcome, leading to biased associations. Addressed using surrogate variable analysis or latent factor models.

Hypothesis¶

A specific, testable statement about the relationship between variables. For example: "Modifier variants cluster in immune regulatory genes."

Internal Validity¶

The degree to which observed effects can be attributed to the experimental variable rather than confounding factors or methodological artifacts.

KGQA¶

Knowledge Graph Question Answering; an approach that enables querying structured biological knowledge using natural language mapped to graph traversal or SPARQL logic.

Knockout Model¶

A genetic experimental design where a gene is intentionally disrupted or deleted to study its function or role in disease.

LangChain¶

A software framework that connects large language models to external tools, databases, and APIs for composable reasoning or document-grounded tasks.

Latent Variable¶

A variable that is not directly observed but is inferred from other variables (e.g., unobserved batch effects in RNA-seq data).

Linkage Analysis¶

A statistical method used to map genetic loci by examining the co-segregation of markers with traits in families. Often expressed in centimorgans and useful for identifying candidate regions in rare disease studies.

Matching¶

A method used in observational or quasi-experimental studies to pair subjects in treatment and control groups based on shared characteristics to reduce bias.

Mermaid.js¶

A JavaScript library that allows users to generate diagrams and flowcharts from plain text, commonly used to represent workflows, graphs, and networks visually.

Meta-Analysis¶

A statistical method for combining results from multiple studies to derive a more precise estimate of effect size or significance.

Natural Experiment¶

A naturally occurring situation that mimics the structure of an experiment, such as a policy change or environmental event, allowing for comparative analysis.

Network Inference¶

The computational process of reconstructing biological networks (e.g., gene-gene, protein-protein, regulatory) from high-throughput data using statistical or machine learning algorithms.

Observational¶

A non-interventional study design that documents relationships among variables without manipulating them. Common in epidemiology, social science, and public health.

Overfitting¶

A modeling error where a statistical model captures noise instead of the underlying signal, reducing its generalizability to new data.

PANDA¶

Passing Attributes between Networks for Data Assimilation; a computational method for integrating gene expression, TF binding, and protein interaction data to infer gene regulatory networks.

Power¶

The probability that a study will detect a true effect when it exists. Depends on effect size, sample size, and significance threshold.

Project Timeline¶

A structured outline of the phases and estimated durations of the research process, including proposal, data collection, analysis, and writing.

Proposal Abstract¶

A short summary of the planned research that includes background, aims, and significance. Required in most graduate and funding proposals.

Protégé¶

An open-source ontology editor and knowledge modeling tool used to build and visualize OWL and RDF graphs. Frequently used for biomedical ontologies.

Pseudo-Autosomal¶

Refers to regions on the sex chromosomes (e.g., PAR1 and PAR2) where homologous recombination occurs between X and Y during meiosis. These regions escape X-inactivation and require special handling in genomic analyses.

Pseudoreplication¶

The error of treating non-independent observations as independent, leading to inflated sample sizes and false positives.

Quasi-Experimental¶

A study design that compares groups with an intervention but lacks random assignment. May include pre-post comparisons or matched groups.

Random Assignment¶

A method used in experiments to allocate units to conditions purely by chance, helping to control for bias and equate groups at baseline.

Random Forest¶

A machine learning method useful for classification and regression in biological datasets, based on ensembles of decision trees.

Random Sampling¶

A method of selecting a representative subset of a population, often used to generalize findings to a broader group.

RENET2¶

A deep learning-based tool for extracting gene-disease associations from biomedical literature using named entity recognition and relation classification.

RDF¶

Resource Description Framework; a data model for representing information in graphs using triples (subject–predicate–object), forming the basis of the semantic web.

Regression Discontinuity Design¶

A quasi-experimental approach where participants are assigned to groups based on a cutoff score on a pre-intervention variable.

Reliability¶

The degree to which a measure or study produces consistent and repeatable results.

Research Aims¶

Concise statements of specific goals the project intends to achieve, often broken into Aim 1, Aim 2, etc., especially in NIH-style grants.

Research Design¶

The overall strategy for integrating different components of the study—experimental logic, sampling, analysis—to test hypotheses effectively.

RNA-seq¶

A sequencing-based method to quantify gene expression, frequently used in transcriptome analysis, disease mechanism exploration, and biomarker discovery.

Selection Bias¶

A distortion in the estimation of effect due to systematic differences in the characteristics of those selected for study groups.

Sensitivity Analysis¶

An approach to test the robustness of results to changes in model assumptions or data input. Often used to assess how parameter choices affect results.

SHACL¶

Shapes Constraint Language; a W3C standard used to define validation rules and constraints over RDF graphs, ensuring consistency in structured data.

A design in which participants are unaware of their group assignment, but investigators are not blinded.

Snakemake¶

A workflow management system used in bioinformatics to define reproducible data processing pipelines using a rule-based format and dependency tracking.

SPARQL¶

A query language for retrieving and manipulating data stored in RDF format. Used in semantic web and knowledge graph querying.

Statistical Significance¶

A result unlikely to have occurred by chance, according to a predefined threshold (typically p < 0.05).

Survival Analysis¶

A class of statistical methods for analyzing time-to-event data. Widely used in genomics, especially for cancer prognosis.

Systems Biology¶

An integrative field that models complex interactions between biological entities (genes, proteins, metabolites) across scales.

Time Series Design¶

A study design involving repeated observations over time—before, during, and after an intervention or event. It is often used to detect trends, interruptions, or delayed effects. Time series designs can help distinguish causal effects from background noise by analyzing change patterns across multiple time points, especially when randomization is not feasible.

Translational Research¶

Research aimed at moving discoveries from bench to bedside, such as identifying candidate biomarkers or repurposing drugs using computational approaches.

Trust Graph¶

A graph representation used to document the provenance, reliability, and decision traceability of computational outputs, especially in explainable AI systems.

Validation Set¶

A separate subset of data used to test model generalization after training. Important in computational modeling to avoid overfitting.

Validity¶

The degree to which a method or result accurately reflects what it is intended to measure or infer.

WGCNA¶

Weighted Gene Co-expression Network Analysis; a method for identifying modules of highly correlated genes and relating them to sample traits or phenotypes.