Anticipated Questions¶

Use this document to prepare for high-stakes, committee-level questions. These prompts assume a sophisticated audience.

Focus on rigor, defensibility, and awareness of limitations or alternatives.

Study Design¶

Can you defend how this design supports causal inference given your use of randomization, matching, or longitudinal tracking?
What latent confounders, such as unmeasured covariates or differential loss to follow-up, might still bias your comparisons?
If your intervention assignment were subject to clinical discretion or batch effects, how would that alter interpretation?
Does your design account for informative censoring or data missing not at random (MNAR) in your patient- or sample-level metadata?

Is your central hypothesis grounded in a defined regulatory mechanism, pathway perturbation, or transcriptional program?
Could your hypothesis be interpreted as post hoc pattern mining rather than pre-specified causal modeling?
Can you pose a falsifiable alternative hypothesis consistent with current omics literature or known network topology?
Are there biologically meaningful null models (e.g., randomized gene sets, permuted genotype-phenotype links) that you have not tested?

How might genetic ancestry, tissue source, or environmental exposure influence generalizability to external datasets?
Could survival bias, inclusion based on diagnostic or treatment criteria, or data completeness thresholds affect effect estimates?
What cross-validation schemes or out-of-sample predictions are you using to assess portability across biospecimens or cohorts?

What empirical benchmarks (e.g., DREAM challenges, gold-standard gene lists, synthetic spike-ins) validate the accuracy of your tool?
Have you tested pipeline robustness under adversarial inputs, class imbalance, or corrupted labels?
What distributional, sparsity, or independence assumptions are required by your algorithm, and are they satisfied in your omics matrix?

How are you assessing identifiability in the presence of multicollinearity, dropout, or latent variable structure?
Have you run diagnostic checks (e.g., residual plots, permutation tests, bootstrapped estimates) to evaluate model calibration and variance?
If model convergence fails or the optimization landscape is degenerate, how will you adapt your inference strategy?

How does your sampling strategy risk underrepresenting ancestrally diverse, low-resource, or clinically complex populations?
Have you stratified performance metrics (e.g., AUC, F1, calibration error) by subgroup to assess algorithmic fairness?
What ethical review is required if your findings suggest prognostic or actionable results in patient subgroups?

How will you interpret a gene module or signature that achieves high predictive accuracy but has no known biological annotation?
What if your key differential expression result replicates in only a subset of validation tissues or datasets?
Can you resolve conflicting evidence when one modality (e.g., expression) supports your hypothesis but another (e.g., methylation or proteomics) does not?
If your transcription factor inference identifies multiple regulators with overlapping binding sites, how will you prioritize them?

Which simplifying assumptions in variant interpretation, pathway enrichment, or transcription factor inference does your work relax or test?
What theoretical advance (e.g., improved causal graph structure, multi-omic fusion, dimensionality reduction) does your method provide?
How does your work contribute to reproducibility, scalability, or interpretability in computational biology pipelines?
Can you articulate how your findings advance mechanistic understanding, clinical translation, or therapeutic discovery in your field?

If a required omics dataset is deprecated, missing metadata, or withdrawn from dbGaP/GEO, what secondary datasets can you use?
Which preprocessing decisions (e.g., normalization, gene filtering, batch correction) are most sensitive in downstream outputs?
How will you interpret your results if your primary hypothesis is only weakly supported but exploratory findings are strong?
What alternative analyses (e.g., unsupervised clustering, pathway enrichment) can you pivot to if your primary model fails to converge?
If your primary tool fails to run on a critical dataset, what are your fallback options for analysis?