Skip to content

Anticipated Questions

Use this document to prepare for high-stakes, committee-level questions. These prompts assume a sophisticated audience.

Focus on rigor, defensibility, and awareness of limitations or alternatives.

Study Design

  • Can you defend how this design supports causal inference given your use of randomization, matching, or longitudinal tracking?
  • What latent confounders, such as unmeasured covariates or differential loss to follow-up, might still bias your comparisons?
  • If your intervention assignment were subject to clinical discretion or batch effects, how would that alter interpretation?
  • Does your design account for informative censoring or data missing not at random (MNAR) in your patient- or sample-level metadata?

Hypothesis and Scope

  • Is your central hypothesis grounded in a defined regulatory mechanism, pathway perturbation, or transcriptional program?
  • Could your hypothesis be interpreted as post hoc pattern mining rather than pre-specified causal modeling?
  • Can you pose a falsifiable alternative hypothesis consistent with current omics literature or known network topology?
  • Are there biologically meaningful null models (e.g., randomized gene sets, permuted genotype-phenotype links) that you have not tested?

Sampling and Generalizability

  • How might genetic ancestry, tissue source, or environmental exposure influence generalizability to external datasets?
  • Could survival bias, inclusion based on diagnostic or treatment criteria, or data completeness thresholds affect effect estimates?
  • What cross-validation schemes or out-of-sample predictions are you using to assess portability across biospecimens or cohorts?

Tool and Pipeline Justification

  • What empirical benchmarks (e.g., DREAM challenges, gold-standard gene lists, synthetic spike-ins) validate the accuracy of your tool?
  • Have you tested pipeline robustness under adversarial inputs, class imbalance, or corrupted labels?
  • What distributional, sparsity, or independence assumptions are required by your algorithm, and are they satisfied in your omics matrix?

Model Assumptions and Sensitivity

  • How are you assessing identifiability in the presence of multicollinearity, dropout, or latent variable structure?
  • Have you run diagnostic checks (e.g., residual plots, permutation tests, bootstrapped estimates) to evaluate model calibration and variance?
  • If model convergence fails or the optimization landscape is degenerate, how will you adapt your inference strategy?

Ethical and Practical Considerations

  • How does your sampling strategy risk underrepresenting ancestrally diverse, low-resource, or clinically complex populations?
  • Have you stratified performance metrics (e.g., AUC, F1, calibration error) by subgroup to assess algorithmic fairness?
  • What ethical review is required if your findings suggest prognostic or actionable results in patient subgroups?

Edge Case Scenarios

  • How will you interpret a gene module or signature that achieves high predictive accuracy but has no known biological annotation?
  • What if your key differential expression result replicates in only a subset of validation tissues or datasets?
  • Can you resolve conflicting evidence when one modality (e.g., expression) supports your hypothesis but another (e.g., methylation or proteomics) does not?
  • If your transcription factor inference identifies multiple regulators with overlapping binding sites, how will you prioritize them?

Broader Impact and Rigor

  • Which simplifying assumptions in variant interpretation, pathway enrichment, or transcription factor inference does your work relax or test?
  • What theoretical advance (e.g., improved causal graph structure, multi-omic fusion, dimensionality reduction) does your method provide?
  • How does your work contribute to reproducibility, scalability, or interpretability in computational biology pipelines?
  • Can you articulate how your findings advance mechanistic understanding, clinical translation, or therapeutic discovery in your field?

Backup and Contingency Plans

  • If a required omics dataset is deprecated, missing metadata, or withdrawn from dbGaP/GEO, what secondary datasets can you use?
  • Which preprocessing decisions (e.g., normalization, gene filtering, batch correction) are most sensitive in downstream outputs?
  • How will you interpret your results if your primary hypothesis is only weakly supported but exploratory findings are strong?
  • What alternative analyses (e.g., unsupervised clustering, pathway enrichment) can you pivot to if your primary model fails to converge?
  • If your primary tool fails to run on a critical dataset, what are your fallback options for analysis?