Ultra-rare diseases are those conditions with a prevalence of fewer than 1 per 50,000 individuals. This threshold is commonly used in European policy discussions and health technology assessment contexts to distinguish ultra-rare conditions from broader rare disease categories (e.g., fewer than five per 10,000 in the European Union, or fewer than 200,000 affected individuals in the United States). While definitions may vary slightly across jurisdictions, the defining feature of ultra-rare diseases is the extremely small patient population, often numbering in the dozens or low hundreds nationally, and sometimes fewer than 1000 patients worldwide. There are many examples of such low-incidence pediatric ultra-rare diseases, including but not limited to neuronal ceroid lipofuscinosis type 2 disease, a rapidly progressive neurodegenerative disorder of childhood; aromatic L-amino acid decarboxylase deficiency, a severe neurotransmitter synthesis disorder; and fibrodysplasia ossificans progressiva, an ultra-rare condition characterized by progressive heterotopic ossification. Each of these conditions has a prevalence well below the 1 per 50,000 threshold and exemplifies the clinical severity, early onset, and profound unmet need that typify ultra-rare pediatric disorders. Pediatric ultra-rare diseases are a persistent challenge to clinical research, exposing inherent limitations in conventional trial design and data analysis strategies. Traditional randomized controlled trials, which rely on large sample sizes and statistical power derived from group comparisons, are ill-suited to generate actionable evidence in these contexts. Investigators frequently resort to open-label case series, historical controls, or approaches that compromise interpretability and introduce bias. The result is a persistent evidence gap that delays therapeutic advances and leaves families and clinicians without robust guidance. Adoption of adaptive methodologies and individualized inference frameworks that can extract maximal value from limited data remains poor. Standard statistical paradigms emphasize population averages, obscuring meaningful within-subject responses that could signal treatment efficacy. Moreover, global regulatory uncertainty and inconsistent analytic standards for single-patient or small-cohort studies further hinder progress. To better understand these gaps, a conceptual case is made to introduce N-of-1 trial designs as a complement to the standard randomized controlled clinical trial designs. Pivotal trials in rare diseases face a structural tension between methodological rigor and feasibility. Extremely small, geographically dispersed patient populations; phenotypic heterogeneity; ethical concerns about prolonged placebo exposure; and urgent unmet need all challenge conventional evidentiary standards. Within this context, the comparative value of standard randomized controlled trials (RCTs) and N-of-1 designs for conclusively detecting efficacy signals warrants careful examination. While both approaches seek to minimize bias and support causal inference, they differ fundamentally in their unit of analysis, inferential scope, and operational demands. The standard parallel-group RCT remains the regulatory gold standard for establishing efficacy because of its capacity to balance known and unknown confounders through randomization, preserve internal validity through blinding, and estimate population-average treatment effects. In rare diseases, however, the assumptions underpinning the RCT model are strained. Recruitment may be protracted, reducing timeliness and statistical power. Small sample sizes limit the precision of effect estimates and increase susceptibility to baseline imbalances despite randomization. Furthermore, when disease trajectories are severe or progressive, assigning patients to placebo or subtherapeutic control arms can raise ethical concerns, particularly in pediatric populations. Nevertheless, when feasible, an adequately powered RCT, potentially incorporating adaptive features, Bayesian borrowing, or external controls, provides the most straightforward pathway to regulatory acceptance. Agencies such as the U.S. Food and Drug Administration and the European Medicines Agency continue to regard randomized comparative evidence as the most robust basis for causal claims, particularly when effect sizes are modest or outcomes are subjective. By contrast, N-of-1 designs invert the analytic frame by treating the individual patient as the unit of experimentation. Typically structured as multiple crossover comparisons within a single patient, these trials can establish whether a treatment produces reproducible benefit under controlled conditions for that individual. In ultra-rare disorders, especially those with stable or fluctuating symptoms and short treatment washout periods, aggregated N-of-1 trials offer an appealing strategy to extract signal from limited populations. Methodologically, repeated randomization within a patient can control for time-varying confounders and enhance sensitivity to within-person treatment effects. When treatment effects are rapid, large, and reversible, N-of-1 approaches may detect efficacy signals with fewer participants than parallel-group RCTs. However, the inferential trade-offs are substantial. Classic N-of-1 trials are optimized for individualized treatment decisions rather than population-level inference. Generalizability depends on the ability to replicate effects across multiple individuals and to pool data using hierarchical or Bayesian models. In progressive or curative contexts, such as gene replacement therapies where effects are irreversible, crossover and washout are impossible, rendering traditional N-of-1 designs inapplicable. Moreover, when outcomes are long-latency or survival-based, within-person alternation between treatment and control is neither practical nor ethical. Even when aggregation is feasible, heterogeneity of treatment response can complicate interpretation: a strong benefit in a subset of patients may be diluted or obscured if effects are inconsistent across individuals. From a statistical perspective, the distinction can be framed in terms of estimands. RCTs typically target the average treatment effect across a defined population, providing direct support for labeling and reimbursement decisions. N-of-1 trials estimate the individual treatment effect; aggregation across individuals can approximate a population-level effect but requires explicit modeling assumptions regarding exchangeability and variance components. In small samples, Bayesian frameworks may reduce uncertainty by incorporating prior information, but they introduce sensitivity to prior specification and potential regulatory scrutiny. Operational considerations also diverge. RCTs demand centralized infrastructure, broad eligibility criteria to achieve enrollment targets, and standardized outcome measures. N-of-1 designs require intensive patient engagement, frequent outcome assessment, and rigorous adherence to crossover schedules. Digital health technologies and remote monitoring have increased the feasibility of repeated within-patient measurement, potentially enhancing the practicality of N-of-1 aggregation in geographically dispersed rare disease populations. Nonetheless, the analytic complexity of combining multiple individualized experiments into a coherent evidentiary package can challenge both sponsors and regulators. Ethically, N-of-1 designs may offer advantages when placebo exposure is minimized or when each participant receives active therapy during portions of the study. Conversely, in rapidly progressive or life-threatening conditions, even temporary withdrawal of effective therapy may be unacceptable. In such contexts, a single-arm trial with natural history comparison may be more appropriate than either a traditional RCT or a crossover-based N-of-1 design. Ultimately, the capacity of either approach to conclusively detect efficacy signals depends on disease biology, expected effect size, reversibility of outcomes, and regulatory context. RCTs remain superior for establishing definitive population-level efficacy when feasible, particularly for therapies with modest or heterogeneous effects. Aggregated N-of-1 trials may be especially valuable when treatment effects are rapid, substantial, and reversible, and when patient numbers are too small to sustain adequately powered parallel-group comparisons. Rather than viewing the two designs as mutually exclusive, a hybrid evidentiary strategy may be optimal: early-phase N-of-1 experiments to characterize individual response patterns and refine endpoints, followed by streamlined randomized trials, or Bayesian meta-analytic aggregation of multiple N-of-1 studies to approximate population-level inference. In pivotal rare disease development, methodological pluralism anchored in rigorous causal reasoning is likely more productive than strict adherence to a single paradigm. The decisive question is not which design is categorically superior, but under what biological, ethical, and statistical conditions each design can most credibly and efficiently generate persuasive evidence of therapeutic benefit. Table 1 contrasts conventional RCTs with N-of-1 trial designs in the specific context of pivotal rare disease development. To bridge these gaps, the industry must pivot toward the standardization of N-of-1 trial designs integrated with digital biomarkers.1, 2 By treating the individual patient as the entire study unit through multi-period crossover sequences, one can generate high-density, statistically rigorous evidence of efficacy without the need for large cohorts. The integration of digital biomarkers allows one to transform the approach from a snapshot into a wider-angle view of the patient's health.2 These tools provide continuous, objective measurements of many endpoints in the child's natural environment, effectively eliminating the variability and biases induced by infrequent hospital visits. Standardization includes several components. First, protocol architecture must be standardized. This includes prespecified rules for treatment sequencing (e.g., number and duration of treatment and control periods), minimum washout intervals justified by pharmacokinetic and pharmacodynamic data, criteria for early stopping, and uniform crossover structures to permit cross-patient aggregation. Without harmonized design templates, pooled inference across multiple N-of-1 trials becomes analytically fragile. Second, endpoint definition and measurement frameworks require standardization. For digital biomarkers in particular, this includes clear operational definitions of primary and secondary endpoints; validation of device-derived measures against clinically meaningful outcomes; standardized sampling frequency; predefinition of clinically meaningful within-patient effect thresholds; and uniform data preprocessing pipelines (e.g., handling of missing data, smoothing algorithms, and artifact rejection). Analytical reproducibility depends heavily on these specifications being consistent across sites and patients. Third, statistical analysis plans must be prospectively standardized. This includes prespecified models for within-patient treatment effect estimation; rules governing aggregation across patients (e.g., hierarchical Bayesian frameworks or mixed-effects models); predefined priors where Bayesian methods are used; multiplicity control when multiple endpoints are evaluated; and sensitivity analyses to assess carryover, period, and time trends. Without common analytic conventions, interpretability and comparability are compromised. Fourth, data infrastructure and interoperability standards are essential. Digital biomarker integration requires harmonized data formats, secure time-stamping, traceable version control of algorithms, and audit-ready data provenance documentation. Interoperability across devices and platforms should be ensured to avoid device-specific bias or proprietary algorithm opacity that would undermine reproducibility. Fifth, blinding and bias-mitigation procedures require explicit standardization. This includes standardized placebo or sham conditions where feasible, automated outcome capture to reduce observer bias, and predefined rules for participant and caregiver masking in pediatric contexts. Sixth, regulatory alignment should be formalized through early scientific advice procedures and template statistical analysis frameworks agreed in advance with agencies. Standardized reporting templates, analogous to CONSORT extensions but tailored to aggregated N-of-1 designs, would further enhance transparency and reproducibility. Finally, governance and quality assurance mechanisms must be defined, including independent data monitoring procedures adapted to small-sample repeated-measures designs and predefined criteria for evidentiary sufficiency when multiple N-of-1 trials are pooled. Standardizing this framework along the lines above would allow regulators to move beyond anecdotal evidence toward a harmonized, data-driven methodology for ultra-rare conditions. This strategy not only respects the biological uniqueness of each child but also provides a scalable, ethically sound pathway for the rapid approval of life-altering therapies. We can no longer afford to wait for populations that do not exist; we must instead innovate with the patients we have. The author declares no connicts of interest. No funding was received for this work. Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.
Rajesh Krishna (Wed,) studied this question.