January 4, 2019

Guidelines for Reporting of Statistics for Clinical Research in Urology

Key Points

Key points are not available for this paper at this time.

Abstract

You have accessJournal of UrologyOutcomes/Epidemiology/Socioeconomics1 Mar 2019Guidelines for Reporting of Statistics for Clinical Research in Urology Melissa Assel, Daniel Sjoberg, Andrew Elders, Xuemei Wang, Dezheng Huo, Albert Botchway, Kristin Delfino, Yunhua Fan, Zhiguo Zhao, Tatsuki Koyama, Brent Hollenbeck, Rui Qin, Whitney Zahnd, Emily C. Zabor, Michael W. Kattan, and Andrew J. Vickers Melissa AsselMelissa Assel Memorial Sloan Kettering Cancer Center, , Daniel SjobergDaniel Sjoberg Memorial Sloan Kettering Cancer Center, , Andrew EldersAndrew Elders Glasgow Caledonian University , Xuemei WangXuemei Wang The University of Texas, MD Anderson Cancer Center , Dezheng HuoDezheng Huo The University of Chicago , Albert BotchwayAlbert Botchway Southern Illinois University School of Medicine , Kristin DelfinoKristin Delfino Southern Illinois University School of Medicine , Yunhua FanYunhua Fan University of Minnesota , Zhiguo ZhaoZhiguo Zhao Cleveland Clinic , Tatsuki KoyamaTatsuki Koyama Vanderbilt University Medical Center , Brent HollenbeckBrent Hollenbeck University of Michigan , Rui QinRui Qin Janssen Research & Development , Whitney ZahndWhitney Zahnd University of South Carolina , Emily C. ZaborEmily C. Zabor Memorial Sloan Kettering Cancer Center, , Michael W. KattanMichael W. Kattan Cleveland Clinic , and Andrew J. VickersAndrew J. Vickers *Correspondence: Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, 485 Lexington Ave., 2nd Floor, New York, New York 10017 (e-mail: E-mail Address: email protected). Memorial Sloan Kettering Cancer Center, View All Author Informationhttps://doi.org/10.1097/JU.0000000000000001AboutPDF ToolsAdd to favoritesDownload CitationsTrack CitationsPermissionsReprints ShareFacebookTwitterLinked InEmail In an effort to improve the quality of statistics in the clinical urology literature statisticians at European Urology, The Journal of Urology®, Urology and BJUI developed a set of guidelines to address common errors of statistical analysis, reporting and interpretation. Authors should break any of the guidelines if it makes scientific sense to do so but would need to provide clear justification. Adoption of the guidelines will in our view not only increase the quality of published articles in our journals, but improve statistical knowledge in our field in general. It is widely acknowledged that the quality of statistics in the clinical research literature is poor. This is true for urology just as it is for other medical specialties. In 2005 Scales et al. published a systematic evaluation of the statistics in papers appearing in a single month in European Urology, The Journal of Urology®, Urology and BJUI.1 They reported widespread errors, including 71% of papers with comparative statistics having at least 1 statistical flaw. These findings mirror many others in the literature, as indicated in the the review by Lang and Altman.2 The quality of statistical reporting in urology journals has no doubt improved since 2005 but remains unsatisfactory. The editors of the 4 urology journals in the review by Scales et al have come together to publish a shared set of statistical guidelines.3 Statistical reviewers at the 4 journals will systematically assess submitted manuscripts using the guidelines to improve statistical analysis, reporting and interpretation by the authors. Adoption of the guidelines will, in our view, not only increase the quality of published papers in our journals but improve statistical knowledge in our field in general. For example, asking an author to follow a guideline about the fallacy of accepting the null hypothesis would no doubt result in a better paper, but we hope it will also enhance the author's understanding of hypothesis tests. The guidelines are didactic based on the consensus of the statistical consultants to the journals. We avoided making specific analytic recommendations when possible, and focused instead on analyses or methods of reporting statistics that should be avoided. We intend to update the guidelines over time and hence encourage readers who question the value or rationale of a guideline to write to us. 1. The Golden Rule: Break any of the Guidelines if it Makes Scientific Sense to Do So Science varies too much to allow methodologic or reporting guidelines to apply universally. 2. Reporting of Design and Statistical Analysis 2.1. Follow existing reporting guidelines for the type of study you are reporting, such as CONSORT for randomized trials, ReMARK for marker studies, TRIPOD for prediction models, STROBE for observational studies or AMSTAR for systematic reviews. Statisticians and methodologists have contributed extensively to a large number of reporting guidelines. The first is widely recognized to be the CONSORT (Consolidated Standards of Reporting Trials) statement on reporting of randomized trials but there are now many other guidelines covering a wide range of different types of study. Reporting guidelines can be downloaded from the Equator Web site (http://www.equator-network.org). 2.2. Describe cohort selection fully. It is insufficient to state, for instance, "the study cohort consisted of 1,144 patients treated for benign prostatic hyperplasia at our institution." The cohort needs to be defined in terms of dates (for example, "presenting March 2013 to December 2017"), inclusion criteria ("International Prostate Symptom Score greater than 12") and whether patients were selected to be included (for a research study) vs being a consecutive series. Exclusions should be described one by one, with the number of patients omitted for each exclusion criterion to give the final cohort size (for example, "patients with prior surgery n=43, allergies to 5-ARIs n=12 and missing data on baseline prostate volume n=86 were excluded to give a final cohort for analysis of 1,003 patients"). Note that inclusion criteria can be omitted if obvious from context (for example, no need to state "undergoing radical prostatectomy for histologically proven prostate cancer"). However, dates may need to be explained if their rationale could be questioned (for example, "March 2013 when our specialist voiding clinic was established to December 2017"). 2.3. Describe the practical steps of randomization in randomized trials. Although this reporting guideline is part of the CONSORT statement, it is so critical and so widely misunderstood that it bears repeating. The purpose of randomization is to prevent selection bias, which can be achieved only if those consenting patients cannot guess treatment allocation of a patient before registration in the trial or change it afterward. This safeguard is known as allocation concealment. Stating merely that "a randomization list was created by a statistician" or that "envelope randomization was used" does not ensure allocation concealment as a list could have been posted at the nurse's station for all to see, and envelopes can be opened and resealed. Investigators need to specify the exact logistic steps taken to ensure allocation concealment. The best method is to use a password protected computer database. 2.4. Statistical methods should describe the study questions and statistical approaches used to address each question. Many statistical methods sections state only something like "Mann-Whitney was used for comparisons of continuous variables and Fisher's exact for comparisons of binary variables." These statements say little more than "the inference tests used were not grossly erroneous for the type of data." Instead, statistical methods sections should lay out each primary study question separately by carefully detailing the analysis associated with each question and describing the rationale for the analytic approach when this is not obvious or if there are reasonable alternatives. Special attention and description should be provided for rarely used statistical techniques. 2.5. Statistical methods should be described in sufficient detail to allow replication by an independent statistician given the same data set. Vague reference to "adjusting for confounders" or "non-linear approaches" is insufficiently specific to allow replication, a cornerstone of the scientific method. All statistical analyses should be specified in the Methods section, including details such as the covariates in a multivariable model. All variables should be clearly defined when there is room for ambiguity. For instance, avoid saying that "Gleason grade was included in the model" and state instead "Gleason grade group was included in four categories 1, 2, 3 and 4 or 5." 3. Inference and P-Values 3.1. Don't accept the null hypothesis. In a court case defendants are declared guilty or not guilty, and there is no verdict of "innocent." Similarly, in a statistical test the null hypothesis is rejected or not rejected. If the p-value is 0.05 or greater, investigators should avoid conclusions such as "the drug was ineffective," "there was no difference between groups" or "response rates were unaffected." Instead, authors should use phrases such as "we did not see evidence of a drug effect," "we were unable to demonstrate a difference between groups" or simply "there was no statistically significant difference in response rates." 3.2. P-values just above 5% are not a trend, and are not moving. Avoid saying that a p-value such as 0.07 shows a "trend" (which is meaningless) or "approaches statistical significance" (because the p-value isn't moving). Alternative language might be, "although we saw some evidence of improved response rates in patients receiving the novel procedure, differences between groups did not meet conventional levels of statistical significance." 3.3. P-values and 95% confidence intervals do not quantify the probability of a hypothesis. A p-value of 0.03 does not mean there is a 3% probability that the findings are due to chance. Additionally, a 95% confidence interval should not be interpreted as a 95% certainty the true parameter value is in the range of the 95% confidence interval. The correct interpretation of a p-value is the probability of finding the observed or more extreme results when the null hypothesis is true, and the 95% confidence interval will contain the true parameter value 95% of the time were a study to be repeated many times using different samples. 3.4. Do not use confidence intervals to test hypotheses. Investigators often interpret confidence intervals in terms of hypotheses. For instance, investigators might claim that there is a statistically significant difference between groups because the 95% confidence interval for the odds ratio excludes 1. Such claims are problematic because confidence intervals are concerned with estimation, not inference. Moreover, the mathematical method to calculate confidence intervals may be different from those used to calculate p-values. It is perfectly possible to have a 95% confidence interval that includes no difference between groups although the p-value is less than 0.05 or vice versa. For instance, in a study of 100 patients in 2 equal groups with event rates of 70% and 50%, respectively, the p-value from Fisher's exact test is 0.066 but the 95% confidence interval for the odds ratio is 1.03 to 5.26. The 95% confidence interval for the risk difference and risk ratio also exclude no difference between groups. 3.5. Take care interpreting results when reporting multiple p-values. The more questions you ask, the more likely you are to get a spurious answer to at least one of them. For example, if you report p-values for 5 independent true null hypotheses, the probability that you will falsely reject at least one is not 5% but greater than 20%. Although formal adjustment of p-values is appropriate in some specific cases, such as genomic studies, a more common approach is simply to interpret p-values in the context of multiple testing. For instance, if an investigator examines the association of 10 variables with 3 different end points, thereby testing 30 separate hypotheses, a p-value of 0.04 should not be interpreted the same way as if the study tested only a single hypothesis with a p-value of 0.04. 3.6. Do not report separate p-values for each of 2 different groups in order to address the question of whether there is a difference between groups. One scientific question means 1 statistical hypothesis tested by 1 p-value. To illustrate the error of using 2 p-values to address 1 question, take the case of a randomized trial of drug versus placebo to reduce voiding symptoms with 30 patients in each group. The authors might report that symptom scores improved by 6 (standard deviation 14) points in the drug group (p=0.03 by 1-sample t-test) and 5 (standard deviation 15) points in the placebo group (p=0.08). However, the study hypothesis concerns the difference between drug and placebo. To test a single hypothesis, a single p-value is needed. A 2-sample t-test for these data gives a p-value of 0.8, which is unsurprising given the scores in each group were virtually the same, confirming that it would be unsound to conclude that the drug was effective based on the finding that change was significant in the drug group but not in placebo controls. 3.7. Use interaction terms in place of subgroup analyses. A similar error to the use of separate tests for a single hypothesis is when an intervention is shown to have a statistically significant effect in 1 group of patients but not another. The approach that is more appropriate is to use what is known as an interaction term in a statistical model. For instance, to determine whether a drug reduced pain scores more in women than in men, the model might be as follows: It is sometimes appropriate to report estimates and confidence intervals within subgroups of interest but p-values should be avoided. 3.8. Tests for change over time are generally uninteresting. A common analysis is to conduct a paired t-test comparing, for example, erectile function in older men at baseline with erectile function after 5 years of followup. The null hypothesis is that "erectile function does not change over time," which is known to be false. Investigators are encouraged to focus on estimation rather than inference, reporting, for example, the mean change over time along with a 95% confidence interval. 3.9. Avoid using statistical tests to determine the type of analysis to be conducted. Numerous statistical tests are available that can be used to determine how a hypothesis test should be conducted. For instance, investigators might conduct a Shapiro-Wilk test for normality to determine whether to use a t-test or Mann-Whitney, Cochran's Q to decide whether to use a fixed or random effects approach in a meta-analysis, or use a t-test for between group differences in a covariate to determine whether that covariate should be included in a multivariable model. The problem with these sorts of approaches is that they are often testing a null hypothesis that is known to be false. For instance, no data set perfectly follows a normal distribution. Moreover, it is often questionable that changing the statistical approach in light of the test is actually of benefit. Statisticians are far from unanimous as to whether Mann-Whitney is always superior to t-test when data are non-normal or that fixed effects are invalid under study heterogeneity or that the criterion of adjusting for a variable should be whether it is significantly different between groups. Investigators should generally follow a pre-specified analytic plan, only altering the analysis if the data unambiguously point to a better alternative. 3.10. When reporting p-values be clear about the hypothesis tested and ensure that the hypothesis is sensible. P-values test specific hypotheses. When reporting a p-value in the Results section, state the hypothesis being tested unless this is completely clear. For example, in the statement "Pain scores were higher in group 1 and similar in groups 2 and 3 (p=0.02)" it is ambiguous whether the p-value of 0.02 is testing group 1 vs groups 2 and 3 combined or the hypothesis that pain score is the same in all 3 groups. Clarity about the hypotheses being tested can help avoid the testing of inappropriate hypotheses. For instance, p-values for differences between groups at baseline in a randomized trial are testing a null hypothesis that is known to be true (informally, that any observed differences between groups are due to chance). 4. Reporting of Study Estimates 4.1. Use appropriate levels of precision. Reporting a p-value of 0.7345 suggests that there is an appreciable difference between p-values of 0.7344 and 0.7346. Reporting that 16.9% of 83 patients responded entails a precision (to the nearest 0.1%) that is nearly 200 times greater than the width of the confidence interval (10% to 27%). Reporting in a clinical study that the mean calorie consumption was 2,069.9 suggests that calorie consumption can be measured extremely precisely by a food questionnaire. Some might argue that being overly precise is irrelevant because the extra numbers can always be ignored. The counter argument is that investigators should think hard about every number they report rather than just cutting and pasting numbers from the statistical software printout. The specific guidelines for precision are listed below. Report p-values to a single significant figure unless the p is close to 0.05 (for example, 0.01 to 0.2), in which case, report 2 significant figures. Do not report "NS" for p-values of 0.05 or above. Low p-values can be reported as p 0.9. For instance, p <0.001, 0.004, 0.045, 0.13, 0.3 and 1 are reported to appropriate precision. Report percentages, rates and probabilities to 2 significant figures, for example 75%, 3.4% and 0.13%. Do not report p-values of zero as any experimental result has a non-zero probability. Do not give decimal places if a probability or proportion is 1 (for example, a p-value of 1.00 or a percentage of 100.00%). The decimal places suggest it is possible to have a p-value of 1.05 for instance. There is a similar consideration for data that can only take integer values. It makes sense to state that the mean number of pregnancies was 2.4 but not that 29% of women reported 1.0 pregnancies. There is generally no need to report estimates to more than 3 significant figures. Hazard and odds ratios are normally reported to 2 decimal places, although this can be avoided for high odds ratios (for example, 18.2 rather than 18.17). 4.2. Avoid redundant statistics in cohort descriptions. Authors should be selective about the descriptive statistics reported and ensure that each number provides unique information. They should avoid reporting descriptive statistics that can be readily derived from data that have already been provided. For instance, there is no need to state 40% of a cohort were men and 60% were women when you can choose one or the other. Another common error is to include a column of descriptive statistics for 2 groups separately and then the whole cohort combined. For example, if the median age is 60 in group 1 and in group 2, we do not need to be that the median age in the cohort as a whole is close to For descriptive median and are over means and and range should be avoided. The median and provide all sorts of such as of patients above the median or between the The range gives the of just 2 patients and so is generally of the data distribution. Report estimates for the study A clinical study on a number of scientific questions and authors should provide an for each of the In a study 2 groups authors should give an of the difference between groups and avoid only data on each group separately or simply saying that the difference was or was not In a study of a authors should give an of the of the such as an odds ratio or as as reporting a p-value testing the null hypothesis of no association between the and Report confidence intervals for the estimates of Authors should generally report a 95% confidence interval the estimates to the research questions but not other estimates given in an For example, in a study 2 the authors might report event rates of and However, the in this case is the difference between and so this of 5% should be reported along with a 95% confidence interval (for example, to intervals should not be reported for the estimates within each group event in group A of 95% to Similarly, confidence intervals should not be given for statistics such as mean age or Do not variables as A variable such as grade groups is 1 to 5 but it is not true that the difference between groups 3 and 4 is as as the difference between groups 2 and 4. such as grade group should be reported as categories (for example, 40% grade group 1, grade group 2, grade group and grade groups 4 and rather than as a continuous variable score of Similarly, variables such as should be not as a single variable ratio of increase in grade but as multiple categories ratio of grade group 2 to group 1 and ratio of grade group 3 to group Avoid of continuous variables unless there is a A common approach to a variable such as age is to patients as years or or than 60 and then age analyses as a variable reporting, for example, that "patients 60 and over the risk of an than patients less than In and marker studies a common approach is to a variable and report a such as a ratio for each to the This is problematic because it that all of a variable within a are the For instance, it is likely not the case that a patient years has the same risk as a patient years but a different risk to a patient years It is generally to variables in a continuous reporting, for instance, how risk with a increase in terms can also be used to avoid the that the association between age and risk follows a Do not use statistical methods to points for clinical statistical methods are available to a continuous For instance, can be on of different points, and the point as the one associated with the p-value. investigators might choose a point that to the value of that is the point to the of a Such methods are inappropriate for clinical points because they do not clinical For instance, the approach that and are of equal it is generally to than to The p-value approach tests of evidence the null hypothesis, which has little to do with the and of a treatment or between a continuous and can be by using In high we often about the between and by a on a with a in some This also true for many scientific For example, for a study on age and rates an investigator could age on the risk of a on the and a with a 95% confidence interval. is often because it a and the investigator to questions such as whether risk to increase a given Do not significant heterogeneity in heterogeneity statistics test as to whether between the results of different studies in a are with or with true differences between If heterogeneity is authors need to do more than merely report the p-value and focus on the random effects Authors should the of heterogeneity and to determine the that to differences in study for example, by common of studies with similar findings or of studies with For time to event report the number but not the proportion of a study that 60 patients 10 it is to report the number of patients the study at different times and were for different and so the reported proportion of is The statistical approach to time to event variables is to calculate such as the risk of being 60% by 5 years or the median time at which the probability of first being For time to event report median for patients the event or the number an event at a given It is often to describe how a cohort has been To illustrate the appropriate methods of a cohort of patients with treated in and to If the was only median for all patients might only be a the median for patients who was This gives a much better of how the cohort been that in a cohort of patients was to the study. The median for will now be a which is would be to report a such as patients have been an event for at least For time to event describe when and when and how patients are A common error is that investigators use a that to an of For example, when the a patient a of should be on the of the time was known to be of prostate specific and not on the of patient (which may not have of For of patient would be an because was known to be at that When specific end points, consideration should be given to the of The end points specific and have specific and attention to specific authors need to carefully how to due to other

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Melissa Assel

Daniel D. Sjoberg

Andrew Elders

Journals

The Journal of Urology

Actions

Institutions

University of Michigan

University of Chicago

Memorial Sloan Kettering Cancer Center

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Guidelines for Reporting of Statistics for Clinical Research in Urology

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study