Metabolomic studies involve the measurement of spectra from biological samples collected by researchers from their study cohort and the analysis of the resulting data. Often the goal of such studies is to identify metabolites which exhibit different behaviour in samples belonging to different groups in the study cohort. In this way they may act as an identifier of the particular feature being studied. Studies must utilise an adequate sample size such that the study achieves an appropriate statistical power to validate its conclusions. Clearly this sample size must be determined before commencing the study. In many biological fields an optimal sample size is determined by performing a pilot study, however, this is not always possible for metabolomic studies due to factors such as experiment cost, availability of resources, and availability of subjects. MetSizeR (Nyamundanda et al. 2013) offers a tool to estimate the sample size needed for an experiment to achieve a desired statistical power, without the use of pilot data or historical data.
MetSizeR operates on the idea of an analysis informed approach, meaning that the sample size should be estimated for an experiment based on the method of analysis the researcher plans to use. There are currently two analysis methods supported by MetSizeR; probabilistic principal components analysis (PPCA), originally developed by Tipping and Bishop (1999), and probabilistic principal components and covariates analysis (PPCCA), developed by Nyamundanda et al. (2010). The application operates by using simulated data in the place of pilot data to perform its estimation. The data are simulated based on the analysis method the researcher plans to use, and the sample size which yields the desired false discovery rate (FDR) for the study is returned to the user.
Full details of the MetSizeR algorithm can be found in Nyamundanda et al. (2013).
A Shiny application is used as a graphic user interface (GUI) for the application, to encourage use in the field without the need for experience using R.
The MetSizeR package provides a R Shiny application which allows users to estimate the sample size required for their study to achieve a desired statistical power. The tool is built to estimate the sample size required for both targeted and untargeted metabolomic experiments. Sample size estimation is performed without the need for pilot data, however, if pilot data are available then these can be uploaded to the application to aid in the estimations. Several inputs are required by the user; it is expected that these inputs should be informed by the user’s own knowledge of their field. Additionally, where pilot data are present it is expected that these data have been sourced and treated appropriately in the context of the user’s research.
To install the MetSizeR package, the user should have the R software environment installed on their machine. To download, visit the website for The R Project for Statistical Computing and follow the instructions to download and install.
When in R, the following command will allow the user to install the MetSizeR package:
Package dependencies will also be installed if they are not already installed on the user’s computer. These consist of the following:
Alternatively, MetSizeR can be downloaded and installed manually by the user if they so prefer by visiting its entry on The Comprehensive R Archive Network (CRAN).
Once installed, the MetSizeR package can be loaded into the user’s current R session by the following command:
The MetSizeR R package is designed to perform all functionality through the associated R Shiny application. The Shiny application can be launched by running the following command:
The application will then launch, either in the user’s external browser, a pop-up window, or in the viewer pane of the user’s R IDE. This depends on how the user has configured their own settings.
Once the application has launched, the initial landing page is an
About
page which contains information about the package,
its functionality, and references for the methods used within.
Navigation through the application occurs through the navigation bar at
the top of the page, with options About
,
Sample Size Estimation
, and
Vary Proportion of Significant Spectral Bins
leading to the
about page, the page on which to perform sample size estimation for a
single experiment, and the page on which to estimate the sample size for
varying proportions of significant bins or metabolites respectively.
The core functionality of MetSizeR is the ability to estimate the
sample size required for a metabolomic experiment to achieve a desired
statistical power. MetSizeR can perform this estimation for targeted or
untargeted experiments, both for cases when pilot data are available or
unavailable. Sample size estimation takes place based on the intended
method of analysis in the experiment; at present the tool is available
for PPCA or PPCCA. To navigate to the sample size estimation tool,
select Sample Size Estimation
on the navigation bar at the
top of the page.
MetSizeR can estimate the sample size required for a study to achieve a desired statistical power even when pilot data are unavailable. In this case, the tool can be used for both targeted and untargeted analysis. The user must specify several parameters for the algorithm to use, given as follows. These are specified by selecting or inputting options on the sidebar located on the left side of the page.
Are experimental pilot data available?
checkbox.Once the input values have been entered to the user’s specifications,
the Estimate Optimal Sample Size
button at the bottom of
the sidebar must be clicked to start the MetSizeR process. Note that the
process may take several minutes for larger numbers of bins. The user
will know the algorithm is running as a notification will appear on the
bottom right of the screen.
When results are ready they will appear on the right side of the screen. A plot of FDR versus sample size will appear, showing the 90th, 50th, and 10th percentiles of the FDR values calculated for each sample size tested. The estimated optimal sample size will be indicated by a blue vertical line, and also in text in the legend of the plot. The target FDR is given by a black dotted line on the plot.
Below the plot, results will be displayed in text. The estimated
optimal sample size will be printed, along with the per-group breakdown
of the sample sizes, in the same ratio as input by the researcher. There
is an option to download the plot, by clicking the
Download Plot
button. The plot will then download to the
location of choice on the user’s computer. There is also an option to
view the exact values from the points on the plot by clicking the
Show values from plot?
checkbox. This opens a table showing
the values of the 90th, 50th, and 10th percentiles of the FDR values
calculated alongside the relevant sample size. The user can download
these data as a CSV file by clicking the
Download Plot Data as .csv
button, which will download the
data to the location of the user’s choosing.
If the user wishes to change any of the inputs and re-run the
algorithm, they can simply change the relevant inputs and click
Estimate Optimal Sample Size
once more, and the results
will update. The user will know the algorithm is running as a
notification will appear on the bottom right of the screen.
If pilot data are available then MetSizeR can use this to aid its
estimation of the optimal sample size required for the study to achieve
a desired level of power. To estimate the sample size in this manner,
navigate to the Sample Size Estimation
page on the
navigation bar at the top of the page. This page contains a sidebar on
the left side which allows the user to input specifications of their
planned study as well as upload pilot data. The following specifications
are required from the user:
Are experimental pilot data available?
checkbox. This will
open several options to the user. Firstly, the user must specify if
their data file contains a header as the first row, if so click the
Does data contain a header?
checkbox. The option to upload
data as a CSV file is then given. To select the data file, click
Browse
, navigate to the file’s location and select the
file. A blue confirmation bar should appear under the file upload
section, saying Upload complete
, and the name of the file
should appear in the box beside the Browse
button.
Note that the number of spectral bins (untargeted analysis) or
metabolites (targeted analysis) does not need to be specified by the
user when pilot data are present as this is read from the uploaded file.
When the input values are correctly specified, the
Estimate Optimal Sample Size
button at the bottom of the
sidebar must be clicked to start the MetSizeR process. Note that the
process may take several minutes for larger data. The user will know the
algorithm is running as a notification will appear on the bottom right
of the screen.
When results are ready they will appear on the right side of the screen. A plot of FDR versus sample size will appear, showing the 90th, 50th, and 10th percentiles of the FDR values calculated for each sample size tested. The estimated optimal sample size will be indicated by a blue vertical line, and also in text in the legend of the plot. The target FDR is given by a black dotted line on the plot.
Below the plot, results will be displayed in text. The estimated
optimal sample size will be printed, along with the per-group breakdown
of the sample sizes, in the same ratio as input by the researcher. There
is an option to download the plot, by clicking the
Download Plot
button. The plot will then download to the
location of choice on the user’s computer. There is also an option to
view the exact values from the points on the plot by clicking the
Show values from plot?
checkbox. This opens a table showing
the values of the 90th, 50th, and 10th percentiles of the FDR values
calculated alongside the relevant sample size. The user can download
these data as a CSV file by clicking the
Download Plot Data as .csv
button, which will download the
data to the location of the user’s choosing.
If the user wishes to change any of the inputs and re-run the
algorithm, they can simply change the relevant inputs and click
Estimate Optimal Sample Size
once more, and the results
will update. The user will know the algorithm is running as a
notification will appear on the bottom right of the screen.
If the user wishes to test different proportions of significant bins
to find the optimal sample size, they can navigate to the
Vary Proportion of Significant Spectral Bins
tab on the
navigation bar at the top of the page. Here, up to four different
significance proportions can be tested for the same experimental
design.
First the user should specify whether they intend to perform targeted or untargeted analysis by selecting the applicable option at the top of the sidebar on the left of the page. There are then boxes on the sidebar where the user can enter up to four proportions. One proportion should be entered in each box. If less than four proportions are required, zero should be entered in the remaining boxes. The values can be input by either typing the proportion as a decimal, or using the arrows on the right of the boxes to change the values.
The user must then also choose the number of spectral bins
(untargeted analysis) or metabolites (targeted analysis) to test by
typing the desired number into the relevant box or using the arrows.
They must then select the desired model to use, either PPCA or PPCCA, by
selecting the button for their chosen model. If PPCCA is selected, the
number of numeric and categorical covariates must be specified, with the
number of levels of any categorical variables entered also. The target
FDR must then be specified, by typing in the desired value or using the
arrows to select. Once the inputs are entered to the user’s
specifications, the Run MetSizeR for Varied Proportions
button should be clicked to start the process. Note that this process
can take some time, especially for larger numbers of bins. A
notification will appear on the bottom right of the screen, indicating
that the algorithm is running. This process will usually be longer than
the single sample size calculation performed on the
Sample Size Estimation
tab, as there are more calculations
to be performed.
When results are ready, one plot for each of the specified proportions will appear on the right side of the screen, along with a statement of their respective proportions. A download button is available for each plot which, when clicked, will allow the user to download the plot as a PNG file to the location of their choosing.
Nyamundanda, G., Brennan, L., & Gormley, I. C. (2010). Probabilistic principal component analysis for metabolomic data. BMC Bioinformatics, 11(1), 571. https://doi.org/10.1186/1471-2105-11-571
Nyamundanda, G., Gormley, I. C., Fan, Y., Gallagher, W. M., & Brennan, L. (2013). MetSizeR: Selecting the optimal sample size for metabolomic studies using an analysis based approach. BMC Bioinformatics, 14(1), 338. https://doi.org/10.1186/1471-2105-14-338
Tipping, M. E., & Bishop, C. M. (1999). Probabilistic Principal Component Analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3), 611–622. https://doi.org/10.1111/1467-9868.00196