diff --git a/readme.md b/readme.md index 3f7697f..69783ae 100644 --- a/readme.md +++ b/readme.md @@ -128,7 +128,8 @@ By default, the Options menu looks like this: 3) Turn on graph/data file caching 4) Turn off serialized binary graph output 5) Turn on GraphML graph output -6) Maximum weight matching algorithm options +6) Turn on calculation of p-values +7) Maximum weight matching algorithm options 0) Return to main menu ``` @@ -182,7 +183,7 @@ Structure: | Alpha CDR3 | Beta CDR3 | Alpha CDR1 | Beta CDR1 | |---|---|---|---| |unique number|unique number|number|number| - +| ... | ... |... | ... | --- #### Sample Plate Files @@ -191,7 +192,8 @@ described above). The wells are filled randomly from a Cell Sample file, accordi frequency distribution. Additionally, every individual sequence within each cell may, with some given dropout probability, be omitted from the file; this simulates the effect of amplification errors prior to sequencing. Plates can also be partitioned into any number of sections, each of which can have a -different concentration of T cells per well. +different concentration of T cells per well. Alternatively, the number of T cells in each well can be randomized between +given minimum and maximum population values. Options when making a Sample Plate file: * Cell Sample file to use @@ -201,7 +203,7 @@ Options when making a Sample Plate file: * Standard deviation size * Exponential * Lambda value - * *(Based on the slope of the graph in Figure 4C of the pairSEQ paper, the distribution of the original experiment was approximately exponential with a lambda ~0.6. (Howie, et al. 2015))* + * *(Based on the slope of the graph in Figure 4C of the pairSEQ paper, the distribution of the original experiment was very roughly exponential with a lambda ~0.6. (Howie, et al. 2015) The actual distribution was certainly quite different.)* * Total number of wells on the plate * Well populations random or fixed * If random, minimum and maximum population sizes @@ -248,12 +250,12 @@ then use it for multiple different BiGpairSEQ simulations. Options for creating a Graph/Data file: * The Cell Sample file to use -* The Sample Plate file to use. (This must have been generated from the selected Cell Sample file.) -* Whether to simulate sequence read depth. If simulated: - * The read depth (number of times each sequence is read) - * The read error rate (probability a sequence is misread) - * The error collision rate (probability two misreads produce the same spurious sequence) - * The real sequence collision rate (probability that a misread will produce a different, real sequence from the sample plate. Only applies to new misreads; once an error of this type has occurred, it's likelihood of ocurring again is dominated by the error collision probability.) +* The Sample Plate file to use (This must have been generated from the selected Cell Sample file.) +* Whether to simulate sequencing read depth. If simulated: + * The read depth (The number of times each sequence is read) + * The read error rate (The probability a sequence is misread) + * The error collision rate (The probability two misreads produce the same spurious sequence) + * The real sequence collision rate (The probability that a misread will produce a different, real sequence from the sample plate. Only applies to new misreads; once an error of this type has occurred, it's likelihood of occurring again is dominated by the error collision probability.) These files do not have a human-readable structure, and are not portable to other programs. @@ -270,7 +272,7 @@ compare the matching results to the original Cell Sample .csv file. #### Matching Results Files Matching results files consist of the results of a BiGpairSEQ matching simulation. Making them requires a serialized binary Graph/Data file (.ser). (Because .graphML files are larger than .ser files, BiGpairSEQ_Sim supports .graphML -output only. Graph/data input must use a serialized binary.) +output only. Graph input must use a serialized binary.) Matching results files are in CSV format. Rows are sequence pairings with extra relevant data. Columns are pairing-specific details. Metadata about the matching simulation is included as comments. Comments are preceded by `#`. @@ -288,56 +290,83 @@ Options when running a BiGpairSEQ simulation of CDR3 alpha/beta matching: Example output: ``` -# Source Sample Plate file: 4MilCellsPlate.csv -# Source Graph and Data file: 4MilCellsPlateGraph.ser -# T cell counts in sample plate wells: 30000 -# Total alphas found: 11813 -# Total betas found: 11808 -# High overlap threshold: 94 -# Low overlap threshold: 3 -# Minimum overlap percent: 0 -# Maximum occupancy difference: 96 -# Pairing attempt rate: 0.438 -# Correct pairings: 5151 -# Incorrect pairings: 18 -# Pairing error rate: 0.00348 -# Simulation time: 862 seconds +# sample plate filename: 8MilCells_Plate.csv +# sequence dropout rate: 0.1 +# graph filename: 8MilGraph_rd2 +# MWM algorithm type: LEDA book with heap: FIBONACCI +# matching weight: 218017.0 +# well populations: 4000 +# sequence read depth: 100 +# sequence read error rate: 0.01 +# read error collision rate: 0.1 +# real sequence collision rate: 0.05 +# total alphas read from plate: 323711 +# total betas read from plate: 323981 +# pre-filter sequences present in all wells: true +# pre-filter sequences based on occupancy/read count discrepancy: true +# alphas in graph (after pre-filtering): 11707 +# betas in graph (after pre-filtering): 11705 +# high overlap threshold for pairing: 95 +# low overlap threshold for pairing: 3 +# minimum overlap percent for pairing: 0 +# maximum occupancy difference for pairing: 2147483647 +# pairing attempt rate: 0.716 +# correct pairing count: 8373 +# incorrect pairing count: 7 +# pairing error rate: 0.000835 +# time to generate graph (seconds): 293 +# time to pair sequences (seconds): 1,416 +# total simulation time (seconds): 1,709 ``` | Alpha | Alpha well count | Beta | Beta well count | Overlap count | Matched Correctly? | P-value | |---|---|---|---|---|---|---| -|5242972|17|1571520|18|17|true|1.41E-18| -|5161027|18|2072219|18|18|true|7.31E-20| -|4145198|33|1064455|30|29|true|2.65E-21| -|7700582|18|112748|18|18|true|7.31E-20| +|10258642|73|1172093|72|70.0|true|4.19E-21| +|6186865|34|4290363|37|34.0|true|4.56E-26| +|10222686|70|11044018|72|68.0|true|9.55E-25| +|5338100|75|2422988|76|74.0|true|4.57E-22| +|12363907|33|6569852|35|33.0|true|5.70E-26| |...|...|...|...|...|...|...| --- -**NOTE: The p-values in the output are not used for matching**—they aren't part of the BiGpairSEQ algorithm at all. -P-values are calculated *after* BiGpairSEQ matching is completed, for purposes of comparison only, -using the (2021 corrected) formula from the original pairSEQ paper. (Howie, et al. 2015) +**NOTE: The p-values in the sample output abpve are not used for matching**—they aren't part of the BiGpairSEQ algorithm at all. +P-values (if enabled in the interactive menu options or by using the -pv flag in the command line) are calculated *after* +BiGpairSEQ matching is completed, for purposes of comparison with pairSEQ only, using the (corrected) formula from the +original pairSEQ paper. (Howie, et al. 2015) Calculation of p-values is off by default to reduce processing time. -## PERFORMANCE (old results; need updating to reflect current, improved simulator performance) +## PERFORMANCE -On a home computer with a Ryzen 5600X CPU, 64GB of 3200MHz DDR4 RAM (half of which was allocated to the Java Virtual Machine), and a PCIe 3.0 SSD, running Linux Mint 20.3 Edge (5.13 kernel), -the author ran a BiGpairSEQ simulation of a 96-well sample plate with 30,000 T cells/well comprising ~11,800 alphas and betas, -taken from a sample of 4,000,000 distinct cells with an exponential frequency distribution (lambda 0.6). +Several BiGpairSEQ simulations were performed on a home computer with the following specs: -With min/max occupancy threshold of 3 and 94 wells for matching, and no other pre-filtering, BiGpairSEQ identified 5,151 -correct pairings and 18 incorrect pairings, for an accuracy of 99.652%. +* Ryzen 5600X CPU +* 128GB of 3200MHz DDR4 RAM +* 2TB PCIe 3.0 SSD +* Linux Mint 21 (5.15 kernel) -The total simulation time was 14'22". If intermediate results were held in memory, this would be equivalent to the total elapsed time. +### Simulation 1 +This simulation was an attempt to replicate the conditions of experiment 1 from the 2015 pairSEQ paper: a matching was found for a +96-well sample plate with 4,000 T cells/well comprising ~11,900 TCRAs and TCRBs, taken from a sample of 8,400,000 +distinct cells with an exponential frequency distribution (lambda 0.6). The sequence dropout rate was 10%, as the analysis +from the original paper concluded that most TCR sequences "have less than a 10% chance of going unobserved." (Howie, et al. 2015) -Since this implementation of BiGpairSEQ writes intermediate results to disk (to improve the efficiency of *repeated* simulations -with different filtering options), the actual elapsed time was greater. File I/O time was not measured, but took -slightly less time than the simulation itself. Real elapsed time from start to finish was under 30 minutes. +The original paper does not contain (or the author of this document failed to identify) information on sequencing depth, +read error probability, or the probabilities of different kinds of read error collisions. As the pre-filtering of BiGpairSEQ +has successfully filtered out all such errors for any reasonable error rates the author has yet tested, this simulation was +done without any sequencing errors, to reduce the processing time. -As mentioned in the theory section, performance could be improved by implementing a more efficient algorithm for finding -the maximum weight matching. +With min/max occupancy thresholds of 3 and 95 wells respectively for matching, BiGpairSEQ identified: +* 8,495 correct pairings +* 5 incorrect pairings -## BEHAVIOR WITH RANDOMIZED WELL POPULATIONS +for an overall pairing accuracy of 99.9992%. + +The total simulation time (excluding file I/O) was 28m52. The total elapsed time with file I/O was 41m23s. +Calculation of p-values was enabled for this simulation, increasing the overall processing time. + + +## BEHAVIOR WITH RANDOMIZED WELL POPULATIONS (old results, need updating for new version of the simulator (though resilience to varying well populations is unchanged)) A series of BiGpairSEQ simulations were conducted using a cell sample file of 3.5 million unique T cells. From these cells, 10 sample plate files were created. All of these sample plates had 96 wells, used an exponential distribution with a lambda of 0.6, and