Update readme with newer results, new menu options

2022-10-01 13:00:33 -05:00
parent 457d643477
commit b19a4d37c2
1 changed files with 74 additions and 45 deletions
--- a/readme.md
+++ b/readme.md
@@ -128,7 +128,8 @@ By default, the Options menu looks like this:
 3) Turn on graph/data file caching
 4) Turn off serialized binary graph output
 5) Turn on GraphML graph output
-6) Maximum weight matching algorithm options
+6) Turn on calculation of p-values
+7) Maximum weight matching algorithm options
 0) Return to main menu
 ```

@@ -182,7 +183,7 @@ Structure:
 | Alpha CDR3 | Beta CDR3 | Alpha CDR1 | Beta CDR1 |
 |---|---|---|---|
 |unique number|unique number|number|number|
-
+| ... | ...  |... | ... |
 ---

 #### Sample Plate Files
@@ -191,7 +192,8 @@ described above). The wells are filled randomly from a Cell Sample file, accordi
 frequency distribution. Additionally, every individual sequence within each cell may, with some
 given dropout probability, be omitted from the file; this simulates the effect of amplification errors
 prior to sequencing. Plates can also be partitioned into any number of sections, each of which can have a 
-different concentration of T cells per well.
+different concentration of T cells per well. Alternatively, the number of T cells in each well can be randomized between 
+given minimum and maximum population values.

 Options when making a Sample Plate file:
 * Cell Sample file to use
@@ -201,7 +203,7 @@ Options when making a Sample Plate file:
    * Standard deviation size 
  * Exponential
    * Lambda value
-      * *(Based on the slope of the graph in Figure 4C of the pairSEQ paper, the distribution of the original experiment was approximately exponential with a lambda ~0.6. (Howie, et al. 2015))*
+      * *(Based on the slope of the graph in Figure 4C of the pairSEQ paper, the distribution of the original experiment was very roughly exponential with a lambda ~0.6. (Howie, et al. 2015) The actual distribution was certainly quite different.)*
 * Total number of wells on the plate
 * Well populations random or fixed
  * If random, minimum and maximum population sizes
@@ -248,12 +250,12 @@ then use it for multiple different BiGpairSEQ simulations.

 Options for creating a Graph/Data file:
 * The Cell Sample file to use
-* The Sample Plate file to use. (This must have been generated from the selected Cell Sample file.)
-* Whether to simulate sequence read depth. If simulated:
-  * The read depth (number of times each sequence is read)
-  * The read error rate (probability a sequence is misread)
-  * The error collision rate (probability two misreads produce the same spurious sequence)
-  * The real sequence collision rate (probability that a misread will produce a different, real sequence from the sample plate. Only applies to new misreads; once an error of this type has occurred, it's likelihood of ocurring again is dominated by the error collision probability.)
+* The Sample Plate file to use (This must have been generated from the selected Cell Sample file.)
+* Whether to simulate sequencing read depth. If simulated:
+  * The read depth (The number of times each sequence is read)
+  * The read error rate (The probability a sequence is misread)
+  * The error collision rate (The probability two misreads produce the same spurious sequence)
+  * The real sequence collision rate (The probability that a misread will produce a different, real sequence from the sample plate. Only applies to new misreads; once an error of this type has occurred, it's likelihood of occurring again is dominated by the error collision probability.)

 These files do not have a human-readable structure, and are not portable to other programs.

@@ -270,7 +272,7 @@ compare the matching results to the original Cell Sample .csv file.
 #### Matching Results Files 
 Matching results files consist of the results of a BiGpairSEQ matching simulation. Making them requires a serialized
 binary Graph/Data file (.ser). (Because .graphML files are larger than .ser files, BiGpairSEQ_Sim supports .graphML
-output only. Graph/data input must use a serialized binary.)
+output only. Graph input must use a serialized binary.)

 Matching results files are in CSV format. Rows are sequence pairings with extra relevant data. Columns are pairing-specific details.
 Metadata about the matching simulation is included as comments. Comments are preceded by `#`.
@@ -288,56 +290,83 @@ Options when running a BiGpairSEQ simulation of CDR3 alpha/beta matching:
 Example output:

 ```
-# Source Sample Plate file: 4MilCellsPlate.csv
-# Source Graph and Data file: 4MilCellsPlateGraph.ser
-# T cell counts in sample plate wells: 30000
-# Total alphas found: 11813
-# Total betas found: 11808
-# High overlap threshold: 94
-# Low overlap threshold: 3
-# Minimum overlap percent: 0
-# Maximum occupancy difference: 96
-# Pairing attempt rate: 0.438
-# Correct pairings: 5151
-# Incorrect pairings: 18
-# Pairing error rate: 0.00348
-# Simulation time: 862 seconds
+# sample plate filename: 8MilCells_Plate.csv
+# sequence dropout rate: 0.1
+# graph filename: 8MilGraph_rd2
+# MWM algorithm type: LEDA book with heap: FIBONACCI
+# matching weight: 218017.0
+# well populations: 4000
+# sequence read depth: 100
+# sequence read error rate: 0.01
+# read error collision rate: 0.1
+# real sequence collision rate: 0.05
+# total alphas read from plate: 323711
+# total betas read from plate: 323981
+# pre-filter sequences present in all wells: true
+# pre-filter sequences based on occupancy/read count discrepancy: true
+# alphas in graph (after pre-filtering): 11707
+# betas in graph (after pre-filtering): 11705
+# high overlap threshold for pairing: 95
+# low overlap threshold for pairing: 3
+# minimum overlap percent for pairing: 0
+# maximum occupancy difference for pairing: 2147483647
+# pairing attempt rate: 0.716
+# correct pairing count: 8373
+# incorrect pairing count: 7
+# pairing error rate: 0.000835
+# time to generate graph (seconds): 293
+# time to pair sequences (seconds): 1,416
+# total simulation time (seconds): 1,709
 ```

 | Alpha | Alpha well count | Beta | Beta well count | Overlap count | Matched Correctly? | P-value |
 |---|---|---|---|---|---|---|
-|5242972|17|1571520|18|17|true|1.41E-18|
-|5161027|18|2072219|18|18|true|7.31E-20|
-|4145198|33|1064455|30|29|true|2.65E-21|
-|7700582|18|112748|18|18|true|7.31E-20|
+|10258642|73|1172093|72|70.0|true|4.19E-21|
+|6186865|34|4290363|37|34.0|true|4.56E-26|
+|10222686|70|11044018|72|68.0|true|9.55E-25|
+|5338100|75|2422988|76|74.0|true|4.57E-22|
+|12363907|33|6569852|35|33.0|true|5.70E-26|
 |...|...|...|...|...|...|...|

 ---

-**NOTE: The p-values in the output are not used for matching**—they aren't part of the BiGpairSEQ algorithm at all. 
-P-values are calculated *after* BiGpairSEQ matching is completed, for purposes of comparison only, 
-using the (2021 corrected) formula from the original pairSEQ paper. (Howie, et al. 2015)
+**NOTE: The p-values in the sample output abpve are not used for matching**—they aren't part of the BiGpairSEQ algorithm at all. 
+P-values (if enabled in the interactive menu options or by using the -pv flag in the command line) are calculated *after* 
+BiGpairSEQ matching is completed, for purposes of comparison with pairSEQ only, using the (corrected) formula from the 
+original pairSEQ paper. (Howie, et al. 2015) Calculation of p-values is off by default to reduce processing time.


-## PERFORMANCE (old results; need updating to reflect current, improved simulator performance)
+## PERFORMANCE

-On a home computer with a Ryzen 5600X CPU, 64GB of 3200MHz DDR4 RAM (half of which was allocated to the Java Virtual Machine), and a PCIe 3.0 SSD, running Linux Mint 20.3 Edge (5.13 kernel), 
-the author ran a BiGpairSEQ simulation of a 96-well sample plate with 30,000 T cells/well comprising ~11,800 alphas and betas,
-taken from a sample of 4,000,000 distinct cells with an exponential frequency distribution (lambda 0.6).
+Several BiGpairSEQ simulations were performed on a home computer with the following specs:

-With min/max occupancy threshold of 3 and 94 wells for matching, and no other pre-filtering, BiGpairSEQ identified 5,151 
-correct pairings and 18 incorrect pairings, for an accuracy of 99.652%.
+* Ryzen 5600X CPU
+* 128GB of 3200MHz DDR4 RAM
+* 2TB PCIe 3.0 SSD
+* Linux Mint 21 (5.15 kernel)

-The total simulation time was 14'22". If intermediate results were held in memory, this would be equivalent to the total elapsed time.
+### Simulation 1
+This simulation was an attempt to replicate the conditions of experiment 1 from the 2015 pairSEQ paper: a matching was found for a 
+96-well sample plate with 4,000 T cells/well comprising ~11,900 TCRAs and TCRBs, taken from a sample of 8,400,000 
+distinct cells with an exponential frequency distribution (lambda 0.6). The sequence dropout rate was 10%, as the analysis
+from the original paper concluded that most TCR sequences "have less than a 10% chance of going unobserved." (Howie, et al. 2015)

-Since this implementation of BiGpairSEQ writes intermediate results to disk (to improve the efficiency of *repeated* simulations
-with different filtering options), the actual elapsed time was greater. File I/O time was not measured, but took 
-slightly less time than the simulation itself. Real elapsed time from start to finish was under 30 minutes.
+The original paper does not contain (or the author of this document failed to identify) information on sequencing depth, 
+read error probability, or the probabilities of different kinds of read error collisions. As the pre-filtering of BiGpairSEQ
+has successfully filtered out all such errors for any reasonable error rates the author has yet tested, this simulation was
+done without any sequencing errors, to reduce the processing time.

-As mentioned in the theory section, performance could be improved by implementing a more efficient algorithm for finding
-the maximum weight matching.
+With min/max occupancy thresholds of 3 and 95 wells respectively for matching, BiGpairSEQ identified:
+* 8,495 correct pairings 
+* 5 incorrect pairings 

-## BEHAVIOR WITH RANDOMIZED WELL POPULATIONS
+for an overall pairing accuracy of 99.9992%.
+
+The total simulation time (excluding file I/O) was 28m52. The total elapsed time with file I/O was 41m23s. 
+Calculation of p-values was enabled for this simulation, increasing the overall processing time.
+
+
+## BEHAVIOR WITH RANDOMIZED WELL POPULATIONS (old results, need updating for new version of the simulator (though resilience to varying well populations is unchanged))

 A series of BiGpairSEQ simulations were conducted using a cell sample file of 3.5 million unique T cells. From these cells,
 10 sample plate files were created. All of these sample plates had 96 wells, used an exponential distribution with a lambda of 0.6, and