Update performance section and TODO

2022-10-01 18:12:33 -05:00
parent bde6da3076
commit cda25a2c62
1 changed files with 56 additions and 15 deletions
--- a/readme.md
+++ b/readme.md
@@ -473,7 +473,9 @@ Several BiGpairSEQ simulations were performed on a home computer with the follow
 * 2TB PCIe 3.0 SSD
 * Linux Mint 21 (5.15 kernel)

-### SAMPLE PLATES WITH VARYING NUMBERS OF CELLS PER WELL (old results, need updating for new version of the simulator (though resilience to varying well populations is unchanged))
+### SAMPLE PLATES WITH VARYING NUMBERS OF CELLS PER WELL
+NOTE: these results were obtained with an earlier version of BiGpairSEQ_Sim, and should be re-run with the current version.
+The observed behavior is not believed to be likely to change, however.

 A series of BiGpairSEQ simulations were conducted using a cell sample file of 3.5 million unique T cells. From these cells,
 10 sample plate files were created. All of these sample plates had 96 wells, used an exponential distribution with a lambda of 0.6, and
@@ -518,32 +520,69 @@ The average results for the randomized plates are closest to the constant plate
 This and several other tests indicate that BiGpairSEQ treats a sample plate with a highly variable number of T cells/well
 roughly as though it had a constant well population equal to the plate's average well population.

-### EXPERIMENTS FROM THE 2015 pairSEQ PAPER
-#### Experiment 1 (Need to be re-tried with different lambda values.)
+### SIMULATING EXPERIMENTS FROM THE 2015 pairSEQ PAPER
+#### Experiment 1
 This simulation was an attempt to replicate the conditions of experiment 1 from the 2015 pairSEQ paper: a matching was found for a 
-96-well sample plate with 4,000 T cells/well comprising ~11,900 TCRAs and TCRBs, taken from a sample of 8,400,000 
-distinct cells with an exponential frequency distribution (lambda 0.6). The sequence dropout rate was 10%, as the analysis
-from the original paper concluded that most TCR sequences "have less than a 10% chance of going unobserved." (Howie, et al. 2015)
+96-well sample plate with 4,000 T cells/well, taken from a sample of 8,400,000 
+distinct cells sampled with an exponential frequency distribution. Examination of Figure 4C from the paper seems to show the points
+(-5, 4) and (-4.5, 3.3) on the line at the boundary of the shaded region, so a lambda value of 1.4 was used for the 
+exponential distribution.
+
+The sequence dropout rate was 10%, as the analysis in the paper concluded that most TCR 
+sequences "have less than a 10% chance of going unobserved." (Howie, et al. 2015) Given this choice of 10%, the simulated
+sample plate is likely to have more sequence dropout, and thus greater error, than the real experiment.

 The original paper does not contain (or the author of this document failed to identify) information on sequencing depth, 
 read error probability, or the probabilities of different kinds of read error collisions. As the pre-filtering of BiGpairSEQ
 has successfully filtered out all such errors for any reasonable error rates the author has yet tested, this simulation was
-done without any sequencing errors, to reduce the processing time.
+done without simulating any sequencing errors, to reduce the processing time.

-With min/max occupancy thresholds of 3 and 95 wells respectively for matching, BiGpairSEQ identified:
-* 8,495 correct pairings 
-* 5 incorrect pairings 
+This simulation was performed 5 times with min/max occupancy thresholds of 3 and 95 wells respectively for matching.

-for an overall pairing accuracy of 99.9992%.
+| |Run 1|Run 2|Run 3|Run 4|Run 5| Average|
+|---|---|---|---|---|---|---|
+|Total pairs|4398|4420|4404|4409|4414|4409|
+|Correct pairs|4322|4313|4337|4336|4339|4329.4|
+|Incorrect pairs|76|107|67|73|75|79.6|
+|Error rate|0.0173|0.0242|0.0152|0.0166|0.0170|0.018|
+|Simulation time (seconds)|697|484|466|473|463|516.6|

-The total simulation time (excluding file I/O) was 28m52. The total elapsed time with file I/O was 41m23s. 
-Calculation of p-values was enabled for this simulation, increasing the overall processing time.
+The experiment in the original paper called 4143 pairs with a false discovery rate of 0.01.

-Note that the frequency distribution of T cell clones in this simulation is only roughly that of 
+Given the roughness of the estimation for the cell frequency distribution of the original experiment and the likely 
+higher rate of sequence dropout in the simulation, these simulated results match the real experiment fairly well.

-#### Experiment 2
+#### Experiment 3
+To simulate experiment 3 from the original paper, a matching was made for a 96-well sample plate with 160,000 T cells/well,
+taken from a sample of 4.5 million distinct T cells sampled with an exponential frequency distribution (lambda 1.4). The 
+sequence dropout rate was again 10%, and no sequencing errors were simulated. Once again, deviation from the original 
+experiment is expected due to the roughness of the estimated frequency distribution, and due to the high sequence dropout 
+rate.

+Results metadata:
+```
+# total alphas read from plate: 6929
+# total betas read from plate: 6939
+# alphas in graph (after pre-filtering): 4452
+# betas in graph (after pre-filtering): 4461
+# high overlap threshold for pairing: 95
+# low overlap threshold for pairing: 3
+# minimum overlap percent for pairing: 0
+# maximum occupancy difference for pairing: 100
+# pairing attempt rate: 0.767
+# correct pairing count: 3233
+# incorrect pairing count: 182
+# pairing error rate: 0.0533
+# time to generate graph (seconds): 40
+# time to pair sequences (seconds): 230
+# total simulation time (seconds): 270
+```

+The simulation ony found 6929 distinct TCRAs and 6939 TCRBs on the sample plate, orders of magnitude fewer than the number of
+pairs called in the pairSEQ experiment. These results show that at very high sampling depths, the differences in the 
+underlying frequency distribution drastically affect the results. The real distribution clearly has a much longer "tail" 
+than the simulated exponential distribution. Implementing a way to exert finer control over the sampling distribution from 
+the file of distinct cells may enable better simulated replication of this experiment.

 ## TODO

@@ -576,6 +615,8 @@ Note that the frequency distribution of T cell clones in this simulation is only
  * Individual well data from the SequenceRecords could be included, if there's ever a reason for it
 * ~~Implement simulation of sequences being misread as other real sequence~~ DONE
 * Update matching metadata output options in CLI
+* Add frequency distribution details to metadata output
+  * need to make an enum for the different distribution types and refactor the Plate class and user interfaces, also add the necessary fields to GraphWithMapData and then call if from Simulator
 * Update performance data in this readme
 * Add section to ReadMe describing data filtering methods.
 * Re-implement CDR1 matching method