Add data on randomized well population behavior

Update readme
2022-03-02 18:55:19 -06:00 · 2022-03-02 12:39:40 -06:00
1 changed files with 49 additions and 6 deletions
--- a/readme.md
+++ b/readme.md
@@ -264,29 +264,71 @@ Example output:
 P-values are calculated *after* BiGpairSEQ matching is completed, for purposes of comparison only, 
 using the (2021 corrected) formula from the original pairSEQ paper. (Howie, et al. 2015)

-### PERFORMANCE
-Performance details of the example excerpted above:
+## PERFORMANCE

 On a home computer with a Ryzen 5600X CPU, 64GB of 3200MHz DDR4 RAM (half of which was allocated to the Java Virtual Machine), and a PCIe 3.0 SSD, running Linux Mint 20.3 Edge (5.13 kernel), 
 the author ran a BiGpairSEQ simulation of a 96-well sample plate with 30,000 T cells/well comprising ~11,800 alphas and betas,
-taken from a sample of 4,000,000 distinct cells with an exponential frequency distribution.
+taken from a sample of 4,000,000 distinct cells with an exponential frequency distribution (lambda 0.6).

 With min/max occupancy threshold of 3 and 94 wells for matching, and no other pre-filtering, BiGpairSEQ identified 5,151 
 correct pairings and 18 incorrect pairings, for an accuracy of 99.652%.

-The simulation time was 14'22". If intermediate results were held in memory, this would be equivalent to the total elapsed time.
+The total simulation time was 14'22". If intermediate results were held in memory, this would be equivalent to the total elapsed time.

 Since this implementation of BiGpairSEQ writes intermediate results to disk (to improve the efficiency of *repeated* simulations
 with different filtering options), the actual elapsed time was greater. File I/O time was not measured, but took 
 slightly less time than the simulation itself. Real elapsed time from start to finish was under 30 minutes.

+## BEHAVIOR WITH RANDOMIZED WELL POPULATIONS
+
+A series of BiGpairSEQ simulations were conducted using a cell sample file of 3.5 million unique T cells. From these cells,
+10 sample plate files were created. All of these sample plates had 96 wells, used an exponential distribution with a lambda of 0.6, and
+had a sequence dropout rate of 10%.
+
+The well populations of the plates were:
+* One sample plate with 1000 T cells/well
+* One sample plate with 2000 T cells/well
+* One sample plate with 3000 T cells/well
+* One sample plate with 4000 T cells/well
+* One sample plate with 5000 T cells/well
+* Five sample plates with each individual well's population randomized, from 1000 to 5000 T cells. (Average population ~3000 T cells/well.)
+
+All BiGpairSEQ simulations were run with a low overlap threshold of 3 and a high overlap threshold of 94.
+
+Constant well population plate results:
+
+| |1000 Cell/Well Plate|2000 Cell/Well Plate|3000 Cell/Well Plate|4000 Cell/Well Plate|5000 Cell/Well Plate
+|---|---|---|---|---|---|
+|Total Alphas Found|6407|7330|7936|8278|8553|
+|Total Betas Found|6405|7333|7968|8269|8582|
+|Pairing Attempt Rate|0.661|0.653|0.600|0.579|0.559|
+|Correct Pairing Count|4231|4749|4723|4761|4750|
+|Incorrect Pairing Count|3|34|40|26|29|
+|Pairing Error Rate|0.000709|0.00711|0.00840|0.00543|0.00607|
+|Simulation Time (Seconds)|500|643|700|589|598|
+
+Randomized well population plate results:
+
+| |Random Plate 1 | Random Plate 2|Random Plate 3|Random Plate 4|Random Plate 5|Average|
+|---|---|---|---|---|---|---|
+Total Alphas Found|7853|7904|7964|7898|7917|7907|
+Total Betas Found|7851|7891|7920|7910|7894|7893|
+Pairing Attempt Rate|0.607|0.610|0.601|0.605|0.603|0.605|
+Correct Pairing Count|4718|4782|4721|4755|4731|4741|
+Incorrect Pairing Count|51|35|42|27|29|37|
+Pairing Error Rate|0.0107|0.00727|0.00882|0.00565|0.00609|0.00771|
+Simulation Time (Seconds)|590|677|730|618|615|646|
+
+From these results, it can be seen that BiGpairSEQ treats a sample plate with a highly variable number of T cells/well
+roughly as though it had a constant well population equal to the average well population.
+
 ## TODO

 * ~~Try invoking GC at end of workloads to reduce paging to disk~~ DONE
 * ~~Hold graph data in memory until another graph is read-in? ABANDONED UNABANDONED~~ DONE
  * ~~*No, this won't work, because BiGpairSEQ simulations alter the underlying graph based on filtering constraints. Changes would cascade with multiple experiments.*~~
  * Might have figured out a way to do it, by taking edges out and then putting them back into the graph. This may actually be possible.
-  * It is possible, though the modifications to the graph incur their own performance penalties. Need testing to see which option is best.
+  * It is possible, though the modifications to the graph incur their own performance penalties. Need testing to see which option is best. It may be computer-specific.
 * ~~Test whether pairing heap (currently used) or Fibonacci heap is more efficient for priority queue in current matching algorithm~~ DONE
  * ~~in theory Fibonacci heap should be more efficient, but complexity overhead may eliminate theoretical advantage~~
  * ~~Add controllable heap-type parameter?~~
@@ -300,6 +342,7 @@ slightly less time than the simulation itself. Real elapsed time from start to f
    * _Got this working, but at the cost of a profoundly strange bug in graph occupancy filtering. Have reverted the repo until I can figure out what caused that. Given how easily Thingiverse transposes CSV matrices in R, might not even be worth fixing.
 * ~~Enable GraphML output in addition to serialized object binaries, for data portability~~ DONE
  * ~~Custom vertex type with attribute for sequence occupancy?~~ ABANDONED
+    * Advantage: would eliminate the need to use maps to associate vertices with sequences, which would make the code easier to understand.
    * Have a branch where this is implemented, but there's a bug that broke matching. Don't currently have time to fix.
 * ~~Re-implement command line arguments, to enable scripting and statistical simulation studies~~ DONE
 * Re-implement CDR1 matching method
@@ -319,7 +362,7 @@ slightly less time than the simulation itself. Real elapsed time from start to f
 * [JGraphT](https://jgrapht.org) -- Graph theory data structures and algorithms
 * [JHeaps](https://www.jheaps.org) -- For pairing heap priority queue used in maximum weight matching algorithm
 * [Apache Commons CSV](https://commons.apache.org/proper/commons-csv/) -- For CSV file output
-* [Apache Commons CLI](https://commons.apache.org/proper/commons-cli/) -- To enable command line arguments for scripting. (**Awaiting re-implementation**.)
+* [Apache Commons CLI](https://commons.apache.org/proper/commons-cli/) -- To enable command line arguments for scripting.

 ## ACKNOWLEDGEMENTS
 BiGpairSEQ was conceived in collaboration with Dr. Alice MacQueen, who brought the original
Author	SHA1	Message	Date
efischer	03e8d31210	Add data on randomized well population behavior	2022-03-02 18:55:19 -06:00
efischer	582dc3ef40	Update readme	2022-03-02 12:39:40 -06:00