improve documentation

2022-02-20 13:15:32 -06:00
parent 2c01a0211c
commit 5d24dc6f70
1 changed files with 66 additions and 20 deletions
--- a/readme.md
+++ b/readme.md
@@ -1,32 +1,78 @@
-BiGpairSEQ SIMULATOR
+# BiGpairSEQ SIMULATOR
 ### ABOUT
 ABOUT:
 This program simulates BiGpairSEQ, a graph theory based adaptation
 of the pairSEQ algorithm for pairing T cell receptor sequences.
-Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and compares 
+### USAGE
-against a null distribution, BiGpairSEQ does not do any statistical calculations 
+
-directly. Instead, BiGpairSEQ creates a simple bipartite weighted graph representing the sample plate. 
+Released as an executable .jar file with interactive, command line UI.
-The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well 
+Requires Java11 or higher (openjdk-17 recommended).
 Run with the command:
 `java -jar BiGpairSEQ_Sim.jar`
 Processing sample plates with tens of thousands of sequences may require large amounts 
 of RAM. It is often desirable to increase the JVM maximum heap allocation with the -Xmx flag.
 For example, to run the program with 32 gigabytes of memory, use the command:
 `java -Xmx32G -jar BiGpairSEQ_Sim.jar`
 Note that you cannot allocate more RAM than is physically present on the system.
 ### OUTPUT
 ### THEORY
 Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and compares
 against a null distribution, BiGpairSEQ does not do any statistical calculations
 directly. Instead, BiGpairSEQ creates a simple bipartite weighted graph representing the sample plate.
 The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well
 are connected by an edge, with the edge weight set to the number of wells in which both sequences appear.
 (Sequences in all wells are filtered out prior to creating the graph, as there is no signal in their occupancy
-The problem of pairing TCRA/TCRB sequences thus reduces to the "assignment problem" of finding a maximum weight 
+The problem of pairing TCRA/TCRB sequences thus reduces to the "assignment problem" of finding a maximum weight
 matching on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value.
-USAGE
+This is a very well-studied combinatorial optimization problem, with many known solutions.
-Released as an executable .jar file with interactive, command line UI
+The best currently-known algorithm for bipartite graphs with integer weights is from [Duan and Su][2].
-Usage: java -jar BiGpairSEQ_Sim.jar
+For a graph with m edges, n vertices per side, and maximum integer edge weight N, their algorithm runs
 in O(m sqrt(n) log(N)) time. With its best-known efficiency and requirement of integer weights, this
 algorithm is ideal for BiGpairSEQ.
-Large cell sample or sample plate files may require large amounts of RAM.
+Unfortunately, the qualities that make it ideal for BiGpairSEQ make it less generically useful,
-It is often desirable to increase the JVM memory allocation with the -Xmx flag
+and it is not implemented by the graph theory library used in this simulator. So this program
-For example, to run the program with 32 gigabytes of memory, use command:
+instead uses the Fibonacci heap-based algorithm of Fredman and Tarjan, which has a worst-case
-java -Xmx32G -jar BiGpairSEQ_Sim.jar
+runtime of O(n (n log(n) + m)).
-Requires Java11 or higher (Openjdk-17 recommended)
+### TODO
-pairSEQ citation:
+* Try invoking GC at end of workloads
-Howie, B., Sherwood, A. M., et. al.
+* Hold graph data in memory until another graph is read-in
-"High-throughput pairing of T cell receptor alpha and beta sequences."
+* Enable GraphML output in addition to serialized object binaries, for data portability
-Sci. Transl. Med. 7, 301ra131 (2015)
+  * Custom vertex type with attribute for sequence occupancy?
 * Re-implement CDR1 matching method
 * Re-implement command line arguments, to enable statistical simulation studies
 * Implement Duan and Su's maximum weight matching algorithms
  * Add controllable algorithm-type parameter?
 * Test whether pairing heap (currently used) or Fibonacci heap is more efficient for current matching algorithm
  * in theory Fibonacci heap should be more efficient, but complexity overhead may eliminate theoretical advantage
  * Add controllable heap-type parameter?
 * Implement sample plates with random numbers of T cells per well
  * BiGpairSEQ is resilient to variations in well populations; pairSEQ is not
-Simulation by Eugene Fischer, 2021-2022
+### EXTERNAL LIBRARIES USED
 ### CITATIONS
 * Howie, B., Sherwood, A. M., et. al. "High-throughput pairing of T cell receptor alpha and beta sequences." Sci. Transl. Med. 7, 301ra131 (2015)
 [2](https://web.eecs.umich.edu/~pettie/matching/Duan-Su-scaling-bipartite-matching.pdf)
 ### ACKNOWLEDGEMENTS
 Conceived in collaboration with Dr. Alice MacQueen, who brought the original pairSEQ paper to the author's attention
 and explained all the biology terms the author didn't know.
 ### AUTHOR
 Eugene Fischer, 2021. UI improvements and documentation in 2022.