improve documentation

2022-02-20 13:15:32 -06:00
parent 2c01a0211c
commit 5d24dc6f70
1 changed files with 66 additions and 20 deletions
--- a/readme.md
+++ b/readme.md
@@ -1,32 +1,78 @@
-BiGpairSEQ SIMULATOR
+# BiGpairSEQ SIMULATOR
+
+### ABOUT

-ABOUT:
 This program simulates BiGpairSEQ, a graph theory based adaptation
 of the pairSEQ algorithm for pairing T cell receptor sequences.

-Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and compares 
-against a null distribution, BiGpairSEQ does not do any statistical calculations 
-directly. Instead, BiGpairSEQ creates a simple bipartite weighted graph representing the sample plate. 
-The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well 
+### USAGE
+
+Released as an executable .jar file with interactive, command line UI.
+Requires Java11 or higher (openjdk-17 recommended).
+
+Run with the command:
+
+`java -jar BiGpairSEQ_Sim.jar`
+
+Processing sample plates with tens of thousands of sequences may require large amounts 
+of RAM. It is often desirable to increase the JVM maximum heap allocation with the -Xmx flag.
+For example, to run the program with 32 gigabytes of memory, use the command:
+
+`java -Xmx32G -jar BiGpairSEQ_Sim.jar`
+
+Note that you cannot allocate more RAM than is physically present on the system.
+
+### OUTPUT
+
+### THEORY
+
+Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and compares
+against a null distribution, BiGpairSEQ does not do any statistical calculations
+directly. Instead, BiGpairSEQ creates a simple bipartite weighted graph representing the sample plate.
+The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well
 are connected by an edge, with the edge weight set to the number of wells in which both sequences appear.
 (Sequences in all wells are filtered out prior to creating the graph, as there is no signal in their occupancy
-The problem of pairing TCRA/TCRB sequences thus reduces to the "assignment problem" of finding a maximum weight 
+The problem of pairing TCRA/TCRB sequences thus reduces to the "assignment problem" of finding a maximum weight
 matching on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value.

-USAGE
-Released as an executable .jar file with interactive, command line UI
-Usage: java -jar BiGpairSEQ_Sim.jar
+This is a very well-studied combinatorial optimization problem, with many known solutions.
+The best currently-known algorithm for bipartite graphs with integer weights is from [Duan and Su][2].
+For a graph with m edges, n vertices per side, and maximum integer edge weight N, their algorithm runs
+in O(m sqrt(n) log(N)) time. With its best-known efficiency and requirement of integer weights, this
+algorithm is ideal for BiGpairSEQ.

-Large cell sample or sample plate files may require large amounts of RAM.
-It is often desirable to increase the JVM memory allocation with the -Xmx flag
-For example, to run the program with 32 gigabytes of memory, use command:
-java -Xmx32G -jar BiGpairSEQ_Sim.jar
+Unfortunately, the qualities that make it ideal for BiGpairSEQ make it less generically useful,
+and it is not implemented by the graph theory library used in this simulator. So this program
+instead uses the Fibonacci heap-based algorithm of Fredman and Tarjan, which has a worst-case
+runtime of O(n (n log(n) + m)).

-Requires Java11 or higher (Openjdk-17 recommended)
+### TODO

-pairSEQ citation:
-Howie, B., Sherwood, A. M., et. al.
-"High-throughput pairing of T cell receptor alpha and beta sequences."
-Sci. Transl. Med. 7, 301ra131 (2015)
+* Try invoking GC at end of workloads
+* Hold graph data in memory until another graph is read-in
+* Enable GraphML output in addition to serialized object binaries, for data portability
+  * Custom vertex type with attribute for sequence occupancy?
+* Re-implement CDR1 matching method
+* Re-implement command line arguments, to enable statistical simulation studies
+* Implement Duan and Su's maximum weight matching algorithms
+  * Add controllable algorithm-type parameter?
+* Test whether pairing heap (currently used) or Fibonacci heap is more efficient for current matching algorithm
+  * in theory Fibonacci heap should be more efficient, but complexity overhead may eliminate theoretical advantage
+  * Add controllable heap-type parameter?
+* Implement sample plates with random numbers of T cells per well
+  * BiGpairSEQ is resilient to variations in well populations; pairSEQ is not

-Simulation by Eugene Fischer, 2021-2022
+### EXTERNAL LIBRARIES USED
+
+
+
+### CITATIONS
+* Howie, B., Sherwood, A. M., et. al. "High-throughput pairing of T cell receptor alpha and beta sequences." Sci. Transl. Med. 7, 301ra131 (2015)
+[2](https://web.eecs.umich.edu/~pettie/matching/Duan-Su-scaling-bipartite-matching.pdf)
+
+### ACKNOWLEDGEMENTS
+Conceived in collaboration with Dr. Alice MacQueen, who brought the original pairSEQ paper to the author's attention
+and explained all the biology terms the author didn't know.
+
+### AUTHOR
+Eugene Fischer, 2021. UI improvements and documentation in 2022.