improve documentation

This commit is contained in:
2022-02-20 13:15:32 -06:00
parent 2c01a0211c
commit 5d24dc6f70

View File

@@ -1,32 +1,78 @@
BiGpairSEQ SIMULATOR # BiGpairSEQ SIMULATOR
### ABOUT
ABOUT:
This program simulates BiGpairSEQ, a graph theory based adaptation This program simulates BiGpairSEQ, a graph theory based adaptation
of the pairSEQ algorithm for pairing T cell receptor sequences. of the pairSEQ algorithm for pairing T cell receptor sequences.
Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and compares ### USAGE
against a null distribution, BiGpairSEQ does not do any statistical calculations
directly. Instead, BiGpairSEQ creates a simple bipartite weighted graph representing the sample plate. Released as an executable .jar file with interactive, command line UI.
The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well Requires Java11 or higher (openjdk-17 recommended).
Run with the command:
`java -jar BiGpairSEQ_Sim.jar`
Processing sample plates with tens of thousands of sequences may require large amounts
of RAM. It is often desirable to increase the JVM maximum heap allocation with the -Xmx flag.
For example, to run the program with 32 gigabytes of memory, use the command:
`java -Xmx32G -jar BiGpairSEQ_Sim.jar`
Note that you cannot allocate more RAM than is physically present on the system.
### OUTPUT
### THEORY
Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and compares
against a null distribution, BiGpairSEQ does not do any statistical calculations
directly. Instead, BiGpairSEQ creates a simple bipartite weighted graph representing the sample plate.
The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well
are connected by an edge, with the edge weight set to the number of wells in which both sequences appear. are connected by an edge, with the edge weight set to the number of wells in which both sequences appear.
(Sequences in all wells are filtered out prior to creating the graph, as there is no signal in their occupancy (Sequences in all wells are filtered out prior to creating the graph, as there is no signal in their occupancy
The problem of pairing TCRA/TCRB sequences thus reduces to the "assignment problem" of finding a maximum weight The problem of pairing TCRA/TCRB sequences thus reduces to the "assignment problem" of finding a maximum weight
matching on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value. matching on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value.
USAGE This is a very well-studied combinatorial optimization problem, with many known solutions.
Released as an executable .jar file with interactive, command line UI The best currently-known algorithm for bipartite graphs with integer weights is from [Duan and Su][2].
Usage: java -jar BiGpairSEQ_Sim.jar For a graph with m edges, n vertices per side, and maximum integer edge weight N, their algorithm runs
in O(m sqrt(n) log(N)) time. With its best-known efficiency and requirement of integer weights, this
algorithm is ideal for BiGpairSEQ.
Large cell sample or sample plate files may require large amounts of RAM. Unfortunately, the qualities that make it ideal for BiGpairSEQ make it less generically useful,
It is often desirable to increase the JVM memory allocation with the -Xmx flag and it is not implemented by the graph theory library used in this simulator. So this program
For example, to run the program with 32 gigabytes of memory, use command: instead uses the Fibonacci heap-based algorithm of Fredman and Tarjan, which has a worst-case
java -Xmx32G -jar BiGpairSEQ_Sim.jar runtime of O(n (n log(n) + m)).
Requires Java11 or higher (Openjdk-17 recommended) ### TODO
pairSEQ citation: * Try invoking GC at end of workloads
Howie, B., Sherwood, A. M., et. al. * Hold graph data in memory until another graph is read-in
"High-throughput pairing of T cell receptor alpha and beta sequences." * Enable GraphML output in addition to serialized object binaries, for data portability
Sci. Transl. Med. 7, 301ra131 (2015) * Custom vertex type with attribute for sequence occupancy?
* Re-implement CDR1 matching method
* Re-implement command line arguments, to enable statistical simulation studies
* Implement Duan and Su's maximum weight matching algorithms
* Add controllable algorithm-type parameter?
* Test whether pairing heap (currently used) or Fibonacci heap is more efficient for current matching algorithm
* in theory Fibonacci heap should be more efficient, but complexity overhead may eliminate theoretical advantage
* Add controllable heap-type parameter?
* Implement sample plates with random numbers of T cells per well
* BiGpairSEQ is resilient to variations in well populations; pairSEQ is not
Simulation by Eugene Fischer, 2021-2022 ### EXTERNAL LIBRARIES USED
### CITATIONS
* Howie, B., Sherwood, A. M., et. al. "High-throughput pairing of T cell receptor alpha and beta sequences." Sci. Transl. Med. 7, 301ra131 (2015)
[2](https://web.eecs.umich.edu/~pettie/matching/Duan-Su-scaling-bipartite-matching.pdf)
### ACKNOWLEDGEMENTS
Conceived in collaboration with Dr. Alice MacQueen, who brought the original pairSEQ paper to the author's attention
and explained all the biology terms the author didn't know.
### AUTHOR
Eugene Fischer, 2021. UI improvements and documentation in 2022.