improve documentation

This commit is contained in:
2022-02-20 13:15:32 -06:00
parent 2c01a0211c
commit 5d24dc6f70

View File

@@ -1,32 +1,78 @@
BiGpairSEQ SIMULATOR
# BiGpairSEQ SIMULATOR
### ABOUT
ABOUT:
This program simulates BiGpairSEQ, a graph theory based adaptation
of the pairSEQ algorithm for pairing T cell receptor sequences.
Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and compares
against a null distribution, BiGpairSEQ does not do any statistical calculations
directly. Instead, BiGpairSEQ creates a simple bipartite weighted graph representing the sample plate.
The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well
### USAGE
Released as an executable .jar file with interactive, command line UI.
Requires Java11 or higher (openjdk-17 recommended).
Run with the command:
`java -jar BiGpairSEQ_Sim.jar`
Processing sample plates with tens of thousands of sequences may require large amounts
of RAM. It is often desirable to increase the JVM maximum heap allocation with the -Xmx flag.
For example, to run the program with 32 gigabytes of memory, use the command:
`java -Xmx32G -jar BiGpairSEQ_Sim.jar`
Note that you cannot allocate more RAM than is physically present on the system.
### OUTPUT
### THEORY
Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and compares
against a null distribution, BiGpairSEQ does not do any statistical calculations
directly. Instead, BiGpairSEQ creates a simple bipartite weighted graph representing the sample plate.
The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well
are connected by an edge, with the edge weight set to the number of wells in which both sequences appear.
(Sequences in all wells are filtered out prior to creating the graph, as there is no signal in their occupancy
The problem of pairing TCRA/TCRB sequences thus reduces to the "assignment problem" of finding a maximum weight
The problem of pairing TCRA/TCRB sequences thus reduces to the "assignment problem" of finding a maximum weight
matching on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value.
USAGE
Released as an executable .jar file with interactive, command line UI
Usage: java -jar BiGpairSEQ_Sim.jar
This is a very well-studied combinatorial optimization problem, with many known solutions.
The best currently-known algorithm for bipartite graphs with integer weights is from [Duan and Su][2].
For a graph with m edges, n vertices per side, and maximum integer edge weight N, their algorithm runs
in O(m sqrt(n) log(N)) time. With its best-known efficiency and requirement of integer weights, this
algorithm is ideal for BiGpairSEQ.
Large cell sample or sample plate files may require large amounts of RAM.
It is often desirable to increase the JVM memory allocation with the -Xmx flag
For example, to run the program with 32 gigabytes of memory, use command:
java -Xmx32G -jar BiGpairSEQ_Sim.jar
Unfortunately, the qualities that make it ideal for BiGpairSEQ make it less generically useful,
and it is not implemented by the graph theory library used in this simulator. So this program
instead uses the Fibonacci heap-based algorithm of Fredman and Tarjan, which has a worst-case
runtime of O(n (n log(n) + m)).
Requires Java11 or higher (Openjdk-17 recommended)
### TODO
pairSEQ citation:
Howie, B., Sherwood, A. M., et. al.
"High-throughput pairing of T cell receptor alpha and beta sequences."
Sci. Transl. Med. 7, 301ra131 (2015)
* Try invoking GC at end of workloads
* Hold graph data in memory until another graph is read-in
* Enable GraphML output in addition to serialized object binaries, for data portability
* Custom vertex type with attribute for sequence occupancy?
* Re-implement CDR1 matching method
* Re-implement command line arguments, to enable statistical simulation studies
* Implement Duan and Su's maximum weight matching algorithms
* Add controllable algorithm-type parameter?
* Test whether pairing heap (currently used) or Fibonacci heap is more efficient for current matching algorithm
* in theory Fibonacci heap should be more efficient, but complexity overhead may eliminate theoretical advantage
* Add controllable heap-type parameter?
* Implement sample plates with random numbers of T cells per well
* BiGpairSEQ is resilient to variations in well populations; pairSEQ is not
Simulation by Eugene Fischer, 2021-2022
### EXTERNAL LIBRARIES USED
### CITATIONS
* Howie, B., Sherwood, A. M., et. al. "High-throughput pairing of T cell receptor alpha and beta sequences." Sci. Transl. Med. 7, 301ra131 (2015)
[2](https://web.eecs.umich.edu/~pettie/matching/Duan-Su-scaling-bipartite-matching.pdf)
### ACKNOWLEDGEMENTS
Conceived in collaboration with Dr. Alice MacQueen, who brought the original pairSEQ paper to the author's attention
and explained all the biology terms the author didn't know.
### AUTHOR
Eugene Fischer, 2021. UI improvements and documentation in 2022.