improve documentation
This commit is contained in:
86
readme.md
86
readme.md
@@ -1,32 +1,78 @@
|
|||||||
BiGpairSEQ SIMULATOR
|
# BiGpairSEQ SIMULATOR
|
||||||
|
|
||||||
|
### ABOUT
|
||||||
|
|
||||||
ABOUT:
|
|
||||||
This program simulates BiGpairSEQ, a graph theory based adaptation
|
This program simulates BiGpairSEQ, a graph theory based adaptation
|
||||||
of the pairSEQ algorithm for pairing T cell receptor sequences.
|
of the pairSEQ algorithm for pairing T cell receptor sequences.
|
||||||
|
|
||||||
Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and compares
|
### USAGE
|
||||||
against a null distribution, BiGpairSEQ does not do any statistical calculations
|
|
||||||
directly. Instead, BiGpairSEQ creates a simple bipartite weighted graph representing the sample plate.
|
Released as an executable .jar file with interactive, command line UI.
|
||||||
The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well
|
Requires Java11 or higher (openjdk-17 recommended).
|
||||||
|
|
||||||
|
Run with the command:
|
||||||
|
|
||||||
|
`java -jar BiGpairSEQ_Sim.jar`
|
||||||
|
|
||||||
|
Processing sample plates with tens of thousands of sequences may require large amounts
|
||||||
|
of RAM. It is often desirable to increase the JVM maximum heap allocation with the -Xmx flag.
|
||||||
|
For example, to run the program with 32 gigabytes of memory, use the command:
|
||||||
|
|
||||||
|
`java -Xmx32G -jar BiGpairSEQ_Sim.jar`
|
||||||
|
|
||||||
|
Note that you cannot allocate more RAM than is physically present on the system.
|
||||||
|
|
||||||
|
### OUTPUT
|
||||||
|
|
||||||
|
### THEORY
|
||||||
|
|
||||||
|
Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and compares
|
||||||
|
against a null distribution, BiGpairSEQ does not do any statistical calculations
|
||||||
|
directly. Instead, BiGpairSEQ creates a simple bipartite weighted graph representing the sample plate.
|
||||||
|
The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well
|
||||||
are connected by an edge, with the edge weight set to the number of wells in which both sequences appear.
|
are connected by an edge, with the edge weight set to the number of wells in which both sequences appear.
|
||||||
(Sequences in all wells are filtered out prior to creating the graph, as there is no signal in their occupancy
|
(Sequences in all wells are filtered out prior to creating the graph, as there is no signal in their occupancy
|
||||||
The problem of pairing TCRA/TCRB sequences thus reduces to the "assignment problem" of finding a maximum weight
|
The problem of pairing TCRA/TCRB sequences thus reduces to the "assignment problem" of finding a maximum weight
|
||||||
matching on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value.
|
matching on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value.
|
||||||
|
|
||||||
USAGE
|
This is a very well-studied combinatorial optimization problem, with many known solutions.
|
||||||
Released as an executable .jar file with interactive, command line UI
|
The best currently-known algorithm for bipartite graphs with integer weights is from [Duan and Su][2].
|
||||||
Usage: java -jar BiGpairSEQ_Sim.jar
|
For a graph with m edges, n vertices per side, and maximum integer edge weight N, their algorithm runs
|
||||||
|
in O(m sqrt(n) log(N)) time. With its best-known efficiency and requirement of integer weights, this
|
||||||
|
algorithm is ideal for BiGpairSEQ.
|
||||||
|
|
||||||
Large cell sample or sample plate files may require large amounts of RAM.
|
Unfortunately, the qualities that make it ideal for BiGpairSEQ make it less generically useful,
|
||||||
It is often desirable to increase the JVM memory allocation with the -Xmx flag
|
and it is not implemented by the graph theory library used in this simulator. So this program
|
||||||
For example, to run the program with 32 gigabytes of memory, use command:
|
instead uses the Fibonacci heap-based algorithm of Fredman and Tarjan, which has a worst-case
|
||||||
java -Xmx32G -jar BiGpairSEQ_Sim.jar
|
runtime of O(n (n log(n) + m)).
|
||||||
|
|
||||||
Requires Java11 or higher (Openjdk-17 recommended)
|
### TODO
|
||||||
|
|
||||||
pairSEQ citation:
|
* Try invoking GC at end of workloads
|
||||||
Howie, B., Sherwood, A. M., et. al.
|
* Hold graph data in memory until another graph is read-in
|
||||||
"High-throughput pairing of T cell receptor alpha and beta sequences."
|
* Enable GraphML output in addition to serialized object binaries, for data portability
|
||||||
Sci. Transl. Med. 7, 301ra131 (2015)
|
* Custom vertex type with attribute for sequence occupancy?
|
||||||
|
* Re-implement CDR1 matching method
|
||||||
|
* Re-implement command line arguments, to enable statistical simulation studies
|
||||||
|
* Implement Duan and Su's maximum weight matching algorithms
|
||||||
|
* Add controllable algorithm-type parameter?
|
||||||
|
* Test whether pairing heap (currently used) or Fibonacci heap is more efficient for current matching algorithm
|
||||||
|
* in theory Fibonacci heap should be more efficient, but complexity overhead may eliminate theoretical advantage
|
||||||
|
* Add controllable heap-type parameter?
|
||||||
|
* Implement sample plates with random numbers of T cells per well
|
||||||
|
* BiGpairSEQ is resilient to variations in well populations; pairSEQ is not
|
||||||
|
|
||||||
Simulation by Eugene Fischer, 2021-2022
|
### EXTERNAL LIBRARIES USED
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### CITATIONS
|
||||||
|
* Howie, B., Sherwood, A. M., et. al. "High-throughput pairing of T cell receptor alpha and beta sequences." Sci. Transl. Med. 7, 301ra131 (2015)
|
||||||
|
[2](https://web.eecs.umich.edu/~pettie/matching/Duan-Su-scaling-bipartite-matching.pdf)
|
||||||
|
|
||||||
|
### ACKNOWLEDGEMENTS
|
||||||
|
Conceived in collaboration with Dr. Alice MacQueen, who brought the original pairSEQ paper to the author's attention
|
||||||
|
and explained all the biology terms the author didn't know.
|
||||||
|
|
||||||
|
### AUTHOR
|
||||||
|
Eugene Fischer, 2021. UI improvements and documentation in 2022.
|
||||||
Reference in New Issue
Block a user