improve documentation
This commit is contained in:
86
readme.md
86
readme.md
@@ -1,32 +1,78 @@
|
||||
BiGpairSEQ SIMULATOR
|
||||
# BiGpairSEQ SIMULATOR
|
||||
|
||||
### ABOUT
|
||||
|
||||
ABOUT:
|
||||
This program simulates BiGpairSEQ, a graph theory based adaptation
|
||||
of the pairSEQ algorithm for pairing T cell receptor sequences.
|
||||
|
||||
Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and compares
|
||||
against a null distribution, BiGpairSEQ does not do any statistical calculations
|
||||
directly. Instead, BiGpairSEQ creates a simple bipartite weighted graph representing the sample plate.
|
||||
The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well
|
||||
### USAGE
|
||||
|
||||
Released as an executable .jar file with interactive, command line UI.
|
||||
Requires Java11 or higher (openjdk-17 recommended).
|
||||
|
||||
Run with the command:
|
||||
|
||||
`java -jar BiGpairSEQ_Sim.jar`
|
||||
|
||||
Processing sample plates with tens of thousands of sequences may require large amounts
|
||||
of RAM. It is often desirable to increase the JVM maximum heap allocation with the -Xmx flag.
|
||||
For example, to run the program with 32 gigabytes of memory, use the command:
|
||||
|
||||
`java -Xmx32G -jar BiGpairSEQ_Sim.jar`
|
||||
|
||||
Note that you cannot allocate more RAM than is physically present on the system.
|
||||
|
||||
### OUTPUT
|
||||
|
||||
### THEORY
|
||||
|
||||
Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and compares
|
||||
against a null distribution, BiGpairSEQ does not do any statistical calculations
|
||||
directly. Instead, BiGpairSEQ creates a simple bipartite weighted graph representing the sample plate.
|
||||
The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well
|
||||
are connected by an edge, with the edge weight set to the number of wells in which both sequences appear.
|
||||
(Sequences in all wells are filtered out prior to creating the graph, as there is no signal in their occupancy
|
||||
The problem of pairing TCRA/TCRB sequences thus reduces to the "assignment problem" of finding a maximum weight
|
||||
The problem of pairing TCRA/TCRB sequences thus reduces to the "assignment problem" of finding a maximum weight
|
||||
matching on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value.
|
||||
|
||||
USAGE
|
||||
Released as an executable .jar file with interactive, command line UI
|
||||
Usage: java -jar BiGpairSEQ_Sim.jar
|
||||
This is a very well-studied combinatorial optimization problem, with many known solutions.
|
||||
The best currently-known algorithm for bipartite graphs with integer weights is from [Duan and Su][2].
|
||||
For a graph with m edges, n vertices per side, and maximum integer edge weight N, their algorithm runs
|
||||
in O(m sqrt(n) log(N)) time. With its best-known efficiency and requirement of integer weights, this
|
||||
algorithm is ideal for BiGpairSEQ.
|
||||
|
||||
Large cell sample or sample plate files may require large amounts of RAM.
|
||||
It is often desirable to increase the JVM memory allocation with the -Xmx flag
|
||||
For example, to run the program with 32 gigabytes of memory, use command:
|
||||
java -Xmx32G -jar BiGpairSEQ_Sim.jar
|
||||
Unfortunately, the qualities that make it ideal for BiGpairSEQ make it less generically useful,
|
||||
and it is not implemented by the graph theory library used in this simulator. So this program
|
||||
instead uses the Fibonacci heap-based algorithm of Fredman and Tarjan, which has a worst-case
|
||||
runtime of O(n (n log(n) + m)).
|
||||
|
||||
Requires Java11 or higher (Openjdk-17 recommended)
|
||||
### TODO
|
||||
|
||||
pairSEQ citation:
|
||||
Howie, B., Sherwood, A. M., et. al.
|
||||
"High-throughput pairing of T cell receptor alpha and beta sequences."
|
||||
Sci. Transl. Med. 7, 301ra131 (2015)
|
||||
* Try invoking GC at end of workloads
|
||||
* Hold graph data in memory until another graph is read-in
|
||||
* Enable GraphML output in addition to serialized object binaries, for data portability
|
||||
* Custom vertex type with attribute for sequence occupancy?
|
||||
* Re-implement CDR1 matching method
|
||||
* Re-implement command line arguments, to enable statistical simulation studies
|
||||
* Implement Duan and Su's maximum weight matching algorithms
|
||||
* Add controllable algorithm-type parameter?
|
||||
* Test whether pairing heap (currently used) or Fibonacci heap is more efficient for current matching algorithm
|
||||
* in theory Fibonacci heap should be more efficient, but complexity overhead may eliminate theoretical advantage
|
||||
* Add controllable heap-type parameter?
|
||||
* Implement sample plates with random numbers of T cells per well
|
||||
* BiGpairSEQ is resilient to variations in well populations; pairSEQ is not
|
||||
|
||||
Simulation by Eugene Fischer, 2021-2022
|
||||
### EXTERNAL LIBRARIES USED
|
||||
|
||||
|
||||
|
||||
### CITATIONS
|
||||
* Howie, B., Sherwood, A. M., et. al. "High-throughput pairing of T cell receptor alpha and beta sequences." Sci. Transl. Med. 7, 301ra131 (2015)
|
||||
[2](https://web.eecs.umich.edu/~pettie/matching/Duan-Su-scaling-bipartite-matching.pdf)
|
||||
|
||||
### ACKNOWLEDGEMENTS
|
||||
Conceived in collaboration with Dr. Alice MacQueen, who brought the original pairSEQ paper to the author's attention
|
||||
and explained all the biology terms the author didn't know.
|
||||
|
||||
### AUTHOR
|
||||
Eugene Fischer, 2021. UI improvements and documentation in 2022.
|
||||
Reference in New Issue
Block a user