improve documentation
This commit is contained in:
193
readme.md
193
readme.md
@@ -1,11 +1,11 @@
|
||||
# BiGpairSEQ SIMULATOR
|
||||
|
||||
### ABOUT
|
||||
## ABOUT
|
||||
|
||||
This program simulates BiGpairSEQ, a graph theory based adaptation
|
||||
of the pairSEQ algorithm for pairing T cell receptor sequences.
|
||||
of the pairSEQ algorithm (Howie et al. 2015) for pairing T cell receptor sequences.
|
||||
|
||||
### USAGE
|
||||
## USAGE
|
||||
|
||||
Released as an executable .jar file with interactive, command line UI.
|
||||
Requires Java11 or higher (openjdk-17 recommended).
|
||||
@@ -22,9 +22,149 @@ For example, to run the program with 32 gigabytes of memory, use the command:
|
||||
|
||||
Note that you cannot allocate more RAM than is physically present on the system.
|
||||
|
||||
### OUTPUT
|
||||
## OUTPUT
|
||||
|
||||
### THEORY
|
||||
To run the simulation, the program reads and writes 4 kinds of files:
|
||||
* Cell Sample files in CSV format
|
||||
* Sample Plate files in CSV format
|
||||
* Graph and Data files in binary object serialization format
|
||||
* Matching Results files in CSV format
|
||||
|
||||
### -- Cell Sample Files --
|
||||
Cell Sample files consist of any number of distinct "T cells." Every cell contains
|
||||
four sequences: Alpha CDR3, Beta CDR, Alpha CDR1, Beta CDR1. The sequences are represented by
|
||||
random integers. CDR3 Alpha and Beta sequences are all unique. CDR1 Alpha and Beta sequences
|
||||
are not necessarily unique; the relative diversity can be set when making a Cell Sample file.
|
||||
|
||||
Options when making a Cell Sample file:
|
||||
* Number of T cells to generate
|
||||
* Factor by which CDR3s are more diverse than CDR1s
|
||||
|
||||
Files are in CSV format. Rows are distinct T cells, columns are sequences within the cells.
|
||||
Comments are preceded by `#`
|
||||
|
||||
Structure:
|
||||
|
||||
---
|
||||
|
||||
# Sample contains 1 unique CDR1 for every 4 unique CDR3s.
|
||||
|
||||
| Alpha CDR3 | Beta CDR3 | Alpha CDR1 | Beta CDR1 |
|
||||
|---|---|---|---|
|
||||
|unique number|unique number|number|number|
|
||||
|
||||
---
|
||||
|
||||
**NOTE:** Matching of CDR1s is currently awaiting re-implementation.
|
||||
|
||||
### -- Sample Plate Files --
|
||||
Sample Plate files consist of any number of "wells" containing any number of T cells (as
|
||||
described above). The wells are filled randomly from a Cell Sample file, according to a selected
|
||||
frequency distribution. Additionally, every individual sequence within each cell may, with some
|
||||
given dropout probability, be omitted from the file. This simulates the effect of amplification errors
|
||||
prior to sequencing. Plates can also be partitioned into any number of (approximately) evenly-sized
|
||||
sections, each of which can have a different number of T cells per well.
|
||||
|
||||
Options when making a Sample Plate file:
|
||||
* Cell Sample file to use
|
||||
* Statistical distribution to apply to Cell Sample file
|
||||
* Poisson
|
||||
* Gaussian
|
||||
* Standard deviation size
|
||||
* Exponential
|
||||
* Lambda value
|
||||
* Based on the slope of the graph in Figure 4C of the pairSEQ paper, the distribution of the original experiment was exponential with a lambda of approximately 0.6. (Howie et al. 2015)
|
||||
* Total number of wells on the plate
|
||||
* Number of sections on plate
|
||||
* Number of T cells per well
|
||||
* per section, if more than one section
|
||||
* Dropout rate
|
||||
|
||||
Files are in CSV format. There are no header labels. Every row represents a well.
|
||||
Every column represents an individual cell, containing four sequences, represented by an array string:
|
||||
`[CDR3A, CDR3B, CDR1A, CDR1B]`. So a representative cell might look like this:
|
||||
|
||||
`[525902, 791533, -1, 866282]`
|
||||
|
||||
Notice that the Alpha CDR1 is missing in the cell above, due to sequence dropout.
|
||||
Dropouts are represented by replacing sequences with the value `-1`. Comments are preceded by `#`
|
||||
|
||||
Structure:
|
||||
|
||||
---
|
||||
|
||||
| Well 1, cell 1 | Well 1, cell 2 | Well 1, cell 3| ... |
|
||||
|---|---|---|---|
|
||||
| **Well 2, cell 1** | **Well 2, cell 2** | **Well 2, cell 3**| ... |
|
||||
| **Well 3, cell 1** | **Well 3, cell 2** | **Well 3, cell 3**| ... |
|
||||
| ... | ... | ... | ... |
|
||||
|
||||
---
|
||||
|
||||
### -- Graph and Data Files --
|
||||
Graph and Data files are serialized binaries of a Java object containing the graph representation of a
|
||||
Sample Plate and necessary metadata for matching and results output. Making them requires a Cell Sample file (to construct a list of correct sequence pairs for checking
|
||||
the accuracy of BiGpairSEQ simulations) and a Sample Plate file (to construct the associated
|
||||
occupancy graph). These files can be several gigabytes in size. Writing them to a file lets us generate a graph and
|
||||
its metadata once, then use it for multiple different BiGpairSEQ simulations.
|
||||
|
||||
These files do not have a human-readable structure, and are not portable to other programs.
|
||||
|
||||
(Export of graphs in a portable data format may be implemented in the future.
|
||||
Exporting the graph itself is easy, the tricky part is packaging it with the necessary metadata.)
|
||||
|
||||
Options for creating a Graph and Data file:
|
||||
* The Cell Sample file to use
|
||||
* The Sample Plate file (generated from the given Cell Sample file) to use.
|
||||
|
||||
### -- Matching Results Files --
|
||||
Matching results files consist of the results of a BiGpairSEQ matching simulation.
|
||||
Files are in CSV format. Rows are sequence pairings with extra relevant data. Columns are pairing-specific details.
|
||||
Metadata about the matching simulation is included as comments. Comments are preceded by `#`.
|
||||
|
||||
|
||||
|
||||
Options when running a BiGpairSEQ simulation of CDR3 alpha/beta matching:
|
||||
* The minimum number of alpha/beta overlap wells to attempt to match
|
||||
* (must be >= 1)
|
||||
* The maximum number of alpha/beta overlap wells to attempt to match
|
||||
* (must be <= the number of wells on the plate - 1)
|
||||
* The maximum difference in alpha/beta occupancy to attempt to match
|
||||
* (To skip using this filter, enter a value >= the number of wells on the plate)
|
||||
* The minimum percentage of a sequence's occupied wells shared by another sequence to attempt to match
|
||||
* given value from 0 to 100
|
||||
* (To skip using this filter, enter 0)
|
||||
|
||||
Sample File Structure:
|
||||
|
||||
---
|
||||
|
||||
# T cell counts in sample plate wells: 5000
|
||||
# Total alphas found: 3387
|
||||
# Total betas found: 3396
|
||||
# High overlap threshold: 94
|
||||
# Low overlap threshold: 3
|
||||
# Minimum overlap percent: 0
|
||||
# Maximum occupancy difference: 50
|
||||
# Pairing attempt rate: 0.488
|
||||
# Correct pairings: 1650
|
||||
# Incorrect pairings: 4
|
||||
# Pairing error rate: 0.00242
|
||||
# Simulation time: 19 seconds
|
||||
|
||||
| Alpha | Alpha well count | Beta | Beta well count | Overlap count | Matched Correctly? | P-value |
|
||||
|---|---|---|---|---|---|---|
|
||||
|716809|31|20739|34|31.0|TRUE|4.99E-25|
|
||||
|753685|28|733213|27|27.0|TRUE|5.26E-23|
|
||||
|...|...|...|...|...|...|...|
|
||||
|
||||
---
|
||||
|
||||
**NOTE: The p-values in the output are not used for matching**—they aren't part of the BiGpairSEQ algorithm at all.
|
||||
P-values are calculated *after* BiGpairSEQ matching is completed for purposes of comparison,
|
||||
using the (2021 corrected) formula from the original pairSEQ paper. (Howie, et al. 2015)
|
||||
|
||||
## THEORY
|
||||
|
||||
Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and compares
|
||||
against a null distribution, BiGpairSEQ does not do any statistical calculations
|
||||
@@ -36,17 +176,21 @@ The problem of pairing TCRA/TCRB sequences thus reduces to the "assignment probl
|
||||
matching on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value.
|
||||
|
||||
This is a very well-studied combinatorial optimization problem, with many known solutions.
|
||||
The best currently-known algorithm for bipartite graphs with integer weights is from [Duan and Su][2]
|
||||
The best currently-known algorithm for bipartite graphs with integer weights is from Duan and Su (2012).
|
||||
For a graph with m edges, n vertices per side, and maximum integer edge weight N, their algorithm runs
|
||||
in O(m sqrt(n) log(N)) time. With its best-known efficiency and requirement of integer weights, this
|
||||
algorithm is ideal for BiGpairSEQ.
|
||||
|
||||
Unfortunately, the qualities that make it ideal for BiGpairSEQ make it less generically useful,
|
||||
and it is not implemented by the graph theory library used in this simulator. So this program
|
||||
instead uses the Fibonacci heap-based algorithm of Fredman and Tarjan, which has a worst-case
|
||||
runtime of O(n (n log(n) + m)).
|
||||
instead uses the Fibonacci heap-based algorithm of Fredman and Tarjan (1984), which has a worst-case
|
||||
runtime of O(n (n log(n) + m)). The algorithm is implemented as described in Melhorn and Näher (1999).
|
||||
|
||||
### TODO
|
||||
The current version of the program uses a pairing heap instead of a Fibonacci heap for its priority queue,
|
||||
which has lower theoretical efficiency but also lower complexity overhead, and is often equivalently performant
|
||||
in practice.
|
||||
|
||||
## TODO
|
||||
|
||||
* Try invoking GC at end of workloads
|
||||
* Hold graph data in memory until another graph is read-in
|
||||
@@ -61,20 +205,25 @@ runtime of O(n (n log(n) + m)).
|
||||
* Add controllable heap-type parameter?
|
||||
* Implement sample plates with random numbers of T cells per well
|
||||
* BiGpairSEQ is resilient to variations in well populations; pairSEQ is not
|
||||
* See if there's a reasonable way to reformat Sample Plate files so that wells are columns instead of rows
|
||||
* Problem is variable number of cells in a well
|
||||
* Apache Commons CSV library writes entries a row at a time
|
||||
|
||||
### EXTERNAL LIBRARIES USED
|
||||
## EXTERNAL LIBRARIES USED
|
||||
* [JGraphT](https://jgrapht.org) -- Graph theory data structures and algorithms
|
||||
* [JHeaps](https://www.jheaps.org) -- For pairing heap priority queue used in maximum weight matching algorithm
|
||||
* [Apache Commons CSV](https://commons.apache.org/proper/commons-csv/) -- For CSV file output
|
||||
* [Apache Commons CLI](https://commons.apache.org/proper/commons-cli/) -- To enable command line arguments for scripting. (**Awaiting reimplementation**.)
|
||||
|
||||
## CITATIONS
|
||||
* Howie, B., Sherwood, A. M., et al. ["High-throughput pairing of T cell receptor alpha and beta sequences."](https://pubmed.ncbi.nlm.nih.gov/26290413/) Sci. Transl. Med. 7, 301ra131 (2015)
|
||||
* Duan, R., Su H. ["A Scaling Algorithm for Maximum Weight Matching in Bipartite Graphs."](https://web.eecs.umich.edu/~pettie/matching/Duan-Su-scaling-bipartite-matching.pdf) Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, p. 1413-1424. (2012)
|
||||
* K. Melhorn, St. Näher. [The LEDA Platform of Combinatorial and Geometric Computing.](https://people.mpi-inf.mpg.de/~mehlhorn/LEDAbook.html) Cambridge University Press. Chapter 7, Graph Algorithms; p. 132-162 (1999)
|
||||
* M. Fredman, R. Tarjan. ["Fibonacci heaps and their uses in improved network optimization algorithms."](https://www.cl.cam.ac.uk/teaching/1011/AlgorithII/1987-FredmanTar-fibonacci.pdf) J. ACM, 34(3):596–615 (1987))
|
||||
|
||||
## ACKNOWLEDGEMENTS
|
||||
BiGpairSEQ was conceived in collaboration with Dr. Alice MacQueen, who brought the original pairSEQ paper to the author's attention
|
||||
and explained all the biology terms he didn't know.
|
||||
|
||||
### CITATIONS
|
||||
[1]: Howie, B., Sherwood, A. M., et. al. "High-throughput pairing of T cell receptor alpha and beta sequences." Sci. Transl. Med. 7, 301ra131 (2015)
|
||||
[2]: https://web.eecs.umich.edu/~pettie/matching/Duan-Su-scaling-bipartite-matching.pdf
|
||||
|
||||
Duan, R. and Su H. "A Scaling Algorithm for Maximum Weight Matching in Bipartite Graphs." (2012)
|
||||
|
||||
### ACKNOWLEDGEMENTS
|
||||
Conceived in collaboration with Dr. Alice MacQueen, who brought the original pairSEQ paper to the author's attention
|
||||
and explained all the biology terms the author didn't know.
|
||||
|
||||
### AUTHOR
|
||||
Eugene Fischer, 2021. UI improvements and documentation in 2022.
|
||||
## AUTHOR
|
||||
Eugene Fischer, 2021. UI improvements and documentation, 2022.
|
||||
Reference in New Issue
Block a user