improve documentation
This commit is contained in:
151
readme.md
151
readme.md
@@ -1,14 +1,45 @@
|
|||||||
# BiGpairSEQ SIMULATOR
|
# BiGpairSEQ SIMULATOR
|
||||||
|
|
||||||
|
|
||||||
## ABOUT
|
## ABOUT
|
||||||
|
|
||||||
This program simulates BiGpairSEQ, a graph theory based adaptation
|
This program simulates BiGpairSEQ (Bipartite Graph pairSEQ), a graph theory-based adaptation
|
||||||
of the pairSEQ algorithm (Howie et al. 2015) for pairing T cell receptor sequences.
|
of the pairSEQ algorithm (Howie et al. 2015) for pairing T cell receptor sequences.
|
||||||
|
|
||||||
|
## THEORY
|
||||||
|
|
||||||
|
Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and compares
|
||||||
|
against a null distribution, BiGpairSEQ does not do any statistical calculations
|
||||||
|
directly.
|
||||||
|
|
||||||
|
BiGpairSEQ creates a [simple bipartite weighted graph](https://en.wikipedia.org/wiki/Bipartite_graph) representing the sample plate.
|
||||||
|
The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well
|
||||||
|
are connected by an edge, with the edge weight set to the number of wells in which both sequences appear.
|
||||||
|
(Sequences in all wells are filtered out prior to creating the graph, as there is no signal in their occupancy pattern.)
|
||||||
|
The problem of pairing TCRA/TCRB sequences thus reduces to the "assignment problem" of finding a maximum weight
|
||||||
|
matching on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value.
|
||||||
|
|
||||||
|
This is a well-studied combinatorial optimization problem, with many known solutions.
|
||||||
|
The best currently-known algorithm for bipartite graphs with integer weights--which is what BiGpairSEQ uses--
|
||||||
|
is from Duan and Su (2012). For a graph with m edges, n vertices per side, and maximum integer edge weight N,
|
||||||
|
their algorithm runs in **O(m sqrt(n) log(N))** time. This is the best known efficiency for finding a maximum weight
|
||||||
|
matching on a bipartite graph, and the integer edge weight requirement makes it ideal for BiGpairSEQ.
|
||||||
|
|
||||||
|
Unfortunately, it's a fairly new algorithm, and the integer edge weight requirement makes it less generically useful.
|
||||||
|
It is not implemented by the graph theory library used in this simulator. So this program
|
||||||
|
instead uses the Fibonacci heap-based algorithm of Fredman and Tarjan (1987), which has a worst-case
|
||||||
|
runtime of **O(n (n log(n) + m))**. The algorithm is implemented as described in Melhorn and Näher (1999).
|
||||||
|
|
||||||
|
The current version of the program uses a pairing heap instead of a Fibonacci heap for its priority queue,
|
||||||
|
which has lower theoretical efficiency but also lower complexity overhead, and is often equivalently performant
|
||||||
|
in practice.
|
||||||
|
|
||||||
## USAGE
|
## USAGE
|
||||||
|
|
||||||
Released as an executable .jar file with interactive, command line UI.
|
### RUNNING THE PROGRAM
|
||||||
Requires Java11 or higher (openjdk-17 recommended).
|
|
||||||
|
BiGpairSEQ_Sim is an executable .jar file. Requires Java 11 or higher. [OpenJDK 17](https://jdk.java.net/17/)
|
||||||
|
recommended.
|
||||||
|
|
||||||
Run with the command:
|
Run with the command:
|
||||||
|
|
||||||
@@ -20,7 +51,24 @@ For example, to run the program with 32 gigabytes of memory, use the command:
|
|||||||
|
|
||||||
`java -Xmx32G -jar BiGpairSEQ_Sim.jar`
|
`java -Xmx32G -jar BiGpairSEQ_Sim.jar`
|
||||||
|
|
||||||
## OUTPUT
|
Once running, BiGpairSEQ_Sim has an interactive, menu-driven CLI for generating files and simulating TCR pairing. The
|
||||||
|
main menu looks like this:
|
||||||
|
|
||||||
|
```
|
||||||
|
--------BiGPairSEQ SIMULATOR--------
|
||||||
|
ALPHA/BETA T-CELL RECEPTOR MATCHING
|
||||||
|
USING WEIGHTED BIPARTITE GRAPHS
|
||||||
|
------------------------------------
|
||||||
|
Please select an option:
|
||||||
|
1) Generate a population of distinct cells
|
||||||
|
2) Generate a sample plate of T cells
|
||||||
|
3) Generate CDR3 alpha/beta occupancy data and overlap graph
|
||||||
|
4) Simulate bipartite graph CDR3 alpha/beta matching (BiGpairSEQ)
|
||||||
|
9) About/Acknowledgments
|
||||||
|
0) Exit
|
||||||
|
```
|
||||||
|
|
||||||
|
### OUTPUT
|
||||||
|
|
||||||
To run the simulation, the program reads and writes 4 kinds of files:
|
To run the simulation, the program reads and writes 4 kinds of files:
|
||||||
* Cell Sample files in CSV format
|
* Cell Sample files in CSV format
|
||||||
@@ -28,7 +76,10 @@ To run the simulation, the program reads and writes 4 kinds of files:
|
|||||||
* Graph and Data files in binary object serialization format
|
* Graph and Data files in binary object serialization format
|
||||||
* Matching Results files in CSV format
|
* Matching Results files in CSV format
|
||||||
|
|
||||||
### -- Cell Sample Files --
|
When entering filenames, it is not necessary to include the file extension (.csv or .ser). When reading or
|
||||||
|
writing files, the program will automatically add the correct extension to any filename without one.
|
||||||
|
|
||||||
|
#### Cell Sample Files
|
||||||
Cell Sample files consist of any number of distinct "T cells." Every cell contains
|
Cell Sample files consist of any number of distinct "T cells." Every cell contains
|
||||||
four sequences: Alpha CDR3, Beta CDR, Alpha CDR1, Beta CDR1. The sequences are represented by
|
four sequences: Alpha CDR3, Beta CDR, Alpha CDR1, Beta CDR1. The sequences are represented by
|
||||||
random integers. CDR3 Alpha and Beta sequences are all unique. CDR1 Alpha and Beta sequences
|
random integers. CDR3 Alpha and Beta sequences are all unique. CDR1 Alpha and Beta sequences
|
||||||
@@ -55,7 +106,7 @@ Structure:
|
|||||||
|
|
||||||
**NOTE:** Matching of CDR1s is currently awaiting re-implementation.
|
**NOTE:** Matching of CDR1s is currently awaiting re-implementation.
|
||||||
|
|
||||||
### -- Sample Plate Files --
|
#### Sample Plate Files
|
||||||
Sample Plate files consist of any number of "wells" containing any number of T cells (as
|
Sample Plate files consist of any number of "wells" containing any number of T cells (as
|
||||||
described above). The wells are filled randomly from a Cell Sample file, according to a selected
|
described above). The wells are filled randomly from a Cell Sample file, according to a selected
|
||||||
frequency distribution. Additionally, every individual sequence within each cell may, with some
|
frequency distribution. Additionally, every individual sequence within each cell may, with some
|
||||||
@@ -90,7 +141,13 @@ Dropouts are represented by replacing sequences with the value `-1`. Comments ar
|
|||||||
Structure:
|
Structure:
|
||||||
|
|
||||||
---
|
---
|
||||||
|
```
|
||||||
|
# Cell source file name:
|
||||||
|
# Each row represents one well on the plate
|
||||||
|
# Plate size:
|
||||||
|
# Concentrations:
|
||||||
|
# Lambda:
|
||||||
|
```
|
||||||
| Well 1, cell 1 | Well 1, cell 2 | Well 1, cell 3| ... |
|
| Well 1, cell 1 | Well 1, cell 2 | Well 1, cell 3| ... |
|
||||||
|---|---|---|---|
|
|---|---|---|---|
|
||||||
| **Well 2, cell 1** | **Well 2, cell 2** | **Well 2, cell 3**| ... |
|
| **Well 2, cell 1** | **Well 2, cell 2** | **Well 2, cell 3**| ... |
|
||||||
@@ -99,7 +156,7 @@ Structure:
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### -- Graph and Data Files --
|
#### Graph and Data Files
|
||||||
Graph and Data files are serialized binaries of a Java object containing the graph representation of a
|
Graph and Data files are serialized binaries of a Java object containing the graph representation of a
|
||||||
Sample Plate and necessary metadata for matching and results output. Making them requires a Cell Sample file (to construct a list of correct sequence pairs for checking
|
Sample Plate and necessary metadata for matching and results output. Making them requires a Cell Sample file (to construct a list of correct sequence pairs for checking
|
||||||
the accuracy of BiGpairSEQ simulations) and a Sample Plate file (to construct the associated
|
the accuracy of BiGpairSEQ simulations) and a Sample Plate file (to construct the associated
|
||||||
@@ -115,13 +172,11 @@ Options for creating a Graph and Data file:
|
|||||||
* The Cell Sample file to use
|
* The Cell Sample file to use
|
||||||
* The Sample Plate file (generated from the given Cell Sample file) to use.
|
* The Sample Plate file (generated from the given Cell Sample file) to use.
|
||||||
|
|
||||||
### -- Matching Results Files --
|
#### Matching Results Files
|
||||||
Matching results files consist of the results of a BiGpairSEQ matching simulation.
|
Matching results files consist of the results of a BiGpairSEQ matching simulation.
|
||||||
Files are in CSV format. Rows are sequence pairings with extra relevant data. Columns are pairing-specific details.
|
Files are in CSV format. Rows are sequence pairings with extra relevant data. Columns are pairing-specific details.
|
||||||
Metadata about the matching simulation is included as comments. Comments are preceded by `#`.
|
Metadata about the matching simulation is included as comments. Comments are preceded by `#`.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Options when running a BiGpairSEQ simulation of CDR3 alpha/beta matching:
|
Options when running a BiGpairSEQ simulation of CDR3 alpha/beta matching:
|
||||||
* The minimum number of alpha/beta overlap wells to attempt to match
|
* The minimum number of alpha/beta overlap wells to attempt to match
|
||||||
* (must be >= 1)
|
* (must be >= 1)
|
||||||
@@ -136,19 +191,20 @@ Options when running a BiGpairSEQ simulation of CDR3 alpha/beta matching:
|
|||||||
Sample File Structure:
|
Sample File Structure:
|
||||||
|
|
||||||
---
|
---
|
||||||
|
```
|
||||||
# T cell counts in sample plate wells: 5000
|
# T cell counts in sample plate wells: 5000
|
||||||
# Total alphas found: 3387
|
# Total alphas found: 3387
|
||||||
# Total betas found: 3396
|
# Total betas found: 3396
|
||||||
# High overlap threshold: 94
|
# High overlap threshold: 94
|
||||||
# Low overlap threshold: 3
|
# Low overlap threshold: 3
|
||||||
# Minimum overlap percent: 0
|
# Minimum overlap percent: 0
|
||||||
# Maximum occupancy difference: 50
|
# Maximum occupancy difference: 50
|
||||||
# Pairing attempt rate: 0.488
|
# Pairing attempt rate: 0.488
|
||||||
# Correct pairings: 1650
|
# Correct pairings: 1650
|
||||||
# Incorrect pairings: 4
|
# Incorrect pairings: 4
|
||||||
# Pairing error rate: 0.00242
|
# Pairing error rate: 0.00242
|
||||||
# Simulation time: 19 seconds
|
# Simulation time: 19 seconds
|
||||||
|
```
|
||||||
|
|
||||||
| Alpha | Alpha well count | Beta | Beta well count | Overlap count | Matched Correctly? | P-value |
|
| Alpha | Alpha well count | Beta | Beta well count | Overlap count | Matched Correctly? | P-value |
|
||||||
|---|---|---|---|---|---|---|
|
|---|---|---|---|---|---|---|
|
||||||
@@ -159,35 +215,9 @@ Sample File Structure:
|
|||||||
---
|
---
|
||||||
|
|
||||||
**NOTE: The p-values in the output are not used for matching**—they aren't part of the BiGpairSEQ algorithm at all.
|
**NOTE: The p-values in the output are not used for matching**—they aren't part of the BiGpairSEQ algorithm at all.
|
||||||
P-values are calculated *after* BiGpairSEQ matching is completed for purposes of comparison,
|
P-values are calculated *after* BiGpairSEQ matching is completed, for purposes of comparison,
|
||||||
using the (2021 corrected) formula from the original pairSEQ paper. (Howie, et al. 2015)
|
using the (2021 corrected) formula from the original pairSEQ paper. (Howie, et al. 2015)
|
||||||
|
|
||||||
## THEORY
|
|
||||||
|
|
||||||
Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and compares
|
|
||||||
against a null distribution, BiGpairSEQ does not do any statistical calculations
|
|
||||||
directly. Instead, BiGpairSEQ creates a simple bipartite weighted graph representing the sample plate.
|
|
||||||
The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well
|
|
||||||
are connected by an edge, with the edge weight set to the number of wells in which both sequences appear.
|
|
||||||
(Sequences in all wells are filtered out prior to creating the graph, as there is no signal in their occupancy
|
|
||||||
The problem of pairing TCRA/TCRB sequences thus reduces to the "assignment problem" of finding a maximum weight
|
|
||||||
matching on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value.
|
|
||||||
|
|
||||||
This is a very well-studied combinatorial optimization problem, with many known solutions.
|
|
||||||
The best currently-known algorithm for bipartite graphs with integer weights is from Duan and Su (2012).
|
|
||||||
For a graph with m edges, n vertices per side, and maximum integer edge weight N, their algorithm runs
|
|
||||||
in O(m sqrt(n) log(N)) time. With its best-known efficiency and requirement of integer weights, this
|
|
||||||
algorithm is ideal for BiGpairSEQ.
|
|
||||||
|
|
||||||
Unfortunately, the qualities that make it ideal for BiGpairSEQ make it less generically useful,
|
|
||||||
and it is not implemented by the graph theory library used in this simulator. So this program
|
|
||||||
instead uses the Fibonacci heap-based algorithm of Fredman and Tarjan (1984), which has a worst-case
|
|
||||||
runtime of O(n (n log(n) + m)). The algorithm is implemented as described in Melhorn and Näher (1999).
|
|
||||||
|
|
||||||
The current version of the program uses a pairing heap instead of a Fibonacci heap for its priority queue,
|
|
||||||
which has lower theoretical efficiency but also lower complexity overhead, and is often equivalently performant
|
|
||||||
in practice.
|
|
||||||
|
|
||||||
## TODO
|
## TODO
|
||||||
|
|
||||||
* Try invoking GC at end of workloads to reduce paging to disk
|
* Try invoking GC at end of workloads to reduce paging to disk
|
||||||
@@ -203,17 +233,12 @@ in practice.
|
|||||||
* in theory Fibonacci heap should be more efficient, but complexity overhead may eliminate theoretical advantage
|
* in theory Fibonacci heap should be more efficient, but complexity overhead may eliminate theoretical advantage
|
||||||
* Add controllable heap-type parameter?
|
* Add controllable heap-type parameter?
|
||||||
* Implement sample plates with random numbers of T cells per well
|
* Implement sample plates with random numbers of T cells per well
|
||||||
* Possible BiGpairSEQ advantage: BiGpairSEQ is resilient to variations in well populations; pairSEQ is not.
|
* Possible BiGpairSEQ advantage over pairSEQ: BiGpairSEQ is resilient to variations in well populations; pairSEQ is not.
|
||||||
* preliminary data suggests that BiGpairSEQ behaves roughly as though the whole plate had whatever the *average* well concentration is, but that's still speculative.
|
* preliminary data suggests that BiGpairSEQ behaves roughly as though the whole plate had whatever the *average* well concentration is, but that's still speculative.
|
||||||
* See if there's a reasonable way to reformat Sample Plate files so that wells are columns instead of rows
|
* See if there's a reasonable way to reformat Sample Plate files so that wells are columns instead of rows
|
||||||
* Problem is variable number of cells in a well
|
* Problem is variable number of cells in a well
|
||||||
* Apache Commons CSV library writes entries a row at a time
|
* Apache Commons CSV library writes entries a row at a time
|
||||||
|
* Can possibly sort the wells by length first, then construct entries
|
||||||
## EXTERNAL LIBRARIES USED
|
|
||||||
* [JGraphT](https://jgrapht.org) -- Graph theory data structures and algorithms
|
|
||||||
* [JHeaps](https://www.jheaps.org) -- For pairing heap priority queue used in maximum weight matching algorithm
|
|
||||||
* [Apache Commons CSV](https://commons.apache.org/proper/commons-csv/) -- For CSV file output
|
|
||||||
* [Apache Commons CLI](https://commons.apache.org/proper/commons-cli/) -- To enable command line arguments for scripting. (**Awaiting reimplementation**.)
|
|
||||||
|
|
||||||
## CITATIONS
|
## CITATIONS
|
||||||
* Howie, B., Sherwood, A. M., et al. ["High-throughput pairing of T cell receptor alpha and beta sequences."](https://pubmed.ncbi.nlm.nih.gov/26290413/) Sci. Transl. Med. 7, 301ra131 (2015)
|
* Howie, B., Sherwood, A. M., et al. ["High-throughput pairing of T cell receptor alpha and beta sequences."](https://pubmed.ncbi.nlm.nih.gov/26290413/) Sci. Transl. Med. 7, 301ra131 (2015)
|
||||||
@@ -221,9 +246,15 @@ in practice.
|
|||||||
* K. Melhorn, St. Näher. [The LEDA Platform of Combinatorial and Geometric Computing.](https://people.mpi-inf.mpg.de/~mehlhorn/LEDAbook.html) Cambridge University Press. Chapter 7, Graph Algorithms; p. 132-162 (1999)
|
* K. Melhorn, St. Näher. [The LEDA Platform of Combinatorial and Geometric Computing.](https://people.mpi-inf.mpg.de/~mehlhorn/LEDAbook.html) Cambridge University Press. Chapter 7, Graph Algorithms; p. 132-162 (1999)
|
||||||
* M. Fredman, R. Tarjan. ["Fibonacci heaps and their uses in improved network optimization algorithms."](https://www.cl.cam.ac.uk/teaching/1011/AlgorithII/1987-FredmanTar-fibonacci.pdf) J. ACM, 34(3):596–615 (1987))
|
* M. Fredman, R. Tarjan. ["Fibonacci heaps and their uses in improved network optimization algorithms."](https://www.cl.cam.ac.uk/teaching/1011/AlgorithII/1987-FredmanTar-fibonacci.pdf) J. ACM, 34(3):596–615 (1987))
|
||||||
|
|
||||||
|
## EXTERNAL LIBRARIES USED
|
||||||
|
* [JGraphT](https://jgrapht.org) -- Graph theory data structures and algorithms
|
||||||
|
* [JHeaps](https://www.jheaps.org) -- For pairing heap priority queue used in maximum weight matching algorithm
|
||||||
|
* [Apache Commons CSV](https://commons.apache.org/proper/commons-csv/) -- For CSV file output
|
||||||
|
* [Apache Commons CLI](https://commons.apache.org/proper/commons-cli/) -- To enable command line arguments for scripting. (**Awaiting re-implementation**.)
|
||||||
|
|
||||||
## ACKNOWLEDGEMENTS
|
## ACKNOWLEDGEMENTS
|
||||||
BiGpairSEQ was conceived in collaboration with Dr. Alice MacQueen, who brought the original pairSEQ paper to the author's attention
|
BiGpairSEQ was conceived in collaboration with Dr. Alice MacQueen, who brought the original
|
||||||
and explained all the biology terms he didn't know.
|
pairSEQ paper to the author's attention and explained all the biology terms he didn't know.
|
||||||
|
|
||||||
## AUTHOR
|
## AUTHOR
|
||||||
Eugene Fischer, 2021. UI improvements and documentation, 2022.
|
Eugene Fischer, 2021. UI improvements and documentation, 2022.
|
||||||
Reference in New Issue
Block a user