295 lines
16 KiB
Markdown
295 lines
16 KiB
Markdown
# BiGpairSEQ SIMULATOR
|
||
|
||
|
||
## ABOUT
|
||
|
||
This program simulates BiGpairSEQ (Bipartite Graph pairSEQ), a graph theory-based adaptation
|
||
of the pairSEQ algorithm (Howie, et al. 2015) for pairing T cell receptor sequences.
|
||
|
||
## THEORY
|
||
|
||
Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and compares
|
||
against a null distribution, BiGpairSEQ does not do any statistical calculations
|
||
directly.
|
||
|
||
BiGpairSEQ creates a [weightd bipartite graph](https://en.wikipedia.org/wiki/Bipartite_graph) representing the sample plate.
|
||
The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well
|
||
are connected by an edge, with the edge weight set to the number of wells in which both sequences appear.
|
||
(Sequences present in *all* wells are filtered out prior to creating the graph, as there is no signal in their occupancy pattern.)
|
||
The problem of pairing TCRA/TCRB sequences thus reduces to the "assignment problem" of finding a maximum weight
|
||
matching on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value.
|
||
|
||
This is a well-studied combinatorial optimization problem, with many known solutions.
|
||
The most efficient algorithm known to the author for maximum weight matching of a bipartite graph with strictly integral weights
|
||
is from Duan and Su (2012). For a graph with m edges, n vertices per side, and maximum integer edge weight N,
|
||
their algorithm runs in **O(m sqrt(n) log(N))** time. As the graph representation of a pairSEQ experiment is
|
||
bipartite with integer weights, this algorithm is ideal for BiGpairSEQ.
|
||
|
||
Unfortunately, it's a fairly new algorithm, and not yet implemented by the graph theory library used in this simulator.
|
||
So this program instead uses the Fibonacci heap-based algorithm of Fredman and Tarjan (1987), which has a worst-case
|
||
runtime of **O(n (n log(n) + m))**. The algorithm is implemented as described in Melhorn and Näher (1999).
|
||
|
||
## USAGE
|
||
|
||
### RUNNING THE PROGRAM
|
||
|
||
[Download the current version of BiGpairSEQ_Sim.](https://gitea.ejsf.synology.me/efischer/BiGpairSEQ/releases)
|
||
|
||
BiGpairSEQ_Sim is an executable .jar file. Requires Java 14 or higher. [OpenJDK 17](https://jdk.java.net/17/)
|
||
recommended.
|
||
|
||
Run with the command:
|
||
|
||
`java -jar BiGpairSEQ_Sim.jar`
|
||
|
||
Processing sample plates with tens of thousands of sequences may require large amounts
|
||
of RAM. It is often desirable to increase the JVM maximum heap allocation with the -Xmx flag.
|
||
For example, to run the program with 32 gigabytes of memory, use the command:
|
||
|
||
`java -Xmx32G -jar BiGpairSEQ_Sim.jar`
|
||
|
||
Once running, BiGpairSEQ_Sim has an interactive, menu-driven CLI for generating files and simulating TCR pairing. The
|
||
main menu looks like this:
|
||
|
||
```
|
||
--------BiGPairSEQ SIMULATOR--------
|
||
ALPHA/BETA T CELL RECEPTOR MATCHING
|
||
USING WEIGHTED BIPARTITE GRAPHS
|
||
------------------------------------
|
||
Please select an option:
|
||
1) Generate a population of distinct cells
|
||
2) Generate a sample plate of T cells
|
||
3) Generate CDR3 alpha/beta occupancy data and overlap graph
|
||
4) Simulate bipartite graph CDR3 alpha/beta matching (BiGpairSEQ)
|
||
8) Options
|
||
9) About/Acknowledgments
|
||
0) Exit
|
||
```
|
||
|
||
### INPUT/OUTPUT
|
||
|
||
To run the simulation, the program reads and writes 4 kinds of files:
|
||
* Cell Sample files in CSV format
|
||
* Sample Plate files in CSV format
|
||
* Graph/Data files in binary object serialization format
|
||
* Matching Results files in CSV format
|
||
|
||
These files are often generated in sequence. When entering filenames, it is not necessary to include the file extension
|
||
(.csv or .ser). When reading or writing files, the program will automatically add the correct extension to any filename without one.
|
||
|
||
To save file I/O time, the most recent instance of each of these four
|
||
files either generated or read from disk can be cached in program memory. This is could be important for Graph/Data files,
|
||
which can be several gigabytes in size. Since some simulations may require running multiple,
|
||
differently-configured BiGpairSEQ matchings on the same graph, keeping the most recent graph cached may reduce execution time.
|
||
(The manipulation necessary to re-use a graph incurs its own performance overhead, though, which may scale with graph
|
||
size faster than file I/O does. If so, caching is best for smaller graphs.)
|
||
|
||
When caching is active, subsequent uses of the same data file won't need to be read in again until another file of that type is used or generated,
|
||
or caching is turned off for that file type. The program checks whether it needs to update its cached data by comparing
|
||
filenames as entered by the user. On encountering a new filename, the program flushes its cache and reads in the new file.
|
||
|
||
The program's caching behavior can be controlled in the Options menu. By default, all caching is OFF.
|
||
|
||
#### Cell Sample Files
|
||
Cell Sample files consist of any number of distinct "T cells." Every cell contains
|
||
four sequences: Alpha CDR3, Beta CDR3, Alpha CDR1, Beta CDR1. The sequences are represented by
|
||
random integers. CDR3 Alpha and Beta sequences are all unique within a given Cell Sample file. CDR1 Alpha and Beta sequences
|
||
are not necessarily unique; the relative diversity can be set when making the file.
|
||
|
||
(Note: though cells still have CDR1 sequences, matching of CDR1s is currently awaiting re-implementation.)
|
||
|
||
Options when making a Cell Sample file:
|
||
* Number of T cells to generate
|
||
* Factor by which CDR3s are more diverse than CDR1s
|
||
|
||
Files are in CSV format. Rows are distinct T cells, columns are sequences within the cells.
|
||
Comments are preceded by `#`
|
||
|
||
Structure:
|
||
|
||
---
|
||
# Sample contains 1 unique CDR1 for every 4 unique CDR3s.
|
||
| Alpha CDR3 | Beta CDR3 | Alpha CDR1 | Beta CDR1 |
|
||
|---|---|---|---|
|
||
|unique number|unique number|number|number|
|
||
|
||
---
|
||
|
||
#### Sample Plate Files
|
||
Sample Plate files consist of any number of "wells" containing any number of T cells (as
|
||
described above). The wells are filled randomly from a Cell Sample file, according to a selected
|
||
frequency distribution. Additionally, every individual sequence within each cell may, with some
|
||
given dropout probability, be omitted from the file; this simulates the effect of amplification errors
|
||
prior to sequencing. Plates can also be partitioned into any number of sections, each of which can have a
|
||
different concentration of T cells per well.
|
||
|
||
Options when making a Sample Plate file:
|
||
* Cell Sample file to use
|
||
* Statistical distribution to apply to Cell Sample file
|
||
* Poisson
|
||
* Gaussian
|
||
* Standard deviation size
|
||
* Exponential
|
||
* Lambda value
|
||
* *(Based on the slope of the graph in Figure 4C of the pairSEQ paper, the distribution of the original experiment was exponential with a lambda of approximately 0.6. (Howie, et al. 2015))*
|
||
* Total number of wells on the plate
|
||
* Number of sections on plate
|
||
* Number of T cells per well
|
||
* per section, if more than one section
|
||
* Dropout rate
|
||
|
||
Files are in CSV format. There are no header labels. Every row represents a well.
|
||
Every value represents an individual cell, containing four sequences, depicted as an array string:
|
||
`[CDR3A, CDR3B, CDR1A, CDR1B]`. So a representative cell might look like this:
|
||
|
||
`[525902, 791533, -1, 866282]`
|
||
|
||
Notice that the CDR1 Alpha is missing in the cell above--sequence dropout from simulated amplification error.
|
||
Dropout sequences are replaced with the value `-1`. Comments are preceded by `#`
|
||
|
||
Structure:
|
||
|
||
---
|
||
```
|
||
# Cell source file name:
|
||
# Each row represents one well on the plate
|
||
# Plate size:
|
||
# Concentrations:
|
||
# Lambda -or- StdDev:
|
||
```
|
||
| Well 1, cell 1 | Well 1, cell 2 | Well 1, cell 3| ... |
|
||
|---|---|---|---|
|
||
| **Well 2, cell 1** | **Well 2, cell 2** | **Well 2, cell 3**| **...** |
|
||
| **Well 3, cell 1** | **Well 3, cell 2** | **Well 3, cell 3**| **...** |
|
||
| **...** | **...** | **...** | **...** |
|
||
|
||
---
|
||
|
||
#### Graph/Data Files
|
||
Graph/Data files are serialized binaries of a Java object containing the weigthed bipartite graph representation of a
|
||
Sample Plate, along with the necessary metadata for matching and results output. Making them requires a Cell Sample file
|
||
(to construct a list of correct sequence pairs for checking the accuracy of BiGpairSEQ simulations) and a
|
||
Sample Plate file (to construct the associated occupancy graph).
|
||
|
||
These files can be several gigabytes in size. Writing them to a file lets us generate a graph and its metadata once,
|
||
then use it for multiple different BiGpairSEQ simulations.
|
||
|
||
Options for creating a Graph/Data file:
|
||
* The Cell Sample file to use
|
||
* The Sample Plate file to use. (This must have been generated from the selected Cell Sample file.)
|
||
|
||
These files do not have a human-readable structure, and are not portable to other programs. (Export of graphs in a
|
||
portable data format may be implemented in the future. The tricky part is encoding the necessary metadata.)
|
||
|
||
---
|
||
|
||
#### Matching Results Files
|
||
Matching results files consist of the results of a BiGpairSEQ matching simulation. Making them requires a Graph and
|
||
Data file. Matching results files are in CSV format. Rows are sequence pairings with extra relevant data. Columns are pairing-specific details.
|
||
Metadata about the matching simulation is included as comments. Comments are preceded by `#`.
|
||
|
||
Options when running a BiGpairSEQ simulation of CDR3 alpha/beta matching:
|
||
* The minimum number of alpha/beta overlap wells to attempt to match
|
||
* (must be >= 1)
|
||
* The maximum number of alpha/beta overlap wells to attempt to match
|
||
* (must be <= the number of wells on the plate - 1)
|
||
* The maximum difference in alpha/beta occupancy to attempt to match
|
||
* (Optional. To skip using this filter, enter a value >= the number of wells on the plate)
|
||
* The minimum overlap percentage--the percentage of a sequence's occupied wells shared by another sequence--to attempt to match. Given as value in range 0 - 100.
|
||
* (Optional. To skip using this filter, enter 0)
|
||
|
||
Example output:
|
||
|
||
---
|
||
```
|
||
# Source Sample Plate file: 4MilCellsPlate.csv
|
||
# Source Graph and Data file: 4MilCellsPlateGraph.ser
|
||
# T cell counts in sample plate wells: 30000
|
||
# Total alphas found: 11813
|
||
# Total betas found: 11808
|
||
# High overlap threshold: 94
|
||
# Low overlap threshold: 3
|
||
# Minimum overlap percent: 0
|
||
# Maximum occupancy difference: 96
|
||
# Pairing attempt rate: 0.438
|
||
# Correct pairings: 5151
|
||
# Incorrect pairings: 18
|
||
# Pairing error rate: 0.00348
|
||
# Simulation time: 862 seconds
|
||
```
|
||
|
||
| Alpha | Alpha well count | Beta | Beta well count | Overlap count | Matched Correctly? | P-value |
|
||
|---|---|---|---|---|---|---|
|
||
|5242972|17|1571520|18|17|true|1.41E-18|
|
||
|5161027|18|2072219|18|18|true|7.31E-20|
|
||
|4145198|33|1064455|30|29|true|2.65E-21|
|
||
|7700582|18|112748|18|18|true|7.31E-20|
|
||
|...|...|...|...|...|...|...|
|
||
|
||
---
|
||
|
||
**NOTE: The p-values in the output are not used for matching**—they aren't part of the BiGpairSEQ algorithm at all.
|
||
P-values are calculated *after* BiGpairSEQ matching is completed, for purposes of comparison only,
|
||
using the (2021 corrected) formula from the original pairSEQ paper. (Howie, et al. 2015)
|
||
|
||
### PERFORMANCE
|
||
Performance details of the example excerpted above:
|
||
|
||
On a home computer with a Ryzen 5600X CPU, 64GB of 3200MHz DDR4 RAM (half of which was allocated to the Java Virtual Machine), and a PCIe 3.0 SSD, running Linux Mint 20.3 Edge (5.13 kernel),
|
||
the author ran a BiGpairSEQ simulation of a 96-well sample plate with 30,000 T cells/well comprising ~11,800 alphas and betas,
|
||
taken from a sample of 4,000,000 distinct cells with an exponential frequency distribution.
|
||
|
||
With min/max occupancy threshold of 3 and 94 wells for matching, and no other pre-filtering, BiGpairSEQ identified 5,151
|
||
correct pairings and 18 incorrect pairings, for an accuracy of 99.652%.
|
||
|
||
The simulation time was 14'22". If intermediate results were held in memory, this would be equivalent to the total elapsed time.
|
||
|
||
Since this implementation of BiGpairSEQ writes intermediate results to disk (to improve the efficiency of *repeated* simulations
|
||
with different filtering options), the actual elapsed time was greater. File I/O time was not measured, but took
|
||
slightly less time than the simulation itself. Real elapsed time from start to finish was under 30 minutes.
|
||
|
||
## TODO
|
||
|
||
* ~~Try invoking GC at end of workloads to reduce paging to disk~~ DONE
|
||
* Hold graph data in memory until another graph is read-in? ~~ABANDONED~~ ~~UNABANDONED~~ DONE
|
||
* ~~*No, this won't work, because BiGpairSEQ simulations alter the underlying graph based on filtering constraints. Changes would cascade with multiple experiments.*~~
|
||
* Might have figured out a way to do it, by taking edges out and then putting them back into the graph. This may actually be possible.
|
||
* It is possible, though the modifications to the graph incur their own performance penalties. Need testing to see which option is best.
|
||
* See if there's a reasonable way to reformat Sample Plate files so that wells are columns instead of rows.
|
||
* ~~Problem is variable number of cells in a well~~
|
||
* ~~Apache Commons CSV library writes entries a row at a time~~
|
||
* _Got this working, but at the cost of a profoundly strange bug in graph occupancy filtering. Have reverted the repo until I can figure out what caused that. Given how easily Thingiverse transposes CSV matrices in R, might not even be worth fixing._
|
||
* Re-implement command line arguments, to enable scripting and statistical simulation studies
|
||
* Implement sample plates with random numbers of T cells per well.
|
||
* Possible BiGpairSEQ advantage over pairSEQ: BiGpairSEQ is resilient to variations in well population sizes on a sample plate; pairSEQ is not.
|
||
* preliminary data suggests that BiGpairSEQ behaves roughly as though the whole plate had whatever the *average* well concentration is, but that's still speculative.
|
||
* Enable GraphML output in addition to serialized object binaries, for data portability
|
||
* Custom vertex type with attribute for sequence occupancy?
|
||
* Re-implement CDR1 matching method
|
||
* Implement Duan and Su's maximum weight matching algorithm
|
||
* Add controllable algorithm-type parameter?
|
||
* ~~Test whether pairing heap (currently used) or Fibonacci heap is more efficient for priority queue in current matching algorithm~~ DONE
|
||
* ~~in theory Fibonacci heap should be more efficient, but complexity overhead may eliminate theoretical advantage~~
|
||
* ~~Add controllable heap-type parameter?~~
|
||
* Parameter implemented. For large graphs, Fibonacci heap wins. Now the new default.
|
||
|
||
|
||
|
||
## CITATIONS
|
||
* Howie, B., Sherwood, A. M., et al. ["High-throughput pairing of T cell receptor alpha and beta sequences."](https://pubmed.ncbi.nlm.nih.gov/26290413/) Sci. Transl. Med. 7, 301ra131 (2015)
|
||
* Duan, R., Su H. ["A Scaling Algorithm for Maximum Weight Matching in Bipartite Graphs."](https://web.eecs.umich.edu/~pettie/matching/Duan-Su-scaling-bipartite-matching.pdf) Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, p. 1413-1424. (2012)
|
||
* Melhorn, K., Näher, St. [The LEDA Platform of Combinatorial and Geometric Computing.](https://people.mpi-inf.mpg.de/~mehlhorn/LEDAbook.html) Cambridge University Press. Chapter 7, Graph Algorithms; p. 132-162 (1999)
|
||
* Fredman, M., Tarjan, R. ["Fibonacci heaps and their uses in improved network optimization algorithms."](https://www.cl.cam.ac.uk/teaching/1011/AlgorithII/1987-FredmanTar-fibonacci.pdf) J. ACM, 34(3):596–615 (1987))
|
||
|
||
## EXTERNAL LIBRARIES USED
|
||
* [JGraphT](https://jgrapht.org) -- Graph theory data structures and algorithms
|
||
* [JHeaps](https://www.jheaps.org) -- For pairing heap priority queue used in maximum weight matching algorithm
|
||
* [Apache Commons CSV](https://commons.apache.org/proper/commons-csv/) -- For CSV file output
|
||
* [Apache Commons CLI](https://commons.apache.org/proper/commons-cli/) -- To enable command line arguments for scripting. (**Awaiting re-implementation**.)
|
||
|
||
## ACKNOWLEDGEMENTS
|
||
BiGpairSEQ was conceived in collaboration with Dr. Alice MacQueen, who brought the original
|
||
pairSEQ paper to the author's attention and explained all the biology terms he didn't know.
|
||
|
||
## AUTHOR
|
||
BiGpairSEQ algorithm and simulation by Eugene Fischer, 2021. UI improvements and documentation, 2022. |