diff --git a/readme.md b/readme.md index ff06da2..629b75e 100644 --- a/readme.md +++ b/readme.md @@ -1,14 +1,45 @@ # BiGpairSEQ SIMULATOR + ## ABOUT -This program simulates BiGpairSEQ, a graph theory based adaptation +This program simulates BiGpairSEQ (Bipartite Graph pairSEQ), a graph theory-based adaptation of the pairSEQ algorithm (Howie et al. 2015) for pairing T cell receptor sequences. +## THEORY + +Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and compares +against a null distribution, BiGpairSEQ does not do any statistical calculations +directly. + +BiGpairSEQ creates a [simple bipartite weighted graph](https://en.wikipedia.org/wiki/Bipartite_graph) representing the sample plate. +The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well +are connected by an edge, with the edge weight set to the number of wells in which both sequences appear. +(Sequences in all wells are filtered out prior to creating the graph, as there is no signal in their occupancy pattern.) +The problem of pairing TCRA/TCRB sequences thus reduces to the "assignment problem" of finding a maximum weight +matching on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value. + +This is a well-studied combinatorial optimization problem, with many known solutions. +The best currently-known algorithm for bipartite graphs with integer weights--which is what BiGpairSEQ uses-- +is from Duan and Su (2012). For a graph with m edges, n vertices per side, and maximum integer edge weight N, +their algorithm runs in **O(m sqrt(n) log(N))** time. This is the best known efficiency for finding a maximum weight +matching on a bipartite graph, and the integer edge weight requirement makes it ideal for BiGpairSEQ. + +Unfortunately, it's a fairly new algorithm, and the integer edge weight requirement makes it less generically useful. +It is not implemented by the graph theory library used in this simulator. So this program +instead uses the Fibonacci heap-based algorithm of Fredman and Tarjan (1987), which has a worst-case +runtime of **O(n (n log(n) + m))**. The algorithm is implemented as described in Melhorn and Näher (1999). + +The current version of the program uses a pairing heap instead of a Fibonacci heap for its priority queue, +which has lower theoretical efficiency but also lower complexity overhead, and is often equivalently performant +in practice. + ## USAGE -Released as an executable .jar file with interactive, command line UI. -Requires Java11 or higher (openjdk-17 recommended). +### RUNNING THE PROGRAM + +BiGpairSEQ_Sim is an executable .jar file. Requires Java 11 or higher. [OpenJDK 17](https://jdk.java.net/17/) +recommended. Run with the command: @@ -20,7 +51,24 @@ For example, to run the program with 32 gigabytes of memory, use the command: `java -Xmx32G -jar BiGpairSEQ_Sim.jar` -## OUTPUT +Once running, BiGpairSEQ_Sim has an interactive, menu-driven CLI for generating files and simulating TCR pairing. The +main menu looks like this: + +``` +--------BiGPairSEQ SIMULATOR-------- +ALPHA/BETA T-CELL RECEPTOR MATCHING + USING WEIGHTED BIPARTITE GRAPHS +------------------------------------ +Please select an option: +1) Generate a population of distinct cells +2) Generate a sample plate of T cells +3) Generate CDR3 alpha/beta occupancy data and overlap graph +4) Simulate bipartite graph CDR3 alpha/beta matching (BiGpairSEQ) +9) About/Acknowledgments +0) Exit +``` + +### OUTPUT To run the simulation, the program reads and writes 4 kinds of files: * Cell Sample files in CSV format @@ -28,7 +76,10 @@ To run the simulation, the program reads and writes 4 kinds of files: * Graph and Data files in binary object serialization format * Matching Results files in CSV format -### -- Cell Sample Files -- +When entering filenames, it is not necessary to include the file extension (.csv or .ser). When reading or +writing files, the program will automatically add the correct extension to any filename without one. + +#### Cell Sample Files Cell Sample files consist of any number of distinct "T cells." Every cell contains four sequences: Alpha CDR3, Beta CDR, Alpha CDR1, Beta CDR1. The sequences are represented by random integers. CDR3 Alpha and Beta sequences are all unique. CDR1 Alpha and Beta sequences @@ -55,7 +106,7 @@ Structure: **NOTE:** Matching of CDR1s is currently awaiting re-implementation. -### -- Sample Plate Files -- +#### Sample Plate Files Sample Plate files consist of any number of "wells" containing any number of T cells (as described above). The wells are filled randomly from a Cell Sample file, according to a selected frequency distribution. Additionally, every individual sequence within each cell may, with some @@ -90,7 +141,13 @@ Dropouts are represented by replacing sequences with the value `-1`. Comments ar Structure: --- - +``` +# Cell source file name: +# Each row represents one well on the plate +# Plate size: +# Concentrations: +# Lambda: +``` | Well 1, cell 1 | Well 1, cell 2 | Well 1, cell 3| ... | |---|---|---|---| | **Well 2, cell 1** | **Well 2, cell 2** | **Well 2, cell 3**| ... | @@ -99,7 +156,7 @@ Structure: --- -### -- Graph and Data Files -- +#### Graph and Data Files Graph and Data files are serialized binaries of a Java object containing the graph representation of a Sample Plate and necessary metadata for matching and results output. Making them requires a Cell Sample file (to construct a list of correct sequence pairs for checking the accuracy of BiGpairSEQ simulations) and a Sample Plate file (to construct the associated @@ -115,13 +172,11 @@ Options for creating a Graph and Data file: * The Cell Sample file to use * The Sample Plate file (generated from the given Cell Sample file) to use. -### -- Matching Results Files -- +#### Matching Results Files Matching results files consist of the results of a BiGpairSEQ matching simulation. Files are in CSV format. Rows are sequence pairings with extra relevant data. Columns are pairing-specific details. Metadata about the matching simulation is included as comments. Comments are preceded by `#`. - - Options when running a BiGpairSEQ simulation of CDR3 alpha/beta matching: * The minimum number of alpha/beta overlap wells to attempt to match * (must be >= 1) @@ -136,19 +191,20 @@ Options when running a BiGpairSEQ simulation of CDR3 alpha/beta matching: Sample File Structure: --- - - # T cell counts in sample plate wells: 5000 - # Total alphas found: 3387 - # Total betas found: 3396 - # High overlap threshold: 94 - # Low overlap threshold: 3 - # Minimum overlap percent: 0 - # Maximum occupancy difference: 50 - # Pairing attempt rate: 0.488 - # Correct pairings: 1650 - # Incorrect pairings: 4 - # Pairing error rate: 0.00242 - # Simulation time: 19 seconds +``` +# T cell counts in sample plate wells: 5000 +# Total alphas found: 3387 +# Total betas found: 3396 +# High overlap threshold: 94 +# Low overlap threshold: 3 +# Minimum overlap percent: 0 +# Maximum occupancy difference: 50 +# Pairing attempt rate: 0.488 +# Correct pairings: 1650 +# Incorrect pairings: 4 +# Pairing error rate: 0.00242 +# Simulation time: 19 seconds +``` | Alpha | Alpha well count | Beta | Beta well count | Overlap count | Matched Correctly? | P-value | |---|---|---|---|---|---|---| @@ -159,35 +215,9 @@ Sample File Structure: --- **NOTE: The p-values in the output are not used for matching**—they aren't part of the BiGpairSEQ algorithm at all. -P-values are calculated *after* BiGpairSEQ matching is completed for purposes of comparison, +P-values are calculated *after* BiGpairSEQ matching is completed, for purposes of comparison, using the (2021 corrected) formula from the original pairSEQ paper. (Howie, et al. 2015) -## THEORY - -Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and compares -against a null distribution, BiGpairSEQ does not do any statistical calculations -directly. Instead, BiGpairSEQ creates a simple bipartite weighted graph representing the sample plate. -The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well -are connected by an edge, with the edge weight set to the number of wells in which both sequences appear. -(Sequences in all wells are filtered out prior to creating the graph, as there is no signal in their occupancy -The problem of pairing TCRA/TCRB sequences thus reduces to the "assignment problem" of finding a maximum weight -matching on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value. - -This is a very well-studied combinatorial optimization problem, with many known solutions. -The best currently-known algorithm for bipartite graphs with integer weights is from Duan and Su (2012). -For a graph with m edges, n vertices per side, and maximum integer edge weight N, their algorithm runs -in O(m sqrt(n) log(N)) time. With its best-known efficiency and requirement of integer weights, this -algorithm is ideal for BiGpairSEQ. - -Unfortunately, the qualities that make it ideal for BiGpairSEQ make it less generically useful, -and it is not implemented by the graph theory library used in this simulator. So this program -instead uses the Fibonacci heap-based algorithm of Fredman and Tarjan (1984), which has a worst-case -runtime of O(n (n log(n) + m)). The algorithm is implemented as described in Melhorn and Näher (1999). - -The current version of the program uses a pairing heap instead of a Fibonacci heap for its priority queue, -which has lower theoretical efficiency but also lower complexity overhead, and is often equivalently performant -in practice. - ## TODO * Try invoking GC at end of workloads to reduce paging to disk @@ -203,27 +233,28 @@ in practice. * in theory Fibonacci heap should be more efficient, but complexity overhead may eliminate theoretical advantage * Add controllable heap-type parameter? * Implement sample plates with random numbers of T cells per well - * Possible BiGpairSEQ advantage: BiGpairSEQ is resilient to variations in well populations; pairSEQ is not. + * Possible BiGpairSEQ advantage over pairSEQ: BiGpairSEQ is resilient to variations in well populations; pairSEQ is not. * preliminary data suggests that BiGpairSEQ behaves roughly as though the whole plate had whatever the *average* well concentration is, but that's still speculative. * See if there's a reasonable way to reformat Sample Plate files so that wells are columns instead of rows * Problem is variable number of cells in a well * Apache Commons CSV library writes entries a row at a time - -## EXTERNAL LIBRARIES USED -* [JGraphT](https://jgrapht.org) -- Graph theory data structures and algorithms -* [JHeaps](https://www.jheaps.org) -- For pairing heap priority queue used in maximum weight matching algorithm -* [Apache Commons CSV](https://commons.apache.org/proper/commons-csv/) -- For CSV file output -* [Apache Commons CLI](https://commons.apache.org/proper/commons-cli/) -- To enable command line arguments for scripting. (**Awaiting reimplementation**.) - + * Can possibly sort the wells by length first, then construct entries + ## CITATIONS * Howie, B., Sherwood, A. M., et al. ["High-throughput pairing of T cell receptor alpha and beta sequences."](https://pubmed.ncbi.nlm.nih.gov/26290413/) Sci. Transl. Med. 7, 301ra131 (2015) * Duan, R., Su H. ["A Scaling Algorithm for Maximum Weight Matching in Bipartite Graphs."](https://web.eecs.umich.edu/~pettie/matching/Duan-Su-scaling-bipartite-matching.pdf) Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, p. 1413-1424. (2012) * K. Melhorn, St. Näher. [The LEDA Platform of Combinatorial and Geometric Computing.](https://people.mpi-inf.mpg.de/~mehlhorn/LEDAbook.html) Cambridge University Press. Chapter 7, Graph Algorithms; p. 132-162 (1999) * M. Fredman, R. Tarjan. ["Fibonacci heaps and their uses in improved network optimization algorithms."](https://www.cl.cam.ac.uk/teaching/1011/AlgorithII/1987-FredmanTar-fibonacci.pdf) J. ACM, 34(3):596–615 (1987)) +## EXTERNAL LIBRARIES USED +* [JGraphT](https://jgrapht.org) -- Graph theory data structures and algorithms +* [JHeaps](https://www.jheaps.org) -- For pairing heap priority queue used in maximum weight matching algorithm +* [Apache Commons CSV](https://commons.apache.org/proper/commons-csv/) -- For CSV file output +* [Apache Commons CLI](https://commons.apache.org/proper/commons-cli/) -- To enable command line arguments for scripting. (**Awaiting re-implementation**.) + ## ACKNOWLEDGEMENTS -BiGpairSEQ was conceived in collaboration with Dr. Alice MacQueen, who brought the original pairSEQ paper to the author's attention -and explained all the biology terms he didn't know. +BiGpairSEQ was conceived in collaboration with Dr. Alice MacQueen, who brought the original +pairSEQ paper to the author's attention and explained all the biology terms he didn't know. ## AUTHOR Eugene Fischer, 2021. UI improvements and documentation, 2022. \ No newline at end of file