diff --git a/readme.md b/readme.md index 142b829..644d782 100644 --- a/readme.md +++ b/readme.md @@ -1,5 +1,15 @@ # BiGpairSEQ SIMULATOR +## CONTENTS +1. ABOUT +2. THEORY +3. THE BiGpairSEQ ALGORITHM +4. USAGE +5. PERFORMANCE +6. TODO +7. CITATIONS +8. ACKNOWLEDGEMENTS +9. AUTHOR ## ABOUT @@ -8,26 +18,65 @@ of the pairSEQ algorithm (Howie, et al. 2015) for pairing T cell receptor sequen ## THEORY -Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and compares -against a null distribution, BiGpairSEQ does not do any statistical calculations -directly. +T cell receptors (TCRs) are encoded by pairs of sequences, alpha sequences (TCRAs) and beta sequences (TCRBs). These sequences +are extremely diverse; to the first approximation, this pair of sequences uniquely identifies a line of T cell (of which there may be many clones). + +As described in the original 2015 paper, pairSEQ pairs TCRAs and TCRBs by distributing a +sample of T cells across a 96-well sample plate, then sequencing the contents of each well. It then calculates p-values for +every TCRA/TCRB sequence overlap and compares that against a null distribution, to find the most statistically probable pairings. + +BiGpairSEQ uses the same fundamental idea of using occupancy overlap to pair TCR sequences, but unlike pairSEQ it +does not require performing any statistical calculations at all. Instead, BiGpairSEQ uses graph theory methods which +produce provably optimal solutions. BiGpairSEQ creates a [weighted bipartite graph](https://en.wikipedia.org/wiki/Bipartite_graph) representing the sample plate. -The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well -are connected by an edge, with the edge weight set to the number of wells in which both sequences appear. -(Sequences present in *all* wells are filtered out prior to creating the graph, as there is no signal in their occupancy pattern.) -The problem of pairing TCRA/TCRB sequences thus reduces to the "assignment problem" of finding a maximum weight -matching on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value. +The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well on the sample plate +are connected by an edge in the graph, with the edge weight set to the number of wells in which both sequences appear. The vertices +themselves are labeled with the occupancy data for the individual sequences they represent, which is useful for pre-filtering +before finding a maximum weight matching. Such a graph fully encodes the distribution data from the sample plate. -This is a well-studied combinatorial optimization problem, with many known solutions. -The most efficient algorithm known to the author for maximum weight matching of a bipartite graph with strictly integral -weights is from Duan and Su (2012). For a graph with m edges, n vertices per side, and maximum integer edge weight N, -their algorithm runs in **O(m sqrt(n) log(N))** time. As the graph representation of a pairSEQ experiment is -bipartite with integer weights, this algorithm is ideal for BiGpairSEQ. +The problem of pairing TCRA/TCRB sequences thus reduces to the [assignment problem](https://en.wikipedia.org/wiki/Assignment_problem) of finding a maximum weight +matching (MWM) on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value. -Unfortunately, it's a fairly new algorithm, and not yet implemented by the graph theory library used in this simulator. -So this program instead uses the Fibonacci heap-based algorithm of Fredman and Tarjan (1987), which has a worst-case -runtime of **O(n (n log(n) + m))**. The algorithm is implemented as described in Melhorn and Näher (1999). +This is a well-studied combinatorial optimization problem, with many known algorithms that produce +provably-optimal solutions. The most theoretically efficient algorithm known to the author for maximum weight matching of a bipartite +graph with strictly integral weights is from Duan and Su (2012). For a graph with m edges, n vertices per side, +and maximum integer edge weight N, their algorithm runs in **O(m sqrt(n) log(N))** time. As the graph representation of +a pairSEQ experiment is bipartite with integer weights, this algorithm is ideal for BiGpairSEQ. Unfortunately, it's a +fairly new algorithm, and not yet implemented by the graph theory library used in this simulator (JGraphT), nor has the author had +time to implement it himself. + +There have been some studies which show that [auction algorithms](https://en.wikipedia.org/wiki/Auction_algorithm) for the assignment problem can have superior performance in +real-world implementations, due to their simplicity, than more complex algorithms with better theoretical asymptotic +performance. But, again, there is no such algorithms implemented by JGraphT, nor has the author yet had time to implement one. + +So this program instead uses the [Fibonacci heap](https://en.wikipedia.org/wiki/Fibonacci_heap) based algorithm of Fredman and Tarjan (1987) (essentially +[the Hungarian algorithm](https://en.wikipedia.org/wiki/Hungarian_algorithm) augmented with a more efficeint priority queue) which has a worst-case +runtime of **O(n (n log(n) + m))**. The algorithm is implemented as described in Melhorn and Näher (1999). (The simulator +allows the substitution of a [pairing heap](https://en.wikipedia.org/wiki/Pairing_heap) for a Fibonacci heap, though the relative performance difference of the two +has not yet been thoroughly tested.) + +One possible advantage of this less efficient algorithm is that Hungarian algorithm and its variations work with both the balanced and the unbalanced assignment problem +(that is, cases where both sides of the bipartite graph have the same number of vertices and those in which they don't.) +Many other MWM algorithms only work for the balanced assignment problem. While pairSEQ-style experiments should theoretically +be balanced assignment problems, in practice sequence dropout can cause them to be unbalanced. The unbalanced case +*can* be reduced to the balanced case, but doing so involves doubling the graph. Since the current implementation uses only +the Hungarian algorithm, graph doubling--which could be challenging with the computational resources available to the +author-- has not yet been necessary. + +The relative time/space efficiencies of BiGpairSEQ when backed by different MWM algorithms remains an open problem + +## THE BiGpairSEQ ALGORITHM + +* Sequence a sample plate of T cells as in pairSEQ. +* Pre-filter the sequence data to minimize the size of the necessary graph. + * *Saturating sequence filter*: remove any sequences present in all of the wells on the sample plate, as there is no signal in the occupancy data of saturating sequences. + * *Non-existent sequence filter*: sequencing misreads can pollute the data from the sample plate with non-existent sequences. These can be identified by the discrepancy between their occupancy and their total read count. If a sequence is read correctly at least half the time, then the total read count of a sequence (R) should be at least half the well occupancy of that sequence (O) times the read depth of the sequencing run (D). Remove any sequences for which C < (O * D) / 2. + * *Misidentified sequence filter*: sequencing misreads can cause one real sequence to be misidentified as a different real sequence. This should be fairly infrequent, but is a problem if it causes a sequence to seem to be in a well where it is not, in fact, present. This can be detected by looking at discrepancies in a sequence's per-well read count. On average, the read count for a sequence in an individual well (r) should be equal to its total read count (R) divided by its total well occupancy (O). Remove any sequences for which r < R / (2 * O). +* Encode the occupancy data from the sample plate as a weighted bipartite graph, where one set of vertices represent the distinct TCRAs and the other set represents distinct TCRBs. Between any TCRA and TCRB that share a well, draw an edge. Assign that edge a weight equal to the total number of wells shared by both sequences. +* Find a maximum weight matching of the bipartite graph, using any [MWM algorithm](https://en.wikipedia.org/wiki/Assignment_problem#Algorithms) that produces a provably optimal result. + * If desired, restrict the matching to a subset of the graph. (Example: restricting matching attempts to cases where the occupancy overlap is 3 or more wells--that is, edges with weight >= 3.0.) +* The resultant matching represents the likeliest TCRA/TCRB sequence pairs based on the occupancy pattern of the sample plate. ## USAGE @@ -116,7 +165,7 @@ turned on in the Options menu. By default, GraphML output is OFF. Cell Sample files consist of any number of distinct "T cells." Every cell contains four sequences: Alpha CDR3, Beta CDR3, Alpha CDR1, Beta CDR1. The sequences are represented by random integers. CDR3 Alpha and Beta sequences are all unique within a given Cell Sample file. CDR1 Alpha and Beta sequences -are not necessarily unique; the relative diversity can be set when making the file. +are not necessarily unique; the relative diversity of CRD1s with respect to CDR3s can be set when making the file. (Note: though cells still have CDR1 sequences, matching of CDR1s is currently awaiting re-implementation.) @@ -400,4 +449,4 @@ BiGpairSEQ was conceived in collaboration with Dr. Alice MacQueen, who brought t pairSEQ paper to the author's attention and explained all the biology terms he didn't know. ## AUTHOR -BiGpairSEQ algorithm and simulation by Eugene Fischer, 2021. UI improvements and documentation, 2022. \ No newline at end of file +BiGpairSEQ algorithm and simulation by Eugene Fischer, 2021. Improvements and documentation, 2022. \ No newline at end of file