Update Theory section, add Contents and Algorithm section.

2022-09-29 18:30:07 -05:00
parent e308e47578
commit 633334a1b8
1 changed files with 67 additions and 18 deletions
--- a/readme.md
+++ b/readme.md
@@ -1,5 +1,15 @@
 # BiGpairSEQ SIMULATOR

+## CONTENTS
+1. ABOUT
+2. THEORY
+3. THE BiGpairSEQ ALGORITHM
+4. USAGE
+5. PERFORMANCE
+6. TODO
+7. CITATIONS
+8. ACKNOWLEDGEMENTS
+9. AUTHOR

 ## ABOUT

@@ -8,26 +18,65 @@ of the pairSEQ algorithm (Howie, et al. 2015) for pairing T cell receptor sequen

 ## THEORY

-Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and compares
-against a null distribution, BiGpairSEQ does not do any statistical calculations
-directly.
+T cell receptors (TCRs) are encoded by pairs of sequences, alpha sequences (TCRAs) and beta sequences (TCRBs). These sequences
+are extremely diverse; to the first approximation, this pair of sequences uniquely identifies a line of T cell (of which there may be many clones). 
+
+As described in the original 2015 paper, pairSEQ pairs TCRAs and TCRBs by distributing a 
+sample of T cells across a 96-well sample plate, then sequencing the contents of each well. It then calculates p-values for 
+every TCRA/TCRB sequence overlap and compares that against a null distribution, to find the most statistically probable pairings.
+
+BiGpairSEQ uses the same fundamental idea of using occupancy overlap to pair TCR sequences, but unlike pairSEQ it
+does not require performing any statistical calculations at all. Instead, BiGpairSEQ uses graph theory methods which
+produce provably optimal solutions.

 BiGpairSEQ creates a [weighted bipartite graph](https://en.wikipedia.org/wiki/Bipartite_graph) representing the sample plate.
-The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well
-are connected by an edge, with the edge weight set to the number of wells in which both sequences appear.
-(Sequences present in *all* wells are filtered out prior to creating the graph, as there is no signal in their occupancy pattern.)
-The problem of pairing TCRA/TCRB sequences thus reduces to the "assignment problem" of finding a maximum weight
-matching on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value.
+The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well on the sample plate
+are connected by an edge in the graph, with the edge weight set to the number of wells in which both sequences appear. The vertices
+themselves are labeled with the occupancy data for the individual sequences they represent, which is useful for pre-filtering
+before finding a maximum weight matching. Such a graph fully encodes the distribution data from the sample plate.

-This is a well-studied combinatorial optimization problem, with many known solutions.
-The most efficient algorithm known to the author for maximum weight matching of a bipartite graph with strictly integral
-weights is from Duan and Su (2012). For a graph with m edges, n vertices per side, and maximum integer edge weight N, 
-their algorithm runs in **O(m sqrt(n) log(N))** time. As the graph representation of a pairSEQ experiment is 
-bipartite with integer weights, this algorithm is ideal for BiGpairSEQ.
+The problem of pairing TCRA/TCRB sequences thus reduces to the [assignment problem](https://en.wikipedia.org/wiki/Assignment_problem) of finding a maximum weight
+matching (MWM) on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value.

-Unfortunately, it's a fairly new algorithm, and not yet implemented by the graph theory library used in this simulator.
-So this program instead uses the Fibonacci heap-based algorithm of Fredman and Tarjan (1987), which has a worst-case
-runtime of **O(n (n log(n) + m))**. The algorithm is implemented as described in Melhorn and Näher (1999).
+This is a well-studied combinatorial optimization problem, with many known algorithms that produce 
+provably-optimal solutions. The most theoretically efficient algorithm known to the author for maximum weight matching of a bipartite
+graph with strictly integral weights is from Duan and Su (2012). For a graph with m edges, n vertices per side, 
+and maximum integer edge weight N, their algorithm runs in **O(m sqrt(n) log(N))** time. As the graph representation of 
+a pairSEQ experiment is bipartite with integer weights, this algorithm is ideal for BiGpairSEQ. Unfortunately, it's a 
+fairly new algorithm, and not yet implemented by the graph theory library used in this simulator (JGraphT), nor has the author had 
+time to implement it himself.
+
+There have been some studies which show that [auction algorithms](https://en.wikipedia.org/wiki/Auction_algorithm) for the assignment problem can have superior performance in 
+real-world implementations, due to their simplicity, than more complex algorithms with better theoretical asymptotic 
+performance. But, again, there is no such algorithms implemented by JGraphT, nor has the author yet had time to implement one.
+
+So this program instead uses the [Fibonacci heap](https://en.wikipedia.org/wiki/Fibonacci_heap) based algorithm of Fredman and Tarjan (1987) (essentially
+[the Hungarian algorithm](https://en.wikipedia.org/wiki/Hungarian_algorithm) augmented with a more efficeint priority queue) which has a worst-case
+runtime of **O(n (n log(n) + m))**. The algorithm is implemented as described in Melhorn and Näher (1999). (The simulator
+allows the substitution of a [pairing heap](https://en.wikipedia.org/wiki/Pairing_heap) for a Fibonacci heap, though the relative performance difference of the two
+has not yet been thoroughly tested.)
+
+One possible advantage of this less efficient algorithm is that Hungarian algorithm and its variations work with both the balanced and the unbalanced assignment problem
+(that is, cases where both sides of the bipartite graph have the same number of vertices and those in which they don't.)
+Many other MWM algorithms only work for the balanced assignment problem. While pairSEQ-style experiments should theoretically
+be balanced assignment problems, in practice sequence dropout can cause them to be unbalanced. The unbalanced case 
+*can* be reduced to the balanced case, but doing so involves doubling the graph. Since the current implementation uses only
+the Hungarian algorithm, graph doubling--which could be challenging with the computational resources available to the 
+author-- has not yet been necessary.
+
+The relative time/space efficiencies of BiGpairSEQ when backed by different MWM algorithms remains an open problem
+
+## THE BiGpairSEQ ALGORITHM
+
+* Sequence a sample plate of T cells as in pairSEQ.
+* Pre-filter the sequence data to minimize the size of the necessary graph.
+  * *Saturating sequence filter*: remove any sequences present in all of the wells on the sample plate, as there is no signal in the occupancy data of saturating sequences.
+  * *Non-existent sequence filter*: sequencing misreads can pollute the data from the sample plate with non-existent sequences. These can be identified by the discrepancy between their occupancy and their total read count. If a sequence is read correctly at least half the time, then the total read count of a sequence (R) should be at least half the well occupancy of that sequence (O) times the read depth of the sequencing run (D). Remove any sequences for which C < (O * D) / 2.
+  * *Misidentified sequence filter*: sequencing misreads can cause one real sequence to be misidentified as a different real sequence. This should be fairly infrequent, but is a problem if it causes a sequence to seem to be in a well where it is not, in fact, present. This can be detected by looking at discrepancies in a sequence's per-well read count. On average, the read count for a sequence in an individual well (r) should be equal to its total read count (R) divided by its total well occupancy (O). Remove any sequences for which r < R / (2 * O).
+* Encode the occupancy data from the sample plate as a weighted bipartite graph, where one set of vertices represent the distinct TCRAs and the other set represents distinct TCRBs. Between any TCRA and TCRB that share a well, draw an edge. Assign that edge a weight equal to the total number of wells shared by both sequences.
+* Find a maximum weight matching of the bipartite graph, using any [MWM algorithm](https://en.wikipedia.org/wiki/Assignment_problem#Algorithms) that produces a provably optimal result.
+  * If desired, restrict the matching to a subset of the graph. (Example: restricting matching attempts to cases where the occupancy overlap is 3 or more wells--that is, edges with weight >= 3.0.)
+* The resultant matching represents the likeliest TCRA/TCRB sequence pairs based on the occupancy pattern of the sample plate.

 ## USAGE

@@ -116,7 +165,7 @@ turned on in the Options menu. By default, GraphML output is OFF.
 Cell Sample files consist of any number of distinct "T cells." Every cell contains 
 four sequences: Alpha CDR3, Beta CDR3, Alpha CDR1, Beta CDR1. The sequences are represented by
 random integers. CDR3 Alpha and Beta sequences are all unique within a given Cell Sample file. CDR1 Alpha and Beta sequences
-are not necessarily unique; the relative diversity can be set when making the file.
+are not necessarily unique; the relative diversity of CRD1s with respect to CDR3s can be set when making the file.

 (Note: though cells still have CDR1 sequences, matching of CDR1s is currently awaiting re-implementation.)

@@ -400,4 +449,4 @@ BiGpairSEQ was conceived in collaboration with Dr. Alice MacQueen, who brought t
 pairSEQ paper to the author's attention and explained all the biology terms he didn't know.

 ## AUTHOR
-BiGpairSEQ algorithm and simulation by Eugene Fischer, 2021. UI improvements and documentation, 2022.
+BiGpairSEQ algorithm and simulation by Eugene Fischer, 2021. Improvements and documentation, 2022.