Clarify steps and reasoning behind the algorithm

2022-10-01 13:43:14 -05:00
parent 54896bc47f
commit 9f0ac227e2
1 changed files with 26 additions and 14 deletions
--- a/readme.md
+++ b/readme.md
@@ -42,7 +42,7 @@ This is a well-studied combinatorial optimization problem, with many known algor
 provably-optimal solutions. The most theoretically efficient algorithm known to the author for maximum weight matching of a bipartite
 graph with strictly integral weights is from Duan and Su (2012). For a graph with m edges, n vertices per side, 
 and maximum integer edge weight N, their algorithm runs in **O(m sqrt(n) log(N))** time. As the graph representation of 
-a pairSEQ experiment is bipartite with integer weights, this algorithm is ideal for BiGpairSEQ. Unfortunately, it's a 
+a pairSEQ experiment is bipartite with integer weights, this algorithm seems ideal for BiGpairSEQ. Unfortunately, it's a 
 fairly new algorithm, and not yet implemented by the graph theory library used in this simulator (JGraphT), nor has the author had 
 time to implement it himself.

@@ -56,27 +56,39 @@ runtime of **O(n (n log(n) + m))**. The algorithm is implemented as described in
 allows the substitution of a [pairing heap](https://en.wikipedia.org/wiki/Pairing_heap) for a Fibonacci heap, though the relative performance difference of the two
 has not yet been thoroughly tested.)

-One possible advantage of this less efficient algorithm is that Hungarian algorithm and its variations work with both the balanced and the unbalanced assignment problem
+One possible advantage of this less efficient algorithm is that the Hungarian algorithm and its variations work with both the balanced and the unbalanced assignment problem
 (that is, cases where both sides of the bipartite graph have the same number of vertices and those in which they don't.)
 Many other MWM algorithms only work for the balanced assignment problem. While pairSEQ-style experiments should theoretically
 be balanced assignment problems, in practice sequence dropout can cause them to be unbalanced. The unbalanced case 
-*can* be reduced to the balanced case, but doing so involves doubling the graph. Since the current implementation uses only
+*can* be reduced to the balanced case, but doing so involves doubling the graph size. Since the current implementation uses only
 the Hungarian algorithm, graph doubling--which could be challenging with the computational resources available to the 
 author--has not yet been necessary.

-The relative time/space efficiencies of BiGpairSEQ when backed by different MWM algorithms remains an open problem
+The relative time/space efficiencies of BiGpairSEQ when backed by different MWM algorithms remains an open problem.

 ## THE BiGpairSEQ ALGORITHM

-* Sequence a sample plate of T cells as in pairSEQ.
-* Pre-filter the sequence data to minimize the size of the necessary graph.
-  * *Saturating sequence filter*: remove any sequences present in all of the wells on the sample plate, as there is no signal in the occupancy data of saturating sequences.
-  * *Non-existent sequence filter*: sequencing misreads can pollute the data from the sample plate with non-existent sequences. These can be identified by the discrepancy between their occupancy and their total read count. If a sequence is read correctly at least half the time, then the total read count of a sequence (R) should be at least half the well occupancy of that sequence (O) times the read depth of the sequencing run (D). Remove any sequences for which C < (O * D) / 2.
-  * *Misidentified sequence filter*: sequencing misreads can cause one real sequence to be misidentified as a different real sequence. This should be fairly infrequent, but is a problem if it causes a sequence to seem to be in a well where it is not, in fact, present. This can be detected by looking at discrepancies in a sequence's per-well read count. On average, the read count for a sequence in an individual well (r) should be equal to its total read count (R) divided by its total well occupancy (O). Remove from the list of wells occupied by a sequence any wells for which r < R / (2 * O).
-* Encode the occupancy data from the sample plate as a weighted bipartite graph, where one set of vertices represent the distinct TCRAs and the other set represents distinct TCRBs. Between any TCRA and TCRB that share a well, draw an edge. Assign that edge a weight equal to the total number of wells shared by both sequences.
-* Find a maximum weight matching of the bipartite graph, using any [MWM algorithm](https://en.wikipedia.org/wiki/Assignment_problem#Algorithms) that produces a provably optimal result.
-  * If desired, restrict the matching to a subset of the graph. (Example: restricting matching attempts to cases where the occupancy overlap is 3 or more wells--that is, edges with weight >= 3.0.)
-* The resultant matching represents the likeliest TCRA/TCRB sequence pairs based on the occupancy pattern of the sample plate.
+1. Sequence a sample plate of T cells as in pairSEQ.
+2. Pre-filter the sequence data to reduce error and minimize the size of the necessary graph.
+   1. *Saturating sequence filter*: remove any sequences present in all wells on the sample plate, as there is no signal in the occupancy data of saturating sequences (and each saturating sequence will have an edge to every vertex on the opposite side of the graph, vastly increasing the total graph size).
+   2. *Non-existent sequence filter*: sequencing misreads can pollute the data from the sample plate with non-existent sequences. These can be identified by the discrepancy between their occupancy and their total read count. Assuming sequences are read correctly at least half the time, then a sequence's  total read count (R) should be at least half the well occupancy of that sequence (O) times the read depth of the sequencing run (D). Remove any sequences for which R < (O * D) / 2.
+   3. *Misidentified sequence filter*: sequencing misreads can cause one real sequence to be misidentified as a different real sequence. This should be fairly infrequent, but is a problem if it skews a sequence's overall occupancy pattering by causing the sequence to seem to be in a well where it is not, in fact, present. This can be detected by looking at discrepancies in a sequence's per-well read count. On average, the read count for a sequence in an individual well (r) should be equal to its total read count (R) divided by its total well occupancy (O). Remove from the list of wells occupied by a sequence any wells for which r < R / (2 * O).
+3. Encode the occupancy data from the sample plate as a weighted bipartite graph, where one set of vertices represent the distinct TCRAs and the other set represents distinct TCRBs. Between any TCRA and TCRB that share a well, draw an edge. Assign that edge a weight equal to the total number of wells shared by both sequences.
+4. Find a maximum weight matching of the bipartite graph, using any [MWM algorithm](https://en.wikipedia.org/wiki/Assignment_problem#Algorithms) that produces a provably optimal result.
+    * If desired, restrict the matching to a subset of the graph. (Example: restricting matching attempts to cases where the occupancy overlap is 4 or more wells--that is, edges with weight >= 4.0.) See below for discussion of why this might be desirable.
+5. The resultant matching represents the likeliest TCRA/TCRB sequence pairs based on the occupancy pattern of the sample plate.
+
+It is important to note that a maximum weight matching is not necessarily unique. If two different sets of vertex-disjoint edges
+sum to the same maximal weight, then a MWM algorithms might find either one of them.
+
+For example, consider a well that contains four rare sequences found only in that well, two TCRAs and two TCRBs. 
+In the graph, both of those TCRAs would have edges to both TCRBs (and to others of course, but since those edges will have a weight of 1.0, 
+they are unlikely be paired in a MWM to sequences with total occupancy of more than one well). If these four sequences 
+represent two unique T cells, then only one of the two possible pairings between these sequences is correct. But both 
+the correct and incorrect pairing will add 2.0 to the total graph weight, so either one could be part of a maximum weight matching.
+
+It is to minimize the number of possible equivalent-weight matchings that one might restrict the algorithm to examining
+only a subset of the graph, as described in step 4 above.

 ## USAGE