Update Theory section, add Contents and Algorithm section.

This commit is contained in:
eugenefischer
2022-09-29 18:30:07 -05:00
parent e308e47578
commit 633334a1b8

View File

@@ -1,5 +1,15 @@
# BiGpairSEQ SIMULATOR
## CONTENTS
1. ABOUT
2. THEORY
3. THE BiGpairSEQ ALGORITHM
4. USAGE
5. PERFORMANCE
6. TODO
7. CITATIONS
8. ACKNOWLEDGEMENTS
9. AUTHOR
## ABOUT
@@ -8,26 +18,65 @@ of the pairSEQ algorithm (Howie, et al. 2015) for pairing T cell receptor sequen
## THEORY
Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and compares
against a null distribution, BiGpairSEQ does not do any statistical calculations
directly.
T cell receptors (TCRs) are encoded by pairs of sequences, alpha sequences (TCRAs) and beta sequences (TCRBs). These sequences
are extremely diverse; to the first approximation, this pair of sequences uniquely identifies a line of T cell (of which there may be many clones).
As described in the original 2015 paper, pairSEQ pairs TCRAs and TCRBs by distributing a
sample of T cells across a 96-well sample plate, then sequencing the contents of each well. It then calculates p-values for
every TCRA/TCRB sequence overlap and compares that against a null distribution, to find the most statistically probable pairings.
BiGpairSEQ uses the same fundamental idea of using occupancy overlap to pair TCR sequences, but unlike pairSEQ it
does not require performing any statistical calculations at all. Instead, BiGpairSEQ uses graph theory methods which
produce provably optimal solutions.
BiGpairSEQ creates a [weighted bipartite graph](https://en.wikipedia.org/wiki/Bipartite_graph) representing the sample plate.
The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well
are connected by an edge, with the edge weight set to the number of wells in which both sequences appear.
(Sequences present in *all* wells are filtered out prior to creating the graph, as there is no signal in their occupancy pattern.)
The problem of pairing TCRA/TCRB sequences thus reduces to the "assignment problem" of finding a maximum weight
matching on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value.
The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well on the sample plate
are connected by an edge in the graph, with the edge weight set to the number of wells in which both sequences appear. The vertices
themselves are labeled with the occupancy data for the individual sequences they represent, which is useful for pre-filtering
before finding a maximum weight matching. Such a graph fully encodes the distribution data from the sample plate.
This is a well-studied combinatorial optimization problem, with many known solutions.
The most efficient algorithm known to the author for maximum weight matching of a bipartite graph with strictly integral
weights is from Duan and Su (2012). For a graph with m edges, n vertices per side, and maximum integer edge weight N,
their algorithm runs in **O(m sqrt(n) log(N))** time. As the graph representation of a pairSEQ experiment is
bipartite with integer weights, this algorithm is ideal for BiGpairSEQ.
The problem of pairing TCRA/TCRB sequences thus reduces to the [assignment problem](https://en.wikipedia.org/wiki/Assignment_problem) of finding a maximum weight
matching (MWM) on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value.
Unfortunately, it's a fairly new algorithm, and not yet implemented by the graph theory library used in this simulator.
So this program instead uses the Fibonacci heap-based algorithm of Fredman and Tarjan (1987), which has a worst-case
runtime of **O(n (n log(n) + m))**. The algorithm is implemented as described in Melhorn and Näher (1999).
This is a well-studied combinatorial optimization problem, with many known algorithms that produce
provably-optimal solutions. The most theoretically efficient algorithm known to the author for maximum weight matching of a bipartite
graph with strictly integral weights is from Duan and Su (2012). For a graph with m edges, n vertices per side,
and maximum integer edge weight N, their algorithm runs in **O(m sqrt(n) log(N))** time. As the graph representation of
a pairSEQ experiment is bipartite with integer weights, this algorithm is ideal for BiGpairSEQ. Unfortunately, it's a
fairly new algorithm, and not yet implemented by the graph theory library used in this simulator (JGraphT), nor has the author had
time to implement it himself.
There have been some studies which show that [auction algorithms](https://en.wikipedia.org/wiki/Auction_algorithm) for the assignment problem can have superior performance in
real-world implementations, due to their simplicity, than more complex algorithms with better theoretical asymptotic
performance. But, again, there is no such algorithms implemented by JGraphT, nor has the author yet had time to implement one.
So this program instead uses the [Fibonacci heap](https://en.wikipedia.org/wiki/Fibonacci_heap) based algorithm of Fredman and Tarjan (1987) (essentially
[the Hungarian algorithm](https://en.wikipedia.org/wiki/Hungarian_algorithm) augmented with a more efficeint priority queue) which has a worst-case
runtime of **O(n (n log(n) + m))**. The algorithm is implemented as described in Melhorn and Näher (1999). (The simulator
allows the substitution of a [pairing heap](https://en.wikipedia.org/wiki/Pairing_heap) for a Fibonacci heap, though the relative performance difference of the two
has not yet been thoroughly tested.)
One possible advantage of this less efficient algorithm is that Hungarian algorithm and its variations work with both the balanced and the unbalanced assignment problem
(that is, cases where both sides of the bipartite graph have the same number of vertices and those in which they don't.)
Many other MWM algorithms only work for the balanced assignment problem. While pairSEQ-style experiments should theoretically
be balanced assignment problems, in practice sequence dropout can cause them to be unbalanced. The unbalanced case
*can* be reduced to the balanced case, but doing so involves doubling the graph. Since the current implementation uses only
the Hungarian algorithm, graph doubling--which could be challenging with the computational resources available to the
author-- has not yet been necessary.
The relative time/space efficiencies of BiGpairSEQ when backed by different MWM algorithms remains an open problem
## THE BiGpairSEQ ALGORITHM
* Sequence a sample plate of T cells as in pairSEQ.
* Pre-filter the sequence data to minimize the size of the necessary graph.
* *Saturating sequence filter*: remove any sequences present in all of the wells on the sample plate, as there is no signal in the occupancy data of saturating sequences.
* *Non-existent sequence filter*: sequencing misreads can pollute the data from the sample plate with non-existent sequences. These can be identified by the discrepancy between their occupancy and their total read count. If a sequence is read correctly at least half the time, then the total read count of a sequence (R) should be at least half the well occupancy of that sequence (O) times the read depth of the sequencing run (D). Remove any sequences for which C < (O * D) / 2.
* *Misidentified sequence filter*: sequencing misreads can cause one real sequence to be misidentified as a different real sequence. This should be fairly infrequent, but is a problem if it causes a sequence to seem to be in a well where it is not, in fact, present. This can be detected by looking at discrepancies in a sequence's per-well read count. On average, the read count for a sequence in an individual well (r) should be equal to its total read count (R) divided by its total well occupancy (O). Remove any sequences for which r < R / (2 * O).
* Encode the occupancy data from the sample plate as a weighted bipartite graph, where one set of vertices represent the distinct TCRAs and the other set represents distinct TCRBs. Between any TCRA and TCRB that share a well, draw an edge. Assign that edge a weight equal to the total number of wells shared by both sequences.
* Find a maximum weight matching of the bipartite graph, using any [MWM algorithm](https://en.wikipedia.org/wiki/Assignment_problem#Algorithms) that produces a provably optimal result.
* If desired, restrict the matching to a subset of the graph. (Example: restricting matching attempts to cases where the occupancy overlap is 3 or more wells--that is, edges with weight >= 3.0.)
* The resultant matching represents the likeliest TCRA/TCRB sequence pairs based on the occupancy pattern of the sample plate.
## USAGE
@@ -116,7 +165,7 @@ turned on in the Options menu. By default, GraphML output is OFF.
Cell Sample files consist of any number of distinct "T cells." Every cell contains
four sequences: Alpha CDR3, Beta CDR3, Alpha CDR1, Beta CDR1. The sequences are represented by
random integers. CDR3 Alpha and Beta sequences are all unique within a given Cell Sample file. CDR1 Alpha and Beta sequences
are not necessarily unique; the relative diversity can be set when making the file.
are not necessarily unique; the relative diversity of CRD1s with respect to CDR3s can be set when making the file.
(Note: though cells still have CDR1 sequences, matching of CDR1s is currently awaiting re-implementation.)
@@ -400,4 +449,4 @@ BiGpairSEQ was conceived in collaboration with Dr. Alice MacQueen, who brought t
pairSEQ paper to the author's attention and explained all the biology terms he didn't know.
## AUTHOR
BiGpairSEQ algorithm and simulation by Eugene Fischer, 2021. UI improvements and documentation, 2022.
BiGpairSEQ algorithm and simulation by Eugene Fischer, 2021. Improvements and documentation, 2022.