Update Theory section, add Contents and Algorithm section.
This commit is contained in:
85
readme.md
85
readme.md
@@ -1,5 +1,15 @@
|
||||
# BiGpairSEQ SIMULATOR
|
||||
|
||||
## CONTENTS
|
||||
1. ABOUT
|
||||
2. THEORY
|
||||
3. THE BiGpairSEQ ALGORITHM
|
||||
4. USAGE
|
||||
5. PERFORMANCE
|
||||
6. TODO
|
||||
7. CITATIONS
|
||||
8. ACKNOWLEDGEMENTS
|
||||
9. AUTHOR
|
||||
|
||||
## ABOUT
|
||||
|
||||
@@ -8,26 +18,65 @@ of the pairSEQ algorithm (Howie, et al. 2015) for pairing T cell receptor sequen
|
||||
|
||||
## THEORY
|
||||
|
||||
Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and compares
|
||||
against a null distribution, BiGpairSEQ does not do any statistical calculations
|
||||
directly.
|
||||
T cell receptors (TCRs) are encoded by pairs of sequences, alpha sequences (TCRAs) and beta sequences (TCRBs). These sequences
|
||||
are extremely diverse; to the first approximation, this pair of sequences uniquely identifies a line of T cell (of which there may be many clones).
|
||||
|
||||
As described in the original 2015 paper, pairSEQ pairs TCRAs and TCRBs by distributing a
|
||||
sample of T cells across a 96-well sample plate, then sequencing the contents of each well. It then calculates p-values for
|
||||
every TCRA/TCRB sequence overlap and compares that against a null distribution, to find the most statistically probable pairings.
|
||||
|
||||
BiGpairSEQ uses the same fundamental idea of using occupancy overlap to pair TCR sequences, but unlike pairSEQ it
|
||||
does not require performing any statistical calculations at all. Instead, BiGpairSEQ uses graph theory methods which
|
||||
produce provably optimal solutions.
|
||||
|
||||
BiGpairSEQ creates a [weighted bipartite graph](https://en.wikipedia.org/wiki/Bipartite_graph) representing the sample plate.
|
||||
The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well
|
||||
are connected by an edge, with the edge weight set to the number of wells in which both sequences appear.
|
||||
(Sequences present in *all* wells are filtered out prior to creating the graph, as there is no signal in their occupancy pattern.)
|
||||
The problem of pairing TCRA/TCRB sequences thus reduces to the "assignment problem" of finding a maximum weight
|
||||
matching on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value.
|
||||
The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well on the sample plate
|
||||
are connected by an edge in the graph, with the edge weight set to the number of wells in which both sequences appear. The vertices
|
||||
themselves are labeled with the occupancy data for the individual sequences they represent, which is useful for pre-filtering
|
||||
before finding a maximum weight matching. Such a graph fully encodes the distribution data from the sample plate.
|
||||
|
||||
This is a well-studied combinatorial optimization problem, with many known solutions.
|
||||
The most efficient algorithm known to the author for maximum weight matching of a bipartite graph with strictly integral
|
||||
weights is from Duan and Su (2012). For a graph with m edges, n vertices per side, and maximum integer edge weight N,
|
||||
their algorithm runs in **O(m sqrt(n) log(N))** time. As the graph representation of a pairSEQ experiment is
|
||||
bipartite with integer weights, this algorithm is ideal for BiGpairSEQ.
|
||||
The problem of pairing TCRA/TCRB sequences thus reduces to the [assignment problem](https://en.wikipedia.org/wiki/Assignment_problem) of finding a maximum weight
|
||||
matching (MWM) on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value.
|
||||
|
||||
Unfortunately, it's a fairly new algorithm, and not yet implemented by the graph theory library used in this simulator.
|
||||
So this program instead uses the Fibonacci heap-based algorithm of Fredman and Tarjan (1987), which has a worst-case
|
||||
runtime of **O(n (n log(n) + m))**. The algorithm is implemented as described in Melhorn and Näher (1999).
|
||||
This is a well-studied combinatorial optimization problem, with many known algorithms that produce
|
||||
provably-optimal solutions. The most theoretically efficient algorithm known to the author for maximum weight matching of a bipartite
|
||||
graph with strictly integral weights is from Duan and Su (2012). For a graph with m edges, n vertices per side,
|
||||
and maximum integer edge weight N, their algorithm runs in **O(m sqrt(n) log(N))** time. As the graph representation of
|
||||
a pairSEQ experiment is bipartite with integer weights, this algorithm is ideal for BiGpairSEQ. Unfortunately, it's a
|
||||
fairly new algorithm, and not yet implemented by the graph theory library used in this simulator (JGraphT), nor has the author had
|
||||
time to implement it himself.
|
||||
|
||||
There have been some studies which show that [auction algorithms](https://en.wikipedia.org/wiki/Auction_algorithm) for the assignment problem can have superior performance in
|
||||
real-world implementations, due to their simplicity, than more complex algorithms with better theoretical asymptotic
|
||||
performance. But, again, there is no such algorithms implemented by JGraphT, nor has the author yet had time to implement one.
|
||||
|
||||
So this program instead uses the [Fibonacci heap](https://en.wikipedia.org/wiki/Fibonacci_heap) based algorithm of Fredman and Tarjan (1987) (essentially
|
||||
[the Hungarian algorithm](https://en.wikipedia.org/wiki/Hungarian_algorithm) augmented with a more efficeint priority queue) which has a worst-case
|
||||
runtime of **O(n (n log(n) + m))**. The algorithm is implemented as described in Melhorn and Näher (1999). (The simulator
|
||||
allows the substitution of a [pairing heap](https://en.wikipedia.org/wiki/Pairing_heap) for a Fibonacci heap, though the relative performance difference of the two
|
||||
has not yet been thoroughly tested.)
|
||||
|
||||
One possible advantage of this less efficient algorithm is that Hungarian algorithm and its variations work with both the balanced and the unbalanced assignment problem
|
||||
(that is, cases where both sides of the bipartite graph have the same number of vertices and those in which they don't.)
|
||||
Many other MWM algorithms only work for the balanced assignment problem. While pairSEQ-style experiments should theoretically
|
||||
be balanced assignment problems, in practice sequence dropout can cause them to be unbalanced. The unbalanced case
|
||||
*can* be reduced to the balanced case, but doing so involves doubling the graph. Since the current implementation uses only
|
||||
the Hungarian algorithm, graph doubling--which could be challenging with the computational resources available to the
|
||||
author-- has not yet been necessary.
|
||||
|
||||
The relative time/space efficiencies of BiGpairSEQ when backed by different MWM algorithms remains an open problem
|
||||
|
||||
## THE BiGpairSEQ ALGORITHM
|
||||
|
||||
* Sequence a sample plate of T cells as in pairSEQ.
|
||||
* Pre-filter the sequence data to minimize the size of the necessary graph.
|
||||
* *Saturating sequence filter*: remove any sequences present in all of the wells on the sample plate, as there is no signal in the occupancy data of saturating sequences.
|
||||
* *Non-existent sequence filter*: sequencing misreads can pollute the data from the sample plate with non-existent sequences. These can be identified by the discrepancy between their occupancy and their total read count. If a sequence is read correctly at least half the time, then the total read count of a sequence (R) should be at least half the well occupancy of that sequence (O) times the read depth of the sequencing run (D). Remove any sequences for which C < (O * D) / 2.
|
||||
* *Misidentified sequence filter*: sequencing misreads can cause one real sequence to be misidentified as a different real sequence. This should be fairly infrequent, but is a problem if it causes a sequence to seem to be in a well where it is not, in fact, present. This can be detected by looking at discrepancies in a sequence's per-well read count. On average, the read count for a sequence in an individual well (r) should be equal to its total read count (R) divided by its total well occupancy (O). Remove any sequences for which r < R / (2 * O).
|
||||
* Encode the occupancy data from the sample plate as a weighted bipartite graph, where one set of vertices represent the distinct TCRAs and the other set represents distinct TCRBs. Between any TCRA and TCRB that share a well, draw an edge. Assign that edge a weight equal to the total number of wells shared by both sequences.
|
||||
* Find a maximum weight matching of the bipartite graph, using any [MWM algorithm](https://en.wikipedia.org/wiki/Assignment_problem#Algorithms) that produces a provably optimal result.
|
||||
* If desired, restrict the matching to a subset of the graph. (Example: restricting matching attempts to cases where the occupancy overlap is 3 or more wells--that is, edges with weight >= 3.0.)
|
||||
* The resultant matching represents the likeliest TCRA/TCRB sequence pairs based on the occupancy pattern of the sample plate.
|
||||
|
||||
## USAGE
|
||||
|
||||
@@ -116,7 +165,7 @@ turned on in the Options menu. By default, GraphML output is OFF.
|
||||
Cell Sample files consist of any number of distinct "T cells." Every cell contains
|
||||
four sequences: Alpha CDR3, Beta CDR3, Alpha CDR1, Beta CDR1. The sequences are represented by
|
||||
random integers. CDR3 Alpha and Beta sequences are all unique within a given Cell Sample file. CDR1 Alpha and Beta sequences
|
||||
are not necessarily unique; the relative diversity can be set when making the file.
|
||||
are not necessarily unique; the relative diversity of CRD1s with respect to CDR3s can be set when making the file.
|
||||
|
||||
(Note: though cells still have CDR1 sequences, matching of CDR1s is currently awaiting re-implementation.)
|
||||
|
||||
@@ -400,4 +449,4 @@ BiGpairSEQ was conceived in collaboration with Dr. Alice MacQueen, who brought t
|
||||
pairSEQ paper to the author's attention and explained all the biology terms he didn't know.
|
||||
|
||||
## AUTHOR
|
||||
BiGpairSEQ algorithm and simulation by Eugene Fischer, 2021. UI improvements and documentation, 2022.
|
||||
BiGpairSEQ algorithm and simulation by Eugene Fischer, 2021. Improvements and documentation, 2022.
|
||||
Reference in New Issue
Block a user