Update readme
This commit is contained in:
86
readme.md
86
readme.md
@@ -4,7 +4,7 @@
|
|||||||
## ABOUT
|
## ABOUT
|
||||||
|
|
||||||
This program simulates BiGpairSEQ (Bipartite Graph pairSEQ), a graph theory-based adaptation
|
This program simulates BiGpairSEQ (Bipartite Graph pairSEQ), a graph theory-based adaptation
|
||||||
of the pairSEQ algorithm (Howie et al. 2015) for pairing T cell receptor sequences.
|
of the pairSEQ algorithm (Howie, et al. 2015) for pairing T cell receptor sequences.
|
||||||
|
|
||||||
## THEORY
|
## THEORY
|
||||||
|
|
||||||
@@ -20,14 +20,13 @@ The problem of pairing TCRA/TCRB sequences thus reduces to the "assignment probl
|
|||||||
matching on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value.
|
matching on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value.
|
||||||
|
|
||||||
This is a well-studied combinatorial optimization problem, with many known solutions.
|
This is a well-studied combinatorial optimization problem, with many known solutions.
|
||||||
The best currently-known algorithm for bipartite graphs with integer weights--which is what BiGpairSEQ uses--
|
The best currently-known algorithm for bipartite graphs with integer weights--which is what BiGpairSEQ uses--is
|
||||||
is from Duan and Su (2012). For a graph with m edges, n vertices per side, and maximum integer edge weight N,
|
from Duan and Su (2012). For a graph with m edges, n vertices per side, and maximum integer edge weight N,
|
||||||
their algorithm runs in **O(m sqrt(n) log(N))** time. This is the best known efficiency for finding a maximum weight
|
their algorithm runs in **O(m sqrt(n) log(N))** time. This is the best known efficiency for finding a maximum weight
|
||||||
matching on a bipartite graph, and the integer edge weight requirement makes it ideal for BiGpairSEQ.
|
matching on a bipartite graph, and the integer edge weight requirement makes it ideal for BiGpairSEQ.
|
||||||
|
|
||||||
Unfortunately, it's a fairly new algorithm, and the integer edge weight requirement makes it less generically useful.
|
Unfortunately, it's a fairly new algorithm. It is not implemented by the graph theory library used in this simulator.
|
||||||
It is not implemented by the graph theory library used in this simulator. So this program
|
So this program instead uses the Fibonacci heap-based algorithm of Fredman and Tarjan (1987), which has a worst-case
|
||||||
instead uses the Fibonacci heap-based algorithm of Fredman and Tarjan (1987), which has a worst-case
|
|
||||||
runtime of **O(n (n log(n) + m))**. The algorithm is implemented as described in Melhorn and Näher (1999).
|
runtime of **O(n (n log(n) + m))**. The algorithm is implemented as described in Melhorn and Näher (1999).
|
||||||
|
|
||||||
The current version of the program uses a pairing heap instead of a Fibonacci heap for its priority queue,
|
The current version of the program uses a pairing heap instead of a Fibonacci heap for its priority queue,
|
||||||
@@ -81,9 +80,9 @@ writing files, the program will automatically add the correct extension to any f
|
|||||||
|
|
||||||
#### Cell Sample Files
|
#### Cell Sample Files
|
||||||
Cell Sample files consist of any number of distinct "T cells." Every cell contains
|
Cell Sample files consist of any number of distinct "T cells." Every cell contains
|
||||||
four sequences: Alpha CDR3, Beta CDR, Alpha CDR1, Beta CDR1. The sequences are represented by
|
four sequences: Alpha CDR3, Beta CDR3, Alpha CDR1, Beta CDR1. The sequences are represented by
|
||||||
random integers. CDR3 Alpha and Beta sequences are all unique. CDR1 Alpha and Beta sequences
|
random integers. CDR3 Alpha and Beta sequences are all unique within a given Cell Sample file. CDR1 Alpha and Beta sequences
|
||||||
are not necessarily unique; the relative diversity can be set when making a Cell Sample file.
|
are not necessarily unique; the relative diversity can be set when making the file.
|
||||||
|
|
||||||
(Note: though cells still have CDR1 sequences, matching of CDR1s is currently awaiting re-implementation.)
|
(Note: though cells still have CDR1 sequences, matching of CDR1s is currently awaiting re-implementation.)
|
||||||
|
|
||||||
@@ -108,9 +107,9 @@ Structure:
|
|||||||
Sample Plate files consist of any number of "wells" containing any number of T cells (as
|
Sample Plate files consist of any number of "wells" containing any number of T cells (as
|
||||||
described above). The wells are filled randomly from a Cell Sample file, according to a selected
|
described above). The wells are filled randomly from a Cell Sample file, according to a selected
|
||||||
frequency distribution. Additionally, every individual sequence within each cell may, with some
|
frequency distribution. Additionally, every individual sequence within each cell may, with some
|
||||||
given dropout probability, be omitted from the file. This simulates the effect of amplification errors
|
given dropout probability, be omitted from the file; this simulates the effect of amplification errors
|
||||||
prior to sequencing. Plates can also be partitioned into any number of (approximately) evenly-sized
|
prior to sequencing. Plates can also be partitioned into any number of sections, each of which can have a
|
||||||
sections, each of which can have a different number of T cells per well.
|
different concentration of T cells per well.
|
||||||
|
|
||||||
Options when making a Sample Plate file:
|
Options when making a Sample Plate file:
|
||||||
* Cell Sample file to use
|
* Cell Sample file to use
|
||||||
@@ -120,7 +119,7 @@ Options when making a Sample Plate file:
|
|||||||
* Standard deviation size
|
* Standard deviation size
|
||||||
* Exponential
|
* Exponential
|
||||||
* Lambda value
|
* Lambda value
|
||||||
* Based on the slope of the graph in Figure 4C of the pairSEQ paper, the distribution of the original experiment was exponential with a lambda of approximately 0.6. (Howie et al. 2015)
|
* (Based on the slope of the graph in Figure 4C of the pairSEQ paper, the distribution of the original experiment was exponential with a lambda of approximately 0.6. (Howie, et al. 2015))
|
||||||
* Total number of wells on the plate
|
* Total number of wells on the plate
|
||||||
* Number of sections on plate
|
* Number of sections on plate
|
||||||
* Number of T cells per well
|
* Number of T cells per well
|
||||||
@@ -128,13 +127,13 @@ Options when making a Sample Plate file:
|
|||||||
* Dropout rate
|
* Dropout rate
|
||||||
|
|
||||||
Files are in CSV format. There are no header labels. Every row represents a well.
|
Files are in CSV format. There are no header labels. Every row represents a well.
|
||||||
Every column represents an individual cell, containing four sequences, represented by an array string:
|
Every column represents an individual cell, containing four sequences, depicted as an array string:
|
||||||
`[CDR3A, CDR3B, CDR1A, CDR1B]`. So a representative cell might look like this:
|
`[CDR3A, CDR3B, CDR1A, CDR1B]`. So a representative cell might look like this:
|
||||||
|
|
||||||
`[525902, 791533, -1, 866282]`
|
`[525902, 791533, -1, 866282]`
|
||||||
|
|
||||||
Notice that the Alpha CDR1 is missing in the cell above, due to sequence dropout.
|
Notice that the CDR1 Alpha is missing in the cell above--sequence dropout from simulated amplification error.
|
||||||
Dropouts are represented by replacing sequences with the value `-1`. Comments are preceded by `#`
|
Dropout sequences are replaced with the value `-1`. Comments are preceded by `#`
|
||||||
|
|
||||||
Structure:
|
Structure:
|
||||||
|
|
||||||
@@ -144,26 +143,26 @@ Structure:
|
|||||||
# Each row represents one well on the plate
|
# Each row represents one well on the plate
|
||||||
# Plate size:
|
# Plate size:
|
||||||
# Concentrations:
|
# Concentrations:
|
||||||
# Lambda:
|
# Lambda -or- StdDev:
|
||||||
```
|
```
|
||||||
| Well 1, cell 1 | Well 1, cell 2 | Well 1, cell 3| ... |
|
| Well 1, cell 1 | Well 1, cell 2 | Well 1, cell 3| ... |
|
||||||
|---|---|---|---|
|
|---|---|---|---|
|
||||||
| **Well 2, cell 1** | **Well 2, cell 2** | **Well 2, cell 3**| ... |
|
| **Well 2, cell 1** | **Well 2, cell 2** | **Well 2, cell 3**| **...** |
|
||||||
| **Well 3, cell 1** | **Well 3, cell 2** | **Well 3, cell 3**| ... |
|
| **Well 3, cell 1** | **Well 3, cell 2** | **Well 3, cell 3**| **...** |
|
||||||
| ... | ... | ... | ... |
|
| **...** | **...** | **...** | **...** |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
#### Graph and Data Files
|
#### Graph and Data Files
|
||||||
Graph and Data files are serialized binaries of a Java object containing the graph representation of a
|
Graph and Data files are serialized binaries of a Java object containing the weigthed bipartite graph representation of a
|
||||||
Sample Plate and necessary metadata for matching and results output. Making them requires a Cell Sample file (to construct a list of correct sequence pairs for checking
|
Sample Plate, along with the necessary metadata for matching and results output. Making them requires a Cell Sample file
|
||||||
the accuracy of BiGpairSEQ simulations) and a Sample Plate file (to construct the associated
|
(to construct a list of correct sequence pairs for checking the accuracy of BiGpairSEQ simulations) and a
|
||||||
occupancy graph). These files can be several gigabytes in size. Writing them to a file lets us generate a graph and
|
Sample Plate file (to construct the associated occupancy graph). These files can be several gigabytes in size.
|
||||||
its metadata once, then use it for multiple different BiGpairSEQ simulations.
|
Writing them to a file lets us generate a graph and its metadata once, then use it for multiple different BiGpairSEQ simulations.
|
||||||
|
|
||||||
Options for creating a Graph and Data file:
|
Options for creating a Graph and Data file:
|
||||||
* The Cell Sample file to use
|
* The Cell Sample file to use
|
||||||
* The Sample Plate file (generated from the given Cell Sample file) to use.
|
* The Sample Plate file to use. (This must have been generated from the selected Cell Sample file.)
|
||||||
|
|
||||||
These files do not have a human-readable structure, and are not portable to other programs. (Export of graphs in a
|
These files do not have a human-readable structure, and are not portable to other programs. (Export of graphs in a
|
||||||
portable data format may be implemented in the future. The tricky part is encoding the necessary metadata.)
|
portable data format may be implemented in the future. The tricky part is encoding the necessary metadata.)
|
||||||
@@ -181,10 +180,9 @@ Options when running a BiGpairSEQ simulation of CDR3 alpha/beta matching:
|
|||||||
* The maximum number of alpha/beta overlap wells to attempt to match
|
* The maximum number of alpha/beta overlap wells to attempt to match
|
||||||
* (must be <= the number of wells on the plate - 1)
|
* (must be <= the number of wells on the plate - 1)
|
||||||
* The maximum difference in alpha/beta occupancy to attempt to match
|
* The maximum difference in alpha/beta occupancy to attempt to match
|
||||||
* (To skip using this filter, enter a value >= the number of wells on the plate)
|
* (Optional. To skip using this filter, enter a value >= the number of wells on the plate)
|
||||||
* The minimum percentage of a sequence's occupied wells shared by another sequence to attempt to match
|
* The minimum overlap percentage--the percentage of a sequence's occupied wells shared by another sequence--to attempt to match. Given as value in range 0 - 100.
|
||||||
* given value from 0 to 100
|
* (Optional. To skip using this filter, enter 0)
|
||||||
* (To skip using this filter, enter 0)
|
|
||||||
|
|
||||||
Example output:
|
Example output:
|
||||||
|
|
||||||
@@ -223,24 +221,26 @@ using the (2021 corrected) formula from the original pairSEQ paper. (Howie, et a
|
|||||||
## TODO
|
## TODO
|
||||||
|
|
||||||
* ~~Try invoking GC at end of workloads to reduce paging to disk~~ DONE
|
* ~~Try invoking GC at end of workloads to reduce paging to disk~~ DONE
|
||||||
* ~~Hold graph data in memory until another graph is read-in?~~
|
* ~~Hold graph data in memory until another graph is read-in?~~ ABANDONED
|
||||||
* No, this won't work, because BiGpairSEQ simulations alter the underlying graph based on filtering constraints. Changes would cascade with multiple experiments.
|
* *No, this won't work, because BiGpairSEQ simulations alter the underlying graph based on filtering constraints. Changes would cascade with multiple experiments.*
|
||||||
|
* See if there's a reasonable way to reformat Sample Plate files so that wells are columns instead of rows.
|
||||||
|
* ~~Problem is variable number of cells in a well~~
|
||||||
|
* ~~Apache Commons CSV library writes entries a row at a time~~
|
||||||
|
* _Got this working, but at the cost of a profoundly strange bug in graph occupancy filtering. Have reverted the repo until I can figure out what caused that. Given how easily Thingiverse transposes CSV matrices in R, might not even be worth fixing._
|
||||||
|
* Re-implement command line arguments, to enable scripting and statistical simulation studies
|
||||||
|
* Implement sample plates with random numbers of T cells per well.
|
||||||
|
* Possible BiGpairSEQ advantage over pairSEQ: BiGpairSEQ is resilient to variations in well populations; pairSEQ is not.
|
||||||
|
* preliminary data suggests that BiGpairSEQ behaves roughly as though the whole plate had whatever the *average* well concentration is, but that's still speculative.
|
||||||
* Enable GraphML output in addition to serialized object binaries, for data portability
|
* Enable GraphML output in addition to serialized object binaries, for data portability
|
||||||
* Custom vertex type with attribute for sequence occupancy?
|
* Custom vertex type with attribute for sequence occupancy?
|
||||||
* Re-implement CDR1 matching method
|
* Re-implement CDR1 matching method
|
||||||
* Re-implement command line arguments, to enable scripting and statistical simulation studies
|
* Implement Duan and Su's maximum weight matching algorithm
|
||||||
* Implement Duan and Su's maximum weight matching algorithms
|
|
||||||
* Add controllable algorithm-type parameter?
|
* Add controllable algorithm-type parameter?
|
||||||
* Test whether pairing heap (currently used) or Fibonacci heap is more efficient for current matching algorithm
|
* Test whether pairing heap (currently used) or Fibonacci heap is more efficient for priority queue in current matching algorithm
|
||||||
* in theory Fibonacci heap should be more efficient, but complexity overhead may eliminate theoretical advantage
|
* in theory Fibonacci heap should be more efficient, but complexity overhead may eliminate theoretical advantage
|
||||||
* Add controllable heap-type parameter?
|
* Add controllable heap-type parameter?
|
||||||
* Implement sample plates with random numbers of T cells per well
|
|
||||||
* Possible BiGpairSEQ advantage over pairSEQ: BiGpairSEQ is resilient to variations in well populations; pairSEQ is not.
|
|
||||||
* preliminary data suggests that BiGpairSEQ behaves roughly as though the whole plate had whatever the *average* well concentration is, but that's still speculative.
|
|
||||||
* See if there's a reasonable way to reformat Sample Plate files so that wells are columns instead of rows
|
|
||||||
* Problem is variable number of cells in a well
|
|
||||||
* Apache Commons CSV library writes entries a row at a time
|
|
||||||
* Can possibly sort the wells by length first, then construct entries
|
|
||||||
|
|
||||||
## CITATIONS
|
## CITATIONS
|
||||||
* Howie, B., Sherwood, A. M., et al. ["High-throughput pairing of T cell receptor alpha and beta sequences."](https://pubmed.ncbi.nlm.nih.gov/26290413/) Sci. Transl. Med. 7, 301ra131 (2015)
|
* Howie, B., Sherwood, A. M., et al. ["High-throughput pairing of T cell receptor alpha and beta sequences."](https://pubmed.ncbi.nlm.nih.gov/26290413/) Sci. Transl. Med. 7, 301ra131 (2015)
|
||||||
@@ -259,4 +259,4 @@ BiGpairSEQ was conceived in collaboration with Dr. Alice MacQueen, who brought t
|
|||||||
pairSEQ paper to the author's attention and explained all the biology terms he didn't know.
|
pairSEQ paper to the author's attention and explained all the biology terms he didn't know.
|
||||||
|
|
||||||
## AUTHOR
|
## AUTHOR
|
||||||
Eugene Fischer, 2021. UI improvements and documentation, 2022.
|
BiGpairSEQ algorithm and simulation by Eugene Fischer, 2021. UI improvements and documentation, 2022.
|
||||||
Reference in New Issue
Block a user