Update readme with newer results, new menu options
This commit is contained in:
119
readme.md
119
readme.md
@@ -128,7 +128,8 @@ By default, the Options menu looks like this:
|
|||||||
3) Turn on graph/data file caching
|
3) Turn on graph/data file caching
|
||||||
4) Turn off serialized binary graph output
|
4) Turn off serialized binary graph output
|
||||||
5) Turn on GraphML graph output
|
5) Turn on GraphML graph output
|
||||||
6) Maximum weight matching algorithm options
|
6) Turn on calculation of p-values
|
||||||
|
7) Maximum weight matching algorithm options
|
||||||
0) Return to main menu
|
0) Return to main menu
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -182,7 +183,7 @@ Structure:
|
|||||||
| Alpha CDR3 | Beta CDR3 | Alpha CDR1 | Beta CDR1 |
|
| Alpha CDR3 | Beta CDR3 | Alpha CDR1 | Beta CDR1 |
|
||||||
|---|---|---|---|
|
|---|---|---|---|
|
||||||
|unique number|unique number|number|number|
|
|unique number|unique number|number|number|
|
||||||
|
| ... | ... |... | ... |
|
||||||
---
|
---
|
||||||
|
|
||||||
#### Sample Plate Files
|
#### Sample Plate Files
|
||||||
@@ -191,7 +192,8 @@ described above). The wells are filled randomly from a Cell Sample file, accordi
|
|||||||
frequency distribution. Additionally, every individual sequence within each cell may, with some
|
frequency distribution. Additionally, every individual sequence within each cell may, with some
|
||||||
given dropout probability, be omitted from the file; this simulates the effect of amplification errors
|
given dropout probability, be omitted from the file; this simulates the effect of amplification errors
|
||||||
prior to sequencing. Plates can also be partitioned into any number of sections, each of which can have a
|
prior to sequencing. Plates can also be partitioned into any number of sections, each of which can have a
|
||||||
different concentration of T cells per well.
|
different concentration of T cells per well. Alternatively, the number of T cells in each well can be randomized between
|
||||||
|
given minimum and maximum population values.
|
||||||
|
|
||||||
Options when making a Sample Plate file:
|
Options when making a Sample Plate file:
|
||||||
* Cell Sample file to use
|
* Cell Sample file to use
|
||||||
@@ -201,7 +203,7 @@ Options when making a Sample Plate file:
|
|||||||
* Standard deviation size
|
* Standard deviation size
|
||||||
* Exponential
|
* Exponential
|
||||||
* Lambda value
|
* Lambda value
|
||||||
* *(Based on the slope of the graph in Figure 4C of the pairSEQ paper, the distribution of the original experiment was approximately exponential with a lambda ~0.6. (Howie, et al. 2015))*
|
* *(Based on the slope of the graph in Figure 4C of the pairSEQ paper, the distribution of the original experiment was very roughly exponential with a lambda ~0.6. (Howie, et al. 2015) The actual distribution was certainly quite different.)*
|
||||||
* Total number of wells on the plate
|
* Total number of wells on the plate
|
||||||
* Well populations random or fixed
|
* Well populations random or fixed
|
||||||
* If random, minimum and maximum population sizes
|
* If random, minimum and maximum population sizes
|
||||||
@@ -248,12 +250,12 @@ then use it for multiple different BiGpairSEQ simulations.
|
|||||||
|
|
||||||
Options for creating a Graph/Data file:
|
Options for creating a Graph/Data file:
|
||||||
* The Cell Sample file to use
|
* The Cell Sample file to use
|
||||||
* The Sample Plate file to use. (This must have been generated from the selected Cell Sample file.)
|
* The Sample Plate file to use (This must have been generated from the selected Cell Sample file.)
|
||||||
* Whether to simulate sequence read depth. If simulated:
|
* Whether to simulate sequencing read depth. If simulated:
|
||||||
* The read depth (number of times each sequence is read)
|
* The read depth (The number of times each sequence is read)
|
||||||
* The read error rate (probability a sequence is misread)
|
* The read error rate (The probability a sequence is misread)
|
||||||
* The error collision rate (probability two misreads produce the same spurious sequence)
|
* The error collision rate (The probability two misreads produce the same spurious sequence)
|
||||||
* The real sequence collision rate (probability that a misread will produce a different, real sequence from the sample plate. Only applies to new misreads; once an error of this type has occurred, it's likelihood of ocurring again is dominated by the error collision probability.)
|
* The real sequence collision rate (The probability that a misread will produce a different, real sequence from the sample plate. Only applies to new misreads; once an error of this type has occurred, it's likelihood of occurring again is dominated by the error collision probability.)
|
||||||
|
|
||||||
These files do not have a human-readable structure, and are not portable to other programs.
|
These files do not have a human-readable structure, and are not portable to other programs.
|
||||||
|
|
||||||
@@ -270,7 +272,7 @@ compare the matching results to the original Cell Sample .csv file.
|
|||||||
#### Matching Results Files
|
#### Matching Results Files
|
||||||
Matching results files consist of the results of a BiGpairSEQ matching simulation. Making them requires a serialized
|
Matching results files consist of the results of a BiGpairSEQ matching simulation. Making them requires a serialized
|
||||||
binary Graph/Data file (.ser). (Because .graphML files are larger than .ser files, BiGpairSEQ_Sim supports .graphML
|
binary Graph/Data file (.ser). (Because .graphML files are larger than .ser files, BiGpairSEQ_Sim supports .graphML
|
||||||
output only. Graph/data input must use a serialized binary.)
|
output only. Graph input must use a serialized binary.)
|
||||||
|
|
||||||
Matching results files are in CSV format. Rows are sequence pairings with extra relevant data. Columns are pairing-specific details.
|
Matching results files are in CSV format. Rows are sequence pairings with extra relevant data. Columns are pairing-specific details.
|
||||||
Metadata about the matching simulation is included as comments. Comments are preceded by `#`.
|
Metadata about the matching simulation is included as comments. Comments are preceded by `#`.
|
||||||
@@ -288,56 +290,83 @@ Options when running a BiGpairSEQ simulation of CDR3 alpha/beta matching:
|
|||||||
Example output:
|
Example output:
|
||||||
|
|
||||||
```
|
```
|
||||||
# Source Sample Plate file: 4MilCellsPlate.csv
|
# sample plate filename: 8MilCells_Plate.csv
|
||||||
# Source Graph and Data file: 4MilCellsPlateGraph.ser
|
# sequence dropout rate: 0.1
|
||||||
# T cell counts in sample plate wells: 30000
|
# graph filename: 8MilGraph_rd2
|
||||||
# Total alphas found: 11813
|
# MWM algorithm type: LEDA book with heap: FIBONACCI
|
||||||
# Total betas found: 11808
|
# matching weight: 218017.0
|
||||||
# High overlap threshold: 94
|
# well populations: 4000
|
||||||
# Low overlap threshold: 3
|
# sequence read depth: 100
|
||||||
# Minimum overlap percent: 0
|
# sequence read error rate: 0.01
|
||||||
# Maximum occupancy difference: 96
|
# read error collision rate: 0.1
|
||||||
# Pairing attempt rate: 0.438
|
# real sequence collision rate: 0.05
|
||||||
# Correct pairings: 5151
|
# total alphas read from plate: 323711
|
||||||
# Incorrect pairings: 18
|
# total betas read from plate: 323981
|
||||||
# Pairing error rate: 0.00348
|
# pre-filter sequences present in all wells: true
|
||||||
# Simulation time: 862 seconds
|
# pre-filter sequences based on occupancy/read count discrepancy: true
|
||||||
|
# alphas in graph (after pre-filtering): 11707
|
||||||
|
# betas in graph (after pre-filtering): 11705
|
||||||
|
# high overlap threshold for pairing: 95
|
||||||
|
# low overlap threshold for pairing: 3
|
||||||
|
# minimum overlap percent for pairing: 0
|
||||||
|
# maximum occupancy difference for pairing: 2147483647
|
||||||
|
# pairing attempt rate: 0.716
|
||||||
|
# correct pairing count: 8373
|
||||||
|
# incorrect pairing count: 7
|
||||||
|
# pairing error rate: 0.000835
|
||||||
|
# time to generate graph (seconds): 293
|
||||||
|
# time to pair sequences (seconds): 1,416
|
||||||
|
# total simulation time (seconds): 1,709
|
||||||
```
|
```
|
||||||
|
|
||||||
| Alpha | Alpha well count | Beta | Beta well count | Overlap count | Matched Correctly? | P-value |
|
| Alpha | Alpha well count | Beta | Beta well count | Overlap count | Matched Correctly? | P-value |
|
||||||
|---|---|---|---|---|---|---|
|
|---|---|---|---|---|---|---|
|
||||||
|5242972|17|1571520|18|17|true|1.41E-18|
|
|10258642|73|1172093|72|70.0|true|4.19E-21|
|
||||||
|5161027|18|2072219|18|18|true|7.31E-20|
|
|6186865|34|4290363|37|34.0|true|4.56E-26|
|
||||||
|4145198|33|1064455|30|29|true|2.65E-21|
|
|10222686|70|11044018|72|68.0|true|9.55E-25|
|
||||||
|7700582|18|112748|18|18|true|7.31E-20|
|
|5338100|75|2422988|76|74.0|true|4.57E-22|
|
||||||
|
|12363907|33|6569852|35|33.0|true|5.70E-26|
|
||||||
|...|...|...|...|...|...|...|
|
|...|...|...|...|...|...|...|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
**NOTE: The p-values in the output are not used for matching**—they aren't part of the BiGpairSEQ algorithm at all.
|
**NOTE: The p-values in the sample output abpve are not used for matching**—they aren't part of the BiGpairSEQ algorithm at all.
|
||||||
P-values are calculated *after* BiGpairSEQ matching is completed, for purposes of comparison only,
|
P-values (if enabled in the interactive menu options or by using the -pv flag in the command line) are calculated *after*
|
||||||
using the (2021 corrected) formula from the original pairSEQ paper. (Howie, et al. 2015)
|
BiGpairSEQ matching is completed, for purposes of comparison with pairSEQ only, using the (corrected) formula from the
|
||||||
|
original pairSEQ paper. (Howie, et al. 2015) Calculation of p-values is off by default to reduce processing time.
|
||||||
|
|
||||||
|
|
||||||
## PERFORMANCE (old results; need updating to reflect current, improved simulator performance)
|
## PERFORMANCE
|
||||||
|
|
||||||
On a home computer with a Ryzen 5600X CPU, 64GB of 3200MHz DDR4 RAM (half of which was allocated to the Java Virtual Machine), and a PCIe 3.0 SSD, running Linux Mint 20.3 Edge (5.13 kernel),
|
Several BiGpairSEQ simulations were performed on a home computer with the following specs:
|
||||||
the author ran a BiGpairSEQ simulation of a 96-well sample plate with 30,000 T cells/well comprising ~11,800 alphas and betas,
|
|
||||||
taken from a sample of 4,000,000 distinct cells with an exponential frequency distribution (lambda 0.6).
|
|
||||||
|
|
||||||
With min/max occupancy threshold of 3 and 94 wells for matching, and no other pre-filtering, BiGpairSEQ identified 5,151
|
* Ryzen 5600X CPU
|
||||||
correct pairings and 18 incorrect pairings, for an accuracy of 99.652%.
|
* 128GB of 3200MHz DDR4 RAM
|
||||||
|
* 2TB PCIe 3.0 SSD
|
||||||
|
* Linux Mint 21 (5.15 kernel)
|
||||||
|
|
||||||
The total simulation time was 14'22". If intermediate results were held in memory, this would be equivalent to the total elapsed time.
|
### Simulation 1
|
||||||
|
This simulation was an attempt to replicate the conditions of experiment 1 from the 2015 pairSEQ paper: a matching was found for a
|
||||||
|
96-well sample plate with 4,000 T cells/well comprising ~11,900 TCRAs and TCRBs, taken from a sample of 8,400,000
|
||||||
|
distinct cells with an exponential frequency distribution (lambda 0.6). The sequence dropout rate was 10%, as the analysis
|
||||||
|
from the original paper concluded that most TCR sequences "have less than a 10% chance of going unobserved." (Howie, et al. 2015)
|
||||||
|
|
||||||
Since this implementation of BiGpairSEQ writes intermediate results to disk (to improve the efficiency of *repeated* simulations
|
The original paper does not contain (or the author of this document failed to identify) information on sequencing depth,
|
||||||
with different filtering options), the actual elapsed time was greater. File I/O time was not measured, but took
|
read error probability, or the probabilities of different kinds of read error collisions. As the pre-filtering of BiGpairSEQ
|
||||||
slightly less time than the simulation itself. Real elapsed time from start to finish was under 30 minutes.
|
has successfully filtered out all such errors for any reasonable error rates the author has yet tested, this simulation was
|
||||||
|
done without any sequencing errors, to reduce the processing time.
|
||||||
|
|
||||||
As mentioned in the theory section, performance could be improved by implementing a more efficient algorithm for finding
|
With min/max occupancy thresholds of 3 and 95 wells respectively for matching, BiGpairSEQ identified:
|
||||||
the maximum weight matching.
|
* 8,495 correct pairings
|
||||||
|
* 5 incorrect pairings
|
||||||
|
|
||||||
## BEHAVIOR WITH RANDOMIZED WELL POPULATIONS
|
for an overall pairing accuracy of 99.9992%.
|
||||||
|
|
||||||
|
The total simulation time (excluding file I/O) was 28m52. The total elapsed time with file I/O was 41m23s.
|
||||||
|
Calculation of p-values was enabled for this simulation, increasing the overall processing time.
|
||||||
|
|
||||||
|
|
||||||
|
## BEHAVIOR WITH RANDOMIZED WELL POPULATIONS (old results, need updating for new version of the simulator (though resilience to varying well populations is unchanged))
|
||||||
|
|
||||||
A series of BiGpairSEQ simulations were conducted using a cell sample file of 3.5 million unique T cells. From these cells,
|
A series of BiGpairSEQ simulations were conducted using a cell sample file of 3.5 million unique T cells. From these cells,
|
||||||
10 sample plate files were created. All of these sample plates had 96 wells, used an exponential distribution with a lambda of 0.6, and
|
10 sample plate files were created. All of these sample plates had 96 wells, used an exponential distribution with a lambda of 0.6, and
|
||||||
|
|||||||
Reference in New Issue
Block a user