Update TOC, command line options

This commit is contained in:
eugenefischer
2022-10-01 13:59:03 -05:00
parent 0657db5653
commit b82176517c

124
readme.md
View File

@@ -5,7 +5,17 @@
2. THEORY
3. THE BiGpairSEQ ALGORITHM
4. USAGE
5. PERFORMANCE
1. RUNNING THE PROGRAM
2. COMMAND LINE OPTIONS
3. INTERACTIVE INTERFACE
4. INPUT/OUTPUT
1. Cell Sample Files
2. Sample Plate Files
3. Graph/Data Files
4. Matching Results Files
5. PERFORMANCE (needs revision!)
1. SIMULATING EXPERIMENTS FROM pairSEQ PAPER
2. BEHAVIOR WITH RANDOMIZED WELL POPULATIONS
6. TODO
7. CITATIONS
8. ACKNOWLEDGEMENTS
@@ -72,7 +82,7 @@ The relative time/space efficiencies of BiGpairSEQ when backed by different MWM
2. Pre-filter the sequence data to reduce error and minimize the size of the necessary graph.
1. *Saturating sequence filter*: remove any sequences present in all wells on the sample plate, as there is no signal in the occupancy data of saturating sequences (and each saturating sequence will have an edge to every vertex on the opposite side of the graph, vastly increasing the total graph size).
2. *Non-existent sequence filter*: sequencing misreads can pollute the data from the sample plate with non-existent sequences. These can be identified by the discrepancy between their occupancy and their total read count. Assuming sequences are read correctly at least half the time, then a sequence's total read count (R) should be at least half the well occupancy of that sequence (O) times the read depth of the sequencing run (D). Remove any sequences for which R < (O * D) / 2.
3. *Misidentified sequence filter*: sequencing misreads can cause one real sequence to be misidentified as a different real sequence. This should be fairly infrequent, but is a problem if it skews a sequence's overall occupancy pattern by causing the sequence to seem to be in a well where it is not, in fact, present. This can be detected by looking at discrepancies in a sequence's per-well read count. On average, the read count for a sequence in an individual well (r) should be equal to its total read count (R) divided by its total well occupancy (O). Remove from the list of wells occupied by a sequence any wells for which r < R / (2 * O).
3. *Misidentified sequence filter*: sequencing misreads can cause one real sequence to be misidentified as a different real sequence. This should be fairly infrequent, but is a problem if it skews a sequence's overall occupancy pattern by causing the sequence to seem to be in a well where it's not. This can be detected by looking for discrepancies in a sequence's per-well read count. On average, the read count for a sequence in an individual well (r) should be equal to its total read count (R) divided by its total well occupancy (O). Remove from the list of wells occupied by a sequence any wells for which r < R / (2 * O).
3. Encode the occupancy data from the sample plate as a weighted bipartite graph, where one set of vertices represent the distinct TCRAs and the other set represents distinct TCRBs. Between any TCRA and TCRB that share a well, draw an edge. Assign that edge a weight equal to the total number of wells shared by both sequences.
4. Find a maximum weight matching of the bipartite graph, using any [MWM algorithm](https://en.wikipedia.org/wiki/Assignment_problem#Algorithms) that produces a provably optimal result.
* If desired, restrict the matching to a subset of the graph. (Example: restricting matching attempts to cases where the occupancy overlap is 4 or more wells--that is, edges with weight >= 4.0.) See below for discussion of why this might be desirable.
@@ -109,11 +119,117 @@ For example, to run the program with 32 gigabytes of memory, use the command:
`java -Xmx32G -jar BiGpairSEQ_Sim.jar`
There are a number of command line options, to allow the program to be used in shell scripts. For a full list,
use the `-help` flag:
### COMMAND LINE OPTIONS
There are a number of command line options, to allow the program to be used in shell scripts. These can be viewed with
the `-help` flag:
`java -jar BiGpairSEQ_Sim.jar -help`
```
usage: BiGpairSEQ_Sim.jar
-cells,--make-cells Makes a cell sample file of distinct T cells
-graph,--make-graph Makes a graph/data file. Requires a cell sample
file and a sample plate file
-help Displays this help menu
-match,--match-cdr3 Matches CDR3s. Requires a graph/data file.
-plate,--make-plate Makes a sample plate file. Requires a cell sample
file.
-version Prints the program version number to stdout
usage: BiGpairSEQ_Sim.jar -cells
-d,--diversity-factor <factor> The factor by which unique CDR3s
outnumber unique CDR1s
-n,--num-cells <number> The number of distinct cells to generate
-o,--output-file <filename> Name of output file
usage: BiGpairSEQ_Sim.jar -plate
-c,--cell-file <filename> The cell sample file to use
-d,--dropout-rate <rate> The sequence dropout rate due to
amplification error. (0.0 - 1.0)
-exponential Use an exponential distribution for cell
sample
-gaussian Use a Gaussian distribution for cell sample
-lambda <value> If using -exponential flag, lambda value
for distribution
-o,--output-file <filename> Name of output file
-poisson Use a Poisson distribution for cell sample
-pop <number [number]...> The well populations for each section of
the sample plate. There will be as many
sections as there are populations given.
-random <min> <max> Randomize well populations on sample plate.
Takes two arguments: the minimum possible
population and the maximum possible
population.
-stddev <value> If using -gaussian flag, standard deviation
for distrbution
-w,--wells <number> The number of wells on the sample plate
usage: BiGpairSEQ_Sim.jar -graph
-c,--cell-file <filename> Cell sample file to use for
checking pairing accuracy
-err,--read-error-prob <prob> (Optional) The probability that
a sequence will be misread. (0.0
- 1.0)
-errcoll,--error-collision-prob <prob> (Optional) The probability that
two misreads will produce the
same spurious sequence. (0.0 -
1.0)
-graphml (Optional) Output GraphML file
-nb,--no-binary (Optional) Don't output
serialized binary file
-o,--output-file <filename> Name of output file
-p,--plate-filename <filename> Sample plate file from which to
construct graph
-rd,--read-depth <depth> (Optional) The number of times
to read each sequence.
-realcoll,--real-collision-prob <prob> (Optional) The probability that
a sequence will be misread as
another real sequence. (Only
applies to unique misreads;
after this has happened once,
future error collisions could
produce the real sequence again)
(0.0 - 1.0)
usage: BiGpairSEQ_Sim.jar -match
-g,--graph-file <filename> The graph/data file to use
-max <number> The maximum number of shared wells to
attempt to match a sequence pair
-maxdiff <number> (Optional) The maximum difference in total
occupancy between two sequences to attempt
matching.
-min <number> The minimum number of shared wells to
attempt to match a sequence pair
-minpct <percent> (Optional) The minimum percentage of a
sequence's total occupancy shared by
another sequence to attempt matching. (0 -
100)
-o,--output-file <filename> (Optional) Name of output the output file.
If not present, no file will be written.
--print-alphas (Optional) Print the number of distinct
alpha sequences to stdout.
--print-attempt (Optional) Print the pairing attempt rate
to stdout
--print-betas (Optional) Print the number of distinct
beta sequences to stdout.
--print-correct (Optional) Print the number of correct
pairs to stdout
--print-error (Optional) Print the pairing error rate to
stdout
--print-incorrect (Optional) Print the number of incorrect
pairs to stdout
--print-metadata (Optional) Print a full summary of the
matching results to stdout.
--print-time (Optional) Print the total simulation time
to stdout.
-pv,--p-value (Optional) Calculate p-values for sequence
pairs.
```
### INTERACTIVE INTERFACE
If no command line arguments are given, BiGpairSEQ_Sim will launch with an interactive, menu-driven CLI for
generating files and simulating TCR pairing. The main menu looks like this: