Update TOC, command line options

2022-10-01 13:59:03 -05:00
parent 0657db5653
commit b82176517c
1 changed files with 120 additions and 4 deletions
--- a/readme.md
+++ b/readme.md
@@ -5,7 +5,17 @@
 2. THEORY
 3. THE BiGpairSEQ ALGORITHM
 4. USAGE
-5. PERFORMANCE
+   1. RUNNING THE PROGRAM
+   2. COMMAND LINE OPTIONS
+   3. INTERACTIVE INTERFACE
+   4. INPUT/OUTPUT
+      1. Cell Sample Files
+      2. Sample Plate Files
+      3. Graph/Data Files
+      4. Matching Results Files
+5. PERFORMANCE (needs revision!)
+   1. SIMULATING EXPERIMENTS FROM pairSEQ PAPER
+   2. BEHAVIOR WITH RANDOMIZED WELL POPULATIONS
 6. TODO
 7. CITATIONS
 8. ACKNOWLEDGEMENTS
@@ -72,7 +82,7 @@ The relative time/space efficiencies of BiGpairSEQ when backed by different MWM
 2. Pre-filter the sequence data to reduce error and minimize the size of the necessary graph.
   1. *Saturating sequence filter*: remove any sequences present in all wells on the sample plate, as there is no signal in the occupancy data of saturating sequences (and each saturating sequence will have an edge to every vertex on the opposite side of the graph, vastly increasing the total graph size).
   2. *Non-existent sequence filter*: sequencing misreads can pollute the data from the sample plate with non-existent sequences. These can be identified by the discrepancy between their occupancy and their total read count. Assuming sequences are read correctly at least half the time, then a sequence's  total read count (R) should be at least half the well occupancy of that sequence (O) times the read depth of the sequencing run (D). Remove any sequences for which R < (O * D) / 2.
-   3. *Misidentified sequence filter*: sequencing misreads can cause one real sequence to be misidentified as a different real sequence. This should be fairly infrequent, but is a problem if it skews a sequence's overall occupancy pattern by causing the sequence to seem to be in a well where it is not, in fact, present. This can be detected by looking at discrepancies in a sequence's per-well read count. On average, the read count for a sequence in an individual well (r) should be equal to its total read count (R) divided by its total well occupancy (O). Remove from the list of wells occupied by a sequence any wells for which r < R / (2 * O).
+   3. *Misidentified sequence filter*: sequencing misreads can cause one real sequence to be misidentified as a different real sequence. This should be fairly infrequent, but is a problem if it skews a sequence's overall occupancy pattern by causing the sequence to seem to be in a well where it's not. This can be detected by looking for discrepancies in a sequence's per-well read count. On average, the read count for a sequence in an individual well (r) should be equal to its total read count (R) divided by its total well occupancy (O). Remove from the list of wells occupied by a sequence any wells for which r < R / (2 * O).
 3. Encode the occupancy data from the sample plate as a weighted bipartite graph, where one set of vertices represent the distinct TCRAs and the other set represents distinct TCRBs. Between any TCRA and TCRB that share a well, draw an edge. Assign that edge a weight equal to the total number of wells shared by both sequences.
 4. Find a maximum weight matching of the bipartite graph, using any [MWM algorithm](https://en.wikipedia.org/wiki/Assignment_problem#Algorithms) that produces a provably optimal result.
    * If desired, restrict the matching to a subset of the graph. (Example: restricting matching attempts to cases where the occupancy overlap is 4 or more wells--that is, edges with weight >= 4.0.) See below for discussion of why this might be desirable.
@@ -109,11 +119,117 @@ For example, to run the program with 32 gigabytes of memory, use the command:

 `java -Xmx32G -jar BiGpairSEQ_Sim.jar`

-There are a number of command line options, to allow the program to be used in shell scripts. For a full list,
-use the `-help` flag:
+### COMMAND LINE OPTIONS
+
+There are a number of command line options, to allow the program to be used in shell scripts. These can be viewed with
+ the `-help` flag:

 `java -jar BiGpairSEQ_Sim.jar -help`

+```
+usage: BiGpairSEQ_Sim.jar
+ -cells,--make-cells   Makes a cell sample file of distinct T cells
+ -graph,--make-graph   Makes a graph/data file. Requires a cell sample
+                       file and a sample plate file
+ -help                 Displays this help menu
+ -match,--match-cdr3   Matches CDR3s. Requires a graph/data file.
+ -plate,--make-plate   Makes a sample plate file. Requires a cell sample
+                       file.
+ -version              Prints the program version number to stdout
+
+usage: BiGpairSEQ_Sim.jar -cells
+ -d,--diversity-factor <factor>   The factor by which unique CDR3s
+                                  outnumber unique CDR1s
+ -n,--num-cells <number>          The number of distinct cells to generate
+ -o,--output-file <filename>      Name of output file
+
+usage: BiGpairSEQ_Sim.jar -plate
+ -c,--cell-file <filename>     The cell sample file to use
+ -d,--dropout-rate <rate>      The sequence dropout rate due to
+                               amplification error. (0.0 - 1.0)
+ -exponential                  Use an exponential distribution for cell
+                               sample
+ -gaussian                     Use a Gaussian distribution for cell sample
+ -lambda <value>               If using -exponential flag, lambda value
+                               for distribution
+ -o,--output-file <filename>   Name of output file
+ -poisson                      Use a Poisson distribution for cell sample
+ -pop <number [number]...>     The well populations for each section of
+                               the sample plate. There will be as many
+                               sections as there are populations given.
+ -random <min> <max>           Randomize well populations on sample plate.
+                               Takes two arguments: the minimum possible
+                               population and the maximum possible
+                               population.
+ -stddev <value>               If using -gaussian flag, standard deviation
+                               for distrbution
+ -w,--wells <number>           The number of wells on the sample plate
+
+usage: BiGpairSEQ_Sim.jar -graph
+ -c,--cell-file <filename>                Cell sample file to use for
+                                          checking pairing accuracy
+ -err,--read-error-prob <prob>            (Optional) The probability that
+                                          a sequence will be misread. (0.0
+                                          - 1.0)
+ -errcoll,--error-collision-prob <prob>   (Optional) The probability that
+                                          two misreads will produce the
+                                          same spurious sequence. (0.0 -
+                                          1.0)
+ -graphml                                 (Optional) Output GraphML file
+ -nb,--no-binary                          (Optional) Don't output
+                                          serialized binary file
+ -o,--output-file <filename>              Name of output file
+ -p,--plate-filename <filename>           Sample plate file from which to
+                                          construct graph
+ -rd,--read-depth <depth>                 (Optional) The number of times
+                                          to read each sequence.
+ -realcoll,--real-collision-prob <prob>   (Optional) The probability that
+                                          a sequence will be misread as
+                                          another real sequence. (Only
+                                          applies to unique misreads;
+                                          after this has happened once,
+                                          future error collisions could
+                                          produce the real sequence again)
+                                          (0.0 - 1.0)
+
+usage: BiGpairSEQ_Sim.jar -match
+ -g,--graph-file <filename>    The graph/data file to use
+ -max <number>                 The maximum number of shared wells to
+                               attempt to match a sequence pair
+ -maxdiff <number>             (Optional) The maximum difference in total
+                               occupancy between two sequences to attempt
+                               matching.
+ -min <number>                 The minimum number of shared wells to
+                               attempt to match a sequence pair
+ -minpct <percent>             (Optional) The minimum percentage of a
+                               sequence's total occupancy shared by
+                               another sequence to attempt matching. (0 -
+                               100)
+ -o,--output-file <filename>   (Optional) Name of output the output file.
+                               If not present, no file will be written.
+    --print-alphas             (Optional) Print the number of distinct
+                               alpha sequences to stdout.
+    --print-attempt            (Optional) Print the pairing attempt rate
+                               to stdout
+    --print-betas              (Optional) Print the number of distinct
+                               beta sequences to stdout.
+    --print-correct            (Optional) Print the number of correct
+                               pairs to stdout
+    --print-error              (Optional) Print the pairing error rate to
+                               stdout
+    --print-incorrect          (Optional) Print the number of incorrect
+                               pairs to stdout
+    --print-metadata           (Optional) Print a full summary of the
+                               matching results to stdout.
+    --print-time               (Optional) Print the total simulation time
+                               to stdout.
+ -pv,--p-value                 (Optional) Calculate p-values for sequence
+                               pairs.
+
+```
+
+### INTERACTIVE INTERFACE
+
 If no command line arguments are given, BiGpairSEQ_Sim will launch with an interactive, menu-driven CLI for 
 generating files and simulating TCR pairing. The main menu looks like this: