update readme

2022-02-23 13:22:04 -06:00
parent 17ae763c6c
commit 4bcda9b66c
1 changed files with 19 additions and 13 deletions
--- a/readme.md
+++ b/readme.md
@@ -12,7 +12,7 @@ Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and c
 against a null distribution, BiGpairSEQ does not do any statistical calculations
 directly.

-BiGpairSEQ creates a [simple bipartite weighted graph](https://en.wikipedia.org/wiki/Bipartite_graph) representing the sample plate.
+BiGpairSEQ creates a [weightd bipartite graph](https://en.wikipedia.org/wiki/Bipartite_graph) representing the sample plate.
 The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well
 are connected by an edge, with the edge weight set to the number of wells in which both sequences appear.
 (Sequences present in *all* wells are filtered out prior to creating the graph, as there is no signal in their occupancy pattern.)
@@ -69,14 +69,24 @@ Please select an option:
 0) Exit
 ```

-### OUTPUT
+### INPUT/OUTPUT

 To run the simulation, the program reads and writes 4 kinds of files:
 * Cell Sample files in CSV format
 * Sample Plate files in CSV format
-* Graph and Data files in binary object serialization format
+* Graph/Data files in binary object serialization format
 * Matching Results files in CSV format

+These files are often generated in sequence. To save file I/O time, the most recent instance of each of these four
+files either generated or read from disk is cached in program memory. This is especially important for Graph/Data files,
+which can be several gigabytes in size. Since some simulations may require running multiple, 
+differntly-configured BiGpairSEQ matchings on the same graph, keeping the most recent graph cached drastically reduces 
+execution time. 
+
+Subsequent uses of the same data file won't need to be read in again until another file of that type is used or generated.
+The program checks whether it needs to update its cached data by comparing filenames as entered by the user. On 
+encountering a new filename, the program flushes its cache and reads in the new file.
+
 When entering filenames, it is not necessary to include the file extension (.csv or .ser). When reading or
 writing files, the program will automatically add the correct extension to any filename without one. 

@@ -121,7 +131,7 @@ Options when making a Sample Plate file:
    * Standard deviation size 
  * Exponential
    * Lambda value
-      * (Based on the slope of the graph in Figure 4C of the pairSEQ paper, the distribution of the original experiment was exponential with a lambda of approximately 0.6. (Howie, et al. 2015))
+      * *(Based on the slope of the graph in Figure 4C of the pairSEQ paper, the distribution of the original experiment was exponential with a lambda of approximately 0.6. (Howie, et al. 2015))*
 * Total number of wells on the plate
 * Number of sections on plate
 * Number of T cells per well
@@ -155,8 +165,8 @@ Structure:

 ---

-#### Graph and Data Files
-Graph and Data files are serialized binaries of a Java object containing the weigthed bipartite graph representation of a
+#### Graph/Data Files
+Graph/Data files are serialized binaries of a Java object containing the weigthed bipartite graph representation of a
 Sample Plate, along with the necessary metadata for matching and results output. Making them requires a Cell Sample file 
 (to construct a list of correct sequence pairs for checking the accuracy of BiGpairSEQ simulations) and a 
 Sample Plate file (to construct the associated occupancy graph).
@@ -164,7 +174,7 @@ Sample Plate file (to construct the associated occupancy graph).
 These files can be several gigabytes in size. Writing them to a file lets us generate a graph and its metadata once,
 then use it for multiple different BiGpairSEQ simulations.

-Options for creating a Graph and Data file:
+Options for creating a Graph/Data file:
 * The Cell Sample file to use
 * The Sample Plate file to use. (This must have been generated from the selected Cell Sample file.)

@@ -175,11 +185,7 @@ portable data format may be implemented in the future. The tricky part is encodi

 #### Matching Results Files 
 Matching results files consist of the results of a BiGpairSEQ matching simulation. Making them requires a Graph and 
-Data file. To save file I/O time, the data from the most recent Graph and Data file read or generated is cached
-by the simulator. Subsequent BiGpairSEQ simulations run with the same input filename will use the cached version
-rather than reading in again from disk.
-
-Files are in CSV format. Rows are sequence pairings with extra relevant data. Columns are pairing-specific details.
+Data file. Matching results files are in CSV format. Rows are sequence pairings with extra relevant data. Columns are pairing-specific details.
 Metadata about the matching simulation is included as comments. Comments are preceded by `#`.

 Options when running a BiGpairSEQ simulation of CDR3 alpha/beta matching: