update readme

This commit is contained in:
2022-02-23 13:22:04 -06:00
parent 17ae763c6c
commit 4bcda9b66c

View File

@@ -12,7 +12,7 @@ Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and c
against a null distribution, BiGpairSEQ does not do any statistical calculations against a null distribution, BiGpairSEQ does not do any statistical calculations
directly. directly.
BiGpairSEQ creates a [simple bipartite weighted graph](https://en.wikipedia.org/wiki/Bipartite_graph) representing the sample plate. BiGpairSEQ creates a [weightd bipartite graph](https://en.wikipedia.org/wiki/Bipartite_graph) representing the sample plate.
The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well
are connected by an edge, with the edge weight set to the number of wells in which both sequences appear. are connected by an edge, with the edge weight set to the number of wells in which both sequences appear.
(Sequences present in *all* wells are filtered out prior to creating the graph, as there is no signal in their occupancy pattern.) (Sequences present in *all* wells are filtered out prior to creating the graph, as there is no signal in their occupancy pattern.)
@@ -69,16 +69,26 @@ Please select an option:
0) Exit 0) Exit
``` ```
### OUTPUT ### INPUT/OUTPUT
To run the simulation, the program reads and writes 4 kinds of files: To run the simulation, the program reads and writes 4 kinds of files:
* Cell Sample files in CSV format * Cell Sample files in CSV format
* Sample Plate files in CSV format * Sample Plate files in CSV format
* Graph and Data files in binary object serialization format * Graph/Data files in binary object serialization format
* Matching Results files in CSV format * Matching Results files in CSV format
These files are often generated in sequence. To save file I/O time, the most recent instance of each of these four
files either generated or read from disk is cached in program memory. This is especially important for Graph/Data files,
which can be several gigabytes in size. Since some simulations may require running multiple,
differntly-configured BiGpairSEQ matchings on the same graph, keeping the most recent graph cached drastically reduces
execution time.
Subsequent uses of the same data file won't need to be read in again until another file of that type is used or generated.
The program checks whether it needs to update its cached data by comparing filenames as entered by the user. On
encountering a new filename, the program flushes its cache and reads in the new file.
When entering filenames, it is not necessary to include the file extension (.csv or .ser). When reading or When entering filenames, it is not necessary to include the file extension (.csv or .ser). When reading or
writing files, the program will automatically add the correct extension to any filename without one. writing files, the program will automatically add the correct extension to any filename without one.
#### Cell Sample Files #### Cell Sample Files
Cell Sample files consist of any number of distinct "T cells." Every cell contains Cell Sample files consist of any number of distinct "T cells." Every cell contains
@@ -121,7 +131,7 @@ Options when making a Sample Plate file:
* Standard deviation size * Standard deviation size
* Exponential * Exponential
* Lambda value * Lambda value
* (Based on the slope of the graph in Figure 4C of the pairSEQ paper, the distribution of the original experiment was exponential with a lambda of approximately 0.6. (Howie, et al. 2015)) * *(Based on the slope of the graph in Figure 4C of the pairSEQ paper, the distribution of the original experiment was exponential with a lambda of approximately 0.6. (Howie, et al. 2015))*
* Total number of wells on the plate * Total number of wells on the plate
* Number of sections on plate * Number of sections on plate
* Number of T cells per well * Number of T cells per well
@@ -155,8 +165,8 @@ Structure:
--- ---
#### Graph and Data Files #### Graph/Data Files
Graph and Data files are serialized binaries of a Java object containing the weigthed bipartite graph representation of a Graph/Data files are serialized binaries of a Java object containing the weigthed bipartite graph representation of a
Sample Plate, along with the necessary metadata for matching and results output. Making them requires a Cell Sample file Sample Plate, along with the necessary metadata for matching and results output. Making them requires a Cell Sample file
(to construct a list of correct sequence pairs for checking the accuracy of BiGpairSEQ simulations) and a (to construct a list of correct sequence pairs for checking the accuracy of BiGpairSEQ simulations) and a
Sample Plate file (to construct the associated occupancy graph). Sample Plate file (to construct the associated occupancy graph).
@@ -164,7 +174,7 @@ Sample Plate file (to construct the associated occupancy graph).
These files can be several gigabytes in size. Writing them to a file lets us generate a graph and its metadata once, These files can be several gigabytes in size. Writing them to a file lets us generate a graph and its metadata once,
then use it for multiple different BiGpairSEQ simulations. then use it for multiple different BiGpairSEQ simulations.
Options for creating a Graph and Data file: Options for creating a Graph/Data file:
* The Cell Sample file to use * The Cell Sample file to use
* The Sample Plate file to use. (This must have been generated from the selected Cell Sample file.) * The Sample Plate file to use. (This must have been generated from the selected Cell Sample file.)
@@ -175,11 +185,7 @@ portable data format may be implemented in the future. The tricky part is encodi
#### Matching Results Files #### Matching Results Files
Matching results files consist of the results of a BiGpairSEQ matching simulation. Making them requires a Graph and Matching results files consist of the results of a BiGpairSEQ matching simulation. Making them requires a Graph and
Data file. To save file I/O time, the data from the most recent Graph and Data file read or generated is cached Data file. Matching results files are in CSV format. Rows are sequence pairings with extra relevant data. Columns are pairing-specific details.
by the simulator. Subsequent BiGpairSEQ simulations run with the same input filename will use the cached version
rather than reading in again from disk.
Files are in CSV format. Rows are sequence pairings with extra relevant data. Columns are pairing-specific details.
Metadata about the matching simulation is included as comments. Comments are preceded by `#`. Metadata about the matching simulation is included as comments. Comments are preceded by `#`.
Options when running a BiGpairSEQ simulation of CDR3 alpha/beta matching: Options when running a BiGpairSEQ simulation of CDR3 alpha/beta matching: