diff --git a/readme.md b/readme.md index 907bd6b..d6e5db4 100644 --- a/readme.md +++ b/readme.md @@ -12,7 +12,7 @@ Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and c against a null distribution, BiGpairSEQ does not do any statistical calculations directly. -BiGpairSEQ creates a [simple bipartite weighted graph](https://en.wikipedia.org/wiki/Bipartite_graph) representing the sample plate. +BiGpairSEQ creates a [weightd bipartite graph](https://en.wikipedia.org/wiki/Bipartite_graph) representing the sample plate. The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well are connected by an edge, with the edge weight set to the number of wells in which both sequences appear. (Sequences present in *all* wells are filtered out prior to creating the graph, as there is no signal in their occupancy pattern.) @@ -69,16 +69,26 @@ Please select an option: 0) Exit ``` -### OUTPUT +### INPUT/OUTPUT To run the simulation, the program reads and writes 4 kinds of files: * Cell Sample files in CSV format * Sample Plate files in CSV format -* Graph and Data files in binary object serialization format +* Graph/Data files in binary object serialization format * Matching Results files in CSV format +These files are often generated in sequence. To save file I/O time, the most recent instance of each of these four +files either generated or read from disk is cached in program memory. This is especially important for Graph/Data files, +which can be several gigabytes in size. Since some simulations may require running multiple, +differntly-configured BiGpairSEQ matchings on the same graph, keeping the most recent graph cached drastically reduces +execution time. + +Subsequent uses of the same data file won't need to be read in again until another file of that type is used or generated. +The program checks whether it needs to update its cached data by comparing filenames as entered by the user. On +encountering a new filename, the program flushes its cache and reads in the new file. + When entering filenames, it is not necessary to include the file extension (.csv or .ser). When reading or -writing files, the program will automatically add the correct extension to any filename without one. +writing files, the program will automatically add the correct extension to any filename without one. #### Cell Sample Files Cell Sample files consist of any number of distinct "T cells." Every cell contains @@ -121,7 +131,7 @@ Options when making a Sample Plate file: * Standard deviation size * Exponential * Lambda value - * (Based on the slope of the graph in Figure 4C of the pairSEQ paper, the distribution of the original experiment was exponential with a lambda of approximately 0.6. (Howie, et al. 2015)) + * *(Based on the slope of the graph in Figure 4C of the pairSEQ paper, the distribution of the original experiment was exponential with a lambda of approximately 0.6. (Howie, et al. 2015))* * Total number of wells on the plate * Number of sections on plate * Number of T cells per well @@ -155,8 +165,8 @@ Structure: --- -#### Graph and Data Files -Graph and Data files are serialized binaries of a Java object containing the weigthed bipartite graph representation of a +#### Graph/Data Files +Graph/Data files are serialized binaries of a Java object containing the weigthed bipartite graph representation of a Sample Plate, along with the necessary metadata for matching and results output. Making them requires a Cell Sample file (to construct a list of correct sequence pairs for checking the accuracy of BiGpairSEQ simulations) and a Sample Plate file (to construct the associated occupancy graph). @@ -164,7 +174,7 @@ Sample Plate file (to construct the associated occupancy graph). These files can be several gigabytes in size. Writing them to a file lets us generate a graph and its metadata once, then use it for multiple different BiGpairSEQ simulations. -Options for creating a Graph and Data file: +Options for creating a Graph/Data file: * The Cell Sample file to use * The Sample Plate file to use. (This must have been generated from the selected Cell Sample file.) @@ -175,11 +185,7 @@ portable data format may be implemented in the future. The tricky part is encodi #### Matching Results Files Matching results files consist of the results of a BiGpairSEQ matching simulation. Making them requires a Graph and -Data file. To save file I/O time, the data from the most recent Graph and Data file read or generated is cached -by the simulator. Subsequent BiGpairSEQ simulations run with the same input filename will use the cached version -rather than reading in again from disk. - -Files are in CSV format. Rows are sequence pairings with extra relevant data. Columns are pairing-specific details. +Data file. Matching results files are in CSV format. Rows are sequence pairings with extra relevant data. Columns are pairing-specific details. Metadata about the matching simulation is included as comments. Comments are preceded by `#`. Options when running a BiGpairSEQ simulation of CDR3 alpha/beta matching: