Update Readme

This commit is contained in:
2022-02-26 11:00:18 -06:00
parent 8935407ade
commit c2db4f87c1

View File

@@ -66,6 +66,18 @@ Please select an option:
0) Exit 0) Exit
``` ```
By default, the Options menu looks like this:
```
--------------OPTIONS---------------
1) Turn on cell sample file caching
2) Turn on plate file caching
3) Turn on graph/data file caching
4) Turn off serialized binary graph output
5) Turn on GraphML graph output
6) Maximum weight matching algorithm options
0) Return to main menu
```
### INPUT/OUTPUT ### INPUT/OUTPUT
To run the simulation, the program reads and writes 4 kinds of files: To run the simulation, the program reads and writes 4 kinds of files:
@@ -75,21 +87,24 @@ To run the simulation, the program reads and writes 4 kinds of files:
* Matching Results files in CSV format * Matching Results files in CSV format
These files are often generated in sequence. When entering filenames, it is not necessary to include the file extension These files are often generated in sequence. When entering filenames, it is not necessary to include the file extension
(.csv or .ser). When reading or writing files, the program will automatically add the correct extension to any filename without one. (.csv or .ser). When reading or writing files, the program will automatically add the correct extension to any filename
without one.
To save file I/O time, the most recent instance of each of these four To save file I/O time, the most recent instance of each of these four
files either generated or read from disk can be cached in program memory. This is could be important for Graph/Data files, files either generated or read from disk can be cached in program memory. When caching is active, subsequent uses of the
which can be several gigabytes in size. Since some simulations may require running multiple, same data file won't need to be read in again until another file of that type is used or generated,
differently-configured BiGpairSEQ matchings on the same graph, keeping the most recent graph cached may reduce execution time.
(The manipulation necessary to re-use a graph incurs its own performance overhead, though, which may scale with graph
size faster than file I/O does. If so, caching is best for smaller graphs.)
When caching is active, subsequent uses of the same data file won't need to be read in again until another file of that type is used or generated,
or caching is turned off for that file type. The program checks whether it needs to update its cached data by comparing or caching is turned off for that file type. The program checks whether it needs to update its cached data by comparing
filenames as entered by the user. On encountering a new filename, the program flushes its cache and reads in the new file. filenames as entered by the user. On encountering a new filename, the program flushes its cache and reads in the new file.
(Note that cached Graph/Data files must be transformed back into their original state after a matching experiment, which
may take some time. Whether file I/O or graph transformation takes longer for graph/data files is likely to be
device-specific.)
The program's caching behavior can be controlled in the Options menu. By default, all caching is OFF. The program's caching behavior can be controlled in the Options menu. By default, all caching is OFF.
The program can optionally output Graph/Data files in .GraphML format (.graphml) for data portability. This can be
turned on in the Options menu. By default, GraphML output is OFF.
#### Cell Sample Files #### Cell Sample Files
Cell Sample files consist of any number of distinct "T cells." Every cell contains Cell Sample files consist of any number of distinct "T cells." Every cell contains
four sequences: Alpha CDR3, Beta CDR3, Alpha CDR1, Beta CDR1. The sequences are represented by four sequences: Alpha CDR3, Beta CDR3, Alpha CDR1, Beta CDR1. The sequences are represented by
@@ -181,14 +196,19 @@ Options for creating a Graph/Data file:
* The Cell Sample file to use * The Cell Sample file to use
* The Sample Plate file to use. (This must have been generated from the selected Cell Sample file.) * The Sample Plate file to use. (This must have been generated from the selected Cell Sample file.)
These files do not have a human-readable structure, and are not portable to other programs. (Export of graphs in a These files do not have a human-readable structure, and are not portable to other programs.
portable data format may be implemented in the future. The tricky part is encoding the necessary metadata.)
(For portability to other software, turn on GraphML output in the Options menu. This will produce a .graphml file
for the weighted graph, with vertex attributes sequence, type, and occupancy data.)
--- ---
#### Matching Results Files #### Matching Results Files
Matching results files consist of the results of a BiGpairSEQ matching simulation. Making them requires a Graph and Matching results files consist of the results of a BiGpairSEQ matching simulation. Making them requires a serialized
Data file. Matching results files are in CSV format. Rows are sequence pairings with extra relevant data. Columns are pairing-specific details. binary Graph/Data file (.ser). (Because .graphML files are larger than .ser files, BiGpairSEQ_Sim supports .graphML
output only. Graph/data input must use a serialized binary.)
Matching results files are in CSV format. Rows are sequence pairings with extra relevant data. Columns are pairing-specific details.
Metadata about the matching simulation is included as comments. Comments are preceded by `#`. Metadata about the matching simulation is included as comments. Comments are preceded by `#`.
Options when running a BiGpairSEQ simulation of CDR3 alpha/beta matching: Options when running a BiGpairSEQ simulation of CDR3 alpha/beta matching:
@@ -258,25 +278,26 @@ slightly less time than the simulation itself. Real elapsed time from start to f
* ~~*No, this won't work, because BiGpairSEQ simulations alter the underlying graph based on filtering constraints. Changes would cascade with multiple experiments.*~~ * ~~*No, this won't work, because BiGpairSEQ simulations alter the underlying graph based on filtering constraints. Changes would cascade with multiple experiments.*~~
* Might have figured out a way to do it, by taking edges out and then putting them back into the graph. This may actually be possible. * Might have figured out a way to do it, by taking edges out and then putting them back into the graph. This may actually be possible.
* It is possible, though the modifications to the graph incur their own performance penalties. Need testing to see which option is best. * It is possible, though the modifications to the graph incur their own performance penalties. Need testing to see which option is best.
* See if there's a reasonable way to reformat Sample Plate files so that wells are columns instead of rows.
* ~~Problem is variable number of cells in a well~~
* ~~Apache Commons CSV library writes entries a row at a time~~
* _Got this working, but at the cost of a profoundly strange bug in graph occupancy filtering. Have reverted the repo until I can figure out what caused that. Given how easily Thingiverse transposes CSV matrices in R, might not even be worth fixing._
* Re-implement command line arguments, to enable scripting and statistical simulation studies
* ~~Implement sample plates with random numbers of T cells per well.~~ DONE
* Possible BiGpairSEQ advantage over pairSEQ: BiGpairSEQ is resilient to variations in well population sizes on a sample plate; pairSEQ is not.
* preliminary data suggests that BiGpairSEQ behaves roughly as though the whole plate had whatever the *average* well concentration is, but that's still speculative.
* Enable GraphML output in addition to serialized object binaries, for data portability
* Custom vertex type with attribute for sequence occupancy?
* Re-implement CDR1 matching method
* Implement Duan and Su's maximum weight matching algorithm
* Add controllable algorithm-type parameter?
* ~~Test whether pairing heap (currently used) or Fibonacci heap is more efficient for priority queue in current matching algorithm~~ DONE * ~~Test whether pairing heap (currently used) or Fibonacci heap is more efficient for priority queue in current matching algorithm~~ DONE
* ~~in theory Fibonacci heap should be more efficient, but complexity overhead may eliminate theoretical advantage~~ * ~~in theory Fibonacci heap should be more efficient, but complexity overhead may eliminate theoretical advantage~~
* ~~Add controllable heap-type parameter?~~ * ~~Add controllable heap-type parameter?~~
* Parameter implemented. For large graphs, Fibonacci heap wins. Now the new default. * Parameter implemented. Fibonacci heap the current default.
* ~~Implement sample plates with random numbers of T cells per well.~~ DONE
* Possible BiGpairSEQ advantage over pairSEQ: BiGpairSEQ is resilient to variations in well population sizes on a sample plate; pairSEQ is not.
* preliminary data suggests that BiGpairSEQ behaves roughly as though the whole plate had whatever the *average* well concentration is, but that's still speculative.
* See if there's a reasonable way to reformat Sample Plate files so that wells are columns instead of rows.
* ~~Problem is variable number of cells in a well~~
* ~~Apache Commons CSV library writes entries a row at a time~~
* _Got this working, but at the cost of a profoundly strange bug in graph occupancy filtering. Have reverted the repo until I can figure out what caused that. Given how easily Thingiverse transposes CSV matrices in R, might not even be worth fixing.
* ~~Enable GraphML output in addition to serialized object binaries, for data portability~~ DONE
* ~~Custom vertex type with attribute for sequence occupancy?~~ABANDONED
* Have a branch where this is implemented, but there's a bug that broke matching. Don't currently have time to fix.
* Re-implement command line arguments, to enable scripting and statistical simulation studies
* Re-implement CDR1 matching method
* Implement Duan and Su's maximum weight matching algorithm
* Add controllable algorithm-type parameter?
* This would be fun and valuable, but probably take more time than I have for a hobby project.
## CITATIONS ## CITATIONS
* Howie, B., Sherwood, A. M., et al. ["High-throughput pairing of T cell receptor alpha and beta sequences."](https://pubmed.ncbi.nlm.nih.gov/26290413/) Sci. Transl. Med. 7, 301ra131 (2015) * Howie, B., Sherwood, A. M., et al. ["High-throughput pairing of T cell receptor alpha and beta sequences."](https://pubmed.ncbi.nlm.nih.gov/26290413/) Sci. Transl. Med. 7, 301ra131 (2015)