Work+092712

continued working on compareAllSequencesInFile1WithFile2 method

Made 2 artificial comparison files "S:\Research\Cancer_Eradication\Users\kwhittem\DR\2012\9-26-12\test_comparison1.txt" "S:\Research\Cancer_Eradication\Users\kwhittem\DR\2012\9-26-12\test_comparison2.txt"

Added a line from the 1st one to the 2nd one. gb|EGH42662.1| RNA methyltransferase TrmH, group 2 [Pseudomonas ... 18.0 19006

Alright the compareAllSequencesInFile1WithFile2 method seems to be working now.

Now I just need to make sure that the blast result from each motif group gets compared with the others. Then I can see how many matches each input has.

I would also like to get my HEE sequence to match something from the database

called this command blastp -db S:\Research\Cancer_Eradication\Users\kwhittem\DR\2012\9-26-12_database\nr -query x -word_size 2 -seg no -evalue 200000000000000 -comp_based_stats no -matrix pam30 -threshold 4 -num_descriptions 20000 -num_alignments 0 -out test_for_hee.txt

x contains the sequence HEEX

Searching this way did yield a list of matches and some of them did contain SMC related matches.

searching this way also yielded a list of matches with some SMC related matches for PMRE as well.

These test search results can be found here C:\kurt\storage\CIM Research Folder\DR\2012\9-27-12\test_blast

When I search for matches lower than the evalue I don't think numbers like 3e-2 are getting counted so I'll need to fix this in the need_to_determine_number_of_matches section

The match counting code appears to work. Now I can just add this match information to the table as well.

Now I would like to rank the items in a manner so that the items with the highest matches and lowest e-values are ranked the highest. Actually, I can basically do this simply by sorting the numbers in excel.

The only other features that I wanted to add to my program involve better logging, and bepipred functionality. I don't think either of these things will be terribly difficult to implement.

Finished writing logging functions.

Started writing bepipred_handler class.

When I tried to use java to ssh I go the following message Pseudo-terminal will not be allocated because stdin is not a terminal.

ssh for java http://stackoverflow.com/questions/2514439/how-to-run-ssh-commands-on-remote-system-through-java-program

I'm getting the following error cannot make a static reference to the non-static method exec(String) from the type Runtime

I'll forget the bepipred stuff for now.

Now I'll try to run the program from scratch from here S:\Research\Cancer_Eradication\Users\kwhittem\DR\2012\9-28-12\mpa

I should modify the log file for the comparison of the blast results so that it states which two are being compared out of how many. I added this feature.

For some reason the FSA files are not being created properly for some sequences. I suspect this is a sequence and regex problem.

What is the command to see how many files are in a directory? ls -1 | wc -l

Now I'll look into the fsa file issue. Here's a file that was not found blast_res_motif_group_0_3i_b_blast_res_motif_group_1_15980i.txt

blast_res_motif_group_0_3i.fsa was created

The file for blast_res_motif_group_1_15980i was not made instead this file was made console_blast_res_motif_group_1_15980i.fsa.txt

The text in this file shows that the blast did not work (the blast help commands are listed and everything). There is also this message Error: Too many positional arguments (1), the offending value: Chain

Now I need to find out what this line was in the original blast result document. I'm not sure which line in the blast result document that 15980 refers to. I would think it would refer to either line 15980 or 15980+31(header part of document) = 16011 + or - 1 number for each of these possibilities. None of these entries contain the word "Chain" though. Line 16021 does contain the word chain. Why would it be 10 off? I see why it is off. The regex I used expects input that has a space after the "|...|" but the lines with the word chain don't have a space after the | and so they were not added. I'll need to fix this.

I think changing the regex from this (.+?)\|(.+?)\|\s\s+?(.+?)\s\s+(.+?)\s\s+(.+) to this (.+?)\|(.+?)\|.*?\s\s+?(.+?)\s\s+(.+?)\s\s+(.+) should work

How long approximately will my program take to compare to 20,000 line blast result files? On Saturday at about 11:39am there were 78484 comparisons performed. The program started on Friday at 11:44 am so let's say that 24 hours passed 20000*20000 = 400,000,000 comparisons need to be made. How many hours will this take? 24/78484=y/400000000

This will take about 122,318 hours. This will take 5097 days or 14 years. A little long haha

Okay now I can take a look at getting ssh and bepipred working

ssh with java

Actually spent time trying to get the blast to work. Wanted to blast an accession against 20,000 accessions, but this doesn't seem to be working. It looks like I may need to retrieve the sequences.

When I blasted one retrieved sequences against the other approximately 20,000 retrieved sequences all in one fast file this gave me the result I wanted. It looks like the program will take approximately 5 days to compare all of the sequences one at a time in file 1 with all of the sequences at once in file 2. I think this is fairly reasonable. Much better than the 11-16 year time period it was going to take before.

I would like to start cleaning up my code a little bit. I have 3 versions of the comparison file method, but I think I can get rid of all but 1 and store the others somewhere else. I would also like to make it so that certain files are not created unless they need to be.I think I'll start on this another time.