note+from+Mark+7-6-11

This is an e-mail conversation about the GEMODA program.

Me

Hi Mark Styczynski,

I am currently experimenting with the Gemoda program which I believe you worked on. I am using it to see how similar different peptide sequences are to each other. Occasionally, with this program I will obtain a best score of exactly 1.0. I obtain this score when I don't think I should. For example, when I compare just two peptides with extremely similar sequences I would think the significance would be a very small number, and it often is. However, sometimes it is 1.0.

Here's a concrete example:

QRQHSP and QRQHSPV have a best significance for their best motif of 1.0 when using the following command:

gemoda -s -l 4 -g 2 -m BLOSUM62 -i motif_file_for_gemoda.txt

I don't completely understand in what situations I obtain 1.0, and I also don't understand whether or not this is the answer I am supposed to get. What do you think about this? I'm basically just looking for a tool to give me a score indicating how similar two short sequences are to each other. I know this is kind of a detailed technical question about something you probably have not looked at for a very long time so I completely understand if you cannot help me much. However, any response would be greatly appreciated!

Best regards, Kurt Whittemore

Graduate Student Arizona State University Biodesign Institute 727 E. Tyler Street Tempe, AZ 85287

Mark

Kurt,

You are right, it has been easily four years since I have swam through Gemoda code, and probably substantially more than that.

My guess would be that for degenerate, over-simplified cases, you are finding these bad significance values.

I'll first refer you to our supplementary information:

@http://web.mit.edu/bamel/gemoda/jensen2004supp.pdf

There are details on the significance calculations in there that you should read. Once you read that, you'll see that the significance is strictly based on your dataset, and whether the similarity "signal" you are detecting is substantially different from the background noise. It is *not* telling you the likelihood of two proteins having some level of similarity given what is known about nature. This was done in order to continue with the "data agnostic" approach --- the significance is only analyzed on a problem-specific level. This means that the same run of similarity, in different backgrounds, will have different significance.

What you have, then, is likely just "this is the only long similarity, there isn't much to compare it to". If you were able to put in longer runs of non-similarity on either side, or additional non-similar sequences, your significance would likely become more like what you are expecting.

Does that help?

Me

Yes! That actually helps a lot. Thanks for the information