FRDB Archives

Freethought & Rationalism Archive

The archives are read only.


Go Back   FRDB Archives > Archives > IIDB ARCHIVE: 200X-2003, PD 2007 > IIDB Philosophical Forums (PRIOR TO JUN-2003)
Welcome, Peter Kirby.
You last visited: Today at 05:55 AM

 
 
Thread Tools Search this Thread
Old 08-03-2003, 01:05 AM   #1
Iasion
Guest
 
Posts: n/a
Arrow Some genome questions

Greetings all,

I was so impressed with some of the biology knowledge in here (e.g. Peez, thanks :-) I thought I'd ask some questions (sorry if they seem basic, I have been reading lots, but some details elude me)


The Genome

The human genome was recently fully sequenced - but WHOSE genome was it?

I.e. was ONE person chosen and their cells passed to each lab?
If so, who?

Or Was a small number of people sequenced? i.e. lab A did chromosome 1 from person X, lab B did chromosome 2 from person Y etc.
If so, does the genome we sequenced actually represent one person?

Or does the genome represent an "average human" in some way? (but how could it?)

Or did each lab sequence whoever they chose?

This leads to ....


Variations

How is/was the variation between humans handled?
I note dbSNP seems to keep this data - was each chromosome sequenced by many labs and the variations fed into dbSNP?

How much does the genome vary across human cultures and individuals?

How well do we even KNOW the amount of variation?


The Data

How does GenBank differ from refSeq ?
How does a "contig" differ from a "sequence" ?


Genes

The real work now is in finding genes it seems.

But,
if we know roughly where the genes are and where the "junk" is (how do we know this?) -

why is it so hard to find genes?
Isn't it a simple task of finding the "start" codon AUG and reading until the "stop" codon TRP? (or is it because they both have OTHER meanings?)

If there are only 30,000 or so, why haven't we found them already with all the procesing power available?


BLAST

Finding genes is apparently done with BLAST - which seems to match a given sequence with the genome database.

Why?
How or why does a lab come up with a sequence, knowing its exact bases, but NOT knowing its location?



Any help much appreciated, and a pointer to a detailed tutorial would be helpful - I ripped through the basic stuff in no time, but I find the real sites too tough to chew yet (LocusLink, GOLD, OMIM, phew)


Iasion
 
Old 08-03-2003, 01:31 AM   #2
Veteran Member
 
Join Date: Mar 2003
Location: Edinburgh
Posts: 1,211
Default

IIRC the work by Celera was actually done on Craig Venter's genome, oh the hubris.
Wounded King is offline  
Old 08-03-2003, 02:22 AM   #3
Veteran Member
 
Join Date: Mar 2003
Location: Edinburgh
Posts: 1,211
Default

Genbank is the sequences that scientists submit directly.

Refseq is a subset of selected non-redundant sequence data.
Wounded King is offline  
Old 08-03-2003, 04:02 AM   #4
Regular Member
 
Join Date: Jul 2002
Location: Australia
Posts: 214
Default

Quote:
Genes

The real work now is in finding genes it seems.

But,
if we know roughly where the genes are and where the "junk" is (how do we know this?) -

why is it so hard to find genes?
Isn't it a simple task of finding the "start" codon AUG and reading until the "stop" codon TRP? (or is it because they both have OTHER meanings?)

If there are only 30,000 or so, why haven't we found them already with all the procesing power available?
I'll have a crack..
A gene is much more than simply a start codon/stop codon - in order for it to be transcribed, it requires a transcription initiation sequence 5' upstream of the ATG. For instance, consider this example;
ATGCCTAGATGCCCGATA

depending on what reading frame you're in depends on where the start codon actually is:

i.e.
ATG CCT AGA TGC CCG ATA

or

ATG CCC GAT

thats why genes require transcription initiation sequences 5' upstream of where transcription starts, to "tell" RNA polymerase where to bind to start transcribing the DNA into RNA. IIRC the first ATG after the initiation sequence is the start codon. So you can see that its not simply a case of looking for start codons in the genome, because they're not all going to be the start of a gene.


Quote:
BLAST

Finding genes is apparently done with BLAST - which seems to match a given sequence with the genome database.

Why?
How or why does a lab come up with a sequence, knowing its exact bases, but NOT knowing its location?
there are several reasons:

say you sequence a gene of interest in a fruitfly - you could then do a BLAST to find similar genes in the same organism (this way you could find functionally related genes) or you could BLAST to find similar genes in a different organism. A BLAST result will present you with a list of candidates of sequences homologous to your sequence, and give you a score based on how similar those sequences are.

Another useful application is when you have the sequence of an mRNA (i.e. the result of the expression of a gene) but you have no idea where in the genome that mRNA was transcribed from. You could do a blast to find out where it is, and then look upstream for regulatory elements controlling its expression

i'm sure there are other reasons to use it, I'll let the people who really know what they're talking about tell you those
monkenstick is offline  
Old 08-03-2003, 10:40 AM   #5
Veteran Member
 
Join Date: Jul 2001
Location: Seattle
Posts: 4,261
Default Re: Some genome questions

Hi Iasion,

Here's a couple helpful websites:

Frequently Asked Questions about the Human Genome Project.

Facts About Genome Sequencing

Quote:
Whose genome was sequenced in the public (HGP) and private projects?

The human reference sequence does not represent an exact match for any one person's genome.

In the Human Genome Project (HGP), researchers collected blood (female) or sperm (male) samples from a large number of donors. Only a few samples were processed as DNA resources, and the source names were protected so neither donors nor scientists knew whose DNA was being sequenced.
I think they tried to get people from different races. But since it is a small sample, I'm sure there is more variability worldwide than the genome project is going to suggest.

scigirl
scigirl is offline  
Old 08-03-2003, 12:15 PM   #6
Veteran Member
 
Join Date: Jun 2001
Location: Denver, CO, USA
Posts: 9,747
Default Re: Some genome questions

Quote:
Originally posted by Iasion

Variations

How is/was the variation between humans handled?
I note dbSNP seems to keep this data - was each chromosome sequenced by many labs and the variations fed into dbSNP?

How much does the genome vary across human cultures and individuals?

How well do we even KNOW the amount of variation?
My understanding is that humans differ on average by about 0.1% in the sequence of DNA. In other words, about 1 out of every 1000 nucleotides will be different between you and any randomly picked human. This difference, I assume, is caculated from observed SNPs in parts of the genome for which we have multiple sequences from hundreds (or thousands) of people, and extrapolated to the entire genome. Don't quote me on that though.

From what I've heard, about 80% of genetic differences can be found within people of your same ethnic group. Within the remaining 20%, some amount is different between ethnic groups, and a smaller amount is different between broader racial categories. In other words, most of our genetic differences are within groups, and not between groups.

Quote:
The Data

How does GenBank differ from refSeq ?
How does a "contig" differ from a "sequence" ?
As best as I can tell, refSeq is a more fully annoated and non-redundant database built using the GenBank database. In other words, GenBank is kind of like the raw data, and refSeq is a refined subset of that data. Both are searchable.

A contig is simply a set of contiguous sequences that overlap to form a larger sequence. Keep in mind, the way genomes are sequenced is by chopping them up into small bits, doing PCR, and then sequencing the bits. Since they use different restriction enzymes when chopping up a genome, the pieces are overlapping. They can therefore assemble the whole genome sequence by matching up the places where there's overlap. These overlapping bits are called a contig. There's a bit of confusion, because the word has been used in different ways. This page explains the confusion somewhat.

Quote:
BLAST

Finding genes is apparently done with BLAST - which seems to match a given sequence with the genome database.

Why?
How or why does a lab come up with a sequence, knowing its exact bases, but NOT knowing its location?
I not sure what you're asking, but when someone sequences DNA or a protein in a given laboratory setting, there's no way of knowing where on a chromosome it came from. When isolating DNA, you usually start out with a primer that will bind to part of the sequence in the genome, and then use PCR to make many copies of the gene(s) you're interested in. Only then can you sequence it.

The purpose of a BLAST search is to find any sequences in a given database that match (or nearly match) your inquiry. This lets you identify a bit of DNA or protein you recently sequenced. It also lets you find homologues and other sequences of interest.

Nowadays, if a lab sequences some genomic DNA, they should be able to find its location pretty easily with a BLAST search, assuming it's from an organism that's been fully sequenced. However, if it's from an organism that hasn't been fully sequenced, the sequence in question might not be in a database at all. And furthermore, if they've sequenced a protein, the database is unlikely to be complete for any given organism.

theyeti
theyeti is offline  
Old 08-03-2003, 06:14 PM   #7
Contributor
 
Join Date: Jul 2000
Location: Lebanon, OR, USA
Posts: 16,829
Default Re: Some genome questions

Iasion:
The human genome was recently fully sequenced - but WHOSE genome was it?

Several people's; the only known name is Craig Venter.

Or does the genome represent an "average human" in some way? (but how could it?)

It's necessarily a sort of average, but, of course, weighted toward whoever they got samples of.

How much does the genome vary across human cultures and individuals?

About 2 million base pairs -- 0.1% of the total genome.

How well do we even KNOW the amount of variation?

There are some projects to map this variation, but there's already a lot of research done on various selected genes.

Furthermore, these variations tend to be inherited in groups, as if chromosomes have some preferred crossing-over points; the HapMap Project is for mapping these groups.

How does a "contig" differ from a "sequence" ?

A contig is simply a contiguous sequence (no gaps).

The real work now is in finding genes it seems.

And regulatory regions also -- regions involved in controlling gene expression.

But,
if we know roughly where the genes are and where the "junk" is (how do we know this?) -

why is it so hard to find genes?
Isn't it a simple task of finding the "start" codon AUG and reading until the "stop" codon TRP? (or is it because they both have OTHER meanings?)


The stop codons are UAA, UAG, and UGA in the "standard" code -- and as mentioned earlier, the big problem is where one starts reading.
lpetrich is offline  
Old 08-03-2003, 09:13 PM   #8
Contributor
 
Join Date: Jul 2000
Location: Lebanon, OR, USA
Posts: 16,829
Default Re: Some genome questions

Iasion:
... why is it so hard to find genes? ...

If there are only 30,000 or so, why haven't we found them already with all the procesing power available?


There is actually a simple trick that gets around the previously-mentioned difficulties in finding genes. It makes use of experience in molecular-evolution studies. Select some genes that do the same thing in several different species, and sequence them and compare them. The result is a family tree of those genes, and such family trees generally match very closely the family trees inferred by traditional means. Furthermore, the more functionally-constrained parts of a molecule are more slowly-evolving than the less functionally-constrained ones. All this leads to the conclusion that much molecular evolution is genetic drift between selectively-similar alternatives, Kimura's theory of "neutral selection".

So one simply compares genomes from different species, especially species closely-related enough to have many genes in common and not divergent enough to be unrecognizable. Regions of their genomes that closely resemble each other are very likely to be some shared gene.

And that's how many of the genes in the human genome have been recognized -- by comparing it with the mouse genome and other sequenced genomes.

However, this method tells us nothing about what a gene does, and it does not distinguish protein-coding genes from functional-RNA genes (transfer RNA, ribosomal RNA, etc.) or from regulatory regions.

Finding genes is apparently done with BLAST - which seems to match a given sequence with the genome database.

Why?
How or why does a lab come up with a sequence, knowing its exact bases, but NOT knowing its location?


First, one does not have to know the exact sequence of the gene one is looking for; the BLAST software imposes penalties for mismatches and gaps, and will find the alignments with the lowest penalty values.

Also, if you are doing a BLAST search against a genome, the search will tell you where in the genome it found the match.

And if you wish to find the function of some gene you've just sequenced, a cheap shortcut is to use BLAST to find out which genes are closest in sequence to it, and to extrapolate from those genes' functions to your gene. That can be risky, but it is a a good starting hypothesis.

Thus, if you've just sequenced Salmonella typhimurium's genome, you can guess what many of its genes do by comparing those genes to the already-sequenced Escherichia coli genome. I chose these examples because they are relatively closely-related enteric bacteria (those that can cause diarrhea and the like).
lpetrich is offline  
Old 08-03-2003, 10:38 PM   #9
Regular Member
 
Join Date: Apr 2003
Location: Nashville, Tennessee
Posts: 114
Default

Finding genes is apparently done with BLAST - which seems to match a given sequence with the genome database.

How or why does a lab come up with a sequence, knowing its exact bases, but NOT knowing its location?

Using BLAST you can search amino acid sequence against genomic DNA looking for conserved domains- i.e. Immunoglobulin domains, Zinc binding domains, etc...

From there you can get a predicted sequence- there are software packages that are pretty decent at detecting intron/exon boundaries, splice sites, upstream promoter sites, etc...

BLAST is nice, but has lots of limitations, and I have had it "not find" sequences several times despite 100% homology.

In the end though you have to clone it. You could RT-PCR it from the putative sequence identified by the informatics work.
acidphos is offline  
Old 08-04-2003, 02:56 AM   #10
Iasion
Guest
 
Posts: n/a
Arrow thanks all

Greetings all,

thanks heaps for that help, some things are clearer now...

(The penny has dropped regarding BLAST).

Regarding WHOSE genome was sequenced - if we don't know exactly who, i.e. we don't know which genes are "normal" and which genes are variant - then doesn't that weaken the value of the database?

i.e. gene X in the genome as sequenced may have a variation from "normal", but we wouldn't know for sure.

So, if we BLAST a gene of interest, and find it matches gene X in the genome database, we would NOT know if that gene is a "normal" one or a variant one.

Isn't that a problem?

Iasion
 
 

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump


All times are GMT -8. The time now is 04:11 PM.

Top

This custom BB emulates vBulletin® Version 3.8.2
Copyright ©2000 - 2015, Jelsoft Enterprises Ltd.