Welcome to the Euglena genome project.

Euglena are a genus of protist capable of both heterotrophy and photosynthesis (autotrophy). Euglena are also capable of phagocytosis, possess two flagella (only one of which is involved in locomotion), have a flagellar pocket-like organelle (the reservoir), are phototropic and exhibit rather unique ‘euglenoid’ movement when encountering a solid substrate. They are distantly related to the trypanosomatids within the Excavata supergroup. The plastid has been sequenced, and there are ~20k EST sequences in the database, but no genome sequencing effort. Given the potential importance of Euglenids in terms of taxonomic position and unique biology for understanding many aspects of protist and evolutionary cell biology, we initiated a sequencing project, primarily for gene discovery and comparative genomics. We are using a combination of Illumina and 454 sequencing, together with mapping of multiple transcriptome datasets to train the assembly for gene prediction. We are anticipating a limited release of data for annotation purposes in spring 2016.

Who’s involved: Mark C. Field, Steve Kelly, ThankGod Ebenezer, Mark Carrington, Michael Lebert, Michael Ginger, Julius Lukes, Andrew Jackson, Joel Dacks, Bill Wickstead, Ellis O’Neill, Ludek Koreny, Vladimir Hampl, and Harry De-Koning.

Strain being sequenced: Euglena gracilis Z, kindly given by William Martin (Düsseldorf). DNA isolated using method of Medina-Acosta and Cross (1993). There is a restricted access to the data, and are only available by invitation or specific request. If you use the data, we do ask that you please acknowledge the source as follows; “E. gracilis genome and transcriptome data obtained from the sequence project at http://euglenadb.org/“.

Genome assembly statistics (draft)


Number of sequences:                  2066288
Median sequence length:             457
Mean sequence length:                694
Max sequence length:                  166587
Min sequence length:                   106
No. sequence > 1kbp:                  373610
No. sequence > 10kbp:               1459
No. sequence > 100kbp:             2
No. gaps:                                        0
Bases in gaps:                                0
N50:                                               955
Combined sequence length:       1435499417

Transcriptome assembly statistics (final)


Number of sequences:                  72509
Median sequence length:             540
Mean sequence length:                869
Max sequence length:                  25763
Min sequence length:                   202
No. sequence > 1kbp:                 19765
No. sequence > 10kbp:               25
No. sequence > 100kbp:             0
No. gaps:                                        0
Bases in gaps:                                0
N50:                                              1242
Combined sequence length:       63050794

Coding Sequence (CDS)

Number of sequences:                 36526
Median sequence length:            765
Mean sequence length:               1041
Max sequence length:                 25218
Min sequence length:                  297
No. sequence > 1kbp:                13991
No. sequence > 10kbp:              24
N50:                                              1413
Combined sequence length:      38030668


Number of proteins:                   36526
Median protein length:              254
Mean protein length:                 346
Max protein length:                   8406
Min protein length:                    98
No. proteins > 1kaa:                 1290
N50:                                            471