DNA Recommendations

Insertion Variant


Definitions

Insertion
a sequence change where, compared to the reference sequence, one or more nucleotides are inserted and where the insertion is not a copy of a sequence immediately 5'

Description

Format: “prefix”“positions_flanking”“ins”“inserted_sequence”, e.g. g.123_124insAGC

“prefix” = reference sequence used = g.
“positions_flanking” = position two nucleotides flanking insertion site = 123_124
“ins” = type of change is an insertion = ins
“inserted_sequence” = inserted sequence = AGC 1


Note

  • prefix reference sequences accepted are g., m., c. and n. (genomic, mitochondrial, coding DNA and non-coding DNA).
  • the “positions flanking” should contain two flanking nucleotides, e.g. 123 and 124 but not 123 and 125.
    • 1 = see Uncertain; when the postion and/or the sequence of an inserted sequence has not been defined, a description may have a format like g.(100_150)ins(25)
    • the “positions_flanking” should be listed from 5’ to 3’, e.g. 123_124 not 124_123
  • tandem duplications are described as a duplication (g.123_456dup), not an insertion (g.456_457ins123_456, see Prioritization)
    • inverted duplications are described as insertion (g.234_235ins123_234inv), not as a duplication (see Inversion)
  • two variants separated by one or more nucleotides should be described individually and not as a “delins”
    • exception: two variants separated by one nucleotide, together affecting one amino acid, should be described as a “delins”
      NOTE: the SVD-WG has prepared a proposal to modify this recommendation (see SVD-WG010). The new proposal is: two variants that are separated by two or fewer intervening nucleotides (that is, not including the variants themselves) should be described as a single “delins” variant
  • for all descriptions the most 3’ position possible of the reference sequence is arbitrarily assigned to have been changed (3’rule)
  • the “inserted_sequence” can be given as the nucleotides inserted (e.g. insAGC) or, for larger insert sequences, by referring to the sequence in the reference sequence (e.g. c.849_850ins858_895) or another reference (e.g. NC_000002.11:g.47643464_47643465ins[NC_000022.10:g.35788169_35788352]). When the inserted sequence is not present in the reference genome it should be submitted to a database (e.g. GenBank) and the accession.version number obtained to refer to it.
    • 1 = see Uncertain; when the postion and/or the sequence of an inserted sequence has not been defined, a description may have a format like g.(100_150)insN[25]

Examples

  • simple insertions
    • NC_000023.10:g.32867861_32867862insT (NM_004006.2:c.169_170insA)
      the insertion of an T nucleotide between nucleotides g.32867861 and g.32867862
    • NC_000023.10:g.32862923_32862924insCCT (LRG_199t1:c.240_241insAGG)
      the insertion of nucleotides CCT between nucleotides g.32862923 and g.32862924
    • NM_004006.2:c.849_850ins858_895
      the insertion of a copy of nucleotids c.858 to c.895 between nuclotides c.849 and c.850
    • NC_000002.11:g.47643464_47643465ins[NC_000022.10:g.35788169_35788352]
      the insertion of nucleotides g.35788169 and g.35788352 as found in NC_000022.10 between nucloetides g.47643464 and g.47643465
  • complex insertions
    • NM_004006.2:c.419_420ins[T;401_419]
      the insertion of T followed by a copy of the sequence from c.401 to c.419 (a duplication not directly flanking the original sequence)
    • LRG_199t1:c.419_420ins[T;450_470;AGGG]
      the insertion of T followed by a copy of the sequence from c.450 to c.470, followed by AGGG
    • NC_000006.11:g.10791926_10791927ins[NC_000004.11:g.106370094_106370420;A[26]]
      the insertion of a copy of an Alu-repeat sequence (from chromosome 4 nucleotides g.106370094 to g.106370420), and a stretch of 26 A nucleotides, between nucleotides g.10791926 and g.10791927 on chromosome 6.
  • insertion of inverted duplicated copies
    • NM_004006.2:c.849_850ins850_900inv
      a copy of nucleotides c.850 to c.900 is inserted, in inverted orientation, 5’ of the original sequence, between nucleotide c.849 and c.850
    • NM_004006.2:c.900_901ins850_900inv
      a copy of nucleotides c.850 to c.900 is inserted, in inverted orientation, 3’ of the original sequence, between nucleotide c.900 and c.901
    • LRG_199t1:c.940_941ins[885_940inv;A;851_883inv]
      an inverted copy of nucleotides c.851 to c.940, with a G>A substitution of nucleotide c.884, is inserted directly 3’ of the original sequence
    • NM_004006.2:c.940_941ins[903_940inv;851_885inv]
      an inverted copy of nucleotides c.851 to c.940, with a deletion from nucleotides c.886 to c.902, is inserted directly 3’ of the original sequence
  • incomplete descriptions, preferably use exact descriptions only
    • NM_004006.2:c.(222_226)insG
      the insertion of a G at an unknown position in the sequence encoding amino acid 75
    • NC_000004.11:g.(3076562_3076732)insN[12]
      the insertion of 12 nucleotides (not specified) at an unknown position between nucleotides g.3076562 and g.3076732 (exon 1 of the HTT gene containing the Gln/Pro repeat region)
    • NC_000023.10:g.32717298_32717299insN (NM_004006.2:c.761_762insN)
      the insertion of one not specified nucleotide (N) between position g.32717298 and g.32717299
    • NM_004006.2:c.761_762insNNNNN (alternatively NM_004006.1:c.761_762insN[5])
      the insertion of 5 not specified nucleotides (NNNNN) between position c.761 and c.762
    • NC_000023.10:g.32717298_32717299insN[100]
      the insertion of 100 nucleotides (not specified) between position g.32717298 and g.32717299
    • NC_000023.10:g.32717298_32717299insN[(80_120)]
      the insertion of 80 to 120 nucleotides between position g.32717298 and g.32717299
    • NC_000023.10:g.32717298_32717299insN[?]
      the insertion of an unknown number of nucleotides between position g.32717298 and g.32717299
    • NC_000006.11:g.8897754_8897755ins[N[543];8897743_8897754]
      the insertion of an undefined sequence of 543 nucleotides (N[543]), and a 12 nucleotide target site duplication (g.8897743 to g.8897754), between nucleotides g.8897754 and g.8897755 on chromosome 6.
  • g.?_?ins[NC_000023.10:(12345_23456)_(34567_45678)]
    the insertion of a sequence from the X-chromosome (NC_000023.10), maximally involving nucleotides 12345_45678 but certainly nucleotides 23456_34567, at an unknown position (g.?_?) in the genome (see Uncertain)

Q&A

Can I describe a variant as g.123insG?

No, since the description is not unequivocal it is not allowed. What does the description mean, the insertion of a G at position 123 or the insertion of a G after position 123?
The situation becomes even more complex when using a coding DNA reference sequence a "-" character is used, e.g. c.-14insG or c.456-13insG. In the description c.456-13insG, when the insertion is after intronic nucleotide c.456-13, is this position c.456-12 or c.456-14?

Can I use the "^" character to describe an insertion?

No, insertions can not be described using the format g.123ˆ124insG or g.123ˆ124G. The recommendations try to restrict the number of different characters used to a minimum. Since a character was already used to indicate a range (the underscore) a new character was not required.

How should I describe the change ATCGATCGATCGATCGAGGGTCCC to ATCGATCGATCGATCGAATCGATCGATCGGGTCCC? The fact that the inserted sequence (ATCGATCGATCG) is present in the original sequence suggests it derives from a duplicative event.

The variant should be described as an insertion; g.17_18ins5_16. A description using "dup" is not correct since, by definition, a duplication should be directly 3'-flanking of the original copy (in tandem). Note that the description given still makes it clear that the sequence inserted between g.17 and g.18 is probably derived from nearby, i.e. position g.5 to g.16, and thus likely derived from a duplicative event.

A variant in the CDKN2A gene, duplicating the first 24 nucleotides of the coding DNA reference sequence, has been described as c.23ins24. My interpretation is it should be described as c.1_24dup, is this correct?

Since the sequence in that region si cagcATGGAGCCGGCGGCGGGGAGCAGCATGGAGCCTTCG.. the correct decription is c.9_32dup (p.(Ala4_Pro11dup)). c.1_24dup seems correct but neglects the 3'rule (3' shift possible for the underlined region). c.23ins24 is not correct since the position of the insertion is not described properly and because ins"24" does not define the sequence inserted.