Aidan Slingsby, City University London, a.slingsby@soi.city.ac.uk [PRIMARY contact]
Jo Wood, City University London, jwo@soi.city.ac.uk
Jason Dykes, City University London, jad7@soi.city.ac.uk
sequenceView was built in a couple of days using Processing - a set of Java libraries for rapidly designing and production of graphical sketches. The giCentre's long experience of using Processing in this way makes such development rapid.
The screen is split into three parts (vertically):
Bases are identified by hue (optionally labelled).
Interactions
The only automated function is the ability to find longest 'common sequences' in any DNA selection. This takes around 15 seconds. Matches of selected common sequences are identified in red. Common sequences can be hidden from view, leaving only those columns where at least one mutation has taken place in the originally-selected set of DNA, which - with this dataset - fit onto the same screen.
Interpretation is left to the human. SequenceView supports the user in doing this by making effective use of alignment, sorting and interaction. SequenceView is very responsive and information to support the user is provided quickly.
Link to video (18Mb)
Answer: Nigeria_B, because it more of its DNA is in common with the current outbreaks than any other native strain. Figure 2 shows an example in which a relatively long common sequence found in the outbreaks (coloured in red) is found in Nigeria_B but not in the other native strains. Zooming out (Figure 1) or scrolling (video) through the whole sequence is quick and reveals that this pattern occurs throughout.
Steps to screenshots: The screenshots were arrived at by (a) selecting all current outbreaks; (b) requesting the longest common sequences within these; and (c) selecting all the resulting common sequences (matches are then coloured red). This and scrolling through the sequences took a couple of minutes. Central Africa and Cameroon also share significant DNA with the outbreaks.
Identifying DNA sequences common to the outbreaks in the native strains and keeping DNA sequences aligned were key to answering this question.

Figure 1: Zoomed-out view showing native sequences (top), outbreaks (middle) and longest common sequences founds in the outbreaks (bottom). Common sequences are identified in red and those over which the mouse is positioned are coloured yellow.

Figure 2: Section of the DNA sequences showing similarily with Nigeria_B.
Answer: The patient with strain 123, because it is more similar to Nicoli's strain (583). Only one base is different (at position 269) as opposed to three based for the other strain. Figure 4 shows the three outbreak DNA sequences with the sequences common to all three hidden leaving just the four bases where mutations have occurred.
Steps to screenshots: (a) The three strains (583, 51 and 123) were selected; (b) common sequences computed; (c) these were selected (causing matching sequences highlighted in red); (d) non-selected sequences were hidden (Figure 3); and (d) common sequences were obliterated leaving just columns in which at least one mutation had taken place (Figure 4). Outbreak sequences can be ordered numerically to assist in finding specific outbreak strains.
SequenceView allowed only those bases in which a mutation had occurred in the three strains to be visually isolated and this was key to answering the question

Figure 3: (Zoomed-in) view showing all the native sequences (top), just 51, 123 and 583 of the outbreaks and the common sequences (bottom).

Figure 4: As figure 3, but with common sequences hidden, and with the base labels showing.
Answer:
Outbreaks were sorted in decreasing order of severity (blue column) with whitespace between categories. Within this sorting any mouseovered column can be sorted by its base (compare Figures 5 and 6) spatially consolidating bases of the same type.
Assumptions: (a) one base mutation (e.g. 313/707) is too small a sample from which to draw many conclusions; (b) mutations should apply to different DNA sequences where possible; (c) 'moderate' also may represent an increase in severity; (d) all severe DNA sequences should be covered; (e) mutation correlation does not necessarily imply both mutations needed.
Steps to screenshots: (a) select all outbreaks; (b) find common sequences; (c) select these; (d) sort on severity; (e) try additionally sorting by the bases in various columns.
The ability to sort strains on severity and additionally sort by bases in any of the columns, to exclude the common sequences (isolating the mutations) and the column index tooltip, were key to answering this question.

Figure 5: Sorted by symptom severity and the bases on column 269.

Figure 6: but sorted by column 223.
Answer:
Each mutation applies to a different DNA sequence, and in total cover most of the top 50% in terms of seriousness.
We used the same technique and assumption as above, but we sorted the strains on seriousness (sum of the disease characteristics; the grey column). This gave equal weight to these characteristics.
Seriousness was highly correlated with severity. We could have choosen those mutations that affected just the top two or three seriousness categories, but some of these were answers to the previous question (i.e. strongly associated with severity). Also, since only three strains were in the most serious group, the sample sizes were quite small (see assumptions in previous answer). So instead we opted for complementary mutations (affected different strains) which together covered most of the top half of serious cases.
We did not find evidence that more than one mutation on the same DNA sequence was necessary. Correlated mutations are candidates, but we decided that the evidence was circumstantial (see assumptions in previous answer).
We can also explore other important characteristics of the disease, visually.
Mutations that change different characteristics of strains of the outbreak will require different planning responses by health authorities. Identifying key mutations as the pandemic continues is therefore essential for managing the response.
Metrics
Metrics could be computed for each mutation (e.g. number of mutations weighted by seriousness), and we would advocate implementing some to assist in data interpretation in future - they could, for example be used to identify likely DNA candidates for a particular critierion or be used as a basis for sorting. We did find, however that interpreting the data visually was essential and any further metrics should support rather than replace this. For example, we easily confirmed by mutations were substitutions rather than insertions, by looking for offsets and not finding any. Sorting and alignment using the gaps between disease characteristic categories and horizontal line placement (right click; figure 8) were particularly important. For example, the mutation at 955 does not look significant in isolation, but it is when considered in the context of the other mutations (they are complementary).
Flexibility
Our approach of rapidly building a tool as part of the data exploration process using Processing enables us to incrementally add functionality where needed, including implementing new metrics or loading additional datasets. This allows the tool to grow in line with the depth of analysis required. SequenceView was built for this particular dataset in mind, but the functionality was designed to be as generic as possible and it should work with similar data of the same data format.
Scalability
SequenceView was designed to accommodate this size of dataset in terms of computation time, graphical display and memory, but it is expected to work for datasets of similar size. In the current design all mutations fit on the same screen once the common sequences have been hidden. If there were many more mutations and/or longer DNA sequences, it might not be possible to fit them all on the screen at once. The screen size of bases could be reduced and scrolling could be used but only if the requirement for scrolling was minimal as this would make visual analysis much more challenging. For datasets that are much larger, some redesign might be necessary. For example, allowing the user being able to hide any columns which are not currently under consideration in order that all those that are can fit on one screen. Or summarising mutations by disease characteristic category instead of showing all DNA sequences.

Figure 7: Sorted by overall seriousness.

Figure 8: Sorted by overall seriousness, with horizontal line showing that T base (mouse cursor) is complementary to 842 and 790 for this strain.

Figure 9: Sorted by complications, note likely key mutation.

Figure 10: Sorted by drug resistance, note likely key mutation.