Studying DeepVariant’s hidden powers

Inspecting DeepVariant

To raised perceive what DeepVariant is studying from its coaching information, we used a set of straightforward clustering and visualization strategies to summarize the knowledge captured within the mannequin’s excessive dimensional information. In partnership with collaborators on the Google Genomics workforce, we first loaded examples into the Built-in Genomics Viewer (IGV), a widely-used device for inspecting genomes and sequencing information. Then, we utilized Uniform Manifold Approximation and Projection (UMAP) to the embeddings of the mixed5 max-pooling layer of the mannequin, which is roughly in the course of the community and accommodates a mixture of low- and high-level options. This visualization technique allows one to visually examine any rising buildings. We used completely different colours to characterize recognized sequencing attributes within the enter information (e.g., low high quality sequence reads and areas which might be arduous to uniquely map within the genome) and a mixed attribute utilizing completely different worth mixtures of the essential attribute.

The buildings that emerged reveal that among the attributes’ values are mapped shut to one another, naturally forming clusters. We noticed that these “pure clusters” kind at completely different ranges throughout mannequin layers, and at occasions get “forgotten” because the community additional processes the enter. This means that various kinds of details about the enter DNA reads are essential to completely different depths of the community.

Based mostly on this primary look, we then used further clustering strategies with the hope of “discovering” beforehand unknown attributes (clusters). We started by making use of okay-means clustering to seek out 10 clusters. Ok-means is a straightforward clustering algorithm that teams information factors by proximity in vector house, with out use of labels that may point out similarity. This leads to visible separation between main clusters, a few of that are far more populous than others. To have management of the dimensions of ensuing clusters, we then utilized hierarchical clustering by working okay-means a number of occasions; first we run 3-cluster okay-means, then for every of the three clusters we apply a second spherical okay-means to additional divide the clusters, the place the cluster quantity is predicated on the form and measurement of the primary spherical clusters.