FastMLC

Lists

Tools

FastMLC.

Summary
Information
Code Meta
Readme
Library
>>
>
<
<<

[+][-]

Summary

Detect clusters in for example large collections of DNA or protein sequences, and visualize the results in a web browser.

Information

Title:

FastMLC

Official URL:

https://www.research-software.nl/software/fastmlc

Uncontrolled Keywords:

Visualization; Machine learning; Big data

Code Meta

Code Repository:

https://github.com/FastMLC/fMLC

Programming Languages:

C++	3722891
C	66239
Makefile	57736
CMake	6810
Python	5473
R	4549
Shell	2880
PHP	163

Date Created:

13 September 2017 09:11:18 UTC

Date Modified:

12 December 2018 16:00:22 UTC

Date Published:

20 March 2019 21:14:12 UTC

License:

GNU General Public License v3.0

Identifier:

103378630

Name:

FastMLC/fMLC

Embargo Date:

20 March 2019 21:14:12 UTC

Issue Tracker:

https://api.github.com/repos/FastMLC/fMLC/issues{/number}

Readme

fMLC

fMLC is the official implementation of the MultiLevel Clustering (MLC) algorithm decribed in Vu D. et al. 2014 , used to cluster massive DNA sequences. fMLC was initially implemented by Szaniszlo Szoke and further developed by Duong Vu. It is written in C++ and supports multi-threaded parallelism. fMLC is also integrated with an interactive web-based tool called DIVE to visualize the resulting DNA sequences based embeddings in 2D or 3D. The work is financially supported by the Westerdijk Fungal Biodiversity Institute and the Netherlands eScience Center.

Citation

Please cite the following paper if you are using fMLC:

D Vu, S Georgievska, S Szoke, A Kuzniar, V Robert. fMLC: Fast Multi-Level Clustering and Visualization of Large Molecular Datasets, Bioinformatics, btx810, https://doi.org/10.1093/bioinformatics/btx810

Pdf verion

Install

Windows

Linux

Data

There are two datasets available as inputs for fMLC. The "small" dataset contains ~4000 ITS yeast sequences, checked and validated by the specialists at the Westerdijk Fungal Biodiversity Institute. This dataset were analyzed and released in Vu D. et al. 2016. The "large" dataset contains ~350K ITS fungal sequences downloaded from GenBank (https://www.ncbi.nlm.nih.gov/) which was used in Vu D. et al. 2014 to evaluate the speed of MLC.

Download the small demo dataset.

Download the large demo dataset.

Results

After clustering the DNA sequences by fMLC, the groupings of the sequences can be saved as output of fMLC. A sparse (or complete) similarity matrix (in .sim format) can be saved in the folder where the dataset is given, to capture the similarity structure of the sequences. Based on this similarity matrix, the coordiates of the sequences can be computed and saved (in .outLargeVis format) using LargeVis. Finally, a json file containing the coordinates and metadata of the sequences is resided in the folder DiVE/data folder as an input of DiVE to visualize the data. This json file can be used for visualization by external applications as well.The clustering and visualization results of the two datasets can be found at https://github.com/FastMLC/fMLC/tree/master/data.

Contact person

Duong Vu (d.vu@westerdijkinstitute.nl)

References

Bolten, E., Schliep, A., Schneckener, S., Schomburg D. & Schrader, R (2001). Clustering protein sequences- structure prediction by transitive homology. Bioinformatics 17, 935-941.

Edgar, R.C (2010). Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460-2461. Paccanaro, P., Casbon, J.A. & Saqi, M.A (2006). Spectral clustering of proteins sequences. Nucleic Acids Res 34, 1571.

Vu D. et al. (2014). Massive fungal biodiversity data re-annotation with multi-level clustering. Scientific Reports 4: 6837.

Library

Depositing User:

Justin Bradley

Date Deposited:

20 Mar 2019 21:55

Revision:

Last Modified:

20 Mar 2019 21:55

30:93

[+][-]

30:93

FastMLC_fMLC.json

Download (6kB)