چکیده انگلیسی مقاله |
Objective and background MicroRNAs are progressively being respected as key players in human diseases. Considering the fast rate of identifying new microRNAs, each having their own complicated influence on specific diseases or disease groups, the need to have databases recording these information is imminent. There are several manually curated databases of microRNA-disease associations; however, keeping these databases updated with the latest literature is a demanding and time consuming task(Wang and Cai 2017). So, there is a strong need to develop more professional computerized literature mining tools that capture microRNAs’ specifications and their reported associations with diseases automatically to provide researchers with rapid access to the latest developments and findings in this area(Gupta, Ross et al. 2016). Methodology The purpose of this work was to devise a method to automatically detect microRNAs and diseases and their relationship in a paper abstract. Most present methods usually use either some kind of hand curated semi-automatic method or some kind of rule based method for detection of the relationship between these entities which would be incomplete and would miss lots of information. To overcome this issue, we first created a database of diseases and microRNAs. miRDisease, a text mining based database, aims at providing a comprehensive resource of microRNA deregulation in various human diseases. A dump of every paper ever published in PubMed related to microRNAs was obtained from PubMed website. The abstracts were broken into sentences and were tokenized. Next, pairs of microRNA-disease were identified in each sentence. Each sentence can have multiple pair of microRNA-disease. For each pair, a graph was created linking the two entities through the shortest path in the sentence. The graph root in most cases contains the phrase describing the relationship between the two entities. Next, a sentiment analysis method was performed on the selected graph to realize the relationship between the specific microRNA-disease pair. After analyzing all of the extracted sentences, the consensus will be determined by voting among different reporting’s across different papers. Results The current version of miRDisease indexes more than 150000 relationships between around 3500 human microRNAs and 1970 human diseases by reviewing more than 59000 published papers. (we only kept the sentences that had at least one of the microRNAs from our dataset of known microRNAs). The relationships are extracted using state of the art NLP techniques aimed at extracting accurate dependencies between reported microRNAs and diseases in paper abstracts. The main advantage of our approach is the choice of automatic algorithmic approach towards data extraction and processing over manually curating it. Compared to datasets like miR2Disease(Jiang, Wang et al. 2009), our approach not only has detected all the relationships provided in mir2disease, it has detected many more mir/disease pairs. miR2Disease current statistics are 3273 relationships between around 349 human microRNAs and 163 human diseases which is less than one tenth of our statistics .miRCancer indexes 5562 relationships between around 184 human cancers which our relationship statistics is more than 30 times of miRCancer(Xie, Ding et al. 2013). Benefit of such an approach is the ability to update the dataset in short intervals (currently, every 15 days). Since it searches for all of known human diseases (extracted from PubMed mesh dataset), it has a clear advantage over single-purpose datasets like miRCancer. In addition, our database classify each mir/disease relationship in a category and all the papers of that association presented in the bottom section. Finally, the novel approach used in our work guarantees that almost all of the reported mir/disease pairs reported in scientific papers are discovered and indexed. To provide a perspective, the number of cancer related mir/disease pairs discovered in our dataset includes all of the detected pairs in miRCancer and 8 times more. Conclusions miRDisease, a text mining based database, aims at providing a comprehensive resource of microRNA deregulation in various human diseases. The relationships are extracted using state of the art NLP techniques aimed at extracting accurate dependencies between reported microRNAs and disease in paper abstracts using a graph structure on each tokenized sentence. Each entry in miRDisease database contains detailed information on a microRNA–disease association, including microRNA ID, the disease name, a brief description of the microRNA–disease relationship, an expression pattern of the microRNA, and comprehensive information about the papers reporting the association. In addition, a graphical network of relations for each mir, disease or series of mir, diseses are provided. miRDisease is freely available at http://www.miRDisease.org and will be updated regularly using recently published papers. |
نویسندگان مقاله |
سید حمید آقایی بختیاری | seyed hamid aghaee bakhtiari department of medical biotechnology, faculty of medicine, mashhad university of medical sciences, mashhad, iran. دانشگاه علوم پزشکی مشهد، دانشکده پزشکی، گروه زیست فناوری سازمان اصلی تایید شده: دانشگاه علوم پزشکی مشهد (Mashhad university of medical sciences)
صبا امیری | saba amiri department of computer sciences, amirkabir university of technology دانشگاه صنعتی امیرکبیر، گروه کامپیوتر سازمان اصلی تایید شده: دانشگاه صنعتی امیرکبیر (Amirkabir university of technology)
|