SQL与Graql:生物医学数据的建模和查询


使用SQL查询关系数据库很容易。作为一种声明性语言,编写查询和构建强大的应用程序非常简单。经过几十年的发展,SQL已经发展成为一种非常健壮、可靠、快速并且对各种规模的应用程序有用的语言。

然而,在处理复杂数据时,SQL很难做到。我所说的复杂数据是指包含许多实体类型并且高度互联的数据。当在SQL中查询这样的数据时,在数据的建模和查询中会出现挑战。例如,由于大量必要的连接,它迫使我们编写冗长的查询。这样的查询不仅运行时间很长,而且很难写,容易出错。当在SQL中处理复杂数据时,我们可以列出如下挑战:

  1. 编写查询很复杂—分析具有许多关系的分层数据会导致包含许多连接的SQL语句,这些语句很容易变得难以理解。
  2. 查询速度慢—因此,具有大量连接的查询会增加计算响应的复杂性,并导致查询响应时间过长。
  3. 复杂的数据模型—具有复杂数据的域需要带有(连接)表的非直观数据模型,这会造成不必要的复杂性并降低数据质量。

当处理生物医学数据时,上述挑战尤其相关(见此处更多内容)。这正是在SQL中很难建模和查询的数据类型。它包含许多实体类型(如各种蛋白质、基因、组织)以及这些实体之间的许多联系(相互作用、关联等)。这就是为什么使用像Graql这样的查询语言可以将建模和查询的复杂性降低几个数量级的原因(下面将详细介绍)。这就是为什么在这篇文章中,我想看看在处理生物医学数据时,Graql与SQL相比如何。

在本文的最后,我将向您展示如何在Graql中只用4行代码编写一个151行的查询。

Graql简介

如果您不熟悉Graql,我想首先简单介绍一下建模构造(see here进行深入介绍)。

正如SQL是关系数据库中的标准查询语言一样,Graql也是Grakn的查询语言。它是一种声明性语言,允许我们对数据进行模型查询和推理。至关重要的是,Graql的可表达性使它成为世界上最可读的查询语言。这非常重要,因为这意味着即使是非程序员(比如生物学家而不是生物信息学家)也能够与数据库对话并理解它。

Grakn的数据模型由三种核心类型组成:

  • 实体是你领域中的主要参与者。这些通常是你想知道的事情类型(例如蛋白质、基因、组织、疾病)
  • 关系描述了两个或更多的事物是如何以某种方式相互联系的。每一种关系都必须与域中其他事物扮演的角色相关联。换句话说,关系依赖于至少两种其他事物的存在(例如,蛋白质相互作用、过程定位、转录)。
  • 属性是附加到其他概念(例如uniprot-id、entrez-id、疾病名称)的小数据片段。
  • 下面是如何使用这些概念来创建一个简单的模型(或模式)。下面的例子具体向我们展示了基因是如何编码蛋白质并与疾病相关联的。

最后,规则是我们在模型中编码的逻辑模式,允许我们对现有数据进行推理,以创建实体、关系和属性的新实例。实际上,这意味着通过在现有数据上强加某种逻辑,我们可以从相同的数据中以指数形式回答更多的问题。

例如,如果我们的原始数据不包括蛋白质与疾病的关系,我们可以通过观察蛋白质的编码基因并将其与相关疾病联系起来来推断这些蛋白质与疾病的关系。换句话说:

如果:

  • 基因编码蛋白质;
  • 基因与疾病有关。

然后:

  • 创建蛋白质疾病关联。

这条规则将是我们的模式的一部分,然后我们可以将其可视化如下,其中虚线表示推断的关系:

附在菱形上的虚线表示推断的关系。

SQL和Graql:数据建模

有了对Graql的基本理解(假设您理解了SQL),让我们比较一下如何在Graql和SQL中表示一个更复杂的模型,让我们找出哪一个更直观。

为了进行比较,我想特别关注疾病网络。我要建模的实体包括:

以及相应的关系:

  • 基因编码蛋白质
  • 基因与药物有关
  • 基因与疾病有关

最后,我希望包含来自BLAST搜索(更多信息here)这是我们之前做过的。对于我的模型,我想整合BLAST在匹配和目标序列之间返回的积极性和同一性分数。我们将其建模为序列之间的关系:

  • 目标蛋白质序列与匹配的蛋白质序列具有阳性和同一性分数

下图是我们如何在关系模型中表示这个域。

我们疾病网络的关系模型。黄色表是连接表。

有几件事需要注意:

  • 有三个连接表将基因与疾病、药物和蛋白质联系起来。
  • 序列比对也被模拟成一个连接表,我用蛋白质序列作为外源键将它们与蛋白质连接起来。在这个表格中,我还添加了积极和同一性分数。

现在我们已经在一个关系模型中对这个疾病网络进行了建模,我们将如何在Grakn中进行这项工作?见下文。

重要的是要注意到,当在Graql中建模任何领域时,Graql比SQL更具表现力,如果我们习惯于在关系世界中建模,我们需要有不同的想法。例如,我们应该考虑到Graql's type hierarchy,并思考我们以后想通过使用规则来推断的概念。记住这一点,对于我的模型:

  • 我对四种蛋白质类型(激酶、离子通道、核受体和GPCR)进行了分型,并创造了一种亲本型蛋白质
  • 我在两个序列属性之间建立了一种叫做序列比对的关系,在这种关系上,我附加了同一性和积极性属性
  • 我还创建了蛋白质-蛋白质-排列关系,这样我就可以创建推断这种关系的规则
  • 最后,我创建了一个药物-疾病-关联关系,所以我也可以推断这种关系

SQL和Graql:查询示例

既然我们已经创建了这些模型,我们将如何在SQL中查询它们呢?在Graql中,这种情况会如何比较?

首先,让我们从一个简单的问题开始。

  1. 哪些药物与基因Ensembl-id ' ensg 000010404 '相互作用?

SQL查询:

SELECT Drug.ChemblId, Drug.DrugName 
FROM Gene 
   INNER JOIN Drug_Gene 
      ON Gene.EntrezId = Drug_Gene.EntrezId 
   LEFT JOIN Drug 
      ON Drug.ChemblId = Drug_Gene.ChemblId 
WHERE Gene.EnsemblId = 'ENSG00000010404'

Graql查询:

match 
$ge isa gene, has ensembl-id "ENSG00000010404"; $dr isa drug; 
(inhibitor: $dr, target-gene: $ge) isa drug-gene-interaction; get; 

好吧,看起来Graql读起来更直观。让我们看一个稍微复杂一点的查询。

2.哪些激酶与糖尿病相关?

SQL查询:

SELECT Kinase.UniprotId, Kinase.UniprotName 
FROM Protein_Gene 
   INNER JOIN Disease_Gene 
      ON Disease_Gene.EntrezId = Protein_Gene.EntrezId 
   RIGHT JOIN Disease 
      ON Disease_Gene.DiseaseName = Disease.DiseaseName 
   LEFT JOIN Kinase 
      ON Protein_Gene.UniprotId = Kinase.UniprotId 
WHERE Disease.DiseaseName = 'Asthma'

Graql查询:

match 
$k isa kinase; 
$di isa disease, has disease-name "Diabetes"; 
(associated-protein: $k, associated-disease: $di) isa protein-disease-association; get;

Graql查询比它的等效SQL短两倍!

现在,请记住,对于这个Graql查询,我们利用我之前创建的规则进行推断蛋白质-疾病-关联如果两个实体都连接到同一个基因。这允许我们问一个更高层次的问题,并让系统知道如何进行导航。在SQL中,由于它没有推理引擎,我们需要通过执行三个连接来编写显式查询。

最后,让我们看一个极端的例子。

3.哪些药物与哮喘有关?

SQL查询:

select Drug_Gene.ChemblId, DrugName from Drug
    inner join Drug_Gene on Drug.ChemblId=Drug_Gene.ChemblId
    inner join Gene_Protein on  Drug_Gene.EntrezId=Gene_Protein.EntrezId
    inner join Kinase on Kinase.UniProtId=Gene_Protein.UniProtId
    where
        Kinase.sequence in
            (select Sequence_Alignment.TargetSequence from Kinase
                inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=Kinase.sequence or Sequence_Alignment.MatchedSequence=Kinase.sequence
                inner join Gene_Protein on Kinase.UniProtId=Gene_Protein.UniProtId
                inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
                inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
                where
                    Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98
                    UNION ALL
                        select Sequence_Alignment.TargetSequence from Ion_channel
                            inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=Ion_channel.sequence or Sequence_Alignment.MatchedSequence=Ion_channel.sequence
                            inner join Gene_Protein on Ion_channel.UniProtId=Gene_Protein.UniProtId
                            inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
                            inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
                            where Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98
                    UNION ALL
                        select Sequence_Alignment.TargetSequence from Nuc_receptor
                            inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=Nuc_receptor.sequence or Sequence_Alignment.MatchedSequence=Nuc_receptor.sequence
                            inner join Gene_Protein on Nuc_receptor.UniProtId=Gene_Protein.UniProtId
                            inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
                            inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
                            where
                                Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98
                    UNION ALL
                        select Sequence_Alignment.TargetSequence from GPCR
                            inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=GPCR.sequence or Sequence_Alignment.MatchedSequence=GPCR.sequence
                            inner join Gene_Protein on GPCR.UniProtId=Gene_Protein.UniProtId
                            inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
                            inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
                            where
                                Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98)
                    UNION ALL
                        select Drug_Gene.ChemblId, DrugName from Drug
                            inner join Drug_Gene on Drug.ChemblId=Drug_Gene.ChemblId
                            inner join Gene_Protein on Drug_Gene.EntrezId=Gene_Protein.EntrezId
                            inner join Ion_channel on Ion_channel.UniProtId=Gene_Protein.UniProtId
                            where
                                Ion_channel.sequence in (
                                    select Sequence_Alignment.TargetSequence from Kinase
                                        inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=Kinase.sequence or Sequence_Alignment.MatchedSequence=Kinase.sequence
                                        inner join Gene_Protein on Kinase.UniProtId=Gene_Protein.UniProtId
                                        inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
                                        inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
                                        where
                                            Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98
                    UNION ALL
                        select Sequence_Alignment.TargetSequence from Ion_channel
                            inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=Ion_channel.sequence or Sequence_Alignment.MatchedSequence=Ion_channel.sequence
                            inner join Gene_Protein on Ion_channel.UniProtId=Gene_Protein.UniProtId
                            inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
                            inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
                            where
                                Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98
                    UNION ALL
                        select Sequence_Alignment.TargetSequence from Nuc_receptor
                            inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=Nuc_receptor.sequence or Sequence_Alignment.MatchedSequence=Nuc_receptor.sequence
                            inner join Gene_Protein on Nuc_receptor.UniProtId=Gene_Protein.UniProtId
                            inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
                            inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
                            where
                                Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98
                    UNION ALL
                        select Sequence_Alignment.TargetSequence from GPCR
                            inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=GPCR.sequence or Sequence_Alignment.MatchedSequence=GPCR.sequence
                            inner join Gene_Protein on GPCR.UniProtId=Gene_Protein.UniProtId
                            inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
                            inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
                            where
                                Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98)
                    UNION ALL
                        select Drug_Gene.ChemblId, DrugName from Drug
                            inner join Drug_Gene on Drug.ChemblId=Drug_Gene.ChemblId
                            inner join Gene_Protein on Drug_Gene.EntrezId=Gene_Protein.EntrezId
                            inner join Nuc_receptor on Nuc_receptor.UniProtId=Gene_Protein.UniProtId
                            where
                                Nuc_receptor.sequence in
                                    (select Sequence_Alignment.TargetSequence from Kinase
                                        inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=Kinase.sequence or Sequence_Alignment.MatchedSequence=Kinase.sequence
                                        inner join Gene_Protein on Kinase.UniProtId=Gene_Protein.UniProtId
                                        inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
                                        inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
                                        where
                                            Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98
                                        UNION ALL
                                            select Sequence_Alignment.TargetSequence from Ion_channel
                                                inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=Ion_channel.sequence or Sequence_Alignment.MatchedSequence=Ion_channel.sequence
                                                inner join Gene_Protein on Ion_channel.UniProtId=Gene_Protein.UniProtId
                                                inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
                                                inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
                                                where
                                                    Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98
                                        UNION ALL
                                            select Sequence_Alignment.TargetSequence from Nuc_receptor
                                                inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=Nuc_receptor.sequence or Sequence_Alignment.MatchedSequence=Nuc_receptor.sequence
                                                inner join Gene_Protein on Nuc_receptor.UniProtId=Gene_Protein.UniProtId
                                                inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
                                                inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
                                                where
                                                    Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98
                                        UNION ALL
                                            select Sequence_Alignment.TargetSequence from GPCR
                                                inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=GPCR.sequence or Sequence_Alignment.MatchedSequence=GPCR.sequence
                                                inner join Gene_Protein on GPCR.UniProtId=Gene_Protein.UniProtId
                                                inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
                                                inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
                                                where
                                                    Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98)

                                        UNION ALL
                                            select Drug_Gene.ChemblId, DrugName from Drug
                                                inner join Drug_Gene on Drug.ChemblId=Drug_Gene.ChemblId
                                                inner join Gene_Protein on Drug_Gene.EntrezId=Gene_Protein.EntrezId
                                                inner join GPCR on GPCR.UniProtId=Gene_Protein.UniProtId
                                                where
                                                    GPCR.sequence in
                                                        (select Sequence_Alignment.TargetSequence from Kinase
                                                            inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=Kinase.sequence or Sequence_Alignment.MatchedSequence=Kinase.sequence
                                                            inner join Gene_Protein on Kinase.UniProtId=Gene_Protein.UniProtId
                                                            inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
                                                            inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
                                                            where
                                                                Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98
                                        UNION ALL
                                            select Sequence_Alignment.TargetSequence from Ion_channel
                                                inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=Ion_channel.sequence or Sequence_Alignment.MatchedSequence=Ion_channel.sequence
                                                inner join Gene_Protein on Ion_channel.UniProtId=Gene_Protein.UniProtId
                                                inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
                                                inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
                                                where
                                                    Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98
                                        UNION ALL
                                            select Sequence_Alignment.TargetSequence from Nuc_receptor
                                                inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=Nuc_receptor.sequence or Sequence_Alignment.MatchedSequence=Nuc_receptor.sequence
                                                inner join Gene_Protein on Nuc_receptor.UniProtId=Gene_Protein.UniProtId
                                                inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
                                                inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
                                                where
                                                Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98
                                        UNION ALL
                                            select Sequence_Alignment.TargetSequence from GPCR
                                                inner join Sequence_Alignment on Sequence_Alignment.TargetSequence=GPCR.sequence or Sequence_Alignment.MatchedSequence=GPCR.sequence
                                                inner join Gene_Protein on GPCR.UniProtId=Gene_Protein.UniProtId
                                                inner join Disease_Gene on Disease_Gene.EntrezId=Gene_Protein.EntrezId
                                                inner join Disease on Disease.DiseaseName=Disease_Gene.DiseaseName
                                                where
                                                    Disease.DiseaseName='Asthma' and Sequence_Alignment.Positivity>=0.98 and Sequence_Alignment.Identicality>=0.98)

Graql查询:

match 
$di isa disease, has disease-name 'Asthma'; 
$dr isa drug; 
$r (therapeutic: $dr, affected-disease: $di) isa drug-disease-association; get;

哇!这是一个很大的区别Graql查询比它的SQL等价查询短37倍。现在,请注意,我并不是真的想拿SQL撒尿!我确信它仍然可以由在这方面经验丰富的人来优化。也就是说,我希望在这里展示一个普通人(比如我)是如何写这篇文章的。

让我来扩展一下我们是如何将一个151行的SQL查询简化为一个4行的Graql查询的。由于底层数据结构和我们想要应用的逻辑,查询本身由于查询排列的数量而导致组合爆炸。Graql基于类型和基于规则的推理将这些抽象出来,允许我们询问更高级别的查询。

在摄取的原始数据中,不存在药物-疾病关系。因此,为了回答这个问题,我们想应用以下逻辑:

  • 如果疾病与基因有关(在我们上面的例子中,我们特别要求疾病哮喘);
  • 基因编码蛋白质(激酶、离子通道、核受体、GCPRs);
  • 如果这些蛋白质中的任何一种与另一种含有高于0.98的阳性率和高于0.98的同一性的蛋白质具有序列比对,
  • 这种蛋白质是由基因编码的,
  • 这与药物有关

我将这个逻辑以规则的形式(见下文)存储在我的Grakn模式中。这意味着随后我可以只查询药物-疾病-关系Grakn会为我找到导航的方法。

when {
    $di isa disease; 
    $g isa gene; $g2 isa gene; $g != $g2; 
    $pr isa protein, has sequence $sequence-1; 
    $pr2 isa protein, has sequence $sequence-2; 
    $pr != $pr2;
    $dr isa drug;
    (associated-disease: $di, associated-gene: $g) isa gene-disease-association;
    (encoding-gene: $g, encoded-protein: $pr) isa gene-protein-encoding; 
    $alignment ($sequence-1, $sequence-2) isa sequence-sequence-alignment, has sequence-identicality $ident, has sequence-positivity $pos;
    $ident >= 0.98; $pos >= 0.98;
    (encoding-gene: $g2, encoded-protein: $pr2) isa gene-protein-encoding;
    ($dr, target-gene: $g2) isa drug-gene-interaction;
} then {
    (affected-disease: $di, therapeutic: $dr) isa drug-disease-association;
};

但是,由于SQL不支持推理,编写SQL查询会导致上面看到的组合爆炸。具体来说,这是因为有四种类型指蛋白质中包含的序列可以是目标序列,也可以是与其他四个序列中的任何一个相连的序列比对关系中的匹配序列蛋白质类型。在SQL中,我们必须显式地声明查询可以采用的所有路径,而在Grakn中,我们只是使用基于规则和基于类型的推理来推理我们的数据。

结束语

毫无疑问,SQL是世界上最久经考验的查询语言。迄今为止,它为世界上大多数信息技术基础设施提供动力。然而,在处理复杂数据时,SQL很难做到:

  1. 编写查询很复杂;
  2. 这会导致查询速度变慢;
  3. 迫使我们创建复杂的数据模型。

从根本上说,这些挑战是不可避免的,因为关系数据模型不是为处理这类数据而创建的。使用一种更具表现力的查询语言来推理我们的数据,将这种复杂性抽象化,这样我们就可以专注于更高层次的问题。

由于Soroush Saffari感谢您对本文的帮助,尤其是创建SQL查询!