DRGs住院診斷關聯群

臨床上的疾病分類系統,在幾十年前幾乎都是為了方便不同醫療體系間的溝通而發展的,但隨者醫療商業化,國家系統開始有了公衛組織,這些對於疾病的分類系統便逐漸用來管理醫院的工具,甚至可以當作是商業保險、醫療給付所使用的工具。Diagnosis Related Groups便是最好的例子,在1980年代由耶魯大學的Rober B Fetter醫師發展出來的,出發點本來是用來方便對病人分類,後來經過許多調整後,便把這概念用來做醫療保險的給付系統,美國從1980年代開始醫療支出大幅上升,便開始希望設計一套新的醫療支付系統來處理這個問題,於是美國在負責處理醫療保險的部門Centers for medicare & medicaid services便使用DRGs來新的支付系統,便是所謂的CM-DRGs,一開始的CM-DRGs不太能反映同一個類別中,不同疾病的嚴重程度,且在美國實行的時候,也是慢慢導入到不同的疾病中。在1986年之後緊接者法國修改了CMS-DRGS,提出GHM1(Group Homogeneous de Malades 1),澳洲提出了AN-DRGS在1991年開始實施,陸陸續續地在匈牙利(1992)、義大利(1995)、西班牙(1997)、丹麥(2002)、英國(2003)都採用了DRG的概念,這使得DRG變成醫療給付上的熱門方法。

台灣在實施全民健保後,也面臨到入不敷出的問題,於是減少醫療支出便是很重要的一件事情,而從論件計酬的方式改到DRGs從世界各國的實施上來看都是一個不錯解決這問題的方式,於是台灣近幾年開始導入DRGs進入臨床上的健保給付。這邊可以看到清楚的時間列表,從民國88年便開始請專家展開導入DRGs相關的討論會議,第一版的TW-DRGs在民國91年頒布,持續改進到第三本,本來預計到97年要開始實施,但後來延到民國105年初才開始實行。
screenshot.png

DRG的框架主要如下圖:
screenshot.png
先從病人的主要診斷來進行分類,這稱作MDC(Major Disease Categories),緊接者則是要看是否有做手術、各式處理,再以此來調整費用。下圖可以看一下目前有哪些MDC類別:
screenshot.pngscreenshot.png

本質上來說,這套模式會參考病人的主要診斷、手術處置、有無併發症/合併症、年齡、性別和出院狀況做為依據,本身具備周延性和互斥性,在給定一定額下,讓醫院自負盈虧。
目前TW-DRG 4.0版本主要有1716項,位於各個MDC下面,整個衛生福利部中央健康保險署的資料非常豐富,且撰寫很清楚。

當有了DRG這個方法使用了病人的疾病、處置、基本資料等來為每個病人做給付的分類,那通常這時候醫院會開始使用所謂的臨床路徑(Clinical Pathway),將在同一個DRGs下的診斷處置以及在醫院內病人所進行的流程做一個標準流程,並且藉由優化這個標準流程來降低同一個DRGs下病人處置的費用支出,並且提高照護品質。

那實際上DRG會如何拿來用計算費用呢?
可以看看這篇在Bull World Health Organ 2013;91::746-756A裡面刊登的圖
screenshot.png
從此圖可以看到一組DRG的費率可以由三個要素乘積決定,分別是Cost weight, Base rate 和adjustment factor,單一個DRG的Cost weight數值(台灣則是稱作relative weight)反映了他與其他的DRG之間在治療上的困難和花費數量,Base rate在所有DRG中是相同的,最後的adjustment factor則可以用來幫如教學醫院等,在基礎花費上本來比較高的機構,調整支付水準的。

在台灣則如何計算每個參數呢?
screenshot.pngscreenshot.png

真實世界的證據(Real-World Evidence):醫學證據觀念的轉換,不再以隨機雙盲試驗為一切

這篇由中國科學院上海生科院生化與細生研究員吳家睿所寫的文章迈向精确医学时代,真实世界证据不容忽视,內容寫得頗好,下面留一些連結!

Real-World Evidence — What Is It and What Can It Tell Us?
http://www.nejm.org/doi/full/10.1056/NEJMsb1609216

Using Design Thinking to Differentiate Useful From Misleading Evidence in Observational Research
http://jamanetwork.com/journals/jama/fullarticle/2603908

Whether to Intubate During Cardiopulmonary ResuscitationConventional Wisdom vs Big Data
http://jamanetwork.com/journals/jama/fullarticle/2598715

New “21st Century Cures” LegislationSpeed and Ease vs Science
http://jamanetwork.com/journals/jama/fullarticle/2597296

The Network For Excellence In Health Innovation
http://www.nehi.net/

Use of Real World Evidence to Support Regulatory Decision Making for Medical Devices
https://www.federalregister.gov/documents/2016/07/27/2016-17750/use-of-real-world-evidence-to-support-regulatory-decisionmaking-for-medical-devices-draft-guidance

PMI Working Group
https://www.nih.gov/allofus-research-program/pmi-working-group

Personalized medicine: Time for one-person trials
http://www.nature.com/news/personalized-medicine-time-for-one-person-trials-1.17411

相關新聞
美國國會通過《21世紀醫療法案》最終版本

醫療數據交換性(Interoperability):從死亡證明到ICD-10

醫療數據的交換性(Interoperability)一直以來都是全世界醫療體系在努力提升的,像是HL7也是這持續不斷努力過程中的一部份,可以將醫療體系中的資訊分成三類,依據這三類也因此衍生出不同的標準,這三類分別是:

1. Data
2. Message
3. Document

這三類的關係其實是一層層往下的,這邊的Data,其實是在指醫學用語的統一,大家是怎麼形容一些疾病的,或是藥物等,當我們有了統一個醫學相關用語,這才會有彼此理解的可能,下面這個圖,很清楚地展示醫學用語的整合之重要性,這當中衍生出所謂的語法(Syntax)和語法(Semantic),兩個層面上的統一。
螢幕快照 2017-04-19 下午2.05.23.png

關於這類的努力,其實可以回朔到死亡證明,在過去時代,往往沒有任何資料可以去查找關於一個地區人們的死亡原因,直到大概1400年,義大利北部才開始推行所謂的死亡證明
螢幕快照 2017-04-19 下午2.12.43.png
在之後1629到1631年間在義大利北部的瘟疫大爆發,這些死亡證明資料開始被拿來用作分析資料,讓政府統計這些重大事件的數據。

到1661年的時候,英國的經濟學家(第一個開始做人口統計的研究者)John Graunt為了要記錄英國倫敦那時候嬰兒的死亡原因,試圖用比較細緻的方式來描述死亡
220px-Graunt2.gifGraunt_Observations.jpeg

在1893年美國的統計學家Jacques Bertillon提出了在死亡分類上的方法Bertillon Classification,從身體系統的架構,貼近臨床上的導因來做分類。

直到1899年在丹麥,第一版的ICD(International Classification of Disease)編碼被提出,直到今日,已有ICD-11版本了,雖然目前在台灣或是美國,才剛開始推行使用ICD-10。在ICD-11的編碼,使用了所謂的Content Model,來加速從病例資料中自動分類編碼的目的。

下面可以看一下,不同版本間的ICD其編碼結構的差異:
螢幕快照 2017-04-19 下午2.27.58.png

也可以直接到WHO提供的ICD-10 Browser來一窺ICD-10編碼的細節:

螢幕快照 2017-04-19 下午2.30.14.png

awk 進階筆記:字串處理

用awk可以解決很多用高階語言需要很多代碼量才能解決的問題,並且在大型的文字資料的處理很快速(純txt檔),一開始使用awk並不會使用到裡面很多的函數,直到最近在使用bioawk後,才發現其實awk很多可以用來處理字串的函數,且極度方便。

會在這邊用來負責統整跟awk字串處理的函數

awk substr用法:

substr(string, start [, length ])

string = "abcdef"
# try to get "abCDEf", won't work
substr(string, 3, 3) = "CDE"

參考閱讀:
The GNU Awk User’s Guide String Function

土炮多核心服務器撒任務:ssh + screen + shell script sobprocess

#!/bin/sh

HLA_type_choose=$1
netMHC="/home/weitinglin66/tools/netMHC-4.0/netMHC"
declare -A HLA_type
HLA_type[0]="HLA-A0101"
HLA_type[1]="HLA-A0201"
HLA_type[2]="HLA-A0301"
HLA_type[3]="HLA-A2402"
HLA_type[4]="HLA-A2601"
HLA_type[5]="HLA-B0702"
HLA_type[6]="HLA-B0801"
HLA_type[7]="HLA-B2705"
HLA_type[8]="HLA-B3901"
HLA_type[9]="HLA-B4001"
HLA_type[10]="HLA-B5801"
HLA_type[11]="HLA-B1501"
#pbpaste| awk 'BEGIN{RS=",";ORS="\n";i=-1}{i ++}{print "HLA_type["i"]="$0}'

declare -A Peptides
Peptides[0]=14
Peptides[1]=13
Peptides[2]=12
Peptides[3]=11
Peptides[4]=10
Peptides[5]=9
Peptides[6]=8

declare -A PeptidesFasta
PeptidesFasta[0]="cosmic_neoantigen_27.fasta"
PeptidesFasta[1]="cosmic_neoantigen_25_length_from_27.fasta"
PeptidesFasta[2]="cosmic_neoantigen_23_length_from_27.fasta"
PeptidesFasta[3]="cosmic_neoantigen_21_length_from_27.fasta"
PeptidesFasta[4]="cosmic_neoantigen_19_length_from_27.fasta"
PeptidesFasta[5]="cosmic_neoantigen_17_length_from_27.fasta"
PeptidesFasta[6]="cosmic_neoantigen_15_length_from_27.fasta"

for i in {0,1,2,3,4,5,6}
do

(/home/weitinglin66/tools/netMHC-4.0/netMHC -a ${HLA_type[$HLA_type_choose]} -l ${Peptides[$i]}  -xls -xlsfile ${HLA_type[$HLA_type_choose]}_${Peptides[$i]}  ${PeptidesFasta[$i]}  & )
done


#!/bin/sh

declare -A HLA_type
HLA_type[0]="HLA_A0101"
HLA_type[1]="HLA_A0201"
HLA_type[2]="HLA_A0301"
HLA_type[3]="HLA_A2402"
HLA_type[4]="HLA_A2601"
HLA_type[5]="HLA_B0702"
HLA_type[6]="HLA_B0801"
HLA_type[7]="HLA_B2705"
HLA_type[8]="HLA_B3901"
HLA_type[9]="HLA_B4001"
HLA_type[10]="HLA_B5801"
HLA_type[11]="HLA_B1501"

declare -A Server
Server[0]=01
Server[1]=02
Server[2]=03
Server[3]=04
Server[4]=05
Server[5]=06
Server[6]=07
Server[7]=08
Server[8]=09
Server[9]=10
Server[10]=11
Server[11]=12


echo $SCRIPT
for i in 0 1 2 3 4 5 6 7 8 9 10 11
do
 MESSAGE=$i
 SCRIPT="cd /home/weitinglin66/Project/MHCprediction/${HLA_type[$MESSAGE]} ; sh netMHC_all_peptide_length.sh ${MESSAGE}"
 ssh -l weitinglin66 MIBngs-${Server[$i]} "${SCRIPT}"
done



合併多個sqlite3資料庫使用python3

合併多個小sqlite資料庫方法應該有很多種,這邊使用python3來處理,好處是可以減少所使用的sql語法。

這邊主要使用的SQL語法為:

ATTACH DATABASE "Data_sub.db" AS DB;
INSERT INTO mergedintable select * from db.bemergedtable
DETACH DATABASE DB;

藉由迴圈來把所有資料庫合併在一起。
在python3中的代碼則使用到基本的sqlite3.connect, cursor, execute, commit, close來完成。(都是..最基本的pyrhon3 sqlite3 db api)

def createdatabase(choose_HLA_type_super):
    """
    create the database and table with the HLA type
    """
    database_name = "{database}_netMHC.db".format(database=re.sub('-','_',choose_HLA_type_super))

# create the table in the database
    conn = sqlite3.connect(database_name)
    c = conn.cursor()

# Create different table according to peptides length

# Create table
    c.execute('''CREATE TABLE Neoantigen_14
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    c.execute('''CREATE TABLE Neoantigen_13
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    c.execute('''CREATE TABLE Neoantigen_12
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    c.execute('''CREATE TABLE Neoantigen_11
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    c.execute('''CREATE TABLE Neoantigen_10
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    c.execute('''CREATE TABLE Neoantigen_9
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    c.execute('''CREATE TABLE Neoantigen_8
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    # Save (commit) the changes
    conn.commit()

    # We can also close the connection if we are done with it.
    conn.close()

    #print(''.join(['Database ', database_name,' created!']))

HLA_type_super=["HLA-A0101","HLA-A0201","HLA-A0301","HLA-A2402","HLA-A2601","HLA-B0702","HLA-B0801","HLA-B2705","HLA-B3901","HLA-B4001","HLA-B5801","HLA-B1501"]

HLA_type_all_ABC=["HLA-A0101","HLA-A0201","HLA-A0202","HLA-A0203","HLA-A0205","HLA-A0206","HLA-A0207","HLA-A0211","HLA-A0212","HLA-A0216","HLA-A0217","HLA-A0219","HLA-A0250","HLA-A0301","HLA-A1101","HLA-A2301","HLA-A2402","HLA-A2403","HLA-A2501","HLA-A2601","HLA-A2602","HLA-A2603","HLA-A2902","HLA-A3001","HLA-A3002","HLA-A3101","HLA-A3201","HLA-A3207","HLA-A3215","HLA-A3301","HLA-A6601","HLA-A6801","HLA-A6802","HLA-A6823","HLA-A6901","HLA-A8001","HLA-B4403","HLA-B4501","HLA-B4601","HLA-B4801","HLA-B5101","HLA-B5301","HLA-B5401","HLA-B5701","HLA-B5801","HLA-B8301","HLA-B0702","HLA-B0801","HLA-B0802","HLA-B0803","HLA-B1402","HLA-B1501","HLA-B1502","HLA-B1503","HLA-B1509","HLA-B1517","HLA-B1801","HLA-B2705","HLA-B2720","HLA-B3501","HLA-B3503","HLA-B3801","HLA-B3901","HLA-B4001","HLA-B4002","HLA-B4013","HLA-B4201","HLA-B4402","HLA-B5802","HLA-B7301","HLA-C0303","HLA-C0401","HLA-C0501","HLA-C0602","HLA-C0701","HLA-C0702","HLA-C0802","HLA-C1203","HLA-C1402","HLA-C1502","HLA-E0101"]

result_title = ['pos', 'HLA', 'peptide', 'core', 'offset','I_pos','I_len', 'D_pos','D_len', 'iCore', 'Identity', 'log50k', 'Affinity','Rank','BindLevel']

createdatabase(''.join([HLA_type_super[0], "_merged"]))

conn = sqlite3.connect("HLA_A0101_merged_netMHC.db")
c = conn.cursor()
for i in range(50):
    string = ''.join(["attach ", "database","'HLA_A0101_",str(i),"_netMHC.db'", "AS 'toMerge'"])
    c.execute(string)
    c.execute("insert into Neoantigen_14 select * from toMerge.Neoantigen_14")
    c.execute("insert into Neoantigen_13 select * from toMerge.Neoantigen_13")
    c.execute("insert into Neoantigen_12 select * from toMerge.Neoantigen_12")
    c.execute("insert into Neoantigen_11 select * from toMerge.Neoantigen_11")
    c.execute("insert into Neoantigen_10 select * from toMerge.Neoantigen_10")
    c.execute("insert into Neoantigen_9 select * from toMerge.Neoantigen_9")
    c.execute("insert into Neoantigen_8 select * from toMerge.Neoantigen_8")
    c.execute("detach database toMerge")
    conn.commit()
conn.close()

多進程Python框架設計:針對百萬級抗原變異做抗原呈現預測

主要是想寫一個程式來當作媒介,來做一些並行運算的設計(當然,假如已經知道整體計算細節和運算的資料,也許可以考慮使用hadoop或是spark這些專門針對分散式資料的數據運算設計的框架),這邊算是土炮來練習如何用python來加速數據運算的框架,藉此熟悉python底層在thread和process的設計,最後這邊使用多進程(process)設計,搭配sqlite3在中間儲存資料,過程呼叫netMHC來進行單組fasta資料的運算。

實作過程發現的一些小心得和重點,把掉坑心得分享一下:
1. 理解sqlite3資料庫的特性
sqlite資料庫的輕特性很適合用來做數據處理原型的資料儲存用,但他在transaction rate限制下要在代碼中考慮進去,否則會成為程式速度很大的瓶頸,雖然sqlite資料庫其實是能支援多thread來讀取的,可是要在compile sqlite安裝時去修改原始代碼,對於不太熟悉其底層的人來說會很困難。

2. python語言的特性
python底層設計上有一個global interpreter lock的設計,所以當你所有程式都是python代碼的時候,即使使用python內的multiprocess或是multithreading,效果依舊有限,可是把python當作是控制主程式,用多線程或是進程來控制一些主要運算,這些運算是用其他語言小程式執行(shellscript, java 等外部程式),效果就會不錯。

3. 理解python3在tmp檔上之實作細節
在調用python3的tempfile模組時,要記得背後的原理,否則會發現當你開啟幾千個線程去使用temp黨時,名字強蹦的機會還是有的,這時候會頗嚴重的。

4. 理解processing和threading的差異及處理
processing和threading在python中的使用方式和語法其實非常相似,主要是底層資源的隔離或是溝通的差異,這兩種不一定是哪一種就比較好,要看自己計算上的瓶頸是什麼,是網路等待貧頸,或是資源調用大量等,頗多底層細節和對自己計算過程的掌握相關,其實真的需要經驗,也是這次採坑才知道,也才理解搭建出hadoop/spark框架的人真的頗有價值。

5. 理解在使用Multiprocessing.queueQueue.queue的差異
在測試過程會發現總會有幾個進程會無緣無故地停著不動,但也沒有進入下一個任務,後來才發現queue模組中的細節,比如像下面這個語句

def doJob(*args):
    que = args[0]
    conn = args[1]
    process_name = multiprocessing.current_process().name
    while que.qsize() > 0:
        try:
            job = que.get(timeout=5)
            job_num = que.qsize()
            job.do(conn, process_name, job_num)
        except:
            print("".join(["Queue is empty"]))
    print("".join(["****",process_name,"do a good job, let it take a break","***"]))

在上面的語法中,用que.qsize來作為thread/process結束時候的信號,讓他不會進行去跟queue要物件來執行,但後來發現在多進程或多線程下,有可能在做while中判斷時qsize是大於0,下一剎那當queue內物件已經被其他線程或是進程取完後,當他進入到queue.get時,預設是會一直等到有物件,除非設置timeout的參數,設置之後就可以讓這些不小心掉如無主狀態的進程或是線程歸位了!

import subprocess
import tempfile
import re
import sqlite3
import os
import time
import queue
import threading
import multiprocessing

def createdatabase(choose_HLA_type_super):
    """
    create the database and table with the HLA type
    """
    database_name = "{database}_netMHC.db".format(database=re.sub('-','_',choose_HLA_type_super))

# create the table in the database
    conn = sqlite3.connect(database_name)
    c = conn.cursor()

# Create different table according to peptides length

# Create table
    c.execute('''CREATE TABLE Neoantigen_14
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    c.execute('''CREATE TABLE Neoantigen_13
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    c.execute('''CREATE TABLE Neoantigen_12
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    c.execute('''CREATE TABLE Neoantigen_11
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    c.execute('''CREATE TABLE Neoantigen_10
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    c.execute('''CREATE TABLE Neoantigen_9
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    c.execute('''CREATE TABLE Neoantigen_8
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    # Save (commit) the changes
    conn.commit()

    # We can also close the connection if we are done with it.
    conn.close()

    #print(''.join(['Database ', database_name,' created!']))

# string 14 - 9
def peptideshorten(peptide):
    """Prepare the peptides for 14 - 8 lengths affinity predict
    Args:
        seq
    Returns:
        A Generator gave the trim peptides from 14 - 8
    Raises:
    """
    i = 0
    yield peptide
    while i < 6:         i += 1         yield peptide[i:-i] def stringtrim(peptide):     """Trim the string to remove spaces     Args:         seq     Returns:         A string without spaces     Raises:        """     return re.sub('\s','',peptide) # Input with single peptide per time def netMHCcalculate(single_peptide_name, single_comment, single_peptide, predict_list, choose_HLA_type_super):     #print("Begin to predict with netMHC!")     while True:         try:             single_peptide_window = next(single_peptide)             peptide_seq_len = len(single_peptide_window)             peptide_len = str(int((peptide_seq_len + 1)/2))             one_record = ''.join(['>',single_peptide_name,' ',single_comment,'\n',single_peptide_window])

            #print(''.join(['The whole peptides length: ', str(peptide_seq_len)]))

            #print(''.join(['The Peptide Length for prediction: ', peptide_len]))

            #print(one_record)

            f = tempfile.NamedTemporaryFile(mode='w+',delete=True, suffix='.fasta')
            f.write(one_record)
            f.seek(1)

            predict_list.append(subprocess.check_output(['sh', 'netMHC_one.sh', choose_HLA_type_super, peptide_len, f.name]))

            f.close()

        except StopIteration:
            break
def multplerecords_netMHCcalculate(multiple_records_iter, choose_HLA_type_super):
    i = 0
    predict_list = []
    while True:
        try:
            single_record_fasta = ''.join(next(multiple_records_iter))
            f = tempfile.NamedTemporaryFile(mode='w+',delete=True, suffix='.fasta', prefix=''.join(['tmp',str(round(time.time()))]))
            f.write(single_record_fasta)
            f.seek(1)
            predict_list.append(subprocess.check_output(['sh', 'netMHC_one.sh', choose_HLA_type_super, str(14-i), f.name]))
            f.close()
            i += 1
        except StopIteration:
            break
    return(predict_list)

def get_single_row_predict(predict_iter):
    single_row_predict = []
    while True:
        try:
            single_result_list = iter(next(predict_iter).decode().split('\n'))
            while True:
                try:
                    single_result = next(single_result_list)
                    if len(single_result) == 0:
                        continue
                    single_row_predict.append(single_result)
                except StopIteration:
                    break
        except StopIteration:
            break
    return(single_row_predict)
# Seperate the conn, reduce the frequence to open a new connection
def update_insert_database(single_row_predict_iter, conn):
    print("Begin to insert result into database")
    c = conn.cursor()
    one_larger_trasaction_14 = []
    one_larger_trasaction_13 = []
    one_larger_trasaction_12 = []
    one_larger_trasaction_11 = []
    one_larger_trasaction_10 = []
    one_larger_trasaction_9 = []
    one_larger_trasaction_8 = []

    while True:
        try:
            single_one_row_predict = next(single_row_predict_iter).split(',')
            #print(single_one_row_predict)
            D_POS, D_HLA, D_Peptide, D_Core, D_Offset, D_I_pos, D_I_len, D_D_pos, D_D_len, D_iCore, D_Identity, D_log50k, D_Affinity, D_Rank, D_BindLevel = single_one_row_predict
            if len(stringtrim(D_Peptide)) == 14:
                    one_larger_trasaction_14.append((int(D_POS), re.sub('\s','',D_HLA), re.sub('\s','',D_Peptide), re.sub('\s','',D_Core), int(D_Offset), int(D_I_pos),int(D_I_len), int(D_D_pos), int(D_D_len), re.sub('\s','',D_iCore), re.sub('\s','',D_Identity), float(D_log50k), float(D_Affinity), float(D_Rank), float(D_BindLevel)))
            if len(stringtrim(D_Peptide)) == 13:
                    one_larger_trasaction_13.append((int(D_POS), re.sub('\s','',D_HLA), re.sub('\s','',D_Peptide), re.sub('\s','',D_Core), int(D_Offset), int(D_I_pos),int(D_I_len), int(D_D_pos), int(D_D_len), re.sub('\s','',D_iCore), re.sub('\s','',D_Identity), float(D_log50k), float(D_Affinity), float(D_Rank), float(D_BindLevel)))
            if len(stringtrim(D_Peptide)) == 12:
                    one_larger_trasaction_12.append((int(D_POS), re.sub('\s','',D_HLA), re.sub('\s','',D_Peptide), re.sub('\s','',D_Core), int(D_Offset), int(D_I_pos),int(D_I_len), int(D_D_pos), int(D_D_len), re.sub('\s','',D_iCore), re.sub('\s','',D_Identity), float(D_log50k), float(D_Affinity), float(D_Rank), float(D_BindLevel)))
            if len(stringtrim(D_Peptide)) == 11:
                    one_larger_trasaction_11.append((int(D_POS), re.sub('\s','',D_HLA), re.sub('\s','',D_Peptide), re.sub('\s','',D_Core), int(D_Offset), int(D_I_pos),int(D_I_len), int(D_D_pos), int(D_D_len), re.sub('\s','',D_iCore), re.sub('\s','',D_Identity), float(D_log50k), float(D_Affinity), float(D_Rank), float(D_BindLevel)))
            if len(stringtrim(D_Peptide)) == 10:
                    one_larger_trasaction_10.append((int(D_POS), re.sub('\s','',D_HLA), re.sub('\s','',D_Peptide), re.sub('\s','',D_Core), int(D_Offset), int(D_I_pos),int(D_I_len), int(D_D_pos), int(D_D_len), re.sub('\s','',D_iCore), re.sub('\s','',D_Identity), float(D_log50k), float(D_Affinity), float(D_Rank), float(D_BindLevel)))
            if len(stringtrim(D_Peptide)) == 9:
                    one_larger_trasaction_9.append((int(D_POS), re.sub('\s','',D_HLA), re.sub('\s','',D_Peptide), re.sub('\s','',D_Core), int(D_Offset), int(D_I_pos),int(D_I_len), int(D_D_pos), int(D_D_len), re.sub('\s','',D_iCore), re.sub('\s','',D_Identity), float(D_log50k), float(D_Affinity), float(D_Rank), float(D_BindLevel)))
            if len(stringtrim(D_Peptide)) == 8:
                    one_larger_trasaction_8.append((int(D_POS), re.sub('\s','',D_HLA), re.sub('\s','',D_Peptide), re.sub('\s','',D_Core), int(D_Offset), int(D_I_pos),int(D_I_len), int(D_D_pos), int(D_D_len), re.sub('\s','',D_iCore), re.sub('\s','',D_Identity), float(D_log50k), float(D_Affinity), float(D_Rank), float(D_BindLevel)))
        except StopIteration:
            # close the database
            #print(''.join(['Insert data into database, finished!']))
            break

    c.executemany("INSERT INTO Neoantigen_14 VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", one_larger_trasaction_14)
    conn.commit()

    c.executemany("INSERT INTO Neoantigen_13 VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", one_larger_trasaction_13)
    conn.commit()

    c.executemany("INSERT INTO Neoantigen_12 VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", one_larger_trasaction_12)
    conn.commit()

    c.executemany("INSERT INTO Neoantigen_11 VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", one_larger_trasaction_11)
    conn.commit()

    c.executemany("INSERT INTO Neoantigen_10 VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", one_larger_trasaction_10)
    conn.commit()

    c.executemany("INSERT INTO Neoantigen_9 VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", one_larger_trasaction_9)
    conn.commit()

    c.executemany("INSERT INTO Neoantigen_8 VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", one_larger_trasaction_8)
    conn.commit()

#conn.close()
class Job:
    def __init__(self, sub_name_list, sub_comment_list, sub_seq_list, multiple_fasta_num, HLA_type):
        self.sub_seq_list = sub_seq_list
        self.sub_name_list = sub_name_list
        self.sub_comment_list = sub_comment_list
        self.HLA_type = HLA_type
        self.multiple_fasta_num = multiple_fasta_num
    def do(self, conn, process_name, job_num):
        print(''.join([process_name,": Begin ",str(job_num),"_th Job!"]))
        multiple_records = create_multiple_records(self.sub_name_list, self.sub_comment_list, self.sub_seq_list, self.multiple_fasta_num)

        print(''.join(['_',process_name,"_Finish create multiple recors!"]))
        multiple_records_iter = iter(multiple_records)

        print(''.join(['_',process_name,"_create multiple record iterator"]))
        # calculate the netMHC on single peptite with 8 - 14 peptide and single HLA type

        predict_list = multplerecords_netMHCcalculate(multiple_records_iter, self.HLA_type)
        print(''.join(['_',process_name,"_Finish netMHCcalculate"]))

        predict_iter = iter(predict_list)
        print(''.join(['_',process_name,"_Create precit_list iterator"]))
        # turn calculate result into single string for per sequence affinity prediction

        single_row_predict = get_single_row_predict(predict_iter)
        print(''.join(['_',process_name,"_Create single row preidcit data"]))

        single_row_predict_iter = iter(single_row_predict)
        # insert the data into database
        print(''.join(['_',process_name,"Prepare to insert database"]))

        update_insert_database(single_row_predict_iter, conn)
        print(''.join(["_",process_name," Job", "calculate finished!"]))
def doJob(*args):
    que = args[0]
    conn = args[1]
    process_name = multiprocessing.current_process().name
    while que.qsize() > 0:
        try:
            job = que.get(timeout=5)
            job_num = que.qsize()
            job.do(conn, process_name, job_num)
        except:
            print("".join(["Queue is empty"]))
    print("".join(["****",process_name,"do a good job, let it take a break","***"]))

def seperateIterate(iter_item, multiple_fasta_num):
    i = 0
    sub_item_list = []
    while i < multiple_fasta_num:
        sub_item_list.append(next(iter_item))
        i += 1
    return sub_item_list

HLA_type_super=["HLA-A0101","HLA-A0201","HLA-A0301","HLA-A2402","HLA-A2601","HLA-B0702","HLA-B0801","HLA-B2705","HLA-B3901","HLA-B4001","HLA-B5801","HLA-B1501"]

HLA_type_all_ABC=["HLA-A0101","HLA-A0201","HLA-A0202","HLA-A0203","HLA-A0205","HLA-A0206","HLA-A0207","HLA-A0211","HLA-A0212","HLA-A0216","HLA-A0217","HLA-A0219","HLA-A0250","HLA-A0301","HLA-A1101","HLA-A2301","HLA-A2402","HLA-A2403","HLA-A2501","HLA-A2601","HLA-A2602","HLA-A2603","HLA-A2902","HLA-A3001","HLA-A3002","HLA-A3101","HLA-A3201","HLA-A3207","HLA-A3215","HLA-A3301","HLA-A6601","HLA-A6801","HLA-A6802","HLA-A6823","HLA-A6901","HLA-A8001","HLA-B4403","HLA-B4501","HLA-B4601","HLA-B4801","HLA-B5101","HLA-B5301","HLA-B5401","HLA-B5701","HLA-B5801","HLA-B8301","HLA-B0702","HLA-B0801","HLA-B0802","HLA-B0803","HLA-B1402","HLA-B1501","HLA-B1502","HLA-B1503","HLA-B1509","HLA-B1517","HLA-B1801","HLA-B2705","HLA-B2720","HLA-B3501","HLA-B3503","HLA-B3801","HLA-B3901","HLA-B4001","HLA-B4002","HLA-B4013","HLA-B4201","HLA-B4402","HLA-B5802","HLA-B7301","HLA-C0303","HLA-C0401","HLA-C0501","HLA-C0602","HLA-C0701","HLA-C0702","HLA-C0802","HLA-C1203","HLA-C1402","HLA-C1502","HLA-E0101"]

result_title = ['pos', 'HLA', 'peptide', 'core', 'offset','I_pos','I_len', 'D_pos','D_len', 'iCore', 'Identity', 'log50k', 'Affinity','Rank','BindLevel']

# Parse the fasta into program
tabule_fasta = subprocess.check_output(['sh', 'TurnTabule.sh', 'cosmic_neoantigen_tmp.fasta'])
tabule_fasta_list = tabule_fasta.decode().split('\n')

# Splite the singel record into name, comment, seq
name_list = []
comment_list = []
seq_list = []
for fasta in tabule_fasta_list:
    try:
        name, comment, seq = fasta.split('\t')
        name_list.append(name)
        comment_list.append(comment)
        seq_list.append(seq)
    except ValueError:
        print(fasta.split('\t'))
choose_HLA_type_super = HLA_type_super[0]

# create multiple database for each threads
database_create_list = []
process_num = 100

for i in range(process_num):
    database_create_list.append(''.join([choose_HLA_type_super,"_",str(i)]))
# create the database
for database_name in database_create_list:
    createdatabase(database_name)

database_name_list = []
for i in range(process_num):
    database_name_list.append(''.join([re.sub('-','_',choose_HLA_type_super),'_',str(i),"_netMHC.db"]))

database_conn_list = []
for i in range(process_num):
    conn = sqlite3.connect(database_name_list[i])
    database_conn_list.append(conn)

# use iterator to decrease memory use
iter_name = iter(name_list)
iter_comment = iter(comment_list)
iter_seq = iter(seq_list)
# que_list
# change to use the multiprocessing.queue
# emerge the que to one
que_list = []
work_num = 3573
multiple_fasta_num = 1000
i = 0
que = multiprocessing.Queue()
while i < work_num:
    try:
        sub_name_list = seperateIterate(iter_name, multiple_fasta_num)
        sub_comment_list = seperateIterate(iter_comment, multiple_fasta_num)
        sub_seq_list = seperateIterate(iter_seq, multiple_fasta_num)
        que.put(Job(sub_name_list, sub_comment_list, sub_seq_list, multiple_fasta_num, choose_HLA_type_super))
        i += 1
    except StopIteration:
        break

start_time  = time.time()
process_list = []
for i in range(process_num):
    p = multiprocessing.Process(target=doJob, name=''.join(['Process_',str(i)]), args=(que, database_conn_list[i]))
    process_list.append(p)
    print(p)
    p.start()

for p in process_list:
    p.join()
print(''.join(["done!, take",str(time.time()-start_time), " sec"]))

雲時代的小幫手:SQLite3

 

screenshot.png

sqlite是輕型開源的關聯式資料庫,目前最新版為sqlite3,在很多作業系統是內建就有了,不需要特別安裝,也是目前作為手機等行動裝置的輕型資料庫。在R(DBI/dplyr)和Python(sqlite3)中也有很好的支持。相對於其他開源關聯式資料庫如MySQL, PostgreSQL等

sqlite到底是輕在哪呢?有什麼特色?

screenshot.png

1. 無服務器化(serverless)
他不需要像MySQL等要執行一隻進程來提供資料庫的存儲功能,藉由放棄client/server的架構,獲得輕型的特性。
2. 低設置(zero configuration)
因為不需要服務器來運作資料庫,所以相關設置也減少
3. 跨平台(cross-platform)
資料庫可以跨平台的搬動和使用
4. self-contained/small runtime footprint
在設置或是細調時,不太需要處理太多外部的library依賴問題,基本用來創建sqlite代碼大小最多700kb,且只需不到4MB的記憶體需求來運作。
5. 支援dynamic-type system/fully in-memory database
在記憶體需求以及創建的時間都小
6. transactional
雖然輕而小,但依舊具有把存儲操作transational的功能(與資料安全性相關)
7. 支持多數SQL92標準的要求
8. 開發團隊可靠
這邊有changelog採訪SQLite的創造人Richard Hipp的音頻,裡面說他們至少會維護到2050年,非常厲害的承諾,且感覺開發者們頗樸實和專注的,應該是能做到。

SQLite不適合的使用情景:

可以理解這些使用場景的不適合,是因為SQLite選擇使用單資料儲存的關係

  1. 高transaction rates的需求(需要頻繁的修改資料庫)
  2. 上百Gb的資料量
  3. 個別使用者的權限管理
  4. 需要client/server的使用場景

 

另外,SQLite還提供方便的command line(又稱dot command),對於新手應該很快可以上手。

.help
.exit
.databases
.tables
.schema
.dump

Bioawk:專門處理定序相關格式的awk(fasta,fastq,bed,sam,vcf,gff)

在處理大量fasta/fastq時,有時候會需要找一些速度快的程式,其中會發現在python, perl中有許多設計專門用來prase fasta/fastq相關的工具,但速度上總是不甚理想,此時,要是有一個跟awk一樣底層處理字串的工具,那絕對會很強大,這個bioawk便是為此而生的,他是Heng Li(寫samtools的大神)開發的一個小工具,速度會是現行其他工具的十倍速,所以很適合用來整合成自己pipeline的前線。

到哪下載和安裝呢?

可以直接到Heng Li的github下載,然後直接用make指令來編譯,比較舊版的linux系統,可能會需要安裝一些相關用來compile的工具。

 

那他有什麼基本指令呢?
其實使用起來就是awk,只是他在語法上把本來NR等常見的variable,制換成這些format(fasta/fastq/bed…)中的值,如下面的表格:

BED SAM VCF GFF FASTX(FASTA/FASTQ)
chrom qname chrom seqname name
start flaq pos source seq
end rname id feature qual
name pos ref start comment
score mapq alt end
strand cigar qual score
thickstart rnext filter filter
thickend pnext info strand
rgb tlen group
blockcount seq attribute
blocksizes qual
blockstarts

這邊有一些使用範例
Fasta

#fasta
算序列長度
bioawk -c fastx '{ print $name, length($seq) }' input.fasta
算每個序列的gc百分比
bioawk -c fastx '{ print $name, gc($seq) }' input.fasta
轉換成互補序列
bioawk -c fastx '{ print ">"$name;print revcomp($seq) }' input.fasta
只留下序列長度大於100的
bioawk -c fastx 'length($seq) > 100{ print ">"$name; print $seq }'  input.fasta
轉換fasta成tabular的格式
bioawk -t -c fastx '{ print $name, $seq }' input.fasta
選取特定id的序列
bioawk -cfastx 'BEGIN{while((getline k <"IDs.txt")>0)i[k]=1}{if(i[$name])print ">"$name"\n"$seq}' input.fasta

Fastq

計算有幾個reads
bioawk -t -c fastx 'END {print NR}' input.fastq
轉換fastq成為fasta
bioawk -c fastx '{print ">"$name; print $seq}' input.fastq
計算phred平均值
bioawk -c fastx '{print ">"$name; print meanqual($qual)}' input.fastq
選取小於10的reads片段
bioawk -cfastx 'length($seq) > 10 {print "@"$name"\n"$seq"\n+\n"$qual}' input.fastq

閱讀參考
Bioawk basics

論文閱讀:Elements of cancer immunity and the cancer-immune set point 精華簡報

這篇論文是今年2007年一月所發表的文獻回顧,由Genetech 癌症免疫療法部門的全球藥物開發負責人Damiel S. Chen和Genetech資深癌症生物學專家Ira Mellman兩位所撰寫,主要在提出如何由到目前為止許多由免疫療藥物的臨床試驗上取得的知識,來更好的建立一套觀點來做病人對免疫療法的預測。最重要的便是如何將病人腫瘤附近的組織切片,利用其特徵來分成幾類免疫表型。