awk 進階筆記：字串處理

2017 年 04 月 20 日2017 年 06 月 08 日1 則迴響

用awk可以解決很多用高階語言需要很多代碼量才能解決的問題，並且在大型的文字資料的處理很快速（純txt檔），一開始使用awk並不會使用到裡面很多的函數，直到最近在使用bioawk後，才發現其實awk很多可以用來處理字串的函數，且極度方便。

會在這邊用來負責統整跟awk字串處理的函數

awk substr用法：

substr(string, start [, length ])

string = "abcdef"
# try to get "abCDEf", won't work
substr(string, 3, 3) = "CDE"

參考閱讀：
The GNU Awk User’s Guide String Function

土炮多核心服務器撒任務：ssh + screen + shell script sobprocess

2017 年 04 月 19 日2017 年 04 月 16 日發表留言

#!/bin/sh

HLA_type_choose=$1
netMHC="/home/weitinglin66/tools/netMHC-4.0/netMHC"
declare -A HLA_type
HLA_type[0]="HLA-A0101"
HLA_type[1]="HLA-A0201"
HLA_type[2]="HLA-A0301"
HLA_type[3]="HLA-A2402"
HLA_type[4]="HLA-A2601"
HLA_type[5]="HLA-B0702"
HLA_type[6]="HLA-B0801"
HLA_type[7]="HLA-B2705"
HLA_type[8]="HLA-B3901"
HLA_type[9]="HLA-B4001"
HLA_type[10]="HLA-B5801"
HLA_type[11]="HLA-B1501"
#pbpaste| awk 'BEGIN{RS=",";ORS="\n";i=-1}{i ++}{print "HLA_type["i"]="$0}'

declare -A Peptides
Peptides[0]=14
Peptides[1]=13
Peptides[2]=12
Peptides[3]=11
Peptides[4]=10
Peptides[5]=9
Peptides[6]=8

declare -A PeptidesFasta
PeptidesFasta[0]="cosmic_neoantigen_27.fasta"
PeptidesFasta[1]="cosmic_neoantigen_25_length_from_27.fasta"
PeptidesFasta[2]="cosmic_neoantigen_23_length_from_27.fasta"
PeptidesFasta[3]="cosmic_neoantigen_21_length_from_27.fasta"
PeptidesFasta[4]="cosmic_neoantigen_19_length_from_27.fasta"
PeptidesFasta[5]="cosmic_neoantigen_17_length_from_27.fasta"
PeptidesFasta[6]="cosmic_neoantigen_15_length_from_27.fasta"

for i in {0,1,2,3,4,5,6}
do

(/home/weitinglin66/tools/netMHC-4.0/netMHC -a ${HLA_type[$HLA_type_choose]} -l ${Peptides[$i]}  -xls -xlsfile ${HLA_type[$HLA_type_choose]}_${Peptides[$i]}  ${PeptidesFasta[$i]}  & )
done


#!/bin/sh

declare -A HLA_type
HLA_type[0]="HLA_A0101"
HLA_type[1]="HLA_A0201"
HLA_type[2]="HLA_A0301"
HLA_type[3]="HLA_A2402"
HLA_type[4]="HLA_A2601"
HLA_type[5]="HLA_B0702"
HLA_type[6]="HLA_B0801"
HLA_type[7]="HLA_B2705"
HLA_type[8]="HLA_B3901"
HLA_type[9]="HLA_B4001"
HLA_type[10]="HLA_B5801"
HLA_type[11]="HLA_B1501"

declare -A Server
Server[0]=01
Server[1]=02
Server[2]=03
Server[3]=04
Server[4]=05
Server[5]=06
Server[6]=07
Server[7]=08
Server[8]=09
Server[9]=10
Server[10]=11
Server[11]=12


echo $SCRIPT
for i in 0 1 2 3 4 5 6 7 8 9 10 11
do
 MESSAGE=$i
 SCRIPT="cd /home/weitinglin66/Project/MHCprediction/${HLA_type[$MESSAGE]} ; sh netMHC_all_peptide_length.sh ${MESSAGE}"
 ssh -l weitinglin66 MIBngs-${Server[$i]} "${SCRIPT}"
done

合併多個sqlite3資料庫使用python3

2017 年 04 月 18 日2017 年 04 月 19 日發表留言

合併多個小sqlite資料庫方法應該有很多種，這邊使用python3來處理，好處是可以減少所使用的sql語法。

這邊主要使用的SQL語法為：

ATTACH DATABASE "Data_sub.db" AS DB;
INSERT INTO mergedintable select * from db.bemergedtable
DETACH DATABASE DB;

藉由迴圈來把所有資料庫合併在一起。
在python3中的代碼則使用到基本的sqlite3.connect, cursor, execute, commit, close來完成。（都是..最基本的pyrhon3 sqlite3 db api）

def createdatabase(choose_HLA_type_super):
    """
    create the database and table with the HLA type
    """
    database_name = "{database}_netMHC.db".format(database=re.sub('-','_',choose_HLA_type_super))

# create the table in the database
    conn = sqlite3.connect(database_name)
    c = conn.cursor()

# Create different table according to peptides length

# Create table
    c.execute('''CREATE TABLE Neoantigen_14
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    c.execute('''CREATE TABLE Neoantigen_13
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    c.execute('''CREATE TABLE Neoantigen_12
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    c.execute('''CREATE TABLE Neoantigen_11
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    c.execute('''CREATE TABLE Neoantigen_10
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    c.execute('''CREATE TABLE Neoantigen_9
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    c.execute('''CREATE TABLE Neoantigen_8
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    # Save (commit) the changes
    conn.commit()

    # We can also close the connection if we are done with it.
    conn.close()

    #print(''.join(['Database ', database_name,' created!']))

HLA_type_super=["HLA-A0101","HLA-A0201","HLA-A0301","HLA-A2402","HLA-A2601","HLA-B0702","HLA-B0801","HLA-B2705","HLA-B3901","HLA-B4001","HLA-B5801","HLA-B1501"]

HLA_type_all_ABC=["HLA-A0101","HLA-A0201","HLA-A0202","HLA-A0203","HLA-A0205","HLA-A0206","HLA-A0207","HLA-A0211","HLA-A0212","HLA-A0216","HLA-A0217","HLA-A0219","HLA-A0250","HLA-A0301","HLA-A1101","HLA-A2301","HLA-A2402","HLA-A2403","HLA-A2501","HLA-A2601","HLA-A2602","HLA-A2603","HLA-A2902","HLA-A3001","HLA-A3002","HLA-A3101","HLA-A3201","HLA-A3207","HLA-A3215","HLA-A3301","HLA-A6601","HLA-A6801","HLA-A6802","HLA-A6823","HLA-A6901","HLA-A8001","HLA-B4403","HLA-B4501","HLA-B4601","HLA-B4801","HLA-B5101","HLA-B5301","HLA-B5401","HLA-B5701","HLA-B5801","HLA-B8301","HLA-B0702","HLA-B0801","HLA-B0802","HLA-B0803","HLA-B1402","HLA-B1501","HLA-B1502","HLA-B1503","HLA-B1509","HLA-B1517","HLA-B1801","HLA-B2705","HLA-B2720","HLA-B3501","HLA-B3503","HLA-B3801","HLA-B3901","HLA-B4001","HLA-B4002","HLA-B4013","HLA-B4201","HLA-B4402","HLA-B5802","HLA-B7301","HLA-C0303","HLA-C0401","HLA-C0501","HLA-C0602","HLA-C0701","HLA-C0702","HLA-C0802","HLA-C1203","HLA-C1402","HLA-C1502","HLA-E0101"]

result_title = ['pos', 'HLA', 'peptide', 'core', 'offset','I_pos','I_len', 'D_pos','D_len', 'iCore', 'Identity', 'log50k', 'Affinity','Rank','BindLevel']

createdatabase(''.join([HLA_type_super[0], "_merged"]))

conn = sqlite3.connect("HLA_A0101_merged_netMHC.db")
c = conn.cursor()
for i in range(50):
    string = ''.join(["attach ", "database","'HLA_A0101_",str(i),"_netMHC.db'", "AS 'toMerge'"])
    c.execute(string)
    c.execute("insert into Neoantigen_14 select * from toMerge.Neoantigen_14")
    c.execute("insert into Neoantigen_13 select * from toMerge.Neoantigen_13")
    c.execute("insert into Neoantigen_12 select * from toMerge.Neoantigen_12")
    c.execute("insert into Neoantigen_11 select * from toMerge.Neoantigen_11")
    c.execute("insert into Neoantigen_10 select * from toMerge.Neoantigen_10")
    c.execute("insert into Neoantigen_9 select * from toMerge.Neoantigen_9")
    c.execute("insert into Neoantigen_8 select * from toMerge.Neoantigen_8")
    c.execute("detach database toMerge")
    conn.commit()
conn.close()

多進程Python框架設計:針對百萬級抗原變異做抗原呈現預測

2017 年 04 月 17 日2017 年 04 月 17 日發表留言

主要是想寫一個程式來當作媒介，來做一些並行運算的設計（當然，假如已經知道整體計算細節和運算的資料，也許可以考慮使用hadoop或是spark這些專門針對分散式資料的數據運算設計的框架），這邊算是土炮來練習如何用python來加速數據運算的框架，藉此熟悉python底層在thread和process的設計，最後這邊使用多進程(process)設計，搭配sqlite3在中間儲存資料，過程呼叫netMHC來進行單組fasta資料的運算。

實作過程發現的一些小心得和重點，把掉坑心得分享一下：
1. 理解sqlite3資料庫的特性
sqlite資料庫的輕特性很適合用來做數據處理原型的資料儲存用，但他在transaction rate限制下要在代碼中考慮進去，否則會成為程式速度很大的瓶頸，雖然sqlite資料庫其實是能支援多thread來讀取的，可是要在compile sqlite安裝時去修改原始代碼，對於不太熟悉其底層的人來說會很困難。

2. python語言的特性
python底層設計上有一個global interpreter lock的設計，所以當你所有程式都是python代碼的時候，即使使用python內的multiprocess或是multithreading，效果依舊有限，可是把python當作是控制主程式，用多線程或是進程來控制一些主要運算，這些運算是用其他語言小程式執行(shellscript, java 等外部程式)，效果就會不錯。

3. 理解python3在tmp檔上之實作細節
在調用python3的tempfile模組時，要記得背後的原理，否則會發現當你開啟幾千個線程去使用temp黨時，名字強蹦的機會還是有的，這時候會頗嚴重的。

4. 理解processing和threading的差異及處理
processing和threading在python中的使用方式和語法其實非常相似，主要是底層資源的隔離或是溝通的差異，這兩種不一定是哪一種就比較好，要看自己計算上的瓶頸是什麼，是網路等待貧頸，或是資源調用大量等，頗多底層細節和對自己計算過程的掌握相關，其實真的需要經驗，也是這次採坑才知道，也才理解搭建出hadoop/spark框架的人真的頗有價值。

5. 理解在使用Multiprocessing.queue和Queue.queue的差異
在測試過程會發現總會有幾個進程會無緣無故地停著不動，但也沒有進入下一個任務，後來才發現queue模組中的細節，比如像下面這個語句

def doJob(*args):
    que = args[0]
    conn = args[1]
    process_name = multiprocessing.current_process().name
    while que.qsize() > 0:
        try:
            job = que.get(timeout=5)
            job_num = que.qsize()
            job.do(conn, process_name, job_num)
        except:
            print("".join(["Queue is empty"]))
    print("".join(["****",process_name,"do a good job, let it take a break","***"]))

在上面的語法中，用que.qsize來作為thread/process結束時候的信號，讓他不會進行去跟queue要物件來執行，但後來發現在多進程或多線程下，有可能在做while中判斷時qsize是大於0，下一剎那當queue內物件已經被其他線程或是進程取完後，當他進入到queue.get時，預設是會一直等到有物件，除非設置timeout的參數，設置之後就可以讓這些不小心掉如無主狀態的進程或是線程歸位了！

import subprocess
import tempfile
import re
import sqlite3
import os
import time
import queue
import threading
import multiprocessing

def createdatabase(choose_HLA_type_super):
    """
    create the database and table with the HLA type
    """
    database_name = "{database}_netMHC.db".format(database=re.sub('-','_',choose_HLA_type_super))

# create the table in the database
    conn = sqlite3.connect(database_name)
    c = conn.cursor()

# Create different table according to peptides length

# Create table
    c.execute('''CREATE TABLE Neoantigen_14
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    c.execute('''CREATE TABLE Neoantigen_13
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    c.execute('''CREATE TABLE Neoantigen_12
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    c.execute('''CREATE TABLE Neoantigen_11
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    c.execute('''CREATE TABLE Neoantigen_10
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    c.execute('''CREATE TABLE Neoantigen_9
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    c.execute('''CREATE TABLE Neoantigen_8
             (pos integer, HLA text, peptide text, Core text, Offset integer, I_pos integer, I_len integer,
              D_pose integer, D_len integer, iCore text, Identity text, llog50k real, Affinity real, Rank REAL, BindLevel integer)
              ''')
    # Save (commit) the changes
    conn.commit()

    # We can also close the connection if we are done with it.
    conn.close()

    #print(''.join(['Database ', database_name,' created!']))

# string 14 - 9
def peptideshorten(peptide):
    """Prepare the peptides for 14 - 8 lengths affinity predict
    Args:
        seq
    Returns:
        A Generator gave the trim peptides from 14 - 8
    Raises:
    """
    i = 0
    yield peptide
    while i < 6:         i += 1         yield peptide[i:-i] def stringtrim(peptide):     """Trim the string to remove spaces     Args:         seq     Returns:         A string without spaces     Raises:        """     return re.sub('\s','',peptide) # Input with single peptide per time def netMHCcalculate(single_peptide_name, single_comment, single_peptide, predict_list, choose_HLA_type_super):     #print("Begin to predict with netMHC!")     while True:         try:             single_peptide_window = next(single_peptide)             peptide_seq_len = len(single_peptide_window)             peptide_len = str(int((peptide_seq_len + 1)/2))             one_record = ''.join(['>',single_peptide_name,' ',single_comment,'\n',single_peptide_window])

            #print(''.join(['The whole peptides length: ', str(peptide_seq_len)]))

            #print(''.join(['The Peptide Length for prediction: ', peptide_len]))

            #print(one_record)

            f = tempfile.NamedTemporaryFile(mode='w+',delete=True, suffix='.fasta')
            f.write(one_record)
            f.seek(1)

            predict_list.append(subprocess.check_output(['sh', 'netMHC_one.sh', choose_HLA_type_super, peptide_len, f.name]))

            f.close()

        except StopIteration:
            break
def multplerecords_netMHCcalculate(multiple_records_iter, choose_HLA_type_super):
    i = 0
    predict_list = []
    while True:
        try:
            single_record_fasta = ''.join(next(multiple_records_iter))
            f = tempfile.NamedTemporaryFile(mode='w+',delete=True, suffix='.fasta', prefix=''.join(['tmp',str(round(time.time()))]))
            f.write(single_record_fasta)
            f.seek(1)
            predict_list.append(subprocess.check_output(['sh', 'netMHC_one.sh', choose_HLA_type_super, str(14-i), f.name]))
            f.close()
            i += 1
        except StopIteration:
            break
    return(predict_list)

def get_single_row_predict(predict_iter):
    single_row_predict = []
    while True:
        try:
            single_result_list = iter(next(predict_iter).decode().split('\n'))
            while True:
                try:
                    single_result = next(single_result_list)
                    if len(single_result) == 0:
                        continue
                    single_row_predict.append(single_result)
                except StopIteration:
                    break
        except StopIteration:
            break
    return(single_row_predict)
# Seperate the conn, reduce the frequence to open a new connection
def update_insert_database(single_row_predict_iter, conn):
    print("Begin to insert result into database")
    c = conn.cursor()
    one_larger_trasaction_14 = []
    one_larger_trasaction_13 = []
    one_larger_trasaction_12 = []
    one_larger_trasaction_11 = []
    one_larger_trasaction_10 = []
    one_larger_trasaction_9 = []
    one_larger_trasaction_8 = []

    while True:
        try:
            single_one_row_predict = next(single_row_predict_iter).split(',')
            #print(single_one_row_predict)
            D_POS, D_HLA, D_Peptide, D_Core, D_Offset, D_I_pos, D_I_len, D_D_pos, D_D_len, D_iCore, D_Identity, D_log50k, D_Affinity, D_Rank, D_BindLevel = single_one_row_predict
            if len(stringtrim(D_Peptide)) == 14:
                    one_larger_trasaction_14.append((int(D_POS), re.sub('\s','',D_HLA), re.sub('\s','',D_Peptide), re.sub('\s','',D_Core), int(D_Offset), int(D_I_pos),int(D_I_len), int(D_D_pos), int(D_D_len), re.sub('\s','',D_iCore), re.sub('\s','',D_Identity), float(D_log50k), float(D_Affinity), float(D_Rank), float(D_BindLevel)))
            if len(stringtrim(D_Peptide)) == 13:
                    one_larger_trasaction_13.append((int(D_POS), re.sub('\s','',D_HLA), re.sub('\s','',D_Peptide), re.sub('\s','',D_Core), int(D_Offset), int(D_I_pos),int(D_I_len), int(D_D_pos), int(D_D_len), re.sub('\s','',D_iCore), re.sub('\s','',D_Identity), float(D_log50k), float(D_Affinity), float(D_Rank), float(D_BindLevel)))
            if len(stringtrim(D_Peptide)) == 12:
                    one_larger_trasaction_12.append((int(D_POS), re.sub('\s','',D_HLA), re.sub('\s','',D_Peptide), re.sub('\s','',D_Core), int(D_Offset), int(D_I_pos),int(D_I_len), int(D_D_pos), int(D_D_len), re.sub('\s','',D_iCore), re.sub('\s','',D_Identity), float(D_log50k), float(D_Affinity), float(D_Rank), float(D_BindLevel)))
            if len(stringtrim(D_Peptide)) == 11:
                    one_larger_trasaction_11.append((int(D_POS), re.sub('\s','',D_HLA), re.sub('\s','',D_Peptide), re.sub('\s','',D_Core), int(D_Offset), int(D_I_pos),int(D_I_len), int(D_D_pos), int(D_D_len), re.sub('\s','',D_iCore), re.sub('\s','',D_Identity), float(D_log50k), float(D_Affinity), float(D_Rank), float(D_BindLevel)))
            if len(stringtrim(D_Peptide)) == 10:
                    one_larger_trasaction_10.append((int(D_POS), re.sub('\s','',D_HLA), re.sub('\s','',D_Peptide), re.sub('\s','',D_Core), int(D_Offset), int(D_I_pos),int(D_I_len), int(D_D_pos), int(D_D_len), re.sub('\s','',D_iCore), re.sub('\s','',D_Identity), float(D_log50k), float(D_Affinity), float(D_Rank), float(D_BindLevel)))
            if len(stringtrim(D_Peptide)) == 9:
                    one_larger_trasaction_9.append((int(D_POS), re.sub('\s','',D_HLA), re.sub('\s','',D_Peptide), re.sub('\s','',D_Core), int(D_Offset), int(D_I_pos),int(D_I_len), int(D_D_pos), int(D_D_len), re.sub('\s','',D_iCore), re.sub('\s','',D_Identity), float(D_log50k), float(D_Affinity), float(D_Rank), float(D_BindLevel)))
            if len(stringtrim(D_Peptide)) == 8:
                    one_larger_trasaction_8.append((int(D_POS), re.sub('\s','',D_HLA), re.sub('\s','',D_Peptide), re.sub('\s','',D_Core), int(D_Offset), int(D_I_pos),int(D_I_len), int(D_D_pos), int(D_D_len), re.sub('\s','',D_iCore), re.sub('\s','',D_Identity), float(D_log50k), float(D_Affinity), float(D_Rank), float(D_BindLevel)))
        except StopIteration:
            # close the database
            #print(''.join(['Insert data into database, finished!']))
            break

    c.executemany("INSERT INTO Neoantigen_14 VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", one_larger_trasaction_14)
    conn.commit()

    c.executemany("INSERT INTO Neoantigen_13 VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", one_larger_trasaction_13)
    conn.commit()

    c.executemany("INSERT INTO Neoantigen_12 VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", one_larger_trasaction_12)
    conn.commit()

    c.executemany("INSERT INTO Neoantigen_11 VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", one_larger_trasaction_11)
    conn.commit()

    c.executemany("INSERT INTO Neoantigen_10 VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", one_larger_trasaction_10)
    conn.commit()

    c.executemany("INSERT INTO Neoantigen_9 VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", one_larger_trasaction_9)
    conn.commit()

    c.executemany("INSERT INTO Neoantigen_8 VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", one_larger_trasaction_8)
    conn.commit()

#conn.close()
class Job:
    def __init__(self, sub_name_list, sub_comment_list, sub_seq_list, multiple_fasta_num, HLA_type):
        self.sub_seq_list = sub_seq_list
        self.sub_name_list = sub_name_list
        self.sub_comment_list = sub_comment_list
        self.HLA_type = HLA_type
        self.multiple_fasta_num = multiple_fasta_num
    def do(self, conn, process_name, job_num):
        print(''.join([process_name,": Begin ",str(job_num),"_th Job!"]))
        multiple_records = create_multiple_records(self.sub_name_list, self.sub_comment_list, self.sub_seq_list, self.multiple_fasta_num)

        print(''.join(['_',process_name,"_Finish create multiple recors!"]))
        multiple_records_iter = iter(multiple_records)

        print(''.join(['_',process_name,"_create multiple record iterator"]))
        # calculate the netMHC on single peptite with 8 - 14 peptide and single HLA type

        predict_list = multplerecords_netMHCcalculate(multiple_records_iter, self.HLA_type)
        print(''.join(['_',process_name,"_Finish netMHCcalculate"]))

        predict_iter = iter(predict_list)
        print(''.join(['_',process_name,"_Create precit_list iterator"]))
        # turn calculate result into single string for per sequence affinity prediction

        single_row_predict = get_single_row_predict(predict_iter)
        print(''.join(['_',process_name,"_Create single row preidcit data"]))

        single_row_predict_iter = iter(single_row_predict)
        # insert the data into database
        print(''.join(['_',process_name,"Prepare to insert database"]))

        update_insert_database(single_row_predict_iter, conn)
        print(''.join(["_",process_name," Job", "calculate finished!"]))
def doJob(*args):
    que = args[0]
    conn = args[1]
    process_name = multiprocessing.current_process().name
    while que.qsize() > 0:
        try:
            job = que.get(timeout=5)
            job_num = que.qsize()
            job.do(conn, process_name, job_num)
        except:
            print("".join(["Queue is empty"]))
    print("".join(["****",process_name,"do a good job, let it take a break","***"]))

def seperateIterate(iter_item, multiple_fasta_num):
    i = 0
    sub_item_list = []
    while i < multiple_fasta_num:
        sub_item_list.append(next(iter_item))
        i += 1
    return sub_item_list

HLA_type_super=["HLA-A0101","HLA-A0201","HLA-A0301","HLA-A2402","HLA-A2601","HLA-B0702","HLA-B0801","HLA-B2705","HLA-B3901","HLA-B4001","HLA-B5801","HLA-B1501"]

HLA_type_all_ABC=["HLA-A0101","HLA-A0201","HLA-A0202","HLA-A0203","HLA-A0205","HLA-A0206","HLA-A0207","HLA-A0211","HLA-A0212","HLA-A0216","HLA-A0217","HLA-A0219","HLA-A0250","HLA-A0301","HLA-A1101","HLA-A2301","HLA-A2402","HLA-A2403","HLA-A2501","HLA-A2601","HLA-A2602","HLA-A2603","HLA-A2902","HLA-A3001","HLA-A3002","HLA-A3101","HLA-A3201","HLA-A3207","HLA-A3215","HLA-A3301","HLA-A6601","HLA-A6801","HLA-A6802","HLA-A6823","HLA-A6901","HLA-A8001","HLA-B4403","HLA-B4501","HLA-B4601","HLA-B4801","HLA-B5101","HLA-B5301","HLA-B5401","HLA-B5701","HLA-B5801","HLA-B8301","HLA-B0702","HLA-B0801","HLA-B0802","HLA-B0803","HLA-B1402","HLA-B1501","HLA-B1502","HLA-B1503","HLA-B1509","HLA-B1517","HLA-B1801","HLA-B2705","HLA-B2720","HLA-B3501","HLA-B3503","HLA-B3801","HLA-B3901","HLA-B4001","HLA-B4002","HLA-B4013","HLA-B4201","HLA-B4402","HLA-B5802","HLA-B7301","HLA-C0303","HLA-C0401","HLA-C0501","HLA-C0602","HLA-C0701","HLA-C0702","HLA-C0802","HLA-C1203","HLA-C1402","HLA-C1502","HLA-E0101"]

result_title = ['pos', 'HLA', 'peptide', 'core', 'offset','I_pos','I_len', 'D_pos','D_len', 'iCore', 'Identity', 'log50k', 'Affinity','Rank','BindLevel']

# Parse the fasta into program
tabule_fasta = subprocess.check_output(['sh', 'TurnTabule.sh', 'cosmic_neoantigen_tmp.fasta'])
tabule_fasta_list = tabule_fasta.decode().split('\n')

# Splite the singel record into name, comment, seq
name_list = []
comment_list = []
seq_list = []
for fasta in tabule_fasta_list:
    try:
        name, comment, seq = fasta.split('\t')
        name_list.append(name)
        comment_list.append(comment)
        seq_list.append(seq)
    except ValueError:
        print(fasta.split('\t'))
choose_HLA_type_super = HLA_type_super[0]

# create multiple database for each threads
database_create_list = []
process_num = 100

for i in range(process_num):
    database_create_list.append(''.join([choose_HLA_type_super,"_",str(i)]))
# create the database
for database_name in database_create_list:
    createdatabase(database_name)

database_name_list = []
for i in range(process_num):
    database_name_list.append(''.join([re.sub('-','_',choose_HLA_type_super),'_',str(i),"_netMHC.db"]))

database_conn_list = []
for i in range(process_num):
    conn = sqlite3.connect(database_name_list[i])
    database_conn_list.append(conn)

# use iterator to decrease memory use
iter_name = iter(name_list)
iter_comment = iter(comment_list)
iter_seq = iter(seq_list)
# que_list
# change to use the multiprocessing.queue
# emerge the que to one
que_list = []
work_num = 3573
multiple_fasta_num = 1000
i = 0
que = multiprocessing.Queue()
while i < work_num:
    try:
        sub_name_list = seperateIterate(iter_name, multiple_fasta_num)
        sub_comment_list = seperateIterate(iter_comment, multiple_fasta_num)
        sub_seq_list = seperateIterate(iter_seq, multiple_fasta_num)
        que.put(Job(sub_name_list, sub_comment_list, sub_seq_list, multiple_fasta_num, choose_HLA_type_super))
        i += 1
    except StopIteration:
        break

start_time  = time.time()
process_list = []
for i in range(process_num):
    p = multiprocessing.Process(target=doJob, name=''.join(['Process_',str(i)]), args=(que, database_conn_list[i]))
    process_list.append(p)
    print(p)
    p.start()

for p in process_list:
    p.join()
print(''.join(["done!, take",str(time.time()-start_time), " sec"]))

雲時代的小幫手：SQLite3

2017 年 04 月 15 日2017 年 04 月 15 日發表留言

sqlite是輕型開源的關聯式資料庫，目前最新版為sqlite3，在很多作業系統是內建就有了，不需要特別安裝，也是目前作為手機等行動裝置的輕型資料庫。在R(DBI/dplyr)和Python(sqlite3)中也有很好的支持。相對於其他開源關聯式資料庫如MySQL, PostgreSQL等

sqlite到底是輕在哪呢？有什麼特色？

1. 無服務器化(serverless)
他不需要像MySQL等要執行一隻進程來提供資料庫的存儲功能，藉由放棄client/server的架構，獲得輕型的特性。
2. 低設置(zero configuration)
因為不需要服務器來運作資料庫，所以相關設置也減少
3. 跨平台(cross-platform)
資料庫可以跨平台的搬動和使用
4. self-contained/small runtime footprint
在設置或是細調時，不太需要處理太多外部的library依賴問題，基本用來創建sqlite代碼大小最多700kb，且只需不到4MB的記憶體需求來運作。
5. 支援dynamic-type system/fully in-memory database
在記憶體需求以及創建的時間都小
6. transactional
雖然輕而小，但依舊具有把存儲操作transational的功能（與資料安全性相關）
7. 支持多數SQL92標準的要求
8. 開發團隊可靠
這邊有changelog採訪SQLite的創造人Richard Hipp的音頻，裡面說他們至少會維護到2050年，非常厲害的承諾，且感覺開發者們頗樸實和專注的，應該是能做到。

SQLite不適合的使用情景：

可以理解這些使用場景的不適合，是因為SQLite選擇使用單資料儲存的關係

高transaction rates的需求（需要頻繁的修改資料庫）
上百Gb的資料量
個別使用者的權限管理
需要client/server的使用場景

另外，SQLite還提供方便的command line（又稱dot command），對於新手應該很快可以上手。

.help
.exit
.databases
.tables
.schema
.dump

Bioawk：專門處理定序相關格式的awk（fasta,fastq,bed,sam,vcf,gff）

2017 年 04 月 14 日2017 年 07 月 26 日1 則迴響

在處理大量fasta/fastq時，有時候會需要找一些速度快的程式，其中會發現在python, perl中有許多設計專門用來prase fasta/fastq相關的工具，但速度上總是不甚理想，此時，要是有一個跟awk一樣底層處理字串的工具，那絕對會很強大，這個bioawk便是為此而生的，他是Heng Li(寫samtools的大神)開發的一個小工具，速度會是現行其他工具的十倍速，所以很適合用來整合成自己pipeline的前線。

到哪下載和安裝呢？

可以直接到Heng Li的github下載，然後直接用make指令來編譯，比較舊版的linux系統，可能會需要安裝一些相關用來compile的工具。

那他有什麼基本指令呢？
其實使用起來就是awk，只是他在語法上把本來NR等常見的variable，制換成這些format（fasta/fastq/bed…）中的值，如下面的表格：

BED	SAM	VCF	GFF	FASTX(FASTA/FASTQ)
chrom	qname	chrom	seqname	name
start	flaq	pos	source	seq
end	rname	id	feature	qual
name	pos	ref	start	comment
score	mapq	alt	end
strand	cigar	qual	score
thickstart	rnext	filter	filter
thickend	pnext	info	strand
rgb	tlen		group
blockcount	seq		attribute
blocksizes	qual
blockstarts

這邊有一些使用範例
Fasta

#fasta
算序列長度
bioawk -c fastx '{ print $name, length($seq) }' input.fasta
算每個序列的gc百分比
bioawk -c fastx '{ print $name, gc($seq) }' input.fasta
轉換成互補序列
bioawk -c fastx '{ print ">"$name;print revcomp($seq) }' input.fasta
只留下序列長度大於100的
bioawk -c fastx 'length($seq) > 100{ print ">"$name; print $seq }'  input.fasta
轉換fasta成tabular的格式
bioawk -t -c fastx '{ print $name, $seq }' input.fasta
選取特定id的序列
bioawk -cfastx 'BEGIN{while((getline k <"IDs.txt")>0)i[k]=1}{if(i[$name])print ">"$name"\n"$seq}' input.fasta

Fastq

計算有幾個reads
bioawk -t -c fastx 'END {print NR}' input.fastq
轉換fastq成為fasta
bioawk -c fastx '{print ">"$name; print $seq}' input.fastq
計算phred平均值
bioawk -c fastx '{print ">"$name; print meanqual($qual)}' input.fastq
選取小於10的reads片段
bioawk -cfastx 'length($seq) > 10 {print "@"$name"\n"$seq"\n+\n"$qual}' input.fastq

閱讀參考
Bioawk basics

論文閱讀：Elements of cancer immunity and the cancer-immune set point 精華簡報

2017 年 04 月 13 日2017 年 04 月 01 日發表留言

這篇論文是今年2007年一月所發表的文獻回顧，由Genetech 癌症免疫療法部門的全球藥物開發負責人Damiel S. Chen和Genetech資深癌症生物學專家Ira Mellman兩位所撰寫，主要在提出如何由到目前為止許多由免疫療藥物的臨床試驗上取得的知識，來更好的建立一套觀點來做病人對免疫療法的預測。最重要的便是如何將病人腫瘤附近的組織切片，利用其特徵來分成幾類免疫表型。

在docker容器中，安裝jupyter於ubuntu:14.04，由mac連線到running jupyter

2017 年 04 月 12 日2017 年 04 月 15 日發表留言

想使用ubuntu:14.04基礎容器來做python程式prototype的工具，一方面，一些要使用的生資軟體只能在ubuntu環境下使用，另一方面，ubuntu:14.04算是很輕的小容器，希望讓最後的鏡像小一點。

這邊製作的容器主要是要使用bioawk和netMHC程式，所以dockerfile中會有安裝他的相關訊息。

這邊有幾個重點：

1. 安裝完所有jupyter需要的python library和ubuntu相關library
2. 在virtual environment中安裝jupyter
3. docker網路設置（port forwarding）
4. jupyter notebook網路設定（讓localhost以外的ip登入）

首先奉上dockerfile，裡面有所需的library

################################################################
# Dockerfile
# Version:          1
# Software:         BioAwk/Jupyter
# Software Version: 201703
# Description:      Basic image for Bioawk/Jupyter/Python3/plotly/numpy
# Provides:         autotools-dev|automake|cmake|curl|fuse|git|wget|zip|build-essential|pkg-config|python2.7|python-dev|python-pip|zlib1g-dev
#################################################################
# Source Image
FROM ubuntu:14.04

################## BEGIN INSTALLATION ######################

# apt update and install global requirements
RUN apt-get update     && \
    apt-get install -y software-properties-common && \
    apt-get install -y    \
        autotools-dev     \
        automake          \
        cmake             \
        curl              \
        fuse              \
        git               \
        wget              \
        zip               \
        build-essential   \
        gcc               \
        pkg-config        \
        csh               \
        tcsh              \
        byacc             \
        flex              \
        liblzma-dev       \
        lib1g-dev         \
        libblas-dev       \
        liblapack-dev     \
        gfortran          \
        libfreetype6-dev  \
        libpng-dev        \
        libjpeg8-dev
RUN apt-get install -y  python3\
                        python3-pip
                        python3-dev\
                        python3.4-venv
Copy /bioapps /bioapps
WORKDIR /bioapps/
ENV PATH /bioapps/netMHC-4.0:$PATH

WORKDIR /bioapps/bioawk-master
RUN make
ENV PATH /bioapps/bioawk-master:$PATH

CMD ["/bin/bash"]

# change workdir
WORKDIR /var/data
RUN chmod 777 /var/data
RUN python3 -m venv test
RUN . test/bin/activate && pip3 install --upgrade pip && pip3 install -vU setuptools && pip3 install jupyter
RUN . test/bin/activate && pip3 install numpy && pip3 install plotly
##################### INSTALLATION END #####################

# File Author / Maintainer
MAINTAINER weitinglin <weitinglin66@gmail.com>

裡面安裝的東西主要由python pip3相關的依賴、bioawk在make時的依賴。過程會創建一個
虛擬環境test，並且在裡面安裝jupyter,和會使用到的package。

接下來，在運行這個容器的時候，要把port口掛在出來。

docker run -p 8888:8888 -it netmhc:latest

接者，在執行jupyter的時候，要用下面的指定

jupyter notebook --allow-root --no-browser --ip="*"

進入安裝有jupyter的python virtual environment
source test/bin/activate

執行jupyter，因為在docker container中，會是root模式，所以使用allow-root，另外，因為不需要直接跳開視窗，所以使用no-browser，最後，一個關鍵動作 –ip="*"，讓所有外來ip都可以進入。過程中，有一些小技巧，比如pip3設定出現問題了，該怎麼刪掉再重新裝一次，可以使用apt-get purget python3-pip，然後再重新安裝一次。然後，會需要瞭解一下docker port forwarding的原理。

參考文章：
How to install pip with Python 3?
make bioawk錯誤，需安裝的檔案
 安裝jupyter在ubuntu上面
 ValueError “Expected version spec” when installing local wheel via pip
How to install pyzmq for iPython Notebook in a Python 3 virtual environment?
Networking features in Docker for Mac
From inside of a Docker container, how do I connect to the localhost of the machine?
Using localhost for to access running container
Jupyer notebook 安全性設定

關於W3C(World Wide Web Consordium)的二三事

2017 年 04 月 10 日2017 年 04 月 10 日發表留言

在處理許多項目時，不管是視覺化或是網頁資料的爬梳，甚至是文字編碼等，都會找到W3C相關網站，就蠻好奇這到底是什麼組織，後來發現這組織跟網路的發展大有關係，是由網路之父Tim Berners-Lee(建立起URL,HTML,HTTP粗稿)，在1994所創立的。

W3C主要是由會員群體所維護的，會員超過8747人來自超過400個公司、大學，組織下面有大概56個針對不同技術問題的組別，負責發展和各種網路標準的討論，發表超過4954份技術報告，包含357個網頁技術標準。

裡頭有很多很棒的文檔（非常詳盡的針對許多網頁的議題，所發展的，幾乎每一個都可以是一本書的資訊量），較廣泛的議題或是適用性高的幾乎都會另外有一個次分支組織在處理這樣的討論。

W3C Logo Web Accessibility initiative

關於網頁可進性(Web Accessibility)該如何提升

1. Tips for Getting Started with Web Accessibility Tips for Getting Started with Web Accessibility
2. Easy Checks – A First Review of Web Accessibility

如何提升網頁的國際性(Internationalization)

1. Localization vs. Internationalization
2. Internationalization Quick Tips for the Web

Network Analysis中評估網絡節點重要性的指標

2017 年 04 月 07 日2017 年 04 月 03 日發表留言

用network來整合生物醫學資料並且使用它來探勘知識是一個非常好的方法，且network的建構具有高度的自由度和知識領域特異性，實際上，有一些拓墣學上的指標可以用來描述特定節點的重要性，這邊介紹11種可以用來描繪節點之於整個網絡關係的算法指標：

Local-based methoed

Local-based的方法一次只考慮單一個節點和他周遭的關係

Degree

最基礎的指標，單個節點的連接數
$Deg(v)=|N(v)|$

Jeong H, Mason SP, Barabasi AL, Oltvai ZN: Lethality and centrality in protein networks. Nature. 2001, 411: 41-42. 10.1038/35075138.

Maximum Neighborhood Component

$MNC(v)=|V(MC(v)|, where MC(v) \ is \ a \ maximum \ connected \ component\ of\ the\ G[N(v)]$

Density of Maximum Neighborhood Component

$DMNC(v)=|E(MC(v))|/|V(MC(v))|^{ \varepsilon },\varepsilon =1.7$

Lin CY, Chin CH, Wu HH, Chen SH, Ho CW, Ko MT: Hubba: hub objects analyzer–a framework of interactome hubs identification for network biology. Nucleic Acids Res. 2008, 36: W438-443. 10.1093/nar/gkn257.

Maximal Clique Centrality (proposed in this paper)

$MCC(v)= \sum\nolimits_{C \in S(v)} (|C|-1)!$
Przulj N, Wigle DA, Jurisica I: Functional topology in a network of protein interactions. Bioinformatics. 2004, 20: 340-348. 10.1093/bioinformatics/btg415.

Global-based methods

這邊主要是six node ranking method的實現，細節可以閱讀shortest path algorithm和它的generalized form k shortest path algorithm。

Closeness

$Clo(v)= \sum_{w \in V}\frac{1}{dist(v,w)}$
Sabidussi G: The centrality index of a graph. Psychometrika. 1966, 31: 581-603. 10.1007/BF02289527.

EcCentricity

$EC(v)=\frac{|V(C(v)|}{|V|}\times \frac{1}{max\begin{Bmatrix}dist(v,w);w \in C(v)\end{Bmatrix}}$

Hage P, Harary F: Eccentricity and centrality in networks. Social Networks. 1995, 17: 57-63. 10.1016/0378-8733(94)00248-9.

Radiality

$Rad(v)=\frac{|V(C(v))|}{|V(C(v))|}\times \frac{\sum_{w \in C(v)}( \Delta_{C(v)}+1 - dist(v,w) )}{max\begin{Bmatrix}dist(v,w);w\in C(v)\end{Bmatrix}}$
Valente TW, Foreman RK: Integration and radiality: Measuring the extent of an individual’s connectedness and reachability in a network. Social Networks. 1998, 20: 89-105. 10.1016/S0378-8733(97)00007-5

BottleNeck(BN)

$BN(v)=\sum_{s \in V}p_{s}(v),where\ p_{s}(v) = 1 \ if\ more\ than\ |V(T_{s})|/4\ paths\ from\ node\ s\ to\ other\ nodes\ in\ T_{s}\ meet\ at\ the\ vertex\ v;otherwise\ p_{s}(v)=0$

Stress

$Str(v)=\sum\nolimits_{s \neq t \neq v \in C(v)}\delta_{st}(v),where\ \delta_{st}(v)\ is\ the\ number\ of\ shortness\ paths\ from\ node\ s\ to\ node\ t\ which\ use\ the\ node\ v$
Shimbel A: Structural parameters of communication networks. The bulletin of mathematical biophysics. 1953, 15: 501-507. 10.1007/BF02476438.

Betweenness

$BC(v)=\sum\nolimits_{s \neq t \neq v \in C(v)}\frac{\delta_{st}(v)}{\delta_{st}},where\ \delta_{st}(v)\ is\ the\ number\ of\ shortness\ paths\ from\ node\ s\ to\ node\ t\ which\ use\ the\ node\ v$
Freeman L: A Set of Measures of Centrality Based on Betweenness. Sociometry. 1977, 40: 35-41. 10.2307/3033543.

Edge Percolated Component

$EPC(v)=\frac{1}{|V|}\sum\nolimits_{k=1}^{1000}\sum\nolimits_{t \in V} \delta_{vt}{k}$

Chin CS, Samanta MP: Global snapshot of a protein interaction network-a percolation based approach. Bioinformatics. 2003, 19: 2413-2419. 10.1093/bioinformatics/btg339.

	臨床次世代定序NGS Workshop… 在使用國網中心超級電腦台灣衫建置定序分析流程(二）
	臨床次世代定序NGS Workshop… 在使用國網中心超級電腦台灣衫建置定序分析流程(一）
	Kotlin 在可擴展標記語言XML是什麼？
	Cupid.K 在 2021 臨床次世代定序實務與應用概論
	kuoYH 在 2023 生物資訊學習資源彙整

	臨床次世代定序NGS Workshop… 在使用國網中心超級電腦台灣衫建置定序分析流程(二）
	臨床次世代定序NGS Workshop… 在使用國網中心超級電腦台灣衫建置定序分析流程(一）
	Kotlin 在可擴展標記語言XML是什麼？
	Cupid.K 在 2021 臨床次世代定序實務與應用概論
	kuoYH 在 2023 生物資訊學習資源彙整