Pathology

On this page

Introduction
Data
Computing
Pending

Introduction

Python is special. It’s used by the big tech companies, of course, but it’s also used by those you would rarely classify as developers.

Python is not just for web developers and pure data science machine learning people, but it’s used by this huge spectrum of people doing all sorts of interesting stuff and solving real problems with it.

A molecular pathologist is a type of physician who deals with looking at the genome of either a patient or a patient’s tissue. And he essentially looks at all of these things in a way to be able to help manage a patient’s treatment.

Molecular pathologist

Molecular pathologists don’t work with patients directly. It’s a kind of subspecialty in medicine where the labs work with the samples that have been collected in the patient, either from a procedure or from the radiology suite. And then they work on that tissue or the blood sample or a bone marrow sample that comes to them.

Then all the testing that they perform is off from that specimen. And then once they generate the clinical reports back, they go back to the patient’s chart and to the patients, to the clinicians who are treating and managing them. And then that way it helps how they’re able to get a diagnosis and then give the appropriate treatment and the management to the patient.

Data

When you’re studying the genomics, how much data is in one strand of DNA? How much of that do you actually care about?

It depends on what is being done. And so when we talk about genomics, it is really designed on how the experiment is done.

So, for example, if we just simply look at the entire human genome, we are talking about three billion alphabets. Essentially, it’s the combination of four alphabets A, T, G, and C. So these are the four nucleotides of the DNA sequence, and the RNA has one additional one, which replaces A.

But the idea is that it’s a mix and match of these sequence. And so if you think about the entire human genome as a single thread of A, T, G, and Cs in various combination, you’re looking at three billion alphabets.

And so what happens is when we do these sequencing experiments where you would take the DNA molecule from a bunch of cells within a tissue, and then either we read all the three billion base pairs. And typically, the way the sequencing is done is you read all of these sequences from many molecules. And so you’ll have multiple copies of that when you’re translating from a molecular, like a chemical molecular structure to a DNA sequence on a flat file in a file system. And so if you look at that large scale of data, like the entire genome, we are talking of hundreds of gigabytes, maybe even terabyte worth of data.

Targeted sequencing

Then there are other more practical approaches when we look at the genome, when we, and especially this is something that we use for day-to-day patient care, which is referred to as targeted sequencing.

What that means is instead of the three billion base pairs, we focus on those regions of the human genome that are of most pertinent use, or, that we, at least as the current field of genomics, that we understand what to do with it.

And so there are certain genes that, at least in the space of cancer genomics, that are close to about maybe 2,000 genes, which are known to be cancer-associated. And of that, roughly about 500 to 700 genes they’ve been studied and demonstrated that there are certain types of abnormalities in those genes in terms of the sequence changes, that they have certain meaning in context of tumor in order to make a diagnosis, or to understand if the tumor is aggressive or benign, or if there are certain treatments that could be applied to those tumors. And that’s specifically linked to the kind of sequence change you see in that region of the genome.

And the targeted testing we do is a very small fraction of the large genome. Typically, there’s a term known as exome sequencing,.

Exome sequncing

Exome sequencing refers to sequencing all those regions of the human genome where it at least encodes for one or the other anodated gene. That is typically about 1 to 2% of the entire gene.

And so if we further narrow that down to about say 500, 600 genes, that one would typically sequence for practical cancer molecular testing. That’s, I would say that’s probably about 10th, maybe slightly less than that of the genome, but it’s a very high yield from a clinical standpoint.

Redundancy

Because the most alteration you will find that would help with the clinical treatment is high. So, if you’re going to talk about that data set, it’s complex in a different way, because just looking at the raw sequence data would be 1 to 20 gigs from a single sequence file, but it entirely depends on how deep we go.

So, for example, when we talk about sequencing, as I mentioned before, when we sequence a molecule, we can sequence it either at certain depths. That means what level of redundancy you want to be able to read that molecule.

Sometimes we read the molecules 20 to 30 times. So that’s referred to as 30x, or sometimes we’ll read that, 500 times. So that would be 500x.

Do you do that because you want to make sure you don’t misread the gene?

Yes. So, right. So what happens is the greater the depth of sequence, so typically for, you know, such large panels that we sequence in a clinical setting, we usually target about 1500x to 2000x. That means we’re reading that 2000 times. So the more the depth it is, the possibility of identifying a certain variation or genomic alteration that is present at a very low level.

For example, you have a tumor cell and within that only 2% of the cells have this mutation. Others don’t. And so when you’re looking for or hunting for these needles in a haystack, you really want to maximize the amount of depth you have to be able to pick those things up. So it really depends on how deep we go. The more deep we go, the more data it is. And so it can scale up to almost several hundred gigabytes.

Yeah, I’ve always wondered about how you can go and read somebody’s genetics and then not make a mistake when you’re reading, using chemicals to read. So, but it’s really ridiculous how much data is there. Off by one, a C for a G or whatever is a bad thing, right? Right.

But I think as the technology has matured, there’s there’s nothing 100% in terms of the error profile for the enzyme that has been used to work, the technology that is reading the actual fluorescence, converting that to, you know, signal. There’s always statistical values and probabilities that are associated with what is the probability that it is wrong or incorrect or correct. But within that frame and where the current technology is pretty accurate for, if not all, many of the regions of the genome. And so it’s mind-baffling how it works. Yeah, it’s, it really is quite amazing. It’s one of the modern marvels of science for sure.

Computing

When you start with a research lab, R, Python are sort of like the most common tools that are used for any kind of data analysis and data set visualization.

In bioinformatics, at least in genomics, the kind of the ecosystem of tools available is a mishmash of everything. It’s, there’s, for anything which is very competition intensive, like when you’re trying to align sequences to the human genome, those are very intensive tasks. And typically it’s a lot of, you know, C, C++, Java that’s involved in some of these very mainstream tools that are available. More recently, I think we are seeing Rust coming into the picture as well.

There’s some applications. And then of course, Python and R, like the predominant tool sets, the programming language that are used to solve all of these problems. So when I sort of started my molecular pathology fellowship and I got into, now I had to do this project that involved, you know, manipulating all the sequence data to a point where we will be able to develop an application that would help sort of, it’s a web-based application that could help, you know, for other pathologists and faculty to read that sequencing data and, you know, digest in a very way that’s easy for them to look at it.

Rather than going to the Linux, you know, terminal and opening up like, you know, raw files and things like that. So I used, that was my first project was to use C# in that context. But I quickly realized that there was a lot of these algorithms that were natively either written in R or Python and then having to, you know, incorporate those functionalities was not as easily possible. So I had to rewrite a lot of those things in C# primarily. You know, it was a good learning curve, but I think from a main primitive perspective was getting really difficult.

And so that’s when the realization was that I think the combination of Linux and Python was, you know, I had to move towards that. Yeah. C# probably from the timeframe that you’re thinking about, didn’t really have a great package manager story, not to the same degree that the Python does. Although they do pretty, they do pretty good now over in the land. Right. Right. Right. Yeah. All right.

So a good question from Chris in the audience says, is there a reason to use Python specifically? Like, are there some special sauce packages that make it attractive? It sounds like that’s kind of what you were getting at. You found more solutions to these algorithms than, you know, available in Python than in C#? Yeah. Or whatever languages. Yeah. Right. So I think, I mean, I think the simple answer is yes. I think the community and the amount of work that has been done in this particular space with genomics. I mean, when you are really searching for applications, it kind of falls into these three categories of, you know, anything which is a high performance component, you know, program that is usually in the Rust, in a C++, C, a lot of, you know, of those languages, a little bit of Rust and Java. And then the other bin is essentially kind of, you know, split up into Python and R.

I think for me, Python was, and I think I’m sure others have shared the same way where it’s almost like, wow, this is amazing. Like coming from C#, it was a little bit of a change because there’s no more, like, you know, curly braces. Think about the whole thing. Did you miss your semicolons? Kind of. Like even now, sometimes when I write like a little bit of JavaScript, I feel like, oh, yeah, okay. This is my term. Exactly. But, you know, not too bad. I think what I got onto was like the simplicity of the language and how powerful it was when, like, if I’m thinking about, you know, you know, I’m not too bad. I think about, you know, it was interesting when I had to do something like there was an algorithm where I had to parse out certain, you know, strings in a way where it required some known workflows that we use to do, like, variant annotations when we are toss referencing databases and putting them together. You know, when you look for, like, you know, when you look for C# packages, I mean, there’s really nothing there for native YouTube. So you have to write a lot of those things.

In Python, it’s the amount of time that is spent in developing those things is much faster. And the development time itself is quick because you either get an idea of somebody who’s already done the work or there’s a more formal package that you can use.

BioPython

So I think initially when I started off, BioPython was a very interesting collection of packages. I think it was a tool suite essentially written to have all these functions available for very common day-to-day tasks. I want to query a certain region of the bam file or I want to parse out certain things in the fastq file to look at some of the sequences or doing counting number of sequences in a given file and getting read counts, things like that was, it’s all out of the box.

And so that was sort of like the first thing to go about, this is amazing. I mean, you just say somebody’s already done the work and just putting on top of it. So instead of creating these, I’ll just like use those. Perfect. Right. So that was one.

And the other motivation to use Python was, you know, say, for example, why not R? You know, why Python? Because R is a very rich ecosystem in, you know, at least in Genomics and Visualization. So I think the second thing was in terms of the idea that I was working on was having to develop a web application and all of these bioinformatics toolings and algorithms running sort of in the back end.

And so at that time, it was like, okay, well, you know, Python, I’ve not heard much about in terms of web application. Mostly it was, you know, again, this big, like, you know, C#, .NET. That was why I started off, you know, with that. But then at that time, you know, there was Django and then Flask was sort of coming in. It was a very minimalistic, you know, sort of application. So I started focusing on that. It was very easy with Flask to, you know, get up and running with very simple, you know, applications to do that. I didn’t try much into Django just because it was too bloated for me. But, you know, Flask was great.

And then what I realized was you can create a simple web application, but then at the same time, you can use all your Biopithons and all the wonderful biopithetics packages in the back end. So it’s like a single language that lets you do both. And so I was like, this is great. It was just, I don’t have to go anywhere to learn, you know, a third or a fourth or a fifth different program language. And this just gets the job done. Yeah.

Keeping in mind that you are actually main, your main job is medicine, not programming, right? It’s not like you’re a CS person who’s just all after, out to learn all the languages, right? Right, right, right. So that, that definitely is, and again, that’s a, that’s a huge, you know, I would say I’m, again, as you said, I’m in a, I’m sort of in an unusual position where I’m, you know, a physician, but I also do a lot of these application developments. So that certainly is an important point in terms of how much time I have to be able to develop these prototypes. And then obviously typically the way it works is at least right now here, you know, where I am currently working, I have an excellent and amazing team of developers and bioinformations who really do a lot of the development work on the front and back end. And so for me to be able to take additional time out of the clinical and the patient care work is limited. So if I can get whatever prototype I’m thinking of or developing the application fast, then that’s what I’m going for. And so you can hand it off to the team and let them polish it up and product, make it production ready, basically.

I was wondering how much time of your job do you get to spend on these kinds of things, finding new packages, optimizing or improving the ways that you’re working on stuff versus just sort of handing it off to the folks you work with and keeping focus more on the medicine side.

Yeah. So, it, you know, that, that, I think it’s a good question. I think it’s, it’s evolved over time as I have, you know, being sort of, you know, when I was in training and then being a faculty and then, you know, faculty in this new position. you know, one of the things I did was as part of my certification was to, you know, to get board certified in clinical informatics.

That’s a discipline by itself that, you know, involves a lot of, you know, it’s a very broad field in terms of, you know, informatics and healthcare. And then one of the buckets there is, you know, software development. And so I was, you know, I was quite interested sort of in that field. And so most of my time in terms of being able to, you know, devote to, you know, finding new packages or trying to, you know, write up an application that could solve a problem or coming up with prototypes. Or it was done in a way that sort of aligned with the work I was doing. And so it would be days when I’m on clinical service where I’m mostly, you know, working on sort of with, you know, with patient care related matters.

So those weeks would be, you know, obviously very busy. I would have, you know, I would wake up at like extremely early in the morning, spend the first two hours, four to six AM just, you know, working on this. And then I get back to like, you know, the critical world. And then there will be weeks when I’m off clinical service. So I, you know, I’m not responsible for any patient care related work. And those weeks would be where I would spend time in terms of, you know, doing these, you know, investigating into sort of some of these packages and, you know, coming up with new ideas, exploring what is all, you know, what is available in terms of certain problems that I’ve been doing. certain problems that I was solving. And, you know, that time sort of, you know, my quote protected time professionally, was spent in that. And so that would be, you know, maybe a week spent into like, Hey, we are trying to look into this variant annotation tool. And then we’d want to, you know, write wrappers around it. So it becomes easy for, you know, our labs operation to be able to use that. And so, so kind of that, that’s how it works. So some of those, either early mornings or, you know, the weeks I’m off clinical services and how, how that works. Yeah. So, let’s get started. So, let’s get started.

So on one hand, in Python web world, you mentioned Flask and Django. You know, Flask and Django, while they are evolving, are kind of, they’re kind of the way they have been and they’re pretty stable. And if you learn Flask five years ago, you’re still good to use Flask today. Yeah. Or is it more like FastAPI, Pydantic, msgspec, just there’s something new all the time that you got

to keep learning to bring in. Are there a ton of new packages just coming online or is there a set of really solid ones? So I think it’s both yes and no. And so it depends on what, you know, what area we’re working on. So right now in the clinical lab that I’m directing here, you know, when I came here in 2020, it was, you know, when we started off from scratch. So essentially the idea was to be able to bring up a pediatric cancer sequencing infrastructure that was not available. And so it was ground up from, you know, the lab to personnel, to space, to competition, and such and everything. And so we kind of have this sort of, you know, two big bubbles in that operation from an informatics perspective. One of them is the, you know, we essentially are in the process of developing our custom lab information system. That’s essentially a web app. And so we have that space and the other space is bioinformatics. And so bioinformatics is a lot of the custom scripting or the applications we develop is Python based. Some of them we do with the Golang, you know, when we need a little bit of performance aspect. And then the other aspect is the web app. So from a web app perspective, when we, when I started here, we actually started, we use FastAPI. So that’s kind of, you know, that was so, you know, the idea was that, well, you know, since you’re starting from scratch, and I came to know about FastAPI at that point of time, the whole thing was about, you know, acing away. I was, I was pretty much sold on that aspect. And then, you know, I think the whole tool made a lot of sense. I’m like, okay, well, this is, this is perfect time to be able to, I think when I started FastAPI was, you know, 0.5 or 0.6. And so now obviously, you can see a lot of change happening there. So, yeah, that definitely is a lot of, you know, fast pace. And so we kind of do catching up in a sense where it has to be done in a, you know, in a careful way. The reason is because from, you know, as compared to more traditional research lab testing where, you know, at the end, really, you know, there’s a lot of discovery, there’s a lot of excitement. And at the end, it all translates into being sort of, you know, the data is presented at a conference or you publish that as a manuscript. And that’s the end point. So if you move off from version one to version three of an algorithm, you know, you have to obviously make sure that your research, everything is reproducible, but beyond that is not a problem. But when we’re talking about the same thing in context of a clinical care for a patient, the room for error is very, very little. You can’t make mistakes. And so the entire space of clinical testing is very regulated in that sense, because there’s a lot of requirement that, you know, you have to perform that any change that’s happened in your pipeline. Say you’re using, you know, some version of an application. Now you upgrade to a newer version. You have to demonstrate that the analytical performance in terms of sensitivity and specificity for that pipeline didn’t change. And so a lot of work is needed when you go, you know, do like a version upgrade. So we keep those things very controlled and, you know, and careful versus some other things which are more in the R&D space. There’s a little bit more room to play around with tools. Right. Yeah. Chris was asking an audience a great question about basically, is it more exploratory and you just move really fast and don’t really worry about tests and stuff like that. It sounds like this is more of a production type thing. Like if it you’re going to run it over and over and if it gives a different answer at some point for testing for a disease or something, that’s really bad. You need it to be right all the time. And so. Yes. So the rule is when we do, when we do new test developments or we bring a new algorithm, obviously that part of which we refer to, there’s a formal term that we use in lab medicine. So we have a lot of things called familiarization and optimization or ONF phase. That’s where, you know, there’s a lot of flexibility in new tools, new version, trying out different things. But once it moves from that into the validation phase and then once we deploy the application, once the deployment is there, it’s a production application. We don’t touch it unless something really has to be tinkered with or there’s a, you know, bug that we have to fix. Who’s in charge of running those apps? Is that people on your team and your lab or is that the hospital or how? Yeah. So the way it is set up here is, so when I started off, I was the end of one. So I started off with the fast AP application, I had to build up the, we have, you know, bioinformatics pipeline that I had initially authored. But then when we went through the validation phase, I luckily had, you know, two people on staff who kind of were handling the bioinformatics on the front end. And then eventually we had a third person who joined the team. So then they were kind of, you know, helping me out with a lot of the actual groundwork of, you know, writing the code, getting tests done. You know, going through the validation data, summarizing that for me, being a lab director, it is my responsibility ultimately to sign off on all those things. So we’re going to say, hey, okay, this is the validation and this is what is being demonstrated that, you know, your package or your pipeline or whatever you’re working on demonstrates this level of sensitivity. Then yes, I being a lab director say that, yes, this is working. And once that happens, so we then deploy those applications in production. You know, we use GitHub and the usual dev test prod cycle. And so that’s kind of how it works. Well, do you have your own hardware or do you stuff like on DigitalOcean or AWS or? So with healthcare data, there is generally a little bit of angst with, you know, data sitting on the cloud outside of the institution. I would say like the institution that I work on is very, you know, that way it’s quite, you know, very forward thinking and being able to use modern technology. So what we have, what we started off and since it was, you know, everything being built up from scratch, we’re taking the decision to keep things on print for beginning. But we also kept in mind that at some point of time, you know, if the institution decides that, oh, we’re going to switch our infrastructure to using, you know, AWS or Azure or whatever the platform is going to be, that we want it to be ready. And so the way we had it set up and this is due to our amazing IS team here in, you know, at our institution. So we had our own hardware that we got in terms of the actual servers. And I had, we collaborated with the IS team to be able to help us build our Kubernetes infrastructure. So we have a test and a prod Kubernetes cluster and then all our apps and the biofarmatics pipeline. Well, the apps for now and the biofarmatics pipeline that we’re looking forward, you know, in the near future to get deployed on these things. As a matter of fact, what we’ve done it, our dev team, we start to, we do a lot of the development on Kubernetes as well. And then we keep moving all these things as containerized applications. That’s excellent. So really embracing containers and Docker and Kubernetes and that should make it super easy to move to wherever you want to go. Right? Anything that can run Kubernetes, you just push, push to that and you’re good to go. Right. Right. I mean, that’s, it’s, you know, it’s, it’s a little bit difficult to start with, but I think once we are in that, you know, stream, it is, it is much less effort to move things around. Yeah. Last year I rewrote all of our servers and APIs and condensed six to eight servers all into one, just Docker cluster. And it was a great decision. But to me, it was also a little intimidating to play. Well, here’s one more thing I layer, I have to manage and understand. And if something goes wrong there, then everything else still breaks. And, but having it set up is really nice once you get used to it. Yes. Right. All right. Let’s talk a bit about, well, I have a question for you. I want to talk about some of these packages. Yeah. That you’ve been, you’ve been saying are like a lot of the reasons you chose Python and it used a lot. That’s just great. But before we get there, like I got the bio python.org website pulled up. And the very first line is bio Python. It’s a set of freely available tools. You know, open source, freely available. Right. So what does that matter to you guys? On one hand, you have a ton of money being in the medical space. It’s really high stakes. So paying for commercial software or commercial libraries is probably not the biggest worry. On the other hand, open source is really nice. Invest, being able to look inside is really nice. Right. So it means you don’t have to deal with getting permission. Right. How does that fit into your world? I know how it fits into like small startups and things like that. But for a hospital, for example, what is free and open source mean to you guys? I think it does have a lot of impact in terms of how we end up working and setting up these things. And obviously, I would, whatever I’m speaking is representing, you know, what it means from sort of an operational standpoint. You know, when we talk about molecular pathology, generally being able to bring up a clinical service like that is a huge investment. And so a lot of the investment is, and this is generally applicable to, you know, any institution where something like this has been set up for patient care or clinical use. The investment is primarily in a lot of the instrumentation and the reagents that we use are generally quite expensive, which is sort of, you know, the, I would say when we talk about, you know, what is the cost of a test when it is offered. That cost factors in a lot of these, you know, operational costs that we need to, you know, buy these expensive sequencing instruments, you know, the reagents that are used as consumables as we do the tests over and over again, every, you know, every week, every month. So from that standpoint, traditionally, the way things have been designed is, you know, I would say 10 years back when we were, you know, work with our finance team to say, okay, the cost of the test is going to be so and so based on all of these different inputs. And so, you know, 10 years back, computation, bioinformatics, you know, all of these were not factored in without. But now, as we are in that era where, you know, using GPUs on a regular basis to be able to do, you know, simple, I would not say simple, but, you know, routine work to get from the raw sequence data to be able to identify genomic variance. Genomic variance that’s getting common, you know, using FPGAs, using large clusters to be able to, you know, to perform these tasks. And so now we are starting to see those costs getting in as part of the ultimate, you know, cost that goes to the patient for a test. And so we try to minimize those things. But one of the way to be able to minimize those things is to be able to choose between free open source versus something which is a commercial product. And it’s always a balance between the reliability and the service that you’re able to get back saying, hey, you know, something breaks down. We know, you know, there’s a, there’s an SLA, there’s a certain, you know, assurance that this thing is going to have help versus open source free would be where we feel very confident in the code base. We sort of, you know, sometimes what happens is when we, you know, when we kind of use some of these open source tools, we end up almost invariably having some wrapper around it to change things or being able to have some insight into the source code. So it depends on that balance, you know, what we choose. Yeah. How often do you fork, fork it and use your self maintained version versus just run what, what is publicly on PyPI and then maybe wrap it to orchestrate it a bit? So I would say for the web application part of it, we don’t really do a lot of forking. We kind of go with what it is. The only thing what we do is since we have the luxury of using a combination of GitHub and containers, and knowing the fact that the regulatory requirements require that you tightly version control all these things with, you know, history and all those things. We tend to, when we are developing these things and we are validating it before we do that, we try to stick to a fairly stable version. So for example, things like, you know, beta or release candidates, we try to stay away from that, even if they have some desirable features, but unless we see like a full production, you know, version of that, we don’t tend to switch to it. So we keep things like that without maintaining or without forking or kind of, you know, making modifications. When we get into more of the bioinformatic stuff where we are actually trying to use an algorithm to solve a particular piece of part of the pipeline that is doing some data transformation, it depends on how much we want to change or modify. That’s when we sometimes fork it. Sometimes we fork where we know that, and this is the unfortunate reality in many of the scenarios where you have great open source tools, but after some time, due to whatever financial or business or other reasons that they stop maintaining. And so essentially, we kind of get into this freeze mode. We tend to fork that so that at least we have that available. And then if we make any changes that we kind of keep it to that fork. But generally, I would say it’s probably in the 80-20 where 20% is where we fork it, make some change. Most of the times we try not to do that. But yes, open source free tools have a big impact. A lot of their tools that we use as part of our bioinformatics pipelines, a matter of fact, which is kind of used in the community of molecular pathology to build these bioinformatics pipelines. They tend to use a lot of open source tools. And the reason for that is, for example, it’s not written in Python, but we have an algorithm called BWA. It’s written by Professor Henley. That’s one of the algorithms that is almost like a de facto, I would say, when it comes to doing sequence alignment. That’s a part of the pipeline. And so it’s a tried and tested application for more than a decade now. So really, there’s not a, and this is a fairly stable, you know, algorithm or an application. So we don’t tend to, it’s well maintained, you know, from an open source perspective. So those obviously are highly, we highly rely on those. But, you know, there is this whole ecosystem of softwares that come under this rubric term of variant calling, where we’re trying to identify these different variants. there’s a whole bunch of those and some are fairly well maintained they are open source sometimes depending on the context you’re using you need a license if it is used in a commercial setting you don’t need a license if it is in an academic setting like for example when we do clinical testing in an institution such as where I am right now that’s an academic institution so typically it’s not for profit and so obviously we don’t need licenses for that use but once it goes into a pure commercial space where if the lab is doing all of this testing for profit then there’s a license requirement so we see a combination of these things showing up it’s actually becoming more common now with open source tools that at least in the genomic biofibic space yeah excellent I think another benefit probably probably for you guys I know it’s a benefit for a lot of organizations is if you use the open source tools and you need to hire somebody new there’s a good chance that they have experience already with those tools whereas if you use something private expensive you know you might have to teach them from scratch what the thing is right yes that is that is correct as a matter of fact it’s fortunate that a lot of the people who’ve done a lot of good work and have contributed to the genomics biofibic space have you know the general tendency is whenever we are setting up any sort of you know pipelines for you know DNA sequencing RNA sequencing or you know more from a research perspective you know methylation sequencing single cell RNA-seq you know UMI based error corrected there’s a lot of there’s a very thriving open source space and so that really helps with people who come in even if they’re you know not familiar with these tools it’s easy to get familiar with because there’s a lot of you know community backing that up or as you said when we have when we hire people who already are coming from you know from a different lab or they’ve had some experience but you know they come and say oh yeah we know how to do you know alignment or I’m aware of these applications that use that it’s it is much much easier from a learning curve perspective rather than having to you know now open up a manual and this is the you know proprietary thing that only works here and yes exactly exactly cool all right well we coordinated a bit on a list of packages that you’ve used in your lab or find really helpful for your work and we could touch on those just a little bit yeah yeah so cnv kit genome-wide copy number from high throughput sequencing I don’t know what that means but tell us about it yeah yeah absolutely so copying so cnv or copy number variation this is a type of genomic operation where what happens is in a simplistic way at least when when we talk about cancer the cancer cells sometimes for it to be able to survive it tries to use different ways of doing that biologically one of the way to do that is certain genes that help a cell to grow in the absence of nutrients or with very little nutrition is certain genes if it has more copies of those genes than normal then it will oh okay you know it’s like you have more money than expected you can do a lot of things so typically what happens is in a normal human genome any cell that we pick up it will only have two copies of the gene one is coming from your mom one is coming from with that in cancer what happens is in certain scenarios if a gene that helps with growth of the cell or it helps the cell to survive even without signal or nutrition if it has more copies of that it will make 6, 8, 20, 50 copies it can survive versus there are certain scenarios where if there is a gene that is supposed to regulate the cell so it doesn’t go haywire if the cancer is able to delete one of the cells one of the genes then you only have one gene left you knock out the thing and then that protective mechanism is gone so then the cancer cell can easily survive so what happens is with CNV or copy number variations the idea is that we use the high throughput sequencing data to be able to infer how many copies of these genes do we have is it more than two is it less than two and so this particular package it’s a very very well established well maintained package in the community that essentially does this thing is you give it the sequencing data and define the regions of the genome that you’re interested in you can also provide like names for the regions like you know this region is you know gene BRAF or this EGFR or whatever you’re interested in and then what I’ll do is I’ll do all the analysis to be able to tell you that okay well when we are comparing this particular tumor against this reference set of you know 20 normal samples where we know that you should only have two copies of the gene in this particular tumor we are seeing there are 50 copies of the gene so it gives you kind of an output data that numerical can tell you that you know what it does is it does a log 2 based you know transformation of the tracing that okay after all this computation when I compare to the normal this is 50 times more this is 20 times more the expected copy or it is you know half of the amount of copy we need in terms of deletions so that’s what really it does and it’s written in Python it uses a lot of you know Python it has Python dependencies that use that have been written in sort of you know in either C or like you know Python C bindings but at the end it gives you that data and it has an internal visualization tool but I was not very happy with you know how it was written so I ended up writing a wrapper which is called CNA plotter it’s open source it essentially uses the end data for from CNA and then it gives you a nice visualization of the copy number so I think if you go down if you scroll down there’s an example images yeah you have this on GitHub so people can if people want to use this it’s right there right yep it’s right there yeah so I think at the very bottom of the images the screenshots there oh yeah yeah right here yeah so for example the first image over here you can see this you know it’s a thin band of all these multicolor things and each one of them is a single human chromosome so chromosome 1 2, 3, 4 so on and so forth and it you know if you look in the image it is at you know the Y scale essentially is log 2 which is 0 and going up it is 1, 2, 3 and then it’s a native scale on the lower side so anything going above 0 means you have more copies than 2 going below is less you know less than 2 copies and so if you see here in this example the plot here you see the very end which says chromosome X is a single you know the band over here is lower at negative 1 that means this is a male patient with a single X chromosome as compared to females who have 2 you know 2 X chromosomes and so when we look at this plot below here this is actually a plot from a you know a cell line a tumor cell line that is abnormal and here we see there are 2 genes which are amplified one of them is a gene known as TERT and the other gene is MDM2 so these 2 genes are again one of those examples where it gives the tumor survival advantage over there and so you can see here there are multiple copies of these genes as compared to you know the baseline over here I see so that might predict something like how survivable the cancer is right so if it is it going to be localized say where it happened or it’s going to like spread to other parts of the body or be difficult to treat or be resistant to treatment yes so if this is you you want higher numbers not lower numbers it all depends right I mean certain genes are good genes for example if there are certain checkpoint genes if those numbers you know if they have lower numbers you want to have two copies of them because if that protective mechanism is gone you know the tumor becomes very aggressive again depending on the tumor so it is all into context so if you’re looking at the good genes you want to have two copies of the good gene if you’re looking at some of the bad genes you don’t want to have more than two copies of the bad genes one or zero is better I got it I got it okay okay HGVS yes this is again a wonderful package that was initially I think it was started by a person named Vise Hart he’s I think he still maintains it but there’s a lot of like you know it’s a very well publicly maintained open source package there’s a lot of you know community involvement in that as well so what HGVS is it’s a nomenclature system for you know giving a name to all these variations so when you talk about I’m not sure Michael if you’ve heard about the term mutation so mutation is a very commonly used term that refers to some kind of abnormality in the genome in this case so what happens is there are these standards that are that you know most clinical labs follow when they’re putting all of this information in the patient reports saying okay you know this particular tumor has you know mutation in BRAF a mutation in EGFR some other gene and there is a certain way that those mutations are described in terms of what sequence alterations happening say at the mRNA level and what sequence alterations are happening at the protein level so now in your protein you know you’re missing these amino acids or you have excess of these amino acids or something got switched from here to there so there’s a formal way of defining that and the guidelines of the group that defines that is referred to as HGPS human genome variation society and so it’s a very complicated process where you have to do all these translations from the you know the genomic scale where the numbering system starts from one to like and you know whatever the length of your chromosome is in terms of ATGCs and each chromosome has a different number and if you have a certain alteration that is happening say in chromosome 7 at this particular position then you have to translate that to the mRNA of that gene and then the protein of that gene so it’s a lot of math a lot of strings involved in that process and so essentially this HGPS Python package provides all of those functionality as a wrap you can create your translation you can essentially project the variant from your genomic to the you know the mRNA to a protein level or vice versa you can validate things so we ended up I actually wrote a paper about this when we did a validation of how well this particular package works and so now in the lab that I’m currently in we implement this thing for generating those nomenclases so what happens is when we put a report out in the patient’s chart and when say our oncologist was treating the patient they want to know what did you identify in this tumor genome they would read that nomenclature saying oh okay this particular change in this BF gene this is significant I know that there are therapies that are out there that we can use to treat this patient tumor so that’s what this nomenclature system is about so it’s a very automated system yeah and it normalizes it if there’s multiple ways to represent it very nice all right this one I’m familiar with open pi excel yeah I guess you probably have a lot of data that either comes from or goes gets shared out into excel right yes so what we do is we sort of are right now in our lab we’re kind of in this sort of you know kind of an interim phase where we sometimes use excel to look at some data so traditionally speaking before you know typically any lab that goes from you know zero to the point where you have a web application that automates everything the intermediate phase is using a lot of excel so it’s very common in many labs to use excel for a lot of things for qc for charts for tracking so we use this open pi excel for a few things one of them is when we have a lot of you know the sequencing data that we have to summarize and then generate a qc to be able to present that to essentially create an excel document on the fly from the backend to provide that you know whatever data they want to look at in terms of statistics or you know list of variants or some form of you know calculation they want to do further that’s where we use this package typically we use it as part of our biofranics pipeline when we have to generate those things but it’s a very handy tool we actually use something similar and I’m forgetting the name of the package that is used to generate our document like we use some word documents for creating reports but we also use python there to be able to summarize a lot of these data points and then create a word document that you know it starts with a template of a word document and then use python to fill up all these you know right here’s where the graph goes here’s where the summary goes here’s where the detected whatever it goes yeah right yeah cool are you here’s two things that overlap are you familiar with this thing where scientists rename human genes to stop excel from misreading oh yes yes absolutely oh my gosh this is crazy yes yes it happens when we import a lot of this data coming from somewhere we’ll see entries like september 14th or you know march 19th these are not yeah this is a big problem going in and out of excel and so as much as you can do in python or any proper programming language rather than using excel but there was one that was m-a-r-c-h-1 or march one yes or s-e-p-t-1 it’s it’s very funny some well some of the gene names are funny but then excel you know gets it to the next level when it changes the names this doesn’t make any sense yeah yeah it doesn’t make any sense yeah all right on to the next one hera yes hera this is very interesting so this is where i think um you know where in our instance we are going away from standard web applications standard bioinformatics pipeline to really touching devops using python and so one of the things that um typically uh we get to the point when we scale up our bioinformatics pipeline where we have multiple samples and multiple runs and everything needs to be orchestrated in a way where you have uh you know while you’re running your pipeline you have a lot of visibility into how it works and so this is uh one of our um uh projects we are working on to move our current bioinformatics pipeline the way it works you know kind of on a single server to be able to use um the kubernetes cluster to actually deploy the uh the long running pipelines onto that and so there are many options you know there are a more standard sort of you know uh whittle based uh you know kind of you know protocols that you can use uh to run on either cloud or hpc environments there is a there’s a very popular tool called next flow that is used to be able to you know kind of create your data analysis pipe flow you can sort of define that and then use any backend to deploy it um one of the things that we kind of when i was exploring this space one thing that one of the things i came across was um uh you know the the the whole sort of ecosystem that argo maintains with uh you know argo workflow and argo cicd and all those things so workflow was interesting because argo provides that way where you can sort of you know write your pipelines in a yaml format and then have it you know deployed on the kubernetes cluster it really is very native to the kubernetes cluster interesting it sounds a little bit like ansible but for specifically for um bio type of projects right yeah so argo so interesting thing is argo when you know when this argo workflows was set up really for a lot of cicd automations in mind so you know it is yes you can run data pipelines in general but never it was never at least in its description it never describes use case sort of in bioinformatics or you know biology pipeline analysis and similarly you know it it was like okay it’s a generic tool you can use it for whatever you want so i tried it out with using you know like a yaml file and it was a simple four-step pipeline it was wonderful it was magical and the good thing was uh with argo like the argo workflow when you install that on your kubernetes cluster it comes with a native web interface so it’s you know if i’m sure if you’ve heard about the workflow option with airflow so airflow is a package that also you know there’s a nice python SDK for that where you have you know you deploy it on a kubernetes cluster we have all these amazing you know visualizations to show what step you run or if there’s some error there it’ll do that argo does the same thing so it has it has obviously the built-in capability to interact with it as an api but then also there’s a web interface that it’ll deploy and it can have visibility into every step of the process you can summarize and see the entire tree so that was very interesting for us because we could get all that thing done in a single thing in a single goal but then uh our challenge was or well our desire was that if you could integrate that with our limb system that we were working on using fast api so that it was hey if there’s any python SDK and so that’s where hera comes so hera essentially is a SDK or a wrapper that’s essentially talking to our goal but you can define your pipeline steps as uh you know dax in python and so that makes the process so super simple where you’re now natively essentially we can integrate that as a back end to our uh web application and so then it’s almost like you know it’s python again from start to finish you’re not you know getting out of that and it’s it’s uh again it’s a very well-maintained application so so we are currently doing a validation to be able to make sure that or demonstrate that you know it’s equally performant when we compare to a more sort of uh native shell based uh you know execution on the pipeline okay yeah this is new to me i mean of course i know airflow but not hera cool all right hi in sim did i grab the right one here uh package no i think it’s it’s similar uh let me uh see if i can i can send you the link yeah throw it in the private chat here and i’ll pull it up yeah uh yeah okay there we go in sim yeah okay got it yeah so this is a this is a very interesting space in uh next generation sequencing assay or for uh you know high throughput sequencing assays so what happens is uh as i mentioned that one of the things that’s required for a clinical lab is to be able to perform a validation on multiple samples of tumors that have certain mutations and then you can demonstrate that you know yes the assay works because you have tested you know 100 samples that have 300 different you know mutations of genetic alterations and then you can demonstrate that yes your pipeline or your assay was able to pick it up so you can say that you know your assay is you know x percentage sensitive x percentage specific and you know your recall rate and things like that so what happens is when you’re trying to um get those samples that have these very difficult or challenging variants to detect because they’re just you know complex in how they occur biologically in the cell is very difficult some of these are very rare there may be only two samples in the entire world or so it’s just not possible practically to you know get those samples unless we wait for like you know 10 years to you know validate that so the idea is that if it is possible to be able to use algorithms which can manipulate the existing sequence so for example we have a sequence data from an existing real tumor sample but we can manipulate that in a way where we introduce these mutations in a in silico so we can introduce you know snbs or insertion deletion mutations and then use that same file to then feed into our pipeline biophonics pipeline and say okay run through the entire data pipeline and at the end let’s see if we are able to identify those variants that we insert and so that’s where this silico mutagenesis comes it’s a very hot topic it’s a very relevant topic of interest to really fill this large gap in terms of availability of rare variants and rare samples and how we can really improve sort of some of these rare but very clinically significant edge cases where we don’t want to miss those variants and actually see those in real tumor samples and so this is a python package that was developed by at the university of chicago as part of their clinical lab and so what really it does is it will take in a list of different you know mutations for example in this plot i think they give examples of you know insertion deletion insertions where you have extra sequence or deletion where you have certain segments which are missing or snbs or single nucleoid variants where you have one nucleotide that got switched with another one and so these are typically that we practically see in in like real samples or real tumor samples but this is a way to mimic that you know in a sample that does not have it and so this python package is able to you know take that list from you say okay i have a list of these 20 important mutations that i know from the public databases have been reported but i want them to be inserted into my data set that was created from say a set of three or four real tumors and then use that to challenge the pipeline to say that hey can you still pick it up and so i see simulate these rare changes and then yes test or exercise your setup right we’ve got a few more to cover but i think we’re getting a little bit short on time so let me just close this out with a final question for you because i know this is the topic du jour what is uh what is ai and llms look like for you guys is it does it matter is it really powerful is it super important uh i mean genetics is kind of text data in a sense and so yes sort of in the space of how it could apply right right it is it is uh it is a it is a text data and it’s a lot of um you know there’s a when you talk about like a search space a lot of the search space is very text based you know there is some numerical base was a lot of text based search as well and i think uh across the entire spectrum from where we start with very raw sequencing data to the point that we are trying to uh you know uh ask the question that okay i found this rare or novel mutation in this particular gene what does it mean you know what human has it been described what disease does it relate to uh one of the things that we do as um molecular pathologies and this is sort of where a lot of the medical work comes in is where we really go through a lot of the medical literature what we have learned before new publications papers out there that you know that have a lot of data in terms of you know studies that have done on this particular gene and they’ve described like okay these alterations actually activate the gene or is bad for the tumor or you know makes it treatment resistant um so you can see the naturally a lot of text starts and happens and so in that space uh we are seeing in the i would say in the past you know three to four years there’s been a lot of uh applications of ai tools that have come out um you know particularly in the space of variant calling where we have this genomic sequence data and we’re trying to identify variants uh you know one of the examples uh that’s been talked about a lot is the uh deep variant caller it’s called deep variant from uh from the team at google who developed that uh that uses a lot of the ai techniques to be able to you know pick those things up uh there are some genomic databases that um we use for in silico prediction for example if you have a variant we have no idea about it it uses there’s a database called dbscsnv that uses random forest um techniques and i think it used another algorithm to predict if a certain site where there’s a mutation can enhance uh abnormal mechanism called splicing versus not uh similarly there’s a lot of tools that are coming in and the llms i think are i would say not mainstream but i think there’s a lot of interesting research that is coming around there where people are trying to use um llms for doing these more broader text search saying that hey you know i have these you know i don’t know thousand articles and i want to find these particular combination of words that you know you know uh there’s a combination of a disease and a mutation and what do i get back on that um i personally tried you know chat gpt with different you know like uh phrases and uh questions about it what i’ve seen so far is and this is purely my personal experience i think a lot of it reads very real but when you start to look into the references as to what it references then you quickly figure out right just making that yeah this is not the real deal and so i think um i think it’s uh you know i don’t i’m not a very pessimistic person i would say oh no this is all garbage but i think uh there is opportunity there it’s just how do you train it uh maybe there’s a a space or an opportunity and it probably already has been people are pursuing this is training a smaller model but it’s really deeply in genetics or it’s trying not trying to use a model that tries to understand everything right right yeah it’s more yeah more in the medical literature or the genomic literature to be able to like meaning is enhanced in that yes so i think there’s active work going on there but it’s uh yeah it’s i think it’s making you know a lot of a lot of interesting research a lot of potential impact on how you know we do things and obviously the tool sets that we currently use we might expect in the next 10 years to change yeah for sure all right final thought here people are listening they’re maybe doing similar work to you how do they get started what would you tell them get going with python and some of these packages yeah i mean i would uh you know my uh reflecting on my own experience sort of you know in a very uh winded way that i ended up here i think um you know python is i i i feel programming in general and i think python particularly as uh programming language is a very low you know sort of you know entry point in terms of being able to really quickly get things done like learn it easily and get things done i think it should be to me anybody anybody who’s trying to pursue something in biology or competition biology or bioinformatics i think this is the first thing it’s it’s something easy to do it’s to be you know i would say relatively easy to do to be able to get from that anybody with you know a desire to learn this has analytical thinking i mean i i think investing into python is probably the best bet because you can pretty much do anything you want that’s what i tell you know when i tell train people you know in my lab or i talk to other students is that if you want to spend your time you have very little time because you’re busy you know with your other things i think the one thing that can get some of the job done and be still aligned with what you’re doing is python and after that i think it’s you know it’s a it’s a lot of self-driven learning where you know you’re kind of you know looking into things but the good thing is i think the the the python community is wonderful it’s almost like i sit down and i think about oh i have to solve this problem probably there are 50 other people who think about that and maybe two people have already worked on it so it’s right they’ve already published it to pipe the eye and you’re good to go absolutely i totally agree with that and you know people should take the couple of weeks get good at it and it’ll amplify it’ll save you time definitely in the long run oh absolutely yes the only thing that i uh that i the only thing i would say like an added thing is if somebody is learning python and then they do have an intention you know to take it to the point where they would be involved in more serious um like you know application development or maintaining an open source package or you know however they contribute to that i think learning a little bit more like learning python in its real sense in terms of how to do it right you know there are five ways of doing something correctly but i think uh there’s one way that is consistent so that it’s again you know easily shared it’s easily maintainable others can easily understand i think that would be my second advice it is it takes a little bit time but i think it’s well worth the effort to spend the time writing you know idiomatic python code so it’s um it’s portable absolutely all right so mac thank you for being on the show it’s been great to get this look inside of what you all are doing with python yeah thank you for having me on the show i appreciate

DNA - Pathology