Creating a DNA Superkit from Multiple DNA Kits

Video Transcription

Your DNA has hundreds of thousands of markers, but each company uses a different set of those markers. So is there any way that you can combine them all to get a better match set?

Howdy, I’m and Lee with Family History Fanatics and this is a segment of DNA. Be sure to subscribe to our channel and click on the belt to be notified about upcoming episodes. If you’ve ever wondered if there’s a way that you can get a better match set, then today is a day that I’m gonna show you how. If we look at the different DNA kids, they each use a different set of snips from your DNA. Now there’s some overlap between each one of the companies, but there’s also a lot of original snips that no other company is using. So I had the idea of actually combining all of these into one kit that I then was gonna upload to GEDmatch and start using that as my major matching kit because it would be the one with the most information. Let me show you how I went about doing this.

Today I’m gonna be using Google Docs and these sheets, functioning Google Docs to do this, but I would actually caution and suggest that you use Excel or Open Office or Apple Pages when you’re doing the full thing because today all I’m gonna be doing is one chromosome and I’m doing the smallest chromosome. As you get to the larger chromosomes, there is so much information in there that it will overwhelm Google Docs. And the very first thing you’re going to be doing is you’re gonna separate things out by chromosome because you can’t have all of it for all of these tests in one tab. So that’s the very first thing that you want to do, is you want to import your data into your spreadsheet and you want to divide out the chromosomes so that they’re all on separate tabs. Now in this case, as we’re looking at this file, you can see that hey, this first one is my 23andme kit, and then this next one is my Ancestry kit, and then we have the Living DNA kit, and finally we have My Heritage Kit.

Now, some people may ask, why didn’t I use Family Tree DNA? And the reason is because Family Tree DNA and my heritage use the exact same micro array chip, and so the information on them is identical. So if I included Family Tree DNA, then almost all of that information is gonna get deleted out anyway because it’s gonna be a duplicate of something I already have on there. So the first thing that I wanna do is I want to go through and clean this up a little bit and what I’m gonna do first is for each one of these, I’m going to delete this intro part that tells a little bit about what this all is. And so I’m gonna just delete these cells, shift this up, and then I’m going to add a little company moniker just so I can keep track of where the different data is coming from. And this one is 23 and me. And so what I’m gonna do is just copy 23 down throughout the entire thing. I’ll make the columns just a little bit narrower because they don’t need to be so big.

And now my 23andme data is almost done. I’m gonna go through and I’m gonna color code everything so that it’s also visually easy to see where this information came from. For 23andme, I’m going to use this nice pink color on Ancestry. I just want to delete this opening part, shift the cells up. Now ancestors is interesting cuz they have their information divided out into two separate columns. So I actually need to do a function to combine these two together. And so what I’m gonna do is I’m going to insert a column and then I’m going to type in my formula that’s gonna combine these two and that is Concat. And then I type in these two cells or I just highlight them and you can see that that is the same thing. And then I copy that all the way down.

Now one thing I want to do here is right now there’s just a formula in there. I want to just have the actual value in there. So I’m going to copy this and then I’m going to paste the values of this. So now you can see when I go and look at that, it is actually showing just the values. It’s not showing a formula that allows me to delete these two columns and I can add my company and that’s going to be Ancestry. I’ll just do an for ancestry and then I can go through and I can hopefully find the bottom and I can color all of these and these are going to be Ancestry’s Green.

Now that I’ve colored all of them, I’m gonna actually copy those and I’m gonna come down here to the bottom of my 23 and me and I’m going to paste all of that information down there at the bottom. So now I have all of my ancestry and my 23 and me in one column and I can delete out the old ancestry. Let’s do the same thing for living DNA. And you can see I’m already up to over 20,000 rows total. So this is gonna get up probably into 30,000 maybe for my heritage. Same thing. Now again, if you’re doing this in Google Docs for some of the larger chromosomes, then it’s going to overwhelm Google Docs. And so you might want to do this in Excel or pages or open office, but the functions are basically going to be the same. The steps are gonna be the same that you have to do.

So there I can delete all these extra columns and now I have all of my information in one single column, but so far they’re not in order. So let’s actually get this in order. What I’m gonna do is I’m going to put a couple of headers here for some formulas that I’m gonna be using to sort out the stuff that are duplicates that I don’t need. And I’m just gonna bold this and then I’m going to freeze that first row so that way as I scroll through, it’s not going to go. So right now I have each company stacked on top of each other. What I want to do is I actually want to go and I want to sort everything by the position, cuz again, we’re talking about one chromosome. So when I sort by position, it’s gonna put everything in order, it’s there.

So, now it’s sort of by position. You can actually see that Living DNA has the first bunch of them before you start getting into some of these multicolored ones. And if you just scroll through, you can see that different companies actually focused sometimes on different areas of the chromosome, but for the most part they have a pretty good spread all throughout the chromosome. So our next step is we want to remove the duplicates that are all the same call. And by call what I mean is, I mean these two letters that indicate what that part of the DNA is. One thing you might notice if you’re scrolling down through is that there’s some patterns to how this is called. Now the computer really doesn’t care which letter comes first because it doesn’t know which chromosome that letter is from. So we actually need to standardize some of these things.

For instance, when I go and I take a look here, I can see that there is a two dashes or two zeros, which means it’s a no call, it couldn’t figure out what it was, an AA and AC and AG and AT. Then there’s a, well that ca is actually the same thing as an ac. And as I continue to go down, I can see that there’s a dd, which means a deletion or an I, which is an insertion, but then there’s ga, which is the same as a G gc, which is the same as CG and on down through it. So I actually wanna take some of these ones that are really the same as something else and I wanna swap them. So what I’m going to do is I’m gonna just highlight this column and I’m going to do a replace. So the first thing that I want to replace is I wanna replace this zero zero with a dash dash, and that means it’s going to be a no call.

I click on this and it says, Hey, it found 10 instances that it did this, and then I’m gonna do the last one, which is TG and replace it with gt and then I’m done. So now I’ve actually simplified what this is and I’ve made it easier for the computer to sort out whether or not it is identical. So we want to first remove all of the duplicates that are the exact same call. So we need to find them. And what that means is that means that this column matches this column. Now since these are in order, we’re really just looking for the one that’s right above it. So we’re gonna write a formula here which is going to be equal and it’s gonna be an if statement. So if this C2 is equal to c1, then we’re gonna do another if statement. And if D two is equal to D one, then we’re gonna call that a match and I’m just gonna use one as my match and then I can put zero as a no match. And so if C2 doesn’t match c1, then that is a no match as well.

I’m gonna copy this formula all the way down the spreadsheet and now I’ll easily be able to find every single one of my matches, all of the duplicates. So if I go and I remove just the zeros so that I’m only looking at the ones, these should all be my matches and I can go and I can just count up real quickly how many total matches there were. And there was almost 19,000 of them, which if you’re looking down here, there’s 41,000, 42,000 lines total. So 19,000 is almost half of all the lines are matches, which is really what we expect because again, each one of these companies, they use a set of snips and there’s some overlap between each of the companies. So we can go through now and we can delete all of these duplicates and I’m just gonna highlight all those rows and I’m gonna delete those rows and almost half of my file will be gone, but the half it’s gone is just the duplicates.

And there you go. It has deleted all that information. And now what we want to do is we want to show what is left. Now we’re gonna go through and we’re going to actually delete this formula because we’re gonna redo it, but all of the duplicates that had the exact same call have now been deleted. So the next thing we want to do is we want to go through and we want to look for any duplicates that are a no call from one company so that we keep the other company’s information. So for instance, let’s go and just take a look at our different no calls here. So we have a couple of no calls here from Living DNA and as we just scroll down, we have a place right here where there is a no call for My Heritage, but 23 and Me has a call. So in this case, we want to keep the 23 and Me information and get rid of my heritage information. Now again, we can do this by using a formula and it’s gonna be pretty similar. First thing we want to do is we’re gonna be looking at if, and we’re gonna look at same thing, the position and see if that position is equal to the position above it. And then we’re going to look at the no call. We’re going to see if this value is equal to that no call.

Then we want to put a one and then we can say zero for the others, and this will be zero as well. And so there’s our formula and we copy that all the way down through. Now remember the first time we ended up with 19,000 of these matches, for this one, it’s going to be a whole lot less because there’s not gonna be near as many no calls. So in this case there’s 537, which is not near as much as 19,000. We want to delete all of these. So I highlight all of them and then I’m just going to delete those rows and it’s all deleted now. So I can go back through and I can see this and what I’m gonna do is I’m going to delete all of these zeros. You can see that there’s still some places where there are some no calls, but that’s not going to hurt anything.

And so I’m just gonna leave those no calls as is right now. So we get onto our next step and that is we need to compare those same positions that actually have different calls. So let’s say for instance, one company calls it an AA and another company calls it an A. Well, in this case we have to decide which one we’re going to select. And you can do that in a couple of different ways. If you have multiple kits, three or four kits, you can actually see, do two companies call it one thing and one company calls it something else? Or if you wanna just choose, I’m gonna always choose Ancestry to be my primary one or 23 and me to be my primary one. That’s another way that you can do that. And so in this case, I’m just gonna go through and I’m going to choose 23andme to be my primary one. And so I’m gonna actually use a similar formula again to the last two. It’s going to be equals if, and I want to compare the position first, so C2 equal to c1, and then I just want to call it one because if those positions are the same, then it is a match that has a different call. And since I know that the 23 and Me one is gonna come before any of the other companies, I’m just gonna keep the 23 and me.

So there’s a one right there and there is a zero right there. And what this will do is now all of these ones will be ones that the companies had a different call, but I’m going to keep the 23 and me one. Let’s take a look at how many of these we have. So remember we started off with 19,000 duplicates, and then we had another 500 from the no calls. And now what we’re gonna see is how many of these ones we have and there’s about 225 of these ones. So I’m gonna go through and I’m going to delete all of these,

Deleting ’em the same way that I’ve deleted everything else, just highlighting all those rows and then deleting them there. And now it’s all deleted. I can go back and now what I should have is I should have everything that doesn’t have any kind of duplicate. And so each one of these is going to represent a unique snip all the way throughout the file and you can see that it’s combined with everything. Some companies are going to have more information in there, and in this case, 23andme is probably going to have the most because they were with the primary company that I chose. And as you’re looking through, you can actually see there’s very little of that orange, which was my heritage, and that’s because most of that, my heritage stuff ended up being duplicates on one of the other three companies kits. But at this point, I am ready now to delete my formula columns and if I want to, I can actually delete my company column as well. Because on this next step, what you’re going to do is remember, you’re doing this for each one of the chromosomes all on a different tab. We now need to get these all together. And so the first thing to do is to export it.

And in this case, what I want to do is I want to export it probably as a comma separated file or as a tab separated file. So what I wanna do is I just want to call this we’ll call it Crumb zone 21, and I’m gonna end up having 22 of these files in the end. So the next thing I wanna do is I want to go in, I want to open that file in Notepad. I don’t wanna open it in Word, I want it in Notepad or some sort of text editor. And you can see this is what it’s going to look like. It’s going to have the format that’s very similar to what that original file that you downloaded from each one of the company’s websites has. Then all I’m going to do is for every one of these, I’m going to highlight all of the chromosome 21.

I’m not gonna highlight the headers, and I’m gonna copy that and put that into a master file on Notepad so that way I can combine all 22 of those chromosomes into one file. And at that point you have a master file, save it as a txt file, and then you can upload that to GEDmatch. Now, why would you want to upload that into GEDmatch? Well, let me show you what the difference is. So right here what I did is I went into GEDmatch and I looked at my ancestry kit and on my ancestry kit I went through and I pulled out some of the matches that it had. Now, one of the important things that GEDMatch can tell you is the amount of overlap, and this is the number of snips that it is using to compare with some other kit. Now really, the more snips that it can compare, the more accurate it’s going to be as far as the match and as far as how much centimorgans and how close that match is to you.

So we can actually compare here between these two matches. I have my super kit and I have my ancestry kit, and you’ll notice that, hey, when we’re just looking at overlap, some of these, they don’t increase that much. Like for instance, this first one, it only increased by about 20,000 on the overlap. However, some of these increased a lot. This one down here went from 178,000 to 302,000. So it increased by 125,000 snips, almost twice as many s snips as what the other one had. One thing that you’ll notice is there was actually one match that my super kit was able to find that my ancestry kit did not find. So by having this super kit, you are able to find more matches because it has more information to be able to compare to. Another interesting thing is there was this one kit and I don’t know what company that it had been uploaded from, but it went from 134,000 on the overlap to 234,000 on the overlap.

But most importantly, it went from total of 12.40 centimorgans when comparing it to the ancestry kit to almost 21.60 cent of Morgans. And that’s really because that much overlap allowed more information to be shown. Now here’s one that the overlap didn’t actually have that much of a difference from borderline genetics. This kit had only 194,000 snips of overlap on the ancestry kit and it went double. It went up to almost 400,000 snips on the super kit. Yet the centimorgans for both of ’em is the exact same. So it’s not always going to increase the amount of shared DNA that you have with somebody, but in some cases it will just because of that snip set that the different kits use. So if you have a little bit of savvy with spreadsheets, I would encourage you to make yourself a super kit and upload it to GEDmatch and you’ll find that one, you’ll be able to actually get all of those matches without having to look at each one of the kits to find those one-offs that some of the other kits weren’t doing.

But two, your matches are going to be more accurate because you have more information. Just as a reference point, my super kit ended up with 1.5 million s snips total in the kit, whereas the most that I had before was my 23andme kit, which the version three I believe has just over 900,000 snips. So I had more than 50% more snips by combining all of my kits into one super kit. Now if you have any questions about how to make a super kit, put it in the comments below and I’ll try to answer it. And if you like this video, be sure to give it a thumbs up and share it with your friends.