Shared cM Project 2020 Analysis, Comparison & Handy Reference Charts

DNAeXplained – Genetic Genealogy

DNAeXplained – Genetic Genealogy

Recently, Blaine Bettinger published V4 of the Shared cM Project, and along with that, Jonny Perl at DNAPainter updated the associated interactive tool as well, including histograms. I wrote about that, here.

The goal of the shared cM project was and remains to document how much DNA can be expected to be shared by various individuals at specific relationship levels. This information allows matches to at least minimally “position” themselves in a general location their trees or conversely, to eliminate specific potential relationships.

Shared cM Project match data is gathered by testers submitting their match information through the submission portal, here.

When the Shared cM Project V3 was released in September 2017, I combined information from various sources and provided an analysis of that data, including the changes from the V2 release in 2016.

I’ve done the same thing this year, adding the new data to the previous release’s table.

Compiled Comparison Table

I initially compiled this table for myself, then decided to update it and share with my readers. This chart allows me to view various perspectives on shared data and relationships and in essence has all the data I might need, including multiple versions, in one place. Feel free to copy and save the table.

In the comparison table below, the relationship rows with data from various sources is shown as follows:

I don’t know if DNA Detectives still uses the “green chart” or if they have moved to the interactive DNAPainter tool. I’ve retained the numbers for historical reference regardless.

Additionally, in some places, you’ll see references to the “degree of relationship,” as in “third degree relatives always match each other.” I’ve included a “Degree of Relationship” column to the far right, but I don’t come across those “relationship degree” references often anymore either. However, it’s here for reference if you need it.

23andMe still gives relationships in percentages, so I’ve included the expected shared percent of DNA for each relationship and the actual shared range from the DNA Detectives Green Chart.

One column shows the expected shared cM amount, assuming that 50% of the DNA from each ancestor is passed on in each generation. Clearly, we know that inheritance doesn’t happen that cleanly because recombination is a random event and children do NOT inherit exactly half of each ancestor’s DNA carried by their parents, but the average should be someplace close to this number.

shared cm table 2020

click to open separately, then use your magnifier to enlarge

The first thing I noticed about V4 is that there is a LOT more data which means that the results are likely more accurate. V4 increased by 32K data points, or 147%. Bravo to everyone who participated, to Blaine for the analysis and to Jonny for automating the results at DNAPainter.

Methods

Blaine provided his white paper, here, which includes “everything you need to know” about the project, and I strongly encourage you to read it. Not only does this document explain the process and methods, it’s educational in its own right.

On the first page, Blaine discusses issues. Any time you are crowd sourcing information, you’re going to encounter challenges and errors. Blaine did remove any entries that were clearly problematic, plus an additional 1% of all entries for each category – .5% from each end meaning the largest and smallest entries. This was done in an attempt to remove the results most likely to be erroneous.

Known issues include:

Challenges aside, the Shared cM Project provides genealogists with a wonderful opportunity to use the combined data of tens of thousands of relationships to estimate and better understand the relationship range of our matches.

The Shared cM Project in combination with DNAPainter provides us with a wonderful tool.

Histograms

When analyzing the data, one of the first things I noticed was a very unusual entry for parent/child relationships.

We all know that children each inherit exactly half of their parent’s DNA. We expect to find an amount in the ballpark of 3400, give or take a bit for normal variances like read errors or reporting differences.

Shared cM parent child.png

click to enlarge

I did not expect to see a minimum shared cM amount for a child/parent relationship at 2376, fully 1024 cM below expected value of 3400 cM. Put bluntly, that’s simply not possible. You cannot live without one third of one of your parent’s DNA. If this data is actually accurate from someone’s account, please contact me because I want to actually see this phenomenon.

I reached out to Blaine, knowing this result is not actually possible, wondering how this would ever get through the quality control cycle at any vendor.

After some discussion, here’s Blaine’s reply:

If you look at the histogram, you’ll see that those are most likely outliers. One of my lessons for the ScP (Shared cM Project) lately is that people shouldn’t be using the data without the histograms.

People get frustrated with this, but I can’t edit data without a basis even if I think it doesn’t make sense. I have to let the data itself decide what data to remove. So I removed 1% from each relationship, the lowest 0.5% and the highest 0.5%. I could have removed more, but based on the histograms, [removing] more appeared to be removing too much valid data. As people submit more parent/child relationships these outliers/incorrect submissions will be removed. But thankfully using the histograms makes it clear.

Indeed, if you look on page 23 on Blaine’s white paper, you’ll see the following histogram of parent/child relationships submitted.

shared cm histogram.png

click to enlarge

Keep in mind that Blaine already removed any obvious errors, plus 1% of the total from either end of the spectrum. In this case, he utilized 2412 submissions, so he would have removed about 24 entries that were even further out on the data spectrum.

On the chart above, we can see that a total of about 14 are still really questionable. It’s not until we get to 3300 that these entries seem feasible. My speculation is that these people meant to type 3400 instead of 2400, and so forth.

shared cm parent grid.png

click to enlarge

The great news is that Jonny Perl at DNAPainter included the histograms so you can judge for yourself if you are in the weeds on the outlier scale by clicking on the relationship.

shared cm parent submissions.png

click to enlarge

Other relationships, like this niece/nephew relationship fit the expected bell shaped curve very nicely.

shared cm niece.png

Of course, this means that if you match your niece or nephew at 900 cM instead of the range shown above, that person is probably not your full niece or nephew – a revelation that may be difficult because of the implications for you, your parent and sibling. This would suggest that your sibling is a half sibling, not a full sibling.

Entering specific amounts of shared DNA and outputting probabilities of specific relationships is where the power of DNAPainter enters the picture. Let’s enter 900 cM and see what happens.

shared cm half niece.png

That 900 cM match is likely your half niece or nephew. Of course, this example illustrates perfectly why some relationships are entered incorrectly – especially if you don’t know that your niece or nephew is a half niece or nephew – because your sibling is a half-sibling instead of a full sibling. Some people, even after receiving results don’t realize there is a discrepancy, either because their data is on the boundary, with various relationships being possible, or because they don’t understand or internalize the genetic message.

shared cm full siblings.png

click to enlarge

This phenomenon probably explains the low minimum value for full siblings, because many of those full siblings aren’t. Let’s enter 1613 and see what DNAPainter says.

shared cm half sibling.png

You’ll notice that DNAPainter shows the 1613 cM relationship as a half-sibling.

shared cm sibling.png

And the histogram indeed shows that 1613 would be the outlier. Being larger that 1600, it would appear in the 1700 category.

shared cm half vs full.png

click to enlarge

Accurately discerning close relationships is often incredibly important to testers. In the histogram chart above, you can see that the blue and orange histograms plotted on the same chart show that there is only a very small amount of overlap between the two histograms. This suggests that some people, those in the overlap range, who believe they are full siblings are in reality half-siblings, and possibly, a few in the reverse situation as well.

What Else is Noteworthy?

First, some relationships cannot be differentiated or sorted out by using the cM data or histogram charts alone.

shared cm half vs aunt.png

click to enlarge

For example, you cannot tell the difference between half-siblings and an aunt/uncle relationship. In order to make that determination, you would need to either test or compare to additional people or use other clues such as genealogical research or geographic proximity.

Second, the ranges of many relationships are wider than they were before. Often, we see the lows being lower and the highs being higher as a result of more data.

shared cm low high.png

click to enlarge

For example, take a look at grandparents. The expected relationship is 1700 cM, the average is 1754 which is very close to the previous average numbers of 1765 and 1766. However, the minimum is now 984 and the new maximum is 2462.

Why might this be? Are ranges actually wider?

Blaine removed 1% each time, which means that in V3, 6 results would have been removed, 3 from each end, while 11 would be removed in V4. More data means that we are likely to see more outliers as entries increase, with the relationship ranges are increasingly likely to overlap on the minimum and maximum ends.

Third, it’s worth noting that several relationships share an expected amount of DNA that is equal, 12.5% which equals 850 cM, in this example.

shared cm 4 relationships.png

click to enlarge

These four relationships appear to be exactly the same, genetically. The only way to tell which one of these relationships is accurate for a given match pair, aside from age (sometimes) and opportunity, is to look at another known relationship. For example, how closely might the tester be related to a parent, sibling, aunt, uncle or first cousin, or one of their other matches. Occasionally, an X chromosome match will be enlightening as well, given the unique inheritance path of the X chromosome.

Additional known relationships help narrow unknown relationships, as might Y DNA or mitochondrial DNA testing, if appropriate. You can read about who can test for the various kinds of tests, here.

Fourth, it’s been believed for several years that all 5 th degree relatives, and above, match, and the V4 data confirms that.

shared cm 5th degree.png

click to enlarge

There are no zeroes in the column for minimum DNA shared, 4th column from right.

5th degree relatives include:

Fifth, some of your more distant cousins won’t match you, beginning with 6 th degree relationships.

shared cm disagree.png

click to enlarge

At the 6 th degree level, the following relationships may share no DNA above the vendor matching threshold:

You’ll notice that the various reporting models and versions don’t always agree, with earlier versions of the Shared cM Project showing zeroes in the minimum amount of DNA shared.

Sixth, at the 7 th degree level, some number of people in every relationship class don’t share DNA, as indicated by the zeros in the Shared cM Minimum column.

shared cm 7th degree.png

click to enlarge

The more generations back in time that you move, the fewer cousins can be expected to match.

shared cm isogg cousin match.png

This chart from the ISOGG Wiki Cousin statistics page shows the probability of matching a cousin at a specific level based on information provided by testing companies.

Quick Reference Chart Summary

In summary, V4 of the Shared cM Project confirms that all 2 nd cousins can expect to match, but beyond that in your trees, cousins may or may not match. I suspect, without evidence, that the further back in time that people are related, the less likely that the proper “cousinship level” is reported. For example, it would be easier to confuse 7 th and 8 th cousins as compared to 1 st and 2 nd cousins. Some people also confuse 8 th cousins with 8 generations back in your tree. It’s not equivalent.

shared cm eighth cousin.png

click to enlarge

It’s interesting to note that Degree 17 relatives, 8 th cousins, 9 generations removed from each other (counting your parents as generation 1), still match in some cases. Note that some companies and people count you as generation 1, while others count your parents as generation 1.

The estimates of autosomal matching reaching 5 or 6 generations back in time, meaning descendants of common 4 times great-grandparents will sometimes match, is accurate as far as it goes, although 5-6 generations is certainly not a line in the sand.

It would be more accurate to state that:

I created this summary chart, combining information from the ISOGG chart and the Shared cM Project as a handy quick reference. Enjoy!

shared cm quick reference.png

click to enlarge

Disclosure

I receive a small contribution when you click on some of the links to vendors in my articles. This does NOT increase the price you pay but helps me to keep the lights on and this informational blog free for everyone. Please click on the links in the articles or to the vendors below if you are purchasing products or DNA testing.

Thank you so much.

DNA Purchases and Free Transfers

Genealogy Products and Services

Genealogy Research

Fun DNA Stuff