I’m a big fan of using MTGO data to analyze deck performance. But one of the biggest issues with data available on the Wizards website is it’s incomplete. We only see 4-0/3-1 decks, we only see one event per day, and we have no idea what matchups happened on the way to those finishes. Although you can still use this public data to perform some interesting analysis at varying depths (see my past articles on Mono U Tron and Infect), the dataset limitations always leave us wanting more.
Working with a few friends of mine from the MTGSalvation Modern forums, we decided to take an in-depth snapshot of the current Modern MTGO metagame. To do this, we looked past the public data into the MTGO client itself, observing and recording dailies over the past few weeks. The result is almost a dozen events to dig through without any of the limitations we typically see in the public data. This dataset lets us not just look at winning decks but all decks, seeing which decks are performing well relative to every strategy in any given event. The statistical possibilities are almost endless, but this week we are going to use this data to get a sense of the top performing decks on MTGO. I’m sure you will see one of them coming, but another was a total surprise even to me.
Using some more statistics (didn’t think you saw the end of that, did you?), I’m going to look into the match win percentages of these decks to try and identify a few of Modern MTGO’s top decks. In doing so, I’m looking both for decks performing significantly above average, but also for some possible explanations about why this might be happening. I’ll also give a general rundown of the current Tier 1 and Tier 2 decks on MTGO, showing how their match win percentages rank up relative to the rest. So let’s dive in to this rich dataset and see how we can use these numbers to improve our understanding of the format.
If you’ve ever read research articles for work or school (all those 30+ page PDFs you got assigned in college and then only read the abstract), you’ll know the importance of describing a dataset before we start. This is particularly important when we aren’t all looking over the dataset together. This is actually quite common in both the “hard” sciences and the “soft” ones, although it is relatively rare in Magic where most data is just a click away on a website (e.g. our Top Decks page). So to get us on the same page, here’s a bit of background on what we are looking at.
The data covers 11 MTGO events from 3/24 to 4/12. The average event size was about 70 players and there are roughly 750 decks captured, covering thousands of individual matches between those decks. The big difference between this dataset and those we have looked at in the past is the inclusion of 2-2 or worse decks, not just 4-0/3-1 ones.
This might be the most exciting Modern data I have gotten to work with in a while, but it still has its limitations. For one, this is by no means a complete population of all MTGO events in the time frame. It’s just 11 dailies over a 2-3 week period, which is actually less than an event per day. Although our deck and match Ns are very large, the event N is quite small, so we need to understand that in analyzing the numbers. Perhaps more importantly, it’s not even a random sample. Dailies were just observed whenever my partners and I had time, which is only semi-random at best. These limitations (and others I am sure you can think of; data entry errors are always at play in big, manually recorded datasets) do not undercut the value of what we are doing, but they do force us to consider their effect on our conclusions.
Also, as with all other articles of this kind, all statistics and data analysis disclaimers apply!
Tier 1 Deck Performance
One of the enduring problems of metagame statistics like the ones on our Top Decks page is they are almost entirely based on prevalence. They do not explicitly account for performance of those decks; a deck getting played a lot might be more likely to just win events eventually! Of course, that’s not saying any of those decks are bad. We wouldn’t keep seeing them time and time again if decks like Abzan, Twin, and Burn were “bad” decks. But prevalence is only one piece of the metagame story. To get to a deck’s performance, particularly for decks played less than the top ones, we need to dig deeper.
To try and get at this issue, I analyzed the match win percentage (MWP) of decks in the dataset. MWP is just the sum of a deck’s winning matches divided by the total number of matches the deck played. So a deck that went 4-0 on MTGO would have a 100% MWP, a 3-1 deck would have a 75% MWP, etc. We can aggregate this across events to get an overall MWP for any deck; a build with 50 match wins out of 100 matches played would have a 50% overall MWP. We can even augment this statistic with game win percentage (GWP) to see how many games the deck is winning as well. After all, a deck that goes 4-0 off of 2-1, 2-1, 2-1, and 2-1 records has a lower GWP than a deck that does the same off of four 2-0 matches. But they would both have that 100% MWP, which is why it is helpful to have both.
If you were worried there weren’t enough statistics here, fear not my loyal readers! I’m also going to give the corresponding statistical significance (p value) for both the GWP and the MWP, just to make sure the number isn’t within expected variance. After all, even a fair coin may flip 3 heads and 7 tails in just 10 trials, and Modern MTGO is much more complicated than that. To check this, I compare the GWPs and MWPs to the weighted average of GWPs/MWPs for all decks in the dataset. This gives us a sense as to whether a deck is actually under/over-performing relative to the average. Remember we are looking for small P values. Anything less than .10 is interesting and anything less than .05 should really catch our attention.
Finally, note GWPs/MWPs are both adjusted to account for byes, mirror matches, drops, splits, and other MTGO quirks skewing our dataset.
With the stat-speak out of the way, it’s time to dive right in. Here are the results of this analysis for the Tier 1 decks on our Top Decks page, as of the 3/1 – 4/1 metagame period. I’m also putting in an entry for the MTGO-wide averages, just as a point of comparison. The table’s default sort is on MWP.
One of the biggest challenges in looking at average is in understanding variance. Those P values help us in that respect. We shouldn’t look at UR Twin and think “below average?? WORST TIER 1 DECK ALERT” or something similar. Instead, we should look at all of those P values and see all of the tier 1 decks are well within the expected performance range for any deck in the format. Infect is a bit higher (which I expected, given my analysis in a previous article), but it’s not that much higher than the rest.
When i see the tier 1 decks, I don’t see decks above or below an MTGO-wide average. I see decks that get a lot of play, which means they attract a lot of different players. Masters will pilot these decks to 4-0 finishes and MTGO newcomers will pick them up and flop out at 0-2. With tier 1 status comes a lot of profile, which means a lot of people are going to be trying these decks and bringing the GWP/MWP both up and down.
This all suggests the tier 1 decks are actually pretty similar in terms of their performance. Infect is a bit higher, but overall they are in the same performance band. This is more or less what we should expect of these decks. You can win with them, you can lose with them, and you can get everywhere in between with them too. Because Infect does not have a significant P value (it doesn’t even have that large of an N), I also am unwilling to conclude Infect is the “best” of that bunch. They are all strong choices in their own right, depending on playstyle, metagame, preference, etc.
Tier 2 Deck Performance
In moving to the tier 2 decks, our biggest concern starts to be N. With the exception of Grixis Delver, which is more popular on MTGO right now than Ring Pops were in my youth, these decks do not have a lot of representatives. Lower representation can polarize GWPs/MWPs, and makes every individual showing count more. We are more likely to see significant P values with fewer decks in the sample, but that doesn’t mean we should discount them entirely. Sample size is just one of many factors to consider and if we make interesting quantitative conclusions matching our qualitative knowledge and suspicions, N might not matter at all.
Here are the tier 2 decks laid out in the same way as the tier 1 decks were. MWP is still the default sorting variable, but pay attention to those P values, because we have some winners in this batch.
|Temur Twin||7||58.98%||0.485||66.67%||0.085 *|
|Abzan Liege||18||54.34%||0.436||58.46%||0.079 *
|Amulet Bloom||25||55.78%||0.289||56.82%||0.075 *
Nothing excites statisticians quite like low P values (except maybe beta values…mmm). It’s a rare thrill to come across significant P values in social science analyse like these. Abzan Liege, Amulet Bloom, and Temur Twin really deliver.
Before we jump into these three decks, let’s take a look at the overall picture. Despite some really small N’s that we almost have to throw out because they are so small (poor Blue Moon on MTGO), these N’s aren’t so tiny that we can ignore their GWPs/MWPs entirely. A lot of these decks have a respectable number of finishes, certainly enough to analyze with different statistical sampling distributions. Grixis Delver, for example, appears very average in this dataset, despite having a ton of playtime online. Indeed, it has very similar values to those in the tier 1 table, which might suggest it’s just as viable on MTGO as any of the big guns like Abzan, Twin, and Burn. A number of other decks are also performing in that expected performance band, except maybe Bogles which is a little lower and Scapeshift that is a lot lower (but still not significantly so). Overall, although I would like to see more representatives of many of these decks, I still think we can look at this and conclude most of the tier 2 decks are all pretty viable. They aren’t too different from the heavy hitters in tier 1, and they aren’t too different from each other.
But three decks actually are pushing past the average performance of MTGO decks. Yeah, it doesn’t quite clear the magical (i.e. semi-arbitrary) 95% significance cutoff we want to see with P < .05, but they are close enough to grab our eye. I want to talk about these decks one-by-one, both to unpack their stats and to give some thoughts on why they are doing so well.
Temur Twin (MWP = 66.77%, P = .085)
I guess good things do come to those who just add Goyf to their deck. Let’s start with the bad news for Temur Twin fans out there; of the three decks I’m highlighting in tier 2, this is the one with by far the most alarming N. We’re only looking at 7 decks spanning just over 25 matches. It’s not really enough data to normalize the effects of luck and chance, even in a dataset we know has a small N overall. So we need to remember this when figuring out why RUG Twin might be good.
If Temur Twin is in fact a good deck, and not just a false positive because of a small N, I think a big part of this is actually the prominence of UR Twin itself. You can’t leave home without anti-Twin tech in your sideboard and an anti-Twin game 1 plan in your maindeck. Most players also know how to respect the combo and play with it in mind. Tarmogoyf messes with that calculus. A lot of the stuff that works well in fighting stalled Twin board states doesn’t do much when facing down the 4/5 Goyf on turn 3. Choke is a great card to land on turn 2-3 against Twin, but it isn’t helpful at shutting down Magic’s most efficient beatstick. It’s easy to misstep against this deck, which can win from so many angles, which might make it a great choice for MTGO where player experience can vary widely. Another possible explanation is in the sideboard, with cards like Ancient Grudge and/or Nature’s Claim, which are very strong in a metagame that is increasingly Affinity. I’m not totally sold RUG Twin would keep its status if we added more matches to the dataset, but if it did, it would likely be due to these reasons.
Abzan Liege (MWP = 58.46%, P = .079)
If a deck designed to target the current metagame didn’t make this list, all of us would feel very misled by the innovators over at Team Face to Face. But Abzan Liege remains a metagame predator today as much as it was at the PT, reflected in its 58.82% MWP almost reaching the coveted .05 P value. The deck’s matchups also reflect its suitability to our current metagame. In all its approximately 65 matchups, it went 4-2 vs. Affinity, 8-5 vs. Burn, 4-2 against UR Twin, 3-0 vs. Junk, and 3-3 against Grixis Delver, which is exactly the kind of match win rates you want to see against top MTGO decks. The deck appears to do less well against weirder, off-beat decks (e.g. Norin the Wary, Smallpox Loam, 4C Gifts, etc.), and is only average against the faster, linear decks like Amulet Bloom, Infect, Bogles, etc. But because MTGO is so dominated by format mainstays like Affinity/Burn/Twin/Grixis Delver (on MTGO at least) and so on, it looks like a great deck to hedge your bets on.
A big reason for the deck’s success is the MTGO metagame’s predictability. All those decks Abzan Liege beats or breaks even with are everywhere online, so between your maindeck and your sideboard, you are ready for a big chunk of the format. The deck’s core also has a surprising degree of customization to fill in holes against more erratic matchups like Burn (which can be die-roll dependent) and Twin (which can hinge on drawing/not drawing a critical removal or disruption spell). But unlike with other customizable decks, like Abzan itself, this flexibility doesn’t come at the cost of faster starts and aggression, which means you are balancing a proactive, focused gameplan with some reactive elements. So long as the MTGO metagame looks like it does now, this deck is going to remain a safe choice.
Amulet Bloom (MWP = 56.82%, P = .075)
On the one hand, this deck did not turn out to be some unstoppable MTGO force as I feared it might when I started the analysis. On the other hand, Amulet Bloom is still just a really good deck. Overly reactive decks just get crushed in this matchup: 3-0 vs. Scapeshift, 3-0 vs. UWR Control, 3-0 vs. Esper Control, etc. Decks that try to play too fairly also have big trouble here. Grixis Delver looks like a massacre, with Amulet Bloom going 5-1 against that deck. But Jund and Junk remain formidable policing presences against Amulet Bloom. The deck was 2-4 against Jund and 0-4 against Junk. Merfolk also gave Amulet Bloom a lot of trouble, combining hyper-efficient disruption with a fast clock to cause Bloom to go 1-5 against the fish-squad.
Despite its troubles against decks that are supposed to be good against combo anyway, Amulet Bloom remains a great deck. With the exception of perhaps Infect, no other top-tier deck punishes opponents harder for lacking interaction or not playing with disruption ready. And Bloom is actually harder than Infect to interact with in the first place. Even when you are trying to interact with Amulet Bloom, there are a few interesting contexts making this difficult. For one, stuff that is good against Amulet is not always relevant in the broader metagame. As an example, early Blood Moon can be great against this deck, but the kind of deck that packs Moon and either acceleration to power it out and/or disruption backup is not very good in MTGO right now. Some Twin builds are doing this, but with Grixis Delver, Burn, and Affinity having such high metagame shares, Moon isn’t always the card you want to rely on.
A second factor at play in beating Amulet is experience. We saw countless examples of this at Pro Tour Fate Reforged, where the Twitch stream exploded as a pro player made the wrong removal, countermagic, or discard decision against the deck. Thoughtseize misplays were the most heart-wrenching. Sure, fatigue and expected gameplay errors played a role in this, but a big piece of it was also deck knowledge. This is magnified on MTGO, where player experience is much more variable than on the PT. Amulet really takes advantage of this, both by punishing your deck choice (which itself might be ill-suited for Amulet interaction), and then again by punishing any inexperience you have going into the match.
A final factor here is relative metagame share. The more Amulet you see, the more likely you are to face anti-Amulet cards or players who have experience beating the deck. But when Amulet declines, the cards to beat it might decline alongside, and the collective MTGO experience declines as well. In many respects, this makes Amulet a kind of Dredge of Modern (although that’s not entirely fair because Modern seems to have a LOT of “Dredges of Modern” these days). Of course, this leads us to some interesting questions and some even more interesting answers. Why is Amulet declining if it’s so good? Are people preparing better for it? Are worse players trying and failing at the deck, leaving it in the hands of masters? Is it just luck? If Amulet were some 8%+ format monster, like Storm in the Seething Song days, then this MWP and its significance would be more straightforward. But because Amulet is seeing a declining share in both paper and MTGO, we need to think of reasons why the deck is both losing its share and still maintaining a strong performance.
My initial draft of this article had another section on some other high (and low!) performing decks that were not tier 1 or tier 2. But because the dataset is so rich and this kind of information so interesting, it makes more sense to split that off into a followup article. That way, we can spend as much time unpacking those decks as we did unpacking these, and we can even revisit some of our conclusions here to see how they fit in the broader metagame context.
Open question until then: what are some other dynamics that can affect MWP and its significance? How should these factors affect our analysis, conclusions, and actions based on those conclusions? As a conversation starter, we already know the most-played decks in the format are not those with the highest MWPs. But we also know that many of those high MWP decks aren’t overperforming at higher levels. This is especially true in paper, where a deck like Amulet Bloom has all but disappeared from the paper scene in the past few months. Having just updated the Top Decks page with events from the past two weeks, I can tell you this deck is basically nowhere. How can we reconcile these potentially contradictory numbers? How would we apply that to other decks? I’ll touch on some other answers to this in next week’s article.
Until next time, keep thinking about MTGO data and how we can use different analytic tools to improve your Modern game.