Settle in for a long read. Over the coming weeks you will be bombarded by shorter, snappier pieces about a controversy inflaming the front where evolutionary and social psychology meet. I’ve touched on this controversy already, and promised you more. Here’s that more, in 2,300 words of detail … rather too long for a column, I know.
Still with me? Thanks.
In June I wrote a column about the idea that women’s preferences for attractive, masculine and dominant men peak around the time conception is most likely. What particularly intrigued me was the fact that two recent meta analyses had weighed the published and unpublished evidence for shifts in preferences across the ovulatory cycle, and reached dramatically different conclusions.
One meta-analysis, published in Emotion Review, by Wendy Wood, Laura Kressel, Priyanka Joshi and Brian Louie at the University of Southern California found no overall support for shifting preferences across the cycle. The other meta-analysis, published in Psychological Bulletin by Kelly Gildersleeve, Martie Haselton and Melissa Fales at UCLA, however, found “robust cycle shifts” when women were asked to assess men as short-term mates, but not when assessing long-term partners.
At the time I promised a column about why the two teams reached such different conclusions, once both teams had had a chance to criticise the other, in press. Today, the latest round of criticism has been completed and I will try to make sense of the differences.
In addition to the two meta analyses, which I will call the “Wood meta-analysis” and “the Gildersleeve meta-analysis” three commentaries have been published in the latest edition of Psychological Bulletin:
- Christine R. Harris, Harold Pashler and Laura Mickes write under the title “Elastic analysis procedures: An Incurable (but Preventable) Problem in the Fertility Effect Literature”
- Wood and Lucas Carden comment on the Gildersleeve meta-analysis under the title “Elusiveness of Menstrual Cycle Effects on Mate Preferences”
- A reply from Gildersleeve, Haselton & Fales under the title “Meta-analyses and P-curves Support Robust Cycle Shifts in Women’s Mate Preferences”.
I will refer to these as the Harris Commentary, Wood Commentary and Gildersleeve Reply.
These five papers alone constitute tens of thousands of words, and there are other commentaries and contributions from various corners of academia. I’m going to try to extract a handful of the key points here, but be aware this is a complex scientific disagreement that is only going to get more intriguing, rich and controversial.
The size of the cake and how you slice it
Almost two decades of research have seen the publication of a substantial number of studies showing evidence that women’s mating preferences shift when women are at their most fertile. But others have failed to find the same effects. And there is a tacit sense among researchers that the negative results are harder to publish than positive ones. If this is true then researchers’ file-drawers should be bulging with unpublished negative results.
Meta-analysis, as an approach, recognises that publication bias happens in all fields of research. One way around this issue is to seek out unpublished studies from a wide range of research groups. There are also statistical methods to explore the extent of likely publication bias.
The Wood and Gildersleeve meta-analyses share 32 published and five unpublished reports. Wood included 10 further published and 8 unpublished reports not included in the Gildersleeve meta-analysis. And Gildersleeve included 4 published papers and 6 unpublished reports unique to their meta analysis. So, already, the body of evidence is overlapping but not identical.
The different conclusions between the two meta analyses appear to turn on both the differences in which studies were included and how the data were analysed. Most dramatically, the Gildersleeve meta-analysis reports shifting preferences that depend on context; when women are assessing long-term relationship prospects there is no ovulatory shift effect. And there is justification for this – the evolutionary thinking behind adaptive shifts is that what women find attractive for a once-off mating that might lead to conception may well be different from what they find attractive in a long-term partner.
The Wood meta-analysis largely ignored this important distinction, bundling studies of possible long-term partner attractiveness in with those concerning men’s attractiveness as short-term mates. Wood’s commentary criticises Gildersleeve’s meta-analysis for including in the short-term attractiveness subset some studies that do not specify relationship length. But these appear to be studies geared to assessing attractiveness of mates rather than long-term partners.
Analysing effects together or separately
The Wood meta-analysis split their analysis by the types of traits being assessed (i.e. separate analyses of preferences for symmetry vs masculinity etc), whereas Gildersleeve’s meta-analysis involved a first overall analysis across all preference types predicted to shift with ovulatory cycle. The second approach has a larger sample of effects, providing a more powerful test of the hypothesis that preferences shift.
Gildersleeve’s comment reanalyses the data presented in the Wood meta-analysis and shows that had Wood analysed the data in the same, more inclusive way, they would have found similar evidence of cycle shifts in short-term mating preferences. That said, Gildersleeve also found evidence of cycle shifts in preferences for some traits in the smaller single-trait analyses.
Analysing the data in a hierarchical fashion, beginning at the most-inclusive form of analysis, is more in line with the intended purpose of meta-analyses. It is worth noting
Elastic analyses and p-hacking
The Harris commentary turns on rather dramatic variation in the methods used by different studies to categorise fertility. Within a field, methods certainly vary and definitely change over time. And individual women differ in the regularity and length of their cycles.
So how should researchers proceed? Should they count from the start of a subject’s last period to the date on which preferences are measured? Or should they follow up after measuring the preference to find out when the next period commenced? And how should researchers account for variation in the timing of fertility?
The only reliable way to infer fertility is to track the hormone profiles of each individual woman. But this is expensive, and consumes the time of both subjects and researchers. As a result studies using this method cannot obtain the large samples needed to detect what are sometimes quite subtle shifts in preference.
Harris argues that cycle-shift studies are “particularly prone to hidden researcher degrees of freedom”. That is to say that researchers try different methods for defining the timing and duration of the fertile window, rather than applying strict a priori criteria. This increases the chance of researchers finding a significant cycle-shift effect.
This is a form of p-hacking – conducting various subtly different forms of an analysis until the test exceeds the threshold for significance (in this field usually a < 5% probability of observing such a result by chance alone). P-hacking doesn’t have to be deliberate. Merely being more likely to stop analysing data that confirms what you expected to find and continue analysing data that does not will tend to do the trick.
Needless to say, p-hacking is a serious allegation to level. It implies either fraud or serious unacknowledged bias. To suggest that many or even all results in a given research area are the results of p-hacking is a deeply damning allegation.
Gildersleeve’s reply addresses this allegation empirically by applying a statistical technique used to detect systematic evidence of p-hacking. If p-hacking is rife, there should be a disproportionate number of significance tests with p-values just below the 5% significance threshold.
The actual distribution of significant effects does not follow such a pattern, however, with few values in the 3-5% range and by far the most being below 1%. This analysis isn’t an out-and-out winner against allegations of p-hacking, but it does represent a devastating return-of-serve.
As my PhD student Amany Gouda-Vossos said in an informal discussion a few collegues were having about this controversy at a recent conference:
In my previous field of molecular biology, we publish the methods paper first. What this field needs is a methods paper.
That would certainly go some way to avoiding both the practice and the allegation of p-hacking. And the Commentaries and Reply contain plenty of insights into what that methods paper should set out to achieve.
While I’m on the subject of serious allegations, Gildersleeve’s commentary raises problems with the coding of effect sizes in the Wood meta-analysis. Seventeen effects (14% of the total sample) are coded as exactly 0.00. Gildersleeve et al report being unable to get “sufficient information from the study authors to compute a precise effect size” for seven of these effects, and computing effects different from zero for other cases.
The inability of the two teams to get the same effect sizes from the same data represents a serious problem, and I look forward to seeing how this is resolved in press.
Wood’s commentary picks up where Harris’ commentary and allegations of p-hacking leaves off, with further arguments that the studies that support cycle shifts are mostly or all research artifacts. They make several related points:
- Unpublished studies tend to show more non-significant effects than published papers.
- Cycle-shift effects were strongest in early studies and have declined over time as methodologies have improved.
- Cycle effects tend to be found in studies that specify longer fertile windows.
The first of these points is hardly surprising. One of the most persistent problems with scientific publishing is that significant effects are more likely to get published. Referees are more likely to see a significant result as a true positive, than they are to consider a non-significant result a true negative. This happens across all fields of science, although social psychology has lately proven particularly prone to allegations of this nature.
Declining effect sizes is also quite normal in a body of scientific literature. It is a form of correction of publication bias. I believe that 2014, including the 5 papers I discuss here, represents a watershed and I would like to see how those studies that take the many issues and concerns expressed in the meta analyses and commentaries fare.
The third allegation provides an interesting insight that seems to resonate with Harris’ commentary. A woman’s fertile window lasts no longer than six days in each cycle. But the strongest support for cycle-shifts comes from studies that code women as potentially fertile over a 9-day period. Both Wood and Harris view this as evidence of p-hacking or at least that the positive results are artifacts.
Gildersleeve’s reply argues that women’s 6-day fertile periods are not all timed the same, even in women with clockwork-regular 28-day cycles. A larger window captures a better representation of women who are potentially fertile. They argue, with evidence from actuarial studies of conception risk, that a 6-day window is not necessarily better than a 9-day window for dividing a cycle into high and low conception-risk phases.
I am certainly not convinced that the use of a 9-day fertile period is unjustified or evidence of p-hacking. Agreement on the best way in which to score conception risk would appear to be an essential step toward progress in this area. Both sides in this debate acknowledge the problem and provide helpful and somewhat similar suggestions. There seems to be broad agreement that fertility should be confirmed with hormonal tests, despite the inconveninece. Gildersleeve (both pieces) also advocates fitting conception probabilty as a continuous variable rather than partitioning the cycle into “fertile” and “infertile” phases, thus eliminating any scope for allegations of analytical elasticity.
Beware scientific tribalism
I could exhaust you with more dissection of the minutae of these studies. I haven’t even reached the minutae yet. But I have already doubled my word limit, and I am deeply in your debt if you’ve made it this far.
So what is going on? As I argued in my first column, much is at stake here. There is along history of women’s fertility cycles being turned against them, to intimate that their competence, temperament and even their sanity is as variable as their hormone levels. How does one separate the biological reality of hormonal variation – of which women’s cycling fertility is but one example – from pseudo-scientifically buttressed sexism?
Wood leads her commentary by reminding us of recent history:
During the 1970s and ’80s, social stereotypes and popular media portrayed women as victims of hormonal fluctuations across the cycle that led to maladjustment and premenstrual syndromes… However, research in this area became more sophisticated following demonstrations that women’s self-reported menstrual syndromes were influenced in part by artifacts tied to cultural beliefs about cycles.
I find no evidence that those researchers who find cyclical shifts in mate preferences, including Gildersleeve, Haselton and Fales are seeking – even inadvertently – to confirm popular media or social stereotypes. The media are certainly fascinated by the topic, reflecting popular interest in the links between hormones and behaviour. But most accounts seem enthralled by the surprise of finding that women’s preferences might be more subtle, more varied and more tied to individuality than social stereotypes and media tropes might have had us believe.
At the centre of this controversy sits the unfortunately polarised issue of biology’s place in our understanding of human behaviour. As Haselton pointed out to me when we discussed the meta-analyses in June, ovulatory shifts are intriguing because they represent an original prediction of evolutionary theory, but are not predicted by non-evolutionary theoretical frameworks. Such frameworks that emphasise social construction, cultural beliefs, social roles and gendered structures and that relegate biology to zero or minor relevance can often explain the same observations as evolutionary accounts.
An absence of cycle shifts would not refute the evolved nature of behaviour; but if such effects are real, they would represent a potent illustration of how evolution shapes subtle, interesting variation in behaviour. Variation that might interact in interesting ways with the social and cultural dimensions of behaviour.
Nothing to fear
Cycle shifts, interpreted properly, should not undermine claims to equity. And they needn’t threaten your marriage either. The shifts in preferences are mostly pretty small.
It’s not as though you’re going to be truly, madly, deeply with your husband the one day and furtively shagging the pool guy the next.
Although that’s not beyond all possibility.
My former PhD student Eddie Aloise King interviewed me on the subject of cycle shifts for her new podcast: Dissecting Love. She put together a very informative and entertaining program.