among the 17 women diagnosed with severe postnatal depression, 13 had had male babies.
Thirteen. Now there's a valid statistical sampling for a nation of about 65 million.
Can you say "advocacy research"?
As I've commented here before, statistics is a difficult thing to understand without training. 13 out of 17 is significant, but not conclusive. If it was say 16 out of 17 I think that would be enough to draw a solid conclusion.
Here's an example of an unintuitive result. When Ross Perot was doing his run for President in 1992, he wanted to get on the ballot in Texas. Under Texas law, to be a 3rd party candidate and on the ballot, he needed to have 54,275 signatures. He brought in 200,000 signatures. However, it was inevitable that not all of them were going to be valid. The elections commission didn't have the manpower to contact all 200,000 of these people, obviously. It just needed to check some of them until it could be sure that there really were at least 54,275 good ones in there. So the question that they asked statisticians was, How many do we need to check to be sure. Can anyone guess what the answer was?
Eight.That's right, after investigating only eight signatures, they would have been able to decide by a comfortable margin whether he really did have enough signatures in that pile to get on the ballot. The reason for this improbable result was the large number of signatures that he brought in compared to the number that was needed. If in truth, only about one in four of the signatures was any good, then getting the first eight to check out by sheer luck would be a 1/(2^16) = .003 % chance. And if one of them didn't check out but the other seven did, then it would still be exceedingly unlikely that there were less than 54k good ones in the pile. But the election commission couldn't go outside and tell the press that they were only going to check eight. That would make it look to the public like something was amiss. The Perot supporters would think that they were trying to disqualify him with an unrepresentative sampling and risk all their hard work, and Perot's opponents would take umbrage with the commission's laziness and unwillingness to simply verify that he should be on the ballot. It was a no-win, so they had to invent some number of signatures to verify that was far beyond the number needed but sounded good.
Out of 17, 13 had male babies. Look at it this way. If the gender of the baby had nothing to do with it, how odd is the result. If you had 17 people lined up taking marbles out of a bag and then putting them back, and each bag had ten blue marbles and ten pink marbles, and you had them do this again and again, in how many of these demonstrations (Bernoulli trials) would you have 7 get a blue and 8 get a pink versus all of them pulling out a blue, for example? If you know that each of them are just as likely to pull out a blue as a pink (probability = 0.50 for a blue), then everybody pulling a blue would be pretty unusual. In this case it would be 1 in 131072 unusual (17 consecutive successful coin flips, or 0.5^17). If 13 of them pulled out a blue, the frequency of a result like that ends up being 0.018. The frequency overall of a trial in which 13 or more of the marbles are the same color (blue or pink doesn't matter, as long as there are at least 13 of them) comes out to about 4.9%.
In other words, if we
assume that post-natal depression is completely unrelated the the infant's gender, then the fact that 13 out of 17 of the depressed mothers had boys would have to be just some 1 in 20 chance that happened. But we don't know that the underlying probability is truly 50/50. That bag of blue and pink marbles might just have more blue ones than pink ones, making this a rather typical result. Because this result was under the 5% range, it is what is called significant. That means that researchers pay attention to it and gather more data in that area to more definitively draw conclusions. Professor Claude de Tychey was quoted in the articles as saying "It's an interesting talking point, but I'm not entirely convinced by this, and would like to see it replicated in larger trials. It's probably a statistical quirk." So he says basically that one such quirk would be expected to be seen when you do 20 studies, and he doesn't think it means anything. My opinion is that whenever a significant outlier like this is found, it warrants further study. After all, you only have to waste your time once in 20 times, and in the other 19 cases, you actually found something.