Sunday, August 26, 2007

Citation Plagiarism

Well, I was about to preface this with a disclaimer that it isn't current news, but why should I do that anyway? The whole problem with the media (and this is only exacerbated on most blogs) is that it's too focused on what's new, and ignoring what was said last month or last year. Of course, last month's or last year's "news" often does appear now to be ephemeral or ultimately non-important. But if that's the case, one shouldn't have been wasting time with those stories then either.

But I digress. What I really wanted to post about was this story:
Tomorrow's IT advances -- which usually start out in today's academic journals -- may be the product of cheating, say UCLA researchers who claim that scientists routinely lie about the amount of research they perform before publishing their innovations.
* * *

"We discovered that the majority of scientific citations are copied from the lists of references used in other papers," Simkin and Roychowdhury write in a paper whose title admonishes, "Read Before You Cite!"

An ingenious study of the statistics of scientific misprints led the two researchers to conclude that major innovations may, in part, be the products of lazy fudge factoring.

* * *

"The probability of repeating someone else's misprint accidentally is 1 in 10,000," Roychowdhury and Simkin claim. "There should be almost no repeat misprints by coincidence."

Yet, repeat misprints appear in nearly 80 percent of the papers the two authors studied, leading them to conclude that "only about 20 percent of citers read the original. Repeat misprints are due to copying some one else's reference, without reading the paper in question."
So far, so good. Citing without reading the work in question = bad scholarship.

I object to this, however, at least as described:
Out of 24,000 papers published between 1975 and 1994 in the prestigious journal "Physical Review D," forty-four papers achieved "renowned" status with 500 or more citations.

Asking the question, "What is the mathematical probability that 44 of 24,000 papers would be cited 500 or more times in 19 years?" Roychowdhury and Simkin found the answer to be 1 in 10^500, or effectively, zero.

In other words, it is a mathematical impossibility that 44 of 24,000 papers would achieve "greatness" by these measures, unless another mechanism -- copying, for instance -- were at work.

If so, the so-called "Matthew Effect" would take over after a few copied citations, the authors say.

"This way, a paper that already was cited is likely to be cited again, and after it is cited again, it is even more likely to be cited in the future," claims Roychowdhury, a specialist in the research of high-performance and parallel computing systems. "In other words, 'unto every one that hath shall be given, and he shall have abundance,'" he quoted from the Gospel of Matthew.
It's a "mathematical impossibility" only if one were expecting citation patterns to be completely random, but why would one expect that in the first place? A better expectation is that citation patterns ought to follow more of a power-law distribution -- lots of papers are crap and should never get cited, and a few papers revolutionize a field or at least make a very substantial contribution, and hence should get cited a lot. There may still be quite a substantial "Matthew effect," of course, but the mere fact that citation patterns fail to match a normal distribution doesn't prove that "copying" is a big problem.

No comments:

Post a Comment