CommentMizzaro

Comments about Quality Control in Scholarly Publishing: A New Proposal, Journal of the American Society for Information Science and Technology, 54(11):989-1005, 2003.

by S. Mizzaro.

http://www.dimi.uniud.it/~mizzaro/research/papers/EJ-JASIST.pdf

-

Abstract:

The Internet has fostered a faster, more interactive and effective model of scholarly publishing. However, as the quantity of information available is constantly increasing, its quality is threatened, since the traditional quality control mechanism of peer review is often not used (e.g., in online repositories of preprints, and by people publishing on their Web pages whatever they want). This paper describes a new kind of electronic scholarly journal, in which the standard submission-review-publication process is replaced by a more sophisticated approach, based on judgments expressed by the readers: in this way, each reader is, potentially, a peer reviewer. New ingredients, not found in similar approaches, are that each reader's judgment is weighted on the basis of the reader's skills as a reviewer, and that readers are encouraged to express correct judgments by a feedback mechanism that estimates their own quality. The new electronic scholarly journal is described in both intuitive and formal ways. Its effectiveness is tested by several laboratory experiments that simulate what might happen if the system were deployed and used.

---

Comments by Paolo:

First of all, I liked the paper, it was very clearly presented, it addresses a real problem and a more and more important one in my opinion. The math is very clear, sound and makes sense. Following there are more the "i-didn't-like-this" comments; in general I think I prefer to comment what I don't like: reviews saying how good your idea is are just boring and not useful, in my opinion.


 * too bad the paper didn't compare the proposed technique against slashdot moderation system. Slashdot moderation systems is used since a lot of time, the code is open source, and it evolved continuously based on the feedback and contributions of a lot of users. More importantly it moderates a real system, used by millions of users, very skilled users. Keywords are karma, moderation, metamoderation, ... See Faq on moderation


 * What is Karma? Karma is an internal value Slashdot uses to determine things like initial posting score, and moderation eligibility. To you, karma is a label like "Good" or "Bad". Many things on Slashdot affect karma, including moderation done to your comments, accepted story submissions, and meta moderating.


 * Also other systems such as Kuro5hin has evolved different techniques for moderation.


 * However it should be noted that the goal of slashdot is more about finding "what should i put in homepage now" and less about "keep an history of what is important". slashdot news are soon old while the quality of a scholar papers (hopefully) will last in time.


 * in order to find data against which testing the proposed technique, i can suggest: citeulike.org (there are no numerical evaluations but they can be possibly derived in some way), imdb.com or eachmovie or movielens (people rate movies instead of scientific paper), citeseer (considering citations as implicit ratings), epinions.com (where users rates items), amazon.com
 * maybe check as well data from citebase of eprints.org, open source


 * some online systems that deal with researchers keeping (rating) bibliography are citeulike.org (via browser), bibster (p2p) , bibserv (via browser), lionshare (p2p) ... see also old entry on my lbog. I was able to try citeulike (that I love) and bibserv (that have a unusable interface, even if it has explicitly concepts such as web of trust)


 * in general in all the paper there is this idea that a paper has a correct value and that all the reader can converge to that objective value. I'm totally against this idea: 99% of the people can think that a paper is "wonderful" and I still advocate the right of saying that this is "orrible". In this case, I'm not a bad reader, I just happen to have different tastes from the majority.

Some of the not-so-good sentences are:
 * ... providing a way of measuring in an automatic and objective way the quality of researchers ... --> such an objective way does not exist.


 * if a reader expresses an inadequate judgment on a paper,her score decreases accordingly,and so on. --> inadequate judgement means "away from the average", again I claim that judgement are just subjective and never inadequate.


 * Moreover this idea is not a good one because it produces standardization (sheep-like behaviour), that is the opposite of innovation, rethinking old local minimas, and such. Instead standardization produces cheap, holliwoodesque items that can be liked by the majority but are not really insightful or provocative or disruptive (Examples are holiwood movies, or books such as da vinci code, ...)


 * An example that I like is the following: at a certain point in time everyone agreed with newton's theory and was claiming "this is the top theory, wonderful". At a certain point a guy came (his name was albert, his surname was einstein) that in same sense voted as "bad" newton's theory. Was Einstein a bad reader? I don't think so. But if the proposed technique was in place, it is possible that his theory were judged as "inadequate" or "wrong", just because they were different than the local minima theory. Now Einstein theory is acclaimed as the top but in future someone will come and will say "einstein theory is bad". Cultural progress is nothing but overcoming local minimas. Local minimas should not be acclaimed. Diversity (in opinions and ideas) should always be considered as important.


 * Technologies can now enable personalization: if i believe relativity is not true, i can receive recommendations from other researchers that think as me and not from the "tiranny of the majority" that is not able to think different! I can "find" people that think as I do and I can discuss with them. ---> Of course, the risk is the "daily me", just speaking with same-minded-people would produce echo-chambers, extremism, out-of-reality-life and in general destruction of what we intend now for culture (that is a set of common beliefs and experiences, common to a "vast enough" group of people). Check Sunstein's books about a clear exposition of such concepts.


 * I call the metrics that try to synthetize a unique (objective?!?) score for every entity in the game, global metric, GlobalTrustMetric. Instead I call the metrics that try to predict a personalized score, local metric, LocalTrustMetric. A global metric would predict the score of Bush as 0.55. A local metric would predict the score of Bush to someone as 1 and to someone as 0, based on what they like. A local metric does not try to average everyone's opinions but takes into account diversity and subjectivity.


 * I think global metrics are not always suited and often a local metric is better, but I might be too extremistic in thinking so ;-)


 * Globally high rated peers are fine if you want to assign the nobel prize but if you want to find people you resonate with then they are not! If you are not mainstream, local metrics are more suited for you. I would live in a world where people that thik differently are considered as a value and not as "bad readers".


 * Actually, you can be good reader just being standardized. you can create a bot that rates as average(paper). then, being good reader you can influence your papers' ratings. This problem is tackled on the paper, for example, defining laziness and other measures.

Let me say that I like the ideas proposed in the systems and I think that we should compare local and global metrics based on the same data and study differences and advantages of the two approaches.

*  This model can be applied not only to scholarly journals,but also to other means of scholarly communication like,for instance,e-prints repositories.

this can be applied to everything: movies, songs, paintings, political ideas, government programs (emergent democracy), ...

* cool how formulas converge and are very good looking!

* interesting: vote sooner ---> early raters should have more weight! they are less lazy and the information they provide is the most useful.

* There are some papers/research that analyze ratings behaviours on slashdot. Rating behaviour is an important component when you design such "social" systems. Some of the findings are (some from Manifesto for the Reputation Society):

* Median time between posting of a story and accumulation of the first 50 percent of commentary is approximately three hours, and for the first 90 percent about 18 hours, so discussions happen quickly; those comments posted later tend to receive fewer ratings since many people will only read the comments that have already been rated highly.

* Of the comments that were moderated, only 15 percent received both positive and negative moderations (indicating disagreement among moderators), and only eight percent of metamoderations disagreed with the moderations they evaluated. The authors suggest that the system could be improved by highlighting comments needing additional moderator attention, which would distribute the raters’ attention more efficiently.

* somewhere there were reports that 90% of the ratings are "positive" ratings (check the data!)

* It is easy to imagine a big sort of attacks. This will become even more important when trhe system is in place and can really produce fame (and funds) for researchers based on the score assigned by the system. The simpliest attacks are sybil attacks (just create 1000 "fake" readers all saying that you are the best writer!), but also malicious cliques and reciprocity. See also TrustMetricsAttacks.

during the presentation there was the proposal to tie together in one single identity, her reader score and her writer score. In this way it is more difficult to create "fake" identity because you have to really write at least one paper in order to become a possible reader (of course you can copy someone else paper in your "fake" "johhny_the_mad_reader" identity ... )

In Open Rating Systems, Guha presents an heuristic for discerning malicious cliques and real cliques

''Cliques: There were a number of small groups of users (few dozen or fewer in each) many of whom trusted many others. Some of these groups corresponded to real-world social groups, i.e., a set of friends who did really trust each other. In other cases trust and ratings swapping cliques would emerge in an effort to boost the overall ratings of those involved in the cliques. In general, it is hard to distinguish between these two kinds of cliques, purely by looking at the graph structure of trust relations. However, a couple of heuristics turn out to be quite useful. 1. Rating swapping cliques are set up very fast. In con-trast, real cliques tend to take time to form. 2. Rating swapping cliques are very insular. Almost no one outside the clique trusts any of the clique members. 3. Real cliques often have short paths leading from a Rec-ognized Trusted User to one of the members of the clique.''

Questions / other points:

* (I am implicitly assuming that each reader can judge each paper only once)

the fact that you cannot re-rate items is a little bit unpleasant but I understand the motivation behind it.

however one of the problem is that a paper that was considered good (by me) but turn out to be bad (because of detected plagiarism? of detected fake results?) will affect negatively first raters who cannot change their early rating.

* what about spam bots? what about sybil attacks (i create 100 identities ....)? This is in general related to TrustMetricAttacks

* what about privacy and private opinions?

* can i judge my own papers? yes.

* add positive readers in the simulation: they rate always 1. it seems most of the ratings are 1 so consistenly rating 1 can give you a high score as reader

* in simulations, assuming that a peer rates 50% of the papers is too much! data will be much sparser!!! (on epinions.com an user rates on average ~13 items out of 150.000)

* paper unique id? how do you think to assure it? it is a requirement?

---

Interesting extracts:

* Several researchers suggest,using Nadasdy  s words (1997),to substitute peer re- view with democracy:each submitted paper is published,possibly before or without a peer review,and readers will judge it,selecting what they deem useful.

* In other words,the mechanism presented in this paper involves real people and,like all biological and social systems,it is likely to exhibit unexpected behavior:see,e.g., (Dawkins,1976;Ridley,1997)for interesting discussions of these phenomena.

-

* during discussion there was such a proposal: if you have a score less than 0.5, you cannot rate a paper that is not yet rated by another reader that has a higher score than you. this will ensure papers will be rated by (at least one) "reliable" users first. I need to think more about this.

* another idea in the paper is to not show average rating after some ratings have been collected. in fact imdb.com does precisely this.