.
At the Algemeiner today, I address the just released Israeli-Palestinian School Book Project. Since posting I have gained further clarity and focus on problematic features of the project and the information about it released to the press.
About the number of books and items “analyzed,”
The official list of books included those approved by the Israeli and Palestinian Ministries of Education for 2011. The study examined school books used in the Israeli State secular and Religious tracts and from independent ultra-Orthodox schools. Palestinian books were the Ministry of Education’s textbooks used in the West Bank and Gaza Strip, and a small number of books from the few independent religious schools (Al-Shariah) when relevant to study themes. A total of 640 school books (492 Israeli books and 148 Palestinian books) were reviewed for relevancy to study themes, and content in the 74 Israeli books and 94 Palestinian books with most relevance was analyzed in detail. The researchers analyzed more than 3,100 text passages, poems, maps and illustrations from the books.
So, indeed, as I raise to question at the Algemeiner, why was there purposeful selection of textbooks from ” independent ultra-Orthodox schools,” but apparently – there is no reference – no comparable selection from Hamas-controlled schools? Why was the original selection of books weighted 3 to 1 toward Israeli books? What were the specific terms of the basis, determined by whom, of “most relevance” what constituted “relevance” that determined the choice of the final 74 Israeli and 94 Palestinian books?
The analysis examined 2,188 literary pieces from Israeli books and 960 from Palestinian books.
Why is there a more than 2 to 1 preponderance of Israeli literary pieces? Would this not provide a more than double opportunity for the detection of passages that might be “analyzed” as “negative”? What rationale is there for not working from equal databases?
Why, apparently, were no Arabic textbooks from Israeli Arabic schools included in the study? (What might it reflect on Israeli society and education if these books were notably free of “negative” depictions of the “other,” however the “other” might complexly be conceived in this circumstance?
A total of 670 literary pieces were analyzed independently by two different research assistants. Statistical analysis demonstrated high inter-rater reliability, meaning that two different raters independently evaluated the same poem, passage of map in highly similar ways.
How were these 670 pieces selected from the 3148 noted above? What was the reason and basis for this further selection?
I have placed in quotation marks around my own use, referencing the report’s use, of the word “analyze” or “analysis.” The report makes significant claims to scientific rigor. However, the analysis of a chemical compound is not the same as the analysis of a text, even if one attempts to subtract human subjectivity from the text by disregarding its truth value. (And was it a stipulated criterion to disregard truth value in determinations of negativity? As, I argue at the Algemeiner, this is indefensible and produces unavoidable and potentially dramatic distortion of the results.) And we are told above that “two” – only two – different research assistants analyzed the 670 pieces. Two analysts of negativity unrelated to truth. Did the study provide them with a list of specific verbs, adjectives, figures of speech and idioms the use of which were automatically to be designated negative? Was there no subjective, critical allowance for judgment beyond such a list? From what environment did the research assistants come? Were they already employed by, students or teaching assistants of the lead researchers who shared, perhaps, their predisposition toward the study’s outcome?
The press release states,
The study engaged a Scientific Advisory Panel that resulted in the worldwide collaboration of 19 experts, including textbook scholars, social scientists and educators from across the political spectrum of both Israeli and Palestinian communities. The advisory panel includes textbook researchers from Germany who led Germany’s self-examination of their textbooks in the decades after World War II, and U.S. scholars who have themselves analyzed school books in Israel, the Arab world, and the former Yugoslavia. The advisory panel reviewed every aspect of the study and agreed on the findings.
However, departing from this account, Eetta Prince-Gibson at Tablet reports,
Several Israeli members of the SAP dissented. According to a memo provided by the Education Ministry spokeswoman, Professor Elihu Richter of the Hebrew University said that “questions remain concerning definitions of the variables, how they are classified and measured and counted and what materials are included and excluded.” Richter warned that some of the comparisons may be “sliding down the slippery slope to moral equivalence.” SAP member Dr. Arnon Groiss, author of a separate study on Middle Eastern textbooks, wrote that he has severe reservations about the methodology and that some 40 significant items, which attest to incitement on the part of Palestinians, were not included.
Further, Groiss has now released this lengthy and instructive analysis and commentary on the report. He states,
Again, we, the SAP members, were not involved in the research activity.
Moreover, it was only a few days before the February 4 release of the report that I was first given the 522 Palestinian quotes for perusal. Having compared them to the quotations appearing in other research projects, I realized that some forty meaningful quotations, which other researchers in former projects, including myself[1], incorporated in the material and used them in forming their conclusions, were missing. [Emphasis in the original]
….
I have found deficiencies on both levels of definition and actual use. On the first level, categorization was restricted to very general themes, leaving out important issues such as open advocacy of peace/war with the “other,” legitimacy of the “other,” etc.
….
There is no attempt to study the quotes more deeply and draw conclusions. All items were treated equally, with no one being evaluated and given a more significant status that the other. It seems that they were simply lumped together, counted and then the numbers spoke. It might be statistically correct, but, as we all know, statistics not always reveal the actual complex picture. This kind of analysis has produced a “flat” survey of the quotes, without any reference to their deeper significance (for example, looking at a demonizing text with no specific enemy as if it were a “neutral” literary piece). Also, all quotes were treated as separate items with no attempt to make a connection between two quotes or more in order to reveal an accumulated message (for example, concluding from the connected recurrent mentioning of the need to liberate Palestine, and the similarly recurring theme that Israel in its pre-1967 borders is “occupied Palestine”, that the liberation of Palestine actually means the liquidation of Israel).
A full reading of Groiss will be instructive for the non-specialist. Its education is two-fold and contrary. First, one recognizes how complex is the activity of attempting to bring something approaching objective scientific rigor to the non-literary analysis of texts. The kinds and range of issues to consider is impressive in variety and complexity. But a mirror principle automatically arises from that condition – that all this complexity in conceiving and formulating the field and terms of analysis bespeaks just that subjectivity of which Groiss offers so many dissenting views, a subjectivity that should give pause on the level of a foot-pedal brake before one reaches with too grasping hands for the label of science.
AJA
(Actually, on further thought, I will partially withdraw the comment about the overall number of passages. If passages are selected for relevance, and findings are reported as a percentage of “inflammatory passages” out “selected passages,” there may be questions. If the report suggests the frequency of inflammatory passages in a sample representative of overall textbooks, that would be absolutely acceptable. The point about inter-rater reliability, however, stands.)
Mostly, I think you have a good set of questions about the reporting if not the study. The points about Hamas-run schools or Israeli-Arab schools are certainly worth considering. Particularly questions about the selection of “relevant” material, especially as raised by Groiss. If his criticism that passages were excluded because they demonized “the enemy” without indicating precisely what enemy was being addressed, that would seem indefensible. In general, Groiss provides very helpful criticism, and I’m really grateful to you for pointing me to that page.
And, of course, the study did not find a lack of differences. From what I’ve seen the differences are likely to be statistically significant, but are minimized (perhaps for legitimate reasons) in interpretation.
The question about there being more Israeli textbooks, however, seems odd. It is likely the answer is simply that there are more textbooks used in Israel, and that these were a reflective sampling of both sets of books. I have yet to see confirmation of that, but it hardly qualifies as a problem in the study. So long as reporting includes the overall numbers of samples and prefers reporting percentages, this is not a problem at all.
And in the passage quoted above, your question is answered right in the section you quote. The reason was to establish inter-rater reliability, and the point of establishing inter-rater reliability is to address exactly the problem of subjectivity. The numbers chosen for multiple ratings hardly seem small. The criticism here seems simply uninformed.
Matt, your comments are a bit unclear to me. I certainly have no expertise in such research. I don’t know if you do. Any ignorance on my part may be quite literal, in which case I can be informed as well as the next bright fellow. About numbers of books, I understand your point about percentages, but what would be the need? In polling, history, demographics, and statistical models provide minimally sufficient numbers to be representative within a margin of error. There is a long history in polling. I don’t know how comparably deep the history is in establishing representative numbers with textbooks, an entirely different fish anyway. Population difference is hardly sufficient. What is the representation of different publishers and local factors? In the U.S. for instance, there are state and local school board selection of widely different texts by culturally and politically different standards. Are there comparable influences in Israel that are accounted for in numerical imbalance? We are not told.
As to inter-rater reliability, “the reason” for what was to establish it? My point was that inter-rater reliability between a remarkably low number of two raters, whose potentially similarly skewed perspectives in applying a range of subjective determinants as to negativity is not remotely reassuring, seems a poor basis on which to issue a report with such profound effect.
You are right. The study does report significant differences, in the percentage of neutral depictions, for instance. And, yes, there appears to be minimizing and maximizing of interpretation going on – and that is the subject of my writing at the Algemeiner.
For the record, I have some experience doing statistical research in psychology. I wouldn’t call it extensive, but I don’t think what’s at issue here is terribly complicated.
With regards to the sample sizes, I think this is less interesting since we both know less about what they did than would be ideal. But if a study set out to compare the heights of men and women, it wouldn’t matter if the sample had twice as many men as women. What would be compared are the averages within each group. You initially wrote,
Indeed, one would have twice the opportunity to discover especially tall men, but the average is unlikely to be affected at all. When you refer to margin of error in polls or metastatistics, I could explain how those work if it might help, but those aren’t affected by sample size in that way.
Far more worrying than the sample sizes would be selection criteria that perhaps systematically biases the sample of passages — the matters of “relevancy” and, especially as Groiss mentions, the exclusion of passages for being insufficiently explicit about relatively minor variables.
As for inter-rater reliability, passages rated by two different raters were surely not rated by the same two different raters each time. Rather, each rater participating in the study would have rated a portion of these so that (1) they could be shown to have been “trained” to be reliable raters and (2) so that the criteria for rating could be shown to be reproducible. Although criteria might still be subjective, if a group of people all independently make sufficiently similar ratings, then the ratings are seen as more reliable. There must really be something there that raters are measuring, even if it is difficult to define precisely.
The number of passages rated by multiple raters here, more than one in five, is probably quite acceptable. I was initially surprised it was that high. I think when I was doing inter-rater reliability — the ratings were on criteria for personality disorder from the DSM-IV, which would be fairly subjective so far as the field of psychology goes, and probably far more subjective than the ratings here — we had promised multiple ratings on 1 in 10 participants. Of course, our ratings required around six hours of interviews with each participant, so to rate one participant twice was costly. Still, I don’t think anyone saw that as a low number.
I don’t want to get too caught up in issues of establishing statistical reliability because I have no expertise in the area and they don’t go to the core of my criticisms. I take your point about two raters, and yes, there were about ten times that many research assistants. I do have substantial experience myself participating in writing assessments, creating writing assessment rubrics, and leading the norming, as it is called, of the raters. I know how problematic inter-rater reliability can be, and even when established, depending on the rubric, or criteria, and the tendencies of team leaders and how much raters conform to those tendencies, how far from scientific – with all that implies – results can be. I understand a study like this tries to refine away the kind of subjectivity that writing assessment involves, for instance, as already noted, by eliminating consideration of factuality. I have already made clear how invalidating I think that procedure, and as Groiss states, “I have found deficiencies on both levels of definition and actual use.”
Regarding the numbers of books and number of passages assessed, you are right, we know less than we could about the criteria for relevance by which several stages of reduction in number took place, from a large imbalance on the Israeli side, to a small imbalance on the Palestinian, to a large imbalance again on the Israeli. Presumably, these reductions took place before assessment of negativity. Still we have a process in which a rough parity in numbers of books, in the near end, produced a large preponderance of Israeli passages to assess. Was there review of that outcome? How was that outcome in itself assessed as contributing to treatment of the “other”? Finally, I’m on shakier ground here, but a study of comparative heights does not seem comparable. Height is a natural phenomenon. The numbers will become predictive. Demographics will become predictive. I am not seeing how the occurrence of a relevant passage on one page of a text – and with so many variables concerning the text – is predictive of any occurrence on the next page or in an additional text. A larger pool, even converting to percentages, seems likely to produce greater accuracy.