突然想起来其实跟个帖子不就解决了,汗,大脑迟钝~
- v* q% K0 ^3 \4 E
% f z C5 K' N1 [第二部分:E-rater工作原理
9 E! B) e+ b! V \
! A' T1 O! i7 }( O- G关于E-rater,大家心里面最犯嘀咕的,莫过于对这个系统的有效性的怀疑。最极端的例子,恐怕是曾经有考生在GMAT的考试中,在自己的essay里面写过“I don’t want to be graded by a robot”这样的句子。
$ ?4 ], j; E+ E
" j( T4 }2 V; B3 S
1999年Business Week对Fred McHale,当时的GMAC Vice-President for Assessment & Research,进行了采访,其中就提到了这个问题:(
http://businessweek.com/bschools/originals/bs90329.htm)
2 Y6 T% \# Q7 a( ~; d# m# O
4 L! e5 ~# G$ Y2 i P4 ]( N0 M# _Q: What has been the biggest challenge surrounding the E-Rater since you've implemented it? Have you encountered a lot of skepticism? Are folks scratching their heads wondering how this electronic assessment software actually works, wondering if the results have any validity?
. {! a8 S: h( m3 X2 L) d1 \0 ?9 b; Z1 c b8 b k
A: There has been a lot of skepticism, and it was expected. People tend to think that E-Rater is just your average grammar-checker on your word processor. But that's just not the case. All we can do is show the results.
- A0 D! e3 C1 j' t5 J
' t* [# g3 ~5 V: w$ y1 |7 t$ a5 u c的确,all they can do is to show the results. 在ETS官方网站上找到的文献表明,E-rater判分和human reader判分之间的一致性一直是他们的研发组关注的重点。在至少3篇文献中都提到了如下所述的实验:对某一个题目找出human reader给出了1,2,3,4,5,6的文章各n篇,然后拿给E-rater判,然后研究给出分数的一致性。实验结果表明,E-rater的判分与human reader判分的Exact Agreement与Adjacent Agreement的情况是绝对多数,而出现Disagreement的情况则是绝对少数。根据公布的实验结果,各个分数段的E-rater判分与Human reader一致性总是大于80%,平均agreement的比率在90%左右。此外,考虑到两个Human reader之间判分一致性也存在差异(这一点也有相关的实验记录),再将这种差异和E-rater与Human reader之间差异的情况进行比较,所得到的结论是,E-rater判分的有效性(也就是文献中反复出现的Automated Essay Score Validity)是完全可以得到保证的。
. C5 D3 S& q5 ^/ @: F. O* d" G8 d# K( i8 d
那E-rater到底是怎么做到这一点的?
! d* [7 t- P$ ^$ x
7 b- ~/ X4 Z4 \- |/ N" n1 U8 u一个计算机程序能够做出对一篇文章的量化评价?
8 U; x5 c6 f! U" D- T0 g( A
+ \ \' P- O0 C9 @7 o) L( w
也许我们第一次听说E-rater的时候,心里面产生的疑问就是:“一个计算机程序能做什么?统计文章字数?计算平均句子长度?某个字眼出现的频率?然后就评个分出来?不是吧?”
& e7 x2 a* ` Y5 Y, F% p$ a! Z3 T; b& k5 L) f$ D
这样的看法,恐怕小看了E-rater的功能了。
, v3 r" z$ a! \- U0 z( ?0 C/ B! v- N9 p) s- K: }+ Z, Z
虽然E-rater的具体评价识别的设计我们无从得知(这个自然,基本上是商业秘密),不过从目前可以拿到的文献中也可以看出一点端倪来,例如:I also assume that shrinking high school enrollment… 这句话,至少可以分析出来:also表达了parallel argument,that表达了claim,句子涉及到的content则有assume shrink high school enrollment… 也就是说,E-rater工作原理,远远不是简单的统计点字数,统计点用词频率。
1 i1 W! T: M9 {0 a
$ P+ X) c: e, g; l4 A) h+ T
再举一个例子:
; V7 J# g( p% g
Q: Differentiate between triggers and stored procedures.
8 y# D& ^& O8 u7 g! N4 }
A: Triggers are programs embedded within a table that are automatically invoked by updates to another table. Stored procedures are programs embedded within a table that can be called from an application program.
1 H% Q0 `- W( [
9 F1 Q* e4 B/ ^" b& Y; d$ I3 L5 ^0 w从这一段中可以识别出什么东西呢?文献中给出了至少这几点:
% d$ c& _* }9 I: t8 G- w9 VSyntactic Variety: …can be called from a program
! q8 x, P _- i: T) O…that a program can call
. W( y9 W' D$ F! ^1 N
Synonymy: …can be invoked from a program…
; z8 Z5 W2 q) P9 yNegation: …are NOT invoked by updates…
1 i) U: `1 U% h" Z- A
Anaphoric Reference: TRIGGERS are programs. THEY are embedded…
2 b& U! H4 y9 ^- T" S
) p0 z" K0 l6 q' i0 u因此可以看出:E-rater所识别的元素也许远远超过我们一般能够构想出的范围,而恐怕我们不得不承认这种识别是合理设计并应用的。参考下面这段文献:
9 w2 i4 ` F1 a5 W9 ?' o1 _
E-rater focuses on three general classes of essay features: discourse, indicated by various rhetorical features that are expected to occur throughout an essay; syntactic, indicated by the structure of sentences; and content, indicated by prompt-specific vocabulary expected to be present in the essay. A total of 59 features are “extractable,” but in practice usually only the most predictive features, as measured by their regression weights, are retained and used for further scoring.
Z: `6 S7 t0 q! z: L V
, j" ^# M m; X( E: h9 g上文提及的59个feature是相当广泛的。例如,就syntactic variety而言,文献中给出了如下几点(当然,这个list是不完全的):number of complement clauses, subordinate clauses, infinitive clauses and relative clauses, occurrences of subjunctive modal auxiliary verbs such as would, could, should, might and may. 对于Argument structure,E-rater着重识别parallelism, contrast, evidence, argument development以及其它一些coherence relations. 至于Discourse的方面,下面一段文献非常有启发性:
8 _# t. |9 A5 t2 p4 ~* o' j" N8 c1 }( g2 F' [7 w
Literature in the field of discourse analysis points out that rhetorical relations can often be identified by the occurrence of cue words and specific syntactic structures (Cohen 1984, Mann and Thompson 1988, Hovy, et al. 1992, Hirschberg and Litman 1993, Van der Linden and Martin 1995, Knott 1996). E-rater follows this approach by identifying and quantifying an essay’s use of cue words and other rhetorical structure features. For example, we adapted the conceptual framework of conjunctive relations from Quirk, et al. (1985) in which phrases such as “In summary” and “In conclusion,” are classified as conjuncts used for summarizing. E-rater identifies these phrases and others as cues for a Summary relation. Words such as “perhaps” and “possibly” are considered to be cues for a Belief relation, one used by the writer to express a belief while developing an argument in the essay. Words like “this” and “these” are often used within certain syntactic structures to indicate that the writer has not changed topics (Sidner 1986). In certain discourse contexts, structures such as infinitive clauses mark the beginning of a new argument.
( v# f3 g- i" D/ s+ C0 i
5 R" x0 f$ v- P; A1 t! t由上文可以看出,通过对文章的feature的识别,E-rater完全可以做出对文章的相关判断。而下面就是一个实际的例子。就coherence这个方面而言,下面的passage得到了6分,评语是“The following paragraph demonstrates an example of a maximally coherent text, centering the company ’Famous name’s Baby Food’ and continuing with the same center through the entire paragraph.”
# ]% O H+ V& J5 r1 {. E6 }* }- \0 o. L6 G
Yet another company that strives for the ”big bucks” through conventional thinking is Famous name’s Baby Food. This company does not go beyond the norm in their product line, product packaging or advertising. If they opted for an extreme market place, they would be ousted. Just look who their market is. As new parents, the Famous name customer wants tradition, quality and trust in their product of choice. Famous name knows this and gives it to them by focusing on ”all natural” ingredients, packaging that shows the happiest baby in the world and feel good commercials the exude great family values. Famous name has really stuck to the typical ways of doing things and in return has been awarded with a healthy bottom line.
' ]9 g% s9 _ P* K7 p6 L5 N I# H' b9 ?% ]6 ?' l2 C
而下面这段评语和相应的例文进一步说明了E-rater对coherence的识别:
8 `; ]- q9 z0 C8 H7 [6 |
Following the same mark-up conventions, we demonstrate text incoherence with an excerpt (a paragraph again) of a student essay scored 4. In this case, repeated Rough-Shift transitions are identified. Several entities are centered, opinion, success and conventional practices, none of which is linked to the previous or following discourse. This discontinuity created by the very short lived Cbs makes it hard to identify the topic of this paragraph and at the same time it is capturing the fact that the introduced centers are poorly developed.
- E1 N; `0 i m& e' B
$ |; k8 m8 R- p- W: E0 E下面就是所说的这段东拉西扯的而在coherence上被判为4分的passage:
" m7 {$ ~2 U4 N% xI disagree with the opinion stated above. In order to achieve real and lasting success a person does not have to be a billionaire. And also because conventional practices and ways of thinking can help a person to become rich.
/ \& q# {! j( w' t
1 J* Z# |0 R& j8 b6 R8 w综上所述,E-rater有能力做到对文章进行识别和判定,文献中摘出的下面这三段话是第二部分内容的最好结束语。
# }' J- k( p/ h* WOverall, while it is largely the case that the raters were not actually counting occurrences of indicator cues representing e-rater features, they were tracing qualities that incorporate such features.
2 S( g' @9 K/ d# K* s' a* S) o5 d7 K- n
Specifically, when an essay writer would make a certain type of assertion in the essay, the raters would expect to see the associated use of certain types of syntactic structures. The absence of such syntax in such an instance would render the assertion superficial. While essays with and without such syntactic variety were both seen, clearly the essays containing the syntactic variety associated with that type of discourse were viewed by the raters as superior.
. J2 Y, S* `$ f# `( }
: ? ]: X0 P4 ]Obviously, e-rater does not read an essay, so it cannot “look for” or “evaluate” writing qualities. However, e-rater can, and does in some instances, detect evidentiary traces, the proverbial “breadcrumbs in the path,” that signal these qualities, using its own version of the characteristics.