Most schools assign each student a “grade point average”, i.e., a number that averages over many teacher evaluations of that student. Many schools also assign each teacher an “average student evaluation”, i.e., a number that averages over many student evaluations of that teacher. Many workplaces similarly post evaluations which average worker performance ratings across different tasks. And sport leagues often post rankings of teams, which average over team performance across many contests.
A lot rides on such metrics, even though they are simple aggregates over contests of varying difficulty, which creates incentives for players to “game” these metrics. For example, students seek to take, and teachers seek to teach, easy/fun classes; workers seek to do easy tasks, and sport teams seek to play easy opponents.
Yet we have long known of a better way, one I described briefly in 2001: stat-model-based summary evaluations.
For example, imagine that a college took all of their student grade transcripts as data, and from that made a best-fit statistical linear regression model. Such a model would predict the grade of each student in each class by using a linear combination of features of each class, such as subject, location, time of day and week, and also “fixed effects” for dates, professors, and especially students. That is, the regression formula would include a term in its sum for each student, a term that is a coefficient for that student, times one or zero depending on if that datum is about a grade for that student.
Such a fixed effects regression coefficient regarding a student should effectively correct for whether the student took easy or hard majors, classes, profs, times of day, year of degree, etc. Furthermore, standard stat methods would give us a “standard error” uncertainty range for this coefficient, so that we are not fooled into thinking we know this parameter more precisely than we do.
Thus a “grade point coefficient”, i.e., a G.P.C., should do better than a G.P.A. as a measure of the overall quality of each student. And the more that potential employers, grad schools, etc. focused on G.P.C.s instead of G.P.A.s, the less incentive students would have to search out easy classes, profs, etc. We could do the same for student evaluations of professors, and the more we relied on prof fixed effects to judge profs, then the less incentives they would have to teach easy classes, or to give students As to bribe them to give high evaluations.
The general idea is simple: fit performance data to a statistical model that estimates each performance outcome as a function of the various context parameters that one would expect to influence performance, plus a parameter representing the quality of each contestant. Then use those contestant parameter estimates as our best estimates of contestant quality. Such statistical models are pretty easy to construct, and most universities contain hundreds of people who are up to this task. And once such models are made and listened to, then contestants should focus more on improving their quality, and less on trying to game the evaluation metric.
Yes, as new data comes in, the models would get adjusted, meaning that contestant estimates would change a little over time, even after a contestant stopped having new performances. Yes, there will be questions of how many context parameters to include in such a model, but there are standard stat tools for addressing such questions. Yes, even after using such tools, there will remain some degrees of freedom regarding the types and functional forms of the model, and how best to encode key relevant factors. And yes, authorities can and would use those remaining degrees of freedom to get evaluation results more in their preferred directions.
But even so, this should be a huge improvement over the status quo. Instead of students looking for easy classes to get easier As, they’d focus instead on improving their overall abilities.
To prove this concept, all we need is one grad student (or exceptional undergrad) with stat training willing to try it, and one university willing to give that student access to their student transcripts (or student evals of profs). Once the models constructed passed some sanity tests, we’d try to get that university to let its students put their G.P.C.s onto their student transcripts. Then we’d try to get the larger world to care about G.P.C.s. So, who wants to try this?
P.S. I’ve posted previously on how broken are many of our eval systems, and how a better entry-level job eval system could allow such jobs to compete with college.
Added: This paper and this paper shows in detail how to do the stats.
One could get more than one useful number per student by adding terms that interact the student fixed effect terms with other features of classes. That second paper shows a two number system is more informative, but is rejected because “gains realized with the two-component index are offset by the additional complexity involved in explaining the two-component index to students, employers, college administrators and faculty.”
One might allow students to experiment with classes in new subjects by including a term that encodes such cases. One might include terms for race, gender, age, etc. of students, though I’d prefer transcripts to show student GPCs with and without such terms.
Added 17Oct: This book by Valen Johnson considers in detail models like those I describe above, wherein the performance of a student in a class is a linear combination of a student term, a class term, and an error. Except that sometimes instead of estimating a grade point, they instead estimate discrete grades, using several terms per class to describe the underlying parameter cutoffs between different discrete grades.
The student term sets an “adjusted GPA” and Johnson proposes to “allow students to optionally report adjusted GPAs on their transcripts.” He reports that when he attempted but failed to get Duke to do this in 1996, this was the biggest issue:
When the achievement index was considered for use as a mechanism to adjust GPAs for students at Duke, instructors who regularly assigned uniformly high grades quickly realized that the achievement index adjustment will make their grades irrelevant in the calculation of student GPAs. Worse still, many students notice the same thing. To thwart the adoption of the achievement index, these high-grading instructors and their student benefactors adopted the position that an A represented an objective assessment of student performance. An A was an A was an A. For them, it represented “excellent” performance on some well-defined but unobservable scale. Indeed, by the end of the debate, several literary theorists had finally identified an objective piece of text: a student grade. (p.222)
Apparently Johnson and others have long tried but failed to get schools to adopt GPCs and variations on them.
All trends in higher education are to make grades and assessment of students less informative than they were even 30-50 years ago. Look at the SAT, it was renormed in the early 90s so that overnight the number of double Verbal/Math 800s increased sixfold thus making it harder for high scorers to distinguish them from those a tier lower. This change has brought NO complaints from the edu establishment but instead has led them to double down by claming that even the watered down SATs are too discriminatory so that all elites have now eliminated SATs for admission.
A current attempt to that initially started by the promoter of GPCs would be dismissed without a second look by universities. Especially by those high status places that know well how to implement these stat techniques. THEY know how to distinguish good from bad students but want employers to treat all their graduates -- the useful and the useless alike -- as stellar products.
Thanks Robin. My points is that the subpopulation /is/ segmented (mainly: between schools, and then all levels up; the data is very sparse since students only follow classes on one school; the model 'should' go hierarchical here). That segmentation makes the models for proper ranking on skill a lot harder than those in the Github (alas, no data there). Not talking about things like the US 'sub'-populations, although even that would become a theme I imagine if you'd try to run this model at scale.
I have some experience in building models from a research perspective versus building models for a production environment (both private and public). The difference in the amount of work is several orders of magnitude.
An example, say your child has top-5% GPA in their school, but gets a top-20% skill ranking. The school is just not that good. You'd be upset, since that is the difference between Ivy league and no-name university (trying to think US here). So you would try and get to the bottom of the model. You then find out that the models is creating some crazy fixed or random effect on a logit that pretty much dooms all the students in the school. The researcher would say: hey, that's just metrics in action and on the whole the model works better than GPA. The (government?) entity running the model would get sued into apoplexy.