Graded Lesson Observations: Alive and Kicking?

In Debates and Polemics, Evidence in Education, Research Evidence by Alex Quigley11 Comments

Mention the phrase ‘graded lesson observations’ in any staffroom in the country and what would be the response?

In many staffrooms they are derided as an ugly feature of a particular strain of virulent OFSTED-itus. Only three or four years ago ‘graded lesson observations’ were the norms in pretty much every school in the nation. Since then, with repeated confirmation from OFSTED, the practice is on the wane. Still, however, many staffrooms will speak of still being subjected to this discredited and discouraging practice.

So why are we hanging onto this zombie of supposed-school-improvement?

I would be intrigued if we could pin some exact statistics onto the ongoing use of graded lesson observations. Maybe they have been given a new name, rebranded as ‘lesson grading over time’, ‘learning tours’, or some such linguistic sophistry. My anecdotal understanding is that the practice is alive and well in too many schools to mention – though proper statistics would prove useful.

Perhaps a reminder of some statistics about lesson observations would prove helpful here…

Professor Rob Coe made what was for me a defining speech on the unreliability of graded lesson observation data at ResearchEd, back in 2013 (the detail is written up in this superb blog – here). The cold facts: if a lesson was to be judged outstanding, the probability that a second observer would attribute a completely different grade would between 51% and 78%’; if a lesson were to be judged inadequate, then the probability of a second observer attributing a different grading would rise to 90%. Professor Coe cited Bill and Melinda Gates’ seminal MET Study – costing a whopping $50 million dollars. Its findings: you are pretty much better off tossing a coin that bargaining on the accuracy of a single school leader grading a lesson observation!

Research by Ho and Kane (2013) on ‘The Reliability of Classroom Observations by School Personnel’ showed that observers regress to the mean and avoid giving the top and bottom gradings (with a scale of four grades, just like OFSTED has, or had); we rate teachers in our own school better than others; we develop positive impressions of teachers, which influences all future ratings (surely the converse is true too). Finally, having more than one observer proved a must if you were to have any validity attributed to a judgment.

Many of the findings regarding unreliability relate to our emotional reasoning – or, more accurately, our inability to separate our emotional biases from our professional judgments. We simply cannot separate the person and practice. We judge physically confident people as more competent. Too make matters more dubious, we overestimate the effects of the teacher and underestimate the effects of the context. For example, if students were calm and quiet, we will likely assume that is the control of the teacher, rather than the potential of early morning sleepiness on the part of the class – see the research on Attribution Error and the Quest for Teacher Quality.

Much of this evidence has already been widely shared and I am flailing and flogging at a horse that has long since been taken to the knackers yard. And yet, we still have a number of schools grading lessons; we have a number of schools rebranding the practice, but with little real difference to what has come before.

Only recently, I spoke to an excellent school leader who had devised their own lesson grading system that they thought foolproof. They may be right – a brilliant outlier that defies the mass of internationally available evidence– but they probably aren’t. Thinking we are exempt from the biases and failings evident in the research evidence is commonplace. When it comes to the development, and even the well being of teachers, we should remain highly circumspect.

In his recent blog, entitled ‘The Semmelweis Reflex: Why does Education Ignore Important Research?’, Carl Hendrick relates the definition of the Semmelweis reflex. That is to say: “the reflex-like tendency to reject new evidence or new knowledge because it contradicts established norms, beliefs or paradigms.” This dismissive response to evidence that doesn’t suit our agenda is alive and kicking sand in the face of teachers everywhere.

Why are lesson observation grading still so attractive? Well, they provide a lazy method of managerial compliance. Too many school leaders, afeared of the toe-capped boot of OFSTED, feel they cannot risk dropping lesson gradings, but some will also secretly harbor the thought that constant OFSTED-fear is no bad thing in leveraging control in their school.

You have to wonder: why do so many OFSTED whispers linger on in schools when school leaders hear every update? Is it a little more than Stockholm Syndrome – or do some schools secretly want to utilise the jack-boot of high-stakes accountability for their own ends? Teachers should ask questions of this zombie practice and they should demand better.

Continuing with lesson observation grading may be explained away as just one small element of a teacher performance development judgment, but it is a sad indictment given that formative feedback on lesson observations could garner greater collegiate trust and actually help genuinely develop the quality of teaching in our schools. I don’t imagine this trust will be established overnight in schools – teachers will distrust lesson observations for a while yet – but we can make a start now by killing off the zombie that is graded lesson observations. We can surely do better.


  1. There are of course schools that ‘don’t grade’ yet still have a hidden spreadsheet where they guestimate a grade from reading the paperwork….

  2. Do you think that frequency solves some of these issues Alex? i.e. making a one-off judgement is flawed because it’s such a small snapshot, but if a teacher’s lessons are judged more frequently then a more reliable picture emerges? Similarly, if Ofsted or others are trying to gauge the quality of teaching in a school then putting together the indicative grades from several lessons allows a more solid picture to emerge. So any single lesson judgement might not be reliable, but the aggregation brings a higher degree of accuracy. I note your point about lazy managerial compliance, but I think it’s reasonable that school leaders want to know the quality of teaching in their school, and sometimes this requires judgements to be made, albeit with caution.

    1. Author

      Hi Steve,

      A few problems. As the research Ho and Kane study cited shows, having multiple observations that brings other biases into play. People develop a ‘halo’ and get given grades over and over: X quickly gets deemed ‘outstanding’ and it becomes a self-fulfilling prophesy; of course, the opposite is true. We judge those physically expressive and ‘confident’ as better, but it may not be reflected in the learning at all. So many personal biases at play – even doing it three times a year isn’t near enough. The Gates study (huge evidence) shows this.

      You can surely reasonably judge teaching quality by triangulating evidence? Exam results data; work scrutiny; lesson planning evidence; resources created; student feedback etc. We are swimming in evidence! If a lesson shows a teacher doesn’t have behavioral control etc. you could still identify concerns, but you apply all the caution you want, but a judgment between ‘good’ and ‘outstanding’ is little more than an arbitrary coin toss! It is counterintuitive for experienced leaders, but the evidence bears it out.

  3. Thanks Alex, appreciate your response. I absolutely accept that we need to triangulate evidence to gain a rounded view on the quality of teaching, but I’m not sure I accept that judgements on quality of teaching, as observed in a lesson, are too flawed to be of any use. As far as students are concerned the lesson experience IS their school experience so I think it’s reasonable that we seek to understand this experience, which at times will entail making tentative judgements on the quality of that experience. My bias here is that I work for a multi-academy trust and spend a lot of time doing 1-day ‘school reviews’ which involve observing lots of portions of lessons to gain a sense of the quality of teaching across the school. I would hope that by looking for the right things in lessons (do student know what they are doing? Can they explain how to do it well? Do the books indicate that students are improving over time? How much work do students typically produce? Is the learning environment calm and focused? Is work presented with pride and precision?) I can gain a reasonable picture of students’ daily diet, which of course I would then try to triangulate with everything else we know about the school. I also wonder if the other evidence that you mention, e.g. lesson planning evidence and teachers’ resources, have their own flaws and are themselves susceptible to bias and personal preferences e.g. worksheets v textbooks; group work v solo study.

    Absolutely love your work by the way!

    1. Author

      Hi Steve,

      We surely all have our own context to manage. Whether it is too flawed to be of any use should be pitched against, does it make teachers teach any better? The evidence about adults learning, risk and trust, would suggest that getting feedback without a grade would be better than with a grade (ditto students and ‘ego involving feedback’). The whole issue of teachers attempting to fit some model of ‘observer favoured’ s also very problematic. Our notion of successful learning has changed a great deal in only the last few years: let’s just consider the notion of learning with ‘rapid pace’ as an example! You can garner triangulation of the school experience through the outcomes, the work and student voice questionnaires/discussions I would argue.

      Clearly, lesson grades exist, despite the flimsy accurate, for appraisal and compliance purposes. I understand this 0- but as you show with your list – we can identify all those things without putting a grading. If there is an issue with a lack of focus and poor behaviour, then I would have a duty to support addressing the issue. I have no qualms with that. If we observe lessons in my school (which we do – teachers can determine which lessons by request)ten I would have a duty to deal with poor behaviour etc. Labeling a lesson ‘requiring improvement’ or ‘good’ isn’t far off a coin toss given a one day school review though – ditto ‘good’ or ‘outstanding’. I am not comfortable doing that – and I consider myself pretty well trained and knowledgeable about pedagogy. It is not comfortable to accept and it compromises some of our practices, but I think we should challenge ourselves with that evidence.

      Of course the other evidence is flawed and biased – making triangulation even more important.

      Thanks for the positive feedback and engaging with the post!

  4. Hi Alex. One of the problems regarding the continuance of grading seems to be that teachers being observed want to be graded: “I know you’re not grading the lesson, but if you were, what would I get?” Sort of thing that I’ve heard more than once this term. Or the observer that feeds back with: “we’re not grading, but if I was you’d get a … Blah blah.” Instead of devoting time to a productive and vital conversation about the lesson that both can learn from. As you say, a ‘lazy method of managerial compliance’.

    1. Author

      I hear anecdotally of very similar scenarios. It is about compliance – including teachers not really breaking our of their Stockholm Syndrome in part.

  5. Pingback: The 'OFSTED Matthew Effect' - HuntingEnglishHuntingEnglish

  6. Hi Alex,

    Thanks for the post. It reconfirmed much of what I think about observations.

    I’m trying to devise a system whereby prior to an observation, I meet with the ‘observee’ the week before to discuss the sequence of lessons leading up to the the observed 30 minute sequence. Within this sequence I’d be looking to judge how far the previous lessons support the students’ development within the observed portion of the lesson – if the students are to produce a piece of assessed writing during the lesson, for example, how have the previous lessons built up / supported their ability to produce a good piece of writing?

    I just wondered if you could see any immediate problems with this? I really do not want to arrive at a judgement in terms of my colleagues’ ability to teach on the basis of isolated 60 minute lesson obs.

    Any input – gratefully received. I’ve just begun the second year of my HOD role.

    Regard, Adam

    1. Author

      Are you still grading the snapshot? I have no issue with developmental observations and observing student work in the process – I just don’t see any validity or value in grading it. If you have to, because your school demands grades, then it is a case of having as much observed time and points of ‘evidence’ like students’ work as possible. It is a flawed notion to judge it, beyond the simple behavioral aspect – the observable climate in the classroom.

Leave a Reply