Collecting data during pilot testing is the point where assessment design stops being a set of assumptions and starts becoming evidence. In assessment design and development, pilot testing and field testing are the structured processes used to observe how items, tasks, rubrics, timing, instructions, and delivery systems perform with real participants before operational launch. Pilot testing usually refers to an earlier, smaller, more diagnostic trial; field testing often refers to a larger administration designed to estimate item statistics, verify test forms, and confirm administration procedures under conditions closer to live use. Both matter because even well-written assessments can fail in practice through unclear wording, construct underrepresentation, timing problems, technical glitches, scoring inconsistency, or unexpected subgroup effects.
I have seen apparently strong item sets unravel the moment candidates interact with them. A multiple-choice question that looked clean in review became ambiguous when test takers interpreted a single verb differently. A performance task aligned beautifully to standards, yet responses clustered at one score point because the rubric language was too broad. In another project, response time data from a pilot revealed that a reading passage, not the targeted skill, was driving difficulty. These are exactly the problems that careful data collection during pilot testing is supposed to uncover. The goal is not only to find broken items. It is to gather enough qualitative and quantitative evidence to improve validity, reliability, fairness, usability, and operational readiness.
For a hub article on pilot testing and field testing, the core idea is simple: collect the right evidence, in the right format, from the right sample, early enough to make revisions. That means defining decisions in advance, building a data collection plan, capturing both performance and process data, and analyzing findings against clear criteria. It also means understanding the difference between small-sample diagnostics and large-sample calibration, because the evidence you need changes as the assessment matures. When teams treat pilot data as an afterthought, they miss the fastest path to a better instrument. When they treat it as the backbone of development, they ship assessments that behave as intended.
This article explains what data to collect during pilot testing, how field testing extends that work, which methods and tools support strong decisions, and where the most common risks appear. As the hub for Pilot Testing & Field Testing within Assessment Design & Development, it covers the full landscape and gives you a framework you can apply whether you are developing a classroom interim assessment, a certification exam, a licensure test, a language proficiency measure, or a technology-delivered simulation.
What data should you collect during pilot testing?
The best pilot testing plans start with decision-usefulness. Before collecting anything, define what decisions the pilot must support. Common decisions include whether an item is understandable, whether a distractor is functioning, whether a rubric distinguishes levels of performance, whether timing is appropriate, whether platform navigation is intuitive, and whether administration manuals are complete. Once decisions are defined, map each one to specific evidence sources. In practice, I group pilot data into five categories: response data, timing data, process data, scoring data, and feedback data.
Response data includes item responses, omitted items, changed answers, response distributions, and score patterns across forms or tasks. For selected-response items, this is where p-values, distractor selection, and item-total relationships begin to emerge. For constructed responses, response data includes score distributions, common error types, and exemplar quality. Timing data captures total test time, item-level response time, time by section, pauses, and whether certain subgroups consistently run short. Process data may include clickstreams, navigation paths, tool usage, keystroke logs, or observation notes from proctors. Scoring data includes inter-rater agreement, exact and adjacent agreement, severity trends, and rubric fit. Feedback data comes from cognitive interviews, post-test surveys, focus groups, and administrator debriefs.
In pilot work, qualitative evidence is not secondary. It is often the fastest way to diagnose why a result occurred. If an item has a low facility value, statistics tell you what happened, but think-aloud protocols or retrospective interviews often explain why. Test takers may be relying on a misleading cue, misreading a graph, or reacting to vocabulary outside the intended construct. Likewise, raters may disagree not because the rubric is weak overall, but because one trait descriptor uses overlapping language. Strong pilot testing combines the efficiency of numbers with the explanatory power of direct observation and participant voice.
Pilot testing versus field testing: what changes?
Pilot testing and field testing are related but not interchangeable. Pilot testing is typically smaller, more iterative, and more diagnostic. You use it when items, tasks, forms, interfaces, or procedures are still moving. Sample sizes can range from a few participants in usability sessions to a few dozen or a few hundred in early administrations, depending on stakes and format. The purpose is to expose problems quickly and cheaply. Field testing is larger and closer to operational conditions. It is used to confirm that revised materials perform consistently and to estimate the psychometric characteristics needed for final assembly, scaling, or standard setting support.
In pilot testing, a team may deliberately oversample edge cases: novice users, English learners, screen-reader users, very high performers, or low performers likely to reveal access or comprehension problems. In field testing, the emphasis shifts toward representativeness, stable item statistics, and administration fidelity across sites. For a credentialing exam, I might run an early pilot with 40 to 80 candidates and detailed interviews, then a field test with several hundred or more candidates to estimate classical item statistics, inspect dimensionality, and flag differential item functioning for review. The smaller pilot helps shape the instrument; the field test provides stronger evidence that the shaped instrument is ready.
The data collected also changes in depth and breadth. Pilot testing often collects more explanatory evidence per participant. Field testing often collects more standardized evidence across participants. A useful rule is this: pilot testing tells you what to fix, while field testing tells you whether the fixes hold up at scale.
Building a practical data collection plan
A workable pilot testing plan specifies objectives, sample, instruments, procedures, data governance, analysis rules, and revision thresholds. Start with a test blueprint and item inventory. For each item or task, note intended standard, cognitive demand, response format, accessibility supports, and expected evidence. Then define success criteria. An item may need a target facility range, a minimum corrected item-total correlation, or evidence that at least three distractors are plausible. A writing rubric may need exact agreement above a preset threshold and no systematic rater severity drift across sessions.
Next, define the sample. Include enough participants to reveal likely issues, but do not pretend that a tiny convenience sample supports stable calibration. If the assessment serves multiple populations, recruit intentionally across those groups. For accessibility, include users of assistive technology during pilot testing, not after launch. Then choose collection methods. Surveys should include both scaled items and open-text prompts. Observation protocols should instruct staff what to record, such as hesitation, requests for clarification, or visible navigation errors. Interview guides should probe interpretation of instructions, confidence, strategy, and perceived fairness.
Teams also need a data management plan. Assign item IDs consistently across content, platform, and analysis files. Predefine variable names, coding rules, missing-data conventions, and version control. In my own projects, the biggest avoidable delays during pilot testing usually come from messy file structures and inconsistent IDs rather than from hard psychometric questions. Clean data architecture is part of quality assurance.
| Data type | What it reveals | Common tools | Typical pilot action |
|---|---|---|---|
| Item response data | Difficulty, distractor performance, omissions, score spread | Excel, R, SPSS, assessment platforms | Revise or remove weak items |
| Response time data | Speededness, reading load, interface friction | Platform logs, Tableau, Power BI | Adjust timing, layout, or passage length |
| Cognitive interview notes | Interpretation errors, confusing language, unintended strategies | Dedoose, NVivo, structured note forms | Rewrite stems, options, prompts, and directions |
| Rater agreement data | Rubric clarity, scorer consistency, training gaps | ManyFacet Rasch, FACETS, Excel | Refine rubric and retrain scorers |
| Accessibility and usability observations | Navigation barriers, support failures, accommodation fit | UserTesting, screen recordings, WCAG checklists | Fix interface and revise administration guidance |
Methods for gathering high-quality evidence
Several methods work especially well during pilot testing and field testing. Cognitive labs are one of the most efficient for early pilots. Participants answer items while verbalizing their reasoning, or they complete the test and then walk through selected responses in a retrospective interview. This reveals whether the item is eliciting the intended construct. Usability testing is essential for digital assessments. Watch where users click, where they hesitate, and whether they understand navigation, flagging, calculators, text-to-speech controls, or drag-and-drop actions. A technically functional interface is not the same as a usable one.
Small-scale live administrations help test manuals, proctor scripts, and room procedures. For performance assessments, collect anchor responses and train raters before scaling. During scoring pilots, monitor exact agreement, adjacent agreement, and the distribution of ratings by rater. If one scorer consistently rates harshly, that is not a minor issue; it alters score meaning. For larger field tests, use item analysis and dimensionality checks appropriate to the purpose. Classical Test Theory remains useful for early diagnostics, while Item Response Theory can support calibration and form assembly when sample size and model fit are adequate. Differential item functioning review should be part of fairness evaluation, but statistical flags always require content review before action.
Surveys are valuable when written well. Avoid vague prompts such as “Was the test fair?” Ask participants whether directions were clear, whether any item required knowledge not taught or intended, whether accessibility tools worked as expected, and whether time limits were sufficient. Ask administrators whether any candidate sought clarification, whether instructions were followed consistently, and whether technical interruptions occurred. These details make pilot findings actionable.
Analyzing pilot and field test data without overreaching
The biggest analytical mistake in pilot testing is false precision. Small samples can reveal obvious flaws, but they rarely justify strong statistical conclusions. Use early item statistics as signals, not verdicts. If an item-total correlation is weak in a pilot of 35 participants, inspect the content, distractors, and interview evidence before discarding the item. Likewise, an apparently difficult item may be acceptable if it targets advanced performance and functions cleanly. Interpretation depends on blueprint role, intended proficiency range, and available corroborating evidence.
Field testing allows stronger inference, but even then, analysis should be tied to intended use. For norm-referenced uses, inspect score distributions, reliability estimates, conditional standard errors, and form comparability. For criterion-referenced uses, examine how well items support mastery decisions at cut-adjacent score points. For performance tasks, review rater consistency, prompt comparability, and whether score scales reflect observable differences in work quality. In digital assessments, analyze response times alongside accuracy. Long response time with low accuracy often indicates confusion, while very short response times may suggest rapid guessing.
Documentation matters as much as analysis. Keep decision logs showing what evidence was reviewed, what change was made, and why. This creates an auditable chain from pilot observation to operational form. It also improves future cycles because teams can see which recurring issues are content problems, scoring problems, or delivery problems.
Common problems pilot data should uncover
Well-run pilot testing identifies recurring categories of failure. Content problems include ambiguous stems, implausible distractors, double-keyed items, excessive reading load, construct-irrelevant vocabulary, and prompts that allow multiple defensible interpretations. Psychometric problems include items that are too easy, too hard, non-discriminating, or misaligned with the intended trait. Scoring problems include rubric overlap, missing anchor papers, drift among raters, and inconsistent treatment of partially correct responses. Delivery problems include broken media, inaccessible interactions, browser conflicts, timer errors, and unclear navigation.
Fairness issues deserve special attention. An item can look statistically acceptable overall and still create avoidable barriers for specific groups. During one field test, a scenario-based item set performed normally in aggregate but generated disproportionate confusion among candidates unfamiliar with a culturally specific context that was irrelevant to the skill being measured. The fix was not cosmetic. Replacing the context reduced irrelevant variance and improved interpretability. Pilot testing should also surface accommodation mismatches, such as screen-reader labels that do not match visible labels or diagrams without equivalent text alternatives. These issues affect score validity, not just user experience.
Using findings to improve the assessment
The value of collecting data during pilot testing is realized only when evidence drives revision. After analysis, sort findings into immediate fixes, conditional fixes, and monitor-only issues. Immediate fixes include flawed keys, broken links, misleading instructions, and rubric language causing scorer disagreement. Conditional fixes are issues that need one more round of evidence, such as a borderline item statistic with otherwise solid qualitative support. Monitor-only issues are acceptable for now but worth watching in the next administration.
Revisions should be specific and documented. Replace “improve clarity” with “replace the verb analyze with calculate in stem 12, shorten option C by eight words, and remove the extraneous chart title.” For rubrics, rewrite descriptors using observable features of performance, retrain raters with fresh anchors, and rerun agreement checks. For timing issues, decide whether to reduce content, split sections, or adjust limits based on where delays occur. Then test again. Assessment quality is iterative, and the strongest programs build pilot testing and field testing into a repeatable development cycle rather than treating them as one-time hurdles.
As a hub for Pilot Testing & Field Testing, the main takeaway is clear: collect evidence that explains performance, not just scores. Use small pilots to diagnose issues, larger field tests to confirm stability, and documented decision rules to turn findings into better items, better scoring, and better delivery. When you gather response, timing, process, scoring, and feedback data in a disciplined way, you reduce avoidable flaws before launch and strengthen the validity of every result that follows. If you are building or revising an assessment, start by drafting a pilot data collection plan tied to the decisions your team must make next.
Frequently Asked Questions
What data should you collect during pilot testing?
During pilot testing, the goal is to collect enough evidence to understand how the assessment actually performs, not just whether it can be delivered. That means gathering both quantitative and qualitative data. On the quantitative side, teams typically track item-level performance, completion rates, timing, score distributions, skipped responses, rubric application patterns, and any unusual trends across participant groups. If the assessment is digitally delivered, system data such as logins, navigation paths, time stamps, interruptions, and technical errors are also extremely valuable. These metrics help reveal whether items are functioning as intended, whether the length is realistic, and whether the test experience matches design expectations.
Equally important is qualitative evidence. Observational notes, participant feedback, interviewer debriefs, proctor reports, rater comments, and cognitive interview findings can uncover issues that numbers alone will miss. For example, an item may appear statistically acceptable while still confusing participants because of unclear wording or misleading formatting. Instructions, accessibility supports, interface design, and scoring rubrics should all be reviewed through user experience data. Strong pilot testing collects evidence across content, administration, scoring, and usability so the development team can identify what is working, what is ambiguous, and what needs revision before a larger field test or operational launch.
Why is collecting data during pilot testing so important in assessment design?
Collecting data during pilot testing is essential because it turns design assumptions into evidence-based decisions. Before pilot testing, many parts of an assessment are still provisional. Developers may believe the instructions are clear, the timing is appropriate, the rubrics are usable, and the items measure the intended construct, but those are still hypotheses until real participants interact with the assessment. Pilot data shows whether the assessment behaves the way it was designed to behave under realistic conditions. Without that evidence, teams risk moving forward with hidden flaws that can affect validity, reliability, fairness, usability, or operational efficiency.
Pilot testing is also where small problems can be found before they become expensive and consequential. A confusing item stem, inconsistent scoring rule, inaccessible interface element, or unrealistic time limit may not be obvious during internal review. Once participants engage with the assessment, however, these issues often become visible through response patterns, participant comments, rater inconsistencies, and administration irregularities. This makes pilot data one of the most practical tools for quality control in assessment development. It helps teams refine the instrument early, prioritize revisions, and make stronger decisions about whether the assessment is ready for field testing, additional redesign, or limited operational use.
How do pilot testing and field testing differ when it comes to data collection?
Pilot testing and field testing are closely related, but they usually serve different purposes and therefore emphasize different kinds of data collection. Pilot testing is generally earlier, smaller, and more diagnostic. The main purpose is to identify weaknesses, unexpected behaviors, and design issues before the assessment is scaled up. Because of that, pilot testing often combines performance data with rich process data such as interviews, observations, participant feedback, and administrator notes. Teams are often willing to pause, probe, or revise based on what they see. The focus is less on making final statistical claims and more on learning how the assessment works in practice.
Field testing, by contrast, is often larger and more standardized. At that stage, the assessment is usually more stable, and the goal shifts toward evaluating psychometric performance, administration consistency, scoring quality, and readiness for operational use. Data collection in field testing still includes monitoring for usability and fairness issues, but the larger sample allows for stronger statistical analysis of item difficulty, discrimination, form performance, subgroup patterns, and scoring reliability. In simple terms, pilot testing helps teams discover what needs to be fixed, while field testing helps confirm whether those fixes hold up at scale. Both are important, but pilot testing is usually where the most exploratory and diagnostic data collection happens.
How can you tell from pilot test data whether an assessment item or task needs revision?
An item or task usually needs revision when the pilot data suggests that participants are responding to something other than the intended construct. There are several warning signs. Unusually high omission rates, extreme timing demands, widespread requests for clarification, or participant comments about confusing wording can all indicate a problem. Quantitative evidence may show that an item is far too easy, far too difficult, or not distinguishing meaningfully between stronger and weaker performers. For constructed-response tasks, inconsistent scoring across raters or unclear rubric application can signal that the task prompt or scoring criteria need work. Technical or formatting issues can also distort performance, especially in digitally delivered assessments.
That said, revision decisions should not be based on a single metric in isolation. A strong review process looks at the full evidence picture: statistical performance, participant experience, content alignment, accessibility considerations, and scoring behavior. For example, an item with weak statistical performance might still be retained if pilot testing reveals that the issue came from a temporary administration problem rather than the item itself. Conversely, an item with acceptable numbers might still require revision if participants consistently misinterpret its instructions. The best decisions come from integrating psychometric data with qualitative review and subject matter expertise so that revisions address the real source of the problem rather than just the symptom.
What are the best practices for collecting high-quality data during pilot testing?
High-quality pilot test data begins with a clear plan. Teams should decide in advance what questions the pilot is meant to answer, what evidence will be collected, who will collect it, and how it will be analyzed. That includes defining success criteria for items, tasks, timing, scoring, instructions, and delivery systems. Sampling matters as well. Even in a small pilot, participants should reflect the intended testing population closely enough to reveal realistic use patterns, accessibility needs, and potential subgroup concerns. Standardized administration procedures are also important because inconsistent delivery can create noise that makes results difficult to interpret.
It is also a best practice to collect multiple forms of evidence at the same time. Response data should be paired with observations, debriefs, technical logs, and scoring reviews so the team can understand both what happened and why it happened. Raters and proctors should be trained to document issues consistently. Participants should have structured opportunities to report confusion, navigation problems, unclear instructions, or fatigue. Data quality checks should be built in from the beginning, including monitoring missing data, verifying time stamps, reviewing anomalous patterns, and confirming that scoring processes are functioning correctly. Finally, the pilot should end with a disciplined review process in which findings are synthesized, revisions are prioritized, and decisions are documented. That final step is critical because pilot testing only adds value when the collected evidence leads to specific, defensible improvements.
