In neonatal intensive care units, there is a need for around the clock monitoring of electroencephalogram (EEG), especially for recognizing seizures. An automated seizure detector with an acceptable performance can partly fill this need. In order to develop a detector, an extensive dataset labeled by experts is needed. However, accurately defining neonatal seizures on EEG is a challenge, especially when seizure discharges do not meet exact definitions of repetitiveness or evolution in amplitude and frequency. When several readers score seizures independently, disagreement can be high. Commonly used metrics such as good detection rate (GDR) and false alarm rate (FAR) derived from data scored by multiple raters have their limitations. Therefore, new metrics are needed to measure the performance with respect to the different labels. In this paper, instead of defining the labels by consensus or majority voting, popular metrics including GDR, FAR, positive predictive value, sensitivity, specificity, and selectivity are modified such that they can take different scores into account. To this end, 353 hours of EEG data containing seizures from 81 neonates were visually scored by a clinical neurophysiologist, and then processed by an automated seizure detector. The scored seizures were mixed with false detections of an automated seizure detector and were relabeled by three independent EEG readers. Then, all labels were used in the proposed performance metrics and the result was compared with the majority voting technique and showed higher accuracy and robustness for the proposed metrics. Results were confirmed using a bootstrapping test.