developing a pschycological assessment in a multicultural context

July 18, 2017 | Autor: Anisa Abbas | Categoria: Psychology, Educational Psychology

Descrição do Produto

Psychological assessment

PYC 4807

Anisa Abbas

3618 8638

Assignment 01 Unique no

581822

Date

15 May 2013

List of contents

Page No

Abstract

01

Introduction

02

Abstract The adaptation of assessment measures is essential in a multicultural and multilingual society like South Africa if test results are to be valid and reliable for all test-takers. In practice, this implies that test takers from different cultural, language and/ or socio-economic backgrounds should be given the same opportunity to respond to any psychological assessment. For example; companies regularly require prospective employees to undertake various assessment tasks to assess the suitability of the applicants for a specific position. if prospective applicants speak different languages or come from different cultural backgrounds, the assessment tasks selected should accommodate the different languages and cultural backgrounds of the applicants. Thus in a context of a multicultural society like South Africa, the adaptation of measures and the detection and elimination of bias from measures plays a vital role in the transformation process. To this end, it is important that rigorous methods and designs are used if the information obtained from the assessment measures is to be valid and reliable. Keywords- validity; reliability assessment. Introduction The focus of this paper is going to highlight the development of a psychological measure with an adaptation required for a multicultural and multilingual context in South Africa. The above abstract suggest that South Africa has a population diverse in cultural practice and language use in multicultural and multilingual society and thus adapting multicultural assessments raises the awareness that culture and language does in fact impact the limitations of the assessment results. The process of developing a psychological measure is an extendable and complicated task which at minimum takes three to five years from inception to publish and subsequently available to test users. Psychological assessments are developed by specialiased and experienced measurements experts. Companies such as PSYTECH and SHL, have tasked themselves with adapting international psychological test and norming them to South African context. However, very few multidimensional assessment have been developed that can be applied to the diverse make up of cultures in South Africa. For the purpose of this paper the author will strive to create a better understanding of how the principles of psychometric concepts are applied to

develop a psychological measurement which is both suitable to the research process and to a South African context. The ensuing discussion will aim to provide an adequate explanation of how to construct a test which can be considered both valid and reliable. 1. Psychological assessment When people talk about psychological tests, they often ask whether the test is valid or not. What exactly does this mean? Psychological assessment is an important part of both experimental research and clinical treatment. One of the greatest concerns when creating a psychological test is whether or not it actually measures what it claims to measure. A valid test ensures that the results are an accurate reflection of the dimension undergoing assessment. For example, a test might be designed to measure a stable personality trait, but instead measure transitory emotions generated by situational or environmental conditions. 2. Validity So what does it mean for a test to have validity? Validity is the extent to which a test measures what it claims to measure. It is vital for a test to be valid in order for the results to be accurately applied and interpreted. Validity isn’t determined by a single statistic, but by a body of research that demonstrates the relationship between the test and the behavior it is intended to measure. There are three types of validity: 2.1Content validity Content validity determines whether the content of the measure covers a representative sample of the behavior being measured. It is non-statistical, and a panel of subject experts (Foxcroft and Roodt 2013) evaluates items during the assembly phase. This measure is best applied to evaluation of achievement and occupational measures. In content validity, face validity is an important aspect for the test-taker. Face validity has nothing to do with the construct, but rather shows the construct appearing to be valid.

Another way of saying this is that content validity concerns, primarily, the adequacy with which the test items adequately and representatively sample the content area to be measured. For example a comprehensive math achievement test would lack content validity if good scores depended primarily on knowledge of English, or if it only had questions about one aspect of math (e.g., algebra). Content validity is primarily an issue for educational tests, certain industrial tests, and other tests of content knowledge. Expert judgment (not statistics) is the primary method used to determine whether a test has content validity. Nevertheless, the test should have a high correlation with other tests that purport to sample the same content domain. This is different from face validity in that face validity is when a test appears valid to examinees who take it, personnel who administer it and other untrained observers. Face validity is not a technical sense of test validity; i.e., just because a test has face validity does not mean it will be valid in the technical sense of the word. "Just because it looks valid doesn’t mean it is valid." 2.2Construct Validity A test has construct validity if it demonstrates an association between the test scores and the prediction of a theoretical trait. Intelligence tests are one example of measurement instruments that should have construct validity. A test has construct validity if it accurately measures a theoretical, non-observable construct or trait. The construct validity of a test is worked out over a period of time on the basis of an accumulation of evidence. There are a number of ways to establish construct validity. Two methods of establishing a test’s construct validity are convergent/divergent validation and factor analysis which will be discussed further on. 2.2a) Convergent/divergent validation A test has convergent validity if it has a high correlation with another test that measures the same construct. By contrast, a test’s divergent validity is demonstrated through a low correlation with a test that measures a different construct. Note this is the only case when a low correlation coefficient provides evidence of high validity.

2.2b) Factor analysis Factor analysis is a complex statistical procedure which is conducted for a variety of purposes, one of which is a useful tool for investigating variable relationships for complex concepts such as socioeconomic status, dietary patterns, or psychological scales. It allows researchers to investigate concepts that are not easily measured directly by collapsing a large number of variables into a few interpretable underlying factors. The key concept of factor analysis is that multiple observed variables have similar patterns of responses because they are all associated with a latent variable. The association with an underlying latent variable which cannot easily be measured.For example, people may respond similarly to questions about income, education, and occupation, which are all associated with the latent variable socioeconomic status. In every factor analysis, there are the same numbers of factors as there are variables. Each factor captures a certain amount of the overall variance in the observed variables, and the factors are always listed in order of how much variation they explain. 2.2c) Internal consistency That is, if a test has construct validity, scores on the individual test items should correlate highly with the total test score. This is evidence that the test is measuring a single construct also developmental change. Tests measuring certain constructs can be shown to have construct validity if the scores on the tests show predictable developmental changes over time and experimental intervention, which is if a test has construct validity, scores should change following an experimental manipulation, in the direction predicted by the theory underlying the construct. is to assess the construct validity of a test or a number of tests.

2.3Criterion-related Validity Criterion-related validity is a concern for tests that are designed to predict someone’s status on an external criterion measure. A test has criterion-related validity if it is useful for predicting a person’s behavior in a specified situation.Otherwise stated when the test has demonstrated its effectiveness in predicting criterion or indicators of a construct. There are two different types of criterion validity: 2.3aConcurrent Validity occurs when the criterion measures are obtained at the same time as the test scores. Having accuracy with which a measure can identify current behavior regarding skills of an individual. This indicates the extent to which the test scores accurately estimate an

individual’s current state with regards to the criterion. For example, on a test that measures levels of depression, the test would be said to have concurrent validity if it measured the current levels of depression experienced by the test taker. 2.3bPredictive validity: the accuracy with which a measure can predict future behavior of an individual. This occurs when the criterion measures are obtained at a time after the test. Examples of test with predictive validity are career or aptitude tests, which are helpful in determining who is likely to succeed or fail in certain subjects or occupations.

Relationship between reliability and validity If a test is unreliable, it cannot be valid. For a test to be valid, it must reliable. However, just because a test is reliable does not mean it will be valid. Reliability is a necessary but not sufficient condition for validity! Let’s discuss Reliability.

3. Reliability A reliable test is one that consistently produces the same results when administered to the same individuals under the same conditions. When we call someone or something reliable, we mean that they are consistent and dependable. Reliability is also an important component of a good psychological test. After all, a test would not be very valuable if it was inconsistent and produced different results every time. How do psychologists define reliability? What influence does it have on psychological testing? Reliability refers to the consistency of a measure. A test is considered reliable if we get the same result repeatedly. For example, if a test is designed to measure a trait (such as introversion), then each time the test is administered to a subject, the results should be approximately the same. Similarly, if a person weighs themselves during the course of a day they would expect to see a similar reading. Scales which measured weight differently each time would be of little use. The same analogy could be applied to a tape measure which measures inches differently each time it was used. It would not be considered reliable. Unfortunately, it is impossible to calculate reliability exactly, but it can be estimated in a number of different ways. 3.1Test-Retest Reliability To gauge test-retest reliability, the test is administered twice at two different points in time. We estimate test-retest reliability when we administer the same test to the same sample on two

different occasions. This approach assumes that there is no substantial change in the construct being measured between the two occasions. The amount of time allowed between measures is critical. We know that if we measure the same thing twice that the correlation between the two observations will depend in part by how much time elapses between the two measurement occasions. The shorter the time gap, the higher the correlation; the longer the time gap, the lower the correlation. This is because the two observations are related over time -- the closer in time we get the more similar the factors that contribute to error. Since this correlation is the test-retest estimate of reliability, you can obtain considerably different estimates depending on the interval. This kind of reliability is used to assess the consistency of a test across time and assumes that there will be no change in the quality or construct being measured. Test-retest reliability is best used for things that are stable over time, such as intelligence. Generally, reliability will be higher when little time has passed between tests. 3.2Inter-scorer Reliability This type of reliability is assessed by having two or more independent judges score the test. One way to test inter-scorer reliability is to have each rater assign each test item a score. For example, each scorer might score items on a scale from 1 to 10. Next, you would calculate the correlation between the two ratings to determine the level of inter-rater reliability. Likewise inter-scorer reliability is appropriate when the measure is a continuous one. When we calculate the correlation between the ratings of the two observers, they might be rating the overall level of activity in a classroom on a 1-to-7 scale. You could have them give their rating at regular time intervals (e.g., every 30 seconds). The correlation between these ratings would give you an estimate of the reliability or consistency between the raters. Another means of testing inter-scorer reliability is to have raters determine which category each observation falls into and then calculate the percentage of agreement between the raters. So, if the scorer agrees 8 out of 10 times, the test has an 80% inter-scorer reliability rate. 3.3Split-half method The split-half method assesses the internal consistency of a test, such as psychometric tests and questionnaires. It measures the extent to which all parts of the test contribute equally to what is being measured. This is done by comparing the results of one half of a test with the results from the other half. A test can be split in half in several ways, e.g. first half and second half, or by odd and even numbers. If the two halves of the test provide similar results this would suggest that the

test has internal reliability.Another way to accomplish this is to create a large set of questions that address the same construct and then randomly divide the questions into two sets. You administer both instruments to the same sample of people. The correlation between the two forms is the estimate of reliability. One major problem with this approach is that you have to be able to generate lots of items that reflect the same construct. This is often not an easy task.

3.4 Internal Consistency Reliability This form of reliability is used to judge the consistency of results across items on the same test. Essentially, you are comparing test items that measure the same construct to determine the tests internal consistency. When you see a question that seems very similar to another test question, it may indicate that the two questions are being used to gauge reliability. Because the two questions are similar and designed to measure the same thing, the test taker should answer both questions the same, which would indicate that the test has internal consistency.In internal consistency reliability estimation we use our single measurement instrument administered to a group of people on one occasion to estimate reliability. In effect we judge the reliability of the instrument by estimating how well the items that reflect the same construct yield similar results. We are looking at how consistent the results are for different items for the same construct within the measure. Internal consistency is the most used in reporting the reliability of scales. This is measured by a Cronbach alpha, where values closer to 1 indicate higher internal consistency reliability. There are a wide variety of internal consistency measures that can be used. The average inter-item correlation uses all of the items on our instrument that are designed to measure the same construct. We first compute the correlation between each pair of items. For example, if we have six items we will have 15 different item pairings (i.e., 15 correlations). The average inter item correlation is simply the average or mean of all these correlations. Thus an average inter-item correlation of .90 with the individual correlations ranging from .84 to .95.This approach also uses the inter-item correlations. In addition, we compute a total score for the six items and use that as a seventh variable in the analysis. The six item-to-total correlations at the bottom of the correlation matrix range from .82 to .88 in this sample analysis, with the average of these at .8.5.

4. Factors That Can Influence Reliability There are a number of different factors that can have an influence on the reliability of a measure. First and perhaps most obviously, it is important that the thing that is being measured be fairly stable and consistent. If the measured variable is something that changes regularly, the results of the test will not be consistent.Aspects of the testing situation can also have an effect on reliability. For example, if the test is administered in a room that is extremely hot, respondents might be distracted and unable to complete the test to the best of their ability. This can have an influence on the reliability of the measure. Other things like fatigue, stress, sickness, motivation, poor instructions, and environmental distractions can also hurt reliability. Each of the reliability estimators will give a different value for reliability. In general, the testretest and inter-rater reliability estimates will be lower in value than the parallel forms and internal consistency ones because they involve measuring at different times or with different raters. Since reliability estimates are often used in statistical analyses of quasi-experimental designs (e.g., the analysis of the nonequivalent group design), the fact that different estimates can differ considerably makes the analysis even more complex. 5. Reliability vs. Validity It is important to note that just because a test has reliability it does not mean that it has validity. Validity refers to whether or not a test really measures what it claims to measure. Think of reliability as a measure of precision and validity as a measure of accuracy. In some cases, a test might be reliable, but not valid. For example, imagine that job applicants are taking a test to determine if they possess a particular personality trait. While the test might produce consistent results, it might not actually be measuring the trait that it purports to measure.

6. Factors that can influence reliability 6.1Question Construction Researchers construct questions on psychological tests to bring about a response on some mental quality such as depression. If test questions are difficult, confusing or ambiguous, reliability is negatively affected. Some people read the question to mean one thing, whereas others read the same question to mean something else. Errors in question construction are systemic errors and can be corrected only through research and redesign of the test.

6.2Administration Errors Instructions with the test may contain errors that create another type of systemic error. These errors exist in either the instructions provided to the test-taker or those given to the psychologist who is conducting the test. Instructions that interfere with accurately gathering information (such as a time limit when the measure the test is seeking has nothing to do with speed) reduce the reliability of a test. 6.3Scoring Errors Reliable tests have an accurate method of scoring and interpreting the results. All tests come with a set of instructions on scoring. Errors in these instructions, such as making unsupported conclusions, reduce the reliability of the test. Test construction begins with research to support the conclusions drawn--but if the research has flaws, again a system error may result. It is important to stick to the standardized procedure outlined in the testing and scoring manuals to improve the reliability and standardization of the test 6.4Environmental Factors Environmental factors such as uncomfortable room temperature, or distracting sounds are one form of unsystematic errors. The errors made by the psychologist proving a test are another type of environmental factor that can affect reliability. Although psychologists are trained in psychological testing, human error is always a possibility. The administrator's attitude toward the test-taker also influences scoring or interpretation when clinical judgment is called for in the test instructions. 6.5Test-Taker Factors We often think that factors related to the test-taker have an effect on reliability. However, factors related to the test-taker, such as poor sleep, feeling ill, anxious or “stressed-out” is integrated into the test itself. Solid reliable tests do not claim to give results that are the "real-score" of the testtaker--but rather a score that is a combination of both the "real-score" and the "error-score." The real score is the test-taker's performance, and the error-score is a margin of error that is built into the test for factors such as poor sleep and anxiety. The factoring-in of the "error-score" makes it all the more important for other factors that negatively affect reliability to be low.

7. Factors affecting the reliability coefficient Any factor which reduces score variability or increases measurement error will also reduce the reliability coefficient. For e.g., all other things being equal, short tests are less reliable than long ones, very easy and very difficult tests are less reliable than moderately difficult tests, and tests where examinees’ scores are affected by guessing (e.g. true-false) have lowered reliability coefficients 

Test length. Generally, the longer a test is, the more reliable it is.



Speed. When a test is a speed test, reliability can be problematic. It is inappropriate to estimate reliability using internal consistency, test-retest, or alternate form methods. This is because not every student is able to complete all of the items in a speed test. In contrast, a power test is a test in which every student is able to complete all the items.



Group homogeneity. In general, the more heterogeneous the group of students who take the test, the more reliable the measure will be.



Item difficulty. When there is little variability among test scores, the reliability will be low. Thus, reliability will be low if a test is so easy that every student gets most or all of the items correct or so difficult that every student gets most or all of the items wrong.



Objectively scored tests, rather than subjectively scored tests, show a higher reliability.



The shorter the time interval between two administrations of a test, the less likely that changes will occur and the higher the reliability will be.



Variation with the testing situation. Errors in the testing situation for example students misunderstanding or misreading test directions, noise level, distractions, and sickness) can cause test scores to vary.

8. Item Analysis There are a variety of techniques for performing an item analysis, which is often used, for example, to determine which items will be kept for the final version of a test. Item analysis is used to help "build" reliability and validity is "into" the test from the start. Item analysis can be both qualitative and quantitative. The former focuses on issues related to the content of the test, for example, content validity. The latter primarily includes measurement of item difficulty and item discrimination.

8a.Item difficulty an item’s difficulty level is usually measured in terms of the percentage of examinees who answer the item correctly. This percentage is referred to as the item difficulty index, or "p". 8b.Item discrimination refers to the degree to which items differentiate among examinees in terms of the characteristic being measured (e.g., between high and low scorers). This can be measured in many ways. One method is to correlate item responses with the total test score; items with the highest test correlation with the total score are retained for the final version of the test. This would be appropriate when a test measures only one attribute and internal consistency is important.

9.Establishing norms Foxcroft and Roodt (2013, p40) argues that “a norm is a measure against which the individual’s raw score is evaluated so that the individual’s position relative to that of the normative sample can be determined”.Normal Distribution is a statistical term frequently used in psychology and other social sciences to describe how traits are distributed through a population. Often referred to as “bell curves” (because the shape looks like a bell) it tracks rare occurrences of a trait on both the high and low ends of the “curve” with the majority of occurrences appearing in the middle section of the curve. The most commonly known example comes from IQ tests with the majority of the population scoring within the “normal” or middle-range of intelligence.The raw score obtained by test takers on psychological measures have little or no meaning. Hence, these raw scores are converted to normal (standardized) scores through statistical transformation. (Foxcroft and Roodt (2013, p40) 10. Types of test norms There are several ways in which these raw scores can be converted into a norm score, which will then be allowed for direct comparison of test results. The following types of test are most commonly used. 10a.Developmental scales are used to measure human characteristics which increase progressively with increases in age and experience.

10b.Mental age scale where the highest and lowest age at which a measure is passed is calculated, and called the basal age. A child's mental age combines the basal age, plus any additional months of credit at higher age levels. Chronological age is irrelevant. 10c.Grade equivalence is used for scholastic and educational achievements measures where a student’s performance is translated into a grade value. For example, a student’s performance in mathematics is equivalent to grade 3; spelling equivalent to grade 4 and reading equivalent to grade 5. 10d.Percentiles percentage of people who fall below a certain raw score on a measure. 70% score would mean that 70% of the normative population scored below this person. 50th percentile is median. Different to percentages which are raw scores expressed as a percentage. Percentile is equal to the percentage of people surpassing a particular score. The main disadvantages would be the inequality of scale units and the ordinal level measures so cannot be used for arithmetic. (Foxcroft and Roodt 2013,p41) 10e.Standard score are divided into z-scores indicate the person’s deviation from the mean in terms of standard deviation. The positive z scores indicate an above average performance whilst a negative z scores indicate below average performance. The advantage is to allow the interval level measures so that it can be statistically manipulated.(Foxcroft and Roodt 2013,p41) 10f. Normalized standard scores are standard scores that have been transformed to fit a normal.(Foxcroft and Roodt.p42). This is done if it is assumed that a particular attribute is normally distributed. There are three normalized standard score which are frequently used 

McCall's T eliminated negative values, and allows for a standard scale.



Stanine scale which range from low to high. The advantages of the Stanine scale are:



Scale units are equal



They reflect the person position in relation to the normative sample.



Performance in rank order is evident



They are comparable across groups



They allow statistical manipulation



Sten scale is the same as the above mentioned scale (Stanine scale) excepts where the Stanine scale has 9scale units the Sten scale has 10scale units.



Deviation IQ Scale is used by most intelligence measures. It is a normalized standard score, with a mean of 100 and a standard deviation of 15. This scale simple to comprehend and interpret and is suitable for ages 18 and above. However, it is not directly comparable with transformed standard scores.

11. The development of a psychological assessment measure As we have seen thus far, there are a number of basic principles which need to be extensively covered before a psychological assessment can be developed. These basic concepts and procedures provide an insight as to why an assessment measure is a daunting and consuming vocation which is the foundation of how psychologists better understand a person and their behavior. It is also a process that helps identifies not just weaknesses of a person, but also their strengths. Psychological assessment — also known as psychological testing — is done to help a psychologist better understand an individual and provide valuable insights into the individual’s behavior, skills, thoughts and personality. Now that we have a better comprehension of reliability, validity and norms we can move along to the steps involved in the development of a psychological measure. 12. The development of an assessment measure When a psychological measure is being formulated there are several stages that the measurement must endure before the end product is available for consumers. Just like any other product which is designed for public use, copious amounts of time, money and resources are poured into the process to ensure an effective and proficient product is developed. The following part of this discuss will be the scrutiny of developing a psychological assessment measure. 12.1The Planning Phase 

Specify the aim of the measure The test developer needs to clearly state the aim of the measure, what constructs will be measured, what the measure will be used for, what decisions will be made based on the test scores and the target group (intended population such as infants, students), will the

measure be paper based or computer based and how will the measure be administered (individual or group based) An important stage in planning is whether the performance is compared to a criterion or a group norm. (Foxcroft & Roodt, 2013 p71). 

Defining the content of measure The content and purpose of a measure is related. The test developer needs to give an operational definition of the construct (content domain) and the purpose of the measure should be considered (Foxcroft & Roodt, 2001). In order to define the content of a measure, one needs to have a defined purpose of a measure. The developer must address a research process of the main theoretical viewpoints of the construct. The intention of the measure is vital as this is basis for constructing the measure.Items are needed to discriminate between individuals, so as to allow the assessor to view the various 'risk' groups.



The test plan A detailed consideration must be given to the kind of item format and to the number of items. The format of the test will vary according to the construct being measured. 

Open-ended items in that no limitation will be placed on the test-taker. These questions are used to obtain the test-taker point of view on a specific matter. For example, what are your views on video games?



Forced-choice items such as multiple-choice (mother, father, legal guardian) and true or false( do you smoke? Yes or no) careful consideration need to be selected. Sentence completion items ( Age:….years…months).



Performance items (apparatus is manipulated by the test-taker, an experiment performed).



Essay items. Test the test-taker’s logical thinking and organizational ability. Essay items are more difficult to score objectively, but modern technology has enhanced the objectivity scoring system.

As mentioned previously, the type of construct being measured will directly influence the item type. The number of items depends on the availability of time to administer the measure and also on the purpose of the measure. Viabilities should be considered for example,time

constraints would be unsuitable for essay type questions. In such cases it would be preferable for the questionnaire to consist of forced choice questions. The most common formats are: 

Objective formats where only one correct response( true / false)



Subjective formats where a verbal (interview) or written response (essay type).

12.2. Item writing 

Writing the items are usually developed by a team of experts who are guided by the measure’s purpose and specifications. The following tips should be considered for item writing:



The wording must be clear and concise which is easily understandable.



Negative expressions such as ‘not’ or ‘never’ and double negative should avoided.



Use a single central theme



Ambiguous items must be avoided



The positioning of correct answer in a multiple-choice measures should altered



True and false statements should be kept similar with regards to quantity and length



The content and the purpose of the measure must be related



When items are written for children’s measures the material should be bright, colorful and appealing (Foxcroft & Roodt, 2013 pg 75).

Reviewing the items Once the items are developed it should be submitted to a panel of experts for review and evaluation. The reviewers will judge the items on relevance, sufficient content, linguistic and gender appropriateness for target group, wording of the items, and the nature of the stimuli. Based on their findings, certain items may need to be re-written or disregarded. The items can also be administered to a small number of people from the target population to obtain information regarding the difficulty of items and vagueness of test instructions (Foxcroft & Roodt, 2013.pg75) 12.3Assembling and pre-testing the experimental version of the measure



Arranging the items:Items need to be formatted in a logical style. This would include grouping items together. Once the measure is in a logical format, the length of the measure needs to be reassessed. Time required to read and understand instructions is critical for test takers. After consideration, the time needed will either to be increased or decreased, or some items may be discarded. The items also need to be printed on a test-booklet in an organized format as to promote easy reading (Foxcroft & Roodt, 2013.p76).



Finalizing the length The length of the measure should be revisited to consult the matter of time constraints, lengthy Items should be discarded or rewritten to allow ample time to complete the test. 

Answer protocols

Careful considerations should be taken as to whether items will be completed in a test booklet or a separate answer sheet (protocol) needs to be developed. The answer sheet should be formatted in an easy flowing style which aids the scoring measure. 

Developing administration instructions Instructions need to be clear and unambiguous. Test administrators need to be thoroughly trained so that the test is administered in the correct way. Poor instructions could lead to poor performance of the test. (Foxcroft & Roodt, 2013.p76)



Pre-testing the experimental version of the measure The measure is now ready to administer to a large sample of the target population. (Approximately 400 to 500). Suggestions of the pre-test phase will include feedback from the test-takers on level of difficulty, easily comprehendible, length of the measure, flow and style of items. Test takers will report on quantitative and qualitative measurement of the items. Qualitative reports of which items did the test-taker find difficult or what did they not understand. What did the test-taker think of the test materials, sequence of the items and the time it took to complete the test?

12.4 Item Analysis phase 

Determine the item difficulty is the main aim of this phase to validate the purpose of each item.(Does each item measure the construct?) The difficulty of an item is directly

proportional to the percentage of the individuals who answer it correctly(p).The higher the percentage of correct responses, the more difficult the item. This supplies a uniform measure of the difficulty of the test item and can often be used to select the final items for a measure. This 'p' states the frequency of correct responses, but nothing about the characteristic of an item. Item difficulty only pertains to the specific sample it was administered to and cannot be generalized to other samples. 

Discriminating power

The discriminating power of an item is determined by a discrimination index (D).This is where a good discriminator has the top percentage of people answering correctly. A positive 'D' value will indicate an item discriminating between upper and lower groups, and has a good discriminating power on the item and the total performance on a measure. A positive D value shows that the item discriminates between the extreme groups and an item with a negative D value shows poor discriminating power. To calculate the degree to which an item discriminates between high and low total scores we can use an item-total correlation. a weak discriminator. The discriminatory power of an item can be restricted by the difficulty level of an item (Foxcroft & Roodt, 2013.p77) 

Item response theoryAn in investigation should be conducted in the early stages of the measurement especially in the development of a multicultural, multilingual context.

12.5 Revise and standardize the final version of the measure. 

Revise items and test which have been identified s been elusive during the item- analysis phase need to be resolved as either discarded or revised.



Selecting items for final version which has already been reviewed by experts on issues such as item difficulty, discrimination and bias. A final check on the reliability and validity coeffients is to ascertain the acceptability of the test.



Administer the final version to a large, representative

group of individuals for the

purposes of validity and reliability (Foxcroft & Roodt, 2013.p77)

12.6 Technical evaluation and establishing norms 

Establishing validity and reliability of coefficients can be calculated, depending on the nature and purpose of the measure. An evaluation of whether or not there is construct and measurement equivalence across subgroups in the target population.



establish norms and set a performance standards or cut –scores. If a norm – referenced measure is developed, appropriate norms need to be established, since an individual’s test score will have very little meaning when tested alone.

12.7 Publish and refine continuously This is the final stage of the development of an assessment measure. There are four sub stages to be applied. Compile the test manual: 

Specify the purpose of the measure.



The intended target group



Stipulate workable information such as reading and grade level, time required, whether or not a practitioner requires training in the administering of the test.



Administration, scoring and interpretation instructions



Outline the test development process followed.



What and how were the finding of the validity and reliability used.



If test bias exists how and when norms were established, along with the normative sample's characteristics.



Elaborate on when and how the norms were established (gender, cultural background, socio- economic status)



Any cut-off scores which may be found or disputed by assessment practitioners.



Show indications on how performance measures should be interpreted.

Submit the measure for classification: The Psychometrics committee will determine if the measure can be classified as a psychological measure. Publish and market the measure: Marketing material should be concise and accurate. It should state any additional requirements for the administrator, and should be worded appropriately for the target group.

Revise and refine continuously: If item content dates quickly, more frequent revisions of the measure will be needed. All new findings should be published and go through the same process as the original measure so as to ensure validity and reliability are still applicable. There are various methods of validating assessments measure of which can be applied to specific constructs. There are factors which are included in validity testing comprise of tests for content, criterion and construct identification. The end result will be the completion of a measure and test manual, as well as the classification of said measure.

The following part of my discussion is will address the ways in which the development of an assessment measure should be adapted into a multicultural and multilingual context. South Africa is a diverse and multidimensional society which necessitates adapting various measures which are suitable for their individuals. The different challenges affect the multicultural will be discussed and suitable solutions will be proposed.

13. An adaptation of a psychological assessment measure A brief explanation is required for the clarification of terminology: 

Test translation refers of converting a measure from one language to one or more other languages(Foxcroft & Roodt, 2013.p84)



adaptation refers to that process of making a measure more applicable to a specific context while using the same language. (Foxcroft & Roodt, 2013.p84) Reasons for adapting measures:



Allowing test takers to be assessed in a language of their choice will increase fairness. Thus removing biasness and validity of results also increased.



The reduction of cost and time saving. Usually it would be more economical and simpler to translate and adapt an existing measure than to develop a new one. Adding to this would be insufficient resources and technical expertise which may hamper the reliability and validity of the measure.



To generate comparative studies between different languages and cultural groups both at a national and international level. Consequently, an increased need for various groups and societies to learn from one another.



A comparison between newly developed measurements and existing norms, interpretations and other information about established and observed measures.

The following are important considerations when adapting measures: 13.1Administration Good communication between the assessment practitioner and the test taker is critical in that a misunderstanding can compromise the test results. 

Familiarity with the language, dialect and culture of the test takers



Competent admistrative skills and expertise is essential.



Present a fair measurement of expertise.

13.2 Item Format Item format refers to the type of questions used in any measure. (Foxcroft & Roodt, 2013.p85) Assumptions should be withheld in respect to all test takers being equally familiar with to specific item formats. For example, South African students friendly with essay type questions, whilst us students with multiple choice type. Thus it is suggested that an equal balance of items be used in the measure. 

Time Limits

Time restrictions can severely hinder test takers results. However, assuring the test takers that there is ample time for completion of the measure. 

Equivalence in cross- cultural comparisons

For measuremnets to be equivalent, individuals with the same or similar standing should obtain similar scores on the different language versions of the items of measure.

14. Conclusion Psychological assessment is a powerful tool, but its effectiveness depends upon the skill and knowledge of the person administering and interpreting the test. When used wisely and in a cautious manner, psychological assessment can help a person learn more about themselves and gain valuable insights. When used inappropriated, psychological testing can mislead a person who is making an important life decision or decision about treatment, possibly causing harm. Psychological assessment is never focused on a single test score or number. Every person has a range of competencies that can be evaluated through a number of methods. A psychologist is there to evaluate the competencies as well as the limitations of the person, and report on them in an objective but helpful manner. A psychological assessment report will not only note weaknesses found in testing, but also the individual’s strengths.

Lihat lebih banyak...

developing a pschycological assessment in a multicultural context

Descrição do Produto

Comentários