CAMSIS Social Interaction and Stratification scales: Construction Overview

SCALE CONSTRUCTION

The Construction of CAMSIS Measures

On these pages we give some details of the various methods currently being used to construct CAMSIS scales. We give an overview of the logic of the construction method and a discussion of some specific problems. In addition, we provide links to information on the CAMSIS scale construction process which go into considerable practical detail over a number of different software packages. The links are intended primarily for those who may wish to replicate the process.

The overview statement, together with our notes on using the CAMSIS measures, will probably be as much detail as most readers will need. (We also recommend that consideration of the technical issues should be supplemented by a reading of works relating to the theoretical aspects, references to which can be found in the bibliographic review.)

Overview topics:

Introductory notes / Statistical Techniques / Units of analysis / Confounding influences / Aggregation of small groups

Technical details on scale construction:

1) An extended guide to scale construction using SPSS and lEM (including example command files) (first published summer 2002)

2) Scale construction using Stata: Manual and automated scale construction macros (first published Autumn 2009)

3) Examples of scale construction using other software (R) (forthcoming, Sept. 2012)

For further advice on these pages and their topics please contact Paul Lambert . Comments on either the construction techniques or the ease of replication are very welcome.

Introduction

Broadly speaking, the idea behind this method of creating a measure of the stratification order is that members of groups that are socially more similar will tend to be more likely to interact socially than are members of groups that are socially less similar. Differences between groups in the relative frequencies of social interaction can be treated as reflecting the social distances between them. These relationships can be represented by a two-dimensional table, where the rows indicate the range of one partner's jobs, the columns the range of the other partner's jobs, and the cell frequencies the number of occurrences of each combination in the population. This table can be analysed to see if the distances are consistent with location in a social space of a limited number of dimensions. In particular, one would expect to find a major dimension relating to social inequality and stratification.

The approach used in earlier work was to create a table of 'social distance' by comparing the frequency distributions of all rows and columns, and then attempting to create a 'space' into which all of these 'distances' could be coherently fitted. However, this method has now been superseded by others that put the emphasis on creating a statistical model for the frequency of husband and wife combinations, as a function of scale values created and assigned to the different occupations. Higher scale values estimated for a particular male occupation, for example, would suggest that there is a pattern to the data whereby the cell frequencies are better predicted by estimating that the wives of men holding such an occupation will themselves have higher scale values, and vice versa. Looked at another way, cell frequencies will be higher for combinations where husbands' and wives' scores are similar, lower where they are dissimilar. Differences between scale values are thus an indicator of 'social distance'.

Model estimation can be achieved using (at least) two techniques. In the first, Correspondence Analysis (CA), scores are assigned to the male and female occupations through a series of dimensions, which successively account for all of the variation in the distribution of cell frequencies that cannot be explained by the basic 'independence' model. In the second, Goodman's class of Row-Column Association models II (RC), one or more dimensions of male and female occupational scores are added as explanatory terms in a log-linear model, which together with the row and column 'main effects' can be evaluated for the amount of model 'deviance reduction' achieved.

The CAMSIS approach uses both modelling techniques. CA models are more widely available in mainstream software packages, and are more familiar to much of the social sciences community. For that reason we find it a useful tool to recommend to other researchers, particularly for preliminary work. RC models, however, have the attraction that they can be much more readily adapted to incorporate a number of the model constraints and evaluations which we discuss immediately below; for that reason, with all other things being equal, we favour the construction of CAMSIS scale scores with RC model techniques.

The core claim of the CAMSIS approach is that patterns of social interaction are intrinsically related to patterns of social stratification. The statistical models generate occupational scores that represent social interaction and, therefore, social stratification. Then, if a parsimonious model can be constructed which creates such scores in a single- or low-dimensional space, we generate an attractive summary scale of the social stratification location of occupations.

[Top of page]

Statistical Techniques

The first of the techniques now used in the construction of CAMSIS scales is Correspondence Analysis (CA). This is widely used in social science applications (Greenacre and Blasius 1994; Weller and Romney 1990), and was the method used throughout a preceding ('Family History') project deriving historical CAMSIS versions for the UK. CA has the attraction that it can now be conducted quickly, producing accessible output tables and graphs, using mainstream statistical packages such as SPSS. Essentially, CA works by accounting for any data patterns in a crosstabulation which deviate from the basic independence distribution. CA derives 'dimension scores' for the base unit categories which reflect how well those scores could explain the patterns of deviation. In the CAMSIS project, dimension scores are given to the occupational base units, and a recurrent empirical finding has been that the most influential dimension estimated is one which assigns scores in a structure which we describe as a 'stratification hierarchy'.

More recently the CAMSIS researchers at Cardiff have favoured using Goodman's class of "RCII" association models (RC) for the scale construction process (also known as log-multiplicative models, and log bi-linear models). These models have been increasingly used in sociological analyses since the production of a series of papers by their early proponents (Goodman 1979, 1985, 1987, 1991; Clogg 1982a, 1982b ).Whilst being methodologically closely related to the CA approach (Gilula and Haberman 1988), the advantage of using RCII models lies with their ability to readily incorporate various parameter constraints and estimation statistics which allow for the comparison of alternative but related models - Rytina 2000 gives a demonstration of the ability to compare alternative stratification schema as nested model structures. In the current CAMSIS project we have used the computer programme lEM (Vermunt 1997) to estimate RCII models on the relevant CAMSIS datasets. Again in summary, the RC model works by fitting a log-linear model to the crosstabulation of row and column base units, but adding, to the prediction of the cell counts, information on one or more dimensions of estimated ordered scale values for each occupational unit. In fact RC models have already been widely exploited to investigate all of the fields with which the CAMSIS project is itself associated - for instance, for patterns of marital endogamy (Hout 1982; Green 1989); friendship networks (Yamaguchi 1990); inter-generational mobility (Yamaguchi 1983; Hout 1984); and intra-generational mobility (Clogg et al 1990). However almost all of these and other related approaches have been used to score (low numbers of) aggregated occupational groups (or educational level categorisations in the case of the Yamaguchi papers), whereas the CAMSIS approach maintains that fine divisions in occupational categorisations lead to a much more complete understanding of social stratification.

[Top of page]

Units of Analysis

An important claim of the CAMSIS methodology is that its use of detailed occupational base unit information allows a better appreciation of the relative positioning of different units. This means that we have an initial requirement, for any CAMSIS scale construction, that data is provided on occupational title units to a relatively detailed degree (note that in the CAMSIS approach we don't normally use occupational industrial unit categorisations). If at all possible, we would also seek to find information on the employment status location of individual's occupations, allowing the subsequent CAMSIS scale construction to proceed on the cross-classifications of the two units, our so-called 'title-by-status' base units.

Occupational title information

Most countries have their own scheme of occupational titles where each distinguished unit is numerically represented (although in many examples more than one scheme exists, usually due to the regular revision of title categories over time). An attraction of using these national specific versions for the CAMSIS research is that they are usually tailored to the employment structure of the country - for instance specific occupations which are unusually common in a given country may be separated out into their own categories, whereas in most other countries they would be combined with another category. Many national specific scheme have, also, been revised at certain points in time, in order to better cover the changing employment distribution of a particular country (for instance, in the UK the primary scheme has been significantly revised decenially since 1950). These revisions can be presented as desirable, further improving the depth of information recorded for a particular dataset. A subsection of these webpages, on occupational unit details, contains a number of resources for coding and labelling the national specific unit schema of many of the CAMSIS countries, as well as, if available, information on recoding between successive versions over time.

A major asset for cross-nationally comparable research, on the other hand, has been the sustained development of the 'ISCO' occupational title schema, a UN supported classification of occupations which is intended to be operationalisable throughout the world (see the Warwick and ILO websites for useful reviews). Attractions of the ISCO schema (the latest widely used version being 'ISCO-88'), include wide familiarity with the categories and occupational grouping schema (such as the 10 major groups), comparability with other research, and compatibility with many other secondary data resources (for instance many of the LIS and LES studies, and the ISSP collections). The CAMSIS page on occupational unit details contains a number of resources relevant to usage of the ISCO classificiations. Additionally, another research project with similar intentions to the CAMSIS programme, has already made publicly accessible a series of their files which translate ISCO values into their favoured representations of occupational class and status positions (see the webpages of Harry Ganzeboom; during the CAMSIS project we have repeatedly used these translations ourselves).

Nevertheless, ISCO occupational schema can be presented as less sensitive to the particular nuances of the specific country when compared to a nationally derived schema. Additionally, a common drawback with the use of ISCO measures is that, unlike with national specific versions, coding to ISCO categories is not normally done interactively for each individual case, but is typically achieved by running an automated recoding procedure from the original data of the national specific schema. The practical consequences of this are that in many examples these procedures include unsatisfactory errors, whilst additionally the full range of possible ISCO categories is often not fully represented (for examples of both, see the notes in the CAMSIS project report for Switzerland).

Thus, when the appropriate data resources are available in the CAMSIS project, we have ideally estimated occupational scale scores, separately, for both the ISCO and national specific title coding schema (see for instance the German and Swiss versions). Alternatively, if our source data has been available in only one base occupational unit, we have usually tried to provide approximated CAMSIS scores for any relevant alternative occupational units, based upon whatever information we have to hand which would allow a translation between the different units. An example of this practice is the derivation of scores for UK SOC90 units, then subsequent release as scores for ISCO88 schema by using a macro which translates most SOC90 units into ISCO88 categories. Again, our own page on occupational information contains a number of resources for obtaining such recodes between different national versions when available.

As a rough guide to the desirable levels of precision in the available title unit, the CAMSIS researchers at Cardiff have worked successfully with occupational title schema which have between 100 and 600 title categories. However they have also found that estimations using title units which have less than 100 categories often seem unsatisfactory, primarily because those levels of precision invariably involve combining occupations in some examples when we have good grounds for anticipating that separate scale scores would otherwise be found.

Employment status information

Whenever the data is available, we have utilised information on the differences in 'employment status' between individuals. This data is then cross-classified with the occupational title units, and that cross-classification serves as our 'title-by-status' unit of analysis.

Several efforts have been made to harmonise employment status records for internationally comparative research (see especially the ISCE scheme, and also the CASMIN project; a project coordinated by Erik Wright offers a different perspective on the important components for distinguishing within job titles). Most of these schema, however, offer a relatively large number of categories, which could be cumbersome if cross-classified by employment title units, and moreover which use a level of detail which may not always be available to other analysts. Thus, whilst we have based our preferred employment status categories loosely upon the ISCO scheme, we have generally worked by looking at whatever categories, for each CAMSIS version analysed, seems most empirically relevant to the country / period under study, and most readily reproducible from other data resources. In practice, we usually have a two, three, four or five category representations of employment status, and in fact we deliberately do not have examples with many more categories.

By the point of release of the CAMSIS scores, however, we have paid more attention to the relations between the chosen national specific schema, and the standardised ISCE schema. We have also tried to compute average scores which will apply to external data resources which have less information to differentiate employment status categorisations than was used in the derivation of the CAMSIS versions. Further details on this procedure for the release of CAMSIS scores can be found on a page describing the ISCE-based standardisation of CAMSIS employment status categories.

[Top of page]

Confounding influences

In theory, as soon as a satisfactory two-dimensional table of male-female occupational interactions is created, it is possible to estimate models for social association which generate CAMSIS-style scales. However, the degree of social interaction can be affected by factors other than occupational units' incumbents' locations in a hierarchical social ordering. The CAMSIS modelling procedures are adapted in two ways to take account of two such confounding influences.

Diagonals and pseudo-diagonals

A recurrent pattern across a range of countries and datasets is that certain distinctive husband-wife occupational combinations are so unusually common that they initially dominate any derived occupational scaling. The large majority of such combinations can however be given a clear theoretical interpretation, as combinations which are the product, not of generalised patterns of social interaction per se, but of other identifiable factors that encourage the particular pattern, such as joint business ventures or institutional links. Some of these are diagonal cells in the table: a husband and wife who are both self-employed in the same occupation - both 'farmers', for example. However, others are not strictly diagonal, but do have comparable characteristics. We term such distinctive occupational combinations, including the true diagonals, 'pseudo-diagonals', because they refer to occupations which are closely related through a number of components. For instance, some of the most prominent such examples, consistent over a number of countries, are husband 'farmers' married to wife 'agricultural workers'; husband 'publicans' married to wife 'barmaids', and husband 'shopkeepers' married to wife 'shop assistants'. Each of these can be interpreted as the result of joint business ventures. Equally, a common example of more 'institutional' combinations are husband 'doctors' married to wife 'nurses', or husband 'teachers - one specialism' married to wife 'teachers - another specialism'. In the latter cases, the combinations are occurring unusually often for the relatively trivial reason that the institutional locations of health service workers, or teachers, create more opportunities for social interaction.

The important point is that we want to prevent the relatively trivial prevalence of such diagonals and pseudo-diagonals from influencing the more general model predicting social association as a function of the generalised structure of occupational locations (interpreted as that of social stratification). We can do this in the models discussed simply by adding an account for each particular such combination. The methods to achieve this differ for the two model frameworks. In the CA framework, we simply delete all the cases in each diagonal or pseudo-diagonal husband-wife combination from the analysis. In the RC framework, we add model parameters for each of those specific husband-wife combinations.

In general, we would argue that treatment in this way of these diagonal and pseudo-diagonal cases has no negative impact on the interpretation of the CAMSIS scales as a generalised ordering of occupations. However, there are some instances where the proportion of cases representing an occupational unit that is treated in this way is unusually high, and there are then substantive implications from the strategy we use. Notes in the page describing the use of the CAMSIS versions, and occasional notes in the version-specific pages and the downloadable version-specific archives, available via the general versions page, discuss these issues in greater detail.

Additional dimensions

Aside from identifying and modelling specific husband-wife combinations, we also find that in some cases it is useful to estimate entire 'subsidiary dimensions' of occupational scores, which reflect structures of social association that are predictable for institutional or economic reasons, but are not part of the primary association that reflects generalised patterns of occupational social association / stratification. Such dimensions cover the whole range of occupational units, but within a limited categorical structure: their theoretical validity comes from imposing an a priori constraint upon them which is associated with occupational sub-structures. Typical examples of such structures are the status divisions within title-by-status units, or the specification of industrial sectors or major groups which cluster the occupational title units.

In most cases, the inclusion of such subsidiary dimensions makes virtually no difference to the primary dimension scores, but in a few instances the use of such a priori constrained subsidiary dimensions proves an important way of separating out the influence of factors which otherwise interact with and distort the primary dimensional structures.

It is important to note, however, that this use of subsidiary dimensions in RC modelling is not equivalent to the estimation of lower dimensions of occupational association which is produced in most Correspondence Analysis techniques. In CA examples, secondary and subsidiary unconstrained orderings of occupations are generated, and in the CAMSIS analyses we do not normally give these empirically minor dimensions any subsequent interpretation. By contrast, in our use of subsidiary dimensions (usually within the RC framework), we estimate an a priori dimensional structure and test for its explanatory power, retaining it in the final model if it seems an efficient way of partitioning secondary structural patterns of social association from the primary pattern which we interpret as stratification.

[Top of page]

Aggregation of small groups

A fundamental principle of CAMSIS is to maintain the maximum degree of differentiation between occupational categories. Therefore, we always avoid combining occupational units even if evidence suggests that some categories are highly similar in their relevant properties. However, survey samples or census sub-samples typically involve a number of occupational groups where there are few cases. When constructing a scale, there is a potential problem that these cases may be unrepresentative. This means that it is desirable to merge them with other groups to whom they are thought to be similar.

The criteria used in deciding about mergers are discussed later. Although the application of these criteria is motivated by convenience, their use does have some theoretical legitimacy. Goodman (1981) suggests criteria of 'homogeneity' and 'structural similarity' when deciding on the suitability of combining categories in tabular analyses. It is difficult to apply his homogeneity criterion of testing every possible table sub-categorisation, because of the sheer number of categories relevant (and because the results of the statistical tests would be inconclusive for the sparse cells anyway). However a variation of his structural criterion was applied, in that categories were combined only if other row and column derived scores on the separate categories were approximately equivalent.

We should note here two potential implications of our data-merging strategy. First, because separate merges are conducted on the male and female occupational unit distributions, it is no longer the case that equivalent categories between genders in the revised schema necessarily mean equivalent categories between genders in the original data. This in turn makes the idea of gender equal models (where scores for the same occupational units are constrained to be equivalent between genders) much more tenuous (although in the vast majority of cases the original occupational units contributing to a revised unit remain equivalent between genders).

Second, the subjective element involved both in identifying 'similar' occupational groups and in choosing substantive criteria for merging units, means that the revisions discussed will inevitably represent the input of the researcher(s). This might seem to bring into question the reliability of the scale construction methods and the possibility of their replication. In practice, however, these issues do not lead to serious shortcomings. First, a degree of replication is readily achievable simply by making available the information on which recodings were originally done (for instance through supplying the command file used to recode occupational units). Next, we could argue that the explicit, limited subjective impact of researchers in such a project is not necessarily a bad thing, especially when, as in the example of the CAMSIS project, we are dealing with scales which could in principle be repeatedly constructed over time and place. The alternative practice, a set of fixed rules on unit revisions imposed by the first set of scale developers, would be much less sensitive to such particular variations*. Most importantly, however, the occupational units concerned are, by definition, those which are relatively rare in the population, so the alterations made at the 'data revision' stage do not have an enormous impact upon the nature of the CAMSIS scores subsequently derived for the bulk of the population.

*(On a point of interest, it is quite possible to fully automate the recoding of occupational units, thus removing any subjective researcher input. All that is required is an algorithm which examines the number of cases in a given unit, and recodes that unit to, for instance, its minor group 'residual' category, if the number of cases is below a specified threshold. Then, provision for the looping back through this process to code from, say, minor group to sub-major group 'residual' categories is needed, and so on, until no unit has a group size below the desired thresholds. The CAMSIS researchers in Cardiff produced a programme to run such an algorithm in SPSS and tested it on a British data example, but, for the reasons mentioned above, the alternative approach was preferred)