However, the size of the web creates an “embarras de richesses”, which poses a challenge in its own right. Principled criteria for choosing what should go into the corpus need to be developed and applied, with sensitivity to the requirements of the project at hand. There are a few candidates generally worth considering all familiar from the compilation of non-electronic corpora). They include authorship (e.g. institutional versus individual, gender, expert or lay status), time of publication, as well as geographic, cultural, and national origin (bearing in mind, however, that on the web the latter is difficult to identify).

Because text can be downloaded so easily, it can be fed directly into text-analytic software such as concordancing packages (e.g. Wordsmith) or software for qualitative analysis such as NVivo or NUDIST. Yet, in spite of the apparent ease with which web data can be collected, the cutting, pasting and editing involved still amounts to a rather time-consuming process. To deal with this
problem, software tools for automated text extraction and subsequent offline analysis have been developed (Fletcher, 2004b), but at this stage their application is likely to be beyond the majority of CDA researchers without expertise in computational linguistics.
(…)
The web is considerably less prejudiced in favour of élites than traditional print media such as newspapers and books. However, from a global standpoint we do have to bear in mind that the technological infrastructure needed still creates a very heavy bias in favour of industrialized nations. Moreover, even within rich societies, there are varying degrees of ‘e-literacy’ (Martin, 2003), creating a ‘digital divide’ that reflects patterns of disadvantage sensitive to social class, age, and gender (Kendall, 1999).