The Widmer-Franaszek 8b/10b code is described in its inventors' seminal paper 'A DC-Balanced, Partitioned-Block, 8B/10B Transmission Code'. This code is used in gigabit ethernet and other high-speed communications links.
Note that Widmer and Franaszek, being electonic engineers, use the convention that bytes are written least significant bit first, since this is the order used in transmission lines; i follow them in this, but avoid writing anything in binary when it's not actually going to be transmitted on the wire.
The input to the code is an arbitrary sequence of symbols drawn from an alphabet consisting of 256 data symbols (the bytes 00-ff) and twelve control symbols (which i will call A-H and W-Z). The output is a sequence of bits with three properties useful for transmission over an electrically constrained link: there are never any runs of 0s or 1s longer than 5 (making a signal-derived clock possible, eliminating the need for an out-of-band clock); the disparity (excess of 0s over 1s or vice versa) between any two points is never greater than 6 (making the signal DC balanced, eliminating the need for a separate return current path); and, provided that the appropriate control symbols are used in the input, the grouping of the bits into codewords can be determined (eliminating the need for separate byte synchronisation). In addition, the code provides a measure of robustness in that its coding rules will be violated by many transmission errors, allowing their detection, and restricting those that do go undetected to corrupting no more than five bits in the output symbol stream.
The code works by breaking each incoming symbol into two subsymbols, one major and one minor, encoding each subsymbol into a subcode, balancing them, and transmitting them, in major then minor order.
Balancing is a step which minimises running disparity: subcodes are either inherently balanced (having disparity 0) or unbalanced (having disparity -2); however, the code allows subcodes to be complemented, producing subcodes with the same meaning but disparity +2. The rule for balancing is that an unbalanced subcode is complemented if the last unbalanced subcode was not, and vice versa.
For data bytes, the subsymbols are derived by taking the least significant five bits as the major symbol, and the most significant three as the minor. The codes are then as follow.
For major subsymbols:
Subsymbol Subcode Balanced? 0 011000 N 1 100010 N 2 010010 Y 3 110001 Y 4 001010 N 5 101001 Y 6 011001 Y 7 000111 Y * 8 000110 N 9 100101 Y 10 010101 Y 11 110100 Y 12 001101 Y 13 101100 Y 14 011100 Y 15 101000 N 16 100100 N 17 100011 Y 18 010011 Y 19 110010 Y 20 001011 Y 21 101010 Y 22 011010 Y 23 000101 N 24 001100 N 25 100110 Y 26 010110 Y 27 001001 N 28 001110 N 29 010001 N 30 100001 N 31 010100 N
Note that subsymbol 7 is handled specially: despite its subcode being balanced, it participates in balancing, being complemented if the last unbalanced subcode was not complemented and vice versa; however, since it itself is balanced, it does not affect the complementation of the next unbalanced subcode.
For minor subsymbols:
Subsymbol Subcode Balanced? 0 0100 N 1 1001 Y 2 0101 Y 3 0011 Y * 4 0010 N 5 1010 Y 6 0110 Y 7 0001/1000 N *
Note that subsymbols 3 and 7 are handled specially. Subsymbol 3 participates in balancing, exactly like major subsymbol 7. Subsymbol 7 is more complicated: it has a primary subcode (0001) and an alternate (1000), the latter being used whenever the last two bits of the preceding major subcode are the same as the initial bits of the primary subcode (after allowing for complementation of both).
Control symbols are slightly more complicated. Firstly, each symbol is broken down into subsymbols as follows:
Symbol Major Subsymbol Minor Subsymbol A K 0 B K 1 C K 2 D K 3 E K 4 F K 5 G K 6 H K 7 W 23 7 X 27 7 Y 29 7 Z 30 7
Where the major subsymbol is a number, this is one of the normal data subsymbols described above; where it is K, it is encoded as 110000 (or its complement, 001111). The minor subsymbols are also encoded as in data, but with a quirk: all of them take part in balancing, and in the case of subsymbol 7, the alternate form is always used.
Essentially, there are two separate control symbol coding schemes at work here: one based on the use of the magic major subsymbol K, followed by a normal data minor subsymbol (albeit encoded strangely, necessary because of the long run of bits in the K subcode), and one based on the use of the alternate form of minor subsymbol 7 when it would not otherwise be expected. These may be referred to as K-type and 7-type control symbols, respectively.
Some of the control symbols have interesting properties.
B, F and H are what is known as 'commas': they have bit patterns which cannot occur as substrings of any encoded stream; this means that they can be used to infer the phase of a stream - that is, the position of the code boundaries in what is otherwise a featuresless flow of bits. In particular, their distinctive feature is a run of two or more of one bit, followed by a run of five of the other; the code boundary is found two bits before the start of the run of five. H is, however a flawed comma: if followed by another K-type control symbol, a second, out-of-phase, comma feature appears in the stream. It is possible to reject the false comma by noting that it overlaps the end of the preceding true comma, but this fails in the presence of errors corrupting the true comma. Furthermore, a sequence of multiple H's in tandem leads to alternating runs of five zeroes and five ones in the output stream; this is not very good for clock derivation; Widmer and Franaszek go so far as to forbid adjacent H's in the input stream. On the other hand, the bit pattern representing H has the unique property that it cannot be generated by any single-bit error in validly encoded data symbols. H is therefore a good comma if used in isolation, but rather weak if followed by another K-type control symbol.
(NB: i'm pretty sure the only way you can get false commas from H is by following it with another K-type control symbol, but not absolutely certain - it may be that you can get it with a data symbol of some sort, in which case H is a very weak comma)
Other interesting control symbols are W, X and Y, any of which yield an encoded bitstream with 60% transition density (60 transitions per 100 bits), an excellent input for clock derivation, and so a good idle or preamble sequence. F is also a good idle - a continuous stream will give a 50% transition density, as well as supplying a stream of commas.
Personally, i would suggest using H as a frame separator, and F as an idle/preamble symbol - although note that you shouldn't follow an H directly with an F. If you need anything else, start with W-Z, since you can mix these freely with H.
Finally, Widmer and Franaszek have a notation for symbols which reflects their decomposition into subsymbols, and which is often used when discussing this code; they do not use my single-letter notation for control symbols. In this notation, all symbols are represented in the form X.N.M, where X is 'D' for data symbols and 'K' for control symbols, N is the value of the major subsymbol, and M is the value of the minor subsymbol. The special major subsymbol K is simply called 28 (giving representations of the form K.28.M). The primary and alternate forms of minor subsymbol 7 are called P7 and A7 respectively, where needed (the authors don't seem to use them much). Major subsymbols are written X.N, and minor subsymbols X.x.M (where x is a literal 'x' - although for some reason they use 'y' with A7).
How about an example? Let's encode a frame consisting of the string 'Hello, world!' preceded by an H control, and followed by an F control. The symbols break down as:
Char Symb Subsymb WF Raw Compl? Final Maj Min Maj Min Maj Min Maj Min <H> H K 7 K.28.A7 110000 1000 N Y 110000 0111 H 48 8 2 D.8.2 000110 0101 N - 000110 0101 e 65 5 3 D.5.3 101001 0011 - y 101001 1100 l 6c 12 3 D.12.3 001101 0011 - y 001101 1100 l 6c 12 3 D.12.3 001101 0011 - y 001101 1100 o 6f 15 3 D.15.3 101000 0011 Y n 010111 0011 , 2c 12 1 D.12.1 001101 1001 - - 001101 1001 SP 20 0 1 D.0.1 011000 1001 N - 011000 1001 w 77 23 3 D.23.3 000101 0011 Y n 111010 0011 o 6f 15 3 D.15.3 101000 0011 N y 101000 1100 r 72 18 3 D.18.3 010011 0011 - y 010011 1100 l 6c 12 3 D.12.3 001101 0011 - y 001101 1100 d 64 4 3 D.4.3 001010 0011 Y n 100101 0011 ! 21 1 1 D.1.1 100010 1001 N - 100010 1001 <F> F K 5 K.28.5 110000 1010 Y N 001111 1010
The end result is the bitstring: 1100000111 0001100101 1010011100 0011011100 0011011100 0101110011 0011011001 0110001001 1110100011 1010001100 0100111100 0011011100 1001010011 1000101001 0011111010. I think. I did that by hand.
Oh, and really, you should scramble your data symbols before putting them into the code - it makes the spectral properties of the signal better. I have no idea what that means but it sounds good.