Extension:CodeTidy

From Organic Design wiki
Info.svg This code is in our Git repository here.

Note: If there is no information in this page about this code and it's a MediaWiki extension, there may be something at mediawiki.org.

I've been having a lot of trouble finding a decent free formatter for PHP, especially one that matches MediaWiki's conventions, there are many online and command-line tools but they just don't seem to work for some reason, the results are all still full of bad indenting with combinations of tabs and spaces, and they never get the braces the way you want them!

So I've had a go at writing a simple one. My one one just uses all regular expressions, no fancy compiler magic, so it's possible that it could break things, and is still very beta and being tested on various messy code. There is a basic online interface in the form of a Special page which is very basic at the moment too, but will improve over time.

It's hardwired for MediaWiki's conventions, I may add options later so it can be used to conform to other conventions as well.

The extension code is linked at the top of the page, but it refers to the tidy.php script in our tools repository which defines a class that can be used by extensions or other PHP programs, but can also be run from the command-line taking the messy code's filename as its only parameter and outputting the formatted result to STDOUT so it can be copied to the clipboard or directed to a new file.

Example command-line usage:

php tidy.php ugly.php > nice.php

How it works

Most of these tidying programs use the PHP parser tools so that a lot of the work is already done and there's no way it can break the syntax or change the logical meaning of the input file in any way. But all the programs I could find did not actually do the tidying job very well at all, so I decided to write a simple one myself that just used regular expressions since I'm not familiar with the parsing tools.

It worked quite well, but there were a couple of serious things that would confuse it leading to broken syntax in the output, so I had to add some more complicated parsing (just my own parsing, not using the official parsing tools still) to resolve those things making it a bit less simple. The basic process involves the following three stages:

  1. Preprocess (reduces the input down to a simple common format with no indenting or trailing whitespace and no comment or string data)
  2. Tidies each PHP code section except for ones that have delimiters and code on the same line as they're assumed to be within HTML template layout
  3. Postprocess (puts all the string and comment data back)

The first two require further description.

CodeTidy::preprocess

The preprocessing phase first makes all the newlines uniform UNIX style (\n), reduces all the indenting and trailing whitespace on each line, then preserves the escaped backslashes and quotes (preserving involves replacing the items with a unique string that won't interfere with any following processes and storing the data so each unique string can be restored to it's former state again after everything's tidied).

The comment and string content is then preserved. Originally this was just done with regular expressions, but this was soon found to be inadequate because if comments contain quotes, it would cause trouble for the quote preservation - or if you preserve comments first to try and avoid that problem, then quotes that contain comment syntax would cause problems instead. Here's an example.

// Either this single quote's a problem,
'or this url is a problem: http://since it contains a double-slash'

As you see a simple regular expression for comments and one for quotes isn't really practical, especially when you consider that there's two kinds of comments and four kinds of strings. So instead a parsing loop is used which goes through character-by-character maintaining a $state variable that indicates whether the current location is within a single/multi line comment, a single/double/backtick quote or a Heredoc-style string and then does the preserves each item when the state changes.

CodeTidy::tidySection

The method that tidies each section is broken into three stages to handle statements, indenting and operators separately.

CodeTidy::statements

This first changes all "else if" to "elseif", and then runs a method called fixNakedStatements for all the for, foreach, if and switch statements (i.e. all the ones that can have their own sub-statements). These sub-statements should be in braces even if there's only one, the ones that aren't surrounded by braces are called "naked". After all naked statements have been braced, all statements are put on their own line.

The fixNakedStatements is another one that became rather complex and required character-by-character parsing, because the bracket structures and braces structures can be recursive. Also it must handle the "chain statements" that can be formed from naked if-elseif-else's. To allow the method to be simpler it's called for each statement in reverse order starting with the last one first so that all the inner-most statements are processed first, thus avoiding an extra level of recursive complexity.

Here's an example nasty statement composed of a chain of naked statements:

for( a; b; c; ) if(d) e(w); else if (f) g(x); else if (h) i(y); else j(z);

The fixNakedStatements method is called twice here, first for the first if, because the "else if" are changed to "elseif" first and the method isn't called for else or elseif as they're part of an if chain. And second for the for.

So first the if-elseif-else chain is braced properly and then because it's found to itself as a whole be a naked statement it's completely wrapped in braces, so the returned fixed if statement is this:

{if (d){e(w);}elseif (f){g(x);}elseif (h){i(y);}else {j(z);}}

So then when it's called again for the for statement it doesn't find it to be naked and nothing needs to be done. However it does preserve the C-style loop syntax within the brackets because their semicolons confuse the next process which puts all statements on their own line, so we end up with:

for (XXX) {if (d){e(w);}elseif (f){g(x);}elseif (h){i();}else {j(z);}}

where XXX would be the unique string representing the preserved C-style loop syntax.

CodeTidy::indent

Next the indenting phase converts our code to what we would expect:

for ( a; b; c; ) {
	if ( d ) {
		e( w );
	}
	elseif ( f ) {
		g( x );
	}
	elseif ( h ) {
		i(y);
	} else {
		j( z );
	}
}

The indent method is quite straight forward, but has the one slight complexity of needing to account for the case and break statements. This is made complex because a break can apply to other loop structures within a case, so it needs to keep track of what keyword the current break applies to by maintain a $state variable which updates the keyword state whenever opening or closing braces are encountered.

CodeTidy::operators

The main job here is to use the CodeTidy::$ops array to determine the spacing that should occur before and after operators. The array has an entry for each operator followed by a 0, 1 or false for what spacing should occur before the operator, and another for the spacing after the operator. 0 means no space, 1 means one space, and false means leave the spacing however it already is.

The operators are all preserved first from largest to smallest because many operators use the same symbols and would confuse each other, for example an "=" operator would match within a "===" operator etc. So they're all preserved, and then the tidying process is done by matching the preserved items, then returning the restored and tidied result.

Apart from that there are a couple of special conditions such as long statements that split over lines leading to lines starting with an operator, and colons which are used after case statements as well as in the (condition) ? : syntax.

Issues

Currently it has problems with indenting some types of multi-line statements like the following which will have the inner content triple-indented.

$value = wfEscapeWikiText( wfUrlEncode( str_replace(
	'Foo',
	'Bar',
	$baz
) ) );

To do

  • Auto split long lines
  • Test on minified code
  • Allow wildcard in filename command-line arg

See also