Subscribe: Code Commit
http://feeds.codecommit.com/codecommit
Language: English
Tags:
bar  classes  def  generalized parsing  generalized  java  language  parser combinators  parser  parsing  recursion  scala  xml
Rate this Feed

Feed Details and Statistics
Preview: Code Commit

# Code Commit

## (permanently in beta)

Updated: 2012-03-14T14:55:08Z

2011-05-31T14:11:35Z

Unveiling the Mysteries of GLL Part 2: The Problem Space

2010-06-07T01:07:58Z

In the previous article, we skimmed the surface of automated text parsing and set the stage for our impending exploration of the GLL algorithm itself. However, before we can move ahead and do just that, we should first build up some idea of what the requirements are for truly generalized parsing and what sort [...]In the previous article, we skimmed the surface of automated text parsing and set the stage for our impending exploration of the GLL algorithm itself. However, before we can move ahead and do just that, we should first build up some idea of what the requirements are for truly generalized parsing and what sort of problems we are likely to encounter. I’m going to assume you already have a working understanding of context-free grammars and how to read them. If you don’t, then I recommend you to the Wikipedia page on CFGs. Specifically, the examples are quite instructive. Recursion S ::= '(' S ')' | '(' ')' In this grammar, the S non-terminal is recursive because one of its productions refers back to itself. Specifically, the first rule corresponding to the S non-terminal is of the form α S β, where α and β stand for some arbitrary rule fragments (in this case, '(' and ')', respectively). When a non-terminal maps to a production which is recursive in its first token, we say that rule is left-recursive. For example: E ::= E '+' N | E '-' N | N N ::= '1' | '2' | '3' | ... In this grammar, the E non-terminal is left-recursive in two of its three productions. Left-recursion is a particularly significant property of a grammar because it means that any left-to-right parse process would need to parse E by first parsing E itself, and then parsing '+' and finally N (assuming that the parser is using the first production). As you can imagine, it would be very easy for a naïve parsing algorithm to get into an infinite loop, trying to parse E by first parsing E, which requires parsing E, which requires parsing E, etc. Mathematically, left-recursive productions are always of the form α E β where α —> ε. In plain-English, this means that a production is left recursive if the part of the production preceding the recursive token represents the empty string (ε). This is a very nice way of defining left-recursion, because it allows for a specific type of left-recursion known as hidden left-recursion. For example: A ::= B A '.' | '.' B ::= ',' | Notice how the second production for B is empty? This means that B can map to ε, and thus A exhibits hidden left-recursion. The difference between hidden and direct left-recursion is that hidden left-recursion is obscured by other rules in the grammar. If we didn’t know that B had the potential to produce the empty string, then we would never have realized that A is left-recursive. LR parsing algorithms (such as tabular LALR or recursive-ascent) can handle direct left-recursion without a problem. However, not even Tomita’s GLR can handle hidden left-recursion (which technically means that the GLR algorithm isn’t fully general). Hidden left-recursion is a perfectly valid property for a context-free grammar to exhibit, and so in order to be fully general, a parsing algorithm must be able to handle it. As it turns out, this is just a little bit troublesome, and many papers on parsing algorithms spend a large majority of their time trying to explain how they handle hidden left-recursion. It’s worth noting that left-recursion cannot be handled by top-down algorithms (such as tabular LL(k) or recursive-descent) without fairly significant contortions. However, such algorithms have no trouble at all with other forms of recursion (such as our original recursive example with S). Left-recursion arises very naturally in many grammars (particularly involving binary forms such as object-oriented method dispatch or mathematical operators) and is one of the primary reasons why many people prefer algorithms in the LR family over LL algorithms. Ambiguity It is perhaps surprising that conte[...]

Unveiling the Mysteries of GLL Part 1: Welcome to the Field

2010-05-30T19:40:11Z

Generalized parsing is probably the most misunderstood topic in the entire field of automated language processing. There is a persistent perception that generalized parsing is slow and impractical. Even worse, most people seem to believe that generalized parsing is complicated and unpredictable (a perception deriving from the extremely obfuscated nature of most generalized [...]Generalized parsing is probably the most misunderstood topic in the entire field of automated language processing. There is a persistent perception that generalized parsing is slow and impractical. Even worse, most people seem to believe that generalized parsing is complicated and unpredictable (a perception deriving from the extremely obfuscated nature of most generalized parsing algorithms). This is all very unfortunate, because none of it is true anymore. Now, before I move forward and justify that rather bold statement, I should probably define what I mean when I say “generalized parsing”. Parsing algorithms generally fall into one of any number of categories. The most common ones are: LL(k) (e.g. ANTLR, JavaCC) Non-Commutative LL(*) (e.g. most hand-written parsers, parser combinators) Memoized Non-Commutative LL(*) (e.g. Packrat parsers, Scala 2.8 parser combinators) LR(k) (e.g. YACC, Bison) Generalized These are arranged roughly in order from least to most powerful. This means that the supported expressivity of grammars increases as you go down the list. Techniques like LL don’t support left-recursion or ambiguity; LR supports left-recursion, but not ambiguity; and generalized supports anything that’s context-free. Note that this is a very rough arrangement. It’s difficult to formally analyze the non-commutative LL(*) techniques, and so theorists tend to be a little unclear as to exactly how powerful the techniques are with respect to more well defined classes like LL and LR. However, it is generally assumed that non-commutative LL(*) is strictly more powerful than LL(k) but likely less powerful than LR(k) (since left-recursion can be handled with memoization, but some LR local ambiguities do not always resolve correctly). As intuition would suggest, algorithms are generally more complex (both in terms of comprehension and asymptotic performance) the more powerful you get. LL(k) algorithms, both the table-driven and the directly-encoded, are usually quite easy to understand. Parser states correspond directly to grammatical rules, and so it’s usually pretty easy to tease out the structure of the parser. By contrast, LR(k) algorithms (most commonly, tabular LALR and recursive-ascent) are usually very difficult to conceptualize and next to impossible to read when encoded in a programming language. One look at the recursive-ascent example on Wikipedia is sufficient to confirm this property. Most listed non-generalized parsing techniques are O(n) in the length of the input. The one exception here is non-commutative LL(*), which is O(k^n) in the case where the grammar is recursively ambiguous and the input is invalid. Generalized parsing on the other hand has an absolute lower-bound of o(n^2) (a property which falls out of the equivalence with matrix multiplication), though no one has ever found an algorithm which can do better than O(n^3). Clearly, generalized parsing does impose a performance penalty beyond more “conventional” techniques. For these reasons, generalized parsing is usually confined to applications which actually need to be able to handle the full set of context-free grammars — most notably, genome analysis and natural language processing. Even worse, the reticence surrounding generalized parsing has led to its avoidance in several less esoteric situations which would benefit greatly from its power — most notably, the Scala, Haskell and C/C++ compilers should use generalized parsing, but don’t. This is really a shame, because generalized parsing benefits from two major areas of advancement i[...]

Working with Scala’s XML Support

2010-05-24T14:22:29Z

XML is probably one of Scala’s most controversial language features (right behind unrestricted operator overloading). On the one hand, it’s very nice to be able to simply embed XML fragments and XPath-like expressions within your Scala source code. At least, it’s certainly a lot nicer than the string-literal approach that is required in [...]XML is probably one of Scala’s most controversial language features (right behind unrestricted operator overloading). On the one hand, it’s very nice to be able to simply embed XML fragments and XPath-like expressions within your Scala source code. At least, it’s certainly a lot nicer than the string-literal approach that is required in many other languages. However, XML literals also complicate the syntax tremendously and pose endless difficulties for incremental syntax-aware editors such as IDEs. Irrespective of the controversy though, XML literals are part of the language and they are here to stay. Martin Odersky has mentioned on multiple occasions that he half-regrets the inclusion of XML literal support, but he can’t really do anything about it now that the language has taken hold and the syntax has solidified. So, we may as well make the best of it… Unfortunately, Scala’s XML library is very…weird. Especially in Scala 2.7. The class hierarchy is unintuitive, and there are odd pitfalls and correctness dangers just waiting to entrap the unwary. That fact, coupled with the lack of appropriate documentation in the language specification, leads to a very steep learning curve for new users. This is quite unfortunate, because a solid understanding of Scala’s XML support is vital to many applications of the language, most notably the Lift web framework. I can’t personally do anything about the strangeness in the XML library. Like the literal syntax itself, it’s too late to make many fundamental changes to the way XML works in Scala. However, I can try to make it easier for beginners to get up and running with Scala’s XML support. The Hierarchy Before we get to literals and queries, it’s important to have some idea of the shape of Scala’s XML library and how its class hierarchy works. I found (and find) this to be the most unintuitive part of the entire ordeal. There are actually more classes than just this (such as Document, which extends NodeSeq, and Unparsed, which extends Atom), but you get the general idea. The ones I have shown are the classes which you are most likely to use on a regular basis. Starting from the top, NodeSeq is probably the most significant class in the entire API. The most commonly used methods in the library are defined in the NodeSeq class, and most third-party methods which work with XML usually work at the level of NodeSeq. More specifically, NodeSeq defines the \\ and \ methods, which are used for XPath selection, as well as the text method, which is used to recursively extract all text within a particular set of nodes. If you’re familiar with libraries like Nokogiri, you should be right at home with the functionality of these methods. One particularly useful aspect of Scala’s XML library is the fact that NodeSeq extends Seq[Node]. This means that you can use standard Scala collections operations to fiddle with XML (map, flatMap, etc). Unfortunately, more often than not, these methods will return something of type Seq[_], rather than choosing the more specific NodeSeq when possible. This is something which could have been solved in Scala 2.8, but has not been as of the latest nightly. Until this design flaw is rectified, the only recourse is to use the NodeSeq.fromSeq utility method to explicitly convert anything of type Seq[Node] back into the more specific NodeSeq as necessary: val nodes: Seq[Node] = ... val ns: NodeSeq = NodeSeq fromSeq nodes Immediately deriving from NodeSeq is another landmark class in the Scala API, Node. At first [...]

Understanding and Applying Operational Transformation

2010-05-17T21:24:28Z

The Magic Behind Parser Combinators

2009-03-24T07:00:14Z

If you’re like me, one of the first things that attracted you to Scala was its parser combinators.  Well, maybe that wasn’t the first thing for me, but it was pretty far up there.  Parser combinators make it almost too easy to create a parser for a complex language without ever leaving the comfortable play-pen [...]If you’re like me, one of the first things that attracted you to Scala was its parser combinators.  Well, maybe that wasn’t the first thing for me, but it was pretty far up there.  Parser combinators make it almost too easy to create a parser for a complex language without ever leaving the comfortable play-pen afforded by Scala.  Incidentally, if you aren’t familiar with the fundamentals of text parsing, context-free grammars and/or parser generators, then you might want to do some reading before you continue with this article. Intro to Parser Combinators In most languages, the process of creating a text parser is usually an arduous and clumsy affair involving a parser generator (such as ANTLR, JavaCC, Beaver or ScalaBison) and (usually) a lexer generator such as JFlex.  These tools do a very good job of generating sources for efficient and powerful parsers, but they aren’t exactly the easiest tools to use.  They generally have a very steep learning curve, and due to their unique status as compiler-compilers, an unintuitive architecture.  Additionally, these tools can be somewhat rigid, making it very difficult to implement unique or experimental features.  For this reason alone, many real world compilers and interpreters (such as javac, ruby, jruby and scalac) actually use hand-written parsers.  These are usually easier to tweak, but they can be very difficult to develop and test.  Additionally, hand-written parsers have a tendency toward poor performance (think: the Scala compiler). When creating a compiler in Scala, it is perfectly acceptable to make use of these conventional Java-generating tools like ANTLR or Beaver, but we do have other options.  Parser combinators are a domain-specific language baked into the standard library.  Using this internal DSL, we can create an instance of a parser for a given grammar using Scala methods and fields.  What’s more, we can do this in a very declarative fashion.  Thanks to the magic of DSLs, our sources will actually look like a plain-Jane context-free grammar for our language.  This means that we get most of the benefits of a hand-written parser without losing the maintainability afforded by parser generators like bison.  For example, here is a very simple grammar for a simplified Scala-like language, expressed in terms of parser combinators: object SimpleScala extends RegexpParsers {   val ID = """[a-zA-Z]([a-zA-Z0-9]|_[a-zA-Z0-9])*"""r   val NUM = """[1-9][0-9]*"""r   def program = clazz*   def classPrefix = "class" ~ ID ~ "(" ~ formals ~ ")"   def classExt = "extends" ~ ID ~ "(" ~ actuals ~ ")"   def clazz = classPrefix ~ opt(classExt) ~ "{" ~ (member*) ~ "}"   def formals = repsep(ID ~ ":" ~ ID, ",")   def actuals = expr*   def member = ( "val" ~ ID ~ ":" ~ ID ~ "=" ~ expr | "var" ~ ID ~ ":" ~ ID ~ "=" ~ expr | "def" ~ ID ~ "(" ~ formals ~ ")" ~ ":" ~ ID ~ "=" ~ expr | "def" ~ ID ~ ":" ~ ID ~ "=" ~ expr | "type" ~ ID ~ "=" ~ ID )   def expr: Parser[Expr] = factor ~ ( "+" ~ factor | &qu[...]

Interop Between Java and Scala

2009-02-09T19:01:25Z

Sometimes, the simplest things are the most difficult to explain.  Scala’s interoperability with Java is completely unparalleled, even including languages like Groovy which tout their tight integration with the JVM’s venerable standard-bearer.  However, despite this fact, there is almost no documentation (aside from chapter 29 in Programming in Scala) which shows how this Scala/Java integration [...]Sometimes, the simplest things are the most difficult to explain.  Scala’s interoperability with Java is completely unparalleled, even including languages like Groovy which tout their tight integration with the JVM’s venerable standard-bearer.  However, despite this fact, there is almost no documentation (aside from chapter 29 in Programming in Scala) which shows how this Scala/Java integration works and where it can be used.  So while it may not be the most exciting or theoretically interesting topic, I have taken it upon myself to fill the gap. Classes are Classes The first piece of knowledge you need about Scala is that Scala classes are real JVM classes.  Consider the following snippets, the first in Java: public class Person { public String getName() { return "Daniel Spiewak"; } } …and the second in Scala: class Person { def getName() = "Daniel Spiewak" } Despite the very different syntax, both of these snippets will produce almost identical bytecode when compiled.  Both will result in a single file, Person.class, which contains a default, no-args constructor and a public method, getName(), with return type java.lang.String.  Both classes may be used from Scala: val p = new Person() p.getName() // => "Daniel Spiewak" …and from Java: Person p = new Person(); p.getName(); // => "Daniel Spiewak" In the case of either language, we can easily swap implementations of the Person class without making any changes to the call-site.  In short, you can use Scala classes from Java (as well as Java classes from Scala) without ever even knowing that they were defined within another language. This single property is the very cornerstone of Scala’s philosophy of bytecode translation.  Wherever possible — and that being more often than not — Scala elements are translated into bytecode which directly corresponds to the equivalent feature in Java.  Scala classes equate to Java classes, methods and fields within those classes become Java methods and fields. This allows some pretty amazing cross-language techniques.  For example, I can extend a Java class within Scala, overriding some methods.  I can in turn extend this Scala class from within Java once again with everything working exactly as anticipated: class MyAbstractButton extends JComponent { private var pushed = false   def setPushed(p: Boolean) { pushed = p }   def getPushed = pushed   override def paintComponent(g: Graphics) { super.paintComponent(g)   // draw a button } } public class ProKitButton extends MyAbstractButton { // do something uniquely Apple-esque } Traits are Interfaces This is probably the one interoperability note which is the least well-known.  Scala’s traits are vastly more powerful than Java’s interfaces, often leading developers to the erroneous conclusion that they are incompatible.  Specifically, traits allow method definitions, while interfaces must be purely-abstract.  Yet, despite this significant distinction, Scala is still able to compile traits into interfaces at the bytecode level…with some minor enhancements. The simplest case is when the trait only contains abstract mem[...]

Hacking Buildr: Interactive Shell Support

2009-01-12T07:57:16Z

Last week, we looked at the unfortunately-unexplored topic of Scala/Java joint compilation.  Specifically, we saw several different ways in which this functionality may be invoked covering a number of different tools.  Among these tools was Buildr, a fast Ruby-based drop-in replacement for Maven with a penchant for simple configuration.  In the article I mentioned that [...]Last week, we looked at the unfortunately-unexplored topic of Scala/Java joint compilation.  Specifically, we saw several different ways in which this functionality may be invoked covering a number of different tools.  Among these tools was Buildr, a fast Ruby-based drop-in replacement for Maven with a penchant for simple configuration.  In the article I mentioned that Buildr doesn’t actually have support for the Scala joint compiler out of the box.  In fact, this feature actually requires the use of a Buildr fork I’ve been using to experiment with different extensions.  Among these extensions is a feature I’ve been wanting from Buildr for a long time: the ability to launch a pre-configured interactive shell. For those coming from a primarily-Java background, the concept of an interactive shell may seem a bit foreign.  Basically, an interactive shell — or REPL, as it is often called — is a line-by-line language interpreter which allows you to execute snippets of code with immediate result.  This has been a common tool in the hands of dynamic language enthusiasts since the days of LISP, but has only recently found its way into the world of mainstream static languages such as Scala. One of the most useful applications of these tools is the testing of code (particularly frameworks) before the implementations are fully completed.  For example, when working on my port of Clojure’s PersistentVector, I would often spin up a Scala shell to quickly test one aspect or another of the class.  As a minor productivity plug, JavaRebel is a truly invaluable tool for development of this variety. The problem with this pattern of work is it requires the interactive shell in question to be pre-configured to include the project’s output directory on the CLASSPATH.  While this isn’t usually so bad, things can get very sticky when you’re working with a project which includes a large number of dependencies.  It isn’t too unreasonable to imagine shell invocations stretching into the tens of lines, just to spin up a “quick and dirty” test tool. Further complicating affairs is the fact that many projects do away with the notion of fixed dependency paths and simply allow tools like Maven or Buildr to manage the CLASSPATH entirely out of sight.  In order to fire up a Scala shell for a project with any external dependencies, I must first manually read my buildfile, parsing out all of the artifacts in use.  Then I have to grope about in my ~/.m2/repository directory until I find the JARs in question.  Needless to say, the productivity benefits of this technique become extremely suspect after the first or second expedition. For this reason, I strongly believe that the launch of an interactive shell should be the responsibility of the tool managing the dependencies, rather than that of the developer.  Note that Maven already has some support for shells in conjunction with certain languages (Scala among them), but it is as crude and verbose as the tool itself.  What I really want is to be able to invoke the following command and have the appropriate shell launched with a pre-configured CLASSPATH.  I shouldn’t have to worry about the language of my project, the location of my repository or even if the shell requires extra configuration on my platform.  The idea is that everything should all work auto-magically: \$ bui[...]

Gun for Hire (off topic)

2009-01-07T08:24:39Z

Just in case you thought Christmas was over, I have a late gift for the world: I’m available for hire!  Ok, so maybe this wasn’t exactly the stocking-stuffer you were expecting, but it’s the thought that counts. I’m announcing my availability for employment as a part-time developer.  Those of you who follow this blog are probably [...]

Just in case you thought Christmas was over, I have a late gift for the world: I’m available for hire!  Ok, so maybe this wasn’t exactly the stocking-stuffer you were expecting, but it’s the thought that counts.

I’m announcing my availability for employment as a part-time developer.  Those of you who follow this blog are probably already familiar with my areas of expertise, so I don’t think there is a need to bore you with a rehash.  Resume available on request!

Anyway, my preference would be a project where I get to use multiple different languages, particularly Scala and Clojure, but I’m flexible.  If you think my skills would make a positive addition to your team, shoot me an email!

(image)

Joint Compilation of Scala and Java Sources

2009-01-05T20:13:56Z

One of the features that the Groovy people like to flaunt is the joint compilation of .groovy and .java files.  This is a fantastically powerful concept which (among other things) allows for circular dependencies between Java, Groovy and back again.  Thus, you can have a Groovy class which extends a Java class which in turn [...]One of the features that the Groovy people like to flaunt is the joint compilation of .groovy and .java files.  This is a fantastically powerful concept which (among other things) allows for circular dependencies between Java, Groovy and back again.  Thus, you can have a Groovy class which extends a Java class which in turn extends another Groovy class. All this is old news, but what you may not know is the fact that Scala is capable of the same thing.  The Scala/Java joint compilation mode is new in Scala 2.7.2, but despite the fact that this release has been out for more than two months, there is still a remarkable lack of tutorials and documentation regarding its usage.  Hence, this post… Concepts For starters, you need to know a little bit about how joint compilation works, both in Groovy and in Scala.  Our motivating example will be the following stimulating snippet: // foo.scala class Foo   class Baz extends Bar …and the Java class: // Bar.java public class Bar extends Foo {} If we try to compile foo.scala before Bar.java, the Scala compiler will issue a type error complaining that class Bar does not exist.  Similarly, if we attempt the to compile Bar.java first, the Java compiler will whine about the lack of a Foo class.  Now, there is actually a way to resolve this particular case (by splitting foo.scala into two separate files), but it’s easy to imagine other examples where the circular dependency is impossible to linearize.  For the sake of example, let’s just assume that this circular dependency is a problem and cannot be handled piece-meal. In order for this to work, either the Scala compiler will need to know about class Bar before its compilation, or vice versa.  This implies that one of the compilers will need to be able to analyze sources which target the other.  Since Scala is the language in question, it only makes sense that it be the accommodating one (rather than javac). What scalac has to do is literally parse and analyze all of the Scala sources it is given in addition to any Java sources which may also be supplied.  It doesn’t need to be a full fledged Java compiler, but it does have to know enough about the Java language to be able to produce an annotated structural AST for any Java source file.  Once this AST is available, circular dependencies may be handled in exactly the same way as circular dependencies internal to Scala sources (because all Scala and all Java classes are available simultaneously to the compiler). Once the analysis phase of scalac has blessed the Scala AST, all of the Java nodes may be discarded.  At this point, circular dependencies have been resolved and all type errors have been handled.  Thus, there is no need to carry around useless class information.  Once scalac is done, both the Foo and the Baz classes will have produced resultant Foo.class and Baz.class output files. However, we’re still not quite done yet.  Compilation has successfully completed, but if we try to run the application, we will receive a NoClassDefFoundError due to the fact that the Bar class has not actually been compiled.  Remember, scalac only analyzed it for the sake of the type checker, no actual bytecode was produced.  Bar may even suffer from a compile error of some sort, as long as this error is within the method definitions, scalac isn’t going to catch it. The final st[...]