2023-05-25
2023-05-26 edit: based on the feedback from this discussion (thanks everyone!), I simplified things by leaving only the more universal multistring variant. I also moved some information to appendices and simplified the formal definition, removing tags and reintroducing them as a possible extension.
In this article I will share with you a recipe I developed for a very useful, but notoriously difficult to get right, and still not fully evolved syntactic feature that has been gainig more and more adoption in recent years. I hope this to be especially useful for designers and implementers of full-blown programming languages as well as more specialized languages (e.g. a configuration or text markup formats).
Feel free to skip the introduction and get straight to the idea or the formal definition.
I shall call this feature the “multistring”. It is a generalization of a multitude of related syntactic constructs, known variedly as: raw string, raw string literal, multiline string, multiline string literal, template literal, here string, heredoc, here document, here text, hereis, here script, code block, fenced code block, inline code block, inline code, and more.
It is available in some (often flawed) form in:
I’ll forgo trying to classify these in terms of how good the feature is, but it certainly varies. There are languages that have many different types of strings that seem to try to achieve approximations of what I’m about to describe, with neither type fully succeeding.
I believe that the general idea behind all manifestations of this feature is the same: to be able to embed an arbitrary sequence of characters in a syntax, without needing to modify that sequence to fit the limitations of the syntax. The purpose of this may be to create and populate a text file inside a script, to embed one language into another, to embed a fragment of source code of a language in itself as a string (suppressing normal interpretation), etc.
Multistrings aim to unify and simplify all these features into one extensible construct that could become (perhaps already is becoming) standard in programming languages, portable configuration or data-interchange formats, or other text-syntax-driven languages, new or already existing.
I think that if you were forced to choose only one type of strings for your language, multistrings would be an excellent choice.
I propose one simple syntax construct, called the “multistring”, which shall look very similar to Markdown syntax for embedding code. I choose Markdown as the basis, because it offers syntax which is perhaps the simplest and most familiar, and among the least flawed.
The syntax I propose is very similar to the Markdown fenced code
block syntax. The difference is that it uses '
(apostrophes)1 instead of linebreaks to separate
the multistring content from the delimiters. With that adjustment,
almost all (except empty)
valid Markdown code blocks would also be valid multistrings, e.g.
```'a multiline
multistring'```
is a valid multistring.
Compare that to a Markdown code block:
```
a multiline
multistring
```
The only difference is the use of apostrophes instead of linebreaks.
Multistrings may also admit blocks delimited with double and single backticks2, e.g.:
``'also a multiline
multistring'``
`'another multiline
multistring'`
The syntax for multistrings cannot, in principle, be fully defined as a rule in a standard metasyntax such as EBNF or ABNF. We need a hyperrule3 instead: a parametrizable kind of rule that can accept arguments to produce concrete rules.
This is a lot less scary than it sounds and actually easy to implement in any programming language. A “hyperrule” is simply a function with parameters.
For this reason, I will first introduce the proposed
multistring
syntax rule (or hyperrule if you will) in the
form of Python-like pseudocode:
# the `multistring` hyperrule accepts one parameter `n` which specifies
# the number of repetitions of the multistring delimiter "`"
def multistring(
# `n` is an integer in range 1 to infinity
# in practice uint > 1 or even uint8 > 1 shall be sufficient
n
):return sequence(
"`".times(n),
"'",
# `anychar` is any unicode character
zeroOrMore(anychar).terminatedBy(
sequence("'",
"`".times(n)
)
), )
Now here is the above more concisely expressed in customized ABNF:
multistring(n) = {n}"`" "'" *any⇥("'" {n}"`")
where the customizations are:
rule(a) =
introduces a rule
parametrized
with the argument a
(a hyperrule){n}rule
matches rule
repeated
n
timesrule_a⇥rule_b
matches rule_a
delimited by
rule_b
, i.e. as soon a match for rule_b
is
found while matching rule_a
, matching succeeds4This hyperrule would “expand to” an infinite5 number of regular rules, such as:
multistring_1 = "`" "'" *any⇥("'" "`")
multistring_2 = "``" "'" *any⇥("'" "``")
multistring_3 = "```" "'" *any⇥("'" "```")
...
multistring_5 = "`````" "'" *any⇥("'" "`````")
...
multistring_10 = "``````````" "'" *any⇥("'" "``````````")
...
This formal definition should be sufficient to implement the feature.
Multistrings, as specified above, are very basic and solve only one problem, i.e. verbatim embedding of arbitrary text into another syntax (e.g. source code in some programming lanugage) without needing to modify the embedded text. Thanks to multistrings, we can literally copy-paste anything into a syntax that supports them as a string and not worry about delimiter collision. When in doubt: add more backticks.
However, there is a number of possible extensions that we can add to the feature. One I’ll mention here briefly is what I call tags.
Tags can be seen as a generalization of Markdown language specifiers.
A tagged multistring may look like:
```tag'multistring'```
Such tags can be used as metadata to describe the content within the
multistring. This metadata may direct transformation(s) of the
multistring: e.g. to interpret it as a specific language, to adjust or
remove indentation, to enable interpretation of \x
escape
sequences, to enable ${substitutions}
, etc.
To allow tags, we’d extend our syntax (in a backwards-compatible way) like so (again in customized ABNF):
multistring(n) = {n}"`" tag "'" *any⇥("'" {n}"`")
The only thing that was changed is the addition of tag
inbetween the apostrophe and the backtick.
I’ll leave the precise definition of the tag
rule for
another time. Certainly it should not contain backticks or apostrophes
and generally we’d want tags to be kept on the same line as the
multistring delimiter. For an actual implementation I’d recommend
starting with a conservative syntax for tags, with limited special
symbols – we may want to use one or more of those to create a
generalized version of multistrings (but that’s perhaps for another
article).
I give a few examples of how tags could be used in Appendix III.
Having defined the multistring rule, it may be time to use it as part of a new syntax! I encourage language designers and implementers to try it out. If you do, let me know how it went! Meanwhile, in a future article I intend to present a little format for configuration which makes extensive use of multistrings.
The described multistring syntax can be generalized – perhaps I’ll discuss the details in yet another article.
Should I write these, they shall be linked here. If you want to be notified of that, you can follow @jevko@layer8.space if you are on Mastodon or you can subscribe directly to that via RSS.
I hope multistrings will prove useful and we’ll be seeing more of them in the wild.
This is it for now.
Thank you for reading and until next time.
This post was discussed on reddit.com/r/ProgrammingLanguages.
If you like this, you can help by spreading it further, e.g.
or elsewhere. In any case, you can use this title and link:
Multistrings: a simple syntax for heredoc-style strings (2023)
Comments? Questions? Write to me at darius.j.chuck
Thank you,
Darius
Other more or less viable separators for multistrings include:
"
(double quote): I chose '
over
"
because often languages use '
for raw
strings which do not support escape sequences or substitutions – as is
true for (untagged) multistrings. An implementation may choose to allow
"
as an alternative separator. Another (very minor) reason
is that '
is generally one keypress less than
"
.A nice advantage of '
also is that if we find ourselves
needing to convert from a regular '
-delimited string into a
multistring, there is no need to delete or replace the delimiters. We
only need to add a layer of `
-delimiters around the regular
string. This is particularly easy in a modern code editor.
For example if we have an '
-delimited string such
as:
'a string with an apostrophe: '
and we find that the next character we insert is '
,
making the string invalid:
'a string with an apostrophe: ''
we can quickly fix this by surrounding the whole string with
`
:
`'a string with an apostrophe: ''`
That said, |
is a nice separator too and you can choose
whatever you like. You could also replace `
with another
delimiter that suits your language better.
Implementers beware.
In Markdown, an empty code block is denoted as:
```
```
Note the single linebreak between the delimiters.
However, an equivalent empty multistring is:
```''```
rather than:
```'```
The stated formal definition does not allow to “fuse together” the opening and closing delimiters like this, which is what effectively happens in Markdown.
Instead, an empty multistring is always an opening delimiter
immediately next to a closing delimiter. This is the same principle as
in the familiar ""
or ''
empty strings.
Thanks to that the following multistring:
```'```'```
is valid and contains ```
.
Whereas in Markdown an analogous syntax:
```
```
```
means an empty code block followed by ```
(an unclosed,
and thus invalid, code block). To make that work in Markdown, we would
need to increase the number of backticks around the middle
```
.
This edge case illustrates that the multistring syntax is more regular than Markdown, thanks to the simple formal definition.
By the way, the following is a minimal edge case of a multistring that can be used to test a parser:
`'`'`
It should parse as a multistring which contains `
.
For completeness, this is the minimal empty multistring:
`''`
For example, a dedent
tag could signify that the
multistring should be postprocessed by removing the first linebreak, all
indentation that goes beyond the indentation of the last line, as well
as the last line, achieving the behavior of raw string literals from C#
11 (thanks @useerup on reddit for
mentioning those). For example using dedent
this
multistring:
```dedent'
{
"key": "value
}
'```
would be equivalent to this one without dedent
:
```'{
"key": "value
}'```
An esc
tag could signify that C-style escapes should be
recognized and replaced within the multistring. E.g.:
```esc'\n\n\n'```
would be equivalent to:
```'
'```
i.e. a string which contains 3 linebreaks.
A $
tag could turn the multistring into a template
literal, where names of variables or expressions could be
substituted for their values, e.g.:
`$'Hello, ${username}!'`
could be equivalent to:
`'Hello, John!'`
assuming John
as the value of the username
variable.
Multiple tags could be allowed for one multistring, perhaps by comma-separating them. E.g.:
`$,esc'Hello,\n${username}!'`
Here we are using both the $
and esc
tags
to achieve something like:
`'Hello,
John!'`
In this way could make up all kinds of useful tags and rules for them.
Which are an arbitrary choice. This recipe can of course be adjusted by picking different delimiters, to fit the syntax of a specific programming lanugage, implementer’s sense of aesthetics, or other constraints.↩︎
Although an implementation is of course free to restrict the minimum number of backticks.↩︎
If interested in the details of that, you may want to look into: two-level grammars, Van Wijngaarden grammars, affix grammars or extended affix grammars.↩︎
To put it anoter way, rule_a⇥rule_b
can be
expressed in EBNF as: (rule_a - rule_b), rule_b
.↩︎
In practice if we limit n
to 255, we’ll get
255 rules.↩︎