Darius J Chuck

Multistrings: a simple syntax for heredoc-style strings

2023-05-25

2023-05-26 edit: based on the feedback from this discussion (thanks everyone!), I simplified things by leaving only the more universal multistring variant. I also moved some information to appendices and simplified the formal definition, removing tags and reintroducing them as a possible extension.

In this article I will share with you a recipe I developed for a very useful, but notoriously difficult to get right, and still not fully evolved syntactic feature that has been gainig more and more adoption in recent years. I hope this to be especially useful for designers and implementers of full-blown programming languages as well as more specialized languages (e.g. a configuration or text markup formats).

Feel free to skip the introduction and get straight to the idea or the formal definition.

Introduction

I shall call this feature the “multistring”. It is a generalization of a multitude of related syntactic constructs, known variedly as: raw string, raw string literal, multiline string, multiline string literal, template literal, here string, heredoc, here document, here text, hereis, here script, code block, fenced code block, inline code block, inline code, and more.

It is available in some (often flawed) form in:

Unix and other shells, such as Bourne shell (sh), C shell (csh), tcsh (tcsh), KornShell (ksh), Bourne Again Shell (bash), Z shell, Windows PowerShell, DIGITAL Command Language,
programming languages, such as Perl, PHP, Ruby, JavaScript, Python, Julia, Tcl, C++, D, R, Racket,
configuration formats, such as YAML, TOML, HOCON,
text markup languages, such as Markdown,
other languages, such as NMAKE, Job Control Language, and more.

I’ll forgo trying to classify these in terms of how good the feature is, but it certainly varies. There are languages that have many different types of strings that seem to try to achieve approximations of what I’m about to describe, with neither type fully succeeding.

I believe that the general idea behind all manifestations of this feature is the same: to be able to embed an arbitrary sequence of characters in a syntax, without needing to modify that sequence to fit the limitations of the syntax. The purpose of this may be to create and populate a text file inside a script, to embed one language into another, to embed a fragment of source code of a language in itself as a string (suppressing normal interpretation), etc.

Multistrings aim to unify and simplify all these features into one extensible construct that could become (perhaps already is becoming) standard in programming languages, portable configuration or data-interchange formats, or other text-syntax-driven languages, new or already existing.

I think that if you were forced to choose only one type of strings for your language, multistrings would be an excellent choice.

The idea

I propose one simple syntax construct, called the “multistring”, which shall look very similar to Markdown syntax for embedding code. I choose Markdown as the basis, because it offers syntax which is perhaps the simplest and most familiar, and among the least flawed.

The syntax I propose is very similar to the Markdown fenced code block syntax. The difference is that it uses ' (apostrophes)¹ instead of linebreaks to separate the multistring content from the delimiters. With that adjustment, almost all (except empty) valid Markdown code blocks would also be valid multistrings, e.g.

```'a multiline
multistring'```

is a valid multistring.

Compare that to a Markdown code block:

```
a multiline
multistring
```

The only difference is the use of apostrophes instead of linebreaks.

Multistrings may also admit blocks delimited with double and single backticks², e.g.:

``'also a multiline
multistring'``

`'another multiline
multistring'`

Formal definition

The syntax for multistrings cannot, in principle, be fully defined as a rule in a standard metasyntax such as EBNF or ABNF. We need a hyperrule³ instead: a parametrizable kind of rule that can accept arguments to produce concrete rules.

This is a lot less scary than it sounds and actually easy to implement in any programming language. A “hyperrule” is simply a function with parameters.

For this reason, I will first introduce the proposed multistring syntax rule (or hyperrule if you will) in the form of Python-like pseudocode:

# the `multistring` hyperrule accepts one parameter `n` which specifies
# the number of repetitions of the multistring delimiter "`"
def multistring(
  # `n` is an integer in range 1 to infinity
  # in practice uint > 1 or even uint8 > 1 shall be sufficient
  n
):
  return sequence(
    "`".times(n),
    "'",
    # `anychar` is any unicode character
    zeroOrMore(anychar).terminatedBy(
      sequence(
        "'",
        "`".times(n)
      )
    ),
  )

Now here is the above more concisely expressed in customized ABNF:

multistring(n) = {n}"`" "'" *any⇥("'" {n}"`")

where the customizations are:

rule(a) = introduces a rule parametrized with the argument a (a hyperrule)
{n}rule matches rule repeated n times
rule_a⇥rule_b matches rule_a delimited by rule_b, i.e. as soon a match for rule_b is found while matching rule_a, matching succeeds⁴

This hyperrule would “expand to” an infinite⁵ number of regular rules, such as:

multistring_1 = "`" "'" *any⇥("'" "`")
multistring_2 = "``" "'" *any⇥("'" "``")
multistring_3 = "```" "'" *any⇥("'" "```")
...
multistring_5 = "`````" "'" *any⇥("'" "`````")
...
multistring_10 = "``````````" "'" *any⇥("'" "``````````")
...

This formal definition should be sufficient to implement the feature.

Possible extensions

Multistrings, as specified above, are very basic and solve only one problem, i.e. verbatim embedding of arbitrary text into another syntax (e.g. source code in some programming lanugage) without needing to modify the embedded text. Thanks to multistrings, we can literally copy-paste anything into a syntax that supports them as a string and not worry about delimiter collision. When in doubt: add more backticks.

However, there is a number of possible extensions that we can add to the feature. One I’ll mention here briefly is what I call tags.

What’s next

Having defined the multistring rule, it may be time to use it as part of a new syntax! I encourage language designers and implementers to try it out. If you do, let me know how it went! Meanwhile, in a future article I intend to present a little format for configuration which makes extensive use of multistrings.

The described multistring syntax can be generalized – perhaps I’ll discuss the details in yet another article.

Should I write these, they shall be linked here. If you want to be notified of that, you can follow @jevko@layer8.space if you are on Mastodon or you can subscribe directly to that via RSS.

Conclusion

I hope multistrings will prove useful and we’ll be seeing more of them in the wild.

This is it for now.

Thank you for reading and until next time.

This post was discussed on reddit.com/r/ProgrammingLanguages.

If you like this, you can help by spreading it further, e.g.

or elsewhere. In any case, you can use this title and link:

Multistrings: a simple syntax for heredoc-style strings (2023)

https://djedr.github.io/posts/multistrings-2023-05-25.html

Comments? Questions? Write to me at darius.j.chuck

Thank you,

Darius

Appendix I: why choose apostrophe as the separator?

Other more or less viable separators for multistrings include:

" (double quote): I chose ' over " because often languages use ' for raw strings which do not support escape sequences or substitutions – as is true for (untagged) multistrings. An implementation may choose to allow " as an alternative separator. Another (very minor) reason is that ' is generally one keypress less than ".
linebreak: which would make multistrings look exactly like Markdown code blocks. I rejected this as slightly harder to implement, less obvious, less flexible, more error-prone, and to reduce confusion with Markdown.

A nice advantage of ' also is that if we find ourselves needing to convert from a regular '-delimited string into a multistring, there is no need to delete or replace the delimiters. We only need to add a layer of `-delimiters around the regular string. This is particularly easy in a modern code editor.

For example if we have an '-delimited string such as:

'a string with an apostrophe: '

and we find that the next character we insert is ', making the string invalid:

'a string with an apostrophe: ''

we can quickly fix this by surrounding the whole string with `:

`'a string with an apostrophe: ''`

That said, | is a nice separator too and you can choose whatever you like. You could also replace ` with another delimiter that suits your language better.

Appendix II: edge cases

Implementers beware.

Empty multistrings are not like empty Markdown code blocks

In Markdown, an empty code block is denoted as:

```
```

Note the single linebreak between the delimiters.

However, an equivalent empty multistring is:

```''```

rather than:

```'```

The stated formal definition does not allow to “fuse together” the opening and closing delimiters like this, which is what effectively happens in Markdown.

Instead, an empty multistring is always an opening delimiter immediately next to a closing delimiter. This is the same principle as in the familiar "" or '' empty strings.

Thanks to that the following multistring:

```'```'```

is valid and contains ```.

Whereas in Markdown an analogous syntax:

```
```
```

means an empty code block followed by ``` (an unclosed, and thus invalid, code block). To make that work in Markdown, we would need to increase the number of backticks around the middle ```.

This edge case illustrates that the multistring syntax is more regular than Markdown, thanks to the simple formal definition.

By the way, the following is a minimal edge case of a multistring that can be used to test a parser:

`'`'`

It should parse as a multistring which contains `.

For completeness, this is the minimal empty multistring:

`''`

Appendix III: examples for how tags could be used

For example, a dedent tag could signify that the multistring should be postprocessed by removing the first linebreak, all indentation that goes beyond the indentation of the last line, as well as the last line, achieving the behavior of raw string literals from C# 11 (thanks @useerup on reddit for mentioning those). For example using dedent this multistring:

    ```dedent'
    {
      "key": "value
    }
    '```

would be equivalent to this one without dedent:

```'{
  "key": "value
}'```

An esc tag could signify that C-style escapes should be recognized and replaced within the multistring. E.g.:

```esc'\n\n\n'```

would be equivalent to:

```'


'```

i.e. a string which contains 3 linebreaks.

A $ tag could turn the multistring into a template literal, where names of variables or expressions could be substituted for their values, e.g.:

`$'Hello, ${username}!'`

could be equivalent to:

`'Hello, John!'`

assuming John as the value of the username variable.

Multiple tags could be allowed for one multistring, perhaps by comma-separating them. E.g.:

`$,esc'Hello,\n${username}!'`

Here we are using both the $ and esc tags to achieve something like:

`'Hello,
John!'`

In this way could make up all kinds of useful tags and rules for them.

Which are an arbitrary choice. This recipe can of course be adjusted by picking different delimiters, to fit the syntax of a specific programming lanugage, implementer’s sense of aesthetics, or other constraints.↩︎
Although an implementation is of course free to restrict the minimum number of backticks.↩︎
If interested in the details of that, you may want to look into: two-level grammars, Van Wijngaarden grammars, affix grammars or extended affix grammars.↩︎
To put it anoter way, rule_a⇥rule_b can be expressed in EBNF as: (rule_a - rule_b), rule_b.↩︎
In practice if we limit n to 255, we’ll get 255 rules.↩︎