Skip to content

Commit ac45578

Browse files
authored
Ensure non-unicode strings aren't normalised (#3556)
In rare scenarios it's possible to end up calling `normalise_whitespace` on a non-unicode string. In this case, we shouldn't be removing whitespace characters that are specific to unicode as a `Encoding::CompatibilityError` error ends up being raised. This has only happened in practice once, and only in the test environment: https://good-machine.sentry.io/issues/6604598285/ In this case, a CSV file was uploaded where the encoding couldn't be detected, which lead to a unhandled exception being raised, rather than a validation error on the file.
2 parents f3c1bc1 + 0415c30 commit ac45578

3 files changed

Lines changed: 23 additions & 6 deletions

File tree

lib/core_ext/string/normalise_whitespace.rb

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,13 @@ class String
44
def normalise_whitespace
55
result = self
66

7-
# \u200D is a zero-width joiner (ZWJ) which is used in the frontend to display the NHS number
8-
result = result.tr("\u200D", "")
7+
if result.encoding == Encoding::UTF_8
8+
# \u200D is a zero-width joiner (ZWJ) which is used in the frontend to display the NHS number
9+
result = result.tr("\u200D", "")
910

10-
# \u00A0 is a non-breaking space
11-
result = result.tr("\u00A0", " ")
11+
# \u00A0 is a non-breaking space
12+
result = result.tr("\u00A0", " ")
13+
end
1214

1315
result.strip.gsub(/\s+/, " ").presence
1416
end

spec/lib/core_ext/string_normalise_whitespace_spec.rb

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,9 +40,14 @@
4040
end
4141

4242
context "with non-UTF-8 encoded strings" do
43+
let(:windows_1252_encoded_string) do
44+
"hello’s world".encode(Encoding::WINDOWS_1252)
45+
end
46+
4347
it "does not apply Unicode-specific transformations" do
44-
ascii_string = "hello world".encode(Encoding::ASCII)
45-
expect(ascii_string.normalise_whitespace).to eq("hello world")
48+
expect(windows_1252_encoded_string.normalise_whitespace).to eq(
49+
windows_1252_encoded_string
50+
)
4651
end
4752
end
4853
end

spec/lib/csv_parser_spec.rb

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,4 +42,14 @@
4242
expect(row[:header].value).to eq("value with ’ character")
4343
end
4444
end
45+
46+
context "when the encoding cannot be detected" do
47+
# This is not a good example of a CSV file, but it was the only
48+
# string I could find where the encoding wasn't detected.
49+
let(:data) { "\x92".b }
50+
51+
it "doesn't raise an error" do
52+
expect { table }.not_to raise_error
53+
end
54+
end
4555
end

0 commit comments

Comments
 (0)