Python: How to get the similar-sounding words together












16















I am trying to get all the similar sounding words from a list.



I tried to get them using cosine similarity but that does not fulfil my purpose.



from sklearn.metrics.pairwise import cosine_similarity
dataList = ['two','fourth','forth','dessert','to','desert']
cosine_similarity(dataList)


I know this is not the right approach, I cannot seem to get a result like:



result = ['xx', 'xx', 'yy', 'yy', 'zz', 'zz'] 


where they mean that the words which sound similar










share|improve this question









New contributor




Marc Stoch is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

























    16















    I am trying to get all the similar sounding words from a list.



    I tried to get them using cosine similarity but that does not fulfil my purpose.



    from sklearn.metrics.pairwise import cosine_similarity
    dataList = ['two','fourth','forth','dessert','to','desert']
    cosine_similarity(dataList)


    I know this is not the right approach, I cannot seem to get a result like:



    result = ['xx', 'xx', 'yy', 'yy', 'zz', 'zz'] 


    where they mean that the words which sound similar










    share|improve this question









    New contributor




    Marc Stoch is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.























      16












      16








      16


      1






      I am trying to get all the similar sounding words from a list.



      I tried to get them using cosine similarity but that does not fulfil my purpose.



      from sklearn.metrics.pairwise import cosine_similarity
      dataList = ['two','fourth','forth','dessert','to','desert']
      cosine_similarity(dataList)


      I know this is not the right approach, I cannot seem to get a result like:



      result = ['xx', 'xx', 'yy', 'yy', 'zz', 'zz'] 


      where they mean that the words which sound similar










      share|improve this question









      New contributor




      Marc Stoch is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.












      I am trying to get all the similar sounding words from a list.



      I tried to get them using cosine similarity but that does not fulfil my purpose.



      from sklearn.metrics.pairwise import cosine_similarity
      dataList = ['two','fourth','forth','dessert','to','desert']
      cosine_similarity(dataList)


      I know this is not the right approach, I cannot seem to get a result like:



      result = ['xx', 'xx', 'yy', 'yy', 'zz', 'zz'] 


      where they mean that the words which sound similar







      python python-3.x list






      share|improve this question









      New contributor




      Marc Stoch is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question









      New contributor




      Marc Stoch is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question








      edited 21 hours ago









      DirtyBit

      10.4k21742




      10.4k21742






      New contributor




      Marc Stoch is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked yesterday









      Marc StochMarc Stoch

      834




      834




      New contributor




      Marc Stoch is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      Marc Stoch is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      Marc Stoch is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.
























          1 Answer
          1






          active

          oldest

          votes


















          24














          First, you need to use a right way to get the similar sounding words i.e. string similarity, I would suggest:



          Using jellyfish:



          from jellyfish import soundex

          print(soundex("two"))
          print(soundex("to"))


          OUTPUT:



          T000
          T000


          Now perhaps, create a function that would handle the list and then sort it to get them:



          def getSoundexList(dList):
          res = [soundex(x) for x in dList] # iterate over each elem in the dataList
          # print(res) # ['T000', 'F630', 'F630', 'D263', 'T000', 'D263']
          return res

          dataList = ['two','fourth','forth','dessert','to','desert']
          print([x for x in sorted(getSoundexList(dataList))])


          OUTPUT:



          ['D263', 'D263', 'F630', 'F630', 'T000', 'T000']


          EDIT:



          Another way could be:



          Using fuzzy:



          import fuzzy
          soundex = fuzzy.Soundex(4)

          print(soundex("to"))
          print(soundex("two"))


          OUTPUT:



          T000
          T000


          EDIT 2:



          If you want them grouped, you could use groupby:



          from itertools import groupby

          def getSoundexList(dList):
          return sorted([soundex(x) for x in dList])

          dataList = ['two','fourth','forth','dessert','to','desert']
          print([list(g) for _, g in groupby(getSoundexList(dataList), lambda x: x)])


          OUTPUT:



          [['D263', 'D263'], ['F630', 'F630'], ['T000', 'T000']]


          EDIT 3:



          This ones for @Eric Duminil, let's say you want both the names and their respective val:



          Using a dict along with itemgetter:



          from operator import itemgetter

          def getSoundexDict(dList):
          return sorted(dict_.items(), key=itemgetter(1)) # sorting the dict_ on val

          dataList = ['two','fourth','forth','dessert','to','desert']
          res = [soundex(x) for x in dataList] # to get the val for each elem
          dict_ = dict(list(zip(dataList, res))) # dict_ with k,v as name/val

          print([list(g) for _, g in groupby(getSoundexDict(dataList), lambda x: x[1])])


          OUTPUT:



          [[('dessert', 'D263'), ('desert', 'D263')], [('fourth', 'F630'), ('forth', 'F630')], [('two', 'T000'), ('to', 'T000')]]


          EDIT 4 (for OP):



          Soundex:




          Soundex is a system whereby values are assigned to names in such a
          manner that similar-sounding names get the same value. These values
          are known as soundex encodings. A search application based on soundex
          will not search for a name directly but rather will search for the
          soundex encoding. By doing so, it will obtain all names that sound
          like the name being sought.




          read more..






          share|improve this answer


























          • @EricDuminil Pardon, but I don't quiet get how isSoundex returning a boolean would do?

            – DirtyBit
            23 hours ago













          • He means the name isSoundex is a binary statement ('is' or 'is not'), and should therefore be a boolean returning function. Maybe consider changing the name to something like getSoundexList?

            – user2397282
            22 hours ago











          • @user2397282 Crap, I over-looked it. Thank you. edited! :)

            – DirtyBit
            22 hours ago






          • 1





            @EricDuminil Done! :)

            – DirtyBit
            21 hours ago











          • Thank you so much! Could you please explain a bit about soundex as well?

            – Marc Stoch
            2 hours ago











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });






          Marc Stoch is a new contributor. Be nice, and check out our Code of Conduct.










          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55331723%2fpython-how-to-get-the-similar-sounding-words-together%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          24














          First, you need to use a right way to get the similar sounding words i.e. string similarity, I would suggest:



          Using jellyfish:



          from jellyfish import soundex

          print(soundex("two"))
          print(soundex("to"))


          OUTPUT:



          T000
          T000


          Now perhaps, create a function that would handle the list and then sort it to get them:



          def getSoundexList(dList):
          res = [soundex(x) for x in dList] # iterate over each elem in the dataList
          # print(res) # ['T000', 'F630', 'F630', 'D263', 'T000', 'D263']
          return res

          dataList = ['two','fourth','forth','dessert','to','desert']
          print([x for x in sorted(getSoundexList(dataList))])


          OUTPUT:



          ['D263', 'D263', 'F630', 'F630', 'T000', 'T000']


          EDIT:



          Another way could be:



          Using fuzzy:



          import fuzzy
          soundex = fuzzy.Soundex(4)

          print(soundex("to"))
          print(soundex("two"))


          OUTPUT:



          T000
          T000


          EDIT 2:



          If you want them grouped, you could use groupby:



          from itertools import groupby

          def getSoundexList(dList):
          return sorted([soundex(x) for x in dList])

          dataList = ['two','fourth','forth','dessert','to','desert']
          print([list(g) for _, g in groupby(getSoundexList(dataList), lambda x: x)])


          OUTPUT:



          [['D263', 'D263'], ['F630', 'F630'], ['T000', 'T000']]


          EDIT 3:



          This ones for @Eric Duminil, let's say you want both the names and their respective val:



          Using a dict along with itemgetter:



          from operator import itemgetter

          def getSoundexDict(dList):
          return sorted(dict_.items(), key=itemgetter(1)) # sorting the dict_ on val

          dataList = ['two','fourth','forth','dessert','to','desert']
          res = [soundex(x) for x in dataList] # to get the val for each elem
          dict_ = dict(list(zip(dataList, res))) # dict_ with k,v as name/val

          print([list(g) for _, g in groupby(getSoundexDict(dataList), lambda x: x[1])])


          OUTPUT:



          [[('dessert', 'D263'), ('desert', 'D263')], [('fourth', 'F630'), ('forth', 'F630')], [('two', 'T000'), ('to', 'T000')]]


          EDIT 4 (for OP):



          Soundex:




          Soundex is a system whereby values are assigned to names in such a
          manner that similar-sounding names get the same value. These values
          are known as soundex encodings. A search application based on soundex
          will not search for a name directly but rather will search for the
          soundex encoding. By doing so, it will obtain all names that sound
          like the name being sought.




          read more..






          share|improve this answer


























          • @EricDuminil Pardon, but I don't quiet get how isSoundex returning a boolean would do?

            – DirtyBit
            23 hours ago













          • He means the name isSoundex is a binary statement ('is' or 'is not'), and should therefore be a boolean returning function. Maybe consider changing the name to something like getSoundexList?

            – user2397282
            22 hours ago











          • @user2397282 Crap, I over-looked it. Thank you. edited! :)

            – DirtyBit
            22 hours ago






          • 1





            @EricDuminil Done! :)

            – DirtyBit
            21 hours ago











          • Thank you so much! Could you please explain a bit about soundex as well?

            – Marc Stoch
            2 hours ago
















          24














          First, you need to use a right way to get the similar sounding words i.e. string similarity, I would suggest:



          Using jellyfish:



          from jellyfish import soundex

          print(soundex("two"))
          print(soundex("to"))


          OUTPUT:



          T000
          T000


          Now perhaps, create a function that would handle the list and then sort it to get them:



          def getSoundexList(dList):
          res = [soundex(x) for x in dList] # iterate over each elem in the dataList
          # print(res) # ['T000', 'F630', 'F630', 'D263', 'T000', 'D263']
          return res

          dataList = ['two','fourth','forth','dessert','to','desert']
          print([x for x in sorted(getSoundexList(dataList))])


          OUTPUT:



          ['D263', 'D263', 'F630', 'F630', 'T000', 'T000']


          EDIT:



          Another way could be:



          Using fuzzy:



          import fuzzy
          soundex = fuzzy.Soundex(4)

          print(soundex("to"))
          print(soundex("two"))


          OUTPUT:



          T000
          T000


          EDIT 2:



          If you want them grouped, you could use groupby:



          from itertools import groupby

          def getSoundexList(dList):
          return sorted([soundex(x) for x in dList])

          dataList = ['two','fourth','forth','dessert','to','desert']
          print([list(g) for _, g in groupby(getSoundexList(dataList), lambda x: x)])


          OUTPUT:



          [['D263', 'D263'], ['F630', 'F630'], ['T000', 'T000']]


          EDIT 3:



          This ones for @Eric Duminil, let's say you want both the names and their respective val:



          Using a dict along with itemgetter:



          from operator import itemgetter

          def getSoundexDict(dList):
          return sorted(dict_.items(), key=itemgetter(1)) # sorting the dict_ on val

          dataList = ['two','fourth','forth','dessert','to','desert']
          res = [soundex(x) for x in dataList] # to get the val for each elem
          dict_ = dict(list(zip(dataList, res))) # dict_ with k,v as name/val

          print([list(g) for _, g in groupby(getSoundexDict(dataList), lambda x: x[1])])


          OUTPUT:



          [[('dessert', 'D263'), ('desert', 'D263')], [('fourth', 'F630'), ('forth', 'F630')], [('two', 'T000'), ('to', 'T000')]]


          EDIT 4 (for OP):



          Soundex:




          Soundex is a system whereby values are assigned to names in such a
          manner that similar-sounding names get the same value. These values
          are known as soundex encodings. A search application based on soundex
          will not search for a name directly but rather will search for the
          soundex encoding. By doing so, it will obtain all names that sound
          like the name being sought.




          read more..






          share|improve this answer


























          • @EricDuminil Pardon, but I don't quiet get how isSoundex returning a boolean would do?

            – DirtyBit
            23 hours ago













          • He means the name isSoundex is a binary statement ('is' or 'is not'), and should therefore be a boolean returning function. Maybe consider changing the name to something like getSoundexList?

            – user2397282
            22 hours ago











          • @user2397282 Crap, I over-looked it. Thank you. edited! :)

            – DirtyBit
            22 hours ago






          • 1





            @EricDuminil Done! :)

            – DirtyBit
            21 hours ago











          • Thank you so much! Could you please explain a bit about soundex as well?

            – Marc Stoch
            2 hours ago














          24












          24








          24







          First, you need to use a right way to get the similar sounding words i.e. string similarity, I would suggest:



          Using jellyfish:



          from jellyfish import soundex

          print(soundex("two"))
          print(soundex("to"))


          OUTPUT:



          T000
          T000


          Now perhaps, create a function that would handle the list and then sort it to get them:



          def getSoundexList(dList):
          res = [soundex(x) for x in dList] # iterate over each elem in the dataList
          # print(res) # ['T000', 'F630', 'F630', 'D263', 'T000', 'D263']
          return res

          dataList = ['two','fourth','forth','dessert','to','desert']
          print([x for x in sorted(getSoundexList(dataList))])


          OUTPUT:



          ['D263', 'D263', 'F630', 'F630', 'T000', 'T000']


          EDIT:



          Another way could be:



          Using fuzzy:



          import fuzzy
          soundex = fuzzy.Soundex(4)

          print(soundex("to"))
          print(soundex("two"))


          OUTPUT:



          T000
          T000


          EDIT 2:



          If you want them grouped, you could use groupby:



          from itertools import groupby

          def getSoundexList(dList):
          return sorted([soundex(x) for x in dList])

          dataList = ['two','fourth','forth','dessert','to','desert']
          print([list(g) for _, g in groupby(getSoundexList(dataList), lambda x: x)])


          OUTPUT:



          [['D263', 'D263'], ['F630', 'F630'], ['T000', 'T000']]


          EDIT 3:



          This ones for @Eric Duminil, let's say you want both the names and their respective val:



          Using a dict along with itemgetter:



          from operator import itemgetter

          def getSoundexDict(dList):
          return sorted(dict_.items(), key=itemgetter(1)) # sorting the dict_ on val

          dataList = ['two','fourth','forth','dessert','to','desert']
          res = [soundex(x) for x in dataList] # to get the val for each elem
          dict_ = dict(list(zip(dataList, res))) # dict_ with k,v as name/val

          print([list(g) for _, g in groupby(getSoundexDict(dataList), lambda x: x[1])])


          OUTPUT:



          [[('dessert', 'D263'), ('desert', 'D263')], [('fourth', 'F630'), ('forth', 'F630')], [('two', 'T000'), ('to', 'T000')]]


          EDIT 4 (for OP):



          Soundex:




          Soundex is a system whereby values are assigned to names in such a
          manner that similar-sounding names get the same value. These values
          are known as soundex encodings. A search application based on soundex
          will not search for a name directly but rather will search for the
          soundex encoding. By doing so, it will obtain all names that sound
          like the name being sought.




          read more..






          share|improve this answer















          First, you need to use a right way to get the similar sounding words i.e. string similarity, I would suggest:



          Using jellyfish:



          from jellyfish import soundex

          print(soundex("two"))
          print(soundex("to"))


          OUTPUT:



          T000
          T000


          Now perhaps, create a function that would handle the list and then sort it to get them:



          def getSoundexList(dList):
          res = [soundex(x) for x in dList] # iterate over each elem in the dataList
          # print(res) # ['T000', 'F630', 'F630', 'D263', 'T000', 'D263']
          return res

          dataList = ['two','fourth','forth','dessert','to','desert']
          print([x for x in sorted(getSoundexList(dataList))])


          OUTPUT:



          ['D263', 'D263', 'F630', 'F630', 'T000', 'T000']


          EDIT:



          Another way could be:



          Using fuzzy:



          import fuzzy
          soundex = fuzzy.Soundex(4)

          print(soundex("to"))
          print(soundex("two"))


          OUTPUT:



          T000
          T000


          EDIT 2:



          If you want them grouped, you could use groupby:



          from itertools import groupby

          def getSoundexList(dList):
          return sorted([soundex(x) for x in dList])

          dataList = ['two','fourth','forth','dessert','to','desert']
          print([list(g) for _, g in groupby(getSoundexList(dataList), lambda x: x)])


          OUTPUT:



          [['D263', 'D263'], ['F630', 'F630'], ['T000', 'T000']]


          EDIT 3:



          This ones for @Eric Duminil, let's say you want both the names and their respective val:



          Using a dict along with itemgetter:



          from operator import itemgetter

          def getSoundexDict(dList):
          return sorted(dict_.items(), key=itemgetter(1)) # sorting the dict_ on val

          dataList = ['two','fourth','forth','dessert','to','desert']
          res = [soundex(x) for x in dataList] # to get the val for each elem
          dict_ = dict(list(zip(dataList, res))) # dict_ with k,v as name/val

          print([list(g) for _, g in groupby(getSoundexDict(dataList), lambda x: x[1])])


          OUTPUT:



          [[('dessert', 'D263'), ('desert', 'D263')], [('fourth', 'F630'), ('forth', 'F630')], [('two', 'T000'), ('to', 'T000')]]


          EDIT 4 (for OP):



          Soundex:




          Soundex is a system whereby values are assigned to names in such a
          manner that similar-sounding names get the same value. These values
          are known as soundex encodings. A search application based on soundex
          will not search for a name directly but rather will search for the
          soundex encoding. By doing so, it will obtain all names that sound
          like the name being sought.




          read more..







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited 2 hours ago

























          answered yesterday









          DirtyBitDirtyBit

          10.4k21742




          10.4k21742













          • @EricDuminil Pardon, but I don't quiet get how isSoundex returning a boolean would do?

            – DirtyBit
            23 hours ago













          • He means the name isSoundex is a binary statement ('is' or 'is not'), and should therefore be a boolean returning function. Maybe consider changing the name to something like getSoundexList?

            – user2397282
            22 hours ago











          • @user2397282 Crap, I over-looked it. Thank you. edited! :)

            – DirtyBit
            22 hours ago






          • 1





            @EricDuminil Done! :)

            – DirtyBit
            21 hours ago











          • Thank you so much! Could you please explain a bit about soundex as well?

            – Marc Stoch
            2 hours ago



















          • @EricDuminil Pardon, but I don't quiet get how isSoundex returning a boolean would do?

            – DirtyBit
            23 hours ago













          • He means the name isSoundex is a binary statement ('is' or 'is not'), and should therefore be a boolean returning function. Maybe consider changing the name to something like getSoundexList?

            – user2397282
            22 hours ago











          • @user2397282 Crap, I over-looked it. Thank you. edited! :)

            – DirtyBit
            22 hours ago






          • 1





            @EricDuminil Done! :)

            – DirtyBit
            21 hours ago











          • Thank you so much! Could you please explain a bit about soundex as well?

            – Marc Stoch
            2 hours ago

















          @EricDuminil Pardon, but I don't quiet get how isSoundex returning a boolean would do?

          – DirtyBit
          23 hours ago







          @EricDuminil Pardon, but I don't quiet get how isSoundex returning a boolean would do?

          – DirtyBit
          23 hours ago















          He means the name isSoundex is a binary statement ('is' or 'is not'), and should therefore be a boolean returning function. Maybe consider changing the name to something like getSoundexList?

          – user2397282
          22 hours ago





          He means the name isSoundex is a binary statement ('is' or 'is not'), and should therefore be a boolean returning function. Maybe consider changing the name to something like getSoundexList?

          – user2397282
          22 hours ago













          @user2397282 Crap, I over-looked it. Thank you. edited! :)

          – DirtyBit
          22 hours ago





          @user2397282 Crap, I over-looked it. Thank you. edited! :)

          – DirtyBit
          22 hours ago




          1




          1





          @EricDuminil Done! :)

          – DirtyBit
          21 hours ago





          @EricDuminil Done! :)

          – DirtyBit
          21 hours ago













          Thank you so much! Could you please explain a bit about soundex as well?

          – Marc Stoch
          2 hours ago





          Thank you so much! Could you please explain a bit about soundex as well?

          – Marc Stoch
          2 hours ago












          Marc Stoch is a new contributor. Be nice, and check out our Code of Conduct.










          draft saved

          draft discarded


















          Marc Stoch is a new contributor. Be nice, and check out our Code of Conduct.













          Marc Stoch is a new contributor. Be nice, and check out our Code of Conduct.












          Marc Stoch is a new contributor. Be nice, and check out our Code of Conduct.
















          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55331723%2fpython-how-to-get-the-similar-sounding-words-together%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          He _____ here since 1970 . Answer needed [closed]What does “since he was so high” mean?Meaning of “catch birds for”?How do I ensure “since” takes the meaning I want?“Who cares here” meaningWhat does “right round toward” mean?the time tense (had now been detected)What does the phrase “ring around the roses” mean here?Correct usage of “visited upon”Meaning of “foiled rail sabotage bid”It was the third time I had gone to Rome or It is the third time I had been to Rome

          Bunad

          Færeyskur hestur Heimild | Tengill | Tilvísanir | LeiðsagnarvalRossið - síða um færeyska hrossið á færeyskuGott ár hjá færeyska hestinum